WO2023059402A1 - Amélioration de la parole personnalisée multicanaux indépendante de la géométrie de réseau - Google Patents
Amélioration de la parole personnalisée multicanaux indépendante de la géométrie de réseau Download PDFInfo
- Publication number
- WO2023059402A1 WO2023059402A1 PCT/US2022/040979 US2022040979W WO2023059402A1 WO 2023059402 A1 WO2023059402 A1 WO 2023059402A1 US 2022040979 W US2022040979 W US 2022040979W WO 2023059402 A1 WO2023059402 A1 WO 2023059402A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- speaker
- pse
- model
- data
- target speaker
- Prior art date
Links
- 230000002452 interceptive effect Effects 0.000 claims abstract description 75
- 239000000203 mixture Substances 0.000 claims abstract description 26
- 230000009467 reduction Effects 0.000 claims abstract description 17
- 239000000284 extract Substances 0.000 claims abstract description 11
- 238000012549 training Methods 0.000 claims description 80
- 238000000034 method Methods 0.000 claims description 20
- 230000008030 elimination Effects 0.000 abstract 1
- 238000003379 elimination reaction Methods 0.000 abstract 1
- 239000013598 vector Substances 0.000 description 18
- 238000011176 pooling Methods 0.000 description 15
- 238000012360 testing method Methods 0.000 description 15
- 230000006870 function Effects 0.000 description 14
- 238000004891 communication Methods 0.000 description 12
- 238000013459 approach Methods 0.000 description 11
- 230000015654 memory Effects 0.000 description 11
- 238000010606 normalization Methods 0.000 description 11
- 238000012545 processing Methods 0.000 description 11
- 238000013518 transcription Methods 0.000 description 11
- 230000035897 transcription Effects 0.000 description 11
- 238000003491 array Methods 0.000 description 8
- 230000008901 benefit Effects 0.000 description 7
- 230000001364 causal effect Effects 0.000 description 7
- 230000008569 process Effects 0.000 description 7
- 238000012935 Averaging Methods 0.000 description 6
- 238000013528 artificial neural network Methods 0.000 description 6
- 230000006872 improvement Effects 0.000 description 5
- 230000003595 spectral effect Effects 0.000 description 5
- 238000001914 filtration Methods 0.000 description 4
- 230000001629 suppression Effects 0.000 description 4
- 238000010200 validation analysis Methods 0.000 description 4
- 230000004913 activation Effects 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 238000013527 convolutional neural network Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 230000007613 environmental effect Effects 0.000 description 2
- 230000007246 mechanism Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000000306 recurrent effect Effects 0.000 description 2
- 230000004044 response Effects 0.000 description 2
- 238000005070 sampling Methods 0.000 description 2
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 1
- 101000822695 Clostridium perfringens (strain 13 / Type A) Small, acid-soluble spore protein C1 Proteins 0.000 description 1
- 101000655262 Clostridium perfringens (strain 13 / Type A) Small, acid-soluble spore protein C2 Proteins 0.000 description 1
- 101000655256 Paraclostridium bifermentans Small, acid-soluble spore protein alpha Proteins 0.000 description 1
- 101000655264 Paraclostridium bifermentans Small, acid-soluble spore protein beta Proteins 0.000 description 1
- 238000000137 annealing Methods 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 230000015556 catabolic process Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000010367 cloning Methods 0.000 description 1
- 230000006835 compression Effects 0.000 description 1
- 238000007906 compression Methods 0.000 description 1
- 238000005094 computer simulation Methods 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000006731 degradation reaction Methods 0.000 description 1
- 238000012217 deletion Methods 0.000 description 1
- 230000037430 deletion Effects 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 239000011521 glass Substances 0.000 description 1
- 238000003780 insertion Methods 0.000 description 1
- 230000037431 insertion Effects 0.000 description 1
- 230000001788 irregular Effects 0.000 description 1
- 238000012886 linear function Methods 0.000 description 1
- 230000007774 longterm Effects 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000008520 organization Effects 0.000 description 1
- 230000002085 persistent effect Effects 0.000 description 1
- 230000008092 positive effect Effects 0.000 description 1
- 238000012805 post-processing Methods 0.000 description 1
- 238000007639 printing Methods 0.000 description 1
- 230000000644 propagated effect Effects 0.000 description 1
- 230000001902 propagating effect Effects 0.000 description 1
- 230000006403 short-term memory Effects 0.000 description 1
- 230000005236 sound signal Effects 0.000 description 1
- 230000002123 temporal effect Effects 0.000 description 1
- 230000007723 transport mechanism Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0272—Voice signal separating
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L2021/02082—Noise filtering the noise being echo, reverberation of the speech
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L2021/02087—Noise filtering the noise being separate speech, e.g. cocktail party
Definitions
- Speech enhancement models are widely used in online communication tools that remove the background noise and let through the human speech.
- One limitation of some existing systems is that they cannot remove interfering speakers at least because they are trained to preserve all human speech, and those existing systems are not capable of selecting the target speaker.
- speech enhancement in such existing systems may not include the capability of removing other speakers (e.g., interfering speakers).
- Examples of array geometry agnostic multi-channel personalized speech enhancement include: extracting speaker embeddings from enrollment data for at least a first target speaker; extracting spatial features from input audio captured by a microphone array, the input audio including a mixture of speech data of the first target speaker and an interfering speaker; providing the input audio, the extracted spatial features, and the extracted speaker embeddings to a trained geometry-agnostic PSE model; and using the trained geometry-agnostic PSE model without geometry information for the microphone array, producing output data, the output data comprising estimated clean speech data of the first target speaker with a reduction of speech data of the interfering speaker.
- PSE array geometry agnostic multi-channel personalized speech enhancement
- FIG. 1 illustrates an example arrangement that advantageously provides personalized speech enhancement (PSE), and in some examples, array geometry agnostic multi-channel PSE;
- FIG. 2 illustrates various exemplary microphone array geometries that may be used in the arrangement of FIG. 1;
- FIG. 3 illustrates an example architecture for a PSE model that may be used in the arrangement of FIG. 1;
- FIG. 4 illustrates an example encoder/decoder that may be used in various architectures described herein;
- FIG. 5 illustrates an example bottleneck block that may be used in various architectures described herein;
- FIG. 6 illustrates an example architecture for a PSE model that may be used in the arrangement of FIG. 1;
- FIG. 7 illustrates another example architecture for a PSE model that may be used in the arrangement of FIG. 1;
- FIG. 8 illustrates another example architecture for a PSE model that may be used in the arrangement of FIG. 1;
- FIG. 9 illustrates another example architecture for a PSE model that may be used in the arrangement of FIG. 1;
- FIG. 10A illustrates an example complex encoder/decoder that may be used in various architectures described herein;
- FIG. 10B illustrates the complex encoder/decoder of FIG. 10A with inter-stream processing
- FIG. 11 is a flowchart illustrating exemplary operations that may be performed using the arrangement of FIG. 1;
- FIG. 12 is another flowchart illustrating exemplary operations that may be performed using the arrangement of FIG. 1;
- FIG. 13 is another flowchart illustrating exemplary operations that may be performed using the arrangement of FIG. 1;
- FIG. 14 is a block diagram of an example computing environment suitable for implementing some of the various examples disclosed herein.
- Examples of array geometry agnostic multi-channel PSE include using a trained geometryagnostic PSE model to process input audio data and produce output audio data, representing clean speech, in real-time.
- Some examples extract speaker embeddings from enrollment data for at least a first target speaker, and extract spatial features from input audio captured by a microphone array.
- the input audio includes a mixture of speech data of the first target speaker and an interfering speaker.
- the input audio, the extracted spatial features, and the extracted speaker embeddings are provided to the trained geometry-agnostic PSE model.
- aspects of the disclosure produce output data comprising estimated clean speech data of the first target speaker with a reduction of speech data of the interfering speaker.
- Examples of multi-channel PSE include extracting speaker embeddings from enrollment data for at least a first target speaker, and extracting spatial features from input audio captured by a microphone array.
- the input audio includes a mixture of speech data of the first target speaker and an interfering speaker.
- the input audio, the extracted spatial features, and the extracted speaker embeddings are provided to a trained PSE model.
- output data is produced comprising estimated clean speech data of the first target speaker with a reduction of speech data of the interfering speaker.
- aspects of the disclosure improve the operations of computing devices at least by improving computerized speech enhancement, specifically PSE that outputs estimated clean speech data of a target speaker with a reduction of speech data of an interfering speaker.
- Multi-channel PSE uses inputs from multiple microphones, and geometry agnostic operation indicates that the PSE solution does not need to know the positions of the microphones in order to reduce or eliminate non-speech noise and speech data of an interfering speaker. Since different users may use different models of computers that each have multiple microphone geometries, the benefits of this approach permit a single trained PSE model to be used on, for example, multiple different computers models without requiring training of the PSE model to be tailored for each specific computer model.
- aspects of the disclosure provide systems and methods for improved PSE, including, in some examples, personalized noise suppression and echo cancellation.
- the PSE described herein is capable of removing both the interfering speakers and environmental noises, providing speech enhancement that utilizes the enrollment data of the target speaker (e.g., around one minute, in some examples, although different examples are operable with longer and shorter intervals of input audio) and uses this enrollment data (e.g., cue) to isolate the target speaker's voice and filter out other audio.
- enrollment data e.g., around one minute, in some examples, although different examples are operable with longer and shorter intervals of input audio
- this enrollment data e.g., cue
- PSE deployed to online communication tools, communication security and user privacy are improved by PSE filtering out voices of other household members. Additionally, PSE also improves communication efficiency and the user experience by removing background speakers that may be distracting during a meeting (e.g., an audio and/or video conference).
- a multi-channel PSE network may improves results compared with a single-channel PSE.
- training a PSE model in such a way that it functions with different multi-channel array geometries is disclosed. This permits the PSE model to be trained with a geometry agnostic approach, so that it functions with multi-channel (e.g., multi -mi crophone) hardware that had not been encountered during training (e.g., new hardware introduced to the market after training). This precludes the need to retrain PSE models for each variation of microphone array geometry that may be encountered, permitting use of the same array agnostic model for various microphone array geometries.
- enrollment data from two target speakers may be input into a trained PSE model.
- the two target speakers may be sharing a computer (using it at different times), co-presenting (using it at the same time), debating, hosting, etc.
- the PSE model filters out audio other than the audio matching the voices of the two enrolled target speakers. Examples may be run in real-time on standard personal computers (PCs).
- FIG. 1 illustrates an example arrangement 100 that advantageously provides multi-channel PSE, and in some examples, array geometry agnostic multi-channel PSE.
- a microphone array 200 captures input audio 112, which may include a mixture of speech data 112a of a first target speaker 102, speech data 112b of a second target speaker 102a, speech data 112c of one or more interfering speakers 104 (together, a set of speakers 106), and background noise 112d from a noise source 108.
- An audio recorder 124 is used to record audio signals for when PSE is applied to recorded audio.
- PSE is performed by a trained PSE model 110, which outputs output data 114, which includes estimated clean speech data of enrolled target speaker(s), such as target speaker 102 and (in some examples) also target speaker 102a.
- output data 114 includes audio data, which may be used in an audio conferencing arrangement 120, so that remote user 122 is able to hear target speakers 102 and 102a clearly over interfering speaker 104 and noise source 108f.
- target speaker 102 may be at a range of 0.3 to 1.3 meters away from a single channel microphone arrangement (e.g., microphone array 200 is reduced to just a single microphone), while interfering speaker 104 may be in excess of 2 meters away from the microphone, resulting in a voice volume of interfering speaker 104 that is 0 to 10 decibels (dB) lower than the voice volume of target speaker 102.
- target speaker 102 may be at a range of 0.5 to 2.5 meters away from a multichannel microphone arrangement, with interfering speaker 104 being at approximately the same distance.
- output data 114 includes a transcript 132, produced by a transcription service 130.
- transcription service 130 employs speech recognition, which is included in trained PSE model 110.
- Trained PSE model 110 is thus valuable in improving accuracy and completeness of transcript 132 by shielding transcription service 130 from confusing input, such as speech data 112c (from interfering speaker 104) and background noise 112d.
- arrangement 100 may be used for live audio conferencing (real-time PSE) and recorded audio (post-processing) to improve speech quality, and also real-time transcription and transcription of recorded audio.
- a trainer 140 uses training data 142, which may be tagged audio data (e.g., audio clips annotated with ground truth data), to train an input PSE model 110a.
- trainer 140 provides multi-task (MT) training, such that trained PSE model 110 is able to perform echo cancelation in addition to PSE.
- MT multi-task
- FIG. 2 illustrates various exemplary microphone array geometries that may be used in arrangement 100, with each dot representing a microphone, and the circles merely representing a notional alignment reference. These arrangements may be used during training and/or operation, with the understanding that examples of the disclosure permit different microphone array geometries to be used during operation (use of the trained model) than had been used for training, due to the trained model being operable with any geometry dependence. It should be further understood that a wide variation of other microphone array geometries may be used, including with different numbers of microphones and irregular spacing, and that FIG. 2 is merely exemplary.
- Microphone array geometry 203a is a 3-channel circular array
- microphone array geometry 203b is a variation of microphone array geometry 203a with an additional center microphone.
- Microphone array geometry 203c is a 3-channel linear array.
- Microphone array geometry 204a is a 4-channel rectangular array.
- Microphone array geometry 204b is a 4-channel circular array, and microphone array geometry 204c is a variation of microphone array geometry 204b with an additional center microphone.
- Microphone array geometry 206a is a 6-channel circular array, and microphone array geometry 206b is a variation of microphone array geometry 206a with an additional center microphone.
- Microphone array geometry 208a is an 8-channel circular array, and microphone array geometry 208b is a variation of microphone array geometry 208a with an additional center microphone.
- Other circular arrays with, for example 5 and 7 microphones may also be used, along with other shapes and numbers of microphones. Distances between microphones may vary between 3 and 10 centimeters (cm), and the radii of the array circles may also vary on approximately the same scale.
- Examples of the disclosure use deep neural networks (DNNs) for PSE and are causal, processing only past frames.
- DNNs deep neural networks
- a metric measures the target speaker over-suppression (TSOS), which assists in developing PSE models.
- MT training with a speech recognition back-end may assist to render PSE models robust against TSOS. Examples are thus able to provide improved transcription and speech quality.
- TSOS is a common limitation of PSE models. Due to inherent ambiguity of speaker embedding, PSE models, especially causal ones, may sometimes become confused and remove the target speaker’s voice, rather than background noise and interfering speakers. The TSOS issue may be even more severe for communication, for example when the target speaker’s voice is suppressed, even for a short period. Aspects of the disclosure provide a method to measure the TSOS reliably to provide methods to reduce or remove it.
- DCCRN deep complex convolutional recurrent network
- pDCCRN personalized DCCRN
- pDCATTUNET personalized deep convolution attention U-Net
- the pDCATTUNET architecture contains an encoder and decoder with 2-D convolutions and U-net-style skip connections.
- conversion of the DCCRN includes introducing the d-vector as an input to the original DCCRN architecture.
- Multiple configurations are possible, such as concatenating the d- vectors to the input of a complex long term short term memory (LSTM) layer (see FIG. 7).
- LSTM long term short term memory
- the d-vectors are concatenated to the real and imaginary parts of the tensors coming from the last layer of the encoder, and fed to a complex LSTM. This modification increases the input size of the complex LSTM layer only; minimizing the additional computational cost.
- This model is referred to as pDCCRN.
- Test sets may be used to measure different aspects of the PSE models.
- a metric identified as TSOS measure estimates the over-suppression issue.
- Automatic speech recognition (ASR) based MT training may reduce the frequency of TSOS.
- ASR Automatic speech recognition
- MT training may further improve transcription quality and reduce the TSOS percentage with limited speech quality degradation.
- FIG. 3 illustrates an example architecture 300 for a PSE model that may be used in arrangement 100, for example when trained PSE model 110 comprises a multi-channel personalized noise suppression (MCPNS) model.
- MPNS multi-channel personalized noise suppression
- pDCATTUNET is implemented using architecture 300.
- the speaker embeddings 308 (d-vector) are extracted from enrollment data for target speaker 102 (and also, in some examples, target speaker 102a) and concatenated with the output from final encoder block 400 If.
- the d-vectors represent the acoustic characteristics of a target speaker.
- a pre-trained Res2Net speaker ID model is used for d-vector extraction. Further detail for encoder blocks 4001a and 4001f is provided in relation to FIG. 4.
- the output of final encoder block 4001f is concatenated with the output of bottleneck block 500a to form the input to initial decoder block 4002f.
- a single-channel PSE model (e.g., architecture 300) performs complex spectral mapping based on pDCCRN.
- a pDCCRN approach uses a U-Net architecture with encoder and decoder blocks and two complex LSTM layers in-between. Each block contains complex 2-D convolutional layers followed by complex batch normalization. Complex layers are formed by two separate real- valued layers that operate on the real and imaginary parts of the layer input.
- the pDCCRN uses a d-vector and the mixture signal Y in the short-time Fourier transform (STFT) domain. The real and imaginary parts of the mixture signal are concatenated and are fed as the input, shown in Eq. (1): where C, F, and T are the number of channels, frequency bins, and time frames, respectively.
- STFT short-time Fourier transform
- the d-vector is replicated through time dimension, concatenated with the encoder’s output, and fed into a first LSTM layer.
- the model outputs a complex ratio mask which is multiplied with the input mixture to estimate the clean speech.
- the model is trained with a power-law compressed phase-aware mean-squared error (MSE) loss function. Operations are causal and may operate in real-time.
- MSE mean-squared error
- FIG. 4 illustrates an example encoder/decoder that may be used in various architectures described herein, for example as encoder blocks 4001a and 400 If, and decoder blocks 4002a and 4002f.
- a decoder performs an inverse of an encoder.
- the example encoder and decoder architectures include convolution layers and are supported by temporal modeling.
- the encoder/decoder blocks feed their input to a convolution layer with a two-dimensional (2-D) kernel, and it is followed by a parametric rectified linear unit (pReLU) and batch normalization (BN).
- pReLU parametric rectified linear unit
- BN batch normalization
- Input 402 is provided to block 404, which performs 2-D convolution, and provides a pPReLU and BN.
- a pPReLU is an activation function that generalizes a traditional rectified linear activation function with a slope for negative values.
- a rectified linear activation function is a piecewise linear function that output the input directly if input is positive and zero if the input is negative.
- BN is used with deep neural networks to standardize inputs to a layer for each mini-batch, in order to reduce the number of training epochs required.
- Block 404 is followed by block 406, which is a max-pooling layer for encoders and a nearest-neighbor upsampling layer for decoders. Downsampling and upsampling operations are applied to the frequency dimension.
- Another convolution block 408 processes the results from block 406 and outputs three channels: query (Q) 410, key (K) 412, and value (V) 414. These intermediate spectrograms are used as query, key, and value for multi-head attention module 416, Multihead(Q , K, V), which uses a causal attention mask.
- Multi-head attention is a module for attention mechanisms that runs through an attention mechanism several times in parallel. Multiple attention heads allows for attending to parts of a sequence differently (e.g., longer-term dependencies versus shorter-term dependencies).
- the output of multi -head attention module 416 is normalized by a layer normalization 418 and concatenated to the intermediate results coming from block 406 (e.g., from the first convolution block).
- a third convolution block 420 takes the concatenated results as input and sends its output 422 to the next layer.
- FIG. 5 illustrates an example bottleneck block 500 that may be used in various architectures described herein, for example as bottleneck blocks 500a and 500d.
- bottleneck block 500 is designed using convolution layers with one-dimensional (1-D) kernels and multi-head attention modules.
- Input 502 to bottleneck block 500 is processed by a convolution layer followed by a PReLU (block 504) and a layer normalization 506.
- the intermediate results are fed to a multi-head attention module 508, followed by a layer normalization 510.
- a layer normalization 514 is applied, and the output 516 is sent to the next layer.
- 1-D batch normalization is applied to real and imaginary parts of the spectrogram after the STFT (STFT block 302).
- the network’s input is the concatenation of the real and imaginary parts of the STFT, channel-wise, resulting in a four-dimensional (4-D) tensor including a batch dimension.
- the complex ratio mask is applied to the original real and imaginary parts of the noisy spectrogram (e.g., the output of decoder block 4002a) to produce output data 114.
- further filtering e.g., bandpass
- a loss function is determined.
- a powerlaw compressed phase-aware (PLCPA) loss function is effective for both ASR and speech quality improvement, and is defined as: where S and S are the estimated and reference (clean) spectrograms, respectively. Parameters t and f represent the time and frequency index, while T and F stand for the total time and frequency frames, respectively.
- Hyper-parameter p is a spectral compression factor and is set to 0.3 in some examples. Operator (p calculates the argument of a complex number. Parameter a is the weighting coefficient between the amplitude and phase-aware components.
- asymmetric loss is adapted to the amplitude part of the loss function defined by Eq. (2).
- the asymmetric loss penalizes the 7-F bins where the target speaker’s voice is removed.
- the power-law compressed phase-aware asymmetric (PLCPA- ASYM) loss function is defined as: where [J is the positive weighting coefficient for L os .
- multi-task (MT) training is adapted using a frozen ASR back-end to update the parameters of the PSE network. Without enrollment data for the ASR data, the d-vectors are extracted directly from the enhanced utterance.
- the MT loss is denoted as L MT .
- FIG. 6 illustrates an example architecture 600 for trained PSE model 110 that may also be used in arrangement 100 as a version of trained PSE model 110.
- Architecture 600 leverages spatial features 660 to further improve the ability to distinguish between target speaker 102 and interfering speaker 104.
- inter-channel phase difference (IPD) captures the relative phase difference between microphones of microphone array 200, which reflects the TDoA for different microphones, based on different angles to microphone array of target speaker 102 and interfering speaker 104.
- speech enhancement (SE) performance may be improved further by using microphone arrays, such as microphone array 200.
- microphone arrays such as microphone array 200.
- spatial information may be extracted and combined with spectral information for obtaining superior SE models.
- the disclosure below extends PSE to utilize the microphone arrays for environments where strong noise, reverberation, and an interfering speaker are present.
- the combination of the enrolled speaker embedding and the spatial features significantly improves SE performance.
- the impact of speaker embeddings and spatial features are examined in challenging conditions where target and interfering speakers have similar angles or distances to the microphone array.
- an array geometry agnostic PSE model is disclosed that works regardless of the number of microphones and the array shape.
- An advantage of this model is the simplification of the development process: the model is trained once and used across multiple microphone array devices without needing to train the model for each user device model (e.g., different models of notebook computers, with different numbers and/or placement of microphones).
- the geometry agnostic model yields consistent improvements over a model that is trained on specific array geometry.
- the effectiveness of the proposed geometry agnostic model for unseen array geometries is also described below.
- stream pooling layers are introduced for convolutional neural networks that are based on the averaging and concatenation of the feature maps.
- multiple approaches are employed to extend a single-channel PSE model to AT microphones with a fixed array geometry.
- real and imaginary parts of short time Fourier transform (STFT) values for all microphones are stacked in the channel dimension to create the input shown in Eq. (4) and fed to a multi-channel PSE model, as shown in FIGs. 6-9.
- STFT short time Fourier transform
- IPD is used for the spatial features and is defined for the microphone pair (z, j) and a mixture signal Y as:
- the first microphone is paired with the rest of the microphones to extract M-l IPD features.
- the cosine and sine values of the IPD features are stacked with the real and imaginary parts of the first microphone STFT, respectively, to form the input shown in Eq. 4).
- the estimated mask is applied to the first microphone in both approaches.
- the PSE model is adaptable at run-time (e.g., on the platform device) based on the use of a speaker profile, adapted to device audio geometry, and supports acoustic echo cancellation.
- the PSE model may be trained to perform echo cancellation in addition to PSE.
- the virtual microphone signal Y v is created by taking the average of all microphones of a microphone array (e.g., microphone array 200), and given as:
- the IPD features (e.g., spatial features 660) for each microphone with respect to the virtual microphone are extracted from: Et i (?)
- the IPD features are normalized using an unbiased exponentially weighted moving average to increase the robustness of the model to arbitrary array geometries.
- a fourth dimension that contains specific stream (microphone) information is introduced.
- FIG. 6 illustrates geometry agnostic model architecture 600.
- Each stream information is processed independently using pDCCRN.
- a stream pooling layer is included after the encoder and decoder blocks of pDCCRN.
- the channel dimension are split into two parts of stream-specific and cross-stream information.
- the average across cross-stream channel is calculated and then concatenated to the stream-specific part of each stream.
- a global-pooling layer is used to average across the streams and channels to estimate a complex mask.
- the estimated complex mask is applied to the STFT for virtual microphone signal Y v .
- the output from LSTM 608 is concatenated with the output from encoder block lOOOlf and fed to a decoder block 10002f, which is followed by an upsampling block 61 Of.
- There are N stages (A 6, in this example), concluding with an encoder block 10001a, followed by an upsampling block 610a.
- the encoding/decoding is unwound, such that the output of encoder block 10001a is concatenated with the output of the penultimate upsampling block to form the input to final decoder block 10002a.
- Decoder blocks 10002a and 10002f are versions of complex encoder/decoder block 1000 of FIG. 10A.
- a global pooling layer 612 follows upsampling block 610a.
- a real mask 614, an imaginary mask 616, and virtual microphone signal Y v 620 (Eq. 6) are combined (e.g., by concatenation), and fed into an inverse STFT block 618 to produce output data 114. In this manner the estimated mask is applied to the virtual microphone. In some examples, further filtering (e.g., bandpass) is performed.
- FIG. 7 illustrates an example architecture 700 for trained PSE model 110 that may be used in arrangement 100, and comprises an example of a personalized DCCRN.
- the d-vectors are concatenated to the real and imaginary parts of the tensors coming from the last layer of the final encoder, and fed to the complex LSTM.
- This approach increases the input size of the complex LSTM layer only; minimizing the additional computational cost.
- the output of complex LSTM 704 is adjusted by a dense layer 706.
- a dense layer is a common, regular deeply-connected neural network layer, and is used when the output of an LSTM is not a softmax (e.g., a normalized exponential function).
- the output from LSTM 704 (as adjusted by dense layer 706) is concatenated with the output from encoder block lOOOlf and fed to a decoder block 10002f (which may be followed by an upsampling block, as shown in FIG. 6).
- the encoding/decoding is unwound, such that the output of encoder block 10001a is concatenated with the output of the penultimate upsampling block to form the input to final decoder block 10002a.
- global pooling layer follows decoder block 10002a (as shown in FIG. 6).
- a real mask 710, an imaginary mask 712, and the STFT and IPD features from block 702 are combined (e.g., by concatenation), and fed into an inverse STFT block 722 to produce output data 114.
- further filtering e.g., bandpass
- FIG. 8 illustrates an example architecture 800 that may be used in arrangement 100 in place of architecture 700.
- speaker embeddings 606 are introduced into each complex encoder and complex decoder, rather than only just prior to complex LSTM 704.
- FIG. 9 illustrates an example architecture 900 that may be used in arrangement 100 in place of architecture 700.
- speaker embeddings 606 are introduced prior to first complex encoder block 10001a, concatenated with the output of block 702.
- FIG. 10A illustrates an example complex encoder/ decoder block 1000 that may be used in various architectures described herein, for example as encoder blocks 10001a and 1000 If and decoder blocks 10002a and 10002f.
- An input 1002 is divided into a real component 1002r and an imaginary component 1002i.
- a real convolution 1004r and an imaginary convolution 1004i are performed and subject to complex BN 1006. This is followed by a pReLU 1008, to produce an output 1010.
- FIG. 10B illustrates the complex encoder/decoder block 1000 of FIG. 10A with inter-stream processing 1052 in an arrangement 1050.
- a virtual microphone is created by taking the average of all selected microphones (e.g., virtual microphone signal Tv of Eq. (6)).
- a fifth dimension is introduced that includes specific stream (microphone) information.
- intra-stream processing processes each stream independently by combining stream and batch dimensions, whereas inter-stream processing performs averaging across streams on upper half of channels and concatenates the results to lower half channels of each stream. This is repeated for all layers in encoders and decoders, and the estimated mask is applied to the virtual microphone.
- Clean speech data of the deep noise suppression (DNS) challenge (INTERSPEECH 2020/2021) was used for evaluating aspects of the disclosure.
- the data includes approximately 544 hours of speech samples from the LibriVox corpus that were obtained by selecting high mean opinion score (MOS) samples.
- Noise files and internal room impulse response (RIR) files were used to create noisy mixtures.
- the noise and RIR files were split into training, validation, and test sets. 60% of the dataset contained samples including the target speaker, interfering speaker, and noise; the other 40% contained samples comprising the target speaker and noise. General SE samples (one speaker plus noise) were included so as not to degrade the performance in scenarios with no interfering speakers.
- the target speaker was assumed to be closer to the microphone than the interfering speaker.
- the target speaker was placed randomly between 0 to 1.3 meters away from the microphone and the interfering speaker was placed more than 2 meters away. 2000 and 50 hours of data for the training and validation, respectively, were simulated.
- Test sets were created to capture single-channel PSE models’ performance for three scenarios: 1) the target scenario (target + interfering speakers + noise), 2) the scenario in which no interfering speakers (target speaker + noise) are present, and 3) the scenario in which there are neither interfering speakers nor noise (target speaker only).
- these scenarios are identified as “TS1”, “TS2”, and “TS3”, respectively.
- TS2 is a benchmark test set with which to compare the PSE models with the unconditional SE models in a fair way.
- TS3 is helpful to determine the oversuppression rate of the target speaker. All the test sets contained reverberation.
- ICR internal conversational recordings
- VCTK voice cloning toolkit
- the VCTK corpus includes 109 speakers with different English accents. For each speaker, 30 samples were set aside and used to extract the speaker’s d-vector. The noisy mixtures were simulated using the test noise and RIRs with the rest of the files. These noisy mixtures were concatenated to generate a single long audio file for each speaker. The average duration of the files was 27.5 minutes. This test set served the purpose of evaluating the models with long duration audio to which the models were not exposed during training and under changing environmental conditions.
- DNSMOS is a neural network-based mean opinion score (MOS) estimator that was shown to correlate well with the human scorers. DEL was employed in addition to WER to measure the over-suppression (OS) of the target speaker. While DEL alone cannot measure OS ultimately, the new metric below is used to measure the target speaker OS reliably.
- the metric is defined as follows: where T is a threshold value and L os is defined in Eq. (3).
- the over-suppression rate is calculated by subtracting the power-law compressed predicted magnitude from the reference one. The negative values are set to zero.
- the reference signal has more energy for the given T-F bin. Therefore, the target speaker’s voice is removed.
- the frequency bins of each time frame for the reference and OS spectrograms are summed and the reference is multiplied by a threshold value. Values for each time frame for both spectrograms are thus obtained.
- a logical comparison is applied to obtain binary values for each frame: if the OS is greater than the reference value, that frame is marked as over-suppressed. Now with TSOS measure, the percentage of the OS per sample, total OS duration, and maximum OS duration may be calculated.
- Example models were compared with VoiceFilter.
- the VoiceFilter model was converted to a causal one by making the paddings for convolution layers causal and the LSTM layers unidirectional.
- the numbers of filters are [16; 32; 64; 128; 128; 128] for the encoder and decoder (inverse of the encoder) layers, and the kernel size was set to 5 _ 2 and stride to 2 > 1 for each layer.
- the LSTM layer’s hidden size was set to 128.
- the numbers of filters were [32; 64; 128; 128; 128; 128] for the encoder and [128; 128; 128; 64; 32; 16] for the decoder.
- the hidden size of the bottleneck layer was set to 128.
- the best model was selected according to the validation loss.
- the parameter a (see Eq. (2)) was set to 0.5 and 0.9, respectively, [J (see Eq. (3)) was set to 1.0, and y (see Eq. (8)) was set to 0.1.
- pDCATTUNET yields superior results for both datasets compared to pDCCRN and the baseline models. While pDCCRN yields similar results to DCCRN in the TS2 scenario, pDCATTUNET can improve the results even with no interfering speech. However, the personalized models generally have higher TSOS % for scenarios when trained with the same loss function. Using the asymmetric loss function significantly improves the TSOS % and speech intelligibility. Furthermore, MT training with LPLCPA loss provides notable improvement to TSOS % compared with the asymmetric loss. However, MT training slightly reduces the speech quality. Additionally, MT training and asymmetric loss combination typically provides superior TSOS % for both pDCCRN and pDCATTUNET.
- MT training may degrade the overlapped speech WER for the pDCATTUNET, although the DEL is reduced.
- the insertion error is increased, indicating that the suppression of the interfering speech is degraded. This may be related to the usage of a single speaker for the ASR data during MT training. Adding interfering speech to ASR data can alleviate this issue.
- RIRs room impulse responses
- the microphone array was located in the center of the room.
- Target and interfering speakers were positioned randomly around the microphone array within [0.5, 2.5] meters with the assumption that the target speaker was the closest to the microphone array.
- 2000 and 50 hours of audio, respectively were simulated based on the clean speech data from the DNS challenge dataset.
- 60% of utterances contained the target and interfering speakers with a signal -to-distorti on ratio (SDR) between 0 to 10 dB.
- SDR signal -to-distorti on ratio
- the simulated audio was mixed with directional and isotropic noises from the Audioset and Freesound datasets with a signal-to-noise ratio (SNR) in the range of [0, 15] dB.
- SNR signal-to-noise ratio
- the sampling rate for all utterances was 16 kHz.
- the geometry agnostic model was trained with the 7-channel circular array and with three additional geometries derived from it: 4-channel tri
- dataset A and B Two 10-hour test data were created: dataset A and B.
- Dataset A contained utterances mixed only with noise and reverberation.
- dataset B contained utterances mixed with both noise, reverberation, and interfering speech.
- Clean utterances were selected from internal conversational style speech recordings with a high neural network-based mean opinion score (MOS).
- MOS mean opinion score
- the SDR and SNR ranges were the same as the training dataset.
- the test utterances were convolved with RIRs from 8 different geometries. Four geometries were the same as the ones that were used for the training dataset.
- the enhanced speech signal was evaluated based on speech recognition accuracy and speech intelligibility.
- Two baselines were used for comparison with the geometry agnostic model. For each array geometry that was used in training, a fixed geometry model was trained based on IPD features. The other baseline was based on processing each microphone independently with a single-channel PSE model followed by averaging the enhanced signals. Although this approach was computationally expensive, it may be considered to be an acceptable alternative for unknown array geometries. Aspects of the disclosure are also operable with MVDR beamforming followed by a single-channel PSE. However, since the assumption is that there is no knowledge about the microphone array geometries, it may be challenging to accomplish beamforming in real-time.
- Multi-channel PSE models substantially outperformed the single-channel PSE model.
- Single-channel PSE introduces processing artifacts that yield worse WER scores compared to the unprocessed noisy mixture for dataset A. By contrast, the multi-channel PSE models improve the speech recognition performance.
- the model trained with the IPD features performed consistently better than a model based on stacked STFTs.
- a multi-channel PSE was trained based on the IPD features without using d-vectors. The results demonstrated that spatial information was helpful regardless of the presence of d-vectors, in some examples.
- test dataset B two versions of test dataset B were created.
- speakers were randomly positioned such that the difference of their angle with respect to the array was less than 5 degrees, and their distance difference was more than 1 meter.
- the angle difference of the speakers was more than 45 degrees, while the distance difference was less than 10 cm.
- the multi-channel PSE models performed the worst for the dataset with similar speaker angles. This result indicates that the model learned to discriminate essentially based on the angle rather than the distance. Also, when the two speakers were at similar distances, using d-vectors substantially improved the performance of the multi-channel PSE.
- the geometry agnostic model outperformed fixed geometry models trained with their dedicated array geometries in both test datasets. This result validates that this approach could effectively decouple the array geometry and the model architecture. Without requiring changes in the model architecture for an individual array, a single model can be shared between multiple arrays with different shapes and different numbers of microphones.
- the geometry agnostic model still outperformed the fixed geometry model for a 3 -channel triangular array, which has fewer microphones than the arrays included in the training.
- PSE with the geometry agnostic model was significant, and its performance was comparable to the results when the array geometries had been seen in training.
- the geometry agnostic model showed consistent improvements over the average of enhanced single-channel signals despite not seeing the front-back ambiguity of the linear arrays during the training.
- the geometry agnostic model improved the performance compared with the average of the enhanced signals to a smaller extent, and the results for dataset A were worse in terms of WER and STOI.
- Spatial aliasing might be part of the reason for the poor performance of the 8-channel circular array.
- a large inter-microphone distance leads to spatial aliasing, and this can introduce unseen patterns for the IPD features.
- spatial aliasing occurs in array geometries for which inter-microphone distance is longer than 4.28 cm with 16 kHz sampling rate. If the model was trained without IPD normalization, the performance degraded significantly, suggesting the spatial aliasing problem may be mitigated by IPD normalization.
- Examples of the disclosure utilize spatial features along with speaker embeddings for PSE. Based on test results, this combination significantly improves the performance for both communication and transcription quality, consistently outperforming geometry dependent models. Some examples further include stream pooling layers for multi-channel PSE that is invariant to the number of microphones and their arrangement.
- FIG. 11 is a flowchart 1100 illustrating exemplary operations that may performed using arrangement 100.
- operations described for flowchart 1100 are performed by computing apparatus 1418 of FIG. 14.
- Flowchart 1100 commences with operation 1102, which trains PSE model 110a to become trained PSE model 110, using operations 1104-1118.
- a training microphone array is set up with a training geometry in operation 1104. See FIG. 2 for various microphone geometries. Training will iterate through multiple microphone geometries, in order to remove geometry-specific learning from PSE model 110a.
- Target speakers used for training e.g., training target speakers
- One benefit of the current disclosure is that trained PSE model 110 may be used with different target speakers 102 and 102a that the training target speakers.
- Operation 1108 extracts speaker embeddings from enrollment data for the training speakers, and operation 1110 collects training audio and truth information (training data 142).
- Operation 1112 extracts spatial features from training input audio captured by microphone array 200, the training input audio including a mixture of speech data of a training target speaker and a training interfering speaker.
- the spatial features comprise an IPD.
- Operation 1114 provides the training input audio, the extracted spatial features, and the extracted speaker embeddings to PSE model 110a (the PSE model being trained).
- Operation 1116 includes training PSE model 110a (which becomes trained PSE model 110) with training data 142.
- training PSE model 110a comprises extracting spatial features from microphone array 200, and training PSE model 110a with the extracted spatial features.
- Decision operation 1118 determines whether training is completed with enough different microphone geometries. If not, flowchart 1100 returns to operation 1104 to continue training PSE model 110a using a plurality of microphone array geometries to produce trained PSE model 110.
- the training of PSE model 110a comprises MT training.
- the MT training comprises echo cancellation.
- training PSE model 110a does not include training PSE model 110a with speech data of target speaker 102, target speaker 102a, or interfering speakers that will be encountered later (e.g., interfering speaker 104).
- the plurality of microphone array geometries used during training does not include a microphone array geometry used to capture input audio 112.
- operation 1120 deploys trained PSE model 110.
- trained PSE model 110 comprises a trained geometry-agnostic PSE model.
- Microphone array 200 is set up in operation 1122. In some examples, this occurs when a notebook computer, that will be used by target speaker 102 is built.
- Target speaker 102 is enrolled in operation 1124, which may also include enrolling target speaker 102a and any other target speakers.
- Trained PSE model 110 does not require re-training for target speaker 102 or 102a, only enrollment with a short voice sample of a few minutes in order to capture voice characteristics. Trained PSE model 110 will suppress any speaker that is not enrolled.
- Operation 1126 includes extracting speaker embeddings (e.g., speaker embeddings 308, 606) from enrollment data 326 for at least target speaker 102.
- enrollment data 326 is single-channel enrollment data.
- the speaker embeddings are extracted from enrollment data 326 for target speaker 102 and a second target speaker.
- output data 114 further comprises estimated clean speech data of target speaker 102a.
- the extracted speaker embeddings represent acoustic characteristics of target speaker 102.
- the extracted speaker embeddings represent acoustic characteristics of both target speaker 102 and target speaker 102a.
- the extracted speaker embeddings are expressed as d-vectors;
- Decision 1128 determines whether PSE will be performed in real-time (on live audio), or on recorded audio. If on recorded audio, operation 1130 records the captured audio in a multi-channel format to preserve individual microphone audio streams. Operation 1132 includes receiving input audio 112. In some examples, input audio 112 comprises recorded audio, and operation 1132 includes playing the recorded audio. In some examples, the input audio comprises real-time audio. In such examples, producing output data 114 comprises producing output data 114 in real-time.
- Operation 1134 includes extracting spatial features 660 from input audio 112 captured by microphone array 200, input audio 112 including a mixture of speech data of target speaker 102 and interfering speaker 104.
- the spatial features comprise an IPD.
- Operation 1136 includes providing input audio 112, the extracted spatial features, and the extracted speaker embeddings to a trained geometry-agnostic personalized speech enhancement (PSE) model.
- Operation 1138 includes using trained PSE model 110, producing output data 114, output data 114 comprising estimated clean speech data of target speaker 102 with a reduction of speech data of interfering speaker 104.
- operation 1138 includes producing output data 114 using a geometry-agnostic version of trained PSE model 110 without geometry information for microphone array 200.
- output data 114 comprises estimated clean speech data of target speaker 102 without speech data from interfering speaker 104. In some examples, output data 114 comprises audio of the estimated clean speech data of target speaker 102. In some examples, output data 114 further comprises estimated clean speech data of target speaker 102a. In some examples, output data 114 comprises audio of the estimated clean speech data of target speaker 102a. In some examples, producing output data 114 using trained PSE model 110 comprises performing an inference with trained PSE model 110. In some examples, performing an inference with trained PSE model 110 comprises using stream pooling layers that are based on at least averaging and concatenation of feature maps.
- output data 114 is sent to transcription service 130 to produce transcript 132 of the estimated clean speech data of target speaker 102 and or target speaker 102a.
- operation 1138 includes generating, from output data 114, a transcript of the estimated clean speech data of first target speaker 102 and/or second target speaker 102a.
- producing output data 114 using trained PSE model 110 comprises isolating speech data of target speaker 102 in a manner that is agnostic of a configuration of microphone array 200. In some examples, producing output data 114 comprises, based on at least the provided input data, receiving a mask from trained PSE model 110, and applying the mask to input audio 112. In some examples, producing output data 114 using trained PSE model 110 comprises isolating speech data of target speaker 102 from at least interfering speaker 104. In some examples, producing output data 114 using trained PSE model 110 comprises isolating speech data of target speaker 102 from at least interfering speaker 104 and background noise 112d.
- producing output data 114 using trained PSE model 110 comprises isolating speech data of target speaker 102 in the input audio data using a profile of a voice of target speaker 102.
- FIG. 12 is a flowchart 1200 illustrating exemplary operations associated with arrangement 100. In some examples, operations described for flowchart 1200 are performed by computing apparatus 1418 of FIG. 14. Flowchart 1200 commences with operation 1202, which includes extracting speaker embeddings from enrollment data for at least a first target speaker. Operation 1204 includes extracting spatial features from input audio captured by a microphone array, the input audio including a mixture of speech data of the first target speaker and an interfering speaker.
- Operation 1206 includes providing the input audio, the extracted spatial features, and the extracted speaker embeddings to a trained geometry-agnostic PSE model.
- Operation 1208 includes using the trained geometry-agnostic PSE model without geometry information for the microphone array, producing output data, the output data comprising estimated clean speech data of the first target speaker with a reduction of speech data of the interfering speaker.
- FIG. 13 is a flowchart 1300 illustrating exemplary operations associated with arrangement 100.
- operations described for flowchart 1300 are performed by computing apparatus 1418 of FIG. 14.
- Flowchart 1300 commences with operation 1302, which includes extracting speaker embeddings from enrollment data for at least a first target speaker.
- Operation 1304 includes extracting spatial features from input audio captured by a microphone array, the input audio including a mixture of speech data of the first target speaker and an interfering speaker.
- Operation 1306 includes providing the input audio, the extracted spatial features, and the extracted speaker embeddings to a trained PSE model.
- Operation 1308 includes using the trained PSE model, producing output data, the output data comprising estimated clean speech data of the first target speaker with a reduction of speech data of the interfering speaker.
- An example computerized method comprises: extracting speaker embeddings from enrollment data for at least a first target speaker; extracting spatial features from input audio captured by a microphone array, the input audio including a mixture of speech data of the first target speaker and an interfering speaker; providing the input audio, the extracted spatial features, and the extracted speaker embeddings to a trained geometry-agnostic PSE model; and using the trained geometry-agnostic PSE model without geometry information for the microphone array, producing output data, the output data comprising estimated clean speech data of the first target speaker with a reduction of speech data of the interfering speaker.
- Another example method comprises: extracting speaker embeddings from enrollment data for at least a first target speaker; extracting spatial features from input audio captured by a microphone array, the input audio including a mixture of speech data of the first target speaker and an interfering speaker; providing the input audio, the extracted spatial features, and the extracted speaker embeddings to a trained PSE model; and using the trained PSE model, producing output data, the output data comprising estimated clean speech data of the first target speaker with a reduction of speech data of the interfering speaker.
- An example system for array geometry agnostic multi-channel PSE comprises: a processor; and a computer-readable medium storing instructions that are operative upon execution by the processor to: extract speaker embeddings from enrollment data for at least a first target speaker; extract spatial features from input audio captured by a microphone array, the input audio including a mixture of speech data of the first target speaker and an interfering speaker; provide the input audio, the extracted spatial features, and the extracted speaker embeddings to a trained geometryagnostic PSE model; and using the trained geometry-agnostic PSE model without geometry information for the microphone array, produce output data, the output data comprising estimated clean speech data of the first target speaker with a reduction of speech data of the interfering speaker.
- Another example system comprises: a processor; and a computer-readable medium storing instructions that are operative upon execution by the processor to: extract speaker embeddings from enrollment data for at least a first target speaker; extract spatial features from input audio captured by a microphone array, the input audio including a mixture of speech data of the first target speaker and an interfering speaker; provide the input audio, the extracted spatial features, and the extracted speaker embeddings to a trained PSE model; and using the trained PSE model, produce output data, the output data comprising estimated clean speech data of the first target speaker with a reduction of speech data of the interfering speaker.
- One or more example computer storage devices have computer-executable instructions stored thereon, which, on execution by a computer, cause the computer to perform operations comprising: extracting speaker embeddings from enrollment data for at least a first target speaker; extracting spatial features from input audio captured by a microphone array, the input audio including a mixture of speech data of the first target speaker and an interfering speaker; providing the input audio, the extracted spatial features, and the extracted speaker embeddings to a trained geometry-agnostic PSE model; and using the trained geometry-agnostic PSE model without geometry information for the microphone array, producing output data, the output data comprising estimated clean speech data of the first target speaker with a reduction of speech data of the interfering speaker.
- One or more example computer storage devices have computer-executable instructions stored thereon, which, on execution by a computer, cause the computer to perform operations comprising: extracting speaker embeddings from enrollment data for at least a first target speaker; extracting spatial features from input audio captured by a microphone array, the input audio including a mixture of speech data of the first target speaker and an interfering speaker; providing the input audio, the extracted spatial features, and the extracted speaker embeddings to a trained PSE model; and using the trained PSE model, producing output data, the output data comprising estimated clean speech data of the first target speaker with a reduction of speech data of the interfering speaker.
- examples include any combination of the following:
- the plurality of microphone array geometries used during training does not include a microphone array geometry used to capture the input audio
- training the PSE model does not include training the PSE model with speech data of the first target speaker, or the second target speaker, or the interfering speaker
- the output data comprises audio of the estimated clean speech data of the first target speaker of the estimated clean speech data of the first target speaker; generating, from the output data, a transcript of the estimated clean speech data of the first target speaker;
- the speaker embeddings are extracted from enrollment data for the first target speaker and a second target speaker, and wherein the output data further comprises estimated clean speech data of the second target speaker;
- the input audio comprises real-time audio
- producing the output data comprises producing the output data in real-time
- the spatial features comprise an IPD
- the extracted speaker embeddings represent acoustic characteristics of the first target speaker
- the extracted speaker embeddings represent acoustic characteristics of the first target speaker and the second target speaker
- extracting the spatial features comprises stacking, in a channel dimension, real and imaginary parts of a short time Fourier transform for all microphones in the microphone array; extracting the spatial features comprises implicitly learning spectral and spatial information from the microphone array; extracting the spatial features comprises explicitly extracting spatial features from the microphone array; - the output data comprises audio of the estimated clean speech data of the second target speaker; generating, from the output data, a transcript of the estimated clean speech data of the second target speaker;
- the output data comprises estimated clean speech data of the first target speaker without speech data from the interfering speaker; producing the output data using the trained geometry-agnostic PSE model comprises performing an inference with the trained geometry-agnostic PSE model; performing an inference with the trained geometry-agnostic PSE model comprises using stream pooling layers that are based on at least averaging and concatenation of feature maps; producing the output data using the trained geometry-agnostic PSE model comprises isolating speech data of the first target speaker in a manner that is agnostic of a configuration of the microphone array; producing the output data comprises, based on at least the provided input data, receiving a mask from the trained geometry-agnostic PSE model, and applying the mask to the input audio; producing the output data using the trained geometry-agnostic PSE model comprises isolating speech data of the first target speaker from at least the interfering speaker; producing the output data using the trained geometry-agnostic PSE model comprises isolating speech data of the first target speaker
- the input audio comprises recorded audio
- the training of the PSE model comprises MT training
- the MT training comprises echo cancellation
- - training the PSE model does not include training the PSE model with speech data of the interfering speaker; and - training the PSE model comprises extracting spatial features from the microphone array, and training the PSE model with the extracted spatial features.
- the present disclosure is operable with a computing device according to an embodiment as a functional block diagram 1400 in FIG. 14.
- components of a computing apparatus are operable with a computing apparatus according to an embodiment as a functional block diagram 1400 in FIG. 14.
- the computing apparatus 1418 are implemented as a part of an electronic device according to one or more embodiments described in this specification.
- the computing apparatus 1418 comprises one or more processors
- processor 1419 which may be microprocessors, controllers, or any other suitable type of processors for processing computer executable instructions to control the operation of the electronic device.
- the processor 1419 is any technology capable of executing logic or instructions, such as a hardcoded machine.
- platform software comprising an operating system 1420 or any other suitable platform software is provided on the apparatus 1418 to enable application software 1421 to be executed on the device.
- Computer executable instructions are provided using any computer-readable media that are accessible by the computing apparatus 1418.
- Computer-readable media include, for example, computer storage media such as a memory 1422 and communications media.
- Computer storage media, such as a memory 1422 include volatile and non-volatile, removable, and nonremovable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or the like.
- Computer storage media include, but are not limited to, Random Access Memory (RAM), Read-Only Memory (ROM), Erasable Programmable Read-Only Memory (EPROM), Electrically Erasable Programmable Read-Only Memory (EEPROM), persistent memory, phase change memory, flash memory or other memory technology, Compact Disk Read-Only Memory (CD-ROM), digital versatile disks (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage, shingled disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information for access by a computing apparatus.
- communication media may embody computer readable instructions, data structures, program modules, or the like in a modulated data signal, such as a carrier wave, or other transport mechanism.
- computer storage media do not include communication media. Therefore, a computer storage medium should not be interpreted to be a propagating signal per se. Propagated signals per se are not examples of computer storage media.
- the computer storage medium (the memory 1422) is shown within the computing apparatus 1418, it will be appreciated by a person skilled in the art, that, in some examples, the storage is distributed or located remotely and accessed via a network or other communication link (e.g., using a communication interface 1423).
- the computing apparatus 1418 comprises an input/output controller 1424 configured to output information to one or more output devices 1425, for example a display or a speaker, which are separate from or integral to the electronic device. Additionally, or alternatively, the input/output controller 1424 is configured to receive and process an input from one or more input devices 1426, for example, a keyboard, a microphone, or a touchpad. In one example, the output device 1425 also acts as the input device. An example of such a device is a touch sensitive display. The input/output controller 1424 may also output data to devices other than the output device, e.g., a locally connected printing device. In some examples, a user provides input to the input device(s) 1426 and/or receive output from the output device(s) 1425.
- the functionality described herein can be performed, at least in part, by one or more hardware logic components.
- the computing apparatus 1418 is configured by the program code when executed by the processor 1419 to execute the embodiments of the operations and functionality described.
- the functionality described herein can be performed, at least in part, by one or more hardware logic components.
- illustrative types of hardware logic components include Field-programmable Gate Arrays (FPGAs), Application-specific Integrated Circuits (ASICs), Program-specific Standard Products (ASSPs), System-on-a-chip systems (SOCs), Complex Programmable Logic Devices (CPLDs), Graphics Processing Units (GPUs).
- Examples of well-known computing systems, environments, and/or configurations that are suitable for use with aspects of the disclosure include, but are not limited to, mobile or portable computing devices (e.g., smartphones), personal computers, server computers, hand-held (e.g., tablet) or laptop devices, multiprocessor systems, gaming consoles or controllers, microprocessorbased systems, set top boxes, programmable consumer electronics, mobile telephones, mobile computing and/or communication devices in wearable or accessory form factors (e.g., watches, glasses, headsets, or earphones), network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like.
- the disclosure is operable with any device with processing capability such that it can execute instructions such as those described herein.
- Such systems or devices accept input from the user in any way, including from input devices such as a keyboard or pointing device, via gesture input, proximity input (such as by hovering), and/or via voice input.
- Examples of the disclosure may be described in the general context of computer-executable instructions, such as program modules, executed by one or more computers or other devices in software, firmware, hardware, or a combination thereof.
- the computer-executable instructions may be organized into one or more computer-executable components or modules.
- program modules include, but are not limited to, routines, programs, objects, components, and data structures that perform particular tasks or implement particular abstract data types.
- aspects of the disclosure may be implemented with any number and organization of such components or modules. For example, aspects of the disclosure are not limited to the specific computerexecutable instructions, or the specific components or modules illustrated in the figures and described herein. Other examples of the disclosure include different computer-executable instructions or components having more or less functionality than illustrated and described herein.
- aspects of the disclosure transform the general-purpose computer into a special-purpose computing device when configured to execute the instructions described herein. Any range or device value given herein may be extended or altered without losing the effect sought, as will be apparent to the skilled person.
- Any range or device value given herein may be extended or altered without losing the effect sought, as will be apparent to the skilled person.
- Examples have been described with reference to data monitored and/or collected from the users.
- notice is provided to the users of the collection of the data (e.g., via a dialog box or preference setting) and users are given the opportunity to give or deny consent for the monitoring and/or collection.
- the consent takes the form of opt-in consent or opt-out consent.
- the term “comprising” is used in this specification to mean including the feature(s) or act(s) followed thereafter, without excluding the presence of one or more additional features or acts.
- the operations illustrated in the figures are implemented as software instructions encoded on a computer readable medium, in hardware programmed or designed to perform the operations, or both.
- aspects of the disclosure are implemented as a system on a chip or other circuitry including a plurality of interconnected, electrically conductive elements.
- the order of execution or performance of the operations in examples of the disclosure illustrated and described herein is not essential, unless otherwise specified. That is, the operations may be performed in any order, unless otherwise specified, and examples of the disclosure may include additional or fewer operations than those disclosed herein. For example, it is contemplated that executing or performing a particular operation before, contemporaneously with, or after another operation is within the scope of aspects of the disclosure.
- the articles “a,” “an,” “the,” and “said” are intended to mean that there are one or more of the elements.
- the terms “comprising,” “including,” and “having” are intended to be inclusive and mean that there may be additional elements other than the listed elements.
- the term “exemplary” is intended to mean “an example of.”
- the phrase “one or more of the following: A, B, and C” means “at least one of A and/or at least one of B and/or at least one of C.”
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Quality & Reliability (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Cable Transmission Systems, Equalization Of Radio And Reduction Of Echo (AREA)
Abstract
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202280065219.2A CN118020101A (zh) | 2021-10-05 | 2022-08-22 | 与阵列几何形状无关的多通道个性化语音增强 |
EP22769448.6A EP4413566A1 (fr) | 2021-10-05 | 2022-08-22 | Amélioration de la parole personnalisée multicanaux indépendante de la géométrie de réseau |
Applications Claiming Priority (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US202163252493P | 2021-10-05 | 2021-10-05 | |
US63/252,493 | 2021-10-05 | ||
US17/555,332 | 2021-12-17 | ||
US17/555,332 US20230116052A1 (en) | 2021-10-05 | 2021-12-17 | Array geometry agnostic multi-channel personalized speech enhancement |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2023059402A1 true WO2023059402A1 (fr) | 2023-04-13 |
Family
ID=83283407
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/US2022/040979 WO2023059402A1 (fr) | 2021-10-05 | 2022-08-22 | Amélioration de la parole personnalisée multicanaux indépendante de la géométrie de réseau |
Country Status (2)
Country | Link |
---|---|
EP (1) | EP4413566A1 (fr) |
WO (1) | WO2023059402A1 (fr) |
-
2022
- 2022-08-22 EP EP22769448.6A patent/EP4413566A1/fr active Pending
- 2022-08-22 WO PCT/US2022/040979 patent/WO2023059402A1/fr active Application Filing
Non-Patent Citations (5)
Title |
---|
LUO YI ET AL: "End-to-end Microphone Permutation and Number Invariant Multi-channel Speech Separation", ICASSP 2020 - 2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), IEEE, 4 May 2020 (2020-05-04), pages 6394 - 6398, XP033793811, DOI: 10.1109/ICASSP40776.2020.9054177 * |
PIERRE-AMAURY GRUMIAUX ET AL: "A Review of Sound Source Localization with Deep Learning Methods", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 8 September 2021 (2021-09-08), XP091051924 * |
TAHERIAN HASSAN ET AL: "One Model to Enhance Them All: Array Geometry Agnostic Multi-Channel Personalized Speech Enhancement", ICASSP 2022, 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 20 October 2021 (2021-10-20), pages 271 - 275, XP055979951, ISBN: 978-1-6654-0540-9, Retrieved from the Internet <URL:https://arxiv.org/pdf/2110.10330.pdf> DOI: 10.1109/ICASSP43922.2022.9747395 * |
WANG ZHONG-QIU ET AL: "Combining Spectral and Spatial Features for Deep Learning Based Blind Speaker Separation", IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, IEEE, USA, vol. 27, no. 2, 1 February 2019 (2019-02-01), pages 457 - 468, XP011699493, ISSN: 2329-9290, [retrieved on 20181206], DOI: 10.1109/TASLP.2018.2881912 * |
WANG ZHONG-QIU ET AL: "Multi-Microphone Complex Spectral Mapping for Speech Dereverberation", ICASSP 2020 - 2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), IEEE, 4 May 2020 (2020-05-04), pages 486 - 490, XP033793269, DOI: 10.1109/ICASSP40776.2020.9053610 * |
Also Published As
Publication number | Publication date |
---|---|
EP4413566A1 (fr) | 2024-08-14 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Žmolíková et al. | Speakerbeam: Speaker aware neural network for target speaker extraction in speech mixtures | |
US9666183B2 (en) | Deep neural net based filter prediction for audio event classification and extraction | |
Bahmaninezhad et al. | A comprehensive study of speech separation: spectrogram vs waveform separation | |
Zhang et al. | Deep learning for environmentally robust speech recognition: An overview of recent developments | |
Wang et al. | Deep learning based target cancellation for speech dereverberation | |
Erdogan et al. | Improved MVDR beamforming using single-channel mask prediction networks. | |
US9008329B1 (en) | Noise reduction using multi-feature cluster tracker | |
US20190172480A1 (en) | Voice activity detection systems and methods | |
EP3501026B1 (fr) | Séparation aveugle de sources utilisant une mesure de similarité | |
US20230116052A1 (en) | Array geometry agnostic multi-channel personalized speech enhancement | |
CN114203163A (zh) | 音频信号处理方法及装置 | |
KR20210137146A (ko) | 큐의 클러스터링을 사용한 음성 증강 | |
JP6348427B2 (ja) | 雑音除去装置及び雑音除去プログラム | |
Sivasankaran et al. | A combined evaluation of established and new approaches for speech recognition in varied reverberation conditions | |
CN113870893A (zh) | 一种多通道双说话人分离方法及系统 | |
Martín-Doñas et al. | Dual-channel DNN-based speech enhancement for smartphones | |
Pertilä | Online blind speech separation using multiple acoustic speaker tracking and time–frequency masking | |
Mirsamadi et al. | A generalized nonnegative tensor factorization approach for distant speech recognition with distributed microphones | |
RU2616534C2 (ru) | Ослабление шума при передаче аудиосигналов | |
Taherian et al. | Multi-resolution location-based training for multi-channel continuous speech separation | |
Kalkhorani et al. | CrossNet: Leveraging Global, Cross-Band, Narrow-Band, and Positional Encoding for Single-and Multi-Channel Speaker Separation | |
Yoshioka et al. | Picknet: Real-time channel selection for ad hoc microphone arrays | |
CN107919136B (zh) | 一种基于高斯混合模型的数字语音采样频率估计方法 | |
WO2023059402A1 (fr) | Amélioration de la parole personnalisée multicanaux indépendante de la géométrie de réseau | |
Mallidi et al. | Robust speaker recognition using spectro-temporal autoregressive models. |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 22769448 Country of ref document: EP Kind code of ref document: A1 |
|
WWE | Wipo information: entry into national phase |
Ref document number: 202280065219.2 Country of ref document: CN |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2022769448 Country of ref document: EP |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
ENP | Entry into the national phase |
Ref document number: 2022769448 Country of ref document: EP Effective date: 20240506 |