WO2023044961A1 - 基于自注意力变换网络的多特征融合回声消除方法及系统 - Google Patents

基于自注意力变换网络的多特征融合回声消除方法及系统 Download PDF

Info

Publication number
WO2023044961A1
WO2023044961A1 PCT/CN2021/122348 CN2021122348W WO2023044961A1 WO 2023044961 A1 WO2023044961 A1 WO 2023044961A1 CN 2021122348 W CN2021122348 W CN 2021122348W WO 2023044961 A1 WO2023044961 A1 WO 2023044961A1
Authority
WO
WIPO (PCT)
Prior art keywords
features
attention
signal
echo cancellation
self
Prior art date
Application number
PCT/CN2021/122348
Other languages
English (en)
French (fr)
Inventor
涂卫平
刘雅洁
韩畅
肖立
杨玉红
刘陈建树
Original Assignee
武汉大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 武汉大学 filed Critical 武汉大学
Publication of WO2023044961A1 publication Critical patent/WO2023044961A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants
    • G10L21/007Changing voice quality, e.g. pitch or formants characterised by the process used
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L2021/02082Noise filtering the noise being echo, reverberation of the speech

Definitions

  • the invention belongs to the field of audio technology, and relates to an echo cancellation method and system, in particular to a multi-feature fusion echo cancellation method and system based on a deep self-attention transformation network.
  • the local microphone will simultaneously collect the far-end signal played by the loudspeaker and the voice of the near-end speaker to form a near-end mixed signal.
  • This mixed signal is sent to the far end so that the speaker at the far end hears what he has just said.
  • This kind of sound signal that has been transmitted - played - collected again and transmitted back is called an echo.
  • the presence of echoes can severely degrade communication quality.
  • the goal of acoustic echo cancellation is to remove the echo signal formed by the far-end signal contained in the near-end mixed signal to the greatest extent, while preserving the speech information of the near-end speaker.
  • the present invention provides a multi-feature fusion echo cancellation method and system based on deep self-attention transformation network.
  • the technical solution adopted by the method of the present invention is: a multi-feature fusion echo cancellation method based on self-attention transformation network, comprising the following steps:
  • Step 1 Calculate the delay between the near-end mixed signal and the far-end reference signal, and align the double-ended signals
  • Step 2 Extract the potential features of the near-end mixed signal and the far-end reference signal respectively, and calculate the attention weight matrix of the latent features of the near-end mixed signal and the potential features of the far-end reference signal, and combine the mixed signal features, the attention weight matrix and the reference The signal features are spliced, and then the fusion features are generated;
  • Step 3 Divide the fusion features extracted in step 2 into blocks of a specified size, and divide the fusion features into two path forms: intra-block features and inter-block features;
  • Step 4 Send the intra-block features in step 3 into the deep dynamic self-attention transformation network, and then use the residual connection to add the output of the network to the attention weight matrix calculated in step 2, and convert it into an inter-block feature, and send it to the deep dynamic self-attention transformation network again; repeat the above-mentioned operation process within the block and between blocks, and calculate the mask value;
  • Step 5 Use the mask value calculated in step 4 and the potential features of the near-end mixed signal to mask to obtain the signal features for echo cancellation;
  • Step 6 Decode the masked signal features in step 5 and reconstruct the signal to obtain the near-end signal after echo cancellation.
  • the technical solution adopted by the system of the present invention is: a multi-feature fusion echo cancellation system of deep self-attention transformation network, including the following modules:
  • Module 1 used to calculate the time delay between the near-end mixed signal and the far-end reference signal, and align the double-ended signals
  • Module 2 which is used to extract potential features from the near-end mixed signal and the far-end reference signal, and calculate the attention weight matrix of the potential features of the near-end mixed signal and the potential features of the far-end reference signal, and combine the mixed signal features, attention weight matrix And the reference signal features are spliced, and then the fusion features are generated;
  • Module 3 is used to divide the fusion features extracted in module 2 into blocks of a specified size, and divide the fusion features into two path forms of intra-block features and inter-block features;
  • Module 4 is used to send the in-block features in module 3 to the deep dynamic self-attention transformation network, and then use the residual connection to add the output of the network to the attention weight matrix calculated in module 2, and convert it to The inter-block features are sent to the deep dynamic self-attention transformation network again; the above-mentioned operation process within and between blocks is repeated to calculate the mask value;
  • Module 5 for masking by using the mask value calculated in module 4 and the potential characteristics of the near-end mixed signal to obtain the signal characteristics of echo cancellation
  • Module 6 is configured to decode the masked signal features in module 5 and reconstruct the signal to obtain the near-end signal after echo cancellation.
  • the present invention provides a multi-feature fusion echo cancellation method and system based on a deep self-attention transformation network, which enables the potential features of double-ended signals to be more fully integrated in the echo cancellation network, and introduces a deep self-attention transformation network ( Transformer) to fit the signal.
  • the present invention makes up for the information loss in the deep self-attention transformation network by adopting the residual network of multi-feature fusion, and at the same time speeds up the training process of the network, and greatly improves the performance in complex environments such as background noise, double-ended conversation and nonlinear distortion.
  • the effect and application range of echo cancellation while making the echo cancellation network more generalizable in complex environments.
  • Fig. 1 is the method flowchart of the embodiment of the present invention
  • Fig. 2 is the system structural diagram of the embodiment of the present invention.
  • Fig. 3 is the flow chart of the latent feature of fusion reference signal and mixed signal in the embodiment of the present invention
  • Fig. 4 is the structural diagram of deep dynamic self-attention transformation network in the embodiment of the present invention.
  • Fig. 5 is the structural diagram of speech energy control assembly in the embodiment of the present invention.
  • This embodiment first calculates the time delay between the near-end mixed signal and the far-end reference signal for alignment, and then independently extracts potential features from the aligned double-ended signals, and uses a multi-head attention mechanism and a deep separable network for feature fusion.
  • the fusion features can be divided into intra-block features and inter-block features.
  • the intra-block features are sent to the deep self-attention transformation network, and then the output of the network is added to the attention weight matrix by residual connection, converted into inter-block features, and sent to the deep self-attention transformation network again. Repeat the above-mentioned intra-block and inter-block operation process 6 times to calculate the mask value.
  • the near-end signal is masked, decoded and reconstructed by using the mask to obtain the near-end signal after echo cancellation.
  • a kind of multi-feature fusion echo cancellation method based on deep self-attention transformation network comprises the following steps:
  • Step 1 Use the delay estimation method based on the generalized cross-correlation function to calculate the delay between the near-end mixed signal and the far-end reference signal, and align the double-ended signals;
  • the near-end refers to the signal collected by the local microphone
  • the mixed signal refers to the signal that not only records the voice of the local speaker, but also records the far-end signal played by the local speaker
  • the reference signal refers to the far-end
  • the signal participates in the training of the network, because the echo part of the signal collected by the local microphone is obtained by nonlinear distortion of the remote signal.
  • the delay estimation method is specifically the generalized cross-correlation-phase transformation method (GCC-PHAT):
  • the peak value of the cross-correlation function of the near-end mixed signal and the far-end reference signal is calculated to determine the delay value.
  • the cross-correlation function is the sum of the sliding products of the two sequences, reflecting the difference between the two functions at different relative positions. Matching degree. Due to the strong correlation between the far-end reference signal and the near-end mixed signal, ideally the time delay of the reference signal can be accurately calculated.
  • the sampling frequency of the signals collected and processed by the microphone is 16KHZ.
  • Step 2 Extract the potential features of the near-end mixed signal and the far-end reference signal through the corresponding encoder, and use the multi-head attention mechanism to calculate the attention weight matrix of the latent features of the near-end mixed signal and the potential features of the far-end reference signal.
  • the mixed signal features, attention weight matrix and reference signal features are spliced, and then a fusion feature is generated through a deep separable network;
  • step 2 includes the following sub-steps:
  • Step 2.1 The near-end mixed signal and the far-end reference signal independently extract corresponding latent features through the encoder;
  • the encoder used in this embodiment is a one-dimensional convolutional layer and a relu activation function, where the convolution kernel size is twice the step size, and the window length is determined according to the size of the video memory to achieve a balance between performance and video memory occupation.
  • the value is 20; the potential features extracted by the encoder need to be processed by group normalization (Group Normalization) and bottleneck layer (Bottleneck Layer); where the bottleneck layer is a 1 ⁇ 1 convolutional neural network, and can also be based on The training effect of the network increases the number of convolutional layers and activation functions to better fit the high-dimensional nonlinear potential features of the signal.
  • Step 2.2 Calculate the attention weight matrix by using the latent features of the near-end mixed signal and the latent features of the far-end reference signal in step 2.1 through the multi-head attention mechanism;
  • Step 2.3 splicing the latent features calculated in step 2.1 and the attention weight matrix calculated in step 2.2 in the same dimension to obtain a splicing matrix
  • Step 2.4 Use the deep separable network to group the stitching matrix in step 2.3, reduce the output channel of the stitching matrix to 1/3 of the original, and form a deep fusion feature that fully combines the information of the near-end mixed signal and the far-end reference signal .
  • the depth separable convolution network in this embodiment is composed of a depth convolution layer and a dot convolution convolution layer, which greatly reduces the amount of calculation required.
  • the multi-head attention mechanism is used to calculate the attention weight matrix of the potential features of the double-ended signal, and the matrix is spliced with the potential features of the double-ended signal to form a multi-feature splicing matrix; the multi-feature splicing matrix is formed by using depth separable convolution Perform a grouping operation to reduce the output channel to 1/3 of the original.
  • the formula can be expressed as:
  • mix and far represent the near-end mixed signal and the far-end reference signal respectively, and the corresponding latent features are obtained through the Enc() convolutional encoder, and the latent features of the near-end mixed signal mix are used as the query Q required for multi-head attention and Key K, the potential feature of the far-end reference signal far is used as the value V, where the subscript i represents the number of heads in the multi-head attention;
  • the attention weight matrix is calculated through the multi-head attention Attention(), specifically the scaling dot product model: That is, the transpose of the Q point multiplied by K is divided by the square root of the vector dimension d, and then the score is calculated through the softmax activation function, and then multiplied by V to obtain the final attention weight matrix.
  • the latent features of the near-end and far-end signals and the attention weight matrix between them are spliced together to obtain the splicing matrix J; finally, it is sent to the depth separable convolution composed of the depth convolution layer and the point convolution convolution layer, and the calculation Get the fusion feature M.
  • the above attention weight matrix will also be connected to the training of the inter-block feature matrix in step 4 by a residual network.
  • Step 3 Divide the fusion features extracted in step 2 into blocks of a specified size, and after performing layer normalization on the divided fusion features, use matrix dimension transformation to divide the fusion features into intra-block features and inter-block features Features two path forms;
  • the fusion feature is a long sequence input, which is divided into smaller blocks, so that the input length is close to the square root of the original sequence length, so as to optimize the data space; the divided fusion features are subjected to layer normalization processing; The dimension transformation operation is performed on the processed fusion features to generate intra-block features and inter-block features of the same data in different dimensions.
  • Step 4 Send the intra-block features in step 3 into the deep dynamic self-attention transformation network, and then use the residual connection to add the output of the network to the attention weight matrix calculated in step 2, and convert it into an inter-block
  • the features are sent to the deep dynamic self-attention transformation network again; the above-mentioned operation process within and between blocks is repeated 6 times to calculate the mask value, which can make full use of double-ended features for local modeling and global modeling;
  • the deep dynamic self-attention transformation network of the present embodiment is the order hierarchical structure of dynamic mask attention network (DMAN), self-attention network and feed-forward neural network; It consists of memory network, activation function and linear connection layer.
  • DMAN dynamic mask attention network
  • self-attention network self-attention network
  • feed-forward neural network It consists of memory network, activation function and linear connection layer.
  • This embodiment introduces a new dynamic mask attention network (dynamic mask attention network, DMAN), combined with Transformer's original self-attention network (self-Attention network, SAN) and feedforward network (feedforward network, FFN) , data flow in the order of DMAN ⁇ SAN ⁇ FFN hierarchical structure.
  • DMAN dynamic mask attention network
  • SAN Transformer's original self-attention network
  • FFN feedforward network
  • the dynamic mask attention module formulation of the improved network is as follows:
  • a M (Q, K, V) S M (Q, K) V;
  • Q, K, and V are the query, key, and value in the attention mechanism respectively;
  • the attention A M (Q, K, V) is the product of the attention scoring function S M (Q, K) and the value V;
  • d k is the vector dimension;
  • M i,j is a number from 0 to 1, which can be dynamic or static.
  • MAN degenerates into SAN, and when it is an identity matrix, it degenerates into FFN.
  • FFN only pays attention to its own information and cannot perceive other adjacent information; while in SAN, each token (token) has an equal connection to any other token.
  • DMAN has proved theoretically that it can improve the shortcomings of SAN that introduces noise to better model local information; therefore, adding DMAN to the echo cancellation network can make the noise other than the echo more stable to deal with the signal-to-noise ratio lower environment.
  • the deep dynamic self-attention transformation network of this embodiment also retains the self-attention network and the feedforward neural network to ensure the modeling effect of the entire network at different scales.
  • the feed-forward neural network consists of a long-short-term memory network, an activation function, and a linear layer.
  • the long-short-term memory network is used to capture the time information of the sequence.
  • Step 5 Use the mask value calculated in step 4 and the potential features of the near-end mixed signal to mask to obtain the signal features for echo cancellation;
  • the mask value will pass through a two-dimensional convolution block, which is composed of a prelu activation function and a two-dimensional convolution layer to map the features into the hidden layer; and then restore according to the method of dividing the matrix in step 3 Feature sequence; finally through an activation function component, which includes a speech energy control component composed of convolutional layers, tanh, sigmoid and relu activation functions.
  • a two-dimensional convolution block which is composed of a prelu activation function and a two-dimensional convolution layer to map the features into the hidden layer; and then restore according to the method of dividing the matrix in step 3 Feature sequence; finally through an activation function component, which includes a speech energy control component composed of convolutional layers, tanh, sigmoid and relu activation functions.
  • the mask value first passes through two parallel links, which are one-dimensional convolutional layer + tanh function and one-dimensional convolutional layer + sigmoid function.
  • the results of the two link outputs are multiplied, and the dot product is passed through the activation function relu again.
  • the mask value is finally limited to 0 to 1 between.
  • the formula for the speech energy control component is as follows:
  • c_mask relu(tanh(1d_conv(mask)*sigmoid(1d_conv(mask))));
  • the original mask value mask passes through a one-dimensional convolutional layer 1d_conv(), and passes through the activation functions tanh() and sigmoid() respectively; the dot product of the two functions passes through the activation function relu() to obtain the c_mask after voice control.
  • Step 6 Decode the masked signal features in step 5 and reconstruct the signal to obtain the near-end signal after echo cancellation.
  • the decoding process is a linear connection layer
  • the reconstruction signal is specifically to restore the high-dimensional matrix to a one-dimensional speech sequence, which is similar to the overlap-and-add process of frame-by-frame synthesis; the near-end speaker signal after echo cancellation is finally obtained .
  • Module 1 used to calculate the time delay between the near-end mixed signal and the far-end reference signal, and align the double-ended signals
  • Module 2 which is used to extract potential features from the near-end mixed signal and the far-end reference signal, and calculate the attention weight matrix of the potential features of the near-end mixed signal and the potential features of the far-end reference signal, and combine the mixed signal features, attention weight matrix And the reference signal features are spliced, and then the fusion features are generated;
  • Module 3 is used to divide the fusion features extracted in module 2 into blocks of specified size, and divide the fusion features into two path forms of intra-block features and inter-block features;
  • Module 4 is used to send the in-block features in module 3 to the deep dynamic self-attention transformation network, and then use the residual connection to add the output of the network to the attention weight matrix calculated in module 2, and convert it to The inter-block features are sent to the deep dynamic self-attention transformation network again; the above-mentioned operation process within and between blocks is repeated to calculate the mask value;
  • Module 5 for masking by using the mask value calculated in module 4 and the potential characteristics of the near-end mixed signal to obtain the signal characteristics of echo cancellation
  • Module 6 is configured to decode the masked signal features in module 5 and reconstruct the signal to obtain the near-end signal after echo cancellation.
  • the invention makes full use of the high-dimensional feature information of the remote reference signal in the network structure, solves the problem of network degradation caused by the depth increase in the deep self-attention transformation network, and at the same time makes up for some irreversible information loss; the structure also accelerates The training process of the entire network is improved, and the effect and application range of echo cancellation are greatly improved in the complex environment of background noise, double-talk and nonlinear distortion.

Landscapes

  • Engineering & Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Cable Transmission Systems, Equalization Of Radio And Reduction Of Echo (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

本发明公开了一种基于自注意力变换网络的多特征融合回声消除方法及系统,使用独立的卷积编码器对近端信号和远端信号分别提取潜在特征;将编码后的双端信号经过多头注意力计算注意力权重矩阵,与双端信号的潜在特征进行拼接,并使用一个深度可分离卷积网络对拼接信号进行融合;融合信号通过维度转换操作后生成块内特征,经过一个深度动态自注意力变换网络,再利用残差连接与注意力权重矩阵相加,转换为块间特征后再次经过深度动态自注意力变换网络;重复块内及块间操作,计算出掩码值;将掩蔽后的编码信号进行解码,得到消除回声后的近端信号。本发明能够在多种场景下消除回声,能够在保持近端语音完整性的情况下极大地提升回声消除的效果。

Description

基于自注意力变换网络的多特征融合回声消除方法及系统 技术领域
本发明属于音频技术领域,涉及一种回声消除方法及系统,特别是涉及一种基于深度自注意力变换网络的多特征融合回声消除方法及系统。
背景技术
在全双工语音通信系统中,当本地扬声器外放接收到的远端信号时,本地麦克风会同时采集到扬声器播放的远端信号和近端说话人的声音,形成近端混合信号,并将这种混合信号发送到远端,使得远端说话人听到自己刚刚说的话音。这种经过传输-播放-再次采集并传输回来的声音信号叫做回声。回声的存在会严重降低通信质量。声学回声消除的目标就是最大程度地去除近端混合信号中包含的远端信号形成的回声信号,同时保留近端说话人的语音信息。
传统的声学回声消除算法通常以接收到的远端信号作为参考信号,使用一个有限脉冲滤波器来自适应地估计回声信号,然后将其从麦克风采集的混合信号中减去。然而,传统方法在非线性回声以及带噪声的复杂环境下很难准确地估计回声信号。
近些年,基于深度神经网络的方法在回声消除领域得到应用。相较于传统回声消除算法,基于深度神经网络的方法能够更好地拟合非线性回声并消除背景噪声,在低信噪比的情况下更加具有竞争力。深度网络对于非线性特征的拟合非常优秀,但随着网络深度的增加,会导致网络的退化以及部分不可逆的信息损失,尤其是在一些复杂的深度网络结构中。
发明内容
为了解决上述技术问题,本发明提供了一种基于深度自注意力变换网络的多特征融合回声消除方法及系统。
本发明的方法采用的技术方案是:一种基于自注意力变换网络的多特征融合回声消除方法,包括以下步骤:
步骤1:计算近端混合信号和远端参考信号间的时延,将双端信号进行对齐;
步骤2:将近端混合信号和远端参考信号分别提取潜在特征,并计算近端混合信号潜在特征和远端参考信号潜在特征的注意力权重矩阵,将混合信号特征、注意力权重矩阵以及参考信号特征进行拼接,然后生成融合特征;
步骤3:将步骤2中提取的融合特征分割为指定大小的块,将融合特征分为块内特征以及块间特征两种路径形式;
步骤4:将步骤3中的块内特征送入深度动态自注意力变换网络,然后将网络的输出利用残差连接与步骤2中计算出的注意力权重矩阵进行相加后,转换为块间特征,再次送入深度动态自注意力变换网络;重复上述的块内及块间的操作过程,计算出掩码值;
步骤5:利用步骤4中计算的掩码值与近端混合信号的潜在特征进行掩蔽,得到消除回声的信号特征;
步骤6:将步骤5中掩蔽后的信号特征进行解码并重建信号,得到经过回声消除后的近端信号。
本发明的系统采用的技术方案是:一种深度自注意力变换网络的多特征融合回声消除系统,包括以下模块:
模块1,用于计算近端混合信号和远端参考信号间的时延,将双端信号进行对齐;
模块2,用于将近端混合信号和远端参考信号分别提取潜在特征,并计算近端混合信号潜在特征和远端参考信号潜在特征的注意力权重矩阵,将混合信号特征、注意力权重矩阵以及参考信号特征进行拼接,然后生成融合特征;
模块3,用于将模块2中提取的融合特征分割为指定大小的块,将融合特征分为块内特征以及块间特征两种路径形式;
模块4,用于将模块3中的块内特征送入深度动态自注意力变换网络,然后将网络的输出利用残差连接与模块2中计算出的注意力权重矩阵进行相加后,转换为块间特征,再次送入深度动态自注意力变换网络;重复上述的块内及块间的操作过程,计算出掩码值;
模块5,用于利用模块4中计算的掩码值与近端混合信号的潜在特征进行掩蔽,得到消除回声的信号特征;
模块6,用于将模块5中掩蔽后的信号特征进行解码并重建信号,得到经过回声消除后的近端信号。
本发明提供了一种基于深度自注意力变换网络的多特征融合回声消除方法及系统,使双端信号的潜在特征在回声消除网络中能够更加充分地融合,并引入深度自注意力变换网络(Transformer)对信号进行拟合。本发明通过采用多特征 融合的残差网络弥补了深度自注意力变换网络中的信息损失,同时加快了网络的训练过程,在背景噪声、双端通话以及非线性失真等复杂环境下大大提升了回声消除的效果和应用范围,同时使回声消除网络在复杂环境下具有更强的泛化性。
附图说明
图1:为本发明实施例的方法流程图;
图2:为本发明实施例的系统结构图;
图3:为本发明实施例中融合参考信号以及混合信号的潜在特征的流程图;
图4:为本发明实施例中深度动态自注意力变换网络的结构图;
图5:为本发明实施例中语音能量控制组件的结构图;
具体实施方式
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。
本实施例首先计算近端混合信号和远端参考信号之间的时延以对齐,然后将对齐的双端信号独立地提取潜在特征,并采用多头注意力机制与深度可分离网络进行特征融合。采用矩阵维度转换的路径切分方式,融合特征可以被分为块内特征以及块间特征。将块内特征送入深度自注意力变换网络,然后将网络的输出利用残差连接与注意力权重矩阵相加后,转换为块间特征,再次送入深度自注意力变换网络。重复6次上述的块内及块间的操作过程,计算出掩码值。最后利用掩码对近端信号进行掩蔽、解码和重建,得到经过回声消除后的近端信号。
请见图1,本发明提供的一种基于深度自注意力变换网络的多特征融合回声消除方法,包括以下步骤:
步骤1:采用基于广义互相关函数的时延估计方法计算近端混合信号和远端参考信号间的时延,将双端信号进行对齐;
在本技术领域中,近端指的是本地麦克风采集到的信号,混合信号是指该信号既录制了本地说话人的语音,又录制了本地扬声器播放的远端信号;参考信号是指远端信号作为网络的参照物,参与到网络的训练中,因为本地麦克风采集到的信号的回声部分是远端信号经过非线性失真得到的。
本实施例中,时延估计方法具体为广义互相关-相位变换方法(GCC-PHAT):
本实施例中,计算近端混合信号和远端参考信号的互相关函数峰值来确定时延值,互相关函数为两个序列的滑动乘积求和,反映出两个函数在不同相对位置上的匹配程度。由于远端参考信号和近端混合信号之间具有很强的相关性,理想情况下能够准确计算出参考信号的时延。
本实施例中,麦克风中采集和处理的信号采样频率均为16KHZ。
步骤2:将近端混合信号和远端参考信号分别通过对应的编码器提取潜在特征,并采用多头注意力机制计算近端混合信号潜在特征和远端参考信号潜在特征的注意力权重矩阵,将混合信号特征、注意力权重矩阵以及参考信号特征进行拼接,然后经过一个深度可分离网络生成融合特征;
本实施例中,步骤2的具体实现包括以下子步骤:
步骤2.1:近端混合信号和远端参考信号独立地通过编码器分别提取相应的潜在特征;
本实施例采用的编码器为一维卷积层和relu激活函数,其中卷积核大小为步长的两倍,窗长根据显存大小确定,以达到性能与显存占用的平衡,本实施例中取值为20;编码器提取的潜在特征均需要经过组归一化(Group Normalization)和瓶颈层(Bottleneck Layer)对数据进行处理;其中瓶颈层为1×1的卷积神经网络,另外可根据网络的训练效果增加卷积层数以及激活函数,以更好地拟合信号的高维非线性潜在特征。
步骤2.2:将步骤2.1的近端混合信号潜在特征和远端参考信号潜在特征通过多头注意力机制计算出注意力权重矩阵;
步骤2.3:将步骤2.1计算出的潜在特征与步骤2.2中计算出注意力权重矩阵在同一维度上进行拼接,获得拼接矩阵;
步骤2.4:利用深度可分离网络将步骤2.3中的拼接矩阵进行分组操作,将拼接矩阵的输出通道缩减为原来的1/3,形成充分结合近端混合信号和远端参考信号信息的深度融合特征。
请见图3,本实施例中深度可分离卷积网络,由一个深度卷积层以及一个点积卷积层组成,大大降低了所需的计算量。
本实施例,采用多头注意力机制计算双端信号潜在特征的注意力权重矩阵,将该矩阵与双端信号潜在特征进行拼接,形成多特征拼接矩阵;使用深度可分离 卷积将多特征拼接矩阵进行分组操作,将输出通道缩减为原来的1/3。公式可表示为:
Q i=Enc(Mix),K i=Enc(Mix),V i=Enc(far)
Figure PCTCN2021122348-appb-000001
M i=Point wise(Depthwise(J i))
其中,mix和far分别代表近端混合信号及远端参考信号,经过Enc()卷积编码器获取相应的潜在特征,将近端混合信号mix的潜在特征作为多头注意力所需的查询Q以及键K,远端参考信号far的潜在特征则作为值V,其中下标i表示多头注意力中头的个数;经过多头注意力Attention()计算注意力权重矩阵,具体为缩放点积模型:即Q点乘K的转置除以向量维度d的平方根,再经过softmax激活函数计算出分数,再与V相乘得到最终的注意力权重矩阵。
之后将近端和远端信号的潜在特征以及它们之间的注意力权重矩阵拼接起来得到拼接矩阵J;最后送入由深度卷积层和点积卷积层组成的深度可分离卷积,计算出融合特征M。
上述注意力权重矩阵同时将由一个残差网络连接到步骤4中块间特征矩阵的训练中。
步骤3:将步骤2中提取的融合特征分割为指定大小的块,将分割后的融合特征进行层归一化处理后,采用矩阵维度变换的方式,将融合特征分为块内特征以及块间特征两种路径形式;
本实施例中,融合特征为长序列输入,将其分割为更小的块,使输入长度与原始序列长度的平方根接近,以优化数据空间;将分割后的融合特征进行层归一化处理;将经过处理后的融合特征进行维度变换操作,能够生成同一数据在不同维度上的块内特征及块间特征。
步骤4:将步骤3中的块内特征送入深度动态自注意力变换网络,然后将网络的输出利用残差连接与步骤2中计算出的注意力权重矩阵进行相加后,转换为块间特征,再次送入深度动态自注意力变换网络;重复6次上述的块内及块间的操作过程,计算出掩码值,能够充分利用双端特征进行局域建模和全局建模;
请见图4,本实施例的深度动态自注意力变换网络,为动态掩码注意力网络(DMAN)、自注意力网络及前馈神经网络的顺序分层结构;前馈神经网络由长短时记忆网络、激活函数以及线性连接层组成。
本实施例引入了一种新的动态掩码注意网络(dynamic mask attetion network,DMAN),与Transformer原始的自注意网络(self-Attention network,SAN)和前馈网络(feedforward network,FFN)相结合,以DMAN→SAN→FFN的顺序分层结构进行数据流动。
改进网络的动态掩码注意模块公式如下:
A M(Q,K,V)=S M(Q,K)V;
Figure PCTCN2021122348-appb-000002
其中,Q、K、V分别为注意力机制中的查询、键和值;注意力A M(Q,K,V)为注意力打分函数S M(Q,K)与值V的乘积;d k为向量维度;M i,j是0到1的数,可以为动态或静态。为全1矩阵时,MAN退化为SAN,为单位矩阵时退化为FFN。从直觉上看,FFN只关注自己的信息,无法感知相邻的其他信息;而在SAN中,每个标识(token)对其他任何token都具有相等的联系。
另外,DMAN通过理论证明其能够改善SAN会引入噪声的缺点,以更好地建模局部信息;因此将DMAN加入回声消除网络中,能够使除回声之外的噪声更加平稳,以应对信噪比较低的环境。
本实施例的深度动态自注意力变换网络同时也保留了自注意力网络以及前馈神经网络,保证整个网络在不同尺度下的建模效果。前馈神经网络由长短时记忆网络、激活函数以及线性层组成,其中长短时记忆网络是为了捕捉序列的时间信息。
步骤5:利用步骤4中计算的掩码值与近端混合信号的潜在特征进行掩蔽,得到消除回声的信号特征;
本实施例中,掩码值将经过一个二维卷积块,该模块由prelu激活函数以及一个二维卷积层组成,将特征映射到隐藏层中;然后按照步骤3中分割矩阵的方式还原特征序列;最后经过一个激活函数组件,其组件包括由卷积层、tanh、 sigmoid以及relu激活函数组成的语音能量控制组件。
如图5所示,掩码值首先经过平行的两个链路,分别是一维卷积层+tanh函数以及一维卷积层+sigmoid函数。两个链路输出的结果相乘,其点积再次经过激活函数relu。
由于tanh函数的取值为(-1,1),sigmoid函数的取值为(0,1),relu函数的取值为[0,∞);因此最终使掩码值被限制在0到1之间。该语音能量控制组件的公式如下所示:
c_mask=relu(tanh(1d_conv(mask)*sigmoid(1d_conv(mask))));
各激活函数公式及取值为:
Figure PCTCN2021122348-appb-000003
Figure PCTCN2021122348-appb-000004
y=relu(x)=max(0,x),y∈[0,∞);
原始掩码值mask经过一个一维卷积层1d_conv(),并分别经过激活函数tanh()和sigmoid();两个函数的点积经过激活函数relu(),得到语音控制后的c_mask。
步骤6:将步骤5中掩蔽后的信号特征进行解码并重建信号,得到经过回声消除后的近端信号。
本实施例中,解码过程为一个线性连接层,重建信号具体为将高维矩阵还原为一维语音序列,类似于分帧合成的重叠相加过程;最终得到消除回声后的近端说话人信号。
请见图2,本发明提供的一种基于深度自注意力变换网络的多特征融合回声消除系统,包括以下模块:
模块1,用于计算近端混合信号和远端参考信号间的时延,将双端信号进行对齐;
模块2,用于将近端混合信号和远端参考信号分别提取潜在特征,并计算近端混合信号潜在特征和远端参考信号潜在特征的注意力权重矩阵,将混合信号特征、注意力权重矩阵以及参考信号特征进行拼接,然后生成融合特征;
模块3,用于将模块2中提取的融合特征分割为指定大小的块,将融合特征 分为块内特征以及块间特征两种路径形式;
模块4,用于将模块3中的块内特征送入深度动态自注意力变换网络,然后将网络的输出利用残差连接与模块2中计算出的注意力权重矩阵进行相加后,转换为块间特征,再次送入深度动态自注意力变换网络;重复上述的块内及块间的操作过程,计算出掩码值;
模块5,用于利用模块4中计算的掩码值与近端混合信号的潜在特征进行掩蔽,得到消除回声的信号特征;
模块6,用于将模块5中掩蔽后的信号特征进行解码并重建信号,得到经过回声消除后的近端信号。
通过上述方法和来自AEC-Challenge数据集的训练,初步验证了该结论。在远端信号带噪的情况下,AEC-Challenge的baseline的SI-SNR为12.20dB,DTLN-aec中网络的最好结果为13.59dB;本发明的网络测试结果为15.28dB。
本发明在网络结构上充分利用远端参考信号的高维特征信息,解决了深度自注意力变换网络中随深度增加而产生的网络退化问题,同时弥补了部分不可逆的信息损失;该结构也加速了整个网络的训练过程,在背景噪声、双端通话以及非线性失真的复杂环境下大大提升了回声消除的效果和应用范围。
应当理解的是,本说明书未详细阐述的部分均属于现有技术。
应当理解的是,上述针对较佳实施例的描述较为详细,并不能因此而认为是对本发明专利保护范围的限制,本领域的普通技术人员在本发明的启示下,在不脱离本发明权利要求所保护的范围情况下,还可以做出替换或变形,均落入本发明的保护范围之内,本发明的请求保护范围应以所附权利要求为准。

Claims (10)

  1. 一种基于自注意力变换网络的多特征融合回声消除方法,其特征在于,包括以下步骤:
    步骤1:计算近端混合信号和远端参考信号间的时延,将双端信号进行对齐;
    步骤2:将近端混合信号和远端参考信号分别提取潜在特征,并计算近端混合信号潜在特征和远端参考信号潜在特征的注意力权重矩阵,将混合信号特征、注意力权重矩阵以及参考信号特征进行拼接,然后生成融合特征;
    步骤3:将步骤2中提取的融合特征分割为指定大小的块,将融合特征分为块内特征以及块间特征两种路径形式;
    步骤4:将步骤3中的块内特征送入深度动态自注意力变换网络,然后将网络的输出利用残差连接与步骤2中计算出的注意力权重矩阵进行相加后,转换为块间特征,再次送入深度动态自注意力变换网络;重复上述的块内及块间的操作过程,计算出掩码值;
    步骤5:利用步骤4中计算的掩码值与近端混合信号的潜在特征进行掩蔽,得到消除回声的信号特征;
    步骤6:将步骤5中掩蔽后的信号特征进行解码并重建信号,得到经过回声消除后的近端信号。
  2. 根据权利要求1所述的基于自注意力变换网络的多特征融合回声消除方法,其特征在于:步骤1中,采用基于广义互相关函数的时延估计方法计算近端混合信号和远端参考信号间的时延。
  3. 根据权利要求1所述的基于自注意力变换网络的多特征融合回声消除方法,其特征在于,步骤2的具体实现包括以下子步骤:
    步骤2.1:近端混合信号和远端参考信号独立地通过编码器分别提取相应的潜在特征;
    步骤2.2:将步骤2.1所述的近端混合信号潜在特征和远端参考信号潜在特征通过多头注意力机制计算出注意力权重矩阵;
    步骤2.3:将步骤2.1计算出的潜在特征与步骤2.2中计算出注意力权重矩阵在同一维度上进行拼接,获得拼接矩阵;
    步骤2.4:利用深度可分离卷积网络将步骤2.3中的拼接矩阵进行分组操作,将拼接矩阵的输出通道缩减为原来的1/3,形成充分结合近端混合信号和远端参 考信号信息的深度融合特征。
  4. 根据权利要求3所述的基于自注意力变换网络的多特征融合回声消除方法,其特征在于:步骤2.1中所述编码器,为卷积层和relu激活函数,其中,卷积层数以及激活函数,根据训练效果确定;其中卷积核大小为步长的两倍,窗长根据显存大小确定,以达到性能与显存占用的平衡;编码器提取的潜在特征均经过组归一化和瓶颈层对数据进行处理;其中瓶颈层为1×1的卷积神经网络。
  5. 根据权利要求3所述的基于自注意力变换网络的多特征融合回声消除方法,其特征在于:步骤2.4中所述深度可分离卷积网络,由一个深度卷积层以及一个点积卷积层组成。
  6. 根据权利要求1所述的基于自注意力变换网络的多特征融合回声消除方法,其特征在于:步骤3中,将分割后的融合特征进行层归一化处理后,采用矩阵维度变换的方式,将融合特征分为块内特征以及块间特征两种路径形式,
  7. 根据权利要求1所述的基于自注意力变换网络的多特征融合回声消除方法,其特征在于:步骤4中所述深度动态自注意力变换网络为动态掩码注意力网络、自注意力网络及前馈神经网络的顺序分层结构;所述前馈神经网络由长短时记忆网络、激活函数以及线性连接层组成。
  8. 根据权利要求1所述的基于自注意力变换网络的多特征融合回声消除方法,其特征在于:步骤5中,掩码值经过一个二维卷积块,将特征映射到隐藏层中,所述二维卷积块由prelu激活函数以及一个二维卷积层组成;然后按照步骤3中分割方式还原特征序列,掩码值经过平行的两个链路,分别是一维卷积层+tanh函数以及一维卷积层+sigmoid函数;最后经过一个激活函数组件,所述激活函数组件为由卷积层、tanh、sigmoid以及relu激活函数组成的语音能量控制组件;两个链路输出的结果相乘,其点积再次经过激活函数relu;由于tanh函数的取值为(-1,1),sigmoid函数的取值为(0,1),relu激活函数的取值为[0,∞);因此最终使掩码值被限制在0到1之间。
  9. 根据权利要求1-8任意一项所述的基于自注意力变换网络的多特征融合回声消除方法,其特征在于:步骤6中,解码过程为一个线性连接层,重建信号具体为将高维矩阵还原为一维语音序列,最终得到消除回声后的近端说话人信号。
  10. 一种基于自注意力变换网络的多特征融合回声消除系统,其特征在于, 包括以下模块:
    模块1,用于计算近端混合信号和远端参考信号间的时延,将双端信号进行对齐;
    模块2,用于将近端混合信号和远端参考信号分别提取潜在特征,并计算近端混合信号潜在特征和远端参考信号潜在特征的注意力权重矩阵,将混合信号特征、注意力权重矩阵以及参考信号特征进行拼接,然后生成融合特征;
    模块3,用于将模块2中提取的融合特征分割为指定大小的块,将融合特征分为块内特征以及块间特征两种路径形式;
    模块4,用于将模块3中的块内特征送入深度动态自注意力变换网络,然后将网络的输出利用残差连接与模块2中计算出的注意力权重矩阵进行相加后,转换为块间特征,再次送入深度动态自注意力变换网络;重复上述的块内及块间的操作过程,计算出掩码值;
    模块5,用于利用模块4中计算的掩码值与近端混合信号的潜在特征进行掩蔽,得到消除回声的信号特征;
    模块6,用于将模块5中掩蔽后的信号特征进行解码并重建信号,得到经过回声消除后的近端信号。
PCT/CN2021/122348 2021-09-23 2021-09-30 基于自注意力变换网络的多特征融合回声消除方法及系统 WO2023044961A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111113340.1A CN113870874A (zh) 2021-09-23 2021-09-23 基于自注意力变换网络的多特征融合回声消除方法及系统
CN202111113340.1 2021-09-23

Publications (1)

Publication Number Publication Date
WO2023044961A1 true WO2023044961A1 (zh) 2023-03-30

Family

ID=78993406

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/122348 WO2023044961A1 (zh) 2021-09-23 2021-09-30 基于自注意力变换网络的多特征融合回声消除方法及系统

Country Status (2)

Country Link
CN (1) CN113870874A (zh)
WO (1) WO2023044961A1 (zh)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116660992A (zh) * 2023-06-05 2023-08-29 北京石油化工学院 一种基于多特征融合的地震信号处理方法
CN117290809A (zh) * 2023-11-22 2023-12-26 小舟科技有限公司 多源异构生理信号融合方法及装置、设备、存储介质
CN117437929A (zh) * 2023-12-21 2024-01-23 睿云联(厦门)网络通讯技术有限公司 一种基于神经网络的实时回声消除方法
CN117711417A (zh) * 2024-02-05 2024-03-15 武汉大学 一种基于频域自注意力网络的语音质量增强方法及系统
CN117798654A (zh) * 2024-02-29 2024-04-02 山西漳电科学技术研究院(有限公司) 汽轮机轴系中心智能调整系统

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10636434B1 (en) * 2018-09-28 2020-04-28 Apple Inc. Joint spatial echo and noise suppression with adaptive suppression criteria
CN111353258A (zh) * 2020-02-10 2020-06-30 厦门快商通科技股份有限公司 基于编码解码神经网络的回声抑制方法及音频装置及设备
US20200312346A1 (en) * 2019-03-28 2020-10-01 Samsung Electronics Co., Ltd. System and method for acoustic echo cancellation using deep multitask recurrent neural networks
CN112151059A (zh) * 2020-09-25 2020-12-29 南京工程学院 面向麦克风阵列的通道注意力加权的语音增强方法
CN113299306A (zh) * 2021-07-27 2021-08-24 北京世纪好未来教育科技有限公司 回声消除方法、装置、电子设备及计算机可读存储介质

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10636434B1 (en) * 2018-09-28 2020-04-28 Apple Inc. Joint spatial echo and noise suppression with adaptive suppression criteria
US20200312346A1 (en) * 2019-03-28 2020-10-01 Samsung Electronics Co., Ltd. System and method for acoustic echo cancellation using deep multitask recurrent neural networks
CN111755019A (zh) * 2019-03-28 2020-10-09 三星电子株式会社 用深度多任务递归神经网络来声学回声消除的系统和方法
CN111353258A (zh) * 2020-02-10 2020-06-30 厦门快商通科技股份有限公司 基于编码解码神经网络的回声抑制方法及音频装置及设备
CN112151059A (zh) * 2020-09-25 2020-12-29 南京工程学院 面向麦克风阵列的通道注意力加权的语音增强方法
CN113299306A (zh) * 2021-07-27 2021-08-24 北京世纪好未来教育科技有限公司 回声消除方法、装置、电子设备及计算机可读存储介质

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
WEN SUO: "Linear Echo Cancellation and Convex Reconstruction of Incomplete Transfer Function Based on DNN", JISUANJI CELIANG YU KONGZHI - COMPUTER MEASUREMENT & CONTROL, JISUANJI CELIANG YU KONGZHI ZAZHISHE, BEIJING, CN, no. 6, 30 June 2020 (2020-06-30), CN , pages 108 - 112, XP093053541, ISSN: 1671-4598 *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116660992A (zh) * 2023-06-05 2023-08-29 北京石油化工学院 一种基于多特征融合的地震信号处理方法
CN116660992B (zh) * 2023-06-05 2024-03-05 北京石油化工学院 一种基于多特征融合的地震信号处理方法
CN117290809A (zh) * 2023-11-22 2023-12-26 小舟科技有限公司 多源异构生理信号融合方法及装置、设备、存储介质
CN117290809B (zh) * 2023-11-22 2024-03-12 小舟科技有限公司 多源异构生理信号融合方法及装置、设备、存储介质
CN117437929A (zh) * 2023-12-21 2024-01-23 睿云联(厦门)网络通讯技术有限公司 一种基于神经网络的实时回声消除方法
CN117437929B (zh) * 2023-12-21 2024-03-08 睿云联(厦门)网络通讯技术有限公司 一种基于神经网络的实时回声消除方法
CN117711417A (zh) * 2024-02-05 2024-03-15 武汉大学 一种基于频域自注意力网络的语音质量增强方法及系统
CN117711417B (zh) * 2024-02-05 2024-04-30 武汉大学 一种基于频域自注意力网络的语音质量增强方法及系统
CN117798654A (zh) * 2024-02-29 2024-04-02 山西漳电科学技术研究院(有限公司) 汽轮机轴系中心智能调整系统
CN117798654B (zh) * 2024-02-29 2024-05-03 山西漳电科学技术研究院(有限公司) 汽轮机轴系中心智能调整系统

Also Published As

Publication number Publication date
CN113870874A (zh) 2021-12-31

Similar Documents

Publication Publication Date Title
WO2023044961A1 (zh) 基于自注意力变换网络的多特征融合回声消除方法及系统
CN110619885B (zh) 基于深度完全卷积神经网络的生成对抗网络语音增强方法
WO2021042870A1 (zh) 语音处理的方法、装置、电子设备及计算机可读存储介质
Zhang et al. Multi-scale temporal frequency convolutional network with axial attention for speech enhancement
JP5554893B2 (ja) 音声特徴ベクトル変換方法及び装置
CN102804747B (zh) 多通道回波对消器
CN111292759A (zh) 一种基于神经网络的立体声回声消除方法及系统
CN107274908A (zh) 基于新阈值函数的小波语音去噪方法
CN110739003A (zh) 基于多头自注意力机制的语音增强方法
CN112687288B (zh) 回声消除方法、装置、电子设备和可读存储介质
CN106157964A (zh) 一种确定回声消除中系统延时的方法
CN111968658A (zh) 语音信号的增强方法、装置、电子设备和存储介质
CN114792524B (zh) 音频数据处理方法、装置、程序产品、计算机设备和介质
Kim et al. Attention Wave-U-Net for Acoustic Echo Cancellation.
CN115602184A (zh) 回声消除方法、装置、电子设备及存储介质
Watcharasupat et al. End-to-end complex-valued multidilated convolutional neural network for joint acoustic echo cancellation and noise suppression
Shu et al. Joint echo cancellation and noise suppression based on cascaded magnitude and complex mask estimation
CN111179920A (zh) 一种端到端远场语音识别方法及系统
Indenbom et al. DeepVQE: Real time deep voice quality enhancement for joint acoustic echo cancellation, noise suppression and dereverberation
WO2021147237A1 (zh) 语音信号处理方法、装置、电子设备及存储介质
CN111353258A (zh) 基于编码解码神经网络的回声抑制方法及音频装置及设备
CN109215635B (zh) 用于语音清晰度增强的宽带语音频谱倾斜度特征参数重建方法
CN110958417A (zh) 一种基于语音线索的视频通话类视频去除压缩噪声的方法
CN115295002A (zh) 一种基于交互性时频注意力机制的单通道语音增强方法
Lan et al. Research on speech enhancement algorithm of multiresolution cochleagram based on skip connection deep neural network

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21958076

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE