CN112863535A - Residual echo and noise elimination method and device - Google Patents

Residual echo and noise elimination method and device Download PDF

Info

Publication number
CN112863535A
CN112863535A CN202110008502.9A CN202110008502A CN112863535A CN 112863535 A CN112863535 A CN 112863535A CN 202110008502 A CN202110008502 A CN 202110008502A CN 112863535 A CN112863535 A CN 112863535A
Authority
CN
China
Prior art keywords
domain signal
echo
voice
noise
frequency domain
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110008502.9A
Other languages
Chinese (zh)
Other versions
CN112863535B (en
Inventor
李军锋
顾建军
颜永红
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Acoustics CAS
Original Assignee
Institute of Acoustics CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Acoustics CAS filed Critical Institute of Acoustics CAS
Priority to CN202110008502.9A priority Critical patent/CN112863535B/en
Publication of CN112863535A publication Critical patent/CN112863535A/en
Application granted granted Critical
Publication of CN112863535B publication Critical patent/CN112863535B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0232Processing in the frequency domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L2021/02082Noise filtering the noise being echo, reverberation of the speech

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Cable Transmission Systems, Equalization Of Radio And Reduction Of Echo (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

The embodiment of the application discloses a method and a device for eliminating residual echo and noise, wherein the method comprises the following steps: performing framing, windowing and Fourier transformation on the received voice time domain signal containing the echo and the noise and the far-end reference sound time domain signal to obtain a corresponding frequency domain signal, determining an echo frequency domain signal, and further determining the voice frequency domain signal containing the residual echo and the noise; respectively carrying out energy normalization processing on the amplitude spectrums of the voice frequency domain signal containing the residual echo and the noise, the echo frequency domain signal and the far-end reference audio frequency domain signal to obtain corresponding characteristics; determining a target voice frequency domain signal according to the corresponding characteristics and the trained cascade network; and carrying out inverse Fourier transform on the target voice frequency domain signal to obtain a target voice time domain signal. According to the embodiment of the application, the feature attention model is used for endowing the input features with different importance, and redundant information in the input features is reduced. And the multi-domain loss function is trained in a cascade network, so that the sensitivity of the model to signal energy is reduced.

Description

一种残余回声及噪声消除方法及装置A method and device for eliminating residual echo and noise

技术领域technical field

本发明涉及回声及噪声消除领域。尤其涉及一种残余回声及噪声消除方法及装置。The present invention relates to the field of echo and noise cancellation. In particular, it relates to a residual echo and noise cancellation method and device.

背景技术Background technique

目前,回声消除技术主要是去除语音信号中由远端参考声信号形成的回声信号,而语音降噪技术主要是去除语音信号中背景噪声以及指向性噪声干扰。回声消除技术和语音降噪技术都旨在提高语音的质量和可懂度。在回声消除技术中,结合基于传统信号处理的自适应滤波方法和基于深度学习的残余回声消除方法,可以有效提升系统的泛化性能。At present, the echo cancellation technology mainly removes the echo signal formed by the remote reference sound signal in the speech signal, and the speech noise reduction technology mainly removes the background noise and directional noise interference in the speech signal. Both echo cancellation technology and speech noise reduction technology are designed to improve the quality and intelligibility of speech. In the echo cancellation technology, combining the adaptive filtering method based on traditional signal processing and the residual echo cancellation method based on deep learning can effectively improve the generalization performance of the system.

然而,在传统方法中残余回声及噪声消除往往是独立分开进行的,没有考虑这两个任务的相关性。在残余回声消除任务中有多个信号特征可以利用,这些特征有着不同的物理意义与重要性,而传统方法都没有考虑这些特征不同的重要性。在训练残余回声及噪声消除模型时,现有技术大多采用目标幅度谱和估计幅度谱的均方误差作为损失函数,但上述损失函数依赖于信号的能量大小,对不同大小能量的信号的尺度也会不同。However, residual echo and noise cancellation are often performed independently and separately in traditional methods, without considering the correlation of these two tasks. In the residual echo cancellation task, there are multiple signal features that can be utilized, and these features have different physical meanings and importance, and traditional methods do not consider the different importance of these features. When training residual echo and noise cancellation models, most of the existing technologies use the mean square error of the target amplitude spectrum and the estimated amplitude spectrum as the loss function, but the above loss function depends on the energy of the signal, and the scale of the signal with different energies is also different. will be different.

发明内容SUMMARY OF THE INVENTION

由于现有方法存在上述问题,本申请实施例提出一种残余回声及噪声消除方法及装置。Due to the above problems in the existing methods, the embodiments of the present application provide a residual echo and noise cancellation method and apparatus.

第一方面,本申请实施例提出一种残余回声及噪声消除方法,包括:In a first aspect, an embodiment of the present application proposes a residual echo and noise cancellation method, including:

接收含有回声及噪声的语音时域信号和远端参考声时域信号;Receive speech time-domain signals containing echo and noise and far-end reference acoustic time-domain signals;

对所述含有回声及噪声的语音时域信号和所述远端参考声时域信号分别进行分帧、加窗和傅里叶变换,得到含有回声及噪声的语音频域信号和远端参考声频域信号;Perform framing, windowing and Fourier transform on the voice time domain signal containing echo and noise and the far-end reference acoustic time domain signal respectively to obtain the voice and audio domain signal containing echo and noise and the far-end reference audio frequency domain signal;

根据所述含有回声及噪声的语音频域信号和所述远端参考声频域信号,确定回声频域信号;Determine an echo frequency domain signal according to the voice and audio frequency domain signal containing echo and noise and the far-end reference audio frequency domain signal;

根据所述含有回声及噪声的语音频域信号和所述回声频域信号,确定含有残余回声及噪声的语音频域信号;According to the voice and audio domain signal containing echo and noise and the echo frequency domain signal, determine a voice and audio domain signal containing residual echo and noise;

将所述含有残余回声及噪声的语音频域信号的幅度谱、所述回声频域信号的幅度谱和所述远端参考声频域信号的幅度谱进行能量归一化处理,得到含有残余回声及噪声的语音频域信号特征、回声频域信号特征和远端参考声频域信号特征;Perform energy normalization processing on the amplitude spectrum of the voice and audio frequency domain signal containing residual echo and noise, the amplitude spectrum of the echo frequency domain signal and the amplitude spectrum of the far-end reference audio frequency domain signal to obtain residual echo and Noise's voice and audio frequency domain signal characteristics, echo frequency domain signal characteristics and far-end reference audio frequency domain signal characteristics;

将所述含有残余回声及噪声的语音频域信号特征与所述远端参考声频域信号特征进行拼接,得到第一拼接结果,并且将所述含有残余回声及噪声的语音频域信号特征与所述回声频域信号特征进行拼接,得到第二拼接结果;The voice and audio domain signal features containing residual echo and noise are spliced with the far-end reference audio frequency domain signal features to obtain a first splicing result, and the voice and audio domain signal features containing residual echo and noise are combined with the above. The echo frequency domain signal features are spliced to obtain a second splicing result;

将所述第一拼接结果和所述第二拼接结果输入所述训练后级联网络中的训练后特征注意力模型,获得与所述远端参考声频域信号特征对应的第一注意力权重和与所述回声频域信号特征对应的第二注意力权重;The first splicing result and the second splicing result are input into the post-training feature attention model in the post-training cascade network to obtain the first attention weight and a second attention weight corresponding to the echo frequency domain signal feature;

将所述远端参考声频域信号特征与第一注意力权重相乘,得到第一融合注意力机制特征,并且将所述回声频域信号特征与第二注意力权重相乘,得到第二融合注意力机制特征;Multiplying the remote reference audio frequency domain signal feature and the first attention weight to obtain the first fusion attention mechanism feature, and multiplying the echo frequency domain signal feature and the second attention weight to obtain the second fusion Attention mechanism features;

将所述第一融合注意力机制特征、所述第二融合注意力机制特征和所述含有残余回声及噪声的语音频域信号特征进行拼接,得到第一融合拼接结果;Splicing the first fusion attention mechanism feature, the second fusion attention mechanism feature, and the voice and audio domain signal features containing residual echo and noise to obtain a first fusion splicing result;

将所述第一融合拼接结果输入所述训练后级联网络中的训练后残余回声及噪声消除模型,得到目标语音频域信号的掩蔽估计值;Inputting the first fusion splicing result into the post-training residual echo and noise cancellation model in the post-training cascade network to obtain a masking estimate value of the target speech and audio domain signal;

根据所述目标语音频域信号的掩蔽估计值和所述含有残余回声及噪声的语音频域信号,得到所述目标语音频域信号;According to the masking estimate value of the target voice and audio domain signal and the voice and audio domain signal containing residual echo and noise, obtain the target voice and audio frequency domain signal;

对所述目标语音频域信号进行逆傅里叶变换,得到目标语音时域信号。Inverse Fourier transform is performed on the target speech and audio frequency domain signal to obtain the target speech time domain signal.

在一种可能的实现中,所述对所述含有回声及噪声的语音时域信号和所述远端参考声时域信号分别进行分帧、加窗和傅里叶变换,包括:In a possible implementation, performing framing, windowing and Fourier transform on the voice time-domain signal containing echo and noise and the far-end reference sound time-domain signal respectively, including:

对所述含有回声及噪声的语音时域信号和所述远端参考声时域信号分别取预设个数采样点作为一帧信号;若长度不足则先补零到预设个数;Taking a preset number of sampling points as a frame of signals for the voice time-domain signal containing echo and noise and the far-end reference acoustic time-domain signal respectively; if the length is insufficient, first fill with zeros to a preset number;

对每一帧信号进行加窗;其中,加窗函数采用汉明窗;Windowing is performed on each frame of signal; wherein, the windowing function adopts a Hamming window;

对加窗后的每一帧信号进行傅里叶变换。Fourier transform is performed on each frame of the windowed signal.

在一种可能的实现中,所述根据所述含有回声及噪声的语音频域信号和所述远端参考声频域信号,确定回声频域信号,包括:In a possible implementation, determining the echo frequency domain signal according to the voice and audio domain signal containing echo and noise and the far-end reference audio frequency domain signal includes:

将所述含有回声及噪声的语音频域信号和所述远端参考声频域信号输入卡尔曼滤波器,得到滤波器系数和所述回声频域信号;Inputting the voice and audio frequency domain signals containing echo and noise and the far-end reference audio frequency domain signals into a Kalman filter to obtain filter coefficients and the echo frequency domain signals;

所述回声频域信号为所述滤波器系数和所述远端参考声频域信号相乘的结果。The echo frequency domain signal is the result of multiplying the filter coefficients and the far-end reference audio frequency domain signal.

在一种可能的实现中,所述根据所述含有回声及噪声的语音频域信号和所述回声频域信号,确定含有残余回声及噪声的语音频域信号,包括:In a possible implementation, determining the voice and audio domain signal containing residual echo and noise according to the voice and audio domain signal containing echo and noise and the echo frequency domain signal, including:

所述含有回声及噪声的语音频域信号减去所述回声频域信号,得到所述含有残余回声及噪声的语音频域信号。The echo frequency domain signal is subtracted from the voice and audio domain signal containing echo and noise to obtain the voice and audio domain signal containing residual echo and noise.

在一种可能的实现中,所述将所述含有残余回声及噪声的语音频域信号的幅度谱、所述回声频域信号的幅度谱和所述远端参考声频域信号的幅度谱进行能量归一化处理,得到含有残余回声及噪声的语音频域信号特征、回声频域信号特征和远端参考声频域信号特征,包括:In a possible implementation, performing energy energy analysis on the amplitude spectrum of the voice domain signal containing residual echo and noise, the amplitude spectrum of the echo frequency domain signal, and the amplitude spectrum of the far-end reference audio frequency domain signal. Normalization processing is performed to obtain the voice and audio domain signal features containing residual echo and noise, the echo frequency domain signal features and the far-end reference audio frequency domain signal features, including:

根据所述含有残余回声及噪声的语音频域信号的幅度谱、所述回声频域信号的幅度谱和所述远端参考声频域信号的幅度谱,分别确定与其对应的第一函数、第二函数和第三函数;Determine the corresponding first function, second function and third function;

根据与所述含有残余回声及噪声的语音频域信号的幅度谱对应的第一函数、所述含有残余回声及噪声的语音频域信号特征的均值及方差,确定所述含有残余回声及噪声的语音频域信号特征;According to the first function corresponding to the amplitude spectrum of the voice domain signal containing residual echo and noise, and the mean and variance of the features of the voice domain signal containing residual echo and noise, determine the residual echo and noise containing voice domain signal. Voice domain signal features;

根据所述回声频域信号的幅度谱对应的第二函数、所述回声频域信号特征的均值及方差,确定所述回声频域信号特征;According to the second function corresponding to the amplitude spectrum of the echo frequency domain signal, and the mean value and variance of the echo frequency domain signal feature, determine the echo frequency domain signal feature;

根据所述远端参考声频域信号的幅度谱对应的第三函数、所述远端参考声频域信号特征的均值及方差,确定所述远端参考声频域信号特征。The feature of the far-end reference audio-frequency domain signal is determined according to the third function corresponding to the amplitude spectrum of the far-end reference audio-frequency domain signal, and the mean and variance of the feature of the far-end reference audio-frequency domain signal.

在一种可能的实现中,所述训练后级联网络通过以下步骤训练得到:In a possible implementation, the post-training cascade network is obtained by training through the following steps:

接收第一含有回声及噪声的语音时域信号、第一远端参考声时域信号和第一目标语音时域信号;receiving the first voice time domain signal containing echo and noise, the first remote reference acoustic time domain signal and the first target voice time domain signal;

对所述第一含有回声及噪声的语音时域信号、所述第一远端参考声时域信号和所述第一目标语音时域信号分别进行分帧、加窗和傅里叶变换,得到第一含有回声及噪声的语音频域信号、第一远端参考声频域信号和第一目标语音频域信号;Perform framing, windowing and Fourier transform on the first voice time-domain signal containing echo and noise, the first far-end reference acoustic time-domain signal and the first target voice time-domain signal, respectively, to obtain a first voice domain signal containing echo and noise, a first remote reference voice domain signal and a first target voice domain signal;

根据所述第一含有回声及噪声的语音频域信号和所述第一远端参考声频域信号,确定第一回声频域信号;determining a first echo frequency domain signal according to the first voice frequency domain signal containing echo and noise and the first remote reference audio frequency domain signal;

根据所述第一含有回声及噪声的语音频域信号和所述第一回声频域信号,确定第一含有残余回声及噪声的语音频域信号;According to the first voice and audio domain signal containing echo and noise and the first echo frequency domain signal, determine a first voice and audio domain signal containing residual echo and noise;

将所述第一含有残余回声及噪声的语音频域信号的幅度谱、所述第一回声频域信号的幅度谱和所述第一远端参考声频域信号的幅度谱进行能量归一化处理,得到第一含有残余回声及噪声的语音频域信号特征、第一回声频域信号特征和第一远端参考声频域信号特征;Perform energy normalization processing on the amplitude spectrum of the first voice-frequency domain signal containing residual echo and noise, the amplitude spectrum of the first echo frequency-domain signal, and the amplitude spectrum of the first far-end reference audio-frequency domain signal , obtain the first voice-frequency domain signal feature containing residual echo and noise, the first echo frequency-domain signal feature and the first remote reference audio-frequency domain signal feature;

将所述第一含有残余回声及噪声的语音频域信号特征与所述第一远端参考声频域信号特征进行拼接,得到第一拼接特征,并将所述第一含有残余回声及噪声的语音频域信号特征与所述第一回声频域信号特征进行拼接,得到第二拼接特征;Splicing the first voice and audio domain signal features containing residual echo and noise with the first remote reference audio frequency domain signal features to obtain a first splicing feature, and splicing the first voice and audio signal features containing residual echo and noise. The audio domain signal feature is spliced with the first echo frequency domain signal feature to obtain a second splicing feature;

将所述第一拼接特征和所述第二拼接特征输入级联网络中的特征注意力模型,以联合训练级联网络中的特征注意力模型和残余回声及噪声消除模型,得到与所述第一远端参考声频域信号特征对应的第一权重和与所述第一回声频域信号特征对应的第二权重;The first splicing feature and the second splicing feature are input to the feature attention model in the cascade network, to jointly train the feature attention model and the residual echo and noise cancellation model in the cascade network, and the result is obtained with the first a first weight corresponding to a remote reference audio frequency domain signal feature and a second weight corresponding to the first echo frequency domain signal feature;

所述第一远端参考声频域信号特征与第一权重相乘,得到第一融合特征,并且所述第一回声频域信号特征与第二权重相乘,得到第二融合特征;The first remote reference audio frequency domain signal feature is multiplied by a first weight to obtain a first fusion feature, and the first echo frequency domain signal feature is multiplied by a second weight to obtain a second fusion feature;

将所述第一融合特征、所述第二融合特征和所述第一含有残余回声及噪声的语音频域信号特征进行拼接,得到第一融合拼接特征;Splicing the first fusion feature, the second fusion feature, and the first voice and audio domain signal features containing residual echo and noise to obtain a first fusion splicing feature;

将所述第一融合拼接特征输入所述级联网络中的残余回声及噪声消除模型,得到第二目标语音频域信号的掩蔽估计值;Inputting the first fusion splicing feature into the residual echo and noise cancellation model in the cascaded network to obtain a masking estimate of the second target voice and audio domain signal;

根据所述第二目标语音频域信号的掩蔽估计值和所述第一含有残余回声及噪声的语音频域信号,确定第二目标语音频域信号;According to the masking estimate value of the second target voice domain signal and the first voice domain signal containing residual echo and noise, determine the second target voice domain signal;

根据至少两个损失函数,确定多域的损失函数;其中,所述至少两个损失函数包括能量无关的幅度谱损失函数和客观语音质量评估得分损失函数;所述能量无关的幅度谱损失函数以所述第一目标语音频域信号的幅度谱为训练目标,根据所述第二目标语音频域信号进行确定;所述客观语音质量评估得分损失函数以提升语音听感质量为训练目标进行确定;According to at least two loss functions, a multi-domain loss function is determined; wherein, the at least two loss functions include an energy-independent amplitude spectrum loss function and an objective speech quality assessment score loss function; the energy-independent amplitude spectrum loss function is The amplitude spectrum of the first target speech and audio domain signal is a training target, and is determined according to the second target speech and audio domain signal; the objective speech quality evaluation score loss function is determined by taking improving the audio quality of speech as a training target;

通过不断地模型参数迭代减小所述多域的损失函数,得到所述训练后级联网络。The post-training cascade network is obtained by continuously reducing the multi-domain loss function iteratively with model parameters.

第二方面,本申请实施例还提出一种残余回声及噪声消除装置,包括:In a second aspect, an embodiment of the present application further provides a residual echo and noise cancellation device, including:

接收模块,用于接收含有回声及噪声的语音时域信号和远端参考声时域信号;The receiving module is used to receive the voice time domain signal containing echo and noise and the far-end reference sound time domain signal;

处理模块,用于对所述含有回声及噪声的语音时域信号和所述远端参考声时域信号分别进行分帧、加窗和傅里叶变换,得到含有回声及噪声的语音频域信号和远端参考声频域信号;The processing module is used to perform framing, windowing and Fourier transform on the voice time-domain signal containing echo and noise and the far-end reference acoustic time-domain signal respectively, so as to obtain the voice and audio domain signal containing echo and noise and the far-end reference audio domain signal;

确定模块,用于根据所述含有回声及噪声的语音频域信号和所述远端参考声频域信号,确定回声频域信号;a determining module, configured to determine an echo frequency domain signal according to the voice and audio frequency domain signal containing echo and noise and the far-end reference audio frequency domain signal;

所述确定模块,还用于根据所述含有回声及噪声的语音频域信号和所述回声频域信号,确定含有残余回声及噪声的语音频域信号;The determining module is further configured to determine a voice and audio domain signal containing residual echo and noise according to the voice and audio domain signal containing echo and noise and the echo frequency domain signal;

能量归一化模块,用于将所述含有残余回声及噪声的语音频域信号的幅度谱、所述回声频域信号的幅度谱和所述远端参考声频域信号的幅度谱进行能量归一化处理,得到含有残余回声及噪声的语音频域信号特征、回声频域信号特征和远端参考声频域信号特征;An energy normalization module, configured to perform energy normalization on the amplitude spectrum of the voice and audio domain signal containing residual echo and noise, the amplitude spectrum of the echo frequency domain signal and the amplitude spectrum of the far-end reference audio frequency domain signal process to obtain the voice and audio domain signal features containing residual echo and noise, the echo frequency domain signal features and the far-end reference audio frequency domain signal features;

拼接模块,用于将所述含有残余回声及噪声的语音频域信号特征与所述远端参考声频域信号特征进行拼接,得到第一拼接结果,并且将所述含有残余回声及噪声的语音频域信号特征与所述回声频域信号特征进行拼接,得到第二拼接结果;The splicing module is used for splicing the voice and audio domain signal features containing residual echo and noise with the remote reference audio frequency domain signal features to obtain a first splicing result, and splicing the voice and audio containing residual echo and noise. The domain signal feature is spliced with the echo frequency domain signal feature to obtain a second splicing result;

权重获得模块,用于将所述第一拼接结果和所述第二拼接结果输入所述训练后级联网络中的训练后特征注意力模型,获得与所述远端参考声频域信号特征对应的第一注意力权重和与所述回声频域信号特征对应的第二注意力权重;A weight obtaining module is used to input the first splicing result and the second splicing result into the post-training feature attention model in the post-training cascade network, and obtain the corresponding feature of the far-end reference audio frequency domain signal. a first attention weight and a second attention weight corresponding to the echo frequency domain signal feature;

融合注意力机制特征获得模块,用于将所述远端参考声频域信号特征与第一注意力权重相乘,得到第一融合注意力机制特征,并且将所述回声频域信号特征与第二注意力权重相乘,得到第二融合注意力机制特征;The fused attention mechanism feature acquisition module is used to multiply the far-end reference audio frequency domain signal feature with the first attention weight to obtain the first fused attention mechanism feature, and combine the echo frequency domain signal feature with the second The attention weights are multiplied to obtain the second fusion attention mechanism feature;

所述拼接模块,还用于将所述第一融合注意力机制特征、所述第二融合注意力机制特征和所述含有残余回声及噪声的语音频域信号特征进行拼接,得到第一融合拼接结果;The splicing module is further configured to splicing the first fusion attention mechanism feature, the second fusion attention mechanism feature, and the voice and audio domain signal features containing residual echo and noise to obtain a first fusion splicing result;

掩蔽估计值获得模块,用于将所述第一融合拼接结果输入所述训练后级联网络中的训练后残余回声及噪声消除模型,得到目标语音频域信号的掩蔽估计值;a masking estimation value obtaining module, which is used to input the first fusion splicing result into the post-training residual echo and noise cancellation model in the post-training cascade network to obtain the masking estimation value of the target speech and audio domain signal;

目标语音频域信号获得模块,用于根据所述目标语音频域信号的掩蔽估计值和所述含有残余回声及噪声的语音频域信号,得到所述目标语音频域信号;a target voice and audio domain signal obtaining module, configured to obtain the target voice and audio domain signal according to the masking estimate value of the target voice and audio domain signal and the voice and audio domain signal containing residual echo and noise;

逆傅里叶变换模块,用于对所述目标语音频域信号进行逆傅里叶变换,得到目标语音时域信号。The inverse Fourier transform module is used for performing inverse Fourier transform on the target speech and audio domain signal to obtain the target speech time domain signal.

第三方面,本申请实施例还提出一种残余回声及噪声消除装置,包括至少一个处理器,所述处理器用于执行存储器中存储的程序,当所述程序被执行时,使得所述装置执行如第一方面及各种可能的实现中的各个步骤。In a third aspect, an embodiment of the present application further provides a residual echo and noise cancellation device, including at least one processor, where the processor is configured to execute a program stored in a memory, and when the program is executed, the device is made to execute As in the first aspect and various steps in various possible implementations.

第四方面,本申请实施例还提出一种非暂态计算机可读存储介质,其上存储有计算机程序,该计算机程序被处理器执行时实现如第一方面及各种可能的实现中的各个步骤。In a fourth aspect, the embodiments of the present application further provide a non-transitory computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, implements each of the first aspect and various possible implementations. step.

本申请实施例的有益效果在于,残余回声及噪声无需独立分开消除,而是通过训练后级联网络一次消除残余回声及噪声。使用训练后特征注意力模型赋予输入的特征不同重要性,减少输入的特征中的冗余信息,提升级联网络消除残余回声及噪声的表现。使用能量无关的幅度谱损失函数和客观语音质量评估得分损失函数相结合的多域的损失函数进行级联网络的训练,降低了模型对于信号能量的敏感度,并提升输出语音的听觉感知质量。The beneficial effect of the embodiments of the present application is that the residual echo and noise do not need to be eliminated separately, but the residual echo and noise are eliminated at one time through a cascaded network after training. The post-training feature attention model is used to assign different importance to the input features, reduce redundant information in the input features, and improve the performance of the cascade network in eliminating residual echo and noise. Using a multi-domain loss function that combines the energy-independent amplitude spectrum loss function and the objective speech quality assessment score loss function to train the cascade network reduces the sensitivity of the model to signal energy and improves the auditory perception quality of the output speech.

附图说明Description of drawings

为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些图获得其他的附图。In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the following briefly introduces the accompanying drawings required for the description of the embodiments or the prior art. Obviously, the drawings in the following description are only These are some embodiments of the present application. For those of ordinary skill in the art, other drawings can also be obtained based on these drawings without any creative effort.

图1为本申请实施例提供的训练能够消除残余回声及噪声的级联网络的过程示意图;1 is a schematic diagram of a process of training a cascaded network capable of eliminating residual echo and noise according to an embodiment of the present application;

图2为本申请实施例提供的使用训练后级联网络消除残余回声及噪声的流程示意图;2 is a schematic flowchart of using a post-training cascade network to eliminate residual echo and noise according to an embodiment of the present application;

图3为本申请实施例提供的一种残余回声及噪声消除装置的结构示意图。FIG. 3 is a schematic structural diagram of a residual echo and noise cancellation apparatus provided by an embodiment of the present application.

具体实施方式Detailed ways

下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行描述。以下实施例仅用于更加清楚地说明本申请的技术方案,而不能以此来限制本申请的保护范围。The technical solutions in the embodiments of the present application will be described below with reference to the accompanying drawings in the embodiments of the present application. The following examples are only used to more clearly illustrate the technical solutions of the present application, and cannot be used to limit the protection scope of the present application.

传统方法中残余回声及噪声消除往往独立分开进行。在残余回声抑制任务中,没有考虑多个信号特征各自的重要性。在训练残余回声及噪声消除模型时,大多采用目标幅度谱和估计幅度谱的均方误差作为损失函数。上述损失函数依赖于信号能量的大小。因此,本申请提出了一种残余回声及噪声消除方法及装置,可以赋予多个信号特征不同的重要性,减少多个信号特征中的冗余信息,同时,采用多域的损失函数进行级联网络的训练,降低了模型对于信号能量的敏感度。Residual echo and noise cancellation are often performed independently in traditional methods. In the residual echo suppression task, the respective importance of multiple signal features is not considered. When training residual echo and noise cancellation models, the mean square error of the target amplitude spectrum and the estimated amplitude spectrum is mostly used as the loss function. The above loss function depends on the magnitude of the signal energy. Therefore, the present application proposes a residual echo and noise elimination method and device, which can assign different importance to multiple signal features, reduce redundant information in multiple signal features, and at the same time use multi-domain loss functions to cascade The training of the network reduces the sensitivity of the model to signal energy.

在本申请实施例中,训练能够消除残余回声及噪声的级联网络的过程示意图如图1所示,包括:S101-S113;其中,级联网络中包括特征注意力模型和残余回声及噪声消除模型。In the embodiment of the present application, a schematic diagram of the process of training a cascaded network capable of eliminating residual echo and noise is shown in FIG. 1, including: S101-S113; wherein, the cascade network includes a feature attention model and residual echo and noise cancellation Model.

特征注意力模型由1层门控循环单元连接1层全连接神经网络组成。上述1层门控循环单元有200个隐层节点,上述全连接神经网络输出层有257个节点且每个神经元的激活函数使用的是Sigmoid函数。The feature attention model consists of 1 layer of gated recurrent units connected to 1 layer of fully connected neural network. The above-mentioned 1-layer gated recurrent unit has 200 hidden layer nodes, the above-mentioned fully connected neural network output layer has 257 nodes, and the activation function of each neuron uses the Sigmoid function.

残余回声及噪声消除模型由2层门控循环单元连接1层全连接神经网络组成。上述2层门控循环单元包含400个隐层节点,上述全连接神经网络输出层有257个节点且每个神经元的激活函数使用的是Sigmoid函数。The residual echo and noise cancellation model consists of 2 layers of gated recurrent units connected to 1 layer of fully connected neural network. The above-mentioned 2-layer gated recurrent unit includes 400 hidden layer nodes, the above-mentioned fully connected neural network output layer has 257 nodes, and the activation function of each neuron uses the Sigmoid function.

S101,接收第一含有回声及噪声的语音时域信号、第一远端参考声时域信号和第一目标语音时域信号。其中,上述第一远端参考声时域信号经过非线性变换再与相应房间传递函数卷积形成上述第一含有回声及噪声的语音时域信号中的回声时域信号。S101: Receive a first voice time domain signal containing echo and noise, a first remote reference acoustic time domain signal, and a first target voice time domain signal. Wherein, the first remote reference acoustic time domain signal is nonlinearly transformed and then convolved with a corresponding room transfer function to form an echo time domain signal in the first voice time domain signal containing echo and noise.

S102,对接收的第一含有回声及噪声的语音时域信号、第一远端参考声时域信号和第一目标语音时域信号进行分帧、加窗。具体地,对接收的第一含有回声及噪声的语音时域信号、第一远端参考声时域信号和第一目标语音时域信号分别取512个采样点作为一帧信号,若长度不足则先补零到512点,然后对每一帧信号进行加窗,加窗函数采用汉明窗。对加窗后的每一帧信号进行傅里叶变换,得到第一含有回声及噪声的语音频域信号、第一远端参考声频域信号和第一目标语音频域信号。S102: Framing and windowing the received first voice time domain signal containing echo and noise, the first remote reference acoustic time domain signal and the first target voice time domain signal. Specifically, 512 sampling points are respectively taken as one frame signal for the received first voice time domain signal containing echo and noise, the first remote reference voice time domain signal and the first target voice time domain signal. First fill zero to 512 points, and then add window to each frame of signal, and the windowing function adopts Hamming window. Fourier transform is performed on each frame of signal after windowing to obtain a first voice and audio domain signal containing echo and noise, a first far-end reference audio frequency domain signal and a first target voice and audio domain signal.

S103,将上述第一含有回声及噪声的语音频域信号和上述第一远端参考声频域信号输入卡尔曼滤波器,实时估计第一滤波器系数和第一回声频域信号。其中,卡尔曼滤波器估计的第一回声频域信号为:S103: Input the first voice and audio domain signal containing echo and noise and the first remote reference audio frequency domain signal into a Kalman filter, and estimate the first filter coefficient and the first echo frequency domain signal in real time. Among them, the first echo frequency domain signal estimated by the Kalman filter is:

C(k,f)=W(k,f)*X(k,f)C(k,f)=W(k,f)*X(k,f)

其中,W(k,f)为第一滤波器系数,X(k,f)为远端参考声频域信号,k和f分别代表第k帧和频率f。Wherein, W(k, f) is the first filter coefficient, X(k, f) is the far-end reference audio frequency domain signal, and k and f represent the kth frame and frequency f, respectively.

S104,从上述第一含有回声及噪声的语音频域信号中减去上述第一回声频域信号,得到第一含有残余回声及噪声的语音频域信号,作为卡尔曼滤波器的输出结果。上述第一含有残余回声及噪声的语音频域信号为:S104: Subtract the first echo frequency domain signal from the first voice and audio domain signal containing echo and noise to obtain a first voice and audio domain signal containing residual echo and noise, which is used as an output result of the Kalman filter. The above-mentioned first voice domain signal containing residual echo and noise is:

E(k,f)=Y(k,f)-C(k,f)E(k,f)=Y(k,f)-C(k,f)

其中,Y(k,f)为第一含有回声及噪声的语音频域信号,C(k,f)为第一回声频域信号,k和f分别代表第k帧和频率f。Among them, Y(k, f) is the first voice and audio domain signal containing echo and noise, C(k, f) is the first echo frequency domain signal, k and f represent the kth frame and frequency f respectively.

S105,将上述第一含有残余回声及噪声的语音频域信号的幅度谱、上述第一回声频域信号的幅度谱和上述第一远端参考声频域信号的幅度谱进行能量归一化处理,得到第一含有残余回声及噪声的语音频域信号特征gFD(f(|E(k,f)|))、第一回声频域信号特征gFD(f(|C(k,f)|))和第一远端参考声频域信号特征gFD(f(|X(k,f)|))。其中,S105, performing energy normalization processing on the amplitude spectrum of the above-mentioned first voice-frequency domain signal containing residual echo and noise, the amplitude spectrum of the above-mentioned first echo frequency-domain signal, and the amplitude spectrum of the above-mentioned first remote reference audio-frequency domain signal, Obtain the first voice domain signal feature g FD (f(|E(k,f)|)) containing residual echo and noise, and the first echo frequency domain signal feature g FD (f(|C(k,f)| )) and the first far-end reference audio frequency domain signal feature g FD (f(|X(k,f)|)). in,

Figure BDA0002884062230000051
Figure BDA0002884062230000051

第一含有残余回声及噪声的语音频域信号特征的均值和方差分别定义为:The mean and variance of the first voice domain signal features containing residual echo and noise are defined as:

μf(e)(k,f)=c1μf(e)(k-1,f)+(1-c1)f(|E(k,f)|)μ f(e) (k,f)=c 1 μ f(e) (k-1,f)+(1-c 1 )f(|E(k,f)|)

Figure BDA0002884062230000061
Figure BDA0002884062230000061

Figure BDA0002884062230000062
Figure BDA0002884062230000062

第一回声频域信号特征的均值和方差分别定义为:The mean and variance of the first echo frequency domain signal features are defined as:

μf(c)(k,f)=c1μf(c)(k-1,f)+(1-c1)f(|C(k,f)|)μ f(c) (k,f)=c 1 μ f(c) (k-1,f)+(1-c 1 )f(|C(k,f)|)

Figure BDA0002884062230000063
Figure BDA0002884062230000063

Figure BDA0002884062230000064
Figure BDA0002884062230000064

第一远端参考声频域信号特征的均值和方差分别定义为:The mean and variance of the first remote reference audio frequency domain signal feature are defined as:

μf(x)(k,f)=c1μf(x)(k-1,f)+(1-c1)f(|X(k,f)|)μ f(x) (k,f)=c 1 μ f(x) (k-1,f)+(1-c 1 )f(|X(k,f)|)

Figure BDA0002884062230000065
Figure BDA0002884062230000065

|E(k,f)|、|C(k,f)|和|X(k,f)|分别代表第一含有残余回声及噪声的语音频域信号的幅度谱、第一回声频域信号的幅度谱和第一远端参考声频域信号的幅度谱,c1为预先设定的常数。|E(k,f)|, |C(k,f)| and |X(k,f)| represent the amplitude spectrum of the first voice-frequency domain signal containing residual echo and noise, and the first echo frequency-domain signal, respectively and the amplitude spectrum of the first remote reference audio frequency domain signal, c 1 is a preset constant.

S106,将上述第一含有残余回声及噪声的语音频域信号特征与上述第一远端参考声频域信号特征进行拼接,得到第一拼接特征,并将上述第一含有残余回声及噪声的语音频域信号特征与上述第一回声频域信号特征进行拼接,得到第二拼接特征。S106, splicing the above-mentioned first voice and audio domain signal features containing residual echo and noise with the above-mentioned first remote reference audio frequency domain signal features to obtain a first splicing feature, and splicing the above-mentioned first voice and audio containing residual echo and noise. The domain signal feature is spliced with the above-mentioned first echo frequency domain signal feature to obtain the second splicing feature.

S107,将上述第一拼接特征和上述第二拼接特征输入级联网络中的特征注意力模型,以联合训练级联网络中的特征注意力模型和残余回声及噪声消除模型,得到与上述第一远端参考声频域信号特征对应的第一权重α(k,f)和与上述第一回声频域信号特征对应的第二权重β(k,f)。S107, the above-mentioned first splicing feature and the above-mentioned second splicing feature are input into the feature attention model in the cascade network, to jointly train the feature attention model and the residual echo and noise elimination model in the cascade network, to obtain a A first weight α(k, f) corresponding to the feature of the far-end reference audio frequency domain signal and a second weight β(k, f) corresponding to the above-mentioned first echo frequency domain signal feature.

S108,上述第一远端参考声频域信号特征与第一权重相乘,得到第一融合特征为:S108, the above-mentioned first remote reference audio frequency domain signal feature is multiplied by the first weight, and the obtained first fusion feature is:

Xatt(k,f)=gFD(f(|X(k,f)|))*α(k,f)X att (k,f)=g FD (f(|X(k,f)|))*α(k,f)

并且上述第一回声频域信号特征与第二权重相乘,得到第二融合特征为:And the above-mentioned first echo frequency domain signal feature is multiplied by the second weight, and the second fusion feature is obtained as:

Catt(k,f)=gFD(f(|C(k,f)|))*β(k,f)。C att (k,f)=g FD (f(|C(k,f)|))*β(k,f).

S109,将上述第一融合特征Xatt(k,f)、上述第二融合特征Catt(k,f)和上述第一含有残余回声及噪声的语音频域信号特征gFD(f(|E(k,f)|))进行拼接,得到第一融合拼接特征。S109, combine the above-mentioned first fusion feature X att (k, f), the above-mentioned second fusion characteristic C att (k, f) and the above-mentioned first voice and audio domain signal feature g FD (f(|E (k,f)|)) for splicing to obtain the first fusion splicing feature.

S110,将上述第一融合拼接特征输入级联网络中的残余回声及噪声消除模型,输出为第二目标语音频域信号的掩蔽估计值G(k,f)。S110: Input the first fusion and splicing feature into the residual echo and noise elimination model in the cascaded network, and output the masking estimate value G(k, f) of the second target voice and audio domain signal.

S111,利用上述第二目标语音频域信号的掩蔽估计值G(k,f)对上述第一含有残余回声及噪声的语音频域信号进行增强,得到第二目标语音频域信号为:S111, using the masking estimated value G(k, f) of the above-mentioned second target speech and audio domain signal to enhance the above-mentioned first speech and audio domain signal containing residual echo and noise, and obtaining the second target speech and audio frequency domain signal is:

Figure BDA0002884062230000066
Figure BDA0002884062230000066

S112,根据至少两个损失函数,确定多域的损失函数。例如,以上述第一目标语音频域信号的幅度谱为训练目标,根据上述第二目标语音频域信号,确定能量无关的幅度谱损失函数

Figure BDA0002884062230000067
S112: Determine a multi-domain loss function according to at least two loss functions. For example, taking the amplitude spectrum of the first target voice and audio domain signal as the training target, and determining the energy-independent amplitude spectrum loss function according to the second target voice and audio domain signal
Figure BDA0002884062230000067

Figure BDA0002884062230000071
Figure BDA0002884062230000071

Figure BDA0002884062230000072
Figure BDA0002884062230000072

以提升语音听感质量为训练目标,确定客观语音质量评估得分损失函数

Figure BDA0002884062230000073
其中,S(k,f)为第一目标语音频域信号,
Figure BDA0002884062230000074
为第二目标语音频域信号。加权相加上述能量无关的幅度谱损失函数和上述客观语音质量评估得分损失函数,确定多域的损失函数为:To improve the audio quality of speech as the training goal, determine the objective speech quality evaluation score loss function
Figure BDA0002884062230000073
Among them, S(k,f) is the first target speech audio domain signal,
Figure BDA0002884062230000074
is the second target voice domain signal. Weighted addition of the above energy-independent amplitude spectrum loss function and the above objective speech quality assessment score loss function, the multi-domain loss function is determined as:

Figure BDA0002884062230000075
Figure BDA0002884062230000075

其中,λ为预先设定的常数。Among them, λ is a preset constant.

S113,通过不断地模型参数迭代减小上述多域的损失函数,得到训练后级联网络。其中,训练后级联网络包括训练后特征注意力模型和训练后残余回声及噪声消除模型。S113, by continuously reducing the multi-domain loss function by iteratively reducing the model parameters, a post-training cascade network is obtained. Among them, the post-training cascade network includes a post-training feature attention model and a post-training residual echo and noise cancellation model.

在本申请实施例中,使用上述训练后级联网络消除残余回声及噪声的流程示意图如图2所示,包括:S201-S207;In the embodiment of the present application, a schematic flowchart of using the above-mentioned post-training cascade network to eliminate residual echo and noise is shown in FIG. 2 , including: S201-S207;

S201,接收含有回声及噪声的语音时域信号和远端参考声时域信号。其中,上述远端参考声时域信号经过非线性变换再与相应房间传递函数卷积形成上述含有回声及噪声的语音时域信号中的回声时域信号。S201, a speech time domain signal containing echo and noise and a far-end reference acoustic time domain signal are received. Wherein, the remote reference acoustic time domain signal is nonlinearly transformed and then convolved with the corresponding room transfer function to form the echo time domain signal in the speech time domain signal containing echo and noise.

S202,对上述含有回声及噪声的语音时域信号和上述远端参考声时域信号分别进行分帧、加窗。具体地,对接收的含有回声及噪声的语音时域信号、远端参考声时域信号分别取512个采样点作为一帧信号,若长度不足则先补零到512点,然后对每一帧信号进行加窗,加窗函数采用汉明窗。对加窗后的每一帧信号进行傅里叶变换,得到含有回声及噪声的语音频域信号和远端参考声频域信号。S202: Framing and windowing the above-mentioned voice time-domain signal containing echo and noise and the above-mentioned far-end reference acoustic time-domain signal respectively. Specifically, take 512 sampling points for the received voice time-domain signal containing echo and noise, and the far-end reference sound time-domain signal as a frame signal. The signal is windowed, and the windowing function uses a Hamming window. Fourier transform is performed on each frame of the windowed signal to obtain a voice and audio domain signal containing echo and noise and a far-end reference audio frequency domain signal.

S203,将上述含有回声及噪声的语音频域信号和上述远端参考声频域信号输入卡尔曼滤波器,实时估计滤波器系数和回声频域信号。其中,卡尔曼滤波器估计的回声频域信号为:S203: Input the above-mentioned voice and audio frequency domain signal containing echo and noise and the above-mentioned far-end reference audio frequency domain signal into a Kalman filter, and estimate the filter coefficients and the echo frequency domain signal in real time. Among them, the echo frequency domain signal estimated by the Kalman filter is:

C3(k,f)=W3(k,f)*X3(k,f)C 3 (k,f)=W 3 (k,f)*X 3 (k,f)

其中,W3(k,f)为第一滤波器系数,X3(k,f)为远端参考声频域信号,k和f分别代表第k帧和频率f。Wherein, W 3 (k, f) is the first filter coefficient, X 3 (k, f) is the far-end reference audio domain signal, and k and f represent the k-th frame and frequency f, respectively.

S204,从上述含有回声及噪声的语音频域信号中减去上述回声频域信号,得到含有残余回声及噪声的语音频域信号为:S204, the above-mentioned echo frequency domain signal is subtracted from the above-mentioned voice and audio domain signal containing echo and noise, and the obtained voice and audio frequency domain signal containing residual echo and noise is:

E3(k,f)=Y3(k,f)-C3(k,f)E 3 (k,f)=Y 3 (k,f)-C 3 (k,f)

其中,Y3(k,f)为含有回声及噪声的语音频域信号,C3(k,f)为回声频域信号,k和f分别代表第k帧和频率f。Among them, Y 3 (k, f) is the voice and audio domain signal containing echo and noise, C 3 (k, f) is the echo frequency domain signal, k and f represent the k-th frame and frequency f, respectively.

S205,将上述含有残余回声及噪声的语音频域信号的幅度谱、上述回声频域信号的幅度谱和上述远端参考声频域信号的幅度谱进行能量归一化处理,得到含有残余回声及噪声的语音频域信号特征

Figure BDA0002884062230000076
第一回声频域信号特征
Figure BDA0002884062230000077
和第一远端参考声频域信号特征
Figure BDA0002884062230000078
其中,S205, performing energy normalization processing on the amplitude spectrum of the voice and audio domain signal containing residual echo and noise, the amplitude spectrum of the echo frequency domain signal, and the amplitude spectrum of the remote reference audio frequency domain signal to obtain residual echo and noise containing The features of the speech domain signal
Figure BDA0002884062230000076
The first echo frequency domain signal characteristics
Figure BDA0002884062230000077
and the first far-end reference audio domain signal characteristics
Figure BDA0002884062230000078
in,

Figure BDA0002884062230000081
Figure BDA0002884062230000081

第一含有残余回声及噪声的语音频域信号特征的均值和方差分别定义为:The mean and variance of the first voice domain signal features containing residual echo and noise are defined as:

Figure BDA00028840622300000811
Figure BDA00028840622300000811

Figure BDA0002884062230000082
Figure BDA0002884062230000082

Figure BDA0002884062230000083
Figure BDA0002884062230000083

第一回声频域信号特征的均值和方差分别定义为:The mean and variance of the first echo frequency domain signal features are defined as:

Figure BDA00028840622300000812
Figure BDA00028840622300000812

Figure BDA0002884062230000084
Figure BDA0002884062230000084

Figure BDA0002884062230000085
Figure BDA0002884062230000085

第一远端参考声频域信号特征的均值和方差分别定义为:The mean and variance of the first remote reference audio frequency domain signal feature are defined as:

Figure BDA0002884062230000086
Figure BDA0002884062230000086

Figure BDA0002884062230000087
Figure BDA0002884062230000087

|E3(k,f)|、|C3(k,f)|和|X3(k,f)|分别代表含有残余回声及噪声的语音频域信号的幅度谱、回声频域信号的幅度谱和远端参考声频域信号的幅度谱,c2为预先设定的常数。|E 3 (k,f)|, |C 3 (k,f)| and |X 3 (k,f)| represent the amplitude spectrum of the voice-frequency domain signal containing residual echo and noise, and the echo frequency domain signal’s amplitude spectrum, respectively. The amplitude spectrum and the amplitude spectrum of the far-end reference audio frequency domain signal, c 2 is a preset constant.

S206,将上述含有残余回声及噪声的语音频域信号特征与上述远端参考声频域信号特征进行拼接,得到第一拼接结果,并且将上述含有残余回声及噪声的语音频域信号特征与上述回声频域信号特征进行拼接,得到第二拼接结果。S206, splicing the above-mentioned voice and audio domain signal features containing residual echo and noise with the above-mentioned remote reference audio frequency domain signal features to obtain a first splicing result, and combining the above-mentioned voice and audio domain signal features containing residual echo and noise with the above echo The audio frequency domain signal features are spliced to obtain a second splicing result.

S207,将上述第一拼接结果和上述第二拼接结果输入上述训练后级联网络中,即输入训练后特征注意力模型,获得与上述远端参考声频域信号特征对应的第一注意力权重α3(k,f)和与上述回声频域信号特征对应的第二注意力权重β3(k,f)。S207, input the above-mentioned first splicing result and the above-mentioned second splicing result into the above-mentioned post-training cascade network, that is, input the post-training feature attention model to obtain the first attention weight α corresponding to the above-mentioned remote reference audio frequency domain signal feature 3 (k, f) and the second attention weight β 3 (k, f) corresponding to the above echo frequency domain signal features.

S208,上述远端参考声频域信号特征与第一注意力权重α3(k,f)相乘,得到第一融合注意力机制特征为:S208, the above-mentioned remote reference audio frequency domain signal feature is multiplied by the first attention weight α 3 (k, f) to obtain the first fusion attention mechanism feature:

Figure BDA0002884062230000088
Figure BDA0002884062230000088

并且上述回声频域信号特征与第二注意力权重β3(k,f)相乘,得到第二融合注意力机制特征为:And the above echo frequency domain signal features are multiplied by the second attention weight β 3 (k, f) to obtain the second fusion attention mechanism features:

Figure BDA0002884062230000089
Figure BDA0002884062230000089

S209,将上述第一融合注意力机制特征、上述第二融合注意力机制特征和上述含有残余回声及噪声的语音频域信号特征进行拼接,得到第一融合拼接结果。S209, splicing the above-mentioned first fusion attention mechanism feature, the above-mentioned second fusion attention mechanism characteristic, and the above-mentioned voice and audio domain signal characteristics containing residual echo and noise, to obtain a first fusion and splicing result.

S210,将上述第一融合拼接结果输入训练后级联网络中的训练后残余回声及噪声消除模型,得到目标语音频域信号的掩蔽估计值G3(k,f)。S210: Input the above-mentioned first fusion and splicing result into a post-training residual echo and noise cancellation model in the post-training cascade network to obtain a masking estimation value G 3 (k, f) of the target speech and audio domain signal.

S211,将上述目标语音频域信号的掩蔽估计值G3(k,f)和所述含有残余回声及噪声的语音频域信号E3(k,f)相乘,得到目标语音频域信号为:S211: Multiply the masking estimate value G 3 (k, f) of the above-mentioned target voice and audio domain signal and the voice and audio domain signal E 3 (k, f) containing residual echo and noise, to obtain the target voice and audio domain signal as :

Figure BDA00028840622300000810
Figure BDA00028840622300000810

S212,对上述目标语音频域信号进行逆傅里叶变换,得到目标语音时域信号。S212 , performing an inverse Fourier transform on the target speech and audio domain signals to obtain a target speech time domain signal.

本申请实施例中的残余回声及噪声无需独立分开消除,而是通过训练后级联网络一次消除残余回声及噪声。使用训练后特征注意力模型赋予输入的特征不同重要性,减少输入的特征中的冗余信息,提升级联网络消除残余回声及噪声的表现。使用能量无关的幅度谱损失函数和客观语音质量评估得分损失函数相结合的多域的损失函数进行级联网络的训练,降低了模型对于信号能量的敏感度,并提升输出语音的听觉感知质量。The residual echo and noise in the embodiments of the present application do not need to be eliminated separately, but the residual echo and noise are eliminated at one time through a cascaded network after training. The post-training feature attention model is used to assign different importance to the input features, reduce redundant information in the input features, and improve the performance of the cascade network in eliminating residual echo and noise. Using a multi-domain loss function that combines the energy-independent amplitude spectrum loss function and the objective speech quality assessment score loss function to train the cascade network reduces the sensitivity of the model to signal energy and improves the auditory perception quality of the output speech.

本申请实施例提供一种残余回声及噪声消除装置,其结构示意图如图3所示,包括:An embodiment of the present application provides a residual echo and noise cancellation device, the schematic structural diagram of which is shown in FIG. 3 , including:

接收模块301、处理模块302、确定模块303、能量归一化模块304和逆傅里叶变换模块305;a receiving module 301, a processing module 302, a determination module 303, an energy normalization module 304 and an inverse Fourier transform module 305;

接收模块301,用于接收含有回声及噪声的语音时域信号和远端参考声时域信号;a receiving module 301, configured to receive a voice time-domain signal containing echo and noise and a far-end reference acoustic time-domain signal;

处理模块302,用于对所述含有回声及噪声的语音时域信号和所述远端参考声时域信号分别进行分帧、加窗和傅里叶变换,得到含有回声及噪声的语音频域信号和远端参考声频域信号;The processing module 302 is used to perform framing, windowing and Fourier transform on the voice time domain signal containing echo and noise and the far-end reference acoustic time domain signal respectively, to obtain a voice and audio domain containing echo and noise signal and far-end reference audio domain signal;

确定模块303,用于根据所述含有回声及噪声的语音频域信号和所述远端参考声频域信号,确定回声频域信号;a determining module 303, configured to determine an echo frequency domain signal according to the voice and audio domain signal containing echo and noise and the far-end reference audio frequency domain signal;

所述确定模块303,还用于根据所述含有回声及噪声的语音频域信号和所述回声频域信号,确定含有残余回声及噪声的语音频域信号;The determining module 303 is further configured to determine a voice and audio domain signal containing residual echo and noise according to the voice and audio domain signal containing echo and noise and the echo frequency domain signal;

能量归一化模块304,用于将所述含有残余回声及噪声的语音频域信号的幅度谱、所述回声频域信号的幅度谱和所述远端参考声频域信号的幅度谱进行能量归一化处理,得到含有残余回声及噪声的语音频域信号特征、回声频域信号特征和远端参考声频域信号特征;The energy normalization module 304 is configured to perform energy normalization on the amplitude spectrum of the speech audio frequency domain signal containing residual echo and noise, the amplitude spectrum of the echo frequency domain signal and the amplitude spectrum of the far-end reference audio frequency domain signal. Normalization processing to obtain the voice and audio domain signal features containing residual echo and noise, the echo frequency domain signal features and the far-end reference audio frequency domain signal features;

拼接模块305,用于将所述含有残余回声及噪声的语音频域信号特征与所述远端参考声频域信号特征进行拼接,得到第一拼接结果,并且将所述含有残余回声及噪声的语音频域信号特征与所述回声频域信号特征进行拼接,得到第二拼接结果;The splicing module 305 is used for splicing the voice and audio domain signal features containing residual echo and noise with the remote reference audio frequency domain signal features to obtain a first splicing result, and splicing the voice and audio domain signal features containing residual echo and noise. The audio domain signal feature and the echo frequency domain signal feature are spliced to obtain a second splicing result;

权重获得模块306,用于将所述第一拼接结果和所述第二拼接结果输入所述训练后级联网络中的训练后特征注意力模型,获得与所述远端参考声频域信号特征对应的第一注意力权重和与所述回声频域信号特征对应的第二注意力权重;The weight obtaining module 306 is used to input the first splicing result and the second splicing result into the post-training feature attention model in the post-training cascade network, and obtain the feature corresponding to the far-end reference audio frequency domain signal The first attention weight of and the second attention weight corresponding to the echo frequency domain signal feature;

融合注意力机制特征获得模块307,用于将所述远端参考声频域信号特征与第一注意力权重相乘,得到第一融合注意力机制特征,并且将所述回声频域信号特征与第二注意力权重相乘,得到第二融合注意力机制特征;The fusion attention mechanism feature obtaining module 307 is used for multiplying the remote reference audio frequency domain signal feature and the first attention weight to obtain the first fusion attention mechanism feature, and combining the echo frequency domain signal feature with the first attention weight. The two attention weights are multiplied to obtain the second fusion attention mechanism feature;

所述拼接模块305,还用于将所述第一融合注意力机制特征、所述第二融合注意力机制特征和所述含有残余回声及噪声的语音频域信号特征进行拼接,得到第一融合拼接结果;The splicing module 305 is further configured to splicing the first fusion attention mechanism feature, the second fusion attention mechanism feature, and the voice and audio domain signal features containing residual echo and noise to obtain a first fusion. splicing result;

掩蔽估计值获得模块308,用于将所述第一融合拼接结果输入所述训练后级联网络中的训练后残余回声及噪声消除模型,得到目标语音频域信号的掩蔽估计值;A masking estimate value obtaining module 308, configured to input the first fusion splicing result into the post-training residual echo and noise cancellation model in the post-training cascade network to obtain a masking estimate value of the target speech and audio domain signal;

目标语音频域信号获得模块309,用于根据所述目标语音频域信号的掩蔽估计值和所述含有残余回声及噪声的语音频域信号,得到所述目标语音频域信号;A target voice and audio domain signal obtaining module 309, configured to obtain the target voice and audio domain signal according to the masking estimated value of the target voice and audio domain signal and the voice and audio domain signal containing residual echo and noise;

逆傅里叶变换模块310,用于对所述目标语音频域信号进行逆傅里叶变换,得到目标语音时域信号。The inverse Fourier transform module 310 is configured to perform inverse Fourier transform on the target speech and audio domain signal to obtain the target speech time domain signal.

本申请实施例提供一种残余回声及噪声消除装置,包括至少一个处理器,所述处理器用于执行存储器中存储的程序,当所述程序被执行时,使得所述装置执行如下步骤:An embodiment of the present application provides a residual echo and noise cancellation device, including at least one processor, where the processor is configured to execute a program stored in a memory, and when the program is executed, the device is made to perform the following steps:

接收含有回声及噪声的语音时域信号和远端参考声时域信号;Receive speech time-domain signals containing echo and noise and far-end reference acoustic time-domain signals;

对所述含有回声及噪声的语音时域信号和所述远端参考声时域信号分别进行分帧、加窗和傅里叶变换,得到含有回声及噪声的语音频域信号和远端参考声频域信号;Perform framing, windowing and Fourier transform on the voice time-domain signal containing echo and noise and the far-end reference acoustic time-domain signal, respectively, to obtain the voice and audio domain signal containing echo and noise and the far-end reference audio frequency domain signal;

根据所述含有回声及噪声的语音频域信号和所述远端参考声频域信号,确定回声频域信号;Determine an echo frequency domain signal according to the voice and audio frequency domain signal containing echo and noise and the far-end reference audio frequency domain signal;

根据所述含有回声及噪声的语音频域信号和所述回声频域信号,确定含有残余回声及噪声的语音频域信号;According to the voice and audio domain signal containing echo and noise and the echo frequency domain signal, determine a voice and audio domain signal containing residual echo and noise;

将所述含有残余回声及噪声的语音频域信号的幅度谱、所述回声频域信号的幅度谱和所述远端参考声频域信号的幅度谱进行能量归一化处理,得到含有残余回声及噪声的语音频域信号特征、回声频域信号特征和远端参考声频域信号特征;Perform energy normalization processing on the amplitude spectrum of the voice and audio frequency domain signal containing residual echo and noise, the amplitude spectrum of the echo frequency domain signal and the amplitude spectrum of the far-end reference audio frequency domain signal to obtain residual echo and Noise's voice and audio frequency domain signal characteristics, echo frequency domain signal characteristics and far-end reference audio frequency domain signal characteristics;

将所述含有残余回声及噪声的语音频域信号特征与所述远端参考声频域信号特征进行拼接,得到第一拼接结果,并且将所述含有残余回声及噪声的语音频域信号特征与所述回声频域信号特征进行拼接,得到第二拼接结果;The voice and audio domain signal features containing residual echo and noise are spliced with the far-end reference audio frequency domain signal features to obtain a first splicing result, and the voice and audio domain signal features containing residual echo and noise are combined with the above. The echo frequency domain signal features are spliced to obtain a second splicing result;

将所述第一拼接结果和所述第二拼接结果输入所述训练后级联网络中的训练后特征注意力模型,获得与所述远端参考声频域信号特征对应的第一注意力权重和与所述回声频域信号特征对应的第二注意力权重;The first splicing result and the second splicing result are input into the post-training feature attention model in the post-training cascade network to obtain the first attention weight and a second attention weight corresponding to the echo frequency domain signal feature;

将所述远端参考声频域信号特征与第一注意力权重相乘,得到第一融合注意力机制特征,并且将所述回声频域信号特征与第二注意力权重相乘,得到第二融合注意力机制特征;Multiplying the remote reference audio frequency domain signal feature and the first attention weight to obtain the first fusion attention mechanism feature, and multiplying the echo frequency domain signal feature and the second attention weight to obtain the second fusion Attention mechanism features;

将所述第一融合注意力机制特征、所述第二融合注意力机制特征和所述含有残余回声及噪声的语音频域信号特征进行拼接,得到第一融合拼接结果;Splicing the first fusion attention mechanism feature, the second fusion attention mechanism feature, and the voice and audio domain signal features containing residual echo and noise to obtain a first fusion splicing result;

将所述第一融合拼接结果输入所述训练后级联网络中的训练后残余回声及噪声消除模型,得到目标语音频域信号的掩蔽估计值;Inputting the first fusion splicing result into the post-training residual echo and noise cancellation model in the post-training cascade network to obtain a masking estimate value of the target speech and audio domain signal;

根据所述目标语音频域信号的掩蔽估计值和所述含有残余回声及噪声的语音频域信号,得到所述目标语音频域信号;According to the masking estimate value of the target voice and audio domain signal and the voice and audio domain signal containing residual echo and noise, obtain the target voice and audio frequency domain signal;

对所述目标语音频域信号进行逆傅里叶变换,得到目标语音时域信号。Inverse Fourier transform is performed on the target speech and audio frequency domain signal to obtain the target speech time domain signal.

本申请实施例提供一种非暂态计算机可读存储介质,其上存储有计算机程序,该计算机程序被处理器执行时实现如下步骤:An embodiment of the present application provides a non-transitory computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the following steps are implemented:

接收含有回声及噪声的语音时域信号和远端参考声时域信号;Receive speech time-domain signals containing echo and noise and far-end reference acoustic time-domain signals;

对所述含有回声及噪声的语音时域信号和所述远端参考声时域信号分别进行分帧、加窗和傅里叶变换,得到含有回声及噪声的语音频域信号和远端参考声频域信号;Perform framing, windowing and Fourier transform on the voice time domain signal containing echo and noise and the far-end reference acoustic time domain signal respectively to obtain the voice and audio domain signal containing echo and noise and the far-end reference audio frequency domain signal;

根据所述含有回声及噪声的语音频域信号和所述远端参考声频域信号,确定回声频域信号;Determine an echo frequency domain signal according to the voice and audio frequency domain signal containing echo and noise and the far-end reference audio frequency domain signal;

根据所述含有回声及噪声的语音频域信号和所述回声频域信号,确定含有残余回声及噪声的语音频域信号;According to the voice and audio domain signal containing echo and noise and the echo frequency domain signal, determine a voice and audio domain signal containing residual echo and noise;

将所述含有残余回声及噪声的语音频域信号的幅度谱、所述回声频域信号的幅度谱和所述远端参考声频域信号的幅度谱进行能量归一化处理,得到含有残余回声及噪声的语音频域信号特征、回声频域信号特征和远端参考声频域信号特征;Perform energy normalization processing on the amplitude spectrum of the voice and audio frequency domain signal containing residual echo and noise, the amplitude spectrum of the echo frequency domain signal and the amplitude spectrum of the far-end reference audio frequency domain signal to obtain residual echo and Noise's voice and audio frequency domain signal characteristics, echo frequency domain signal characteristics and far-end reference audio frequency domain signal characteristics;

将所述含有残余回声及噪声的语音频域信号特征与所述远端参考声频域信号特征进行拼接,得到第一拼接结果,并且将所述含有残余回声及噪声的语音频域信号特征与所述回声频域信号特征进行拼接,得到第二拼接结果;The voice and audio domain signal features containing residual echo and noise are spliced with the far-end reference audio frequency domain signal features to obtain a first splicing result, and the voice and audio domain signal features containing residual echo and noise are combined with the above. The echo frequency domain signal features are spliced to obtain a second splicing result;

将所述第一拼接结果和所述第二拼接结果输入所述训练后级联网络中的训练后特征注意力模型,获得与所述远端参考声频域信号特征对应的第一注意力权重和与所述回声频域信号特征对应的第二注意力权重;The first splicing result and the second splicing result are input into the post-training feature attention model in the post-training cascade network to obtain the first attention weight and a second attention weight corresponding to the echo frequency domain signal feature;

将所述远端参考声频域信号特征与第一注意力权重相乘,得到第一融合注意力机制特征,并且将所述回声频域信号特征与第二注意力权重相乘,得到第二融合注意力机制特征;Multiplying the remote reference audio frequency domain signal feature and the first attention weight to obtain the first fusion attention mechanism feature, and multiplying the echo frequency domain signal feature and the second attention weight to obtain the second fusion Attention mechanism features;

将所述第一融合注意力机制特征、所述第二融合注意力机制特征和所述含有残余回声及噪声的语音频域信号特征进行拼接,得到第一融合拼接结果;Splicing the first fusion attention mechanism feature, the second fusion attention mechanism feature, and the voice and audio domain signal features containing residual echo and noise to obtain a first fusion splicing result;

将所述第一融合拼接结果输入所述训练后级联网络中的训练后残余回声及噪声消除模型,得到目标语音频域信号的掩蔽估计值;Inputting the first fusion splicing result into the post-training residual echo and noise cancellation model in the post-training cascade network to obtain a masking estimate value of the target speech and audio domain signal;

根据所述目标语音频域信号的掩蔽估计值和所述含有残余回声及噪声的语音频域信号,得到所述目标语音频域信号;According to the masking estimate value of the target voice and audio domain signal and the voice and audio domain signal containing residual echo and noise, obtain the target voice and audio frequency domain signal;

对所述目标语音频域信号进行逆傅里叶变换,得到目标语音时域信号。Inverse Fourier transform is performed on the target speech and audio frequency domain signal to obtain the target speech time domain signal.

以上所描述的装置实施例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。本领域普通技术人员在不付出创造性的劳动的情况下,即可以理解并实施。The device embodiments described above are only illustrative, wherein the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in One place, or it can be distributed over multiple network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution in this embodiment. Those of ordinary skill in the art can understand and implement it without creative effort.

通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到各实施方式可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件。基于这样的理解,上述技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品可以存储在计算机可读存储介质中,如ROM/RAM、磁碟、光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行各个实施例或者实施例的某些部分所述的方法。From the description of the above embodiments, those skilled in the art can clearly understand that each embodiment can be implemented by means of software plus a necessary general hardware platform, and certainly can also be implemented by hardware. Based on this understanding, the above-mentioned technical solutions can be embodied in the form of software products in essence or the parts that make contributions to the prior art, and the computer software products can be stored in computer-readable storage media, such as ROM/RAM, magnetic A disc, an optical disc, etc., includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform the methods described in various embodiments or some parts of the embodiments.

应说明的是:以上实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围。It should be noted that: the above embodiments are only used to illustrate the technical solutions of the present application, but not to limit them; although the present application has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand: it can still be used for The technical solutions described in the foregoing embodiments are modified, or some technical features thereof are equivalently replaced; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the spirit and scope of the technical solutions in the embodiments of the present application.

Claims (9)

1.一种残余回声及噪声消除方法,其特征在于,包括:1. a residual echo and noise cancellation method, is characterized in that, comprises: 接收含有回声及噪声的语音时域信号和远端参考声时域信号;Receive speech time-domain signals containing echo and noise and far-end reference acoustic time-domain signals; 对所述含有回声及噪声的语音时域信号和所述远端参考声时域信号分别进行分帧、加窗和傅里叶变换,得到含有回声及噪声的语音频域信号和远端参考声频域信号;Perform framing, windowing and Fourier transform on the voice time domain signal containing echo and noise and the far-end reference acoustic time domain signal respectively to obtain the voice and audio domain signal containing echo and noise and the far-end reference audio frequency domain signal; 根据所述含有回声及噪声的语音频域信号和所述远端参考声频域信号,确定回声频域信号;Determine an echo frequency domain signal according to the voice and audio frequency domain signal containing echo and noise and the far-end reference audio frequency domain signal; 根据所述含有回声及噪声的语音频域信号和所述回声频域信号,确定含有残余回声及噪声的语音频域信号;According to the voice and audio domain signal containing echo and noise and the echo frequency domain signal, determine a voice and audio domain signal containing residual echo and noise; 将所述含有残余回声及噪声的语音频域信号的幅度谱、所述回声频域信号的幅度谱和所述远端参考声频域信号的幅度谱进行能量归一化处理,得到含有残余回声及噪声的语音频域信号特征、回声频域信号特征和远端参考声频域信号特征;Perform energy normalization processing on the amplitude spectrum of the voice and audio frequency domain signal containing residual echo and noise, the amplitude spectrum of the echo frequency domain signal and the amplitude spectrum of the far-end reference audio frequency domain signal to obtain residual echo and Noise's voice and audio frequency domain signal characteristics, echo frequency domain signal characteristics and far-end reference audio frequency domain signal characteristics; 将所述含有残余回声及噪声的语音频域信号特征与所述远端参考声频域信号特征进行拼接,得到第一拼接结果,并且将所述含有残余回声及噪声的语音频域信号特征与所述回声频域信号特征进行拼接,得到第二拼接结果;The voice and audio domain signal features containing residual echo and noise are spliced with the far-end reference audio frequency domain signal features to obtain a first splicing result, and the voice and audio domain signal features containing residual echo and noise are combined with the above. The echo frequency domain signal features are spliced to obtain a second splicing result; 将所述第一拼接结果和所述第二拼接结果输入所述训练后级联网络中的训练后特征注意力模型,获得与所述远端参考声频域信号特征对应的第一注意力权重和与所述回声频域信号特征对应的第二注意力权重;The first splicing result and the second splicing result are input into the post-training feature attention model in the post-training cascade network to obtain the first attention weight and a second attention weight corresponding to the echo frequency domain signal feature; 将所述远端参考声频域信号特征与第一注意力权重相乘,得到第一融合注意力机制特征,并且将所述回声频域信号特征与第二注意力权重相乘,得到第二融合注意力机制特征;Multiplying the remote reference audio frequency domain signal feature and the first attention weight to obtain the first fusion attention mechanism feature, and multiplying the echo frequency domain signal feature and the second attention weight to obtain the second fusion Attention mechanism features; 将所述第一融合注意力机制特征、所述第二融合注意力机制特征和所述含有残余回声及噪声的语音频域信号特征进行拼接,得到第一融合拼接结果;Splicing the first fusion attention mechanism feature, the second fusion attention mechanism feature, and the voice and audio domain signal features containing residual echo and noise to obtain a first fusion splicing result; 将所述第一融合拼接结果输入所述训练后级联网络中的训练后残余回声及噪声消除模型,得到目标语音频域信号的掩蔽估计值;Inputting the first fusion splicing result into the post-training residual echo and noise cancellation model in the post-training cascade network to obtain a masking estimate value of the target speech and audio domain signal; 根据所述目标语音频域信号的掩蔽估计值和所述含有残余回声及噪声的语音频域信号,得到所述目标语音频域信号;According to the masking estimate value of the target voice and audio domain signal and the voice and audio domain signal containing residual echo and noise, obtain the target voice and audio frequency domain signal; 对所述目标语音频域信号进行逆傅里叶变换,得到目标语音时域信号。Inverse Fourier transform is performed on the target speech and audio frequency domain signal to obtain the target speech time domain signal. 2.根据权利要求1所述的方法,其特征在于,所述对所述含有回声及噪声的语音时域信号和所述远端参考声时域信号分别进行分帧、加窗和傅里叶变换,包括:2. The method according to claim 1, wherein the described voice time-domain signal containing echo and noise and the far-end reference acoustic time-domain signal are respectively framed, windowed and Fourier transformations, including: 对所述含有回声及噪声的语音时域信号和所述远端参考声时域信号分别取预设个数采样点作为一帧信号;若长度不足则先补零到预设个数;Taking a preset number of sampling points as a frame of signals for the voice time-domain signal containing echo and noise and the far-end reference acoustic time-domain signal respectively; if the length is insufficient, first fill with zeros to a preset number; 对每一帧信号进行加窗;其中,加窗函数采用汉明窗;Windowing is performed on each frame of signal; wherein, the windowing function adopts a Hamming window; 对加窗后的每一帧信号进行傅里叶变换。Fourier transform is performed on each frame of the windowed signal. 3.根据权利要求1所述的方法,其特征在于,所述根据所述含有回声及噪声的语音频域信号和所述远端参考声频域信号,确定回声频域信号,包括:3. The method according to claim 1, wherein, determining the echo frequency domain signal according to the voice and audio frequency domain signal containing echo and noise and the far-end reference audio frequency domain signal, comprising: 将所述含有回声及噪声的语音频域信号和所述远端参考声频域信号输入卡尔曼滤波器,得到滤波器系数和所述回声频域信号;Inputting the voice and audio frequency domain signals containing echo and noise and the far-end reference audio frequency domain signals into a Kalman filter to obtain filter coefficients and the echo frequency domain signals; 所述回声频域信号为所述滤波器系数和所述远端参考声频域信号相乘的结果。The echo frequency domain signal is the result of multiplying the filter coefficients and the far-end reference audio frequency domain signal. 4.根据权利要求1所述的方法,其特征在于,所述根据所述含有回声及噪声的语音频域信号和所述回声频域信号,确定含有残余回声及噪声的语音频域信号,包括:4 . The method according to claim 1 , wherein determining the voice and audio domain signal containing residual echo and noise according to the voice and audio domain signal containing echo and noise and the echo frequency domain signal, comprising: 5 . : 所述含有回声及噪声的语音频域信号减去所述回声频域信号,得到所述含有残余回声及噪声的语音频域信号。The echo frequency domain signal is subtracted from the voice and audio domain signal containing echo and noise to obtain the voice and audio domain signal containing residual echo and noise. 5.根据权利要求1所述的方法,其特征在于,所述将所述含有残余回声及噪声的语音频域信号的幅度谱、所述回声频域信号的幅度谱和所述远端参考声频域信号的幅度谱进行能量归一化处理,得到含有残余回声及噪声的语音频域信号特征、回声频域信号特征和远端参考声频域信号特征,包括:5 . The method according to claim 1 , wherein the method of comparing the amplitude spectrum of the voice domain signal containing residual echo and noise, the amplitude spectrum of the echo frequency domain signal and the far-end reference audio The amplitude spectrum of the domain signal is energy normalized to obtain the voice and audio domain signal features containing residual echo and noise, the echo frequency domain signal features and the far-end reference audio frequency domain signal features, including: 根据所述含有残余回声及噪声的语音频域信号的幅度谱、所述回声频域信号的幅度谱和所述远端参考声频域信号的幅度谱,分别确定与其对应的第一函数、第二函数和第三函数;Determine the corresponding first function, second function and third function; 根据与所述含有残余回声及噪声的语音频域信号的幅度谱对应的第一函数、所述含有残余回声及噪声的语音频域信号特征的均值及方差,确定所述含有残余回声及噪声的语音频域信号特征;According to the first function corresponding to the amplitude spectrum of the voice domain signal containing residual echo and noise, and the mean and variance of the features of the voice domain signal containing residual echo and noise, determine the residual echo and noise containing voice domain signal. Voice domain signal features; 根据所述回声频域信号的幅度谱对应的第二函数、所述回声频域信号特征的均值及方差,确定所述回声频域信号特征;According to the second function corresponding to the amplitude spectrum of the echo frequency domain signal, and the mean value and variance of the echo frequency domain signal feature, determine the echo frequency domain signal feature; 根据所述远端参考声频域信号的幅度谱对应的第三函数、所述远端参考声频域信号特征的均值及方差,确定所述远端参考声频域信号特征。The feature of the far-end reference audio-frequency domain signal is determined according to the third function corresponding to the amplitude spectrum of the far-end reference audio-frequency domain signal, and the mean and variance of the feature of the far-end reference audio-frequency domain signal. 6.根据权利要求1所述的方法,其特征在于,所述训练后级联网络通过以下步骤训练得到:6. method according to claim 1, is characterized in that, after described training, cascade network is obtained by following steps training: 接收第一含有回声及噪声的语音时域信号、第一远端参考声时域信号和第一目标语音时域信号;receiving the first voice time domain signal containing echo and noise, the first remote reference acoustic time domain signal and the first target voice time domain signal; 对所述第一含有回声及噪声的语音时域信号、所述第一远端参考声时域信号和所述第一目标语音时域信号分别进行分帧、加窗和傅里叶变换,得到第一含有回声及噪声的语音频域信号、第一远端参考声频域信号和第一目标语音频域信号;Perform framing, windowing and Fourier transform on the first voice time-domain signal containing echo and noise, the first far-end reference acoustic time-domain signal and the first target voice time-domain signal, respectively, to obtain a first voice domain signal containing echo and noise, a first remote reference voice domain signal and a first target voice domain signal; 根据所述第一含有回声及噪声的语音频域信号和所述第一远端参考声频域信号,确定第一回声频域信号;determining a first echo frequency domain signal according to the first voice frequency domain signal containing echo and noise and the first remote reference audio frequency domain signal; 根据所述第一含有回声及噪声的语音频域信号和所述第一回声频域信号,确定第一含有残余回声及噪声的语音频域信号;According to the first voice and audio domain signal containing echo and noise and the first echo frequency domain signal, determine a first voice and audio domain signal containing residual echo and noise; 将所述第一含有残余回声及噪声的语音频域信号的幅度谱、所述第一回声频域信号的幅度谱和所述第一远端参考声频域信号的幅度谱进行能量归一化处理,得到第一含有残余回声及噪声的语音频域信号特征、第一回声频域信号特征和第一远端参考声频域信号特征;Perform energy normalization processing on the amplitude spectrum of the first voice-frequency domain signal containing residual echo and noise, the amplitude spectrum of the first echo frequency-domain signal, and the amplitude spectrum of the first far-end reference audio-frequency domain signal , obtain the first voice-frequency domain signal feature containing residual echo and noise, the first echo frequency-domain signal feature and the first remote reference audio-frequency domain signal feature; 将所述第一含有残余回声及噪声的语音频域信号特征与所述第一远端参考声频域信号特征进行拼接,得到第一拼接特征,并将所述第一含有残余回声及噪声的语音频域信号特征与所述第一回声频域信号特征进行拼接,得到第二拼接特征;Splicing the first voice and audio domain signal features containing residual echo and noise with the first remote reference audio frequency domain signal features to obtain a first splicing feature, and splicing the first voice and audio signal features containing residual echo and noise. The audio domain signal feature is spliced with the first echo frequency domain signal feature to obtain a second splicing feature; 将所述第一拼接特征和所述第二拼接特征输入级联网络中的特征注意力模型,以联合训练级联网络中的特征注意力模型和残余回声及噪声消除模型,得到与所述第一远端参考声频域信号特征对应的第一权重和与所述第一回声频域信号特征对应的第二权重;The first splicing feature and the second splicing feature are input to the feature attention model in the cascade network, to jointly train the feature attention model and the residual echo and noise cancellation model in the cascade network, and the result is obtained with the first a first weight corresponding to a remote reference audio frequency domain signal feature and a second weight corresponding to the first echo frequency domain signal feature; 所述第一远端参考声频域信号特征与第一权重相乘,得到第一融合特征,并且所述第一回声频域信号特征与第二权重相乘,得到第二融合特征;The first remote reference audio frequency domain signal feature is multiplied by a first weight to obtain a first fusion feature, and the first echo frequency domain signal feature is multiplied by a second weight to obtain a second fusion feature; 将所述第一融合特征、所述第二融合特征和所述第一含有残余回声及噪声的语音频域信号特征进行拼接,得到第一融合拼接特征;Splicing the first fusion feature, the second fusion feature, and the first voice and audio domain signal features containing residual echo and noise to obtain a first fusion splicing feature; 将所述第一融合拼接特征输入所述级联网络中的残余回声及噪声消除模型,得到第二目标语音频域信号的掩蔽估计值;Inputting the first fusion splicing feature into the residual echo and noise cancellation model in the cascaded network to obtain a masking estimate of the second target voice and audio domain signal; 根据所述第二目标语音频域信号的掩蔽估计值和所述第一含有残余回声及噪声的语音频域信号,确定第二目标语音频域信号;According to the masking estimate value of the second target voice domain signal and the first voice domain signal containing residual echo and noise, determine the second target voice domain signal; 根据至少两个损失函数,确定多域的损失函数;其中,所述至少两个损失函数包括能量无关的幅度谱损失函数和客观语音质量评估得分损失函数;所述能量无关的幅度谱损失函数以所述第一目标语音频域信号的幅度谱为训练目标,根据所述第二目标语音频域信号进行确定;所述客观语音质量评估得分损失函数以提升语音听感质量为训练目标进行确定;According to at least two loss functions, a multi-domain loss function is determined; wherein, the at least two loss functions include an energy-independent amplitude spectrum loss function and an objective speech quality assessment score loss function; the energy-independent amplitude spectrum loss function is The amplitude spectrum of the first target speech and audio domain signal is a training target, and is determined according to the second target speech and audio domain signal; the objective speech quality evaluation score loss function is determined by taking improving the audio quality of speech as a training target; 通过不断地模型参数迭代减小所述多域的损失函数,得到所述训练后级联网络。The post-training cascade network is obtained by continuously reducing the multi-domain loss function iteratively with model parameters. 7.一种残余回声及噪声消除装置,其特征在于,包括:7. A residual echo and noise cancellation device, comprising: 接收模块,用于接收含有回声及噪声的语音时域信号和远端参考声时域信号;The receiving module is used to receive the voice time domain signal containing echo and noise and the far-end reference sound time domain signal; 处理模块,用于对所述含有回声及噪声的语音时域信号和所述远端参考声时域信号分别进行分帧、加窗和傅里叶变换,得到含有回声及噪声的语音频域信号和远端参考声频域信号;The processing module is used to perform framing, windowing and Fourier transform on the voice time-domain signal containing echo and noise and the far-end reference acoustic time-domain signal respectively, so as to obtain the voice and audio domain signal containing echo and noise and the far-end reference audio domain signal; 确定模块,用于根据所述含有回声及噪声的语音频域信号和所述远端参考声频域信号,确定回声频域信号;a determining module, configured to determine an echo frequency domain signal according to the voice and audio frequency domain signal containing echo and noise and the far-end reference audio frequency domain signal; 所述确定模块,还用于根据所述含有回声及噪声的语音频域信号和所述回声频域信号,确定含有残余回声及噪声的语音频域信号;The determining module is further configured to determine a voice and audio domain signal containing residual echo and noise according to the voice and audio domain signal containing echo and noise and the echo frequency domain signal; 能量归一化模块,用于将所述含有残余回声及噪声的语音频域信号的幅度谱、所述回声频域信号的幅度谱和所述远端参考声频域信号的幅度谱进行能量归一化处理,得到含有残余回声及噪声的语音频域信号特征、回声频域信号特征和远端参考声频域信号特征;An energy normalization module, configured to perform energy normalization on the amplitude spectrum of the voice and audio domain signal containing residual echo and noise, the amplitude spectrum of the echo frequency domain signal and the amplitude spectrum of the far-end reference audio frequency domain signal process to obtain the voice and audio domain signal features containing residual echo and noise, the echo frequency domain signal features and the far-end reference audio frequency domain signal features; 拼接模块,用于将所述含有残余回声及噪声的语音频域信号特征与所述远端参考声频域信号特征进行拼接,得到第一拼接结果,并且将所述含有残余回声及噪声的语音频域信号特征与所述回声频域信号特征进行拼接,得到第二拼接结果;The splicing module is used for splicing the voice and audio domain signal features containing residual echo and noise with the remote reference audio frequency domain signal features to obtain a first splicing result, and splicing the voice and audio containing residual echo and noise. The domain signal feature is spliced with the echo frequency domain signal feature to obtain a second splicing result; 权重获得模块,用于将所述第一拼接结果和所述第二拼接结果输入所述训练后级联网络中的训练后特征注意力模型,获得与所述远端参考声频域信号特征对应的第一注意力权重和与所述回声频域信号特征对应的第二注意力权重;A weight obtaining module is used to input the first splicing result and the second splicing result into the post-training feature attention model in the post-training cascade network, and obtain the corresponding feature of the far-end reference audio frequency domain signal. a first attention weight and a second attention weight corresponding to the echo frequency domain signal feature; 融合注意力机制特征获得模块,用于将所述远端参考声频域信号特征与第一注意力权重相乘,得到第一融合注意力机制特征,并且将所述回声频域信号特征与第二注意力权重相乘,得到第二融合注意力机制特征;The fused attention mechanism feature acquisition module is used to multiply the far-end reference audio frequency domain signal feature with the first attention weight to obtain the first fused attention mechanism feature, and combine the echo frequency domain signal feature with the second The attention weights are multiplied to obtain the second fusion attention mechanism feature; 所述拼接模块,还用于将所述第一融合注意力机制特征、所述第二融合注意力机制特征和所述含有残余回声及噪声的语音频域信号特征进行拼接,得到第一融合拼接结果;The splicing module is further configured to splicing the first fusion attention mechanism feature, the second fusion attention mechanism feature, and the voice and audio domain signal features containing residual echo and noise to obtain a first fusion splicing result; 掩蔽估计值获得模块,用于将所述第一融合拼接结果输入所述训练后级联网络中的训练后残余回声及噪声消除模型,得到目标语音频域信号的掩蔽估计值;a masking estimation value obtaining module, which is used to input the first fusion splicing result into the post-training residual echo and noise cancellation model in the post-training cascade network to obtain the masking estimation value of the target speech and audio domain signal; 目标语音频域信号获得模块,用于根据所述目标语音频域信号的掩蔽估计值和所述含有残余回声及噪声的语音频域信号,得到所述目标语音频域信号;a target voice and audio domain signal obtaining module, configured to obtain the target voice and audio domain signal according to the masking estimate value of the target voice and audio domain signal and the voice and audio domain signal containing residual echo and noise; 逆傅里叶变换模块,用于对所述目标语音频域信号进行逆傅里叶变换,得到目标语音时域信号。The inverse Fourier transform module is used for performing inverse Fourier transform on the target speech and audio domain signal to obtain the target speech time domain signal. 8.一种残余回声及噪声消除装置,其特征在于,包括至少一个处理器,所述处理器用于执行存储器中存储的程序,当所述程序被执行时,使得所述装置执行:8. A residual echo and noise cancellation device, comprising at least one processor, the processor is configured to execute a program stored in a memory, and when the program is executed, the device is made to execute: 如权利要求1-6任一项所述的方法。The method of any one of claims 1-6. 9.一种非暂态计算机可读存储介质,其上存储有计算机程序,其特征在于,该计算机程序被处理器执行时实现如权利要求1-6任一项所述的方法。9. A non-transitory computer-readable storage medium on which a computer program is stored, characterized in that, when the computer program is executed by a processor, the method according to any one of claims 1-6 is implemented.
CN202110008502.9A 2021-01-05 2021-01-05 Residual echo and noise elimination method and device Active CN112863535B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110008502.9A CN112863535B (en) 2021-01-05 2021-01-05 Residual echo and noise elimination method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110008502.9A CN112863535B (en) 2021-01-05 2021-01-05 Residual echo and noise elimination method and device

Publications (2)

Publication Number Publication Date
CN112863535A true CN112863535A (en) 2021-05-28
CN112863535B CN112863535B (en) 2022-04-26

Family

ID=76003795

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110008502.9A Active CN112863535B (en) 2021-01-05 2021-01-05 Residual echo and noise elimination method and device

Country Status (1)

Country Link
CN (1) CN112863535B (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113436636A (en) * 2021-06-11 2021-09-24 深圳波洛斯科技有限公司 Acoustic echo cancellation method and system based on adaptive filter and neural network
CN113489854A (en) * 2021-06-30 2021-10-08 北京小米移动软件有限公司 Sound processing method, sound processing device, electronic equipment and storage medium
CN113539291A (en) * 2021-07-09 2021-10-22 北京声智科技有限公司 Method and device for reducing noise of audio signal, electronic equipment and storage medium
CN113744762A (en) * 2021-08-09 2021-12-03 杭州网易智企科技有限公司 Signal-to-noise ratio determining method and device, electronic equipment and storage medium
CN114337908A (en) * 2022-01-05 2022-04-12 中国科学院声学研究所 Method and device for generating interference signal of target speech signal
CN114974281A (en) * 2022-05-24 2022-08-30 云知声智能科技股份有限公司 Training method and device of voice noise reduction model, storage medium and electronic device
CN114974286A (en) * 2022-06-30 2022-08-30 北京达佳互联信息技术有限公司 Signal enhancement method, model training method, device, equipment, sound box and medium
CN115294997A (en) * 2022-06-30 2022-11-04 北京达佳互联信息技术有限公司 Voice processing method and device, electronic equipment and storage medium
WO2023226592A1 (en) * 2022-05-25 2023-11-30 青岛海尔科技有限公司 Noise signal processing method and apparatus, and storage medium and electronic apparatus
CN117437929A (en) * 2023-12-21 2024-01-23 睿云联(厦门)网络通讯技术有限公司 Real-time echo cancellation method based on neural network

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107636758A (en) * 2015-05-15 2018-01-26 哈曼国际工业有限公司 Acoustic echo cancellation system and method
US20200105287A1 (en) * 2017-04-14 2020-04-02 Industry-University Cooperation Foundation Hanyang University Deep neural network-based method and apparatus for combining noise and echo removal
CN111161752A (en) * 2019-12-31 2020-05-15 歌尔股份有限公司 Echo cancellation method and device
CN111341336A (en) * 2020-03-16 2020-06-26 北京字节跳动网络技术有限公司 Echo cancellation method, device, terminal equipment and medium
US20200312346A1 (en) * 2019-03-28 2020-10-01 Samsung Electronics Co., Ltd. System and method for acoustic echo cancellation using deep multitask recurrent neural networks
CN111768796A (en) * 2020-07-14 2020-10-13 中国科学院声学研究所 Method and device for acoustic echo cancellation and de-reverberation
CN111768795A (en) * 2020-07-09 2020-10-13 腾讯科技(深圳)有限公司 Noise suppression method, device, equipment and storage medium for voice signal

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107636758A (en) * 2015-05-15 2018-01-26 哈曼国际工业有限公司 Acoustic echo cancellation system and method
US20200105287A1 (en) * 2017-04-14 2020-04-02 Industry-University Cooperation Foundation Hanyang University Deep neural network-based method and apparatus for combining noise and echo removal
US20200312346A1 (en) * 2019-03-28 2020-10-01 Samsung Electronics Co., Ltd. System and method for acoustic echo cancellation using deep multitask recurrent neural networks
CN111161752A (en) * 2019-12-31 2020-05-15 歌尔股份有限公司 Echo cancellation method and device
CN111341336A (en) * 2020-03-16 2020-06-26 北京字节跳动网络技术有限公司 Echo cancellation method, device, terminal equipment and medium
CN111768795A (en) * 2020-07-09 2020-10-13 腾讯科技(深圳)有限公司 Noise suppression method, device, equipment and storage medium for voice signal
CN111768796A (en) * 2020-07-14 2020-10-13 中国科学院声学研究所 Method and device for acoustic echo cancellation and de-reverberation

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
王冬霞等: "基于BLSTM神经网络的回声和噪声抑制算法", 《信号处理》 *

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113436636A (en) * 2021-06-11 2021-09-24 深圳波洛斯科技有限公司 Acoustic echo cancellation method and system based on adaptive filter and neural network
CN113489854A (en) * 2021-06-30 2021-10-08 北京小米移动软件有限公司 Sound processing method, sound processing device, electronic equipment and storage medium
CN113489854B (en) * 2021-06-30 2024-03-01 北京小米移动软件有限公司 Sound processing method, device, electronic equipment and storage medium
CN113539291A (en) * 2021-07-09 2021-10-22 北京声智科技有限公司 Method and device for reducing noise of audio signal, electronic equipment and storage medium
CN113744762A (en) * 2021-08-09 2021-12-03 杭州网易智企科技有限公司 Signal-to-noise ratio determining method and device, electronic equipment and storage medium
CN113744762B (en) * 2021-08-09 2023-10-27 杭州网易智企科技有限公司 Signal-to-noise ratio determining method and device, electronic equipment and storage medium
CN114337908A (en) * 2022-01-05 2022-04-12 中国科学院声学研究所 Method and device for generating interference signal of target speech signal
CN114337908B (en) * 2022-01-05 2024-04-12 中国科学院声学研究所 Method and device for generating interference signal of target voice signal
CN114974281A (en) * 2022-05-24 2022-08-30 云知声智能科技股份有限公司 Training method and device of voice noise reduction model, storage medium and electronic device
WO2023226592A1 (en) * 2022-05-25 2023-11-30 青岛海尔科技有限公司 Noise signal processing method and apparatus, and storage medium and electronic apparatus
CN115294997A (en) * 2022-06-30 2022-11-04 北京达佳互联信息技术有限公司 Voice processing method and device, electronic equipment and storage medium
CN114974286A (en) * 2022-06-30 2022-08-30 北京达佳互联信息技术有限公司 Signal enhancement method, model training method, device, equipment, sound box and medium
CN115294997B (en) * 2022-06-30 2024-10-29 北京达佳互联信息技术有限公司 Voice processing method, device, electronic equipment and storage medium
CN117437929A (en) * 2023-12-21 2024-01-23 睿云联(厦门)网络通讯技术有限公司 Real-time echo cancellation method based on neural network
CN117437929B (en) * 2023-12-21 2024-03-08 睿云联(厦门)网络通讯技术有限公司 Real-time echo cancellation method based on neural network

Also Published As

Publication number Publication date
CN112863535B (en) 2022-04-26

Similar Documents

Publication Publication Date Title
CN112863535B (en) Residual echo and noise elimination method and device
CN107452389B (en) Universal single-track real-time noise reduction method
KR101934636B1 (en) Method and apparatus for integrating and removing acoustic echo and background noise based on deepening neural network
Williamson et al. Time-frequency masking in the complex domain for speech dereverberation and denoising
CN107993670B (en) Microphone array speech enhancement method based on statistical model
CN112700786B (en) Speech enhancement method, device, electronic equipment and storage medium
Carbajal et al. Multiple-input neural network-based residual echo suppression
CN109841206A (en) A kind of echo cancel method based on deep learning
Zhao et al. Late reverberation suppression using recurrent neural networks with long short-term memory
CN111986660B (en) A single-channel speech enhancement method, system and storage medium based on neural network sub-band modeling
Schmid et al. Variational Bayesian inference for multichannel dereverberation and noise reduction
CN112735456A (en) Speech enhancement method based on DNN-CLSTM network
CN108172231A (en) A method and system for removing reverberation based on Kalman filter
CN112037809A (en) Residual echo suppression method based on deep neural network with multi-feature flow structure
CN111048061B (en) Method, device and equipment for obtaining step length of echo cancellation filter
Lei et al. Deep neural network based regression approach for acoustic echo cancellation
Shankar et al. Efficient two-microphone speech enhancement using basic recurrent neural network cell for hearing and hearing aids
CN115424627A (en) Voice enhancement hybrid processing method based on convolution cycle network and WPE algorithm
WO2022256577A1 (en) A method of speech enhancement and a mobile computing device implementing the method
Pfeifenberger et al. Deep complex-valued neural beamformers
CN111883154B (en) Echo cancellation method and device, computer-readable storage medium, and electronic device
CN113838471A (en) Noise reduction method and system based on neural network, electronic device and storage medium
Schwartz et al. Nested generalized sidelobe canceller for joint dereverberation and noise reduction
Chen Noise reduction of bird calls based on a combination of spectral subtraction, Wiener filtering, and Kalman filtering
CN111883155B (en) Echo cancellation method, device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant