CN112863535A

CN112863535A - Residual echo and noise elimination method and device

Info

Publication number: CN112863535A
Application number: CN202110008502.9A
Authority: CN
Inventors: 李军锋; 顾建军; 颜永红
Original assignee: Institute of Acoustics CAS
Current assignee: Institute of Acoustics CAS
Priority date: 2021-01-05
Filing date: 2021-01-05
Publication date: 2021-05-28
Anticipated expiration: 2041-01-05
Also published as: CN112863535B

Abstract

The embodiment of the application discloses a method and a device for eliminating residual echo and noise, wherein the method comprises the following steps: performing framing, windowing and Fourier transformation on the received voice time domain signal containing the echo and the noise and the far-end reference sound time domain signal to obtain a corresponding frequency domain signal, determining an echo frequency domain signal, and further determining the voice frequency domain signal containing the residual echo and the noise; respectively carrying out energy normalization processing on the amplitude spectrums of the voice frequency domain signal containing the residual echo and the noise, the echo frequency domain signal and the far-end reference audio frequency domain signal to obtain corresponding characteristics; determining a target voice frequency domain signal according to the corresponding characteristics and the trained cascade network; and carrying out inverse Fourier transform on the target voice frequency domain signal to obtain a target voice time domain signal. According to the embodiment of the application, the feature attention model is used for endowing the input features with different importance, and redundant information in the input features is reduced. And the multi-domain loss function is trained in a cascade network, so that the sensitivity of the model to signal energy is reduced.

Description

A method and device for eliminating residual echo and noise

技术领域technical field

本发明涉及回声及噪声消除领域。尤其涉及一种残余回声及噪声消除方法及装置。The present invention relates to the field of echo and noise cancellation. In particular, it relates to a residual echo and noise cancellation method and device.

背景技术Background technique

目前，回声消除技术主要是去除语音信号中由远端参考声信号形成的回声信号，而语音降噪技术主要是去除语音信号中背景噪声以及指向性噪声干扰。回声消除技术和语音降噪技术都旨在提高语音的质量和可懂度。在回声消除技术中，结合基于传统信号处理的自适应滤波方法和基于深度学习的残余回声消除方法，可以有效提升系统的泛化性能。At present, the echo cancellation technology mainly removes the echo signal formed by the remote reference sound signal in the speech signal, and the speech noise reduction technology mainly removes the background noise and directional noise interference in the speech signal. Both echo cancellation technology and speech noise reduction technology are designed to improve the quality and intelligibility of speech. In the echo cancellation technology, combining the adaptive filtering method based on traditional signal processing and the residual echo cancellation method based on deep learning can effectively improve the generalization performance of the system.

然而，在传统方法中残余回声及噪声消除往往是独立分开进行的，没有考虑这两个任务的相关性。在残余回声消除任务中有多个信号特征可以利用，这些特征有着不同的物理意义与重要性，而传统方法都没有考虑这些特征不同的重要性。在训练残余回声及噪声消除模型时，现有技术大多采用目标幅度谱和估计幅度谱的均方误差作为损失函数，但上述损失函数依赖于信号的能量大小，对不同大小能量的信号的尺度也会不同。However, residual echo and noise cancellation are often performed independently and separately in traditional methods, without considering the correlation of these two tasks. In the residual echo cancellation task, there are multiple signal features that can be utilized, and these features have different physical meanings and importance, and traditional methods do not consider the different importance of these features. When training residual echo and noise cancellation models, most of the existing technologies use the mean square error of the target amplitude spectrum and the estimated amplitude spectrum as the loss function, but the above loss function depends on the energy of the signal, and the scale of the signal with different energies is also different. will be different.

发明内容SUMMARY OF THE INVENTION

由于现有方法存在上述问题，本申请实施例提出一种残余回声及噪声消除方法及装置。Due to the above problems in the existing methods, the embodiments of the present application provide a residual echo and noise cancellation method and apparatus.

第一方面，本申请实施例提出一种残余回声及噪声消除方法，包括：In a first aspect, an embodiment of the present application proposes a residual echo and noise cancellation method, including:

接收含有回声及噪声的语音时域信号和远端参考声时域信号；Receive speech time-domain signals containing echo and noise and far-end reference acoustic time-domain signals;

对所述含有回声及噪声的语音时域信号和所述远端参考声时域信号分别进行分帧、加窗和傅里叶变换，得到含有回声及噪声的语音频域信号和远端参考声频域信号；Perform framing, windowing and Fourier transform on the voice time domain signal containing echo and noise and the far-end reference acoustic time domain signal respectively to obtain the voice and audio domain signal containing echo and noise and the far-end reference audio frequency domain signal;

根据所述含有回声及噪声的语音频域信号和所述远端参考声频域信号，确定回声频域信号；Determine an echo frequency domain signal according to the voice and audio frequency domain signal containing echo and noise and the far-end reference audio frequency domain signal;

根据所述含有回声及噪声的语音频域信号和所述回声频域信号，确定含有残余回声及噪声的语音频域信号；According to the voice and audio domain signal containing echo and noise and the echo frequency domain signal, determine a voice and audio domain signal containing residual echo and noise;

将所述含有残余回声及噪声的语音频域信号的幅度谱、所述回声频域信号的幅度谱和所述远端参考声频域信号的幅度谱进行能量归一化处理，得到含有残余回声及噪声的语音频域信号特征、回声频域信号特征和远端参考声频域信号特征；Perform energy normalization processing on the amplitude spectrum of the voice and audio frequency domain signal containing residual echo and noise, the amplitude spectrum of the echo frequency domain signal and the amplitude spectrum of the far-end reference audio frequency domain signal to obtain residual echo and Noise's voice and audio frequency domain signal characteristics, echo frequency domain signal characteristics and far-end reference audio frequency domain signal characteristics;

将所述含有残余回声及噪声的语音频域信号特征与所述远端参考声频域信号特征进行拼接，得到第一拼接结果，并且将所述含有残余回声及噪声的语音频域信号特征与所述回声频域信号特征进行拼接，得到第二拼接结果；The voice and audio domain signal features containing residual echo and noise are spliced with the far-end reference audio frequency domain signal features to obtain a first splicing result, and the voice and audio domain signal features containing residual echo and noise are combined with the above. The echo frequency domain signal features are spliced to obtain a second splicing result;

将所述第一拼接结果和所述第二拼接结果输入所述训练后级联网络中的训练后特征注意力模型，获得与所述远端参考声频域信号特征对应的第一注意力权重和与所述回声频域信号特征对应的第二注意力权重；The first splicing result and the second splicing result are input into the post-training feature attention model in the post-training cascade network to obtain the first attention weight and a second attention weight corresponding to the echo frequency domain signal feature;

将所述远端参考声频域信号特征与第一注意力权重相乘，得到第一融合注意力机制特征，并且将所述回声频域信号特征与第二注意力权重相乘，得到第二融合注意力机制特征；Multiplying the remote reference audio frequency domain signal feature and the first attention weight to obtain the first fusion attention mechanism feature, and multiplying the echo frequency domain signal feature and the second attention weight to obtain the second fusion Attention mechanism features;

将所述第一融合注意力机制特征、所述第二融合注意力机制特征和所述含有残余回声及噪声的语音频域信号特征进行拼接，得到第一融合拼接结果；Splicing the first fusion attention mechanism feature, the second fusion attention mechanism feature, and the voice and audio domain signal features containing residual echo and noise to obtain a first fusion splicing result;

将所述第一融合拼接结果输入所述训练后级联网络中的训练后残余回声及噪声消除模型，得到目标语音频域信号的掩蔽估计值；Inputting the first fusion splicing result into the post-training residual echo and noise cancellation model in the post-training cascade network to obtain a masking estimate value of the target speech and audio domain signal;

根据所述目标语音频域信号的掩蔽估计值和所述含有残余回声及噪声的语音频域信号，得到所述目标语音频域信号；According to the masking estimate value of the target voice and audio domain signal and the voice and audio domain signal containing residual echo and noise, obtain the target voice and audio frequency domain signal;

对所述目标语音频域信号进行逆傅里叶变换，得到目标语音时域信号。Inverse Fourier transform is performed on the target speech and audio frequency domain signal to obtain the target speech time domain signal.

在一种可能的实现中，所述对所述含有回声及噪声的语音时域信号和所述远端参考声时域信号分别进行分帧、加窗和傅里叶变换，包括：In a possible implementation, performing framing, windowing and Fourier transform on the voice time-domain signal containing echo and noise and the far-end reference sound time-domain signal respectively, including:

对所述含有回声及噪声的语音时域信号和所述远端参考声时域信号分别取预设个数采样点作为一帧信号；若长度不足则先补零到预设个数；Taking a preset number of sampling points as a frame of signals for the voice time-domain signal containing echo and noise and the far-end reference acoustic time-domain signal respectively; if the length is insufficient, first fill with zeros to a preset number;

对每一帧信号进行加窗；其中，加窗函数采用汉明窗；Windowing is performed on each frame of signal; wherein, the windowing function adopts a Hamming window;

对加窗后的每一帧信号进行傅里叶变换。Fourier transform is performed on each frame of the windowed signal.

在一种可能的实现中，所述根据所述含有回声及噪声的语音频域信号和所述远端参考声频域信号，确定回声频域信号，包括：In a possible implementation, determining the echo frequency domain signal according to the voice and audio domain signal containing echo and noise and the far-end reference audio frequency domain signal includes:

将所述含有回声及噪声的语音频域信号和所述远端参考声频域信号输入卡尔曼滤波器，得到滤波器系数和所述回声频域信号；Inputting the voice and audio frequency domain signals containing echo and noise and the far-end reference audio frequency domain signals into a Kalman filter to obtain filter coefficients and the echo frequency domain signals;

所述回声频域信号为所述滤波器系数和所述远端参考声频域信号相乘的结果。The echo frequency domain signal is the result of multiplying the filter coefficients and the far-end reference audio frequency domain signal.

在一种可能的实现中，所述根据所述含有回声及噪声的语音频域信号和所述回声频域信号，确定含有残余回声及噪声的语音频域信号，包括：In a possible implementation, determining the voice and audio domain signal containing residual echo and noise according to the voice and audio domain signal containing echo and noise and the echo frequency domain signal, including:

所述含有回声及噪声的语音频域信号减去所述回声频域信号，得到所述含有残余回声及噪声的语音频域信号。The echo frequency domain signal is subtracted from the voice and audio domain signal containing echo and noise to obtain the voice and audio domain signal containing residual echo and noise.

在一种可能的实现中，所述将所述含有残余回声及噪声的语音频域信号的幅度谱、所述回声频域信号的幅度谱和所述远端参考声频域信号的幅度谱进行能量归一化处理，得到含有残余回声及噪声的语音频域信号特征、回声频域信号特征和远端参考声频域信号特征，包括：In a possible implementation, performing energy energy analysis on the amplitude spectrum of the voice domain signal containing residual echo and noise, the amplitude spectrum of the echo frequency domain signal, and the amplitude spectrum of the far-end reference audio frequency domain signal. Normalization processing is performed to obtain the voice and audio domain signal features containing residual echo and noise, the echo frequency domain signal features and the far-end reference audio frequency domain signal features, including:

根据所述含有残余回声及噪声的语音频域信号的幅度谱、所述回声频域信号的幅度谱和所述远端参考声频域信号的幅度谱，分别确定与其对应的第一函数、第二函数和第三函数；Determine the corresponding first function, second function and third function;

根据与所述含有残余回声及噪声的语音频域信号的幅度谱对应的第一函数、所述含有残余回声及噪声的语音频域信号特征的均值及方差，确定所述含有残余回声及噪声的语音频域信号特征；According to the first function corresponding to the amplitude spectrum of the voice domain signal containing residual echo and noise, and the mean and variance of the features of the voice domain signal containing residual echo and noise, determine the residual echo and noise containing voice domain signal. Voice domain signal features;

根据所述回声频域信号的幅度谱对应的第二函数、所述回声频域信号特征的均值及方差，确定所述回声频域信号特征；According to the second function corresponding to the amplitude spectrum of the echo frequency domain signal, and the mean value and variance of the echo frequency domain signal feature, determine the echo frequency domain signal feature;

根据所述远端参考声频域信号的幅度谱对应的第三函数、所述远端参考声频域信号特征的均值及方差，确定所述远端参考声频域信号特征。The feature of the far-end reference audio-frequency domain signal is determined according to the third function corresponding to the amplitude spectrum of the far-end reference audio-frequency domain signal, and the mean and variance of the feature of the far-end reference audio-frequency domain signal.

在一种可能的实现中，所述训练后级联网络通过以下步骤训练得到：In a possible implementation, the post-training cascade network is obtained by training through the following steps:

接收第一含有回声及噪声的语音时域信号、第一远端参考声时域信号和第一目标语音时域信号；receiving the first voice time domain signal containing echo and noise, the first remote reference acoustic time domain signal and the first target voice time domain signal;

对所述第一含有回声及噪声的语音时域信号、所述第一远端参考声时域信号和所述第一目标语音时域信号分别进行分帧、加窗和傅里叶变换，得到第一含有回声及噪声的语音频域信号、第一远端参考声频域信号和第一目标语音频域信号；Perform framing, windowing and Fourier transform on the first voice time-domain signal containing echo and noise, the first far-end reference acoustic time-domain signal and the first target voice time-domain signal, respectively, to obtain a first voice domain signal containing echo and noise, a first remote reference voice domain signal and a first target voice domain signal;

根据所述第一含有回声及噪声的语音频域信号和所述第一远端参考声频域信号，确定第一回声频域信号；determining a first echo frequency domain signal according to the first voice frequency domain signal containing echo and noise and the first remote reference audio frequency domain signal;

根据所述第一含有回声及噪声的语音频域信号和所述第一回声频域信号，确定第一含有残余回声及噪声的语音频域信号；According to the first voice and audio domain signal containing echo and noise and the first echo frequency domain signal, determine a first voice and audio domain signal containing residual echo and noise;

将所述第一含有残余回声及噪声的语音频域信号的幅度谱、所述第一回声频域信号的幅度谱和所述第一远端参考声频域信号的幅度谱进行能量归一化处理，得到第一含有残余回声及噪声的语音频域信号特征、第一回声频域信号特征和第一远端参考声频域信号特征；Perform energy normalization processing on the amplitude spectrum of the first voice-frequency domain signal containing residual echo and noise, the amplitude spectrum of the first echo frequency-domain signal, and the amplitude spectrum of the first far-end reference audio-frequency domain signal , obtain the first voice-frequency domain signal feature containing residual echo and noise, the first echo frequency-domain signal feature and the first remote reference audio-frequency domain signal feature;

将所述第一含有残余回声及噪声的语音频域信号特征与所述第一远端参考声频域信号特征进行拼接，得到第一拼接特征，并将所述第一含有残余回声及噪声的语音频域信号特征与所述第一回声频域信号特征进行拼接，得到第二拼接特征；Splicing the first voice and audio domain signal features containing residual echo and noise with the first remote reference audio frequency domain signal features to obtain a first splicing feature, and splicing the first voice and audio signal features containing residual echo and noise. The audio domain signal feature is spliced with the first echo frequency domain signal feature to obtain a second splicing feature;

将所述第一拼接特征和所述第二拼接特征输入级联网络中的特征注意力模型，以联合训练级联网络中的特征注意力模型和残余回声及噪声消除模型，得到与所述第一远端参考声频域信号特征对应的第一权重和与所述第一回声频域信号特征对应的第二权重；The first splicing feature and the second splicing feature are input to the feature attention model in the cascade network, to jointly train the feature attention model and the residual echo and noise cancellation model in the cascade network, and the result is obtained with the first a first weight corresponding to a remote reference audio frequency domain signal feature and a second weight corresponding to the first echo frequency domain signal feature;

所述第一远端参考声频域信号特征与第一权重相乘，得到第一融合特征，并且所述第一回声频域信号特征与第二权重相乘，得到第二融合特征；The first remote reference audio frequency domain signal feature is multiplied by a first weight to obtain a first fusion feature, and the first echo frequency domain signal feature is multiplied by a second weight to obtain a second fusion feature;

将所述第一融合特征、所述第二融合特征和所述第一含有残余回声及噪声的语音频域信号特征进行拼接，得到第一融合拼接特征；Splicing the first fusion feature, the second fusion feature, and the first voice and audio domain signal features containing residual echo and noise to obtain a first fusion splicing feature;

将所述第一融合拼接特征输入所述级联网络中的残余回声及噪声消除模型，得到第二目标语音频域信号的掩蔽估计值；Inputting the first fusion splicing feature into the residual echo and noise cancellation model in the cascaded network to obtain a masking estimate of the second target voice and audio domain signal;

根据所述第二目标语音频域信号的掩蔽估计值和所述第一含有残余回声及噪声的语音频域信号，确定第二目标语音频域信号；According to the masking estimate value of the second target voice domain signal and the first voice domain signal containing residual echo and noise, determine the second target voice domain signal;

根据至少两个损失函数，确定多域的损失函数；其中，所述至少两个损失函数包括能量无关的幅度谱损失函数和客观语音质量评估得分损失函数；所述能量无关的幅度谱损失函数以所述第一目标语音频域信号的幅度谱为训练目标，根据所述第二目标语音频域信号进行确定；所述客观语音质量评估得分损失函数以提升语音听感质量为训练目标进行确定；According to at least two loss functions, a multi-domain loss function is determined; wherein, the at least two loss functions include an energy-independent amplitude spectrum loss function and an objective speech quality assessment score loss function; the energy-independent amplitude spectrum loss function is The amplitude spectrum of the first target speech and audio domain signal is a training target, and is determined according to the second target speech and audio domain signal; the objective speech quality evaluation score loss function is determined by taking improving the audio quality of speech as a training target;

通过不断地模型参数迭代减小所述多域的损失函数，得到所述训练后级联网络。The post-training cascade network is obtained by continuously reducing the multi-domain loss function iteratively with model parameters.

第二方面，本申请实施例还提出一种残余回声及噪声消除装置，包括：In a second aspect, an embodiment of the present application further provides a residual echo and noise cancellation device, including:

接收模块，用于接收含有回声及噪声的语音时域信号和远端参考声时域信号；The receiving module is used to receive the voice time domain signal containing echo and noise and the far-end reference sound time domain signal;

处理模块，用于对所述含有回声及噪声的语音时域信号和所述远端参考声时域信号分别进行分帧、加窗和傅里叶变换，得到含有回声及噪声的语音频域信号和远端参考声频域信号；The processing module is used to perform framing, windowing and Fourier transform on the voice time-domain signal containing echo and noise and the far-end reference acoustic time-domain signal respectively, so as to obtain the voice and audio domain signal containing echo and noise and the far-end reference audio domain signal;

确定模块，用于根据所述含有回声及噪声的语音频域信号和所述远端参考声频域信号，确定回声频域信号；a determining module, configured to determine an echo frequency domain signal according to the voice and audio frequency domain signal containing echo and noise and the far-end reference audio frequency domain signal;

所述确定模块，还用于根据所述含有回声及噪声的语音频域信号和所述回声频域信号，确定含有残余回声及噪声的语音频域信号；The determining module is further configured to determine a voice and audio domain signal containing residual echo and noise according to the voice and audio domain signal containing echo and noise and the echo frequency domain signal;

能量归一化模块，用于将所述含有残余回声及噪声的语音频域信号的幅度谱、所述回声频域信号的幅度谱和所述远端参考声频域信号的幅度谱进行能量归一化处理，得到含有残余回声及噪声的语音频域信号特征、回声频域信号特征和远端参考声频域信号特征；An energy normalization module, configured to perform energy normalization on the amplitude spectrum of the voice and audio domain signal containing residual echo and noise, the amplitude spectrum of the echo frequency domain signal and the amplitude spectrum of the far-end reference audio frequency domain signal process to obtain the voice and audio domain signal features containing residual echo and noise, the echo frequency domain signal features and the far-end reference audio frequency domain signal features;

拼接模块，用于将所述含有残余回声及噪声的语音频域信号特征与所述远端参考声频域信号特征进行拼接，得到第一拼接结果，并且将所述含有残余回声及噪声的语音频域信号特征与所述回声频域信号特征进行拼接，得到第二拼接结果；The splicing module is used for splicing the voice and audio domain signal features containing residual echo and noise with the remote reference audio frequency domain signal features to obtain a first splicing result, and splicing the voice and audio containing residual echo and noise. The domain signal feature is spliced with the echo frequency domain signal feature to obtain a second splicing result;

权重获得模块，用于将所述第一拼接结果和所述第二拼接结果输入所述训练后级联网络中的训练后特征注意力模型，获得与所述远端参考声频域信号特征对应的第一注意力权重和与所述回声频域信号特征对应的第二注意力权重；A weight obtaining module is used to input the first splicing result and the second splicing result into the post-training feature attention model in the post-training cascade network, and obtain the corresponding feature of the far-end reference audio frequency domain signal. a first attention weight and a second attention weight corresponding to the echo frequency domain signal feature;

融合注意力机制特征获得模块，用于将所述远端参考声频域信号特征与第一注意力权重相乘，得到第一融合注意力机制特征，并且将所述回声频域信号特征与第二注意力权重相乘，得到第二融合注意力机制特征；The fused attention mechanism feature acquisition module is used to multiply the far-end reference audio frequency domain signal feature with the first attention weight to obtain the first fused attention mechanism feature, and combine the echo frequency domain signal feature with the second The attention weights are multiplied to obtain the second fusion attention mechanism feature;

所述拼接模块，还用于将所述第一融合注意力机制特征、所述第二融合注意力机制特征和所述含有残余回声及噪声的语音频域信号特征进行拼接，得到第一融合拼接结果；The splicing module is further configured to splicing the first fusion attention mechanism feature, the second fusion attention mechanism feature, and the voice and audio domain signal features containing residual echo and noise to obtain a first fusion splicing result;

掩蔽估计值获得模块，用于将所述第一融合拼接结果输入所述训练后级联网络中的训练后残余回声及噪声消除模型，得到目标语音频域信号的掩蔽估计值；a masking estimation value obtaining module, which is used to input the first fusion splicing result into the post-training residual echo and noise cancellation model in the post-training cascade network to obtain the masking estimation value of the target speech and audio domain signal;

目标语音频域信号获得模块，用于根据所述目标语音频域信号的掩蔽估计值和所述含有残余回声及噪声的语音频域信号，得到所述目标语音频域信号；a target voice and audio domain signal obtaining module, configured to obtain the target voice and audio domain signal according to the masking estimate value of the target voice and audio domain signal and the voice and audio domain signal containing residual echo and noise;

逆傅里叶变换模块，用于对所述目标语音频域信号进行逆傅里叶变换，得到目标语音时域信号。The inverse Fourier transform module is used for performing inverse Fourier transform on the target speech and audio domain signal to obtain the target speech time domain signal.

第三方面，本申请实施例还提出一种残余回声及噪声消除装置，包括至少一个处理器，所述处理器用于执行存储器中存储的程序，当所述程序被执行时，使得所述装置执行如第一方面及各种可能的实现中的各个步骤。In a third aspect, an embodiment of the present application further provides a residual echo and noise cancellation device, including at least one processor, where the processor is configured to execute a program stored in a memory, and when the program is executed, the device is made to execute As in the first aspect and various steps in various possible implementations.

第四方面，本申请实施例还提出一种非暂态计算机可读存储介质，其上存储有计算机程序，该计算机程序被处理器执行时实现如第一方面及各种可能的实现中的各个步骤。In a fourth aspect, the embodiments of the present application further provide a non-transitory computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, implements each of the first aspect and various possible implementations. step.

本申请实施例的有益效果在于，残余回声及噪声无需独立分开消除，而是通过训练后级联网络一次消除残余回声及噪声。使用训练后特征注意力模型赋予输入的特征不同重要性，减少输入的特征中的冗余信息，提升级联网络消除残余回声及噪声的表现。使用能量无关的幅度谱损失函数和客观语音质量评估得分损失函数相结合的多域的损失函数进行级联网络的训练，降低了模型对于信号能量的敏感度，并提升输出语音的听觉感知质量。The beneficial effect of the embodiments of the present application is that the residual echo and noise do not need to be eliminated separately, but the residual echo and noise are eliminated at one time through a cascaded network after training. The post-training feature attention model is used to assign different importance to the input features, reduce redundant information in the input features, and improve the performance of the cascade network in eliminating residual echo and noise. Using a multi-domain loss function that combines the energy-independent amplitude spectrum loss function and the objective speech quality assessment score loss function to train the cascade network reduces the sensitivity of the model to signal energy and improves the auditory perception quality of the output speech.

附图说明Description of drawings

为了更清楚地说明本申请实施例或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本申请的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些图获得其他的附图。In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the following briefly introduces the accompanying drawings required for the description of the embodiments or the prior art. Obviously, the drawings in the following description are only These are some embodiments of the present application. For those of ordinary skill in the art, other drawings can also be obtained based on these drawings without any creative effort.

图1为本申请实施例提供的训练能够消除残余回声及噪声的级联网络的过程示意图；1 is a schematic diagram of a process of training a cascaded network capable of eliminating residual echo and noise according to an embodiment of the present application;

图2为本申请实施例提供的使用训练后级联网络消除残余回声及噪声的流程示意图；2 is a schematic flowchart of using a post-training cascade network to eliminate residual echo and noise according to an embodiment of the present application;

图3为本申请实施例提供的一种残余回声及噪声消除装置的结构示意图。FIG. 3 is a schematic structural diagram of a residual echo and noise cancellation apparatus provided by an embodiment of the present application.

具体实施方式Detailed ways

下面将结合本申请实施例中的附图，对本申请实施例中的技术方案进行描述。以下实施例仅用于更加清楚地说明本申请的技术方案，而不能以此来限制本申请的保护范围。The technical solutions in the embodiments of the present application will be described below with reference to the accompanying drawings in the embodiments of the present application. The following examples are only used to more clearly illustrate the technical solutions of the present application, and cannot be used to limit the protection scope of the present application.

传统方法中残余回声及噪声消除往往独立分开进行。在残余回声抑制任务中，没有考虑多个信号特征各自的重要性。在训练残余回声及噪声消除模型时，大多采用目标幅度谱和估计幅度谱的均方误差作为损失函数。上述损失函数依赖于信号能量的大小。因此，本申请提出了一种残余回声及噪声消除方法及装置，可以赋予多个信号特征不同的重要性，减少多个信号特征中的冗余信息，同时，采用多域的损失函数进行级联网络的训练，降低了模型对于信号能量的敏感度。Residual echo and noise cancellation are often performed independently in traditional methods. In the residual echo suppression task, the respective importance of multiple signal features is not considered. When training residual echo and noise cancellation models, the mean square error of the target amplitude spectrum and the estimated amplitude spectrum is mostly used as the loss function. The above loss function depends on the magnitude of the signal energy. Therefore, the present application proposes a residual echo and noise elimination method and device, which can assign different importance to multiple signal features, reduce redundant information in multiple signal features, and at the same time use multi-domain loss functions to cascade The training of the network reduces the sensitivity of the model to signal energy.

在本申请实施例中，训练能够消除残余回声及噪声的级联网络的过程示意图如图1所示，包括：S101-S113；其中，级联网络中包括特征注意力模型和残余回声及噪声消除模型。In the embodiment of the present application, a schematic diagram of the process of training a cascaded network capable of eliminating residual echo and noise is shown in FIG. 1, including: S101-S113; wherein, the cascade network includes a feature attention model and residual echo and noise cancellation Model.

特征注意力模型由1层门控循环单元连接1层全连接神经网络组成。上述1层门控循环单元有200个隐层节点，上述全连接神经网络输出层有257个节点且每个神经元的激活函数使用的是Sigmoid函数。The feature attention model consists of 1 layer of gated recurrent units connected to 1 layer of fully connected neural network. The above-mentioned 1-layer gated recurrent unit has 200 hidden layer nodes, the above-mentioned fully connected neural network output layer has 257 nodes, and the activation function of each neuron uses the Sigmoid function.

残余回声及噪声消除模型由2层门控循环单元连接1层全连接神经网络组成。上述2层门控循环单元包含400个隐层节点，上述全连接神经网络输出层有257个节点且每个神经元的激活函数使用的是Sigmoid函数。The residual echo and noise cancellation model consists of 2 layers of gated recurrent units connected to 1 layer of fully connected neural network. The above-mentioned 2-layer gated recurrent unit includes 400 hidden layer nodes, the above-mentioned fully connected neural network output layer has 257 nodes, and the activation function of each neuron uses the Sigmoid function.

S101，接收第一含有回声及噪声的语音时域信号、第一远端参考声时域信号和第一目标语音时域信号。其中，上述第一远端参考声时域信号经过非线性变换再与相应房间传递函数卷积形成上述第一含有回声及噪声的语音时域信号中的回声时域信号。S101: Receive a first voice time domain signal containing echo and noise, a first remote reference acoustic time domain signal, and a first target voice time domain signal. Wherein, the first remote reference acoustic time domain signal is nonlinearly transformed and then convolved with a corresponding room transfer function to form an echo time domain signal in the first voice time domain signal containing echo and noise.

S102,对接收的第一含有回声及噪声的语音时域信号、第一远端参考声时域信号和第一目标语音时域信号进行分帧、加窗。具体地，对接收的第一含有回声及噪声的语音时域信号、第一远端参考声时域信号和第一目标语音时域信号分别取512个采样点作为一帧信号，若长度不足则先补零到512点，然后对每一帧信号进行加窗，加窗函数采用汉明窗。对加窗后的每一帧信号进行傅里叶变换，得到第一含有回声及噪声的语音频域信号、第一远端参考声频域信号和第一目标语音频域信号。S102: Framing and windowing the received first voice time domain signal containing echo and noise, the first remote reference acoustic time domain signal and the first target voice time domain signal. Specifically, 512 sampling points are respectively taken as one frame signal for the received first voice time domain signal containing echo and noise, the first remote reference voice time domain signal and the first target voice time domain signal. First fill zero to 512 points, and then add window to each frame of signal, and the windowing function adopts Hamming window. Fourier transform is performed on each frame of signal after windowing to obtain a first voice and audio domain signal containing echo and noise, a first far-end reference audio frequency domain signal and a first target voice and audio domain signal.

S103，将上述第一含有回声及噪声的语音频域信号和上述第一远端参考声频域信号输入卡尔曼滤波器，实时估计第一滤波器系数和第一回声频域信号。其中，卡尔曼滤波器估计的第一回声频域信号为：S103: Input the first voice and audio domain signal containing echo and noise and the first remote reference audio frequency domain signal into a Kalman filter, and estimate the first filter coefficient and the first echo frequency domain signal in real time. Among them, the first echo frequency domain signal estimated by the Kalman filter is:

C(k,f)＝W(k,f)*X(k,f)C(k,f)=W(k,f)*X(k,f)

其中，W(k,f)为第一滤波器系数，X(k,f)为远端参考声频域信号，k和f分别代表第k帧和频率f。Wherein, W(k, f) is the first filter coefficient, X(k, f) is the far-end reference audio frequency domain signal, and k and f represent the kth frame and frequency f, respectively.

S104，从上述第一含有回声及噪声的语音频域信号中减去上述第一回声频域信号，得到第一含有残余回声及噪声的语音频域信号，作为卡尔曼滤波器的输出结果。上述第一含有残余回声及噪声的语音频域信号为：S104: Subtract the first echo frequency domain signal from the first voice and audio domain signal containing echo and noise to obtain a first voice and audio domain signal containing residual echo and noise, which is used as an output result of the Kalman filter. The above-mentioned first voice domain signal containing residual echo and noise is:

E(k,f)＝Y(k,f)-C(k,f)E(k,f)=Y(k,f)-C(k,f)

其中，Y(k,f)为第一含有回声及噪声的语音频域信号，C(k,f)为第一回声频域信号，k和f分别代表第k帧和频率f。Among them, Y(k, f) is the first voice and audio domain signal containing echo and noise, C(k, f) is the first echo frequency domain signal, k and f represent the kth frame and frequency f respectively.

S105，将上述第一含有残余回声及噪声的语音频域信号的幅度谱、上述第一回声频域信号的幅度谱和上述第一远端参考声频域信号的幅度谱进行能量归一化处理，得到第一含有残余回声及噪声的语音频域信号特征g_FD(f(|E(k,f)|))、第一回声频域信号特征g_FD(f(|C(k,f)|))和第一远端参考声频域信号特征g_FD(f(|X(k,f)|))。其中，S105, performing energy normalization processing on the amplitude spectrum of the above-mentioned first voice-frequency domain signal containing residual echo and noise, the amplitude spectrum of the above-mentioned first echo frequency-domain signal, and the amplitude spectrum of the above-mentioned first remote reference audio-frequency domain signal, Obtain the first voice domain signal feature g _FD (f(|E(k,f)|)) containing residual echo and noise, and the first echo frequency domain signal feature g _FD (f(|C(k,f)| )) and the first far-end reference audio frequency domain signal feature g _FD (f(|X(k,f)|)). in,

第一含有残余回声及噪声的语音频域信号特征的均值和方差分别定义为：The mean and variance of the first voice domain signal features containing residual echo and noise are defined as:

μ_f(e)(k,f)＝c₁μ_f(e)(k-1,f)+(1-c₁)f(|E(k,f)|)μ _f(e) (k,f)=c ₁ μ _f(e) (k-1,f)+(1-c ₁ )f(|E(k,f)|)

第一回声频域信号特征的均值和方差分别定义为：The mean and variance of the first echo frequency domain signal features are defined as:

μ_f(c)(k,f)＝c₁μ_f(c)(k-1,f)+(1-c₁)f(|C(k,f)|)μ _f(c) (k,f)=c ₁ μ _f(c) (k-1,f)+(1-c ₁ )f(|C(k,f)|)

第一远端参考声频域信号特征的均值和方差分别定义为：The mean and variance of the first remote reference audio frequency domain signal feature are defined as:

μ_f(x)(k,f)＝c₁μ_f(x)(k-1,f)+(1-c₁)f(|X(k,f)|)μ _f(x) (k,f)=c ₁ μ _f(x) (k-1,f)+(1-c ₁ )f(|X(k,f)|)

|E(k,f)|、|C(k,f)|和|X(k,f)|分别代表第一含有残余回声及噪声的语音频域信号的幅度谱、第一回声频域信号的幅度谱和第一远端参考声频域信号的幅度谱，c₁为预先设定的常数。|E(k,f)|, |C(k,f)| and |X(k,f)| represent the amplitude spectrum of the first voice-frequency domain signal containing residual echo and noise, and the first echo frequency-domain signal, respectively and the amplitude spectrum of the first remote reference audio frequency domain signal, c ₁ is a preset constant.

S106，将上述第一含有残余回声及噪声的语音频域信号特征与上述第一远端参考声频域信号特征进行拼接，得到第一拼接特征，并将上述第一含有残余回声及噪声的语音频域信号特征与上述第一回声频域信号特征进行拼接，得到第二拼接特征。S106, splicing the above-mentioned first voice and audio domain signal features containing residual echo and noise with the above-mentioned first remote reference audio frequency domain signal features to obtain a first splicing feature, and splicing the above-mentioned first voice and audio containing residual echo and noise. The domain signal feature is spliced with the above-mentioned first echo frequency domain signal feature to obtain the second splicing feature.

S107，将上述第一拼接特征和上述第二拼接特征输入级联网络中的特征注意力模型，以联合训练级联网络中的特征注意力模型和残余回声及噪声消除模型，得到与上述第一远端参考声频域信号特征对应的第一权重α(k,f)和与上述第一回声频域信号特征对应的第二权重β(k,f)。S107, the above-mentioned first splicing feature and the above-mentioned second splicing feature are input into the feature attention model in the cascade network, to jointly train the feature attention model and the residual echo and noise elimination model in the cascade network, to obtain a A first weight α(k, f) corresponding to the feature of the far-end reference audio frequency domain signal and a second weight β(k, f) corresponding to the above-mentioned first echo frequency domain signal feature.

S108，上述第一远端参考声频域信号特征与第一权重相乘，得到第一融合特征为：S108, the above-mentioned first remote reference audio frequency domain signal feature is multiplied by the first weight, and the obtained first fusion feature is:

X_att(k,f)＝g_FD(f(|X(k,f)|))*α(k,f)X _att (k,f)=g _FD (f(|X(k,f)|))*α(k,f)

并且上述第一回声频域信号特征与第二权重相乘，得到第二融合特征为：And the above-mentioned first echo frequency domain signal feature is multiplied by the second weight, and the second fusion feature is obtained as:

C_att(k,f)＝g_FD(f(|C(k,f)|))*β(k,f)。C _att (k,f)=g _FD (f(|C(k,f)|))*β(k,f).

S109，将上述第一融合特征X_att(k,f)、上述第二融合特征C_att(k,f)和上述第一含有残余回声及噪声的语音频域信号特征g_FD(f(|E(k,f)|))进行拼接，得到第一融合拼接特征。S109, combine the above-mentioned first fusion feature X _att (k, f), the above-mentioned second fusion characteristic C _att (k, f) and the above-mentioned first voice and audio domain signal feature g _FD (f(|E (k,f)|)) for splicing to obtain the first fusion splicing feature.

S110，将上述第一融合拼接特征输入级联网络中的残余回声及噪声消除模型，输出为第二目标语音频域信号的掩蔽估计值G(k,f)。S110: Input the first fusion and splicing feature into the residual echo and noise elimination model in the cascaded network, and output the masking estimate value G(k, f) of the second target voice and audio domain signal.

S111，利用上述第二目标语音频域信号的掩蔽估计值G(k,f)对上述第一含有残余回声及噪声的语音频域信号进行增强，得到第二目标语音频域信号为：S111, using the masking estimated value G(k, f) of the above-mentioned second target speech and audio domain signal to enhance the above-mentioned first speech and audio domain signal containing residual echo and noise, and obtaining the second target speech and audio frequency domain signal is:

S112，根据至少两个损失函数，确定多域的损失函数。例如，以上述第一目标语音频域信号的幅度谱为训练目标，根据上述第二目标语音频域信号，确定能量无关的幅度谱损失函数

S112: Determine a multi-domain loss function according to at least two loss functions. For example, taking the amplitude spectrum of the first target voice and audio domain signal as the training target, and determining the energy-independent amplitude spectrum loss function according to the second target voice and audio domain signal

以提升语音听感质量为训练目标，确定客观语音质量评估得分损失函数

其中，S(k,f)为第一目标语音频域信号，

为第二目标语音频域信号。加权相加上述能量无关的幅度谱损失函数和上述客观语音质量评估得分损失函数，确定多域的损失函数为：To improve the audio quality of speech as the training goal, determine the objective speech quality evaluation score loss function

Among them, S(k,f) is the first target speech audio domain signal,

is the second target voice domain signal. Weighted addition of the above energy-independent amplitude spectrum loss function and the above objective speech quality assessment score loss function, the multi-domain loss function is determined as:

其中，λ为预先设定的常数。Among them, λ is a preset constant.

S113，通过不断地模型参数迭代减小上述多域的损失函数，得到训练后级联网络。其中，训练后级联网络包括训练后特征注意力模型和训练后残余回声及噪声消除模型。S113, by continuously reducing the multi-domain loss function by iteratively reducing the model parameters, a post-training cascade network is obtained. Among them, the post-training cascade network includes a post-training feature attention model and a post-training residual echo and noise cancellation model.

在本申请实施例中，使用上述训练后级联网络消除残余回声及噪声的流程示意图如图2所示，包括：S201-S207；In the embodiment of the present application, a schematic flowchart of using the above-mentioned post-training cascade network to eliminate residual echo and noise is shown in FIG. 2 , including: S201-S207;

S201，接收含有回声及噪声的语音时域信号和远端参考声时域信号。其中，上述远端参考声时域信号经过非线性变换再与相应房间传递函数卷积形成上述含有回声及噪声的语音时域信号中的回声时域信号。S201, a speech time domain signal containing echo and noise and a far-end reference acoustic time domain signal are received. Wherein, the remote reference acoustic time domain signal is nonlinearly transformed and then convolved with the corresponding room transfer function to form the echo time domain signal in the speech time domain signal containing echo and noise.

S202，对上述含有回声及噪声的语音时域信号和上述远端参考声时域信号分别进行分帧、加窗。具体地，对接收的含有回声及噪声的语音时域信号、远端参考声时域信号分别取512个采样点作为一帧信号，若长度不足则先补零到512点，然后对每一帧信号进行加窗，加窗函数采用汉明窗。对加窗后的每一帧信号进行傅里叶变换，得到含有回声及噪声的语音频域信号和远端参考声频域信号。S202: Framing and windowing the above-mentioned voice time-domain signal containing echo and noise and the above-mentioned far-end reference acoustic time-domain signal respectively. Specifically, take 512 sampling points for the received voice time-domain signal containing echo and noise, and the far-end reference sound time-domain signal as a frame signal. The signal is windowed, and the windowing function uses a Hamming window. Fourier transform is performed on each frame of the windowed signal to obtain a voice and audio domain signal containing echo and noise and a far-end reference audio frequency domain signal.

S203，将上述含有回声及噪声的语音频域信号和上述远端参考声频域信号输入卡尔曼滤波器，实时估计滤波器系数和回声频域信号。其中，卡尔曼滤波器估计的回声频域信号为：S203: Input the above-mentioned voice and audio frequency domain signal containing echo and noise and the above-mentioned far-end reference audio frequency domain signal into a Kalman filter, and estimate the filter coefficients and the echo frequency domain signal in real time. Among them, the echo frequency domain signal estimated by the Kalman filter is:

C₃(k,f)＝W₃(k,f)*X₃(k,f)C ₃ (k,f)=W ₃ (k,f)*X ₃ (k,f)

其中，W₃(k,f)为第一滤波器系数，X₃(k,f)为远端参考声频域信号，k和f分别代表第k帧和频率f。Wherein, W ₃ (k, f) is the first filter coefficient, X ₃ (k, f) is the far-end reference audio domain signal, and k and f represent the k-th frame and frequency f, respectively.

S204，从上述含有回声及噪声的语音频域信号中减去上述回声频域信号，得到含有残余回声及噪声的语音频域信号为：S204, the above-mentioned echo frequency domain signal is subtracted from the above-mentioned voice and audio domain signal containing echo and noise, and the obtained voice and audio frequency domain signal containing residual echo and noise is:

E₃(k,f)＝Y₃(k,f)-C₃(k,f)E ₃ (k,f)=Y ₃ (k,f)-C ₃ (k,f)

其中，Y₃(k,f)为含有回声及噪声的语音频域信号，C₃(k,f)为回声频域信号，k和f分别代表第k帧和频率f。Among them, Y ₃ (k, f) is the voice and audio domain signal containing echo and noise, C ₃ (k, f) is the echo frequency domain signal, k and f represent the k-th frame and frequency f, respectively.

S205，将上述含有残余回声及噪声的语音频域信号的幅度谱、上述回声频域信号的幅度谱和上述远端参考声频域信号的幅度谱进行能量归一化处理，得到含有残余回声及噪声的语音频域信号特征

第一回声频域信号特征

和第一远端参考声频域信号特征

其中，S205, performing energy normalization processing on the amplitude spectrum of the voice and audio domain signal containing residual echo and noise, the amplitude spectrum of the echo frequency domain signal, and the amplitude spectrum of the remote reference audio frequency domain signal to obtain residual echo and noise containing The features of the speech domain signal

The first echo frequency domain signal characteristics

and the first far-end reference audio domain signal characteristics

in,

|E₃(k,f)|、|C₃(k,f)|和|X₃(k,f)|分别代表含有残余回声及噪声的语音频域信号的幅度谱、回声频域信号的幅度谱和远端参考声频域信号的幅度谱，c₂为预先设定的常数。|E ₃ (k,f)|, |C ₃ (k,f)| and |X ₃ (k,f)| represent the amplitude spectrum of the voice-frequency domain signal containing residual echo and noise, and the echo frequency domain signal’s amplitude spectrum, respectively. The amplitude spectrum and the amplitude spectrum of the far-end reference audio frequency domain signal, c ₂ is a preset constant.

S206，将上述含有残余回声及噪声的语音频域信号特征与上述远端参考声频域信号特征进行拼接，得到第一拼接结果，并且将上述含有残余回声及噪声的语音频域信号特征与上述回声频域信号特征进行拼接，得到第二拼接结果。S206, splicing the above-mentioned voice and audio domain signal features containing residual echo and noise with the above-mentioned remote reference audio frequency domain signal features to obtain a first splicing result, and combining the above-mentioned voice and audio domain signal features containing residual echo and noise with the above echo The audio frequency domain signal features are spliced to obtain a second splicing result.

S207，将上述第一拼接结果和上述第二拼接结果输入上述训练后级联网络中，即输入训练后特征注意力模型，获得与上述远端参考声频域信号特征对应的第一注意力权重α₃(k,f)和与上述回声频域信号特征对应的第二注意力权重β₃(k,f)。S207, input the above-mentioned first splicing result and the above-mentioned second splicing result into the above-mentioned post-training cascade network, that is, input the post-training feature attention model to obtain the first attention weight α corresponding to the above-mentioned remote reference audio frequency domain signal feature ₃ (k, f) and the second attention weight β ₃ (k, f) corresponding to the above echo frequency domain signal features.

S208，上述远端参考声频域信号特征与第一注意力权重α₃(k,f)相乘，得到第一融合注意力机制特征为：S208, the above-mentioned remote reference audio frequency domain signal feature is multiplied by the first attention weight α ₃ (k, f) to obtain the first fusion attention mechanism feature:

并且上述回声频域信号特征与第二注意力权重β₃(k,f)相乘，得到第二融合注意力机制特征为：And the above echo frequency domain signal features are multiplied by the second attention weight β ₃ (k, f) to obtain the second fusion attention mechanism features:

S209，将上述第一融合注意力机制特征、上述第二融合注意力机制特征和上述含有残余回声及噪声的语音频域信号特征进行拼接，得到第一融合拼接结果。S209, splicing the above-mentioned first fusion attention mechanism feature, the above-mentioned second fusion attention mechanism characteristic, and the above-mentioned voice and audio domain signal characteristics containing residual echo and noise, to obtain a first fusion and splicing result.

S210，将上述第一融合拼接结果输入训练后级联网络中的训练后残余回声及噪声消除模型，得到目标语音频域信号的掩蔽估计值G₃(k,f)。S210: Input the above-mentioned first fusion and splicing result into a post-training residual echo and noise cancellation model in the post-training cascade network to obtain a masking estimation value G ₃ (k, f) of the target speech and audio domain signal.

S211，将上述目标语音频域信号的掩蔽估计值G₃(k,f)和所述含有残余回声及噪声的语音频域信号E₃(k,f)相乘，得到目标语音频域信号为：S211: Multiply the masking estimate value G ₃ (k, f) of the above-mentioned target voice and audio domain signal and the voice and audio domain signal E ₃ (k, f) containing residual echo and noise, to obtain the target voice and audio domain signal as :

S212，对上述目标语音频域信号进行逆傅里叶变换，得到目标语音时域信号。S212 , performing an inverse Fourier transform on the target speech and audio domain signals to obtain a target speech time domain signal.

本申请实施例中的残余回声及噪声无需独立分开消除，而是通过训练后级联网络一次消除残余回声及噪声。使用训练后特征注意力模型赋予输入的特征不同重要性，减少输入的特征中的冗余信息，提升级联网络消除残余回声及噪声的表现。使用能量无关的幅度谱损失函数和客观语音质量评估得分损失函数相结合的多域的损失函数进行级联网络的训练，降低了模型对于信号能量的敏感度，并提升输出语音的听觉感知质量。The residual echo and noise in the embodiments of the present application do not need to be eliminated separately, but the residual echo and noise are eliminated at one time through a cascaded network after training. The post-training feature attention model is used to assign different importance to the input features, reduce redundant information in the input features, and improve the performance of the cascade network in eliminating residual echo and noise. Using a multi-domain loss function that combines the energy-independent amplitude spectrum loss function and the objective speech quality assessment score loss function to train the cascade network reduces the sensitivity of the model to signal energy and improves the auditory perception quality of the output speech.

本申请实施例提供一种残余回声及噪声消除装置，其结构示意图如图3所示，包括：An embodiment of the present application provides a residual echo and noise cancellation device, the schematic structural diagram of which is shown in FIG. 3 , including:

接收模块301、处理模块302、确定模块303、能量归一化模块304和逆傅里叶变换模块305；a receiving module 301, a processing module 302, a determination module 303, an energy normalization module 304 and an inverse Fourier transform module 305;

接收模块301，用于接收含有回声及噪声的语音时域信号和远端参考声时域信号；a receiving module 301, configured to receive a voice time-domain signal containing echo and noise and a far-end reference acoustic time-domain signal;

处理模块302，用于对所述含有回声及噪声的语音时域信号和所述远端参考声时域信号分别进行分帧、加窗和傅里叶变换，得到含有回声及噪声的语音频域信号和远端参考声频域信号；The processing module 302 is used to perform framing, windowing and Fourier transform on the voice time domain signal containing echo and noise and the far-end reference acoustic time domain signal respectively, to obtain a voice and audio domain containing echo and noise signal and far-end reference audio domain signal;

确定模块303，用于根据所述含有回声及噪声的语音频域信号和所述远端参考声频域信号，确定回声频域信号；a determining module 303, configured to determine an echo frequency domain signal according to the voice and audio domain signal containing echo and noise and the far-end reference audio frequency domain signal;

所述确定模块303，还用于根据所述含有回声及噪声的语音频域信号和所述回声频域信号，确定含有残余回声及噪声的语音频域信号；The determining module 303 is further configured to determine a voice and audio domain signal containing residual echo and noise according to the voice and audio domain signal containing echo and noise and the echo frequency domain signal;

能量归一化模块304，用于将所述含有残余回声及噪声的语音频域信号的幅度谱、所述回声频域信号的幅度谱和所述远端参考声频域信号的幅度谱进行能量归一化处理，得到含有残余回声及噪声的语音频域信号特征、回声频域信号特征和远端参考声频域信号特征；The energy normalization module 304 is configured to perform energy normalization on the amplitude spectrum of the speech audio frequency domain signal containing residual echo and noise, the amplitude spectrum of the echo frequency domain signal and the amplitude spectrum of the far-end reference audio frequency domain signal. Normalization processing to obtain the voice and audio domain signal features containing residual echo and noise, the echo frequency domain signal features and the far-end reference audio frequency domain signal features;

拼接模块305，用于将所述含有残余回声及噪声的语音频域信号特征与所述远端参考声频域信号特征进行拼接，得到第一拼接结果，并且将所述含有残余回声及噪声的语音频域信号特征与所述回声频域信号特征进行拼接，得到第二拼接结果；The splicing module 305 is used for splicing the voice and audio domain signal features containing residual echo and noise with the remote reference audio frequency domain signal features to obtain a first splicing result, and splicing the voice and audio domain signal features containing residual echo and noise. The audio domain signal feature and the echo frequency domain signal feature are spliced to obtain a second splicing result;

权重获得模块306，用于将所述第一拼接结果和所述第二拼接结果输入所述训练后级联网络中的训练后特征注意力模型，获得与所述远端参考声频域信号特征对应的第一注意力权重和与所述回声频域信号特征对应的第二注意力权重；The weight obtaining module 306 is used to input the first splicing result and the second splicing result into the post-training feature attention model in the post-training cascade network, and obtain the feature corresponding to the far-end reference audio frequency domain signal The first attention weight of and the second attention weight corresponding to the echo frequency domain signal feature;

融合注意力机制特征获得模块307，用于将所述远端参考声频域信号特征与第一注意力权重相乘，得到第一融合注意力机制特征，并且将所述回声频域信号特征与第二注意力权重相乘，得到第二融合注意力机制特征；The fusion attention mechanism feature obtaining module 307 is used for multiplying the remote reference audio frequency domain signal feature and the first attention weight to obtain the first fusion attention mechanism feature, and combining the echo frequency domain signal feature with the first attention weight. The two attention weights are multiplied to obtain the second fusion attention mechanism feature;

所述拼接模块305，还用于将所述第一融合注意力机制特征、所述第二融合注意力机制特征和所述含有残余回声及噪声的语音频域信号特征进行拼接，得到第一融合拼接结果；The splicing module 305 is further configured to splicing the first fusion attention mechanism feature, the second fusion attention mechanism feature, and the voice and audio domain signal features containing residual echo and noise to obtain a first fusion. splicing result;

掩蔽估计值获得模块308，用于将所述第一融合拼接结果输入所述训练后级联网络中的训练后残余回声及噪声消除模型，得到目标语音频域信号的掩蔽估计值；A masking estimate value obtaining module 308, configured to input the first fusion splicing result into the post-training residual echo and noise cancellation model in the post-training cascade network to obtain a masking estimate value of the target speech and audio domain signal;

目标语音频域信号获得模块309，用于根据所述目标语音频域信号的掩蔽估计值和所述含有残余回声及噪声的语音频域信号，得到所述目标语音频域信号；A target voice and audio domain signal obtaining module 309, configured to obtain the target voice and audio domain signal according to the masking estimated value of the target voice and audio domain signal and the voice and audio domain signal containing residual echo and noise;

逆傅里叶变换模块310，用于对所述目标语音频域信号进行逆傅里叶变换，得到目标语音时域信号。The inverse Fourier transform module 310 is configured to perform inverse Fourier transform on the target speech and audio domain signal to obtain the target speech time domain signal.

本申请实施例提供一种残余回声及噪声消除装置，包括至少一个处理器，所述处理器用于执行存储器中存储的程序，当所述程序被执行时，使得所述装置执行如下步骤：An embodiment of the present application provides a residual echo and noise cancellation device, including at least one processor, where the processor is configured to execute a program stored in a memory, and when the program is executed, the device is made to perform the following steps:

对所述含有回声及噪声的语音时域信号和所述远端参考声时域信号分别进行分帧、加窗和傅里叶变换，得到含有回声及噪声的语音频域信号和远端参考声频域信号；Perform framing, windowing and Fourier transform on the voice time-domain signal containing echo and noise and the far-end reference acoustic time-domain signal, respectively, to obtain the voice and audio domain signal containing echo and noise and the far-end reference audio frequency domain signal;

本申请实施例提供一种非暂态计算机可读存储介质，其上存储有计算机程序，该计算机程序被处理器执行时实现如下步骤：An embodiment of the present application provides a non-transitory computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the following steps are implemented:

以上所描述的装置实施例仅仅是示意性的，其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的，作为单元显示的部件可以是或者也可以不是物理单元，即可以位于一个地方，或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。本领域普通技术人员在不付出创造性的劳动的情况下，即可以理解并实施。The device embodiments described above are only illustrative, wherein the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in One place, or it can be distributed over multiple network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution in this embodiment. Those of ordinary skill in the art can understand and implement it without creative effort.

通过以上的实施方式的描述，本领域的技术人员可以清楚地了解到各实施方式可借助软件加必需的通用硬件平台的方式来实现，当然也可以通过硬件。基于这样的理解，上述技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来，该计算机软件产品可以存储在计算机可读存储介质中，如ROM/RAM、磁碟、光盘等，包括若干指令用以使得一台计算机设备(可以是个人计算机，服务器，或者网络设备等)执行各个实施例或者实施例的某些部分所述的方法。From the description of the above embodiments, those skilled in the art can clearly understand that each embodiment can be implemented by means of software plus a necessary general hardware platform, and certainly can also be implemented by hardware. Based on this understanding, the above-mentioned technical solutions can be embodied in the form of software products in essence or the parts that make contributions to the prior art, and the computer software products can be stored in computer-readable storage media, such as ROM/RAM, magnetic A disc, an optical disc, etc., includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform the methods described in various embodiments or some parts of the embodiments.

应说明的是：以上实施例仅用以说明本申请的技术方案，而非对其限制；尽管参照前述实施例对本申请进行了详细的说明，本领域的普通技术人员应当理解：其依然可以对前述各实施例所记载的技术方案进行修改，或者对其中部分技术特征进行等同替换；而这些修改或者替换，并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围。It should be noted that: the above embodiments are only used to illustrate the technical solutions of the present application, but not to limit them; although the present application has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand: it can still be used for The technical solutions described in the foregoing embodiments are modified, or some technical features thereof are equivalently replaced; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the spirit and scope of the technical solutions in the embodiments of the present application.

Claims

1. a residual echo and noise cancellation method, is characterized in that, comprises:

Receive speech time-domain signals containing echo and noise and far-end reference acoustic time-domain signals;

Perform framing, windowing and Fourier transform on the voice time domain signal containing echo and noise and the far-end reference acoustic time domain signal respectively to obtain the voice and audio domain signal containing echo and noise and the far-end reference audio frequency domain signal;

Determine an echo frequency domain signal according to the voice and audio frequency domain signal containing echo and noise and the far-end reference audio frequency domain signal;

According to the voice and audio domain signal containing echo and noise and the echo frequency domain signal, determine a voice and audio domain signal containing residual echo and noise;

Perform energy normalization processing on the amplitude spectrum of the voice and audio frequency domain signal containing residual echo and noise, the amplitude spectrum of the echo frequency domain signal and the amplitude spectrum of the far-end reference audio frequency domain signal to obtain residual echo and Noise's voice and audio frequency domain signal characteristics, echo frequency domain signal characteristics and far-end reference audio frequency domain signal characteristics;

The voice and audio domain signal features containing residual echo and noise are spliced with the far-end reference audio frequency domain signal features to obtain a first splicing result, and the voice and audio domain signal features containing residual echo and noise are combined with the above. The echo frequency domain signal features are spliced to obtain a second splicing result;

The first splicing result and the second splicing result are input into the post-training feature attention model in the post-training cascade network to obtain the first attention weight and a second attention weight corresponding to the echo frequency domain signal feature;

Multiplying the remote reference audio frequency domain signal feature and the first attention weight to obtain the first fusion attention mechanism feature, and multiplying the echo frequency domain signal feature and the second attention weight to obtain the second fusion Attention mechanism features;

Splicing the first fusion attention mechanism feature, the second fusion attention mechanism feature, and the voice and audio domain signal features containing residual echo and noise to obtain a first fusion splicing result;

Inputting the first fusion splicing result into the post-training residual echo and noise cancellation model in the post-training cascade network to obtain a masking estimate value of the target speech and audio domain signal;

According to the masking estimate value of the target voice and audio domain signal and the voice and audio domain signal containing residual echo and noise, obtain the target voice and audio frequency domain signal;

Inverse Fourier transform is performed on the target speech and audio frequency domain signal to obtain the target speech time domain signal.

2. The method according to claim 1, wherein the described voice time-domain signal containing echo and noise and the far-end reference acoustic time-domain signal are respectively framed, windowed and Fourier transformations, including:

Taking a preset number of sampling points as a frame of signals for the voice time-domain signal containing echo and noise and the far-end reference acoustic time-domain signal respectively; if the length is insufficient, first fill with zeros to a preset number;

Windowing is performed on each frame of signal; wherein, the windowing function adopts a Hamming window;

Fourier transform is performed on each frame of the windowed signal.

3. The method according to claim 1, wherein, determining the echo frequency domain signal according to the voice and audio frequency domain signal containing echo and noise and the far-end reference audio frequency domain signal, comprising:

Inputting the voice and audio frequency domain signals containing echo and noise and the far-end reference audio frequency domain signals into a Kalman filter to obtain filter coefficients and the echo frequency domain signals;

The echo frequency domain signal is the result of multiplying the filter coefficients and the far-end reference audio frequency domain signal.

4 . The method according to claim 1 , wherein determining the voice and audio domain signal containing residual echo and noise according to the voice and audio domain signal containing echo and noise and the echo frequency domain signal, comprising: 5 . :

The echo frequency domain signal is subtracted from the voice and audio domain signal containing echo and noise to obtain the voice and audio domain signal containing residual echo and noise.

5 . The method according to claim 1 , wherein the method of comparing the amplitude spectrum of the voice domain signal containing residual echo and noise, the amplitude spectrum of the echo frequency domain signal and the far-end reference audio The amplitude spectrum of the domain signal is energy normalized to obtain the voice and audio domain signal features containing residual echo and noise, the echo frequency domain signal features and the far-end reference audio frequency domain signal features, including:

Determine the corresponding first function, second function and third function;

According to the first function corresponding to the amplitude spectrum of the voice domain signal containing residual echo and noise, and the mean and variance of the features of the voice domain signal containing residual echo and noise, determine the residual echo and noise containing voice domain signal. Voice domain signal features;

According to the second function corresponding to the amplitude spectrum of the echo frequency domain signal, and the mean value and variance of the echo frequency domain signal feature, determine the echo frequency domain signal feature;

The feature of the far-end reference audio-frequency domain signal is determined according to the third function corresponding to the amplitude spectrum of the far-end reference audio-frequency domain signal, and the mean and variance of the feature of the far-end reference audio-frequency domain signal.

6. method according to claim 1, is characterized in that, after described training, cascade network is obtained by following steps training:

receiving the first voice time domain signal containing echo and noise, the first remote reference acoustic time domain signal and the first target voice time domain signal;

Perform framing, windowing and Fourier transform on the first voice time-domain signal containing echo and noise, the first far-end reference acoustic time-domain signal and the first target voice time-domain signal, respectively, to obtain a first voice domain signal containing echo and noise, a first remote reference voice domain signal and a first target voice domain signal;

determining a first echo frequency domain signal according to the first voice frequency domain signal containing echo and noise and the first remote reference audio frequency domain signal;

According to the first voice and audio domain signal containing echo and noise and the first echo frequency domain signal, determine a first voice and audio domain signal containing residual echo and noise;

Perform energy normalization processing on the amplitude spectrum of the first voice-frequency domain signal containing residual echo and noise, the amplitude spectrum of the first echo frequency-domain signal, and the amplitude spectrum of the first far-end reference audio-frequency domain signal , obtain the first voice-frequency domain signal feature containing residual echo and noise, the first echo frequency-domain signal feature and the first remote reference audio-frequency domain signal feature;

Splicing the first voice and audio domain signal features containing residual echo and noise with the first remote reference audio frequency domain signal features to obtain a first splicing feature, and splicing the first voice and audio signal features containing residual echo and noise. The audio domain signal feature is spliced with the first echo frequency domain signal feature to obtain a second splicing feature;

The first splicing feature and the second splicing feature are input to the feature attention model in the cascade network, to jointly train the feature attention model and the residual echo and noise cancellation model in the cascade network, and the result is obtained with the first a first weight corresponding to a remote reference audio frequency domain signal feature and a second weight corresponding to the first echo frequency domain signal feature;

The first remote reference audio frequency domain signal feature is multiplied by a first weight to obtain a first fusion feature, and the first echo frequency domain signal feature is multiplied by a second weight to obtain a second fusion feature;

Splicing the first fusion feature, the second fusion feature, and the first voice and audio domain signal features containing residual echo and noise to obtain a first fusion splicing feature;

Inputting the first fusion splicing feature into the residual echo and noise cancellation model in the cascaded network to obtain a masking estimate of the second target voice and audio domain signal;

According to the masking estimate value of the second target voice domain signal and the first voice domain signal containing residual echo and noise, determine the second target voice domain signal;

According to at least two loss functions, a multi-domain loss function is determined; wherein, the at least two loss functions include an energy-independent amplitude spectrum loss function and an objective speech quality assessment score loss function; the energy-independent amplitude spectrum loss function is The amplitude spectrum of the first target speech and audio domain signal is a training target, and is determined according to the second target speech and audio domain signal; the objective speech quality evaluation score loss function is determined by taking improving the audio quality of speech as a training target;

The post-training cascade network is obtained by continuously reducing the multi-domain loss function iteratively with model parameters.

7. A residual echo and noise cancellation device, comprising:

The receiving module is used to receive the voice time domain signal containing echo and noise and the far-end reference sound time domain signal;

The processing module is used to perform framing, windowing and Fourier transform on the voice time-domain signal containing echo and noise and the far-end reference acoustic time-domain signal respectively, so as to obtain the voice and audio domain signal containing echo and noise and the far-end reference audio domain signal;

a determining module, configured to determine an echo frequency domain signal according to the voice and audio frequency domain signal containing echo and noise and the far-end reference audio frequency domain signal;

The determining module is further configured to determine a voice and audio domain signal containing residual echo and noise according to the voice and audio domain signal containing echo and noise and the echo frequency domain signal;

An energy normalization module, configured to perform energy normalization on the amplitude spectrum of the voice and audio domain signal containing residual echo and noise, the amplitude spectrum of the echo frequency domain signal and the amplitude spectrum of the far-end reference audio frequency domain signal process to obtain the voice and audio domain signal features containing residual echo and noise, the echo frequency domain signal features and the far-end reference audio frequency domain signal features;

The splicing module is used for splicing the voice and audio domain signal features containing residual echo and noise with the remote reference audio frequency domain signal features to obtain a first splicing result, and splicing the voice and audio containing residual echo and noise. The domain signal feature is spliced with the echo frequency domain signal feature to obtain a second splicing result;

A weight obtaining module is used to input the first splicing result and the second splicing result into the post-training feature attention model in the post-training cascade network, and obtain the corresponding feature of the far-end reference audio frequency domain signal. a first attention weight and a second attention weight corresponding to the echo frequency domain signal feature;

The fused attention mechanism feature acquisition module is used to multiply the far-end reference audio frequency domain signal feature with the first attention weight to obtain the first fused attention mechanism feature, and combine the echo frequency domain signal feature with the second The attention weights are multiplied to obtain the second fusion attention mechanism feature;

The splicing module is further configured to splicing the first fusion attention mechanism feature, the second fusion attention mechanism feature, and the voice and audio domain signal features containing residual echo and noise to obtain a first fusion splicing result;

a masking estimation value obtaining module, which is used to input the first fusion splicing result into the post-training residual echo and noise cancellation model in the post-training cascade network to obtain the masking estimation value of the target speech and audio domain signal;

a target voice and audio domain signal obtaining module, configured to obtain the target voice and audio domain signal according to the masking estimate value of the target voice and audio domain signal and the voice and audio domain signal containing residual echo and noise;

The inverse Fourier transform module is used for performing inverse Fourier transform on the target speech and audio domain signal to obtain the target speech time domain signal.

8. A residual echo and noise cancellation device, comprising at least one processor, the processor is configured to execute a program stored in a memory, and when the program is executed, the device is made to execute:

The method of any one of claims 1-6.

9. A non-transitory computer-readable storage medium on which a computer program is stored, characterized in that, when the computer program is executed by a processor, the method according to any one of claims 1-6 is implemented.