WO2023092955A1

WO2023092955A1 - Audio signal processing method and apparatus

Info

Publication number: WO2023092955A1
Application number: PCT/CN2022/091811
Authority: WO
Inventors: 张旭; 郑羲光; 李楠; 韩润强; 张晨
Original assignee: 北京达佳互联信息技术有限公司
Priority date: 2021-11-29
Filing date: 2022-05-09
Publication date: 2023-06-01
Also published as: CN114038476A; CN114038476B

Abstract

The present disclosure relates to an audio signal processing method and apparatus, and relates to the technical field of signal processing. The audio signal processing method comprises: acquiring a proximal collected audio signal, a distal reference audio signal, and a first proximal audio signal, which is obtained after linear acoustic echo cancellation is performed on the proximal collected audio signal; respectively performing time-frequency transformation on the proximal collected audio signal, the distal reference audio signal and the first proximal audio signal, so as to obtain a frequency domain proximal collected audio signal, a frequency domain distal reference audio signal and a first frequency domain proximal audio signal; performing deep learning noise reduction on the amplitude of the first frequency domain proximal audio signal, so as to obtain a second frequency domain proximal audio signal; performing nonlinear acoustic echo cancellation on the second frequency domain proximal audio signal on the basis of the frequency domain distal reference audio signal, the frequency domain proximal collected audio signal, the first frequency domain proximal audio signal and the second frequency domain proximal audio signal, so as to obtain a frequency domain proximal audio enhanced signal; and performing time-frequency inverse transformation on the frequency domain proximal audio enhanced signal, so as to obtain a proximal audio enhanced signal.

Description

Audio signal processing method and device

相关申请的交叉引用Cross References to Related Applications

本申请基于申请号为202111433241.1、申请日为2021年11月29日的中国专利申请提出，并要求该中国专利申请的优先权，该中国专利申请的全部内容在此引入本申请作为参考。This application is based on a Chinese patent application with application number 202111433241.1 and a filing date of November 29, 2021, and claims the priority of this Chinese patent application. The entire content of this Chinese patent application is hereby incorporated by reference into this application.

technical field

本公开涉及信号处理技术领域，尤其涉及一种音频信号处理方法及装置。The present disclosure relates to the technical field of signal processing, and in particular to an audio signal processing method and device.

Background technique

回声消除(Acoustic Echo Cancellation，AEC)是实时通讯中的重要技术之一，是保证音视频体验的关键点。回声消除技术是指将近端麦克风采集到的音频信号中的远端信号消除掉，只保留下近端信号，其中，近端麦克风采集到的音频信号包括近端信号及远端信号通过近端扬声器播放出来的信号。回声消除技术一般包括线性回声消除和非线性回声消除。Acoustic Echo Cancellation (AEC) is one of the important technologies in real-time communication, and it is the key point to ensure audio and video experience. Echo cancellation technology refers to the elimination of the far-end signal in the audio signal collected by the near-end microphone, and only the near-end signal is retained. The audio signal collected by the near-end microphone includes the near-end signal and the far-end signal through the near-end The signal played by the speaker. Echo cancellation techniques generally include linear echo cancellation and nonlinear echo cancellation.

发明内容Contents of the invention

本公开提供一种音频信号处理方法及装置。The disclosure provides an audio signal processing method and device.

根据本公开实施例的第一方面，提供一种音频信号处理方法，包括：获取近端采集音频信号、远端参考音频信号，以及对所述近端采集音频信号进行线性回声消除后得到的第一近端音频信号；对所述近端采集音频信号、所述远端参考音频信号和所述第一近端音频信号分别进行时频变换，得到频域近端采集音频信号、频域远端参考音频信号和第一频域近端音频信号；对所述第一频域近端音频信号的幅度进行深度学习降噪，得到第二频域近端音频信号；基于所述频域远端参考音频信号、所述频域近端采集音频信号、所述第一频域近端音频信号和所述第二频域近端音频信号，对所述第二频域近端音频信号进行非线性回声消除，得到频域近端音频增强信号；对所述频域近端音频增强信号进行时频逆变换，得到近端音频增强信号。According to the first aspect of an embodiment of the present disclosure, an audio signal processing method is provided, including: acquiring a near-end collected audio signal, a far-end reference audio signal, and performing linear echo cancellation on the near-end collected audio signal. A near-end audio signal; time-frequency conversion is performed on the near-end collected audio signal, the far-end reference audio signal and the first near-end audio signal respectively, to obtain the frequency-domain near-end collected audio signal, the frequency-domain far-end A reference audio signal and a first frequency domain near-end audio signal; performing deep learning noise reduction on the magnitude of the first frequency domain near-end audio signal to obtain a second frequency domain near-end audio signal; based on the frequency domain far-end reference audio signal, the near-end audio signal in the frequency domain, the near-end audio signal in the first frequency domain, and the near-end audio signal in the second frequency domain, performing nonlinear echo on the near-end audio signal in the second frequency domain elimination to obtain a frequency-domain near-end audio enhancement signal; performing time-frequency inverse transform on the frequency-domain near-end audio enhancement signal to obtain a near-end audio enhancement signal.

在一些实施例中，所述对所述第一频域近端音频信号的幅度进行深度学习降噪，得到第二频域近端音频信号，包括：通过训练好的降噪神经网络模型，对所述第一频域近端音频信号的幅度进行深度学习降噪，得到所述第二频域近端音频信号的幅度；根据所述第二频域近端音频信号的幅度和所述第一频域近端音频信号的相位，得到所述第二频域近端音频信号。In some embodiments, performing deep learning noise reduction on the amplitude of the first frequency-domain near-end audio signal to obtain the second frequency-domain near-end audio signal includes: using a trained noise reduction neural network model to The magnitude of the near-end audio signal in the first frequency domain is subjected to deep learning noise reduction to obtain the magnitude of the near-end audio signal in the second frequency domain; according to the magnitude of the near-end audio signal in the second frequency domain and the first Phase of the near-end audio signal in the frequency domain to obtain the second near-end audio signal in the frequency domain.

在一些实施例中，所述通过训练好的降噪神经网络模型，对所述第一频域近端音频信号的幅度进行深度学习降噪，得到所述第二频域近端音频信号的幅度，包括：将所述第一频域近端音频信号的幅度输入所述训练好的降噪神经网络模型中，得到第一信号幅度比，其中，所述第一信号幅度比为所述第二频域近端音频信号的幅度和所述第一频域近端音频信号的幅度的比值的预测值；根据所述第一信号幅度比和所述第一频域近端音频信号的幅度，得到所述第二频域近端音频信号的幅度，其中，所述第二频域近端音频信号的幅度是所述第一信号幅度比和所述第一频域近端音频信号的幅度的乘积。In some embodiments, the trained noise reduction neural network model performs deep learning and noise reduction on the amplitude of the first frequency-domain near-end audio signal to obtain the amplitude of the second frequency-domain near-end audio signal , comprising: inputting the magnitude of the first frequency-domain near-end audio signal into the trained noise reduction neural network model to obtain a first signal magnitude ratio, wherein the first signal magnitude ratio is the second The predicted value of the ratio of the amplitude of the near-end audio signal in the frequency domain to the magnitude of the first frequency-domain near-end audio signal; according to the first signal amplitude ratio and the magnitude of the first frequency-domain near-end audio signal, obtain The magnitude of the second frequency-domain near-end audio signal, wherein the magnitude of the second frequency-domain near-end audio signal is the product of the first signal magnitude ratio and the magnitude of the first frequency-domain near-end audio signal .

在一些实施例中，所述基于所述频域远端参考音频信号、所述频域近端采集音频信号、所述第一频域近端音频信号和所述第二频域近端音频信号，对所述第二频域近端音频信号进行非线性回声消除，得到频域近端音频增强信号，包括：将所述频域远端参考音频信号分别与所述频域近端采集音频信号和第二频率近端音频信号在各个频带上进行求相关，得到各个频带的第二信号幅度比；根据所述第二信号幅度比、所述第一频域近端音频信号和所述第二频域近端音频信号，对所述第二频域近端音频信号进行非线性回声消除，得到频域近端音频增强信号。In some embodiments, the frequency-domain far-end reference audio signal, the frequency-domain near-end audio signal, the first frequency-domain near-end audio signal and the second frequency-domain near-end audio signal , performing nonlinear echo cancellation on the second frequency-domain near-end audio signal to obtain a frequency-domain near-end audio enhancement signal, comprising: respectively combining the frequency-domain far-end reference audio signal with the frequency-domain near-end acquisition audio signal Perform correlation with the second frequency near-end audio signal on each frequency band to obtain a second signal amplitude ratio of each frequency band; according to the second signal amplitude ratio, the first frequency domain near-end audio signal and the second For the near-end audio signal in the frequency domain, nonlinear echo cancellation is performed on the second near-end audio signal in the frequency domain to obtain an enhanced near-end audio signal in the frequency domain.

在一些实施例中，所述根据所述第二信号幅度比、所述第一频域近端音频信号和所述第二频域近端音频信号，对所述第二频域近端音频信号进行非线性回声消除，得到频域近端音频增强信号，包括：获取所述第二信号幅度比和所述第一频域近端音频信号的幅度的乘积作为参考幅度；获取所述参考幅度和所述第二频域近端音频信号的幅度中的最小值，作为所述频域近端音频增强信号的幅度；根据所述频域近端音频增强信号的幅度和所述第一频域近端音频信号的相位，得到频域近端音频增强信号。In some embodiments, according to the second signal amplitude ratio, the first frequency-domain near-end audio signal and the second frequency-domain near-end audio signal, the second frequency-domain near-end audio signal Perform nonlinear echo cancellation to obtain a frequency-domain near-end audio enhancement signal, including: obtaining the product of the second signal amplitude ratio and the amplitude of the first frequency-domain near-end audio signal as a reference amplitude; obtaining the reference amplitude and The minimum value of the amplitude of the second frequency-domain near-end audio signal is used as the amplitude of the frequency-domain near-end audio enhancement signal; according to the amplitude of the frequency-domain near-end audio enhancement signal and the first frequency-domain near-end The phase of the end audio signal is obtained to obtain the near-end audio enhancement signal in the frequency domain.

在一些实施例中，所述第二信号幅度比通过下式获取：In some embodiments, the second signal amplitude ratio is obtained by the following formula:

Mask(n,k)＝min{1-RCr(n,k),1-RY _pr(n,k)}； Mask(n,k)=min{1-RCr(n,k),1-RY _p r(n,k)};

其中，Mask(n,k)为所述第二信号幅度比，RCr(n,k)为所述频域远端参考音频信号和所述频域近端采集音频信号的互相关系数，RY _pr(n,k)为所述频域远端参考音频信号和所述第二频域近端音频信号的互相关系数，n为帧序列数，k为中心频率序列数，0<n≤N，0<k≤K，N为总帧数，K为总频带数。 Wherein, Mask(n,k) is the second signal amplitude ratio, RCr(n,k) is the cross-correlation coefficient between the frequency-domain far-end reference audio signal and the frequency-domain near-end acquisition audio signal, _RYp r(n,k) is the cross-correlation coefficient between the frequency domain far-end reference audio signal and the second frequency domain near-end audio signal, n is the number of frame sequences, k is the number of center frequency sequences, 0<n≤N , 0<k≤K, N is the total number of frames, K is the total number of frequency bands.

根据本公开实施例的第二方面，提供一种音频信号处理装置，包括：信号获取单元，被配置为：获取近端采集音频信号、远端参考音频信号，以及对所述近端采集音频信号进行线性回声消除后得到的第一近端音频信号；频域变换单元，被配置为：对所述近端采集音频信号、所述远端参考音频信号和所述第一近端音频信号分别进行时频变换，得到频域近端采集音频信号、频域远端参考音频信号和第一频域近端音频信号；深度降噪单元，被配置为：对所述第一频域近端音频信号的幅度进行深度学习降噪，得到第二频域近端音频信号；非线性消除单元，被配置为：基于所述频域远端参考音频信号、所述频域近端采集音频信号、所述第一频域近端音频信号和所述第二频域近端音频信号，对所述第二频域近端音频信号进行非线性回声消除，得到频域近端音频增强信号；时域变换单元，被配置为：对所述频域近端音频增强信号进行时频逆变换，得到近端音频增强信号。According to a second aspect of an embodiment of the present disclosure, there is provided an audio signal processing device, including: a signal acquisition unit configured to: acquire a near-end collected audio signal, a far-end reference audio signal, and perform an operation on the near-end collected audio signal The first near-end audio signal obtained after performing linear echo cancellation; the frequency domain transformation unit is configured to: respectively perform Time-frequency transformation to obtain the frequency-domain near-end audio signal, the frequency-domain far-end reference audio signal and the first frequency-domain near-end audio signal; the deep noise reduction unit is configured to: the first frequency-domain near-end audio signal The magnitude of the deep learning noise reduction is carried out to obtain the second frequency domain near-end audio signal; the non-linear elimination unit is configured to: based on the frequency domain far-end reference audio signal, the frequency domain near-end acquisition audio signal, the The first frequency-domain near-end audio signal and the second frequency-domain near-end audio signal, performing nonlinear echo cancellation on the second frequency-domain near-end audio signal to obtain a frequency-domain near-end audio enhancement signal; a time-domain transformation unit , is configured to: perform a time-frequency inverse transform on the frequency-domain near-end audio enhancement signal to obtain a near-end audio enhancement signal.

在一些实施例中，深度降噪单元被配置为：通过训练好的降噪神经网络模型，对所述第一频域近端音频信号的幅度进行深度学习降噪，得到所述第二频域近端音频信号的幅度；根据所述第二频域近端音频信号的幅度和所述第一频域近端音频信号的相位，得到所述第二频域近端音频信号。In some embodiments, the deep noise reduction unit is configured to: use a trained noise reduction neural network model to perform deep learning and noise reduction on the amplitude of the near-end audio signal in the first frequency domain to obtain the second frequency domain Amplitude of the near-end audio signal; the second frequency-domain near-end audio signal is obtained according to the magnitude of the second frequency-domain near-end audio signal and the phase of the first frequency-domain near-end audio signal.

在一些实施例中，深度降噪单元被配置为：将所述第一频域近端音频信号的幅度输入所述训练好的降噪神经网络模型中，得到第一信号幅度比，其中，所述第一信号幅度比为所述第二频域近端音频信号的幅度和所述第一频域近端音频信号的幅度的比值的预测值；根据所述第一信号幅度比和所述第一频域近端音频信号的幅度，得到所述第二频域近端音频信号的幅度，其中，所述第二频域近端音频信号的幅度是所述第一信号幅度比和所述第一频域近端音频信号的幅度的乘积。In some embodiments, the deep noise reduction unit is configured to: input the magnitude of the first frequency-domain near-end audio signal into the trained noise reduction neural network model to obtain a first signal magnitude ratio, wherein the The first signal amplitude ratio is a predicted value of the ratio of the amplitude of the second frequency-domain near-end audio signal to the amplitude of the first frequency-domain near-end audio signal; according to the first signal amplitude ratio and the first The magnitude of a near-end audio signal in the frequency domain is obtained by obtaining the magnitude of the second frequency-domain near-end audio signal, wherein the magnitude of the second frequency-domain near-end audio signal is the first signal magnitude ratio and the first A product of the amplitudes of the near-end audio signal in the frequency domain.

在一些实施例中，非线性消除单元被配置为：将所述频域远端参考音频信号分别与所述频域近端采集音频信号和第二频率近端音频信号在各个频带上进行求相关，得到各个频带的第二信号幅度比；根据所述第二信号幅度比、所述第一频域近端音频信号和所述第二频域近端音频信号，对所述第二频域近端音频信号进行非线性回声消除，得到频域近端音频增强信号。In some embodiments, the non-linear elimination unit is configured to: correlate the frequency-domain far-end reference audio signal with the frequency-domain near-end acquisition audio signal and the second-frequency near-end audio signal in respective frequency bands , to obtain the second signal amplitude ratio of each frequency band; according to the second signal amplitude ratio, the first frequency domain near-end audio signal and the second frequency domain near-end audio signal, the second frequency domain near-end audio signal The non-linear echo cancellation is performed on the end audio signal to obtain the near-end audio enhancement signal in the frequency domain.

在一些实施例中，非线性消除单元被配置为：获取所述第二信号幅度比和所述第一频域近端音频信号的幅度的乘积作为参考幅度；获取所述参考幅度和所述第二频域近端音频信号的幅度中的最小值，作为所述频域近端音频增强信号的幅度；根据所述频域近端音频增强信号的幅度和所述第一频域近端音频信号的相位，得到频域近端音频增强信号。In some embodiments, the non-linear elimination unit is configured to: acquire the product of the second signal amplitude ratio and the amplitude of the first frequency-domain near-end audio signal as a reference amplitude; acquire the reference amplitude and the first The minimum value of the amplitudes of the two frequency-domain near-end audio signals is used as the amplitude of the frequency-domain near-end audio enhancement signal; according to the amplitude of the frequency-domain near-end audio enhancement signal and the first frequency-domain near-end audio signal The phase of the near-end audio enhancement signal in the frequency domain is obtained.

根据本公开实施例的第三方面，提供一种电子设备，包括：至少一个处理器；至少一个存储计算机可执行指令的存储器，其中，所述计算机可执行指令在被所述至少一个处理器运行时，促使所述至少一个处理器执行根据本公开的音频信号处理方法。According to a third aspect of the embodiments of the present disclosure, there is provided an electronic device, including: at least one processor; at least one memory storing computer-executable instructions, wherein the computer-executable instructions are executed by the at least one processor , prompting the at least one processor to execute the audio signal processing method according to the present disclosure.

根据本公开实施例的第四方面，提供一种非易失性计算机可读存储介质，当所述计算机可读存储介质中的指令被至少一个处理器运行时，促使所述至少一个处理器执行根据本公开的音频信号处理方法。According to a fourth aspect of the embodiments of the present disclosure, there is provided a non-volatile computer-readable storage medium, when instructions in the computer-readable storage medium are executed by at least one processor, the at least one processor is prompted to execute An audio signal processing method according to the present disclosure.

根据本公开实施例的第五方面，提供一种计算机程序产品，包括计算机指令，所述计算机指令被至少一个处理器执行时实现根据本公开的音频信号处理方法。According to a fifth aspect of the embodiments of the present disclosure, there is provided a computer program product including computer instructions, and when the computer instructions are executed by at least one processor, the audio signal processing method according to the present disclosure is implemented.

根据本公开的音频信号处理方法及装置，先对近端采集音频信号进行线性回声消除，接着对其进行深度学习降噪，然后进行非线性回声消除，通过线性回声消除和非线性回声消除对远端参考音频信号中的回声信号进行消除，获得最终的近端音频增强信号，将深度学习降噪和回声消除结合起来，充分利用了深度学习降噪的良好性能，相较于相关技术中所采用的传统降噪技术和回声消除的结合，能够得到更好的回声消除效果和降噪效果，带来音质的提升。According to the audio signal processing method and device of the present disclosure, linear echo cancellation is first performed on the near-end collected audio signal, and then deep learning noise reduction is performed on it, and then nonlinear echo cancellation is performed. The echo signal in the end reference audio signal is eliminated to obtain the final near-end audio enhancement signal, and the combination of deep learning noise reduction and echo cancellation makes full use of the good performance of deep learning noise reduction. The combination of traditional noise reduction technology and echo cancellation can get better echo cancellation effect and noise reduction effect, and bring about the improvement of sound quality.

此外，根据本公开的音频信号处理方法及装置，可利用深度学习降噪处理传统降噪很难解决的情况，如类稳态或非稳态噪声的消除以及传统降噪消除噪声效果不好的情况。In addition, according to the audio signal processing method and device of the present disclosure, deep learning can be used for noise reduction to deal with situations that are difficult to solve in traditional noise reduction, such as the elimination of quasi-stationary or non-stationary noise and the traditional noise reduction that is not effective in eliminating noise. Condition.

此外，根据本公开的音频信号处理方法及装置，在非线性回声消除的过程中，通过基于互相关系数获取信号幅度比，最终获得频域近端音频增强信号，可同时消除回声和噪声。In addition, according to the audio signal processing method and device of the present disclosure, in the process of nonlinear echo cancellation, by obtaining the signal amplitude ratio based on the cross-correlation coefficient, and finally obtaining the near-end audio enhancement signal in the frequency domain, the echo and noise can be eliminated at the same time.

应当理解的是，以上的一般描述和后文的细节描述仅是示例性和解释性的，并不能限制本公开。It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the present disclosure.

Description of drawings

此处的附图被并入说明书中并构成本说明书的一部分，示出了符合本公开的实施例，并与说明书一起用于解释本公开的原理，并不构成对本公开的不当限定。The accompanying drawings here are incorporated into the specification and constitute a part of the specification, show embodiments consistent with the disclosure, and are used together with the description to explain the principle of the disclosure, and do not constitute an improper limitation of the disclosure.

图1是示出根据本公开的示例性实施例的音频信号处理方法的整体框架图。FIG. 1 is an overall block diagram illustrating an audio signal processing method according to an exemplary embodiment of the present disclosure.

图2是示出根据本公开的示例性实施例的音频信号处理方法的流程图。FIG. 2 is a flowchart illustrating an audio signal processing method according to an exemplary embodiment of the present disclosure.

图3是示出根据本公开的示例性实施例的音频信号处理装置的框图。FIG. 3 is a block diagram illustrating an audio signal processing device according to an exemplary embodiment of the present disclosure.

图4是示出根据本公开的示例性实施例的电子设备400的框图。FIG. 4 is a block diagram illustrating an electronic device 400 according to an exemplary embodiment of the present disclosure.

Detailed ways

为了使本领域普通人员更好地理解本公开的技术方案，下面将结合附图，对本公开实施例中的技术方案进行清楚、完整地描述。In order to enable ordinary persons in the art to better understand the technical solutions of the present disclosure, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below in conjunction with the accompanying drawings.

需要说明的是，本公开的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象，而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换，以便这里描述的本公开的实施例能够以除了在这里图示或描述的那些以外的顺序实施。以下实施例中所描述的实施方式并不代表与本公开相一致的所有实施方式。相反，它们仅是与如所附权利要求书中所详述的、本公开的一些方面相一致的装置和方法的例子。It should be noted that the terms "first" and "second" in the specification and claims of the present disclosure and the above drawings are used to distinguish similar objects, but not necessarily used to describe a specific sequence or sequence. It is to be understood that the data so used are interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein can be practiced in sequences other than those illustrated or described herein. The implementations described in the following examples do not represent all implementations consistent with this disclosure. Rather, they are merely examples of apparatuses and methods consistent with aspects of the present disclosure as recited in the appended claims.

在此需要说明的是，在本公开中出现的“若干项之中的至少一项”均表示包含“该若干项中的任意一项”、“该若干项中的任意多项的组合”、“该若干项的全体”这三类并列的情况。例如“包括A和B之中的至少一个”即包括如下三种并列的情况：(1)包括A；(2)包括B；(3)包括A和B。又例如“执行步骤一和步骤二之中的至少一个”，即表示如下三种并列的情况：(1)执行步骤一；(2)执行步骤二；(3)执行步骤一和步骤二。What needs to be explained here is that "at least one of several items" appearing in this disclosure all means to include "any one of the several items", "a combination of any of the several items", The three categories of "the whole of the several items" are juxtaposed. For example, "including at least one of A and B" includes the following three parallel situations: (1) including A; (2) including B; (3) including A and B. Another example is "execute at least one of step 1 and step 2", which means the following three parallel situations: (1) execute step 1; (2) execute step 2; (3) execute step 1 and step 2.

在实时通讯中，音频信号包含噪声，也会包含回声，基于此，回声消除(Acoustic Echo Cancellation，AEC)成为了实时通讯中的重要技术之一，是保证音视频体验的关键点。回声消除技术是指将近端麦克风采集到的音频信号中的远端信号消除掉，只保留下近端信号，其中，近端麦克风采集到的音频信号包括近端信号及远端信号通过近端扬声器播放出来的音频信号。需要说明的是，近端可称为本端，远端可称为其他端。回声消除技术一般包括线性回声消除和非线性回声消除。In real-time communication, audio signals contain noise as well as echo. Based on this, Acoustic Echo Cancellation (AEC) has become one of the important technologies in real-time communication, and it is the key point to ensure audio and video experience. Echo cancellation technology refers to the elimination of the far-end signal in the audio signal collected by the near-end microphone, and only the near-end signal is retained. The audio signal collected by the near-end microphone includes the near-end signal and the far-end signal through the near-end The audio signal played by the speaker. It should be noted that the near end may be referred to as the local end, and the far end may be referred to as the other end. Echo cancellation techniques generally include linear echo cancellation and nonlinear echo cancellation.

线性回声消除可以通过自适应滤波的方法消除远端信号，但是往往会有远端信号残留。线性回声消除与非线性回声消除相结合会提升远端信号的消除效果。但是通常情况下，近端采集的信号的信噪比会影响非线性回声消除的效果。相关技术中，将经过线性回声消除的音频信号利用传统方式如维纳滤波进行降噪，得到相对干净的音频信号，再进行非线性回声消除进行相关性处理。但是这种方法处理得到的音频信号受到传统降噪技术效果的限制，降噪效果不好，尤其在噪声为非稳态的情况下，从而影响回声消除效果，造成音质不佳。Linear echo cancellation can eliminate far-end signals through adaptive filtering, but often there will be residual far-end signals. The combination of linear echo cancellation and nonlinear echo cancellation will improve the cancellation effect of the far-end signal. But usually, the signal-to-noise ratio of the signal collected at the near end will affect the effect of nonlinear echo cancellation. In related technologies, the audio signal that has undergone linear echo cancellation is noise-reduced using a traditional method such as Wiener filtering to obtain a relatively clean audio signal, and then non-linear echo cancellation is performed for correlation processing. However, the audio signal processed by this method is limited by the effect of traditional noise reduction technology, and the noise reduction effect is not good, especially in the case of non-stationary noise, which affects the echo cancellation effect and causes poor sound quality.

将回声消除和传统降噪如滤波进行结合，从而达到降噪和回声消除的效果，但是这种方法处理得到的音频信号存在回声消除和降噪效果较差，且音质不佳。Combine echo cancellation with traditional noise reduction such as filtering to achieve the effects of noise reduction and echo cancellation, but the audio signal processed by this method has poor echo cancellation and noise reduction effects and poor sound quality.

本公开提出一种音频信号处理方法及装置，先对近端采集音频信号进行线性回声消除，接着对其进行深度学习降噪，然后进行非线性回声消除，获得最终的近端音频增强信号，将深度学习降噪和回声消除结合起来，相较于相关技术中所采用的传统降噪技术和回声消除的结合，能够得到更好的回声消除效果和降噪效果，带来音质的提升。The present disclosure proposes an audio signal processing method and device, which first performs linear echo cancellation on the near-end audio signal, then performs deep learning noise reduction on it, and then performs nonlinear echo cancellation to obtain the final near-end audio enhancement signal. The combination of deep learning noise reduction and echo cancellation, compared with the combination of traditional noise reduction technology and echo cancellation used in related technologies, can obtain better echo cancellation and noise reduction effects, resulting in improved sound quality.

下面，将参照图1至图4来详细描述根据本公开的音频信号处理方法及装置。Hereinafter, the audio signal processing method and device according to the present disclosure will be described in detail with reference to FIGS. 1 to 4 .

图1是示出根据本公开的示例性实施例的音频信号处理方法的整体框架图。参照图1，可将近端采集音频信号进行线性回声消除，得到第一近端音频信号，可将第一频域近端音频信号进行深度学习降噪，得到第二频域近端音频信号，可对第二频域近端音频信号进行非线性回声消除处理，得到频域近端音频增强信号，需要说明的是，在这个过程中，可以通过线性回声消除和非线性回声消除对远端参考音频信号中的回声信号进行消除。FIG. 1 is an overall block diagram illustrating an audio signal processing method according to an exemplary embodiment of the present disclosure. Referring to Figure 1, the near-end collected audio signal can be subjected to linear echo cancellation to obtain the first near-end audio signal, and the first frequency-domain near-end audio signal can be subjected to deep learning noise reduction to obtain the second frequency-domain near-end audio signal, Non-linear echo cancellation processing can be performed on the second frequency-domain near-end audio signal to obtain a frequency-domain near-end audio enhancement signal. It should be noted that, in this process, the far-end reference The echo signal in the audio signal is eliminated.

在得知本公开的示例性实施例的整体框架后，下面通过具体的方法步骤对于本公开的示例性实施例的音频信号处理方法进行说明。After knowing the overall framework of the exemplary embodiment of the present disclosure, the audio signal processing method of the exemplary embodiment of the present disclosure will be described below through specific method steps.

图2是示出根据本公开的示例性实施例的音频信号处理方法的流程图。本公开实施例中提供的音频信号处理方法，包括：步骤201，获取近端采集音频信号、远端参考音频信号，以及对近端采集音频信号进行线性回声消除后得到的第一近端音频信号；步骤202，对近端采集音频信号、远端参考音频信号和第一近端音频信号分别进行时频变换，得到频域近端采集音频信号、频域远端参考音频信号和第一频域近端音频信号；步骤203，对第一频域近端音频信号的幅度进行深度学习降噪，得到第二频域近端音频信号；步骤204，基于频域远端参考音频信号、频域近端采集音频信号、第一频域近端音频信号和第二频域近端音频信号，对第二频域近端音频信号进行非线性回声消除，得到频域近端音频增强信号；步骤205，对频域近端音频增强信号进行时频逆变换，得到近端音频增强信号。参照图2，在步骤201，可获取近端采集音频信号、远端参考音频信号，以及对近端采集音频信号进行线性回声消除后得到的第一近端音频信号。FIG. 2 is a flowchart illustrating an audio signal processing method according to an exemplary embodiment of the present disclosure. The audio signal processing method provided in the embodiment of the present disclosure includes: Step 201, acquiring a near-end audio signal, a far-end reference audio signal, and a first near-end audio signal obtained by performing linear echo cancellation on the near-end audio signal ; Step 202, time-frequency conversion is performed on the near-end audio signal, the far-end reference audio signal and the first near-end audio signal respectively, to obtain the near-end audio signal in the frequency domain, the far-end reference audio signal in the frequency domain and the first frequency domain Near-end audio signal; Step 203, carry out deep learning noise reduction to the amplitude of the first frequency domain near-end audio signal, obtain the second frequency domain near-end audio signal; Step 204, based on frequency domain far-end reference audio signal, frequency domain near-end The terminal collects the audio signal, the near-end audio signal in the first frequency domain and the near-end audio signal in the second frequency domain, performs nonlinear echo cancellation on the near-end audio signal in the second frequency domain, and obtains the near-end audio enhancement signal in the frequency domain; step 205, Time-frequency inverse transform is performed on the near-end audio enhancement signal in the frequency domain to obtain the near-end audio enhancement signal. Referring to FIG. 2 , in step 201 , a near-end collected audio signal, a far-end reference audio signal, and a first near-end audio signal obtained after performing linear echo cancellation on the near-end collected audio signal can be obtained.

根据本公开的示例性实施例，近端采集音频信号为近端麦克风采集到的音频信号，远端参考音频信号为未经近端扬声器播放的远端传送的音频信号。According to an exemplary embodiment of the present disclosure, the near-end collected audio signal is an audio signal collected by a near-end microphone, and the far-end reference audio signal is an audio signal transmitted from a far end that is not played by a near-end speaker.

近端麦克风采集到的音频信号可包括，但不限于，近端麦克风采集到的近端的用户声音信号和近端麦克风采集到的远端通过网络链路传输过来并经过近端扬声器播放的音频信号。The audio signal collected by the near-end microphone may include, but not limited to, the near-end user voice signal collected by the near-end microphone and the audio collected by the near-end microphone and transmitted from the far end through the network link and played through the near-end speaker Signal.

在步骤202，可对近端采集音频信号、远端参考音频信号和第一近端音频信号分别进行时频变换，得到频域近端采集音频信号、频域远端参考音频信号和第一频域近端音频信号。In step 202, time-frequency transformation may be performed on the near-end collected audio signal, the far-end reference audio signal and the first near-end audio signal respectively, to obtain the frequency-domain near-end collected audio signal, the frequency-domain far-end reference audio signal and the first frequency domain domain near-end audio signal.

根据本公开的示例性实施例，时频变换可为，但不限于，短时傅里叶变换(Short-Time Fourier Transform，STFT)。下面以STFT为例进行描述。According to an exemplary embodiment of the present disclosure, the time-frequency transform may be, but not limited to, Short-Time Fourier Transform (Short-Time Fourier Transform, STFT). The STFT is taken as an example for description below.

根据本公开的示例性实施例，可将时间长度为T的近端采集音频信号、远端参考音频信号和第一近端音频信号在时域上分别表示为c(t)、r(t)和y(t)，其中，t为时间，0<t≤T。经过短时傅里叶变换之后，c(t)、r(t)和y(t)在频域可通过下式(1)-(3)表示：According to an exemplary embodiment of the present disclosure, the near-end acquisition audio signal, the far-end reference audio signal and the first near-end audio signal with a time length of T may be expressed as c(t) and r(t) respectively in the time domain and y(t), where t is time, 0<t≤T. After short-time Fourier transform, c(t), r(t) and y(t) can be expressed in the frequency domain by the following formulas (1)-(3):

C(n,k)＝STFT(c(t))；(1)C(n,k)=STFT(c(t));(1)

R(n,k)＝STFT(r(t))；(2)R(n,k)=STFT(r(t));(2)

Y(n,k)＝STFT(y(t))；(3)Y(n,k)=STFT(y(t));(3)

其中，c(t)为近端采集音频信号，r(t)为远端参考音频信号，y(t)为第一近端音频信号，STFT(x)表示对x进行短时傅里叶变换，C(n,k)为频域近端采集音频信号，R(n,k)为频域远端参考音频信号，Y(n,k)为第一频域近端音频信号，n为帧序列数，k为中心频率序列数，0<n≤N，0<k≤K，N为总帧数，K为总频带数。Among them, c(t) is the near-end acquisition audio signal, r(t) is the far-end reference audio signal, y(t) is the first near-end audio signal, and STFT(x) represents the short-time Fourier transform of x , C(n,k) is the frequency-domain near-end audio signal, R(n,k) is the frequency-domain far-end reference audio signal, Y(n,k) is the first frequency-domain near-end audio signal, and n is the frame Number of sequences, k is the number of center frequency sequences, 0<n≤N, 0<k≤K, N is the total number of frames, K is the total number of frequency bands.

在步骤203，可对第一频域近端音频信号的幅度进行深度学习降噪，得到第二频域近端音频信号。In step 203, the amplitude of the near-end audio signal in the first frequency domain can be denoised by deep learning to obtain the near-end audio signal in the second frequency domain.

根据本公开的示例性实施例，深度学习降噪可通过神经网络模型进行降噪实现，在这种情况下：首先可通过训练好的降噪神经网络模型，对第一频域近端音频信号的幅度进行深度学习降噪，得到第二频域近端音频信号的幅度。然后可根据第二频域近端音频信号的幅度和第一频域近端音频信号的相位，得到第二频域近端音频信号。According to an exemplary embodiment of the present disclosure, deep learning noise reduction can be implemented by using a neural network model to perform noise reduction. In this case: firstly, the first frequency domain near-end audio signal can be processed through a trained noise reduction neural network model The amplitude of the deep learning noise reduction is performed to obtain the amplitude of the near-end audio signal in the second frequency domain. Then, the second frequency-domain near-end audio signal can be obtained according to the amplitude of the second frequency-domain near-end audio signal and the phase of the first frequency-domain near-end audio signal.

根据本公开的示例性实施例，首先可将第一频域近端音频信号的幅度输入训练好的降噪神经网络模型中，得到第一信号幅度比，其中，第一信号幅度比为第二频域近端音频信号的幅度和第一频域近端音频信号的幅度的比值的预测值。然后可根据第一信号幅度比和第一频域近端音频信号的幅度，得到第二频域近端音频信号的幅度，其中，第二频域近端音频信号的幅度是第一信号幅度比和第一频域近端音频信号的幅度的乘积。例如，训练好的降噪神经网络模型可以是基于深度学习(Deep Learning)的神经网络模型，包括，但不限于，卷积神经网络模型。According to an exemplary embodiment of the present disclosure, first, the magnitude of the first frequency-domain near-end audio signal can be input into the trained noise reduction neural network model to obtain the first signal magnitude ratio, wherein the first signal magnitude ratio is the second A predicted value of a ratio of the amplitude of the frequency-domain near-end audio signal to the amplitude of the first frequency-domain near-end audio signal. Then the amplitude of the near-end audio signal in the second frequency domain can be obtained according to the amplitude ratio of the first signal amplitude and the amplitude of the near-end audio signal in the first frequency domain, wherein the amplitude of the near-end audio signal in the second frequency domain is the first signal amplitude ratio and the amplitude of the first frequency-domain near-end audio signal. For example, the trained noise reduction neural network model may be a neural network model based on deep learning (Deep Learning), including, but not limited to, a convolutional neural network model.

根据本公开的示例性实施例，待训练的降噪神经网络模型通过对训练数据集的训练，完成其训练，其中，训练数据集可包括带噪音频信号的幅度数据集。According to an exemplary embodiment of the present disclosure, the training of the noise reduction neural network model to be trained is completed by training the training data set, wherein the training data set may include an amplitude data set with a noisy frequency signal.

根据本公开的示例性实施例，第二频域近端音频信号可通过下式(4)获取：According to an exemplary embodiment of the present disclosure, the second frequency-domain near-end audio signal can be obtained by the following equation (4):

Y(n,k) _p＝PrediCt(MagY(n,k))*Phase(Y(n,k))；(4) Y(n,k) _p =PrediCt(MagY(n,k))*Phase(Y(n,k));(4)

其中，Y(n,k) _p为第二频域近端音频信号，n为帧序列数，k为中心频率序列数，0<n≤N，0<k≤K，N为总帧数，K为总频带数，Y(n,k)为第一频域近端音频信号，MagY(n,k)为第一频域近端音频信号的幅度，Predict(MagY(n,k))代表对第一频域近端音频信号的幅度进行深度学习降噪，Phase(Y(n,k))为第一频域近端音频信号的相位。 Wherein, Y(n,k) _p is the near-end audio signal in the second frequency domain, n is the number of frame sequences, k is the number of center frequency sequences, 0<n≤N, 0<k≤K, N is the total number of frames, K is the total number of frequency bands, Y(n,k) is the near-end audio signal in the first frequency domain, MagY(n,k) is the amplitude of the near-end audio signal in the first frequency domain, Predict(MagY(n,k)) represents Perform deep learning noise reduction on the amplitude of the near-end audio signal in the first frequency domain, and Phase(Y(n,k)) is the phase of the near-end audio signal in the first frequency domain.

回到图2，在步骤204，可基于频域远端参考音频信号、频域近端采集音频信号、第一频域近端音频信号和第二频域近端音频信号，对第二频域近端音频信号进行非线性回声消除，得到频域近端音频增强信号。Returning to Fig. 2, in step 204, based on the far-end reference audio signal in the frequency domain, the near-end audio signal in the frequency domain, the near-end audio signal in the first frequency domain and the near-end audio signal in the second frequency domain, the second frequency domain The near-end audio signal is subjected to nonlinear echo cancellation to obtain a near-end audio enhancement signal in the frequency domain.

根据本公开的示例性实施例，首先可将频域远端参考音频信号分别与频域近端采集音频信号和第二频率近端音频信号在各个频带上进行求相关，得到各个频带的第二信号幅度比。然后可根据第二信号幅度比、第一频域近端音频信号和第二频域近端音频信号，对第二频域近端音频信号进行非线性回声消除，得到频域近端音频增强信号。According to an exemplary embodiment of the present disclosure, firstly, the frequency-domain far-end reference audio signal can be correlated with the frequency-domain near-end collected audio signal and the second-frequency near-end audio signal in each frequency band to obtain the second signal amplitude ratio. Then, according to the second signal amplitude ratio, the first frequency-domain near-end audio signal and the second frequency-domain near-end audio signal, nonlinear echo cancellation is performed on the second frequency-domain near-end audio signal to obtain a frequency-domain near-end audio enhancement signal .

根据本公开的示例性实施例，首先可获取频域远端参考音频信号和频域近端采集音频信号的互相关系数，以及，频域远端参考音频信号和第二频域近端音频信号的互相关系数。然后可基于这两个互相关系数得到各个频带的第二信号幅度比。According to an exemplary embodiment of the present disclosure, first, the cross-correlation coefficient of the frequency-domain far-end reference audio signal and the frequency-domain near-end acquisition audio signal can be obtained, and the frequency-domain far-end reference audio signal and the second frequency-domain near-end audio signal The correlation coefficient of . The second signal amplitude ratios for the respective frequency bands can then be obtained based on these two cross-correlation coefficients.

例如，可通过下式(5)获取频域远端参考音频信号和频域近端采集音频信号的互相关系数：For example, the cross-correlation coefficient of the frequency-domain far-end reference audio signal and the frequency-domain near-end acquisition audio signal can be obtained by the following formula (5):

RCr(n,k)＝Xcorr(R(n,k),C(n,k))；(5)RCr(n,k)=Xcorr(R(n,k),C(n,k));(5)

其中，RCr(n,k)为频域远端参考音频信号和频域近端采集音频信号的互相关系数，Xcorr(a,b)代表通过互相关函数对a和b在各帧和各频带求互相关获得a和b在各个频带上的互相关系数，R(n,k)为频域远端参考音频信号，C(n,k)为频域近端采集音频信号，n为帧序列数，k为中心频率序列数，0<n≤N，0<k≤K，N为总帧数，K为总频带数。Among them, RCr(n,k) is the cross-correlation coefficient of the far-end reference audio signal in the frequency domain and the near-end acquisition audio signal in the frequency domain, and Xcorr(a,b) represents the correlation between a and b in each frame and each frequency band through the cross-correlation function Find the cross-correlation to obtain the cross-correlation coefficients of a and b in each frequency band, R(n,k) is the far-end reference audio signal in the frequency domain, C(n,k) is the near-end audio signal in the frequency domain, and n is the frame sequence number, k is the number of center frequency sequences, 0<n≤N, 0<k≤K, N is the total number of frames, and K is the total number of frequency bands.

例如，可通过下式(6)获取频域远端参考音频信号和第二频域近端音频信号的互相关系数：For example, the cross-correlation coefficient of the far-end reference audio signal in the frequency domain and the near-end audio signal in the second frequency domain can be obtained by the following formula (6):

RY _pr(n,k)＝Xcorr(R(n,k),Y(n,k) _p)；(6) RY _p r(n,k)=Xcorr(R(n,k),Y(n,k) _p );(6)

其中，RY _pr(n,k)为频域远端参考音频信号和第二频域近端音频信号的互相关系数，Xcorr(a,b)代表通过互相关函数对a和b在各帧和各频带求互相关获得a和b在各个频带上的互相关系数，R(n,k)为频域远端参考音频信号，Y(n,k) _p为第二频域近端音频信号，n为帧序列数，k为中心频率序列数，0<n≤N，0<k≤K，N为总帧数，K为总频带数。 Among them, RY _p r(n,k) is the cross-correlation coefficient of the far-end reference audio signal in the frequency domain and the near-end audio signal in the second frequency domain, and Xcorr(a,b) represents the correlation between a and b in each frame through the cross-correlation function Calculate the cross-correlation with each frequency band to obtain the cross-correlation coefficient of a and b in each frequency band, R(n,k) is the far-end reference audio signal in the frequency domain, and Y(n,k) _p is the near-end audio signal in the second frequency domain , n is the number of frame sequences, k is the number of center frequency sequences, 0<n≤N, 0<k≤K, N is the total number of frames, and K is the total number of frequency bands.

例如，第二信号幅度比通过下式(7)获取：For example, the second signal amplitude ratio is obtained by the following formula (7):

Mask(n,k)＝min{1-RCr(n,k),1-RY _pr(n,k)}；(7) Mask(n,k)=min{1-RCr(n,k),1-RY _p r(n,k)}; (7)

其中，Mask(n,k)为第二信号幅度比，RCr(n,k)为频域远端参考音频信号和频域近端采集音频信号的互相关系数，RY _pr(n,k)为频域远端参考音频信号和第二频域近端音频信号的互相关系数，n为帧序列数，k为中心频率序列数，0<n≤N，0<k≤K，N为总帧数，K为总频带数。 Among them, Mask(n,k) is the second signal amplitude ratio, RCr(n,k) is the cross-correlation coefficient between the frequency domain far-end reference audio signal and the frequency domain near-end acquisition audio signal, RY _p r(n,k) is the cross-correlation coefficient between the far-end reference audio signal in the frequency domain and the near-end audio signal in the second frequency domain, n is the number of frame sequences, k is the number of center frequency sequences, 0<n≤N, 0<k≤K, N is the total The number of frames, K is the total number of frequency bands.

根据本公开的示例性实施例，首先可获取第二信号幅度比和第一频域近端音频信号的幅度的乘积作为参考幅度。然后可获取参考幅度和第二频域近端音频信号的幅度中的最小值，作为频域近端音频增强信号的幅度。最后可根据频域近端音频增强信号的幅度和第一频域近端音频信号的相位，得到频域近端音频增强信号。According to an exemplary embodiment of the present disclosure, first, the product of the second signal amplitude ratio and the amplitude of the first frequency-domain near-end audio signal may be acquired as the reference amplitude. Then the minimum value of the reference amplitude and the amplitude of the second frequency-domain near-end audio signal may be obtained as the amplitude of the frequency-domain near-end audio enhancement signal. Finally, the frequency-domain near-end audio enhancement signal can be obtained according to the amplitude of the frequency-domain near-end audio enhancement signal and the phase of the first frequency-domain near-end audio signal.

例如，频域近端音频增强信号可通过下式(8)获取：For example, the near-end audio enhancement signal in the frequency domain can be obtained by the following formula (8):

Y(n,k) _out＝min{MagY(n,k)*Mask(n,k),MagY(n,k) _p}*Phase(Y(n,k))；(8) Y(n,k) _out =min{MagY(n,k)*Mask(n,k),MagY(n,k) _p }*Phase(Y(n,k));(8)

其中，Y(n,k) _out为频域近端音频增强信号，n为帧序列数，k为中心频率序列数，0<n≤N，0<k≤K，N为总帧数，K为总频带数，Y(n,k)为第一频域近端音频信号，MagY(n,k)为第一频域近端音频信号的幅度，Mask(n,k)为第二信号幅度比，MagY(n,k) _p为第二频域近端音频信号的幅度，Y(n,k) _p为第二频域近端音频信号，Phase(Y(n,k))为第一频域近端音频信号的相位。 Among them, Y(n,k) _out is the near-end audio enhancement signal in the frequency domain, n is the number of frame sequences, k is the number of center frequency sequences, 0<n≤N, 0<k≤K, N is the total number of frames, K is the total number of frequency bands, Y(n,k) is the near-end audio signal in the first frequency domain, MagY(n,k) is the amplitude of the near-end audio signal in the first frequency domain, and Mask(n,k) is the second signal amplitude ratio, MagY(n,k) _p is the amplitude of the near-end audio signal in the second frequency domain, Y(n,k) _p is the near-end audio signal in the second frequency domain, and Phase(Y(n,k)) is the first The phase of the near-end audio signal in the frequency domain.

回到图2，在步骤205，可对频域近端音频增强信号进行时频逆变换，得到近端音频增强信号。Returning to FIG. 2 , in step 205 , time-frequency inverse transform may be performed on the near-end audio enhancement signal in the frequency domain to obtain the near-end audio enhancement signal.

根据本公开的示例性实施例，时频逆变换可为，但不限于，短时反傅里叶变换(Inverse Short-Time Fourier Transform，ISTFT)。在这种情况下，近端音频增强信号可通过下式(9)获取：According to an exemplary embodiment of the present disclosure, the inverse time-frequency transform may be, but not limited to, Inverse Short-Time Fourier Transform (ISTFT). In this case, the near-end audio enhancement signal can be obtained by the following formula (9):

y(t) _out＝ISTFT(Y(n,k) _out)；(9) y(t) _out = ISTFT(Y(n,k) _out ); (9)

其中，y(t) _out为近端音频增强信号，ISTFT(x)表示对x进行短时反傅里叶变换，Y(n,k) _out为频域近端音频增强信号，t为时间，0<t≤T，T为时间长度，n为帧序列数，k为中心频率序列数，0<n≤N，0<k≤K，N为总帧数，K为总频带数。 Among them, y(t) _out is the near-end audio enhancement signal, ISTFT(x) represents the short-time inverse Fourier transform of x, Y(n,k) _out is the frequency domain near-end audio enhancement signal, t is time, 0<t≤T, T is the time length, n is the number of frame sequences, k is the number of center frequency sequences, 0<n≤N, 0<k≤K, N is the total number of frames, K is the total number of frequency bands.

图3是示出根据本公开的示例性实施例的音频信号处理装置的框图。参照图3，根据本公开的示例性实施例的音频信号处理装置300可包括信号获取单元301、频域变换单元302、深度降噪单元303、非线性消除单元304和时域变换单元305。FIG. 3 is a block diagram illustrating an audio signal processing device according to an exemplary embodiment of the present disclosure. Referring to FIG. 3 , an audio signal processing device 300 according to an exemplary embodiment of the present disclosure may include a signal acquisition unit 301 , a frequency domain transformation unit 302 , a depth noise reduction unit 303 , a non-linear elimination unit 304 and a time domain transformation unit 305 .

信号获取单元301可获取近端采集音频信号、远端参考音频信号，以及对近端采集音频信号进行线性回声消除后得到的第一近端音频信号。The signal obtaining unit 301 may obtain a near-end collected audio signal, a far-end reference audio signal, and a first near-end audio signal obtained by performing linear echo cancellation on the near-end collected audio signal.

频域变换单元302可对近端采集音频信号、远端参考音频信号和第一近端音频信号分别进行时频变换，得到频域近端采集音频信号、频域远端参考音频信号和第一频域近端音频信号。The frequency domain conversion unit 302 can perform time-frequency conversion on the near-end collected audio signal, the far-end reference audio signal and the first near-end audio signal, respectively, to obtain the frequency-domain near-end collected audio signal, the frequency-domain far-end reference audio signal and the first Frequency domain near-end audio signal.

根据本公开的示例性实施例，可将时间长度为T的近端采集音频信号、远端参考音频信号和第一近端音频信号在时域上分别表示为c(t)、r(t)和y(t)，其中，t为时间，0<t≤T。经过短时傅里叶变换之后，c(t)、r(t)和y(t)在频域可通过上式(1)-(3)表示。According to an exemplary embodiment of the present disclosure, the near-end acquisition audio signal, the far-end reference audio signal and the first near-end audio signal with a time length of T may be expressed as c(t) and r(t) respectively in the time domain and y(t), where t is time, 0<t≤T. After short-time Fourier transform, c(t), r(t) and y(t) can be expressed by the above formulas (1)-(3) in the frequency domain.

深度降噪单元303可对第一频域近端音频信号的幅度进行深度学习降噪，得到第二频域近端音频信号。The deep noise reduction unit 303 may perform deep learning and noise reduction on the magnitude of the first frequency-domain near-end audio signal to obtain a second frequency-domain near-end audio signal.

根据本公开的示例性实施例，深度学习降噪可通过神经网络模型进行降噪实现，在这种情况下：深度降噪单元303首先可通过训练好的降噪神经网络模型，对第一频域近端音频信号的幅度进行深度学习降噪，得到第二频域近端音频信号的幅度。深度降噪单元303然后可根据第二频域近端音频信号的幅度和第一频域近端音频信号的相位，得到第二频域近端音频信号。According to an exemplary embodiment of the present disclosure, deep learning denoising can be implemented by using a neural network model to perform denoising. The amplitude of the near-end audio signal in the frequency domain is used for deep learning noise reduction, and the amplitude of the near-end audio signal in the second frequency domain is obtained. The deep noise reduction unit 303 can then obtain the second frequency-domain near-end audio signal according to the amplitude of the second frequency-domain near-end audio signal and the phase of the first frequency-domain near-end audio signal.

根据本公开的示例性实施例，深度降噪单元303首先可将第一频域近端音频信号的幅度输入训练好的降噪神经网络模型中，得到第一信号幅度比，其中，第一信号幅度比为第二频域近端音频信号的幅度和第一频域近端音频信号的幅度的比值的预测值。深度降噪单元303然后可根据第一信号幅度比和第一频域近端音频信号的幅度，得到第二频域近端音频信号的幅度，其中，第二频域近端音频信号的幅度是第一信号幅度比和第一频域近端音频信号的幅度的乘积。例如，训练好的降噪神经网络模型可以是基于深度学习(Deep Learning)的神经网络模型，包括，但不限于，卷积神经网络模型。According to an exemplary embodiment of the present disclosure, the deep noise reduction unit 303 may first input the amplitude of the first frequency domain near-end audio signal into the trained noise reduction neural network model to obtain the first signal amplitude ratio, wherein the first signal The amplitude ratio is a predicted value of a ratio of the amplitude of the near-end audio signal in the second frequency domain to the amplitude of the near-end audio signal in the first frequency domain. The depth noise reduction unit 303 can then obtain the magnitude of the second frequency domain near-end audio signal according to the first signal magnitude ratio and the magnitude of the first frequency domain near-end audio signal, wherein the magnitude of the second frequency domain near-end audio signal is The product of the first signal amplitude ratio and the amplitude of the first frequency-domain near-end audio signal. For example, the trained noise reduction neural network model may be a neural network model based on deep learning (Deep Learning), including, but not limited to, a convolutional neural network model.

根据本公开的示例性实施例，第二频域近端音频信号可通过上式(4)获取。According to an exemplary embodiment of the present disclosure, the second frequency-domain near-end audio signal can be obtained by the above formula (4).

回到图3，非线性消除单元304可基于频域远端参考音频信号、频域近端采集音频信号、第一频域近端音频信号和第二频域近端音频信号，对第二频域近端音频信号进行非线性回声消除，得到频域近端音频增强信号。Returning to FIG. 3 , the non-linear elimination unit 304 may perform a second-frequency The non-linear echo cancellation is performed on the near-end audio signal in the frequency domain to obtain the near-end audio enhancement signal in the frequency domain.

根据本公开的示例性实施例，非线性消除单元304首先可将频域远端参考音频信号分别与频域近端采集音频信号和第二频率近端音频信号在各个频带上进行求相关，得到各个频带的第二信号幅度比。非线性消除单元304然后可根据第二信号幅度比、第一频域近端音频信号和第二频域近端音频信号，对第二频域近端音频信号进行非线性回声消除，得到频域近端音频增强信号。According to an exemplary embodiment of the present disclosure, the non-linear elimination unit 304 may first correlate the frequency-domain far-end reference audio signal with the frequency-domain near-end collected audio signal and the second-frequency near-end audio signal in each frequency band to obtain Second signal amplitude ratios for respective frequency bands. The nonlinear elimination unit 304 can then perform nonlinear echo cancellation on the second frequency-domain near-end audio signal according to the second signal amplitude ratio, the first frequency-domain near-end audio signal, and the second frequency-domain near-end audio signal to obtain a frequency domain Near-end audio enhancement signal.

根据本公开的示例性实施例，非线性消除单元304首先可获取频域远端参考音频信号和频域近端采集音频信号的互相关系数，以及，频域远端参考音频信号和第二频域近端音频信号的互相关系数。非线性消除单元304然后可基于这两个互相关系数得到各个频带的第二信号幅度比。According to an exemplary embodiment of the present disclosure, the nonlinear elimination unit 304 may first obtain the cross-correlation coefficient between the frequency-domain far-end reference audio signal and the frequency-domain near-end acquisition audio signal, and the frequency-domain far-end reference audio signal and the second frequency The cross-correlation coefficient of the domain near-end audio signal. The non-linear elimination unit 304 can then obtain the second signal amplitude ratio of each frequency band based on the two correlation coefficients.

例如，可通过上式(5)获取频域远端参考音频信号和频域近端采集音频信号的互相关系数。For example, the cross-correlation coefficient between the frequency-domain far-end reference audio signal and the frequency-domain near-end collected audio signal can be obtained through the above formula (5).

例如，可通过上式(6)获取频域远端参考音频信号和第二频域近端音频信号的互相关系数。For example, the cross-correlation coefficient between the far-end reference audio signal in the frequency domain and the near-end audio signal in the second frequency domain may be obtained through the above formula (6).

例如，第二信号幅度比通过上式(7)获取。根据本公开的示例性实施例，非线性消除单元304首先可获取第二信号幅度比和第一频域近端音频信号的幅度的乘积作为参考幅度。非线性消除单元304然后可获取参考幅度和第二频域近端音频信号的幅度中的最小值，作为频域近端音频增强信号的幅度。非线性消除单元304最后可根据频域近端音频增强信号的幅度和第一频域近端音频信号的相位，得到频域近端音频增强信号。For example, the second signal amplitude ratio is obtained through the above formula (7). According to an exemplary embodiment of the present disclosure, the nonlinear elimination unit 304 may first obtain a product of the second signal amplitude ratio and the amplitude of the first frequency-domain near-end audio signal as the reference amplitude. The non-linear elimination unit 304 may then obtain the minimum value of the reference amplitude and the amplitude of the second frequency-domain near-end audio signal as the amplitude of the frequency-domain near-end audio enhancement signal. Finally, the non-linear elimination unit 304 can obtain the frequency-domain near-end audio enhancement signal according to the amplitude of the frequency-domain near-end audio enhancement signal and the phase of the first frequency-domain near-end audio signal.

例如，频域近端音频增强信号可通过上式(8)获取。For example, the near-end audio enhancement signal in the frequency domain can be obtained through the above formula (8).

回到图3，时域变换单元305可对频域近端音频增强信号进行时频逆变换，得到近端音频增强信号。Returning to FIG. 3 , the time-domain transformation unit 305 may perform time-frequency inverse transformation on the near-end audio enhancement signal in the frequency domain to obtain the near-end audio enhancement signal.

根据本公开的示例性实施例，时频逆变换可为，但不限于，短时反傅里叶变换(Inverse Short-Time Fourier Transform，ISTFT)。在这种情况下，近端音频增强信号可通过上式(9)获取。According to an exemplary embodiment of the present disclosure, the inverse time-frequency transform may be, but not limited to, Inverse Short-Time Fourier Transform (ISTFT). In this case, the near-end audio enhancement signal can be obtained through the above formula (9).

参照图4，电子设备400包括至少一个存储器401和至少一个处理器402，所述至少一个存储器401中存储有计算机可执行指令集合，当计算机可执行指令集合被至少一个处理器402执行时，执行根据本公开的示例性实施例的音频信号处理方法。4, the electronic device 400 includes at least one memory 401 and at least one processor 402, the at least one memory 401 stores a set of computer-executable instructions, when the set of computer-executable instructions is executed by the at least one processor 402, the execution An audio signal processing method according to an exemplary embodiment of the present disclosure.

作为示例，电子设备400可以是PC计算机、平板装置、个人数字助理、智能手机、或其他能够执行上述指令集合的装置。这里，电子设备400并非必须是单个的电子设备，还可以是任何能够单独或联合执行上述指令(或指令集)的装置或电路的集合体。电子设备400还可以是集成控制系统或系统管理器的一部分，或者可被配置为与本地或远程(例如，经由无线传输)以接口互联的便携式电子设备。As an example, the electronic device 400 may be a PC computer, a tablet device, a personal digital assistant, a smart phone, or other devices capable of executing the above-mentioned set of instructions. Here, the electronic device 400 is not necessarily a single electronic device, but may also be any assembly of devices or circuits capable of individually or jointly executing the above-mentioned instructions (or instruction sets). The electronic device 400 may also be part of an integrated control system or system manager, or may be configured as a portable electronic device that interfaces locally or remotely (eg, via wireless transmission).

在电子设备400中，处理器402可包括中央处理器(CPU)、图形处理器(GPU)、可编程逻辑装置、专用处理器系统、微控制器或微处理器。作为示例而非限制，处理器还可包括模拟处理器、数字处理器、微处理器、多核处理器、处理器阵列、网络处理器等。In electronic device 400, processor 402 may include a central processing unit (CPU), a graphics processing unit (GPU), a programmable logic device, a special purpose processor system, a microcontroller, or a microprocessor. By way of example and not limitation, a processor may also include analog processors, digital processors, microprocessors, multi-core processors, processor arrays, network processors, and the like.

处理器402可运行存储在存储器401中的指令或代码，其中，存储器401还可以存储数据。指令和数据还可经由网络接口装置而通过网络被发送和接收，其中，网络接口装置可采用任何已知的传输协议。The processor 402 can execute instructions or codes stored in the memory 401, wherein the memory 401 can also store data. Instructions and data may also be sent and received over the network via the network interface device, which may employ any known transmission protocol.

存储器401可与处理器402集成为一体，例如，将RAM或闪存布置在集成电路微处理器等之内。此外，存储器401可包括独立的装置，诸如，外部盘驱动、存储阵列或任何数据库系统可使用的其他存储装置。存储器401和处理器402可在操作上进行耦合，或者可例如通过I/O端口、网络连接等互相通信，使得处理器402能够读取存储在存储器中的文件。The memory 401 can be integrated with the processor 402, for example, RAM or flash memory is arranged in an integrated circuit microprocessor or the like. Additionally, memory 401 may comprise a separate device, such as an external disk drive, storage array, or any other storage device usable by the database system. The memory 401 and the processor 402 may be operatively coupled, or may communicate with each other, such as through an I/O port, network connection, etc., such that the processor 402 can read files stored in the memory.

此外，电子设备400还可包括视频显示器(诸如，液晶显示器)和用户交互接口(诸如，键盘、鼠标、触摸输入装置等)。电子设备400的所有组件可经由总线和/或网络而彼此连接。In addition, the electronic device 400 may further include a video display (such as a liquid crystal display) and a user interaction interface (such as a keyboard, mouse, touch input device, etc.). All components of the electronic device 400 may be connected to each other via a bus and/or a network.

根据本公开的示例性实施例，还可提供一种存储指令的非易失性计算机可读存储介质，其中，当指令被至少一个处理器运行时，促使至少一个处理器执行根据本公开的示例性实施例的音频信号处理方法。这里的计算机可读存储介质的示例包括：只读存储器(ROM)、随机存取可编程只读存储器(PROM)、电可擦除可编程只读存储器(EEPROM)、随机存取存储器(RAM)、动态随机存取存储器(DRAM)、静态随机存取存储器(SRAM)、闪存、非易失性存储器、CD-ROM、CD-R、CD+R、CD-RW、CD+RW、DVD-ROM、DVD-R、DVD+R、DVD-RW、DVD+RW、DVD-RAM、BD-ROM、BD-R、BD-R LTH、BD-RE、蓝光或光盘存储器、硬盘驱动器(HDD)、固态硬盘(SSD)、卡式存储器(诸如，多媒体卡、安全数字(SD)卡或极速数字(XD)卡)、磁带、软盘、磁光数据存储装置、光学数据存储装置、硬盘、固态盘以及任何其他装置，所述任何其他装置被配置为以非暂时性方式存储计算机程序以及任何相关联的数据、数据文件和数据结构并将所述计算机程序以及任何相关联的数据、数据文件和数据结构提供给处理器或计算机使得处理器或计算机能执行所述计算机程序。上述计算机可读存储介质中的计算机程序可在诸如客户端、主机、代理装置、服务器等计算机设备中部署的环境中运行，此外，在一个示例中，计算机程序以及任何相关联的数据、数据文件和数据结构分布在联网的计算机系统上，使得计算机程序以及任何相关联的数据、数据文件和数据结构通过一个或多个处理器或计算机以分布式方式存储、访问和执行。According to an exemplary embodiment of the present disclosure, there may also be provided a non-transitory computer-readable storage medium storing instructions, wherein, when the instructions are executed by at least one processor, at least one processor is caused to execute an example according to the present disclosure. The audio signal processing method of the exemplary embodiment. Examples of computer readable storage media herein include: Read Only Memory (ROM), Random Access Programmable Read Only Memory (PROM), Electrically Erasable Programmable Read Only Memory (EEPROM), Random Access Memory (RAM) , Dynamic Random Access Memory (DRAM), Static Random Access Memory (SRAM), Flash Memory, Non-volatile Memory, CD-ROM, CD-R, CD+R, CD-RW, CD+RW, DVD-ROM , DVD-R, DVD+R, DVD-RW, DVD+RW, DVD-RAM, BD-ROM, BD-R, BD-R LTH, BD-RE, Blu-ray or Optical Memory, Hard Disk Drive (HDD), Solid State Hard disks (SSD), memory cards (such as MultiMediaCards, Secure Digital (SD) or Extreme Digital (XD) cards), magnetic tapes, floppy disks, magneto-optical data storage devices, optical data storage devices, hard disks, solid-state disks, and any other means configured to store a computer program and any associated data, data files and data structures in a non-transitory manner and to provide said computer program and any associated data, data files and data structures to the processor or the computer to enable the processor or the computer to execute the computer program. The computer program in the above-mentioned computer-readable storage medium can run in an environment deployed in computer equipment such as a client, a host, an agent device, a server, etc. In addition, in one example, the computer program and any associated data and data files and data structures are distributed over network-connected computer systems so that the computer programs and any associated data, data files and data structures are stored, accessed and executed in a distributed fashion by one or more processors or computers.

根据本公开的示例性实施例，还可提供一种计算机程序产品，该计算机程序产品中的指令可由计算机设备的处理器执行以完成根据本公开的示例性实施例的音频信号处理方法。According to an exemplary embodiment of the present disclosure, there may also be provided a computer program product, instructions in the computer program product may be executed by a processor of a computer device to implement the audio signal processing method according to the exemplary embodiment of the present disclosure.

根据本公开的音频信号处理方法及装置，先对近端采集音频信号进行线性回声消除，接着对其进行深度学习降噪，然后进行非线性回声消除，通过线性回声消除和非线性回声消除对远端参考音频信号中的回声信号进行消除，获得最终的近端音频增强信号，将深度学习降噪和回声消除结合起来，充分利用了深度学习降噪的良好性能，相较于相关技术中所采用的传统降噪技术和回声消除的结合，能够得到更好的消除回声和降噪，提升音质。According to the audio signal processing method and device of the present disclosure, linear echo cancellation is first performed on the near-end collected audio signal, and then deep learning noise reduction is performed on it, and then nonlinear echo cancellation is performed. The echo signal in the end reference audio signal is eliminated to obtain the final near-end audio enhancement signal, and the combination of deep learning noise reduction and echo cancellation makes full use of the good performance of deep learning noise reduction. The combination of traditional noise reduction technology and echo cancellation can get better echo cancellation and noise reduction, and improve sound quality.

本公开所有实施例均可以单独被执行，也可以与其他实施例相结合被执行，均视为本公开要求的保护范围。All the embodiments of the present disclosure can be implemented independently or in combination with other embodiments, which are all regarded as the scope of protection required by the present disclosure.

Claims

An audio signal processing method, comprising:

Acquiring a near-end audio signal, a far-end reference audio signal, and a first near-end audio signal obtained by performing linear echo cancellation on the near-end audio signal;

performing time-frequency transformation on the near-end collected audio signal, the far-end reference audio signal and the first near-end audio signal respectively, to obtain the frequency-domain near-end collected audio signal, the frequency-domain far-end reference audio signal and the first frequency domain near-end audio signal;

performing deep learning noise reduction on the amplitude of the first frequency-domain near-end audio signal to obtain a second frequency-domain near-end audio signal;

Based on the frequency-domain far-end reference audio signal, the frequency-domain near-end collected audio signal, the first frequency-domain near-end audio signal, and the second frequency-domain near-end audio signal, the second frequency domain Non-linear echo cancellation is performed on the near-end audio signal to obtain a near-end audio enhancement signal in the frequency domain;

Inverse time-frequency transform is performed on the near-end audio enhancement signal in the frequency domain to obtain the near-end audio enhancement signal.

The audio signal processing method according to claim 1, wherein said performing deep learning noise reduction on the magnitude of the first frequency-domain near-end audio signal to obtain a second frequency-domain near-end audio signal, comprising:

Through the trained noise reduction neural network model, the magnitude of the near-end audio signal in the first frequency domain is subjected to deep learning and noise reduction to obtain the magnitude of the near-end audio signal in the second frequency domain;

The second frequency-domain near-end audio signal is obtained according to the amplitude of the second frequency-domain near-end audio signal and the phase of the first frequency-domain near-end audio signal.

The audio signal processing method according to claim 2, wherein, the amplitude of the near-end audio signal in the first frequency domain is subjected to deep learning and noise reduction through the trained noise reduction neural network model to obtain the second The magnitude of the near-end audio signal in the frequency domain, including:

Inputting the magnitude of the near-end audio signal in the first frequency domain into the trained noise reduction neural network model to obtain a first signal magnitude ratio, wherein the first signal magnitude ratio is equal to the second frequency domain near-end audio signal A predicted value of the ratio of the amplitude of the end audio signal to the amplitude of the first frequency domain near end audio signal;

According to the first signal amplitude ratio and the amplitude of the first frequency-domain near-end audio signal, the amplitude of the second frequency-domain near-end audio signal is obtained, wherein the amplitude of the second frequency-domain near-end audio signal is the product of the first signal amplitude ratio and the amplitude of the first frequency-domain near-end audio signal.

The audio signal processing method according to claim 1, wherein said frequency-domain far-end reference audio signal based on said frequency-domain near-end acquisition audio signal, said first frequency-domain near-end audio signal and said A second frequency-domain near-end audio signal, performing nonlinear echo cancellation on the second frequency-domain near-end audio signal to obtain a frequency-domain near-end audio enhancement signal, including:

Correlating the frequency-domain far-end reference audio signal with the frequency-domain near-end acquisition audio signal and the second-frequency near-end audio signal in each frequency band to obtain a second signal amplitude ratio of each frequency band;

According to the second signal amplitude ratio, the first frequency-domain near-end audio signal, and the second frequency-domain near-end audio signal, nonlinear echo cancellation is performed on the second frequency-domain near-end audio signal to obtain a frequency Domain near-end audio enhancement signal.

The audio signal processing method according to claim 4, wherein, according to the second signal amplitude ratio, the first frequency domain near-end audio signal and the second frequency domain near-end audio signal, the The second frequency-domain near-end audio signal is subjected to nonlinear echo cancellation to obtain a frequency-domain near-end audio enhancement signal, including:

Acquiring the product of the second signal amplitude ratio and the amplitude of the first frequency-domain near-end audio signal as a reference amplitude;

Acquiring the minimum value of the reference amplitude and the amplitude of the second frequency-domain near-end audio signal as the amplitude of the frequency-domain near-end audio enhancement signal;

A frequency-domain near-end audio enhancement signal is obtained according to the amplitude of the frequency-domain near-end audio enhancement signal and the phase of the first frequency-domain near-end audio signal.

The audio signal processing method according to claim 4 or 5, wherein the second signal amplitude ratio is obtained by the following formula:

Mask(n,k)=min{1-RCr(n,k),1-RY _p r(n,k)};

Wherein, Mask(n,k) is the second signal amplitude ratio, RCr(n,k) is the cross-correlation coefficient between the frequency-domain far-end reference audio signal and the frequency-domain near-end acquisition audio signal, _RYp r(n,k) is the cross-correlation coefficient between the frequency domain far-end reference audio signal and the second frequency domain near-end audio signal, n is the number of frame sequences, k is the number of center frequency sequences, 0<n≤N , 0<k≤K, N is the total number of frames, K is the total number of frequency bands.

An audio signal processing device, comprising:

The signal acquisition unit is configured to: acquire a near-end audio signal, a far-end reference audio signal, and a first near-end audio signal obtained by performing linear echo cancellation on the near-end audio signal;

The frequency-domain conversion unit is configured to: perform time-frequency conversion on the near-end audio signal, the far-end reference audio signal, and the first near-end audio signal, respectively, to obtain the frequency-domain near-end audio signal, frequency A domain far-end reference audio signal and a first frequency-domain near-end audio signal;

The deep noise reduction unit is configured to: perform deep learning and noise reduction on the amplitude of the first frequency-domain near-end audio signal to obtain a second frequency-domain near-end audio signal;

A non-linear elimination unit configured to: based on the frequency-domain far-end reference audio signal, the frequency-domain near-end collected audio signal, the first frequency-domain near-end audio signal, and the second frequency-domain near-end audio signal, performing nonlinear echo cancellation on the second frequency-domain near-end audio signal to obtain a frequency-domain near-end audio enhancement signal;

The time-domain transformation unit is configured to: perform time-frequency inverse transformation on the frequency-domain near-end audio enhancement signal to obtain the near-end audio enhancement signal.

The audio signal processing device according to claim 7, wherein the deep noise reduction unit is configured to:

The audio signal processing device according to claim 8, wherein the deep noise reduction unit is configured to:

The audio signal processing device according to claim 7, wherein the non-linear cancellation unit is configured to:

The audio signal processing device according to claim 10, wherein the non-linear cancellation unit is configured to:

The audio signal processing device according to claim 10 or 11, wherein the second signal amplitude ratio is obtained by the following formula:

Mask(n,k)=min{1-RCr(n,k),1-RY _p r(n,k)};

An electronic device comprising:

at least one processor;

at least one memory storing computer-executable instructions,

Wherein, the computer-executable instructions, when executed by the at least one processor, cause the at least one processor to perform the following steps:

A non-volatile computer-readable storage medium, wherein, when the instructions in the computer-readable storage medium are executed by at least one processor, the at least one processor is caused to perform the following steps:

A computer program product comprising computer instructions, wherein said computer instructions, when executed by at least one processor, implement the following steps: