CN113763980A

CN113763980A - Echo cancellation method

Info

Publication number: CN113763980A
Application number: CN202111277825.4A
Authority: CN
Inventors: 刘文通; 万东琴
Original assignee: Chipintelli Technology Co Ltd
Current assignee: Chipintelli Technology Co Ltd
Priority date: 2021-10-30
Filing date: 2021-10-30
Publication date: 2021-12-07
Anticipated expiration: 2041-10-30
Also published as: CN113763980B

Abstract

An echo cancellation method, comprising the steps of: s1, acquiring a digital microphone signal and a digital reference signal through a microphone array; s2, converting the digital time domain signal into a frequency domain signal; s3, performing linear prediction caching and nonlinear expansion on the reference frequency domain signal to obtain a reference frequency domain signal matrix; s4, calculating an autocorrelation diagonalization matrix; s5, calculating an echo cancellation gain vector of each frequency point, and performing echo cancellation on the microphone frequency domain signal obtained in the step S2; and S6, outputting the final output frequency domain signal and converting the final output frequency domain signal into a time domain signal. Compared with the traditional echo cancellation method, the method is beneficial to improving the influence of the nonlinear distortion of the system on the processing effect, saves the resource consumption on the calculation principle, effectively improves the signal-to-noise ratio of the processed voice signal, and further improves the echo cancellation effect.

Description

Echo cancellation method

Technical Field

The invention belongs to the technical field of audio processing, and particularly relates to an echo cancellation method.

Background

In an audio system with a loudspeaker and a microphone, an echo cancellation technology is widely applied, and with the rapid development of an artificial intelligence technology and the internet of things, actual application products put more rigorous requirements on echo cancellation effect, computing power and memory.

A common echo cancellation method estimates an echo channel through a self-adaptive filter to further cancel echo, and if the self-adaptive filter of the method involves inversion operation and nonlinear suppression, hardware consumption of a product is increased; meanwhile, many echo cancellation methods based on a deep neural network appear in recent years, the methods can further improve the echo cancellation effect, have certain processing effects on the problems of nonlinear distortion, reverberation, environmental noise and the like, but considering various complex application environments, the screening of a training set is a challenge, the stability in the actual application process can be directly influenced, and meanwhile, the calculation power and the memory of the echo cancellation method based on deep learning limit the wide application of the echo cancellation method.

Disclosure of Invention

In order to overcome the defects in the prior art, the invention discloses an echo cancellation method.

The echo cancellation method of the invention comprises the following steps:

s1, acquiring an analog microphone signal and an analog reference signal through a microphone array, converting the analog microphone signal and the analog reference signal into digital time domain signals, and respectively acquiring a digital microphone signal and a digital reference signal;

wherein the analog microphone signal is an electrical signal which is sent out by the loudspeaker and output after being received by the microphone, and the analog reference signal is an electrical signal which is input into the loudspeaker;

s2, converting the digital microphone signal and the digital reference signal in the form of digital time domain signal into a microphone frequency domain signal and a reference frequency domain signal respectively by adopting short-time Fourier transform technology;

s3, performing linear prediction caching and nonlinear expansion on the reference frequency domain signal to obtain a reference frequency domain signal matrix, wherein the reference frequency domain signal matrix is composed of a plurality of reference frequency domain signal vectors;

reference frequency domain signal vector REF _ VEC of frequency point k of reference channel l frame_qThe calculation process of (k, l) is:

s31, setting a linear prediction length LP, and carrying out cache prediction on a k frequency point of the ith frame of the q reference channel

REF_VEC_PRE_q(k,l)=[Ref_q(k,l), Ref_q(k,l-1),…Ref_q(k,l-LP+1)]

Wherein Ref_q(k, l) is the reference frequency domain signal of the kth frequency point of the ith frame of the qth reference channel, and the rest is analogized；

S32, for the prediction buffer vector REF _ VEC _ PRE_qThe reference frequency domain signal stored in (k, l) is subjected to nonlinear expansion to obtain a reference frequency domain signal vector Ref _ VEC after nonlinear expansion_q(k, l) is:

p₁,p₂,…p_LPfor the order of the non-linear expansion, ref _ vec _ p_q,p(k, l) representsqP-order expanded reference frequency domain signals of the kth frequency point of the ith frame of each reference channel, and the rest are analogized; reference frequency domain signal Ref of the kth frequency point of the ith frame of the qth reference channel through odd power series_q(k, l) is subjected to nonlinear expansion, and specifically:

ref_vec_p_q,p(k,l)= Ref_q(k,l)^2*p-1(ii) a The rest is analogized;

s33, traversing each frame, frequency point and reference channel, and combining all reference frequency domain signal vectors to obtain a reference frequency domain signal matrix REF _ VEC;

s4, calculating an autocorrelation diagonalization matrix R _ IVM of the reference frequency domain signal matrix REF _ VEC;

s5, calculating echo cancellation gain vector W of each frequency point, and performing echo cancellation on the microphone frequency domain signal obtained in the step S2;

s51, calculating a microphone frequency domain signal Mic of the kth frequency point of the nth microphone channel of the l frame_nCross correlation vector R _ MIC _ Ref between (k, l) and kth frequency point reference frequency domain signal of ith frame of qth channel_q(k,l)

S52, traversing all reference frequency domain signal vectors Ref _ VEC for multiple times_q(k, l) taking the element of the non-linear extended reference frequency domain signal vector of the jth traversal as ref _ vec _ n_q(j)，

The specific process of each traversal is as follows:

S521.R_y_ref= y_n,j-1(k,l)*conj(ref_vec_n_q(j))；

conj denotes taking the conjugate, y_n,j-1(k, l) th frame for the (j-1) th traversal of the nth microphone channelA k frequency point residual voice frequency domain signal, wherein R _ y _ ref is a traversal intermediate variable;

j =1, y_n,j-1(k,l) =Mic_n(k,l)，

S522, traversing the smoothed cross-correlation signal r _ cm of the jth frequency point of the kth frequency point of the l frame for the jth time_q(k,l,j)= λ* r_cm_q(k, l-1, j) + (1- λ) × R _ y _ ref, λ being a smoothing factor; r _ y _ ref is a traversal intermediate variable;

s523, echo cancellation gain in the jth traversal:

W(j)= r_cm_q(k,l,j)/ [r_ivm_q(k,l,j)+δ]δ is a minimum value that prevents the denominator from being zero;

wherein r _ ivm_q(k, l, j) is the k frequency point reference frequency domain signal autocorrelation diagonalization signal of the ith frame of the qth channel of the jth traversal, and the autocorrelation diagonalization matrix R _ IVM obtained in the step S4;

s524, carrying out echo cancellation processing, and enabling the traversal result y of the last residual voice frequency domain signal_n,j-1(k, l) is used in the traversal calculation process of this time, and the residual voice frequency domain signal y is calculated in the traversal process of this time_n,j(k, l) is

y_n,j(k,l)= y_n,j-1(k,l)-W(j)* ref_vec_n_q(j)；

S6, after all the traversal times are finished, outputting a final output frequency domain signal obtained by the last traversal,

and converted into a time domain signal.

Preferably, the step S4 specifically includes:

s41, referring to a diagonal simplified matrix R _ Ref of the kth frequency point of the ith frame of the qth reference channel in a frequency domain signal matrix REF _ VEC_q(k,l)= Ref_VEC_q(k,l)* Ref_VEC_q(k,l)^H

Wherein Ref _ VEC_q(k, l) a reference frequency domain signal vector obtained by carrying out nonlinear expansion on the kth frequency point of the ith frame of the qth reference channel, wherein the superscript H represents a conjugate transpose and the-represents a dot product;

s42, the self-correlation diagonalization vector of the reference frequency domain signal of the kth frequency point of the ith frame of the qth reference channel

R_ IVM_q(k,l)= λ* R_ IVM_q(k,l) +（1-λ）* R_ Ref_q(k,l)；

λ is the smoothing factor;

s43, traversing each frame, frequency point and reference channel, and combining all the reference frequency domain signal autocorrelation diagonalization vectors to obtain an autocorrelation diagonalization matrix R _ IVM.

Preferably, the value of the smoothing factor lambda is 0.7-0.99.

Preferably, in step S6, the frequency domain signal after the echo cancellation is converted into a time domain signal by using an inverse short-time fourier transform module.

Compared with the traditional echo cancellation method, the scheme of the invention utilizes a rapid echo cancellation algorithm, the signal-to-noise ratio of the processed voice signal is higher, and the echo cancellation effect can be effectively improved.

Drawings

Fig. 1 is a flow chart of an embodiment of the echo cancellation method according to the present invention;

FIG. 2 is a schematic flow chart of an echo cancellation method according to the present invention;

FIG. 3 is a waveform diagram of a time domain signal before echo cancellation processing in an embodiment of the present invention;

fig. 3 (a 1) part signal is a microphone signal acquired by a microphone array, and (a 2) part signal is a reference signal;

FIG. 4 is a schematic diagram illustrating a comparison of waveforms obtained by performing echo cancellation processing on the signal in FIG. 3 according to the prior art and the present invention;

fig. 4 (A3) is a part of an output waveform processed by a prior art echo cancellation method, and (a 4) is a waveform diagram of an output processed by the echo cancellation device shown in fig. 2 according to the present invention;

in fig. 3 and 4, the abscissa represents time, and the ordinate represents voltage amplitude.

Detailed Description

The following provides a more detailed description of the present invention.

The echo cancellation method of the invention comprises the following steps:

s31, setting linear prediction length LP, and predicting buffer vector for buffer storage

REF_VEC_PRE_q(k,l)=[Ref_q(k,l), Ref_q(k,l-1),…Ref_q(k,l-LP+1)]

Wherein Ref_q(k, l) is a reference frequency domain signal of the kth frequency point of the ith frame of the qth reference channel, and the rest is analogized;

ref_vec_p_q,p(k,l)= Ref_q(k,l)^2*p-1(ii) a The rest is analogized;

The specific process of each traversal is as follows:

s521, traversing intermediate variable R _ y _ ref = y_n,j-1(k,l)*conj(ref_vec_n_q(j))；

conj denotes taking the conjugate, y_n,j-1(k, l) for the (j-1) th traverse of the nth microphone channel, the kth frequency point residual speech frequency domain signal of the l frame,

j =1, y_n,j-1(k,l) =Mic_n(k,l)，

Where ref _ vec _ n_q(j) Is a reference frequency domain signal vector Ref _ VEC_qThe jth element of (k, l);

s522, traversing the smoothed cross-correlation signal r _ cm of the jth frequency point of the kth frequency point of the l frame for the jth time_q(k,l,j)= λ* r_cm_q(k, l-1, j) + (1- λ) × R _ y _ ref, λ being a smoothing factor;

s523, echo cancellation gain in the jth traversal:

wherein r _ ivm_q(k, l, j) is the kth frequency point parameter of the ith frame of the qth channel of the jth traversalThe autocorrelation diagonalization signal of the reference frequency domain signal is obtained from the autocorrelation diagonalization matrix R _ IVM obtained in step S4;

y_n,j(k,l)= y_n,j-1(k,l)-W(j)* ref_vec_n_q(j)；

and converted into a time domain signal.

One embodiment, as shown in fig. 1, may be implemented by the following steps:

s1, acquiring an analog microphone signal and an analog reference signal through a microphone array, and then converting the analog microphone signal and the analog reference signal into a digital time domain signal by adopting an analog-to-digital converter (ADC) to obtain a digital microphone signal and a digital reference signal;

s2, converting the digital time domain signal into a digital frequency domain signal by adopting a short-time Fourier transform technology

Converting the digital microphone signal and the digital reference signal obtained in the step S1 into frequency domain signals of K frequency points

For further details of the embodiments, the minimum system is taken as an example, and a single-microphone single-speaker system is adopted for description, in which the number of microphones N is=1, number of reference channels Q=1, converting a microphone time domain signal of a current frame digital microphone signal into a microphone frequency domain signal; converting the reference time domain signal of the current l frame into a reference frequency domain signal;

in a specific embodiment, a short-time fourier transform with 512 points is adopted, and then the frequency point number K =257 and the dimension of the microphone frequency domain signal is K =; the dimension of the reference frequency domain signal is;

and S3, performing linear prediction caching and nonlinear expansion on the reference frequency domain signal to obtain a reference frequency domain signal matrix, wherein the reference frequency domain signal matrix is composed of a plurality of reference frequency domain signal vectors.

For convenience of description, the reference frequency domain signal vector REF _ VEC of the kth frequency point of the ith frame of the qth reference channel_qThe calculation process of (k, l) is described in detail, and the q value is at least 1.

Because the loudspeaker signal collected by the microphone has strong linear correlation with the original reference signal, the correlation can be approximated by adopting the linear prediction technology, and if the technology for realizing the linear prediction needs to buffer the past frame signal, therefore, the prediction buffer vector REF _ VEC _ PRE of the kth frequency point of the ith frame of the qth reference channel is used for the k frequency point of the ith frame_q(k, l) the linear prediction length LP =4 is set for buffering, and in the present embodiment:

REF_VEC_PRE_q(k,l)= [ Ref_q(k,l), Ref_q(k,l-1),…Ref_q(k,l-4+1)] ；

Ref_qand (k, l) is a reference frequency domain signal of the kth frequency point of the ith frame of the qth reference channel, and the rest is analogized.

In practical application, especially in embedded equipment using micro-speakers, nonlinearity is inevitable, and in order to weaken the influence of nonlinear distortion of a system on echo cancellation effect, a reference frequency domain signal after nonlinear expansion can be obtained by performing nonlinear expansion on a kth frequency point reference frequency domain signal of an ith frame of a qth reference channel through odd power series

ref_vec_p_q,i(k,l)= Ref_q(k,l)^2*i-1I =1,2 … m; wherein m is the order of expansion, and a vector P = [2,2,1 ] of nonlinear expansion order]，

To take advantage of linear prediction and nonlinear extension into account, the vector REF _ VEC _ PRE is buffered for prediction_qThe reference frequency domain signal stored in (k, l) is subjected to nonlinear expansion to obtain a reference frequency domain signal vector Ref _ VEC after nonlinear expansion_q(k, l) is:

since P = [2,2,1 ]]Therefore, each row is expanded from top to bottom by 2,1 and 1 orders, ref _ vec _ p_q,2(k, l) representsq2-order expanded reference frequency domain signals of the kth frequency point of the ith frame of each reference channel, and the rest are analogized in sequence;

traversing each frame, frequency point and reference channel, and combining all reference frequency domain signal vectors to obtain a reference frequency domain signal matrix REF _ VEC;

in this embodiment, Ref _ VEC can be obtained_qThe dimension of (k, l) is 1 × 6.

And S4, calculating an autocorrelation diagonalization matrix R _ IVM of the reference frequency domain signal matrix REF _ VEC.

For convenience of description, the autocorrelation diagonalization vector R _ IVM of the kth frequency point of the ith frame of the qth reference channel_qThe calculation process of (k, l) is described in detail.

Usually, an autocorrelation matrix of a reference signal is calculated in an adaptive filter for analyzing the correlation between a microphone signal and the reference signal, so that an autocorrelation matrix R _ Ref _ VEC of a kth frequency point prediction buffer vector of an ith frame of a qth reference channel needs to be calculated_q(k, l) due to the vector Ref _ VEC_qThe dimension of (k, l) is 1 x 6, and the matrix R _ Ref _ VEC can be calculated_qThe dimension of (k, l) is 1 × 6, when the order setting is large to obtain a good echo cancellation effect, it will create a huge challenge for subsequent calculation, and at this time, the memory space required by the system is very harsh for the actual embedded product, so the simplified matrix R _ Ref _ VEC is approximated by using a diagonal sequence_q(k, l), obtaining a diagonal reduced matrix R _ Ref of the k frequency point of the ith frame of the q reference channel after reduction_q(k,l)= Ref_VEC_q(k,l)* Ref_VEC_q(k,l)^H

Where the superscript H denotes the conjugate transpose and x denotes the dot product.

Resulting R _ Ref_qThe dimension of (k, l) is 1 x 6, compared to the matrix R _ Ref _ VEC_qThe dimension of (k, l) is 1 × 6, and it can be found that the memory and subsequent calculation amount can be greatly reduced by using the method of approximating the diagonalized sequence.

Because the amplitude fluctuation of the reference signal is large in the actual processing process, the stability of the system is considered, and R _ Ref is corrected_q(k, l) smoothing to obtain a smoothed autocorrelation diagonalized vector R _ IVM of the reference frequency domain signal_q(k, l) is:

R_ IVM_q(k,l)= λ* R_ IVM_q(k,l) +（1-λ）* R_ Ref_q(k,l)

wherein λ is a smoothing factor, generally taking a value of 0.7 to 0.999, and λ =0.99 in the present embodiment;

s51, calculating a microphone frequency domain signal Mic of the kth frequency point of the nth microphone channel of the l frame_nCross correlation vector R _ MIC _ Ref between (k, l) and kth frequency point reference frequency domain signal of ith frame of qth channel_q(k, l), for a single microphone system, n = 1;

The specific process of each traversal is as follows:

S521.R_y_ref= y_n,j-1(k,l)*conj(ref_vec_n_q(j))；

j =1, y_n,j-1(k,l) =Mic_n(k,l)，

s522, traversing the smoothed cross-correlation signal of the jth frequency point of the kth frequency point of the l frame for the jth time

r_cm_q(k,l,j)= λ* r_cm_q(k, l-1, j) + (1- λ) × R _ y _ ref, λ being the smoothing factor;

s523, echo cancellation gain in the jth traversal:

W(j)= r_cm_q(k,l,j)/ [r_ivm_q(k,l,j)+δ]delta is preventionMinimum value with stop denominator of zero, and delta =10 can be taken^-6；

y_n,j(k,l)= y_n,j-1(k,l)-W(j)* ref_vec_n_q(j)；

and converted into a time domain signal.

In step S6, an inverse short-time fourier transform module ISTFT technique may be used to convert the frequency domain signal after echo cancellation into a time domain signal, and the time domain signal after echo cancellation may be directly transmitted to the next processing module through the system.

Compared with the conventional echo cancellation method, the scheme of the invention can effectively improve the echo cancellation effect by using a fast echo cancellation algorithm, and fig. 3 and 4 show a specific embodiment of the invention, and the echo cancellation processing process is performed based on the echo cancellation device shown in fig. 2. Fig. 3 is a time domain signal before echo cancellation processing, where the (a 1) partial signal is a microphone signal acquired by a microphone array, including ambient noise, sound played by a speaker, and a target human voice, where the target human voice is a command word played by a sound at a distance of 3m from a microphone; (A2) part of the signal is a reference signal, i.e. the signal input to the loudspeaker by the audio source shown in fig. 2.

Fig. 4 is a waveform diagram obtained after echo cancellation processing, and part (a 4) in fig. 4 is a waveform diagram of an output after processing by the echo cancellation device shown in fig. 1 according to the present invention; (A3) and in part, an output waveform map processed using a prior art RLS (least squares based) echo cancellation method. As can be seen from fig. 4, the difference between the target speech processed by the present invention, i.e., the parts with larger voltage amplitudes appearing in the upper and lower waveforms of fig. 4, and the echo residual value, i.e., the part with smaller voltage amplitude, is larger, i.e., the signal-to-noise ratio of the speech signal processed by the present invention is higher, which indicates that the present invention has a better echo cancellation effect.

The foregoing is a description of preferred embodiments of the present invention, and the preferred embodiments in the preferred embodiments may be combined and combined in any combination, if not obviously contradictory or prerequisite to a certain preferred embodiment, and the specific parameters in the examples and the embodiments are only for the purpose of clearly illustrating the inventor's invention verification process and are not intended to limit the patent protection scope of the present invention, which is defined by the claims and the equivalent structural changes made by the content of the description of the present invention are also included in the protection scope of the present invention.

Claims

1. An echo cancellation method, comprising the steps of:

s2, converting a digital microphone signal and a digital reference signal in a digital time domain signal form into a microphone frequency domain signal and a reference frequency domain signal respectively by adopting a short-time Fourier transform technology;

REF_VEC_PRE_q(k,l)=[Ref_q(k,l), Ref_q(k,l-1),…Ref_q(k,l-LP+1)]

Ref_q(k, l) is a reference frequency domain signal of the kth frequency point of the ith frame of the qth reference channel, and the rest is analogized;

p₁,p₂,…p_LPfor the order of the non-linear extension, LP is the linear prediction length, ref _ vec _ p_q,p(k, l) representsqP-order expanded reference frequency domain signals of the kth frequency point of the ith frame of each reference channel, and the rest are analogized; reference frequency domain signal Ref of the kth frequency point of the ith frame of the qth reference channel through odd power series_q(k, l) is subjected to nonlinear expansion, and specifically:

ref_vec_p_q,p(k,l)= Ref_q(k,l)^2*p-1(ii) a The rest is analogized;

s4, calculating an autocorrelation diagonalization matrix R _ IVM of a reference frequency domain signal matrix REF _ VEC;

s5, calculating an echo cancellation gain vector W of each frequency point, and performing echo cancellation on the microphone frequency domain signal obtained in the step S2;

S52, traversing all reference frequency domain signals for multiple timesVector Ref _ VEC_q(k, l) taking the element of the non-linear extended reference frequency domain signal vector of the jth traversal as ref _ vec _ n_q(j)，

The specific process of each traversal is as follows:

S521. R_y_ref= y_n,j-1(k,l)*conj(ref_vec_n_q(j))；

conj denotes taking the conjugate, y_n,j-1(k, l) traversing the kth frequency point residual voice frequency domain signal of the ith frame of the nth microphone channel for the (j-1) th time, wherein R _ y _ ref is a traversal intermediate variable;

j =1, y_n,j-1(k,l) =Mic_n(k,l)，

s523, echo cancellation gain in the jth traversal:

y_n,j(k,l)= y_n,j-1(k,l)-W(j)* ref_vec_n_q(j)；

And S6, after all the traversal times are finished, outputting a final output frequency domain signal obtained by the last traversal, and converting the final output frequency domain signal into a time domain signal.

2. The echo cancellation method according to claim 1, wherein the step S4 specifically includes:

s41, diagonal simplification of kth frequency point of the ith frame of the qth reference channel in the reference frequency domain signal matrix REF _ VECMatrix R _ Ref_q(k,l)= Ref_VEC_q(k,l)* Ref_VEC_q(k,l)^H

R_ IVM_q(k,l)= λ* R_ IVM_q(k,l) +（1-λ）* R_ Ref_q(k,l)；

λ is the smoothing factor; r _ Ref_q(k, l) is a diagonal reduced matrix of the kth frequency point of the ith frame of the qth reference channel;

3. The echo cancellation method of claim 1, wherein said smoothing factor λ is in a range of 0.7-0.99.

4. The echo cancellation method of claim 1, wherein in step S6, the frequency domain signal after echo cancellation processing is converted into a time domain signal by using an inverse short-time fourier transform module.