CN112863535B

CN112863535B - Residual echo and noise elimination method and device

Info

Publication number: CN112863535B
Application number: CN202110008502.9A
Authority: CN
Inventors: 李军锋; 顾建军; 颜永红
Original assignee: Institute of Acoustics CAS
Current assignee: Institute of Acoustics CAS
Priority date: 2021-01-05
Filing date: 2021-01-05
Publication date: 2022-04-26
Anticipated expiration: 2041-01-05
Also published as: CN112863535A

Abstract

The embodiment of the application discloses a method and a device for eliminating residual echo and noise, wherein the method comprises the following steps: performing framing, windowing and Fourier transformation on the received voice time domain signal containing the echo and the noise and the far-end reference sound time domain signal to obtain a corresponding frequency domain signal, determining an echo frequency domain signal, and further determining the voice frequency domain signal containing the residual echo and the noise; respectively carrying out energy normalization processing on the amplitude spectrums of the voice frequency domain signal containing the residual echo and the noise, the echo frequency domain signal and the far-end reference audio frequency domain signal to obtain corresponding characteristics; determining a target voice frequency domain signal according to the corresponding characteristics and the trained cascade network; and carrying out inverse Fourier transform on the target voice frequency domain signal to obtain a target voice time domain signal. According to the embodiment of the application, the feature attention model is used for endowing the input features with different importance, and redundant information in the input features is reduced. And the multi-domain loss function is trained in a cascade network, so that the sensitivity of the model to signal energy is reduced.

Description

Residual echo and noise elimination method and device

Technical Field

The present invention relates to the field of echo and noise cancellation. And more particularly, to a method and apparatus for residual echo and noise cancellation.

Background

At present, the echo cancellation technology mainly removes an echo signal formed by a far-end reference sound signal in a speech signal, and the speech noise reduction technology mainly removes background noise and directional noise interference in the speech signal. Both echo cancellation techniques and speech noise reduction techniques aim to improve the quality and intelligibility of speech. In the echo cancellation technology, a self-adaptive filtering method based on traditional signal processing and a residual echo cancellation method based on deep learning are combined, so that the generalization performance of the system can be effectively improved.

However, in the conventional method, the residual echo and the noise cancellation are always independently and separately performed, and the correlation between the two tasks is not considered. There are a number of signal features available in the residual echo cancellation task that have different physical significance and significance, and conventional methods do not take into account the different significance of these features. When training a residual echo and noise elimination model, the mean square error of a target amplitude spectrum and an estimated amplitude spectrum is mostly adopted as a loss function in the prior art, but the loss function depends on the energy of signals, and the scales of the signals with different energy are different.

Disclosure of Invention

Because the existing method has the above problems, the present application provides a method and an apparatus for eliminating residual echo and noise.

In a first aspect, an embodiment of the present application provides a residual echo and noise cancellation method, including:

receiving a voice time domain signal containing echo and noise and a far-end reference sound time domain signal;

respectively performing framing, windowing and Fourier transformation on the voice time domain signal containing the echo and the noise and the far-end reference sound time domain signal to obtain a voice frequency domain signal containing the echo and the noise and a far-end reference sound frequency domain signal;

determining an echo frequency domain signal according to the voice frequency domain signal containing the echo and the noise and the far-end reference audio frequency domain signal;

determining a voice frequency domain signal containing residual echo and noise according to the voice frequency domain signal containing echo and noise and the echo frequency domain signal;

performing energy normalization processing on the amplitude spectrum of the voice frequency domain signal containing the residual echo and the noise, the amplitude spectrum of the echo frequency domain signal and the amplitude spectrum of the far-end reference audio frequency domain signal to obtain a voice frequency domain signal characteristic containing the residual echo and the noise, an echo frequency domain signal characteristic and a far-end reference audio frequency domain signal characteristic;

splicing the voice frequency domain signal characteristics containing the residual echo and the noise with the far-end reference audio domain signal characteristics to obtain a first splicing result, and splicing the voice frequency domain signal characteristics containing the residual echo and the noise with the echo frequency domain signal characteristics to obtain a second splicing result;

inputting the first splicing result and the second splicing result into a trained feature attention model in the trained cascade network to obtain a first attention weight corresponding to the far-end reference audio domain signal feature and a second attention weight corresponding to the echo frequency domain signal feature;

multiplying the far-end reference audio frequency domain signal characteristic by a first attention weight to obtain a first fused attention mechanism characteristic, and multiplying the echo frequency domain signal characteristic by a second attention weight to obtain a second fused attention mechanism characteristic;

splicing the first fusion attention mechanism characteristic, the second fusion attention mechanism characteristic and the voice frequency domain signal characteristic containing residual echo and noise to obtain a first fusion splicing result;

inputting the first fusion splicing result into a trained residual echo and noise elimination model in the trained cascade network to obtain a masking estimation value of a target voice frequency domain signal;

obtaining the target voice frequency domain signal according to the masking estimation value of the target voice frequency domain signal and the voice frequency domain signal containing the residual echo and the noise;

and carrying out inverse Fourier transform on the target voice frequency domain signal to obtain a target voice time domain signal.

In one possible implementation, the framing, windowing, and fourier transforming the echo and noise containing speech time domain signal and the far-end reference acoustic time domain signal respectively includes:

respectively taking a preset number of sampling points as a frame signal for the voice time domain signal containing the echo and the noise and the far-end reference sound time domain signal; if the length is not enough, zero is firstly filled to a preset number;

windowing each frame signal; wherein, the windowing function adopts a Hamming window;

and performing Fourier transform on each windowed frame signal.

In one possible implementation, the determining an echo frequency domain signal according to the echo and noise-containing speech frequency domain signal and the far-end reference audio frequency domain signal includes:

inputting the voice frequency domain signal containing the echo and the noise and the far-end reference audio frequency domain signal into a Kalman filter to obtain a filter coefficient and the echo frequency domain signal;

the echo frequency domain signal is a result of multiplying the filter coefficient and the far-end reference sound frequency domain signal.

In one possible implementation, the determining a voice frequency domain signal containing residual echo and noise according to the voice frequency domain signal containing echo and noise and the echo frequency domain signal includes:

and subtracting the echo frequency domain signal from the voice frequency domain signal containing the echo and the noise to obtain the voice frequency domain signal containing the residual echo and the noise.

In a possible implementation, the energy normalization processing is performed on the amplitude spectrum of the speech frequency domain signal containing the residual echo and the noise, the amplitude spectrum of the echo frequency domain signal, and the amplitude spectrum of the far-end reference audio domain signal to obtain a speech frequency domain signal feature containing the residual echo and the noise, an echo frequency domain signal feature, and a far-end reference audio domain signal feature, and the energy normalization processing includes:

respectively determining a first function, a second function and a third function corresponding to the amplitude spectrum of the voice frequency domain signal containing the residual echo and the noise, the amplitude spectrum of the echo frequency domain signal and the amplitude spectrum of the far-end reference audio frequency domain signal;

determining the voice frequency domain signal characteristics containing the residual echo and the noise according to a first function corresponding to the amplitude spectrum of the voice frequency domain signal containing the residual echo and the noise, and the mean value and the variance of the voice frequency domain signal characteristics containing the residual echo and the noise;

determining the echo frequency domain signal characteristics according to a second function corresponding to the amplitude spectrum of the echo frequency domain signal, and the mean value and the variance of the echo frequency domain signal characteristics;

and determining the characteristics of the far-end reference audio frequency domain signal according to a third function corresponding to the amplitude spectrum of the far-end reference audio frequency domain signal and the mean value and the variance of the characteristics of the far-end reference audio frequency domain signal.

In one possible implementation, the trained cascade network is trained by:

receiving a first voice time domain signal containing echo and noise, a first far-end reference sound time domain signal and a first target voice time domain signal;

performing framing, windowing and Fourier transformation on the first voice time domain signal containing the echo and the noise, the first far-end reference sound time domain signal and the first target voice time domain signal respectively to obtain a first voice frequency domain signal containing the echo and the noise, a first far-end reference audio frequency domain signal and a first target voice frequency domain signal;

determining a first echo frequency domain signal according to the first voice frequency domain signal containing echo and noise and the first far-end reference audio frequency domain signal;

determining a first voice frequency domain signal containing residual echo and noise according to the first voice frequency domain signal containing echo and noise and the first echo frequency domain signal;

performing energy normalization processing on the amplitude spectrum of the first voice frequency domain signal containing residual echo and noise, the amplitude spectrum of the first echo frequency domain signal and the amplitude spectrum of the first far-end reference audio frequency domain signal to obtain a first voice frequency domain signal characteristic containing residual echo and noise, a first echo frequency domain signal characteristic and a first far-end reference audio frequency domain signal characteristic;

splicing the first voice frequency domain signal characteristic containing residual echo and noise with the first far-end reference audio domain signal characteristic to obtain a first splicing characteristic, and splicing the first voice frequency domain signal characteristic containing residual echo and noise with the first echo frequency domain signal characteristic to obtain a second splicing characteristic;

inputting the first splicing feature and the second splicing feature into a feature attention model in a cascade network so as to jointly train the feature attention model and a residual echo and noise elimination model in the cascade network, and obtaining a first weight corresponding to the first far-end reference audio domain signal feature and a second weight corresponding to the first echo frequency domain signal feature;

multiplying the first far-end reference audio domain signal characteristic by a first weight to obtain a first fused characteristic, and multiplying the first echo audio domain signal characteristic by a second weight to obtain a second fused characteristic;

splicing the first fusion characteristic, the second fusion characteristic and the first voice frequency domain signal characteristic containing residual echo and noise to obtain a first fusion splicing characteristic;

inputting the first fusion splicing characteristic into a residual echo and noise elimination model in the cascade network to obtain a masking estimation value of a second target voice frequency domain signal;

determining a second target voice frequency domain signal according to the masking estimation value of the second target voice frequency domain signal and the first voice frequency domain signal containing residual echo and noise;

determining a multi-domain loss function according to at least two loss functions; wherein the at least two loss functions comprise an energy-independent magnitude spectrum loss function and an objective speech quality assessment score loss function; the amplitude spectrum loss function irrelevant to the energy takes the amplitude spectrum of the first target voice frequency domain signal as a training target and is determined according to the second target voice frequency domain signal; the objective voice quality evaluation score loss function is determined by taking the improvement of voice audibility quality as a training target;

and iteratively reducing the loss function of the multiple domains by continuously modeling parameters to obtain the trained cascade network.

In a second aspect, an embodiment of the present application further provides a residual echo and noise cancellation apparatus, including:

the receiving module is used for receiving a voice time domain signal containing echo and noise and a far-end reference sound time domain signal;

the processing module is used for respectively performing framing, windowing and Fourier transform on the voice time domain signal containing the echo and the noise and the far-end reference sound time domain signal to obtain a voice frequency domain signal containing the echo and the noise and a far-end reference sound frequency domain signal;

the determining module is used for determining an echo frequency domain signal according to the voice frequency domain signal containing the echo and the noise and the far-end reference audio frequency domain signal;

the determining module is further configured to determine a voice frequency domain signal containing residual echo and noise according to the voice frequency domain signal containing echo and noise and the echo frequency domain signal;

the energy normalization module is used for carrying out energy normalization processing on the amplitude spectrum of the voice frequency domain signal containing the residual echo and the noise, the amplitude spectrum of the echo frequency domain signal and the amplitude spectrum of the far-end reference audio frequency domain signal to obtain a voice frequency domain signal characteristic containing the residual echo and the noise, an echo frequency domain signal characteristic and a far-end reference audio frequency domain signal characteristic;

the splicing module is used for splicing the voice frequency domain signal characteristics containing the residual echo and the noise with the far-end reference audio domain signal characteristics to obtain a first splicing result, and splicing the voice frequency domain signal characteristics containing the residual echo and the noise with the echo frequency domain signal characteristics to obtain a second splicing result;

a weight obtaining module, configured to input the first splicing result and the second splicing result into a trained feature attention model in the trained cascade network, and obtain a first attention weight corresponding to the far-end reference audio-domain signal feature and a second attention weight corresponding to the echo frequency-domain signal feature;

a fused attention mechanism feature obtaining module, configured to multiply the far-end reference audio-domain signal feature by a first attention weight to obtain a first fused attention mechanism feature, and multiply the echo frequency-domain signal feature by a second attention weight to obtain a second fused attention mechanism feature;

the splicing module is further configured to splice the first fusion attention mechanism feature, the second fusion attention mechanism feature and the voice frequency domain signal feature containing the residual echo and the noise to obtain a first fusion splicing result;

a masking estimation value obtaining module, configured to input the first fusion splicing result into a post-training residual echo and noise cancellation model in the post-training cascade network, so as to obtain a masking estimation value of the target voice frequency domain signal;

a target voice frequency domain signal obtaining module, configured to obtain a target voice frequency domain signal according to the masking estimation value of the target voice frequency domain signal and the voice frequency domain signal containing the residual echo and the noise;

and the inverse Fourier transform module is used for performing inverse Fourier transform on the target voice frequency domain signal to obtain a target voice time domain signal.

In a third aspect, an embodiment of the present application further provides a residual echo and noise cancellation apparatus, including at least one processor, where the processor is configured to execute a program stored in a memory, and when the program is executed, the apparatus is caused to perform the steps as in the first aspect and in various possible implementations.

In a fourth aspect, embodiments of the present application further propose a non-transitory computer-readable storage medium having stored thereon a computer program, which when executed by a processor implements the steps as in the first aspect and various possible implementations.

The beneficial effect of the embodiment of the application is that the residual echo and the noise do not need to be eliminated separately, but are eliminated once through the trained cascade network. The trained feature attention model is used for endowing the input features with different importance, redundant information in the input features is reduced, and the performance of eliminating residual echo and noise of the cascade network is improved. The multi-domain loss function combining the energy-independent amplitude spectrum loss function and the objective voice quality assessment score loss function is used for training the cascade network, the sensitivity of the model to signal energy is reduced, and the auditory perception quality of output voice is improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

Fig. 1 is a schematic process diagram of training a cascade network capable of eliminating residual echo and noise according to an embodiment of the present disclosure;

fig. 2 is a schematic flowchart illustrating a process of eliminating residual echo and noise by using a post-training cascade network according to an embodiment of the present application;

fig. 3 is a schematic structural diagram of a residual echo and noise cancellation device according to an embodiment of the present disclosure.

Detailed Description

The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application. The following examples are only for illustrating the technical solutions of the present application more clearly, and the protection scope of the present application is not limited thereby.

In the traditional method, residual echo and noise elimination are always independently and separately carried out. The importance of each of the plurality of signal features is not taken into account in the residual echo suppression task. When training a residual echo and noise elimination model, the mean square error of a target amplitude spectrum and an estimated amplitude spectrum is mostly adopted as a loss function. The loss function depends on the magnitude of the signal energy. Therefore, the application provides a method and a device for eliminating residual echo and noise, which can endow different importance to a plurality of signal characteristics, reduce redundant information in the plurality of signal characteristics, and simultaneously, adopt a multi-domain loss function to train a cascade network, thereby reducing the sensitivity of a model to signal energy.

In the embodiment of the present application, a schematic process diagram of training a cascade network capable of eliminating residual echo and noise is shown in fig. 1, and includes: S101-S113; the cascade network comprises a feature attention model and a residual echo and noise elimination model.

The feature attention model is formed by connecting 1 layer of gating circulation units with 1 layer of fully-connected neural networks. The 1-layer gating cycle unit has 200 hidden layer nodes, the fully-connected neural network output layer has 257 nodes, and the activation function of each neuron uses a Sigmoid function.

The residual echo and noise elimination model is formed by connecting a 2-layer gating circulation unit with a 1-layer fully-connected neural network. The 2-layer gating cycle unit comprises 400 hidden layer nodes, the fully-connected neural network output layer comprises 257 nodes, and the activation function of each neuron uses a Sigmoid function.

S101, receiving a first voice time domain signal containing echo and noise, a first far-end reference sound time domain signal and a first target voice time domain signal. The first far-end reference sound time domain signal is subjected to nonlinear transformation and then convolved with a corresponding room transfer function to form an echo time domain signal in the first voice time domain signal containing echo and noise.

S102, framing and windowing are carried out on the received first voice time domain signal containing the echo and the noise, the first far-end reference sound time domain signal and the first target voice time domain signal. Specifically, 512 sampling points are respectively taken as a frame signal for a received first voice time domain signal containing echo and noise, a first far-end reference voice time domain signal and a first target voice time domain signal, if the length is insufficient, zero padding is firstly carried out to 512 points, then windowing is carried out on each frame signal, and a Hamming window is adopted as a windowing function. And carrying out Fourier transform on each windowed frame signal to obtain a first voice frequency domain signal containing echo and noise, a first far-end reference audio domain signal and a first target voice frequency domain signal.

S103, inputting the first voice frequency domain signal containing echo and noise and the first far-end reference audio frequency domain signal into a Kalman filter, and estimating a first filter coefficient and a first echo frequency domain signal in real time. Wherein, the first echo frequency domain signal estimated by the Kalman filter is:

C(k,f)＝W(k,f)*X(k,f)

where W (k, f) is the first filter coefficient, X (k, f) is the far-end reference audio domain signal, and k and f represent the kth frame and frequency f, respectively.

S104, subtracting the first echo frequency domain signal from the first voice frequency domain signal containing echo and noise to obtain a first voice frequency domain signal containing residual echo and noise, and using the first voice frequency domain signal containing residual echo and noise as an output result of the Kalman filter. The first voice frequency domain signal containing residual echo and noise is:

E(k,f)＝Y(k,f)-C(k,f)

wherein, Y (k, f) is the first speech frequency domain signal containing echo and noise, C (k, f) is the first echo frequency domain signal, and k and f represent the k-th frame and frequency f respectively.

S105, performing energy normalization processing on the amplitude spectrum of the first voice frequency domain signal containing the residual echo and the noise, the amplitude spectrum of the first echo frequency domain signal and the amplitude spectrum of the first far-end reference audio frequency domain signal to obtain a first voice frequency domain signal characteristic g containing the residual echo and the noise_FD(f (| E (k, f) |)), first-echo frequency domain signal feature g_FD(f (| C (k, f) |)) and a first far-end reference audio-domain signal feature g_FD(f (| X (k, f) |)). Wherein the content of the first and second substances,

the mean and variance of the first speech frequency domain signal feature containing residual echo and noise are respectively defined as:

μ_f(e)(k,f)＝c₁μ_f(e)(k-1,f)+(1-c₁)f(|E(k,f)|)

the mean and variance of the first-pass audio domain signal characteristics are defined as:

μ_f(c)(k,f)＝c₁μ_f(c)(k-1,f)+(1-c₁)f(|C(k,f)|)

the mean and variance of the first remote-reference audio-domain signal characteristic are respectively defined as:

μ_f(x)(k,f)＝c₁μ_f(x)(k-1,f)+(1-c₁)f(|X(k,f)|)

| E (k, f) |, | C (k, f) | and | X (k, f) | respectively represent the amplitude spectrum of the first speech frequency domain signal containing residual echo and noise, the amplitude spectrum of the first echo frequency domain signal and the amplitude spectrum of the first far-end reference audio frequency domain signal, C₁Is a preset constant.

And S106, splicing the first voice frequency domain signal characteristic containing the residual echo and the noise with the first far-end reference audio domain signal characteristic to obtain a first splicing characteristic, and splicing the first voice frequency domain signal characteristic containing the residual echo and the noise with the first echo frequency domain signal characteristic to obtain a second splicing characteristic.

S107, inputting the first splicing feature and the second splicing feature into a feature attention model in the cascade network, so as to jointly train the feature attention model and the residual echo and noise cancellation model in the cascade network, and obtain a first weight α (k, f) corresponding to the first far-end reference audio domain signal feature and a second weight β (k, f) corresponding to the first echo domain signal feature.

S108, the multiplying the first far-end reference audio domain signal characteristic by the first weight obtains a first fused characteristic:

X_att(k,f)＝g_FD(f(|X(k,f)|))*α(k,f)

and the first echo frequency domain signal characteristic is multiplied by a second weight to obtain a second fused characteristic:

C_att(k,f)＝g_FD(f(|C(k,f)|))*β(k,f)。

s109, the first fusion feature X is used_att(k, f) the second meltingGeneral character C_att(k, f) and the first speech frequency domain signal feature g containing residual echo and noise_FDAnd (f (| E (k, f) |)) splicing to obtain a first fusion splicing characteristic.

And S110, inputting the first fusion splicing characteristics into a residual echo and noise elimination model in the cascade network, and outputting the residual echo and noise elimination model as a masking estimation value G (k, f) of a second target voice frequency domain signal.

S111, using the masking estimation value G (k, f) of the second target speech frequency domain signal to enhance the first speech frequency domain signal containing residual echo and noise, and obtaining a second target speech frequency domain signal:

and S112, determining a multi-domain loss function according to the at least two loss functions. For example, an energy-independent magnitude spectrum loss function is determined from the second target speech frequency domain signal using the magnitude spectrum of the first target speech frequency domain signal as a training target

Determining objective speech quality evaluation score loss function by taking improvement of speech audibility quality as training target

Wherein S (k, f) is a first target speech frequency domain signal,

for the second target speech frequency domainA signal. Weighting and adding the energy-independent amplitude spectrum loss function and the objective voice quality evaluation score loss function, and determining a multi-domain loss function as follows:

wherein λ is a predetermined constant.

And S113, iteratively reducing the loss function of the multiple domains by continuously modeling parameters to obtain the trained cascade network. The post-training cascade network comprises a post-training feature attention model and a post-training residual echo and noise elimination model.

In the embodiment of the present application, a schematic flow chart of using the post-training cascade network to eliminate residual echo and noise is shown in fig. 2, and includes: S201-S207;

s201, receiving a voice time domain signal containing echo and noise and a far-end reference sound time domain signal. And the far-end reference sound time domain signal is subjected to nonlinear transformation and then convoluted with a corresponding room transfer function to form an echo time domain signal in the voice time domain signal containing the echo and the noise.

S202, framing and windowing are respectively carried out on the voice time domain signal containing the echo and the noise and the far-end reference sound time domain signal. Specifically, 512 sampling points are respectively taken as a frame signal for a received voice time domain signal containing echo and noise and a far-end reference sound time domain signal, if the length is insufficient, zero padding is firstly carried out to 512 points, then windowing is carried out on each frame signal, and a hamming window is adopted as a windowing function. And carrying out Fourier transform on each windowed frame signal to obtain a voice frequency domain signal containing echo and noise and a far-end reference audio domain signal.

S203, inputting the voice frequency domain signal containing the echo and the noise and the far-end reference audio frequency domain signal into a Kalman filter, and estimating a filter coefficient and an echo frequency domain signal in real time. The echo frequency domain signal estimated by the Kalman filter is as follows:

C₃(k,f)＝W₃(k,f)*X₃(k,f)

wherein, W₃(k, f) is the first filter coefficient, X₃(k, f) are the far-end reference audio domain signals, k and f representing the k-th frame and the frequency f, respectively.

S204, subtracting the echo frequency domain signal from the voice frequency domain signal containing echo and noise to obtain a voice frequency domain signal containing residual echo and noise:

E₃(k,f)＝Y₃(k,f)-C₃(k,f)

wherein, Y₃(k, f) is a speech frequency domain signal containing echo and noise, C₃And (k, f) is an echo frequency domain signal, and k and f respectively represent a k-th frame and a frequency f.

S205, performing energy normalization processing on the magnitude spectrum of the voice frequency domain signal containing the residual echo and the noise, the magnitude spectrum of the echo frequency domain signal and the magnitude spectrum of the far-end reference sound frequency domain signal to obtain the voice frequency domain signal characteristics containing the residual echo and the noise

First-pass audio domain signal characterization

And a first remote reference audio domain signal characteristic

Wherein the content of the first and second substances,

|E₃(k,f)|、|C₃(k, f) | and | X₃(k, f) | respectively represents the amplitude spectrum of the voice frequency domain signal containing residual echo and noise, the amplitude spectrum of the echo frequency domain signal and the amplitude spectrum of the far-end reference audio frequency domain signal, c₂Is a preset constant.

And S206, splicing the voice frequency domain signal characteristics containing the residual echo and the noise with the far-end reference audio domain signal characteristics to obtain a first splicing result, and splicing the voice frequency domain signal characteristics containing the residual echo and the noise with the echo frequency domain signal characteristics to obtain a second splicing result.

S207, inputting the first splicing result and the second splicing result into the trained cascade network, i.e. inputting the trained feature attention model, and obtaining a first attention weight α corresponding to the far-end reference audio domain signal feature₃(k, f) and a second attention weight β corresponding to the echo frequency domain signal characteristic₃(k,f)。

S208, the above-mentioned far-end reference audio domain signal feature and the first attention weight α₃(k, f) multiplying to obtain a first fused attention mechanism characterized by:

and the echo frequency domain signal characteristic and the second attention weight beta₃(k, f) multiplying to obtain a second fused attention mechanism characterized by:

and S209, splicing the first fusion attention mechanism characteristic, the second fusion attention mechanism characteristic and the voice frequency domain signal characteristic containing the residual echo and the noise to obtain a first fusion splicing result.

S210, inputting the first fusion splicing result into a trained residual echo and noise elimination model in the trained cascade network to obtain a masking estimation value G of the target voice frequency domain signal₃(k,f)。

S211, the masking estimated value G of the target voice frequency domain signal₃(k, f) and the speech frequency domain signal E containing residual echo and noise₃(k, f) multiplying to obtain a target voice frequency domain signal:

and S212, performing inverse Fourier transform on the target voice frequency domain signal to obtain a target voice time domain signal.

The residual echo and the noise in the embodiment of the application are not required to be eliminated independently and separately, but are eliminated once through a trained cascade network. The trained feature attention model is used for endowing the input features with different importance, redundant information in the input features is reduced, and the performance of eliminating residual echo and noise of the cascade network is improved. The multi-domain loss function combining the energy-independent amplitude spectrum loss function and the objective voice quality assessment score loss function is used for training the cascade network, the sensitivity of the model to signal energy is reduced, and the auditory perception quality of output voice is improved.

An embodiment of the present application provides a residual echo and noise cancellation device, a schematic structural diagram of which is shown in fig. 3, including:

a receiving module 301, a processing module 302, a determining module 303, an energy normalizing module 304 and an inverse fourier transform module 305;

a receiving module 301, configured to receive a voice time domain signal containing echo and noise and a far-end reference sound time domain signal;

a processing module 302, configured to perform framing, windowing, and fourier transform on the voice time-domain signal containing the echo and the noise and the far-end reference sound time-domain signal, respectively, to obtain a voice frequency-domain signal containing the echo and the noise and a far-end reference audio frequency-domain signal;

a determining module 303, configured to determine an echo frequency domain signal according to the voice frequency domain signal containing echo and noise and the far-end reference audio frequency domain signal;

the determining module 303 is further configured to determine a voice frequency domain signal containing residual echo and noise according to the voice frequency domain signal containing echo and noise and the echo frequency domain signal;

an energy normalization module 304, configured to perform energy normalization processing on the magnitude spectrum of the voice frequency domain signal containing the residual echo and the noise, the magnitude spectrum of the echo frequency domain signal, and the magnitude spectrum of the far-end reference audio domain signal, so as to obtain a voice frequency domain signal feature containing the residual echo and the noise, an echo frequency domain signal feature, and a far-end reference audio domain signal feature;

a splicing module 305, configured to splice the voice frequency domain signal feature containing the residual echo and the noise with the far-end reference audio domain signal feature to obtain a first splicing result, and splice the voice frequency domain signal feature containing the residual echo and the noise with the echo frequency domain signal feature to obtain a second splicing result;

a weight obtaining module 306, configured to input the first splicing result and the second splicing result into a trained feature attention model in the trained cascade network, and obtain a first attention weight corresponding to the far-end reference audio-domain signal feature and a second attention weight corresponding to the echo frequency-domain signal feature;

a fused attention mechanism feature obtaining module 307, configured to multiply the far-end reference audio domain signal feature by a first attention weight to obtain a first fused attention mechanism feature, and multiply the echo frequency domain signal feature by a second attention weight to obtain a second fused attention mechanism feature;

the splicing module 305 is further configured to splice the first fusion attention mechanism feature, the second fusion attention mechanism feature, and the voice frequency domain signal feature containing the residual echo and the noise to obtain a first fusion splicing result;

a masking estimation value obtaining module 308, configured to input the first fusion splicing result into a post-training residual echo and noise cancellation model in the post-training cascade network, so as to obtain a masking estimation value of the target voice frequency domain signal;

a target voice frequency domain signal obtaining module 309, configured to obtain the target voice frequency domain signal according to the masking estimation value of the target voice frequency domain signal and the voice frequency domain signal containing the residual echo and the noise;

and an inverse fourier transform module 310, configured to perform inverse fourier transform on the target voice frequency domain signal to obtain a target voice time domain signal.

An embodiment of the present application provides a residual echo and noise cancellation apparatus, including at least one processor, where the processor is configured to execute a program stored in a memory, and when the program is executed, the apparatus is enabled to perform the following steps:

An embodiment of the application provides a non-transitory computer readable storage medium, on which a computer program is stored, the computer program, when executed by a processor, implementing the steps of:

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

It should be noted that: the above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims

1. A method for residual echo and noise cancellation, comprising:

inputting the first splicing result and the second splicing result into a trained feature attention model in a trained cascade network to obtain a first attention weight corresponding to the far-end reference audio domain signal feature and a second attention weight corresponding to the echo frequency domain signal feature;

2. The method of claim 1, wherein the framing, windowing, and fourier transforming the echo and noise containing speech time domain signal and the far-end reference acoustic time domain signal, respectively, comprises:

and performing Fourier transform on each windowed frame signal.

3. The method according to claim 1, wherein determining an echo frequency domain signal from the echo and noise containing speech frequency domain signal and the far-end reference audio frequency domain signal comprises:

4. The method of claim 1, wherein determining the voice frequency domain signal containing the residual echo and the noise according to the voice frequency domain signal containing the echo and the noise and the echo frequency domain signal comprises:

5. The method according to claim 1, wherein the energy normalization processing is performed on the magnitude spectrum of the speech frequency domain signal containing the residual echo and the noise, the magnitude spectrum of the echo frequency domain signal, and the magnitude spectrum of the far-end reference audio domain signal to obtain the speech frequency domain signal characteristic containing the residual echo and the noise, the echo frequency domain signal characteristic, and the far-end reference audio domain signal characteristic, and comprises:

6. The method of claim 1, wherein the trained cascade network is trained by:

7. A residual echo and noise cancellation apparatus, comprising:

a weight obtaining module, configured to input the first splicing result and the second splicing result into a trained feature attention model in a trained cascade network, and obtain a first attention weight corresponding to the far-end reference audio-domain signal feature and a second attention weight corresponding to the echo frequency-domain signal feature;

8. A residual echo and noise cancellation apparatus comprising at least one processor configured to execute a program stored in a memory, the program, when executed, causing the apparatus to perform:

the method of any one of claims 1-6.

9. A non-transitory computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the method according to any one of claims 1-6.