CN112863535A - Residual echo and noise elimination method and device - Google Patents

Residual echo and noise elimination method and device Download PDF

Info

Publication number
CN112863535A
CN112863535A CN202110008502.9A CN202110008502A CN112863535A CN 112863535 A CN112863535 A CN 112863535A CN 202110008502 A CN202110008502 A CN 202110008502A CN 112863535 A CN112863535 A CN 112863535A
Authority
CN
China
Prior art keywords
domain signal
frequency domain
echo
noise
far
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110008502.9A
Other languages
Chinese (zh)
Other versions
CN112863535B (en
Inventor
李军锋
顾建军
颜永红
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Acoustics CAS
Original Assignee
Institute of Acoustics CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Acoustics CAS filed Critical Institute of Acoustics CAS
Priority to CN202110008502.9A priority Critical patent/CN112863535B/en
Publication of CN112863535A publication Critical patent/CN112863535A/en
Application granted granted Critical
Publication of CN112863535B publication Critical patent/CN112863535B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0232Processing in the frequency domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L2021/02082Noise filtering the noise being echo, reverberation of the speech

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Cable Transmission Systems, Equalization Of Radio And Reduction Of Echo (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

The embodiment of the application discloses a method and a device for eliminating residual echo and noise, wherein the method comprises the following steps: performing framing, windowing and Fourier transformation on the received voice time domain signal containing the echo and the noise and the far-end reference sound time domain signal to obtain a corresponding frequency domain signal, determining an echo frequency domain signal, and further determining the voice frequency domain signal containing the residual echo and the noise; respectively carrying out energy normalization processing on the amplitude spectrums of the voice frequency domain signal containing the residual echo and the noise, the echo frequency domain signal and the far-end reference audio frequency domain signal to obtain corresponding characteristics; determining a target voice frequency domain signal according to the corresponding characteristics and the trained cascade network; and carrying out inverse Fourier transform on the target voice frequency domain signal to obtain a target voice time domain signal. According to the embodiment of the application, the feature attention model is used for endowing the input features with different importance, and redundant information in the input features is reduced. And the multi-domain loss function is trained in a cascade network, so that the sensitivity of the model to signal energy is reduced.

Description

Residual echo and noise elimination method and device
Technical Field
The present invention relates to the field of echo and noise cancellation. And more particularly, to a method and apparatus for residual echo and noise cancellation.
Background
At present, the echo cancellation technology mainly removes an echo signal formed by a far-end reference sound signal in a speech signal, and the speech noise reduction technology mainly removes background noise and directional noise interference in the speech signal. Both echo cancellation techniques and speech noise reduction techniques aim to improve the quality and intelligibility of speech. In the echo cancellation technology, a self-adaptive filtering method based on traditional signal processing and a residual echo cancellation method based on deep learning are combined, so that the generalization performance of the system can be effectively improved.
However, in the conventional method, the residual echo and the noise cancellation are always independently and separately performed, and the correlation between the two tasks is not considered. There are a number of signal features available in the residual echo cancellation task that have different physical significance and significance, and conventional methods do not take into account the different significance of these features. When training a residual echo and noise elimination model, the mean square error of a target amplitude spectrum and an estimated amplitude spectrum is mostly adopted as a loss function in the prior art, but the loss function depends on the energy of signals, and the scales of the signals with different energy are different.
Disclosure of Invention
Because the existing method has the above problems, the present application provides a method and an apparatus for eliminating residual echo and noise.
In a first aspect, an embodiment of the present application provides a residual echo and noise cancellation method, including:
receiving a voice time domain signal containing echo and noise and a far-end reference sound time domain signal;
respectively performing framing, windowing and Fourier transformation on the voice time domain signal containing the echo and the noise and the far-end reference sound time domain signal to obtain a voice frequency domain signal containing the echo and the noise and a far-end reference sound frequency domain signal;
determining an echo frequency domain signal according to the voice frequency domain signal containing the echo and the noise and the far-end reference audio frequency domain signal;
determining a voice frequency domain signal containing residual echo and noise according to the voice frequency domain signal containing echo and noise and the echo frequency domain signal;
performing energy normalization processing on the amplitude spectrum of the voice frequency domain signal containing the residual echo and the noise, the amplitude spectrum of the echo frequency domain signal and the amplitude spectrum of the far-end reference audio frequency domain signal to obtain a voice frequency domain signal characteristic containing the residual echo and the noise, an echo frequency domain signal characteristic and a far-end reference audio frequency domain signal characteristic;
splicing the voice frequency domain signal characteristics containing the residual echo and the noise with the far-end reference audio domain signal characteristics to obtain a first splicing result, and splicing the voice frequency domain signal characteristics containing the residual echo and the noise with the echo frequency domain signal characteristics to obtain a second splicing result;
inputting the first splicing result and the second splicing result into a trained feature attention model in the trained cascade network to obtain a first attention weight corresponding to the far-end reference audio domain signal feature and a second attention weight corresponding to the echo frequency domain signal feature;
multiplying the far-end reference audio frequency domain signal characteristic by a first attention weight to obtain a first fused attention mechanism characteristic, and multiplying the echo frequency domain signal characteristic by a second attention weight to obtain a second fused attention mechanism characteristic;
splicing the first fusion attention mechanism characteristic, the second fusion attention mechanism characteristic and the voice frequency domain signal characteristic containing residual echo and noise to obtain a first fusion splicing result;
inputting the first fusion splicing result into a trained residual echo and noise elimination model in the trained cascade network to obtain a masking estimation value of a target voice frequency domain signal;
obtaining the target voice frequency domain signal according to the masking estimation value of the target voice frequency domain signal and the voice frequency domain signal containing the residual echo and the noise;
and carrying out inverse Fourier transform on the target voice frequency domain signal to obtain a target voice time domain signal.
In one possible implementation, the framing, windowing, and fourier transforming the echo and noise containing speech time domain signal and the far-end reference acoustic time domain signal respectively includes:
respectively taking a preset number of sampling points as a frame signal for the voice time domain signal containing the echo and the noise and the far-end reference sound time domain signal; if the length is not enough, zero is firstly filled to a preset number;
windowing each frame signal; wherein, the windowing function adopts a Hamming window;
and performing Fourier transform on each windowed frame signal.
In one possible implementation, the determining an echo frequency domain signal according to the echo and noise-containing speech frequency domain signal and the far-end reference audio frequency domain signal includes:
inputting the voice frequency domain signal containing the echo and the noise and the far-end reference audio frequency domain signal into a Kalman filter to obtain a filter coefficient and the echo frequency domain signal;
the echo frequency domain signal is a result of multiplying the filter coefficient and the far-end reference sound frequency domain signal.
In one possible implementation, the determining a voice frequency domain signal containing residual echo and noise according to the voice frequency domain signal containing echo and noise and the echo frequency domain signal includes:
and subtracting the echo frequency domain signal from the voice frequency domain signal containing the echo and the noise to obtain the voice frequency domain signal containing the residual echo and the noise.
In a possible implementation, the energy normalization processing is performed on the amplitude spectrum of the speech frequency domain signal containing the residual echo and the noise, the amplitude spectrum of the echo frequency domain signal, and the amplitude spectrum of the far-end reference audio domain signal to obtain a speech frequency domain signal feature containing the residual echo and the noise, an echo frequency domain signal feature, and a far-end reference audio domain signal feature, and the energy normalization processing includes:
respectively determining a first function, a second function and a third function corresponding to the amplitude spectrum of the voice frequency domain signal containing the residual echo and the noise, the amplitude spectrum of the echo frequency domain signal and the amplitude spectrum of the far-end reference audio frequency domain signal;
determining the voice frequency domain signal characteristics containing the residual echo and the noise according to a first function corresponding to the amplitude spectrum of the voice frequency domain signal containing the residual echo and the noise, and the mean value and the variance of the voice frequency domain signal characteristics containing the residual echo and the noise;
determining the echo frequency domain signal characteristics according to a second function corresponding to the amplitude spectrum of the echo frequency domain signal, and the mean value and the variance of the echo frequency domain signal characteristics;
and determining the characteristics of the far-end reference audio frequency domain signal according to a third function corresponding to the amplitude spectrum of the far-end reference audio frequency domain signal and the mean value and the variance of the characteristics of the far-end reference audio frequency domain signal.
In one possible implementation, the trained cascade network is trained by:
receiving a first voice time domain signal containing echo and noise, a first far-end reference sound time domain signal and a first target voice time domain signal;
performing framing, windowing and Fourier transformation on the first voice time domain signal containing the echo and the noise, the first far-end reference sound time domain signal and the first target voice time domain signal respectively to obtain a first voice frequency domain signal containing the echo and the noise, a first far-end reference audio frequency domain signal and a first target voice frequency domain signal;
determining a first echo frequency domain signal according to the first voice frequency domain signal containing echo and noise and the first far-end reference audio frequency domain signal;
determining a first voice frequency domain signal containing residual echo and noise according to the first voice frequency domain signal containing echo and noise and the first echo frequency domain signal;
performing energy normalization processing on the amplitude spectrum of the first voice frequency domain signal containing residual echo and noise, the amplitude spectrum of the first echo frequency domain signal and the amplitude spectrum of the first far-end reference audio frequency domain signal to obtain a first voice frequency domain signal characteristic containing residual echo and noise, a first echo frequency domain signal characteristic and a first far-end reference audio frequency domain signal characteristic;
splicing the first voice frequency domain signal characteristic containing residual echo and noise with the first far-end reference audio domain signal characteristic to obtain a first splicing characteristic, and splicing the first voice frequency domain signal characteristic containing residual echo and noise with the first echo frequency domain signal characteristic to obtain a second splicing characteristic;
inputting the first splicing feature and the second splicing feature into a feature attention model in a cascade network so as to jointly train the feature attention model and a residual echo and noise elimination model in the cascade network, and obtaining a first weight corresponding to the first far-end reference audio domain signal feature and a second weight corresponding to the first echo frequency domain signal feature;
multiplying the first far-end reference audio domain signal characteristic by a first weight to obtain a first fused characteristic, and multiplying the first echo audio domain signal characteristic by a second weight to obtain a second fused characteristic;
splicing the first fusion characteristic, the second fusion characteristic and the first voice frequency domain signal characteristic containing residual echo and noise to obtain a first fusion splicing characteristic;
inputting the first fusion splicing characteristic into a residual echo and noise elimination model in the cascade network to obtain a masking estimation value of a second target voice frequency domain signal;
determining a second target voice frequency domain signal according to the masking estimation value of the second target voice frequency domain signal and the first voice frequency domain signal containing residual echo and noise;
determining a multi-domain loss function according to at least two loss functions; wherein the at least two loss functions comprise an energy-independent magnitude spectrum loss function and an objective speech quality assessment score loss function; the amplitude spectrum loss function irrelevant to the energy takes the amplitude spectrum of the first target voice frequency domain signal as a training target and is determined according to the second target voice frequency domain signal; the objective voice quality evaluation score loss function is determined by taking the improvement of voice audibility quality as a training target;
and iteratively reducing the loss function of the multiple domains by continuously modeling parameters to obtain the trained cascade network.
In a second aspect, an embodiment of the present application further provides a residual echo and noise cancellation apparatus, including:
the receiving module is used for receiving a voice time domain signal containing echo and noise and a far-end reference sound time domain signal;
the processing module is used for respectively performing framing, windowing and Fourier transform on the voice time domain signal containing the echo and the noise and the far-end reference sound time domain signal to obtain a voice frequency domain signal containing the echo and the noise and a far-end reference sound frequency domain signal;
the determining module is used for determining an echo frequency domain signal according to the voice frequency domain signal containing the echo and the noise and the far-end reference audio frequency domain signal;
the determining module is further configured to determine a voice frequency domain signal containing residual echo and noise according to the voice frequency domain signal containing echo and noise and the echo frequency domain signal;
the energy normalization module is used for carrying out energy normalization processing on the amplitude spectrum of the voice frequency domain signal containing the residual echo and the noise, the amplitude spectrum of the echo frequency domain signal and the amplitude spectrum of the far-end reference audio frequency domain signal to obtain a voice frequency domain signal characteristic containing the residual echo and the noise, an echo frequency domain signal characteristic and a far-end reference audio frequency domain signal characteristic;
the splicing module is used for splicing the voice frequency domain signal characteristics containing the residual echo and the noise with the far-end reference audio domain signal characteristics to obtain a first splicing result, and splicing the voice frequency domain signal characteristics containing the residual echo and the noise with the echo frequency domain signal characteristics to obtain a second splicing result;
a weight obtaining module, configured to input the first splicing result and the second splicing result into a trained feature attention model in the trained cascade network, and obtain a first attention weight corresponding to the far-end reference audio-domain signal feature and a second attention weight corresponding to the echo frequency-domain signal feature;
a fused attention mechanism feature obtaining module, configured to multiply the far-end reference audio-domain signal feature by a first attention weight to obtain a first fused attention mechanism feature, and multiply the echo frequency-domain signal feature by a second attention weight to obtain a second fused attention mechanism feature;
the splicing module is further configured to splice the first fusion attention mechanism feature, the second fusion attention mechanism feature and the voice frequency domain signal feature containing the residual echo and the noise to obtain a first fusion splicing result;
a masking estimation value obtaining module, configured to input the first fusion splicing result into a post-training residual echo and noise cancellation model in the post-training cascade network, so as to obtain a masking estimation value of the target voice frequency domain signal;
a target voice frequency domain signal obtaining module, configured to obtain a target voice frequency domain signal according to the masking estimation value of the target voice frequency domain signal and the voice frequency domain signal containing the residual echo and the noise;
and the inverse Fourier transform module is used for performing inverse Fourier transform on the target voice frequency domain signal to obtain a target voice time domain signal.
In a third aspect, an embodiment of the present application further provides a residual echo and noise cancellation apparatus, including at least one processor, where the processor is configured to execute a program stored in a memory, and when the program is executed, the apparatus is caused to perform the steps as in the first aspect and in various possible implementations.
In a fourth aspect, embodiments of the present application further propose a non-transitory computer-readable storage medium having stored thereon a computer program, which when executed by a processor implements the steps as in the first aspect and various possible implementations.
The beneficial effect of the embodiment of the application is that the residual echo and the noise do not need to be eliminated separately, but are eliminated once through the trained cascade network. The trained feature attention model is used for endowing the input features with different importance, redundant information in the input features is reduced, and the performance of eliminating residual echo and noise of the cascade network is improved. The multi-domain loss function combining the energy-independent amplitude spectrum loss function and the objective voice quality assessment score loss function is used for training the cascade network, the sensitivity of the model to signal energy is reduced, and the auditory perception quality of output voice is improved.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.
Fig. 1 is a schematic process diagram of training a cascade network capable of eliminating residual echo and noise according to an embodiment of the present disclosure;
fig. 2 is a schematic flowchart illustrating a process of eliminating residual echo and noise by using a post-training cascade network according to an embodiment of the present application;
fig. 3 is a schematic structural diagram of a residual echo and noise cancellation device according to an embodiment of the present disclosure.
Detailed Description
The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application. The following examples are only for illustrating the technical solutions of the present application more clearly, and the protection scope of the present application is not limited thereby.
In the traditional method, residual echo and noise elimination are always independently and separately carried out. The importance of each of the plurality of signal features is not taken into account in the residual echo suppression task. When training a residual echo and noise elimination model, the mean square error of a target amplitude spectrum and an estimated amplitude spectrum is mostly adopted as a loss function. The loss function depends on the magnitude of the signal energy. Therefore, the application provides a method and a device for eliminating residual echo and noise, which can endow different importance to a plurality of signal characteristics, reduce redundant information in the plurality of signal characteristics, and simultaneously, adopt a multi-domain loss function to train a cascade network, thereby reducing the sensitivity of a model to signal energy.
In the embodiment of the present application, a schematic process diagram of training a cascade network capable of eliminating residual echo and noise is shown in fig. 1, and includes: S101-S113; the cascade network comprises a feature attention model and a residual echo and noise elimination model.
The feature attention model is formed by connecting 1 layer of gating circulation units with 1 layer of fully-connected neural networks. The 1-layer gating cycle unit has 200 hidden layer nodes, the fully-connected neural network output layer has 257 nodes, and the activation function of each neuron uses a Sigmoid function.
The residual echo and noise elimination model is formed by connecting a 2-layer gating circulation unit with a 1-layer fully-connected neural network. The 2-layer gating cycle unit comprises 400 hidden layer nodes, the fully-connected neural network output layer comprises 257 nodes, and the activation function of each neuron uses a Sigmoid function.
S101, receiving a first voice time domain signal containing echo and noise, a first far-end reference sound time domain signal and a first target voice time domain signal. The first far-end reference sound time domain signal is subjected to nonlinear transformation and then convolved with a corresponding room transfer function to form an echo time domain signal in the first voice time domain signal containing echo and noise.
S102, framing and windowing are carried out on the received first voice time domain signal containing the echo and the noise, the first far-end reference sound time domain signal and the first target voice time domain signal. Specifically, 512 sampling points are respectively taken as a frame signal for a received first voice time domain signal containing echo and noise, a first far-end reference voice time domain signal and a first target voice time domain signal, if the length is insufficient, zero padding is firstly carried out to 512 points, then windowing is carried out on each frame signal, and a Hamming window is adopted as a windowing function. And carrying out Fourier transform on each windowed frame signal to obtain a first voice frequency domain signal containing echo and noise, a first far-end reference audio domain signal and a first target voice frequency domain signal.
S103, inputting the first voice frequency domain signal containing echo and noise and the first far-end reference audio frequency domain signal into a Kalman filter, and estimating a first filter coefficient and a first echo frequency domain signal in real time. Wherein, the first echo frequency domain signal estimated by the Kalman filter is:
C(k,f)=W(k,f)*X(k,f)
where W (k, f) is the first filter coefficient, X (k, f) is the far-end reference audio domain signal, and k and f represent the kth frame and frequency f, respectively.
S104, subtracting the first echo frequency domain signal from the first voice frequency domain signal containing echo and noise to obtain a first voice frequency domain signal containing residual echo and noise, and using the first voice frequency domain signal containing residual echo and noise as an output result of the Kalman filter. The first voice frequency domain signal containing residual echo and noise is:
E(k,f)=Y(k,f)-C(k,f)
wherein, Y (k, f) is the first speech frequency domain signal containing echo and noise, C (k, f) is the first echo frequency domain signal, and k and f represent the k-th frame and frequency f respectively.
S105, performing energy normalization processing on the amplitude spectrum of the first voice frequency domain signal containing the residual echo and the noise, the amplitude spectrum of the first echo frequency domain signal and the amplitude spectrum of the first far-end reference audio frequency domain signal to obtain a first voice frequency domain signal characteristic g containing the residual echo and the noiseFD(f (| E (k, f) |)), first-echo frequency domain signal feature gFD(f (| C (k, f) |)) and a first far-end reference audio-domain signal feature gFD(f (| X (k, f) |)). Wherein,
Figure BDA0002884062230000051
the mean and variance of the first speech frequency domain signal feature containing residual echo and noise are respectively defined as:
μf(e)(k,f)=c1μf(e)(k-1,f)+(1-c1)f(|E(k,f)|)
Figure BDA0002884062230000061
Figure BDA0002884062230000062
the mean and variance of the first-pass audio domain signal characteristics are defined as:
μf(c)(k,f)=c1μf(c)(k-1,f)+(1-c1)f(|C(k,f)|)
Figure BDA0002884062230000063
Figure BDA0002884062230000064
the mean and variance of the first remote-reference audio-domain signal characteristic are respectively defined as:
μf(x)(k,f)=c1μf(x)(k-1,f)+(1-c1)f(|X(k,f)|)
Figure BDA0002884062230000065
| E (k, f) |, | C (k, f) | and | X (k, f) | respectively represent the amplitude spectrum of the first speech frequency domain signal containing residual echo and noise, the amplitude spectrum of the first echo frequency domain signal and the amplitude spectrum of the first far-end reference audio frequency domain signal, C1Is a preset constant.
And S106, splicing the first voice frequency domain signal characteristic containing the residual echo and the noise with the first far-end reference audio domain signal characteristic to obtain a first splicing characteristic, and splicing the first voice frequency domain signal characteristic containing the residual echo and the noise with the first echo frequency domain signal characteristic to obtain a second splicing characteristic.
S107, inputting the first splicing feature and the second splicing feature into a feature attention model in the cascade network, so as to jointly train the feature attention model and the residual echo and noise cancellation model in the cascade network, and obtain a first weight α (k, f) corresponding to the first far-end reference audio domain signal feature and a second weight β (k, f) corresponding to the first echo domain signal feature.
S108, the multiplying the first far-end reference audio domain signal characteristic by the first weight obtains a first fused characteristic:
Xatt(k,f)=gFD(f(|X(k,f)|))*α(k,f)
and the first echo frequency domain signal characteristic is multiplied by a second weight to obtain a second fused characteristic:
Catt(k,f)=gFD(f(|C(k,f)|))*β(k,f)。
s109, the first fusion feature X is usedatt(k, f) the second fusion characteristics Catt(k, f) and the first speech frequency domain signal feature g containing residual echo and noiseFDAnd (f (| E (k, f) |)) splicing to obtain a first fusion splicing characteristic.
And S110, inputting the first fusion splicing characteristics into a residual echo and noise elimination model in the cascade network, and outputting the residual echo and noise elimination model as a masking estimation value G (k, f) of a second target voice frequency domain signal.
S111, using the masking estimation value G (k, f) of the second target speech frequency domain signal to enhance the first speech frequency domain signal containing residual echo and noise, and obtaining a second target speech frequency domain signal:
Figure BDA0002884062230000066
and S112, determining a multi-domain loss function according to the at least two loss functions. For example, using the magnitude spectrum of the first target speech frequency domain signal as a training target, based on the second target speech frequency domain signal,determining an energy-independent amplitude spectral loss function
Figure BDA0002884062230000067
Figure BDA0002884062230000071
Figure BDA0002884062230000072
Determining objective speech quality evaluation score loss function by taking improvement of speech audibility quality as training target
Figure BDA0002884062230000073
Wherein S (k, f) is a first target speech frequency domain signal,
Figure BDA0002884062230000074
is a second target speech frequency domain signal. Weighting and adding the energy-independent amplitude spectrum loss function and the objective voice quality evaluation score loss function, and determining a multi-domain loss function as follows:
Figure BDA0002884062230000075
wherein λ is a predetermined constant.
And S113, iteratively reducing the loss function of the multiple domains by continuously modeling parameters to obtain the trained cascade network. The post-training cascade network comprises a post-training feature attention model and a post-training residual echo and noise elimination model.
In the embodiment of the present application, a schematic flow chart of using the post-training cascade network to eliminate residual echo and noise is shown in fig. 2, and includes: S201-S207;
s201, receiving a voice time domain signal containing echo and noise and a far-end reference sound time domain signal. And the far-end reference sound time domain signal is subjected to nonlinear transformation and then convoluted with a corresponding room transfer function to form an echo time domain signal in the voice time domain signal containing the echo and the noise.
S202, framing and windowing are respectively carried out on the voice time domain signal containing the echo and the noise and the far-end reference sound time domain signal. Specifically, 512 sampling points are respectively taken as a frame signal for a received voice time domain signal containing echo and noise and a far-end reference sound time domain signal, if the length is insufficient, zero padding is firstly carried out to 512 points, then windowing is carried out on each frame signal, and a hamming window is adopted as a windowing function. And carrying out Fourier transform on each windowed frame signal to obtain a voice frequency domain signal containing echo and noise and a far-end reference audio domain signal.
S203, inputting the voice frequency domain signal containing the echo and the noise and the far-end reference audio frequency domain signal into a Kalman filter, and estimating a filter coefficient and an echo frequency domain signal in real time. The echo frequency domain signal estimated by the Kalman filter is as follows:
C3(k,f)=W3(k,f)*X3(k,f)
wherein, W3(k, f) is the first filter coefficient, X3(k, f) are the far-end reference audio domain signals, k and f representing the k-th frame and the frequency f, respectively.
S204, subtracting the echo frequency domain signal from the voice frequency domain signal containing echo and noise to obtain a voice frequency domain signal containing residual echo and noise:
E3(k,f)=Y3(k,f)-C3(k,f)
wherein, Y3(k, f) is a speech frequency domain signal containing echo and noise, C3And (k, f) is an echo frequency domain signal, and k and f respectively represent a k-th frame and a frequency f.
S205, performing energy normalization processing on the magnitude spectrum of the voice frequency domain signal containing the residual echo and the noise, the magnitude spectrum of the echo frequency domain signal and the magnitude spectrum of the far-end reference sound frequency domain signal to obtain the voice frequency domain signal characteristics containing the residual echo and the noise
Figure BDA0002884062230000076
First-pass audio domain signal characterization
Figure BDA0002884062230000077
And a first remote reference audio domain signal characteristic
Figure BDA0002884062230000078
Wherein,
Figure BDA0002884062230000081
the mean and variance of the first speech frequency domain signal feature containing residual echo and noise are respectively defined as:
Figure BDA00028840622300000811
Figure BDA0002884062230000082
Figure BDA0002884062230000083
the mean and variance of the first-pass audio domain signal characteristics are defined as:
Figure BDA00028840622300000812
Figure BDA0002884062230000084
Figure BDA0002884062230000085
the mean and variance of the first remote-reference audio-domain signal characteristic are respectively defined as:
Figure BDA0002884062230000086
Figure BDA0002884062230000087
|E3(k,f)|、|C3(k, f) | and | X3(k, f) | respectively represents the amplitude spectrum of the voice frequency domain signal containing residual echo and noise, the amplitude spectrum of the echo frequency domain signal and the amplitude spectrum of the far-end reference audio frequency domain signal, c2Is a preset constant.
And S206, splicing the voice frequency domain signal characteristics containing the residual echo and the noise with the far-end reference audio domain signal characteristics to obtain a first splicing result, and splicing the voice frequency domain signal characteristics containing the residual echo and the noise with the echo frequency domain signal characteristics to obtain a second splicing result.
S207, inputting the first splicing result and the second splicing result into the trained cascade network, i.e. inputting the trained feature attention model, and obtaining a first attention weight α corresponding to the far-end reference audio domain signal feature3(k, f) and a second attention weight β corresponding to the echo frequency domain signal characteristic3(k,f)。
S208, the above-mentioned far-end reference audio domain signal feature and the first attention weight α3(k, f) multiplying to obtain a first fused attention mechanism characterized by:
Figure BDA0002884062230000088
and the echo frequency domain signal characteristic and the second attention weight beta3(k, f) multiplying to obtain a second fused attention mechanism characterized by:
Figure BDA0002884062230000089
and S209, splicing the first fusion attention mechanism characteristic, the second fusion attention mechanism characteristic and the voice frequency domain signal characteristic containing the residual echo and the noise to obtain a first fusion splicing result.
S210, inputting the first fusion splicing result into a trained residual echo and noise elimination model in the trained cascade network to obtain a masking estimation value G of the target voice frequency domain signal3(k,f)。
S211, the masking estimated value G of the target voice frequency domain signal3(k, f) and the speech frequency domain signal E containing residual echo and noise3(k, f) multiplying to obtain a target voice frequency domain signal:
Figure BDA00028840622300000810
and S212, performing inverse Fourier transform on the target voice frequency domain signal to obtain a target voice time domain signal.
The residual echo and the noise in the embodiment of the application are not required to be eliminated independently and separately, but are eliminated once through a trained cascade network. The trained feature attention model is used for endowing the input features with different importance, redundant information in the input features is reduced, and the performance of eliminating residual echo and noise of the cascade network is improved. The multi-domain loss function combining the energy-independent amplitude spectrum loss function and the objective voice quality assessment score loss function is used for training the cascade network, the sensitivity of the model to signal energy is reduced, and the auditory perception quality of output voice is improved.
An embodiment of the present application provides a residual echo and noise cancellation device, a schematic structural diagram of which is shown in fig. 3, including:
a receiving module 301, a processing module 302, a determining module 303, an energy normalizing module 304 and an inverse fourier transform module 305;
a receiving module 301, configured to receive a voice time domain signal containing echo and noise and a far-end reference sound time domain signal;
a processing module 302, configured to perform framing, windowing, and fourier transform on the voice time-domain signal containing the echo and the noise and the far-end reference sound time-domain signal, respectively, to obtain a voice frequency-domain signal containing the echo and the noise and a far-end reference audio frequency-domain signal;
a determining module 303, configured to determine an echo frequency domain signal according to the voice frequency domain signal containing echo and noise and the far-end reference audio frequency domain signal;
the determining module 303 is further configured to determine a voice frequency domain signal containing residual echo and noise according to the voice frequency domain signal containing echo and noise and the echo frequency domain signal;
an energy normalization module 304, configured to perform energy normalization processing on the magnitude spectrum of the voice frequency domain signal containing the residual echo and the noise, the magnitude spectrum of the echo frequency domain signal, and the magnitude spectrum of the far-end reference audio domain signal, so as to obtain a voice frequency domain signal feature containing the residual echo and the noise, an echo frequency domain signal feature, and a far-end reference audio domain signal feature;
a splicing module 305, configured to splice the voice frequency domain signal feature containing the residual echo and the noise with the far-end reference audio domain signal feature to obtain a first splicing result, and splice the voice frequency domain signal feature containing the residual echo and the noise with the echo frequency domain signal feature to obtain a second splicing result;
a weight obtaining module 306, configured to input the first splicing result and the second splicing result into a trained feature attention model in the trained cascade network, and obtain a first attention weight corresponding to the far-end reference audio-domain signal feature and a second attention weight corresponding to the echo frequency-domain signal feature;
a fused attention mechanism feature obtaining module 307, configured to multiply the far-end reference audio domain signal feature by a first attention weight to obtain a first fused attention mechanism feature, and multiply the echo frequency domain signal feature by a second attention weight to obtain a second fused attention mechanism feature;
the splicing module 305 is further configured to splice the first fusion attention mechanism feature, the second fusion attention mechanism feature, and the voice frequency domain signal feature containing the residual echo and the noise to obtain a first fusion splicing result;
a masking estimation value obtaining module 308, configured to input the first fusion splicing result into a post-training residual echo and noise cancellation model in the post-training cascade network, so as to obtain a masking estimation value of the target voice frequency domain signal;
a target voice frequency domain signal obtaining module 309, configured to obtain the target voice frequency domain signal according to the masking estimation value of the target voice frequency domain signal and the voice frequency domain signal containing the residual echo and the noise;
and an inverse fourier transform module 310, configured to perform inverse fourier transform on the target voice frequency domain signal to obtain a target voice time domain signal.
An embodiment of the present application provides a residual echo and noise cancellation apparatus, including at least one processor, where the processor is configured to execute a program stored in a memory, and when the program is executed, the apparatus is enabled to perform the following steps:
receiving a voice time domain signal containing echo and noise and a far-end reference sound time domain signal;
respectively performing framing, windowing and Fourier transformation on the voice time domain signal containing the echo and the noise and the far-end reference sound time domain signal to obtain a voice frequency domain signal containing the echo and the noise and a far-end reference sound frequency domain signal;
determining an echo frequency domain signal according to the voice frequency domain signal containing the echo and the noise and the far-end reference audio frequency domain signal;
determining a voice frequency domain signal containing residual echo and noise according to the voice frequency domain signal containing echo and noise and the echo frequency domain signal;
performing energy normalization processing on the amplitude spectrum of the voice frequency domain signal containing the residual echo and the noise, the amplitude spectrum of the echo frequency domain signal and the amplitude spectrum of the far-end reference audio frequency domain signal to obtain a voice frequency domain signal characteristic containing the residual echo and the noise, an echo frequency domain signal characteristic and a far-end reference audio frequency domain signal characteristic;
splicing the voice frequency domain signal characteristics containing the residual echo and the noise with the far-end reference audio domain signal characteristics to obtain a first splicing result, and splicing the voice frequency domain signal characteristics containing the residual echo and the noise with the echo frequency domain signal characteristics to obtain a second splicing result;
inputting the first splicing result and the second splicing result into a trained feature attention model in the trained cascade network to obtain a first attention weight corresponding to the far-end reference audio domain signal feature and a second attention weight corresponding to the echo frequency domain signal feature;
multiplying the far-end reference audio frequency domain signal characteristic by a first attention weight to obtain a first fused attention mechanism characteristic, and multiplying the echo frequency domain signal characteristic by a second attention weight to obtain a second fused attention mechanism characteristic;
splicing the first fusion attention mechanism characteristic, the second fusion attention mechanism characteristic and the voice frequency domain signal characteristic containing residual echo and noise to obtain a first fusion splicing result;
inputting the first fusion splicing result into a trained residual echo and noise elimination model in the trained cascade network to obtain a masking estimation value of a target voice frequency domain signal;
obtaining the target voice frequency domain signal according to the masking estimation value of the target voice frequency domain signal and the voice frequency domain signal containing the residual echo and the noise;
and carrying out inverse Fourier transform on the target voice frequency domain signal to obtain a target voice time domain signal.
An embodiment of the application provides a non-transitory computer readable storage medium, on which a computer program is stored, the computer program, when executed by a processor, implementing the steps of:
receiving a voice time domain signal containing echo and noise and a far-end reference sound time domain signal;
respectively performing framing, windowing and Fourier transformation on the voice time domain signal containing the echo and the noise and the far-end reference sound time domain signal to obtain a voice frequency domain signal containing the echo and the noise and a far-end reference sound frequency domain signal;
determining an echo frequency domain signal according to the voice frequency domain signal containing the echo and the noise and the far-end reference audio frequency domain signal;
determining a voice frequency domain signal containing residual echo and noise according to the voice frequency domain signal containing echo and noise and the echo frequency domain signal;
performing energy normalization processing on the amplitude spectrum of the voice frequency domain signal containing the residual echo and the noise, the amplitude spectrum of the echo frequency domain signal and the amplitude spectrum of the far-end reference audio frequency domain signal to obtain a voice frequency domain signal characteristic containing the residual echo and the noise, an echo frequency domain signal characteristic and a far-end reference audio frequency domain signal characteristic;
splicing the voice frequency domain signal characteristics containing the residual echo and the noise with the far-end reference audio domain signal characteristics to obtain a first splicing result, and splicing the voice frequency domain signal characteristics containing the residual echo and the noise with the echo frequency domain signal characteristics to obtain a second splicing result;
inputting the first splicing result and the second splicing result into a trained feature attention model in the trained cascade network to obtain a first attention weight corresponding to the far-end reference audio domain signal feature and a second attention weight corresponding to the echo frequency domain signal feature;
multiplying the far-end reference audio frequency domain signal characteristic by a first attention weight to obtain a first fused attention mechanism characteristic, and multiplying the echo frequency domain signal characteristic by a second attention weight to obtain a second fused attention mechanism characteristic;
splicing the first fusion attention mechanism characteristic, the second fusion attention mechanism characteristic and the voice frequency domain signal characteristic containing residual echo and noise to obtain a first fusion splicing result;
inputting the first fusion splicing result into a trained residual echo and noise elimination model in the trained cascade network to obtain a masking estimation value of a target voice frequency domain signal;
obtaining the target voice frequency domain signal according to the masking estimation value of the target voice frequency domain signal and the voice frequency domain signal containing the residual echo and the noise;
and carrying out inverse Fourier transform on the target voice frequency domain signal to obtain a target voice time domain signal.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
It should be noted that: the above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims (9)

1. A method for residual echo and noise cancellation, comprising:
receiving a voice time domain signal containing echo and noise and a far-end reference sound time domain signal;
respectively performing framing, windowing and Fourier transformation on the voice time domain signal containing the echo and the noise and the far-end reference sound time domain signal to obtain a voice frequency domain signal containing the echo and the noise and a far-end reference sound frequency domain signal;
determining an echo frequency domain signal according to the voice frequency domain signal containing the echo and the noise and the far-end reference audio frequency domain signal;
determining a voice frequency domain signal containing residual echo and noise according to the voice frequency domain signal containing echo and noise and the echo frequency domain signal;
performing energy normalization processing on the amplitude spectrum of the voice frequency domain signal containing the residual echo and the noise, the amplitude spectrum of the echo frequency domain signal and the amplitude spectrum of the far-end reference audio frequency domain signal to obtain a voice frequency domain signal characteristic containing the residual echo and the noise, an echo frequency domain signal characteristic and a far-end reference audio frequency domain signal characteristic;
splicing the voice frequency domain signal characteristics containing the residual echo and the noise with the far-end reference audio domain signal characteristics to obtain a first splicing result, and splicing the voice frequency domain signal characteristics containing the residual echo and the noise with the echo frequency domain signal characteristics to obtain a second splicing result;
inputting the first splicing result and the second splicing result into a trained feature attention model in the trained cascade network to obtain a first attention weight corresponding to the far-end reference audio domain signal feature and a second attention weight corresponding to the echo frequency domain signal feature;
multiplying the far-end reference audio frequency domain signal characteristic by a first attention weight to obtain a first fused attention mechanism characteristic, and multiplying the echo frequency domain signal characteristic by a second attention weight to obtain a second fused attention mechanism characteristic;
splicing the first fusion attention mechanism characteristic, the second fusion attention mechanism characteristic and the voice frequency domain signal characteristic containing residual echo and noise to obtain a first fusion splicing result;
inputting the first fusion splicing result into a trained residual echo and noise elimination model in the trained cascade network to obtain a masking estimation value of a target voice frequency domain signal;
obtaining the target voice frequency domain signal according to the masking estimation value of the target voice frequency domain signal and the voice frequency domain signal containing the residual echo and the noise;
and carrying out inverse Fourier transform on the target voice frequency domain signal to obtain a target voice time domain signal.
2. The method of claim 1, wherein the framing, windowing, and fourier transforming the echo and noise containing speech time domain signal and the far-end reference acoustic time domain signal, respectively, comprises:
respectively taking a preset number of sampling points as a frame signal for the voice time domain signal containing the echo and the noise and the far-end reference sound time domain signal; if the length is not enough, zero is firstly filled to a preset number;
windowing each frame signal; wherein, the windowing function adopts a Hamming window;
and performing Fourier transform on each windowed frame signal.
3. The method according to claim 1, wherein determining an echo frequency domain signal from the echo and noise containing speech frequency domain signal and the far-end reference audio frequency domain signal comprises:
inputting the voice frequency domain signal containing the echo and the noise and the far-end reference audio frequency domain signal into a Kalman filter to obtain a filter coefficient and the echo frequency domain signal;
the echo frequency domain signal is a result of multiplying the filter coefficient and the far-end reference sound frequency domain signal.
4. The method of claim 1, wherein determining the voice frequency domain signal containing the residual echo and the noise according to the voice frequency domain signal containing the echo and the noise and the echo frequency domain signal comprises:
and subtracting the echo frequency domain signal from the voice frequency domain signal containing the echo and the noise to obtain the voice frequency domain signal containing the residual echo and the noise.
5. The method according to claim 1, wherein the energy normalization processing is performed on the magnitude spectrum of the speech frequency domain signal containing the residual echo and the noise, the magnitude spectrum of the echo frequency domain signal, and the magnitude spectrum of the far-end reference audio domain signal to obtain the speech frequency domain signal characteristic containing the residual echo and the noise, the echo frequency domain signal characteristic, and the far-end reference audio domain signal characteristic, and comprises:
respectively determining a first function, a second function and a third function corresponding to the amplitude spectrum of the voice frequency domain signal containing the residual echo and the noise, the amplitude spectrum of the echo frequency domain signal and the amplitude spectrum of the far-end reference audio frequency domain signal;
determining the voice frequency domain signal characteristics containing the residual echo and the noise according to a first function corresponding to the amplitude spectrum of the voice frequency domain signal containing the residual echo and the noise, and the mean value and the variance of the voice frequency domain signal characteristics containing the residual echo and the noise;
determining the echo frequency domain signal characteristics according to a second function corresponding to the amplitude spectrum of the echo frequency domain signal, and the mean value and the variance of the echo frequency domain signal characteristics;
and determining the characteristics of the far-end reference audio frequency domain signal according to a third function corresponding to the amplitude spectrum of the far-end reference audio frequency domain signal and the mean value and the variance of the characteristics of the far-end reference audio frequency domain signal.
6. The method of claim 1, wherein the trained cascade network is trained by:
receiving a first voice time domain signal containing echo and noise, a first far-end reference sound time domain signal and a first target voice time domain signal;
performing framing, windowing and Fourier transformation on the first voice time domain signal containing the echo and the noise, the first far-end reference sound time domain signal and the first target voice time domain signal respectively to obtain a first voice frequency domain signal containing the echo and the noise, a first far-end reference audio frequency domain signal and a first target voice frequency domain signal;
determining a first echo frequency domain signal according to the first voice frequency domain signal containing echo and noise and the first far-end reference audio frequency domain signal;
determining a first voice frequency domain signal containing residual echo and noise according to the first voice frequency domain signal containing echo and noise and the first echo frequency domain signal;
performing energy normalization processing on the amplitude spectrum of the first voice frequency domain signal containing residual echo and noise, the amplitude spectrum of the first echo frequency domain signal and the amplitude spectrum of the first far-end reference audio frequency domain signal to obtain a first voice frequency domain signal characteristic containing residual echo and noise, a first echo frequency domain signal characteristic and a first far-end reference audio frequency domain signal characteristic;
splicing the first voice frequency domain signal characteristic containing residual echo and noise with the first far-end reference audio domain signal characteristic to obtain a first splicing characteristic, and splicing the first voice frequency domain signal characteristic containing residual echo and noise with the first echo frequency domain signal characteristic to obtain a second splicing characteristic;
inputting the first splicing feature and the second splicing feature into a feature attention model in a cascade network so as to jointly train the feature attention model and a residual echo and noise elimination model in the cascade network, and obtaining a first weight corresponding to the first far-end reference audio domain signal feature and a second weight corresponding to the first echo frequency domain signal feature;
multiplying the first far-end reference audio domain signal characteristic by a first weight to obtain a first fused characteristic, and multiplying the first echo audio domain signal characteristic by a second weight to obtain a second fused characteristic;
splicing the first fusion characteristic, the second fusion characteristic and the first voice frequency domain signal characteristic containing residual echo and noise to obtain a first fusion splicing characteristic;
inputting the first fusion splicing characteristic into a residual echo and noise elimination model in the cascade network to obtain a masking estimation value of a second target voice frequency domain signal;
determining a second target voice frequency domain signal according to the masking estimation value of the second target voice frequency domain signal and the first voice frequency domain signal containing residual echo and noise;
determining a multi-domain loss function according to at least two loss functions; wherein the at least two loss functions comprise an energy-independent magnitude spectrum loss function and an objective speech quality assessment score loss function; the amplitude spectrum loss function irrelevant to the energy takes the amplitude spectrum of the first target voice frequency domain signal as a training target and is determined according to the second target voice frequency domain signal; the objective voice quality evaluation score loss function is determined by taking the improvement of voice audibility quality as a training target;
and iteratively reducing the loss function of the multiple domains by continuously modeling parameters to obtain the trained cascade network.
7. A residual echo and noise cancellation apparatus, comprising:
the receiving module is used for receiving a voice time domain signal containing echo and noise and a far-end reference sound time domain signal;
the processing module is used for respectively performing framing, windowing and Fourier transform on the voice time domain signal containing the echo and the noise and the far-end reference sound time domain signal to obtain a voice frequency domain signal containing the echo and the noise and a far-end reference sound frequency domain signal;
the determining module is used for determining an echo frequency domain signal according to the voice frequency domain signal containing the echo and the noise and the far-end reference audio frequency domain signal;
the determining module is further configured to determine a voice frequency domain signal containing residual echo and noise according to the voice frequency domain signal containing echo and noise and the echo frequency domain signal;
the energy normalization module is used for carrying out energy normalization processing on the amplitude spectrum of the voice frequency domain signal containing the residual echo and the noise, the amplitude spectrum of the echo frequency domain signal and the amplitude spectrum of the far-end reference audio frequency domain signal to obtain a voice frequency domain signal characteristic containing the residual echo and the noise, an echo frequency domain signal characteristic and a far-end reference audio frequency domain signal characteristic;
the splicing module is used for splicing the voice frequency domain signal characteristics containing the residual echo and the noise with the far-end reference audio domain signal characteristics to obtain a first splicing result, and splicing the voice frequency domain signal characteristics containing the residual echo and the noise with the echo frequency domain signal characteristics to obtain a second splicing result;
a weight obtaining module, configured to input the first splicing result and the second splicing result into a trained feature attention model in the trained cascade network, and obtain a first attention weight corresponding to the far-end reference audio-domain signal feature and a second attention weight corresponding to the echo frequency-domain signal feature;
a fused attention mechanism feature obtaining module, configured to multiply the far-end reference audio-domain signal feature by a first attention weight to obtain a first fused attention mechanism feature, and multiply the echo frequency-domain signal feature by a second attention weight to obtain a second fused attention mechanism feature;
the splicing module is further configured to splice the first fusion attention mechanism feature, the second fusion attention mechanism feature and the voice frequency domain signal feature containing the residual echo and the noise to obtain a first fusion splicing result;
a masking estimation value obtaining module, configured to input the first fusion splicing result into a post-training residual echo and noise cancellation model in the post-training cascade network, so as to obtain a masking estimation value of the target voice frequency domain signal;
a target voice frequency domain signal obtaining module, configured to obtain a target voice frequency domain signal according to the masking estimation value of the target voice frequency domain signal and the voice frequency domain signal containing the residual echo and the noise;
and the inverse Fourier transform module is used for performing inverse Fourier transform on the target voice frequency domain signal to obtain a target voice time domain signal.
8. A residual echo and noise cancellation apparatus comprising at least one processor configured to execute a program stored in a memory, the program, when executed, causing the apparatus to perform:
the method of any one of claims 1-6.
9. A non-transitory computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the method according to any one of claims 1-6.
CN202110008502.9A 2021-01-05 2021-01-05 Residual echo and noise elimination method and device Active CN112863535B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110008502.9A CN112863535B (en) 2021-01-05 2021-01-05 Residual echo and noise elimination method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110008502.9A CN112863535B (en) 2021-01-05 2021-01-05 Residual echo and noise elimination method and device

Publications (2)

Publication Number Publication Date
CN112863535A true CN112863535A (en) 2021-05-28
CN112863535B CN112863535B (en) 2022-04-26

Family

ID=76003795

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110008502.9A Active CN112863535B (en) 2021-01-05 2021-01-05 Residual echo and noise elimination method and device

Country Status (1)

Country Link
CN (1) CN112863535B (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113436636A (en) * 2021-06-11 2021-09-24 深圳波洛斯科技有限公司 Acoustic echo cancellation method and system based on adaptive filter and neural network
CN113489854A (en) * 2021-06-30 2021-10-08 北京小米移动软件有限公司 Sound processing method, sound processing device, electronic equipment and storage medium
CN113539291A (en) * 2021-07-09 2021-10-22 北京声智科技有限公司 Method and device for reducing noise of audio signal, electronic equipment and storage medium
CN113744762A (en) * 2021-08-09 2021-12-03 杭州网易智企科技有限公司 Signal-to-noise ratio determining method and device, electronic equipment and storage medium
CN114337908A (en) * 2022-01-05 2022-04-12 中国科学院声学研究所 Method and device for generating interference signal of target voice signal
CN114974286A (en) * 2022-06-30 2022-08-30 北京达佳互联信息技术有限公司 Signal enhancement method, model training method, device, equipment, sound box and medium
CN114974281A (en) * 2022-05-24 2022-08-30 云知声智能科技股份有限公司 Training method and device of voice noise reduction model, storage medium and electronic device
CN115294997A (en) * 2022-06-30 2022-11-04 北京达佳互联信息技术有限公司 Voice processing method and device, electronic equipment and storage medium
WO2023226592A1 (en) * 2022-05-25 2023-11-30 青岛海尔科技有限公司 Noise signal processing method and apparatus, and storage medium and electronic apparatus
CN117437929A (en) * 2023-12-21 2024-01-23 睿云联(厦门)网络通讯技术有限公司 Real-time echo cancellation method based on neural network

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107636758A (en) * 2015-05-15 2018-01-26 哈曼国际工业有限公司 Acoustic echo eliminates system and method
US20200105287A1 (en) * 2017-04-14 2020-04-02 Industry-University Cooperation Foundation Hanyang University Deep neural network-based method and apparatus for combining noise and echo removal
CN111161752A (en) * 2019-12-31 2020-05-15 歌尔股份有限公司 Echo cancellation method and device
CN111341336A (en) * 2020-03-16 2020-06-26 北京字节跳动网络技术有限公司 Echo cancellation method, device, terminal equipment and medium
US20200312346A1 (en) * 2019-03-28 2020-10-01 Samsung Electronics Co., Ltd. System and method for acoustic echo cancellation using deep multitask recurrent neural networks
CN111768796A (en) * 2020-07-14 2020-10-13 中国科学院声学研究所 Acoustic echo cancellation and dereverberation method and device
CN111768795A (en) * 2020-07-09 2020-10-13 腾讯科技(深圳)有限公司 Noise suppression method, device, equipment and storage medium for voice signal

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107636758A (en) * 2015-05-15 2018-01-26 哈曼国际工业有限公司 Acoustic echo eliminates system and method
US20200105287A1 (en) * 2017-04-14 2020-04-02 Industry-University Cooperation Foundation Hanyang University Deep neural network-based method and apparatus for combining noise and echo removal
US20200312346A1 (en) * 2019-03-28 2020-10-01 Samsung Electronics Co., Ltd. System and method for acoustic echo cancellation using deep multitask recurrent neural networks
CN111161752A (en) * 2019-12-31 2020-05-15 歌尔股份有限公司 Echo cancellation method and device
CN111341336A (en) * 2020-03-16 2020-06-26 北京字节跳动网络技术有限公司 Echo cancellation method, device, terminal equipment and medium
CN111768795A (en) * 2020-07-09 2020-10-13 腾讯科技(深圳)有限公司 Noise suppression method, device, equipment and storage medium for voice signal
CN111768796A (en) * 2020-07-14 2020-10-13 中国科学院声学研究所 Acoustic echo cancellation and dereverberation method and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
王冬霞等: "基于BLSTM神经网络的回声和噪声抑制算法", 《信号处理》 *

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113436636A (en) * 2021-06-11 2021-09-24 深圳波洛斯科技有限公司 Acoustic echo cancellation method and system based on adaptive filter and neural network
CN113489854A (en) * 2021-06-30 2021-10-08 北京小米移动软件有限公司 Sound processing method, sound processing device, electronic equipment and storage medium
CN113489854B (en) * 2021-06-30 2024-03-01 北京小米移动软件有限公司 Sound processing method, device, electronic equipment and storage medium
CN113539291A (en) * 2021-07-09 2021-10-22 北京声智科技有限公司 Method and device for reducing noise of audio signal, electronic equipment and storage medium
CN113744762B (en) * 2021-08-09 2023-10-27 杭州网易智企科技有限公司 Signal-to-noise ratio determining method and device, electronic equipment and storage medium
CN113744762A (en) * 2021-08-09 2021-12-03 杭州网易智企科技有限公司 Signal-to-noise ratio determining method and device, electronic equipment and storage medium
CN114337908A (en) * 2022-01-05 2022-04-12 中国科学院声学研究所 Method and device for generating interference signal of target voice signal
CN114337908B (en) * 2022-01-05 2024-04-12 中国科学院声学研究所 Method and device for generating interference signal of target voice signal
CN114974281A (en) * 2022-05-24 2022-08-30 云知声智能科技股份有限公司 Training method and device of voice noise reduction model, storage medium and electronic device
WO2023226592A1 (en) * 2022-05-25 2023-11-30 青岛海尔科技有限公司 Noise signal processing method and apparatus, and storage medium and electronic apparatus
CN115294997A (en) * 2022-06-30 2022-11-04 北京达佳互联信息技术有限公司 Voice processing method and device, electronic equipment and storage medium
CN114974286A (en) * 2022-06-30 2022-08-30 北京达佳互联信息技术有限公司 Signal enhancement method, model training method, device, equipment, sound box and medium
CN115294997B (en) * 2022-06-30 2024-10-29 北京达佳互联信息技术有限公司 Voice processing method, device, electronic equipment and storage medium
CN117437929A (en) * 2023-12-21 2024-01-23 睿云联(厦门)网络通讯技术有限公司 Real-time echo cancellation method based on neural network
CN117437929B (en) * 2023-12-21 2024-03-08 睿云联(厦门)网络通讯技术有限公司 Real-time echo cancellation method based on neural network

Also Published As

Publication number Publication date
CN112863535B (en) 2022-04-26

Similar Documents

Publication Publication Date Title
CN112863535B (en) Residual echo and noise elimination method and device
CN107452389B (en) Universal single-track real-time noise reduction method
CN108172231B (en) Dereverberation method and system based on Kalman filtering
KR101934636B1 (en) Method and apparatus for integrating and removing acoustic echo and background noise based on deepening neural network
Zhao et al. A two-stage algorithm for noisy and reverberant speech enhancement
CN112700786B (en) Speech enhancement method, device, electronic equipment and storage medium
Lee et al. DNN-based residual echo suppression.
Zhao et al. Late reverberation suppression using recurrent neural networks with long short-term memory
CN112581973B (en) Voice enhancement method and system
CN111048061B (en) Method, device and equipment for obtaining step length of echo cancellation filter
CN112037809A (en) Residual echo suppression method based on multi-feature flow structure deep neural network
CN112201273B (en) Noise power spectral density calculation method, system, equipment and medium
CN111883154B (en) Echo cancellation method and device, computer-readable storage medium, and electronic device
CN113838471A (en) Noise reduction method and system based on neural network, electronic device and storage medium
CN113744748A (en) Network model training method, echo cancellation method and device
Schwartz et al. Nested generalized sidelobe canceller for joint dereverberation and noise reduction
CN112997249B (en) Voice processing method, device, storage medium and electronic equipment
CN114302286A (en) Method, device and equipment for reducing noise of call voice and storage medium
CN117219102A (en) Low-complexity voice enhancement method based on auditory perception
CN115620737A (en) Voice signal processing device, method, electronic equipment and sound amplification system
Kamarudin et al. Acoustic echo cancellation using adaptive filtering algorithms for Quranic accents (Qiraat) identification
CN111883155B (en) Echo cancellation method, device and storage medium
Braun et al. Low complexity online convolutional beamforming
CN108074580B (en) Noise elimination method and device
Yoshioka et al. Speech dereverberation and denoising based on time varying speech model and autoregressive reverberation model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant