CN112863535A - Residual echo and noise elimination method and device - Google Patents
Residual echo and noise elimination method and device Download PDFInfo
- Publication number
- CN112863535A CN112863535A CN202110008502.9A CN202110008502A CN112863535A CN 112863535 A CN112863535 A CN 112863535A CN 202110008502 A CN202110008502 A CN 202110008502A CN 112863535 A CN112863535 A CN 112863535A
- Authority
- CN
- China
- Prior art keywords
- domain signal
- frequency domain
- echo
- noise
- far
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 26
- 230000008030 elimination Effects 0.000 title claims description 19
- 238000003379 elimination reaction Methods 0.000 title claims description 19
- 238000001228 spectrum Methods 0.000 claims abstract description 72
- 238000012545 processing Methods 0.000 claims abstract description 20
- 238000010606 normalization Methods 0.000 claims abstract description 18
- 238000009432 framing Methods 0.000 claims abstract description 14
- 230000009466 transformation Effects 0.000 claims abstract description 9
- 230000006870 function Effects 0.000 claims description 57
- 230000007246 mechanism Effects 0.000 claims description 35
- 230000000873 masking effect Effects 0.000 claims description 25
- 238000012549 training Methods 0.000 claims description 23
- 230000004927 fusion Effects 0.000 claims description 22
- 238000007526 fusion splicing Methods 0.000 claims description 22
- 238000004590 computer program Methods 0.000 claims description 4
- 238000001303 quality assessment method Methods 0.000 claims description 4
- 238000013441 quality evaluation Methods 0.000 claims description 4
- 238000005070 sampling Methods 0.000 claims description 4
- 230000006872 improvement Effects 0.000 claims description 3
- 230000001131 transforming effect Effects 0.000 claims description 2
- 230000035945 sensitivity Effects 0.000 abstract description 4
- 238000013528 artificial neural network Methods 0.000 description 4
- 238000010586 diagram Methods 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 3
- 230000008569 process Effects 0.000 description 3
- 230000004913 activation Effects 0.000 description 2
- 238000007796 conventional method Methods 0.000 description 2
- 210000002569 neuron Anatomy 0.000 description 2
- 230000008447 perception Effects 0.000 description 2
- 230000009467 reduction Effects 0.000 description 2
- 238000012546 transfer Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000012512 characterization method Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000005236 sound signal Effects 0.000 description 1
- 230000003595 spectral effect Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000001629 suppression Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
- G10L21/0232—Processing in the frequency domain
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L2021/02082—Noise filtering the noise being echo, reverberation of the speech
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Quality & Reliability (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Cable Transmission Systems, Equalization Of Radio And Reduction Of Echo (AREA)
- Circuit For Audible Band Transducer (AREA)
Abstract
The embodiment of the application discloses a method and a device for eliminating residual echo and noise, wherein the method comprises the following steps: performing framing, windowing and Fourier transformation on the received voice time domain signal containing the echo and the noise and the far-end reference sound time domain signal to obtain a corresponding frequency domain signal, determining an echo frequency domain signal, and further determining the voice frequency domain signal containing the residual echo and the noise; respectively carrying out energy normalization processing on the amplitude spectrums of the voice frequency domain signal containing the residual echo and the noise, the echo frequency domain signal and the far-end reference audio frequency domain signal to obtain corresponding characteristics; determining a target voice frequency domain signal according to the corresponding characteristics and the trained cascade network; and carrying out inverse Fourier transform on the target voice frequency domain signal to obtain a target voice time domain signal. According to the embodiment of the application, the feature attention model is used for endowing the input features with different importance, and redundant information in the input features is reduced. And the multi-domain loss function is trained in a cascade network, so that the sensitivity of the model to signal energy is reduced.
Description
Technical Field
The present invention relates to the field of echo and noise cancellation. And more particularly, to a method and apparatus for residual echo and noise cancellation.
Background
At present, the echo cancellation technology mainly removes an echo signal formed by a far-end reference sound signal in a speech signal, and the speech noise reduction technology mainly removes background noise and directional noise interference in the speech signal. Both echo cancellation techniques and speech noise reduction techniques aim to improve the quality and intelligibility of speech. In the echo cancellation technology, a self-adaptive filtering method based on traditional signal processing and a residual echo cancellation method based on deep learning are combined, so that the generalization performance of the system can be effectively improved.
However, in the conventional method, the residual echo and the noise cancellation are always independently and separately performed, and the correlation between the two tasks is not considered. There are a number of signal features available in the residual echo cancellation task that have different physical significance and significance, and conventional methods do not take into account the different significance of these features. When training a residual echo and noise elimination model, the mean square error of a target amplitude spectrum and an estimated amplitude spectrum is mostly adopted as a loss function in the prior art, but the loss function depends on the energy of signals, and the scales of the signals with different energy are different.
Disclosure of Invention
Because the existing method has the above problems, the present application provides a method and an apparatus for eliminating residual echo and noise.
In a first aspect, an embodiment of the present application provides a residual echo and noise cancellation method, including:
receiving a voice time domain signal containing echo and noise and a far-end reference sound time domain signal;
respectively performing framing, windowing and Fourier transformation on the voice time domain signal containing the echo and the noise and the far-end reference sound time domain signal to obtain a voice frequency domain signal containing the echo and the noise and a far-end reference sound frequency domain signal;
determining an echo frequency domain signal according to the voice frequency domain signal containing the echo and the noise and the far-end reference audio frequency domain signal;
determining a voice frequency domain signal containing residual echo and noise according to the voice frequency domain signal containing echo and noise and the echo frequency domain signal;
performing energy normalization processing on the amplitude spectrum of the voice frequency domain signal containing the residual echo and the noise, the amplitude spectrum of the echo frequency domain signal and the amplitude spectrum of the far-end reference audio frequency domain signal to obtain a voice frequency domain signal characteristic containing the residual echo and the noise, an echo frequency domain signal characteristic and a far-end reference audio frequency domain signal characteristic;
splicing the voice frequency domain signal characteristics containing the residual echo and the noise with the far-end reference audio domain signal characteristics to obtain a first splicing result, and splicing the voice frequency domain signal characteristics containing the residual echo and the noise with the echo frequency domain signal characteristics to obtain a second splicing result;
inputting the first splicing result and the second splicing result into a trained feature attention model in the trained cascade network to obtain a first attention weight corresponding to the far-end reference audio domain signal feature and a second attention weight corresponding to the echo frequency domain signal feature;
multiplying the far-end reference audio frequency domain signal characteristic by a first attention weight to obtain a first fused attention mechanism characteristic, and multiplying the echo frequency domain signal characteristic by a second attention weight to obtain a second fused attention mechanism characteristic;
splicing the first fusion attention mechanism characteristic, the second fusion attention mechanism characteristic and the voice frequency domain signal characteristic containing residual echo and noise to obtain a first fusion splicing result;
inputting the first fusion splicing result into a trained residual echo and noise elimination model in the trained cascade network to obtain a masking estimation value of a target voice frequency domain signal;
obtaining the target voice frequency domain signal according to the masking estimation value of the target voice frequency domain signal and the voice frequency domain signal containing the residual echo and the noise;
and carrying out inverse Fourier transform on the target voice frequency domain signal to obtain a target voice time domain signal.
In one possible implementation, the framing, windowing, and fourier transforming the echo and noise containing speech time domain signal and the far-end reference acoustic time domain signal respectively includes:
respectively taking a preset number of sampling points as a frame signal for the voice time domain signal containing the echo and the noise and the far-end reference sound time domain signal; if the length is not enough, zero is firstly filled to a preset number;
windowing each frame signal; wherein, the windowing function adopts a Hamming window;
and performing Fourier transform on each windowed frame signal.
In one possible implementation, the determining an echo frequency domain signal according to the echo and noise-containing speech frequency domain signal and the far-end reference audio frequency domain signal includes:
inputting the voice frequency domain signal containing the echo and the noise and the far-end reference audio frequency domain signal into a Kalman filter to obtain a filter coefficient and the echo frequency domain signal;
the echo frequency domain signal is a result of multiplying the filter coefficient and the far-end reference sound frequency domain signal.
In one possible implementation, the determining a voice frequency domain signal containing residual echo and noise according to the voice frequency domain signal containing echo and noise and the echo frequency domain signal includes:
and subtracting the echo frequency domain signal from the voice frequency domain signal containing the echo and the noise to obtain the voice frequency domain signal containing the residual echo and the noise.
In a possible implementation, the energy normalization processing is performed on the amplitude spectrum of the speech frequency domain signal containing the residual echo and the noise, the amplitude spectrum of the echo frequency domain signal, and the amplitude spectrum of the far-end reference audio domain signal to obtain a speech frequency domain signal feature containing the residual echo and the noise, an echo frequency domain signal feature, and a far-end reference audio domain signal feature, and the energy normalization processing includes:
respectively determining a first function, a second function and a third function corresponding to the amplitude spectrum of the voice frequency domain signal containing the residual echo and the noise, the amplitude spectrum of the echo frequency domain signal and the amplitude spectrum of the far-end reference audio frequency domain signal;
determining the voice frequency domain signal characteristics containing the residual echo and the noise according to a first function corresponding to the amplitude spectrum of the voice frequency domain signal containing the residual echo and the noise, and the mean value and the variance of the voice frequency domain signal characteristics containing the residual echo and the noise;
determining the echo frequency domain signal characteristics according to a second function corresponding to the amplitude spectrum of the echo frequency domain signal, and the mean value and the variance of the echo frequency domain signal characteristics;
and determining the characteristics of the far-end reference audio frequency domain signal according to a third function corresponding to the amplitude spectrum of the far-end reference audio frequency domain signal and the mean value and the variance of the characteristics of the far-end reference audio frequency domain signal.
In one possible implementation, the trained cascade network is trained by:
receiving a first voice time domain signal containing echo and noise, a first far-end reference sound time domain signal and a first target voice time domain signal;
performing framing, windowing and Fourier transformation on the first voice time domain signal containing the echo and the noise, the first far-end reference sound time domain signal and the first target voice time domain signal respectively to obtain a first voice frequency domain signal containing the echo and the noise, a first far-end reference audio frequency domain signal and a first target voice frequency domain signal;
determining a first echo frequency domain signal according to the first voice frequency domain signal containing echo and noise and the first far-end reference audio frequency domain signal;
determining a first voice frequency domain signal containing residual echo and noise according to the first voice frequency domain signal containing echo and noise and the first echo frequency domain signal;
performing energy normalization processing on the amplitude spectrum of the first voice frequency domain signal containing residual echo and noise, the amplitude spectrum of the first echo frequency domain signal and the amplitude spectrum of the first far-end reference audio frequency domain signal to obtain a first voice frequency domain signal characteristic containing residual echo and noise, a first echo frequency domain signal characteristic and a first far-end reference audio frequency domain signal characteristic;
splicing the first voice frequency domain signal characteristic containing residual echo and noise with the first far-end reference audio domain signal characteristic to obtain a first splicing characteristic, and splicing the first voice frequency domain signal characteristic containing residual echo and noise with the first echo frequency domain signal characteristic to obtain a second splicing characteristic;
inputting the first splicing feature and the second splicing feature into a feature attention model in a cascade network so as to jointly train the feature attention model and a residual echo and noise elimination model in the cascade network, and obtaining a first weight corresponding to the first far-end reference audio domain signal feature and a second weight corresponding to the first echo frequency domain signal feature;
multiplying the first far-end reference audio domain signal characteristic by a first weight to obtain a first fused characteristic, and multiplying the first echo audio domain signal characteristic by a second weight to obtain a second fused characteristic;
splicing the first fusion characteristic, the second fusion characteristic and the first voice frequency domain signal characteristic containing residual echo and noise to obtain a first fusion splicing characteristic;
inputting the first fusion splicing characteristic into a residual echo and noise elimination model in the cascade network to obtain a masking estimation value of a second target voice frequency domain signal;
determining a second target voice frequency domain signal according to the masking estimation value of the second target voice frequency domain signal and the first voice frequency domain signal containing residual echo and noise;
determining a multi-domain loss function according to at least two loss functions; wherein the at least two loss functions comprise an energy-independent magnitude spectrum loss function and an objective speech quality assessment score loss function; the amplitude spectrum loss function irrelevant to the energy takes the amplitude spectrum of the first target voice frequency domain signal as a training target and is determined according to the second target voice frequency domain signal; the objective voice quality evaluation score loss function is determined by taking the improvement of voice audibility quality as a training target;
and iteratively reducing the loss function of the multiple domains by continuously modeling parameters to obtain the trained cascade network.
In a second aspect, an embodiment of the present application further provides a residual echo and noise cancellation apparatus, including:
the receiving module is used for receiving a voice time domain signal containing echo and noise and a far-end reference sound time domain signal;
the processing module is used for respectively performing framing, windowing and Fourier transform on the voice time domain signal containing the echo and the noise and the far-end reference sound time domain signal to obtain a voice frequency domain signal containing the echo and the noise and a far-end reference sound frequency domain signal;
the determining module is used for determining an echo frequency domain signal according to the voice frequency domain signal containing the echo and the noise and the far-end reference audio frequency domain signal;
the determining module is further configured to determine a voice frequency domain signal containing residual echo and noise according to the voice frequency domain signal containing echo and noise and the echo frequency domain signal;
the energy normalization module is used for carrying out energy normalization processing on the amplitude spectrum of the voice frequency domain signal containing the residual echo and the noise, the amplitude spectrum of the echo frequency domain signal and the amplitude spectrum of the far-end reference audio frequency domain signal to obtain a voice frequency domain signal characteristic containing the residual echo and the noise, an echo frequency domain signal characteristic and a far-end reference audio frequency domain signal characteristic;
the splicing module is used for splicing the voice frequency domain signal characteristics containing the residual echo and the noise with the far-end reference audio domain signal characteristics to obtain a first splicing result, and splicing the voice frequency domain signal characteristics containing the residual echo and the noise with the echo frequency domain signal characteristics to obtain a second splicing result;
a weight obtaining module, configured to input the first splicing result and the second splicing result into a trained feature attention model in the trained cascade network, and obtain a first attention weight corresponding to the far-end reference audio-domain signal feature and a second attention weight corresponding to the echo frequency-domain signal feature;
a fused attention mechanism feature obtaining module, configured to multiply the far-end reference audio-domain signal feature by a first attention weight to obtain a first fused attention mechanism feature, and multiply the echo frequency-domain signal feature by a second attention weight to obtain a second fused attention mechanism feature;
the splicing module is further configured to splice the first fusion attention mechanism feature, the second fusion attention mechanism feature and the voice frequency domain signal feature containing the residual echo and the noise to obtain a first fusion splicing result;
a masking estimation value obtaining module, configured to input the first fusion splicing result into a post-training residual echo and noise cancellation model in the post-training cascade network, so as to obtain a masking estimation value of the target voice frequency domain signal;
a target voice frequency domain signal obtaining module, configured to obtain a target voice frequency domain signal according to the masking estimation value of the target voice frequency domain signal and the voice frequency domain signal containing the residual echo and the noise;
and the inverse Fourier transform module is used for performing inverse Fourier transform on the target voice frequency domain signal to obtain a target voice time domain signal.
In a third aspect, an embodiment of the present application further provides a residual echo and noise cancellation apparatus, including at least one processor, where the processor is configured to execute a program stored in a memory, and when the program is executed, the apparatus is caused to perform the steps as in the first aspect and in various possible implementations.
In a fourth aspect, embodiments of the present application further propose a non-transitory computer-readable storage medium having stored thereon a computer program, which when executed by a processor implements the steps as in the first aspect and various possible implementations.
The beneficial effect of the embodiment of the application is that the residual echo and the noise do not need to be eliminated separately, but are eliminated once through the trained cascade network. The trained feature attention model is used for endowing the input features with different importance, redundant information in the input features is reduced, and the performance of eliminating residual echo and noise of the cascade network is improved. The multi-domain loss function combining the energy-independent amplitude spectrum loss function and the objective voice quality assessment score loss function is used for training the cascade network, the sensitivity of the model to signal energy is reduced, and the auditory perception quality of output voice is improved.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.
Fig. 1 is a schematic process diagram of training a cascade network capable of eliminating residual echo and noise according to an embodiment of the present disclosure;
fig. 2 is a schematic flowchart illustrating a process of eliminating residual echo and noise by using a post-training cascade network according to an embodiment of the present application;
fig. 3 is a schematic structural diagram of a residual echo and noise cancellation device according to an embodiment of the present disclosure.
Detailed Description
The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application. The following examples are only for illustrating the technical solutions of the present application more clearly, and the protection scope of the present application is not limited thereby.
In the traditional method, residual echo and noise elimination are always independently and separately carried out. The importance of each of the plurality of signal features is not taken into account in the residual echo suppression task. When training a residual echo and noise elimination model, the mean square error of a target amplitude spectrum and an estimated amplitude spectrum is mostly adopted as a loss function. The loss function depends on the magnitude of the signal energy. Therefore, the application provides a method and a device for eliminating residual echo and noise, which can endow different importance to a plurality of signal characteristics, reduce redundant information in the plurality of signal characteristics, and simultaneously, adopt a multi-domain loss function to train a cascade network, thereby reducing the sensitivity of a model to signal energy.
In the embodiment of the present application, a schematic process diagram of training a cascade network capable of eliminating residual echo and noise is shown in fig. 1, and includes: S101-S113; the cascade network comprises a feature attention model and a residual echo and noise elimination model.
The feature attention model is formed by connecting 1 layer of gating circulation units with 1 layer of fully-connected neural networks. The 1-layer gating cycle unit has 200 hidden layer nodes, the fully-connected neural network output layer has 257 nodes, and the activation function of each neuron uses a Sigmoid function.
The residual echo and noise elimination model is formed by connecting a 2-layer gating circulation unit with a 1-layer fully-connected neural network. The 2-layer gating cycle unit comprises 400 hidden layer nodes, the fully-connected neural network output layer comprises 257 nodes, and the activation function of each neuron uses a Sigmoid function.
S101, receiving a first voice time domain signal containing echo and noise, a first far-end reference sound time domain signal and a first target voice time domain signal. The first far-end reference sound time domain signal is subjected to nonlinear transformation and then convolved with a corresponding room transfer function to form an echo time domain signal in the first voice time domain signal containing echo and noise.
S102, framing and windowing are carried out on the received first voice time domain signal containing the echo and the noise, the first far-end reference sound time domain signal and the first target voice time domain signal. Specifically, 512 sampling points are respectively taken as a frame signal for a received first voice time domain signal containing echo and noise, a first far-end reference voice time domain signal and a first target voice time domain signal, if the length is insufficient, zero padding is firstly carried out to 512 points, then windowing is carried out on each frame signal, and a Hamming window is adopted as a windowing function. And carrying out Fourier transform on each windowed frame signal to obtain a first voice frequency domain signal containing echo and noise, a first far-end reference audio domain signal and a first target voice frequency domain signal.
S103, inputting the first voice frequency domain signal containing echo and noise and the first far-end reference audio frequency domain signal into a Kalman filter, and estimating a first filter coefficient and a first echo frequency domain signal in real time. Wherein, the first echo frequency domain signal estimated by the Kalman filter is:
C(k,f)=W(k,f)*X(k,f)
where W (k, f) is the first filter coefficient, X (k, f) is the far-end reference audio domain signal, and k and f represent the kth frame and frequency f, respectively.
S104, subtracting the first echo frequency domain signal from the first voice frequency domain signal containing echo and noise to obtain a first voice frequency domain signal containing residual echo and noise, and using the first voice frequency domain signal containing residual echo and noise as an output result of the Kalman filter. The first voice frequency domain signal containing residual echo and noise is:
E(k,f)=Y(k,f)-C(k,f)
wherein, Y (k, f) is the first speech frequency domain signal containing echo and noise, C (k, f) is the first echo frequency domain signal, and k and f represent the k-th frame and frequency f respectively.
S105, performing energy normalization processing on the amplitude spectrum of the first voice frequency domain signal containing the residual echo and the noise, the amplitude spectrum of the first echo frequency domain signal and the amplitude spectrum of the first far-end reference audio frequency domain signal to obtain a first voice frequency domain signal characteristic g containing the residual echo and the noiseFD(f (| E (k, f) |)), first-echo frequency domain signal feature gFD(f (| C (k, f) |)) and a first far-end reference audio-domain signal feature gFD(f (| X (k, f) |)). Wherein,
the mean and variance of the first speech frequency domain signal feature containing residual echo and noise are respectively defined as:
μf(e)(k,f)=c1μf(e)(k-1,f)+(1-c1)f(|E(k,f)|)
the mean and variance of the first-pass audio domain signal characteristics are defined as:
μf(c)(k,f)=c1μf(c)(k-1,f)+(1-c1)f(|C(k,f)|)
the mean and variance of the first remote-reference audio-domain signal characteristic are respectively defined as:
μf(x)(k,f)=c1μf(x)(k-1,f)+(1-c1)f(|X(k,f)|)
| E (k, f) |, | C (k, f) | and | X (k, f) | respectively represent the amplitude spectrum of the first speech frequency domain signal containing residual echo and noise, the amplitude spectrum of the first echo frequency domain signal and the amplitude spectrum of the first far-end reference audio frequency domain signal, C1Is a preset constant.
And S106, splicing the first voice frequency domain signal characteristic containing the residual echo and the noise with the first far-end reference audio domain signal characteristic to obtain a first splicing characteristic, and splicing the first voice frequency domain signal characteristic containing the residual echo and the noise with the first echo frequency domain signal characteristic to obtain a second splicing characteristic.
S107, inputting the first splicing feature and the second splicing feature into a feature attention model in the cascade network, so as to jointly train the feature attention model and the residual echo and noise cancellation model in the cascade network, and obtain a first weight α (k, f) corresponding to the first far-end reference audio domain signal feature and a second weight β (k, f) corresponding to the first echo domain signal feature.
S108, the multiplying the first far-end reference audio domain signal characteristic by the first weight obtains a first fused characteristic:
Xatt(k,f)=gFD(f(|X(k,f)|))*α(k,f)
and the first echo frequency domain signal characteristic is multiplied by a second weight to obtain a second fused characteristic:
Catt(k,f)=gFD(f(|C(k,f)|))*β(k,f)。
s109, the first fusion feature X is usedatt(k, f) the second fusion characteristics Catt(k, f) and the first speech frequency domain signal feature g containing residual echo and noiseFDAnd (f (| E (k, f) |)) splicing to obtain a first fusion splicing characteristic.
And S110, inputting the first fusion splicing characteristics into a residual echo and noise elimination model in the cascade network, and outputting the residual echo and noise elimination model as a masking estimation value G (k, f) of a second target voice frequency domain signal.
S111, using the masking estimation value G (k, f) of the second target speech frequency domain signal to enhance the first speech frequency domain signal containing residual echo and noise, and obtaining a second target speech frequency domain signal:
and S112, determining a multi-domain loss function according to the at least two loss functions. For example, using the magnitude spectrum of the first target speech frequency domain signal as a training target, based on the second target speech frequency domain signal,determining an energy-independent amplitude spectral loss function
Determining objective speech quality evaluation score loss function by taking improvement of speech audibility quality as training targetWherein S (k, f) is a first target speech frequency domain signal,is a second target speech frequency domain signal. Weighting and adding the energy-independent amplitude spectrum loss function and the objective voice quality evaluation score loss function, and determining a multi-domain loss function as follows:
wherein λ is a predetermined constant.
And S113, iteratively reducing the loss function of the multiple domains by continuously modeling parameters to obtain the trained cascade network. The post-training cascade network comprises a post-training feature attention model and a post-training residual echo and noise elimination model.
In the embodiment of the present application, a schematic flow chart of using the post-training cascade network to eliminate residual echo and noise is shown in fig. 2, and includes: S201-S207;
s201, receiving a voice time domain signal containing echo and noise and a far-end reference sound time domain signal. And the far-end reference sound time domain signal is subjected to nonlinear transformation and then convoluted with a corresponding room transfer function to form an echo time domain signal in the voice time domain signal containing the echo and the noise.
S202, framing and windowing are respectively carried out on the voice time domain signal containing the echo and the noise and the far-end reference sound time domain signal. Specifically, 512 sampling points are respectively taken as a frame signal for a received voice time domain signal containing echo and noise and a far-end reference sound time domain signal, if the length is insufficient, zero padding is firstly carried out to 512 points, then windowing is carried out on each frame signal, and a hamming window is adopted as a windowing function. And carrying out Fourier transform on each windowed frame signal to obtain a voice frequency domain signal containing echo and noise and a far-end reference audio domain signal.
S203, inputting the voice frequency domain signal containing the echo and the noise and the far-end reference audio frequency domain signal into a Kalman filter, and estimating a filter coefficient and an echo frequency domain signal in real time. The echo frequency domain signal estimated by the Kalman filter is as follows:
C3(k,f)=W3(k,f)*X3(k,f)
wherein, W3(k, f) is the first filter coefficient, X3(k, f) are the far-end reference audio domain signals, k and f representing the k-th frame and the frequency f, respectively.
S204, subtracting the echo frequency domain signal from the voice frequency domain signal containing echo and noise to obtain a voice frequency domain signal containing residual echo and noise:
E3(k,f)=Y3(k,f)-C3(k,f)
wherein, Y3(k, f) is a speech frequency domain signal containing echo and noise, C3And (k, f) is an echo frequency domain signal, and k and f respectively represent a k-th frame and a frequency f.
S205, performing energy normalization processing on the magnitude spectrum of the voice frequency domain signal containing the residual echo and the noise, the magnitude spectrum of the echo frequency domain signal and the magnitude spectrum of the far-end reference sound frequency domain signal to obtain the voice frequency domain signal characteristics containing the residual echo and the noiseFirst-pass audio domain signal characterizationAnd a first remote reference audio domain signal characteristicWherein,
the mean and variance of the first speech frequency domain signal feature containing residual echo and noise are respectively defined as:
the mean and variance of the first-pass audio domain signal characteristics are defined as:
the mean and variance of the first remote-reference audio-domain signal characteristic are respectively defined as:
|E3(k,f)|、|C3(k, f) | and | X3(k, f) | respectively represents the amplitude spectrum of the voice frequency domain signal containing residual echo and noise, the amplitude spectrum of the echo frequency domain signal and the amplitude spectrum of the far-end reference audio frequency domain signal, c2Is a preset constant.
And S206, splicing the voice frequency domain signal characteristics containing the residual echo and the noise with the far-end reference audio domain signal characteristics to obtain a first splicing result, and splicing the voice frequency domain signal characteristics containing the residual echo and the noise with the echo frequency domain signal characteristics to obtain a second splicing result.
S207, inputting the first splicing result and the second splicing result into the trained cascade network, i.e. inputting the trained feature attention model, and obtaining a first attention weight α corresponding to the far-end reference audio domain signal feature3(k, f) and a second attention weight β corresponding to the echo frequency domain signal characteristic3(k,f)。
S208, the above-mentioned far-end reference audio domain signal feature and the first attention weight α3(k, f) multiplying to obtain a first fused attention mechanism characterized by:
and the echo frequency domain signal characteristic and the second attention weight beta3(k, f) multiplying to obtain a second fused attention mechanism characterized by:
and S209, splicing the first fusion attention mechanism characteristic, the second fusion attention mechanism characteristic and the voice frequency domain signal characteristic containing the residual echo and the noise to obtain a first fusion splicing result.
S210, inputting the first fusion splicing result into a trained residual echo and noise elimination model in the trained cascade network to obtain a masking estimation value G of the target voice frequency domain signal3(k,f)。
S211, the masking estimated value G of the target voice frequency domain signal3(k, f) and the speech frequency domain signal E containing residual echo and noise3(k, f) multiplying to obtain a target voice frequency domain signal:
and S212, performing inverse Fourier transform on the target voice frequency domain signal to obtain a target voice time domain signal.
The residual echo and the noise in the embodiment of the application are not required to be eliminated independently and separately, but are eliminated once through a trained cascade network. The trained feature attention model is used for endowing the input features with different importance, redundant information in the input features is reduced, and the performance of eliminating residual echo and noise of the cascade network is improved. The multi-domain loss function combining the energy-independent amplitude spectrum loss function and the objective voice quality assessment score loss function is used for training the cascade network, the sensitivity of the model to signal energy is reduced, and the auditory perception quality of output voice is improved.
An embodiment of the present application provides a residual echo and noise cancellation device, a schematic structural diagram of which is shown in fig. 3, including:
a receiving module 301, a processing module 302, a determining module 303, an energy normalizing module 304 and an inverse fourier transform module 305;
a receiving module 301, configured to receive a voice time domain signal containing echo and noise and a far-end reference sound time domain signal;
a processing module 302, configured to perform framing, windowing, and fourier transform on the voice time-domain signal containing the echo and the noise and the far-end reference sound time-domain signal, respectively, to obtain a voice frequency-domain signal containing the echo and the noise and a far-end reference audio frequency-domain signal;
a determining module 303, configured to determine an echo frequency domain signal according to the voice frequency domain signal containing echo and noise and the far-end reference audio frequency domain signal;
the determining module 303 is further configured to determine a voice frequency domain signal containing residual echo and noise according to the voice frequency domain signal containing echo and noise and the echo frequency domain signal;
an energy normalization module 304, configured to perform energy normalization processing on the magnitude spectrum of the voice frequency domain signal containing the residual echo and the noise, the magnitude spectrum of the echo frequency domain signal, and the magnitude spectrum of the far-end reference audio domain signal, so as to obtain a voice frequency domain signal feature containing the residual echo and the noise, an echo frequency domain signal feature, and a far-end reference audio domain signal feature;
a splicing module 305, configured to splice the voice frequency domain signal feature containing the residual echo and the noise with the far-end reference audio domain signal feature to obtain a first splicing result, and splice the voice frequency domain signal feature containing the residual echo and the noise with the echo frequency domain signal feature to obtain a second splicing result;
a weight obtaining module 306, configured to input the first splicing result and the second splicing result into a trained feature attention model in the trained cascade network, and obtain a first attention weight corresponding to the far-end reference audio-domain signal feature and a second attention weight corresponding to the echo frequency-domain signal feature;
a fused attention mechanism feature obtaining module 307, configured to multiply the far-end reference audio domain signal feature by a first attention weight to obtain a first fused attention mechanism feature, and multiply the echo frequency domain signal feature by a second attention weight to obtain a second fused attention mechanism feature;
the splicing module 305 is further configured to splice the first fusion attention mechanism feature, the second fusion attention mechanism feature, and the voice frequency domain signal feature containing the residual echo and the noise to obtain a first fusion splicing result;
a masking estimation value obtaining module 308, configured to input the first fusion splicing result into a post-training residual echo and noise cancellation model in the post-training cascade network, so as to obtain a masking estimation value of the target voice frequency domain signal;
a target voice frequency domain signal obtaining module 309, configured to obtain the target voice frequency domain signal according to the masking estimation value of the target voice frequency domain signal and the voice frequency domain signal containing the residual echo and the noise;
and an inverse fourier transform module 310, configured to perform inverse fourier transform on the target voice frequency domain signal to obtain a target voice time domain signal.
An embodiment of the present application provides a residual echo and noise cancellation apparatus, including at least one processor, where the processor is configured to execute a program stored in a memory, and when the program is executed, the apparatus is enabled to perform the following steps:
receiving a voice time domain signal containing echo and noise and a far-end reference sound time domain signal;
respectively performing framing, windowing and Fourier transformation on the voice time domain signal containing the echo and the noise and the far-end reference sound time domain signal to obtain a voice frequency domain signal containing the echo and the noise and a far-end reference sound frequency domain signal;
determining an echo frequency domain signal according to the voice frequency domain signal containing the echo and the noise and the far-end reference audio frequency domain signal;
determining a voice frequency domain signal containing residual echo and noise according to the voice frequency domain signal containing echo and noise and the echo frequency domain signal;
performing energy normalization processing on the amplitude spectrum of the voice frequency domain signal containing the residual echo and the noise, the amplitude spectrum of the echo frequency domain signal and the amplitude spectrum of the far-end reference audio frequency domain signal to obtain a voice frequency domain signal characteristic containing the residual echo and the noise, an echo frequency domain signal characteristic and a far-end reference audio frequency domain signal characteristic;
splicing the voice frequency domain signal characteristics containing the residual echo and the noise with the far-end reference audio domain signal characteristics to obtain a first splicing result, and splicing the voice frequency domain signal characteristics containing the residual echo and the noise with the echo frequency domain signal characteristics to obtain a second splicing result;
inputting the first splicing result and the second splicing result into a trained feature attention model in the trained cascade network to obtain a first attention weight corresponding to the far-end reference audio domain signal feature and a second attention weight corresponding to the echo frequency domain signal feature;
multiplying the far-end reference audio frequency domain signal characteristic by a first attention weight to obtain a first fused attention mechanism characteristic, and multiplying the echo frequency domain signal characteristic by a second attention weight to obtain a second fused attention mechanism characteristic;
splicing the first fusion attention mechanism characteristic, the second fusion attention mechanism characteristic and the voice frequency domain signal characteristic containing residual echo and noise to obtain a first fusion splicing result;
inputting the first fusion splicing result into a trained residual echo and noise elimination model in the trained cascade network to obtain a masking estimation value of a target voice frequency domain signal;
obtaining the target voice frequency domain signal according to the masking estimation value of the target voice frequency domain signal and the voice frequency domain signal containing the residual echo and the noise;
and carrying out inverse Fourier transform on the target voice frequency domain signal to obtain a target voice time domain signal.
An embodiment of the application provides a non-transitory computer readable storage medium, on which a computer program is stored, the computer program, when executed by a processor, implementing the steps of:
receiving a voice time domain signal containing echo and noise and a far-end reference sound time domain signal;
respectively performing framing, windowing and Fourier transformation on the voice time domain signal containing the echo and the noise and the far-end reference sound time domain signal to obtain a voice frequency domain signal containing the echo and the noise and a far-end reference sound frequency domain signal;
determining an echo frequency domain signal according to the voice frequency domain signal containing the echo and the noise and the far-end reference audio frequency domain signal;
determining a voice frequency domain signal containing residual echo and noise according to the voice frequency domain signal containing echo and noise and the echo frequency domain signal;
performing energy normalization processing on the amplitude spectrum of the voice frequency domain signal containing the residual echo and the noise, the amplitude spectrum of the echo frequency domain signal and the amplitude spectrum of the far-end reference audio frequency domain signal to obtain a voice frequency domain signal characteristic containing the residual echo and the noise, an echo frequency domain signal characteristic and a far-end reference audio frequency domain signal characteristic;
splicing the voice frequency domain signal characteristics containing the residual echo and the noise with the far-end reference audio domain signal characteristics to obtain a first splicing result, and splicing the voice frequency domain signal characteristics containing the residual echo and the noise with the echo frequency domain signal characteristics to obtain a second splicing result;
inputting the first splicing result and the second splicing result into a trained feature attention model in the trained cascade network to obtain a first attention weight corresponding to the far-end reference audio domain signal feature and a second attention weight corresponding to the echo frequency domain signal feature;
multiplying the far-end reference audio frequency domain signal characteristic by a first attention weight to obtain a first fused attention mechanism characteristic, and multiplying the echo frequency domain signal characteristic by a second attention weight to obtain a second fused attention mechanism characteristic;
splicing the first fusion attention mechanism characteristic, the second fusion attention mechanism characteristic and the voice frequency domain signal characteristic containing residual echo and noise to obtain a first fusion splicing result;
inputting the first fusion splicing result into a trained residual echo and noise elimination model in the trained cascade network to obtain a masking estimation value of a target voice frequency domain signal;
obtaining the target voice frequency domain signal according to the masking estimation value of the target voice frequency domain signal and the voice frequency domain signal containing the residual echo and the noise;
and carrying out inverse Fourier transform on the target voice frequency domain signal to obtain a target voice time domain signal.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
It should be noted that: the above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.
Claims (9)
1. A method for residual echo and noise cancellation, comprising:
receiving a voice time domain signal containing echo and noise and a far-end reference sound time domain signal;
respectively performing framing, windowing and Fourier transformation on the voice time domain signal containing the echo and the noise and the far-end reference sound time domain signal to obtain a voice frequency domain signal containing the echo and the noise and a far-end reference sound frequency domain signal;
determining an echo frequency domain signal according to the voice frequency domain signal containing the echo and the noise and the far-end reference audio frequency domain signal;
determining a voice frequency domain signal containing residual echo and noise according to the voice frequency domain signal containing echo and noise and the echo frequency domain signal;
performing energy normalization processing on the amplitude spectrum of the voice frequency domain signal containing the residual echo and the noise, the amplitude spectrum of the echo frequency domain signal and the amplitude spectrum of the far-end reference audio frequency domain signal to obtain a voice frequency domain signal characteristic containing the residual echo and the noise, an echo frequency domain signal characteristic and a far-end reference audio frequency domain signal characteristic;
splicing the voice frequency domain signal characteristics containing the residual echo and the noise with the far-end reference audio domain signal characteristics to obtain a first splicing result, and splicing the voice frequency domain signal characteristics containing the residual echo and the noise with the echo frequency domain signal characteristics to obtain a second splicing result;
inputting the first splicing result and the second splicing result into a trained feature attention model in the trained cascade network to obtain a first attention weight corresponding to the far-end reference audio domain signal feature and a second attention weight corresponding to the echo frequency domain signal feature;
multiplying the far-end reference audio frequency domain signal characteristic by a first attention weight to obtain a first fused attention mechanism characteristic, and multiplying the echo frequency domain signal characteristic by a second attention weight to obtain a second fused attention mechanism characteristic;
splicing the first fusion attention mechanism characteristic, the second fusion attention mechanism characteristic and the voice frequency domain signal characteristic containing residual echo and noise to obtain a first fusion splicing result;
inputting the first fusion splicing result into a trained residual echo and noise elimination model in the trained cascade network to obtain a masking estimation value of a target voice frequency domain signal;
obtaining the target voice frequency domain signal according to the masking estimation value of the target voice frequency domain signal and the voice frequency domain signal containing the residual echo and the noise;
and carrying out inverse Fourier transform on the target voice frequency domain signal to obtain a target voice time domain signal.
2. The method of claim 1, wherein the framing, windowing, and fourier transforming the echo and noise containing speech time domain signal and the far-end reference acoustic time domain signal, respectively, comprises:
respectively taking a preset number of sampling points as a frame signal for the voice time domain signal containing the echo and the noise and the far-end reference sound time domain signal; if the length is not enough, zero is firstly filled to a preset number;
windowing each frame signal; wherein, the windowing function adopts a Hamming window;
and performing Fourier transform on each windowed frame signal.
3. The method according to claim 1, wherein determining an echo frequency domain signal from the echo and noise containing speech frequency domain signal and the far-end reference audio frequency domain signal comprises:
inputting the voice frequency domain signal containing the echo and the noise and the far-end reference audio frequency domain signal into a Kalman filter to obtain a filter coefficient and the echo frequency domain signal;
the echo frequency domain signal is a result of multiplying the filter coefficient and the far-end reference sound frequency domain signal.
4. The method of claim 1, wherein determining the voice frequency domain signal containing the residual echo and the noise according to the voice frequency domain signal containing the echo and the noise and the echo frequency domain signal comprises:
and subtracting the echo frequency domain signal from the voice frequency domain signal containing the echo and the noise to obtain the voice frequency domain signal containing the residual echo and the noise.
5. The method according to claim 1, wherein the energy normalization processing is performed on the magnitude spectrum of the speech frequency domain signal containing the residual echo and the noise, the magnitude spectrum of the echo frequency domain signal, and the magnitude spectrum of the far-end reference audio domain signal to obtain the speech frequency domain signal characteristic containing the residual echo and the noise, the echo frequency domain signal characteristic, and the far-end reference audio domain signal characteristic, and comprises:
respectively determining a first function, a second function and a third function corresponding to the amplitude spectrum of the voice frequency domain signal containing the residual echo and the noise, the amplitude spectrum of the echo frequency domain signal and the amplitude spectrum of the far-end reference audio frequency domain signal;
determining the voice frequency domain signal characteristics containing the residual echo and the noise according to a first function corresponding to the amplitude spectrum of the voice frequency domain signal containing the residual echo and the noise, and the mean value and the variance of the voice frequency domain signal characteristics containing the residual echo and the noise;
determining the echo frequency domain signal characteristics according to a second function corresponding to the amplitude spectrum of the echo frequency domain signal, and the mean value and the variance of the echo frequency domain signal characteristics;
and determining the characteristics of the far-end reference audio frequency domain signal according to a third function corresponding to the amplitude spectrum of the far-end reference audio frequency domain signal and the mean value and the variance of the characteristics of the far-end reference audio frequency domain signal.
6. The method of claim 1, wherein the trained cascade network is trained by:
receiving a first voice time domain signal containing echo and noise, a first far-end reference sound time domain signal and a first target voice time domain signal;
performing framing, windowing and Fourier transformation on the first voice time domain signal containing the echo and the noise, the first far-end reference sound time domain signal and the first target voice time domain signal respectively to obtain a first voice frequency domain signal containing the echo and the noise, a first far-end reference audio frequency domain signal and a first target voice frequency domain signal;
determining a first echo frequency domain signal according to the first voice frequency domain signal containing echo and noise and the first far-end reference audio frequency domain signal;
determining a first voice frequency domain signal containing residual echo and noise according to the first voice frequency domain signal containing echo and noise and the first echo frequency domain signal;
performing energy normalization processing on the amplitude spectrum of the first voice frequency domain signal containing residual echo and noise, the amplitude spectrum of the first echo frequency domain signal and the amplitude spectrum of the first far-end reference audio frequency domain signal to obtain a first voice frequency domain signal characteristic containing residual echo and noise, a first echo frequency domain signal characteristic and a first far-end reference audio frequency domain signal characteristic;
splicing the first voice frequency domain signal characteristic containing residual echo and noise with the first far-end reference audio domain signal characteristic to obtain a first splicing characteristic, and splicing the first voice frequency domain signal characteristic containing residual echo and noise with the first echo frequency domain signal characteristic to obtain a second splicing characteristic;
inputting the first splicing feature and the second splicing feature into a feature attention model in a cascade network so as to jointly train the feature attention model and a residual echo and noise elimination model in the cascade network, and obtaining a first weight corresponding to the first far-end reference audio domain signal feature and a second weight corresponding to the first echo frequency domain signal feature;
multiplying the first far-end reference audio domain signal characteristic by a first weight to obtain a first fused characteristic, and multiplying the first echo audio domain signal characteristic by a second weight to obtain a second fused characteristic;
splicing the first fusion characteristic, the second fusion characteristic and the first voice frequency domain signal characteristic containing residual echo and noise to obtain a first fusion splicing characteristic;
inputting the first fusion splicing characteristic into a residual echo and noise elimination model in the cascade network to obtain a masking estimation value of a second target voice frequency domain signal;
determining a second target voice frequency domain signal according to the masking estimation value of the second target voice frequency domain signal and the first voice frequency domain signal containing residual echo and noise;
determining a multi-domain loss function according to at least two loss functions; wherein the at least two loss functions comprise an energy-independent magnitude spectrum loss function and an objective speech quality assessment score loss function; the amplitude spectrum loss function irrelevant to the energy takes the amplitude spectrum of the first target voice frequency domain signal as a training target and is determined according to the second target voice frequency domain signal; the objective voice quality evaluation score loss function is determined by taking the improvement of voice audibility quality as a training target;
and iteratively reducing the loss function of the multiple domains by continuously modeling parameters to obtain the trained cascade network.
7. A residual echo and noise cancellation apparatus, comprising:
the receiving module is used for receiving a voice time domain signal containing echo and noise and a far-end reference sound time domain signal;
the processing module is used for respectively performing framing, windowing and Fourier transform on the voice time domain signal containing the echo and the noise and the far-end reference sound time domain signal to obtain a voice frequency domain signal containing the echo and the noise and a far-end reference sound frequency domain signal;
the determining module is used for determining an echo frequency domain signal according to the voice frequency domain signal containing the echo and the noise and the far-end reference audio frequency domain signal;
the determining module is further configured to determine a voice frequency domain signal containing residual echo and noise according to the voice frequency domain signal containing echo and noise and the echo frequency domain signal;
the energy normalization module is used for carrying out energy normalization processing on the amplitude spectrum of the voice frequency domain signal containing the residual echo and the noise, the amplitude spectrum of the echo frequency domain signal and the amplitude spectrum of the far-end reference audio frequency domain signal to obtain a voice frequency domain signal characteristic containing the residual echo and the noise, an echo frequency domain signal characteristic and a far-end reference audio frequency domain signal characteristic;
the splicing module is used for splicing the voice frequency domain signal characteristics containing the residual echo and the noise with the far-end reference audio domain signal characteristics to obtain a first splicing result, and splicing the voice frequency domain signal characteristics containing the residual echo and the noise with the echo frequency domain signal characteristics to obtain a second splicing result;
a weight obtaining module, configured to input the first splicing result and the second splicing result into a trained feature attention model in the trained cascade network, and obtain a first attention weight corresponding to the far-end reference audio-domain signal feature and a second attention weight corresponding to the echo frequency-domain signal feature;
a fused attention mechanism feature obtaining module, configured to multiply the far-end reference audio-domain signal feature by a first attention weight to obtain a first fused attention mechanism feature, and multiply the echo frequency-domain signal feature by a second attention weight to obtain a second fused attention mechanism feature;
the splicing module is further configured to splice the first fusion attention mechanism feature, the second fusion attention mechanism feature and the voice frequency domain signal feature containing the residual echo and the noise to obtain a first fusion splicing result;
a masking estimation value obtaining module, configured to input the first fusion splicing result into a post-training residual echo and noise cancellation model in the post-training cascade network, so as to obtain a masking estimation value of the target voice frequency domain signal;
a target voice frequency domain signal obtaining module, configured to obtain a target voice frequency domain signal according to the masking estimation value of the target voice frequency domain signal and the voice frequency domain signal containing the residual echo and the noise;
and the inverse Fourier transform module is used for performing inverse Fourier transform on the target voice frequency domain signal to obtain a target voice time domain signal.
8. A residual echo and noise cancellation apparatus comprising at least one processor configured to execute a program stored in a memory, the program, when executed, causing the apparatus to perform:
the method of any one of claims 1-6.
9. A non-transitory computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the method according to any one of claims 1-6.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110008502.9A CN112863535B (en) | 2021-01-05 | 2021-01-05 | Residual echo and noise elimination method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110008502.9A CN112863535B (en) | 2021-01-05 | 2021-01-05 | Residual echo and noise elimination method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112863535A true CN112863535A (en) | 2021-05-28 |
CN112863535B CN112863535B (en) | 2022-04-26 |
Family
ID=76003795
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110008502.9A Active CN112863535B (en) | 2021-01-05 | 2021-01-05 | Residual echo and noise elimination method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112863535B (en) |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113436636A (en) * | 2021-06-11 | 2021-09-24 | 深圳波洛斯科技有限公司 | Acoustic echo cancellation method and system based on adaptive filter and neural network |
CN113489854A (en) * | 2021-06-30 | 2021-10-08 | 北京小米移动软件有限公司 | Sound processing method, sound processing device, electronic equipment and storage medium |
CN113539291A (en) * | 2021-07-09 | 2021-10-22 | 北京声智科技有限公司 | Method and device for reducing noise of audio signal, electronic equipment and storage medium |
CN113744762A (en) * | 2021-08-09 | 2021-12-03 | 杭州网易智企科技有限公司 | Signal-to-noise ratio determining method and device, electronic equipment and storage medium |
CN114337908A (en) * | 2022-01-05 | 2022-04-12 | 中国科学院声学研究所 | Method and device for generating interference signal of target voice signal |
CN114974286A (en) * | 2022-06-30 | 2022-08-30 | 北京达佳互联信息技术有限公司 | Signal enhancement method, model training method, device, equipment, sound box and medium |
CN114974281A (en) * | 2022-05-24 | 2022-08-30 | 云知声智能科技股份有限公司 | Training method and device of voice noise reduction model, storage medium and electronic device |
CN115294997A (en) * | 2022-06-30 | 2022-11-04 | 北京达佳互联信息技术有限公司 | Voice processing method and device, electronic equipment and storage medium |
WO2023226592A1 (en) * | 2022-05-25 | 2023-11-30 | 青岛海尔科技有限公司 | Noise signal processing method and apparatus, and storage medium and electronic apparatus |
CN117437929A (en) * | 2023-12-21 | 2024-01-23 | 睿云联(厦门)网络通讯技术有限公司 | Real-time echo cancellation method based on neural network |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107636758A (en) * | 2015-05-15 | 2018-01-26 | 哈曼国际工业有限公司 | Acoustic echo eliminates system and method |
US20200105287A1 (en) * | 2017-04-14 | 2020-04-02 | Industry-University Cooperation Foundation Hanyang University | Deep neural network-based method and apparatus for combining noise and echo removal |
CN111161752A (en) * | 2019-12-31 | 2020-05-15 | 歌尔股份有限公司 | Echo cancellation method and device |
CN111341336A (en) * | 2020-03-16 | 2020-06-26 | 北京字节跳动网络技术有限公司 | Echo cancellation method, device, terminal equipment and medium |
US20200312346A1 (en) * | 2019-03-28 | 2020-10-01 | Samsung Electronics Co., Ltd. | System and method for acoustic echo cancellation using deep multitask recurrent neural networks |
CN111768796A (en) * | 2020-07-14 | 2020-10-13 | 中国科学院声学研究所 | Acoustic echo cancellation and dereverberation method and device |
CN111768795A (en) * | 2020-07-09 | 2020-10-13 | 腾讯科技(深圳)有限公司 | Noise suppression method, device, equipment and storage medium for voice signal |
-
2021
- 2021-01-05 CN CN202110008502.9A patent/CN112863535B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107636758A (en) * | 2015-05-15 | 2018-01-26 | 哈曼国际工业有限公司 | Acoustic echo eliminates system and method |
US20200105287A1 (en) * | 2017-04-14 | 2020-04-02 | Industry-University Cooperation Foundation Hanyang University | Deep neural network-based method and apparatus for combining noise and echo removal |
US20200312346A1 (en) * | 2019-03-28 | 2020-10-01 | Samsung Electronics Co., Ltd. | System and method for acoustic echo cancellation using deep multitask recurrent neural networks |
CN111161752A (en) * | 2019-12-31 | 2020-05-15 | 歌尔股份有限公司 | Echo cancellation method and device |
CN111341336A (en) * | 2020-03-16 | 2020-06-26 | 北京字节跳动网络技术有限公司 | Echo cancellation method, device, terminal equipment and medium |
CN111768795A (en) * | 2020-07-09 | 2020-10-13 | 腾讯科技(深圳)有限公司 | Noise suppression method, device, equipment and storage medium for voice signal |
CN111768796A (en) * | 2020-07-14 | 2020-10-13 | 中国科学院声学研究所 | Acoustic echo cancellation and dereverberation method and device |
Non-Patent Citations (1)
Title |
---|
王冬霞等: "基于BLSTM神经网络的回声和噪声抑制算法", 《信号处理》 * |
Cited By (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113436636A (en) * | 2021-06-11 | 2021-09-24 | 深圳波洛斯科技有限公司 | Acoustic echo cancellation method and system based on adaptive filter and neural network |
CN113489854A (en) * | 2021-06-30 | 2021-10-08 | 北京小米移动软件有限公司 | Sound processing method, sound processing device, electronic equipment and storage medium |
CN113489854B (en) * | 2021-06-30 | 2024-03-01 | 北京小米移动软件有限公司 | Sound processing method, device, electronic equipment and storage medium |
CN113539291A (en) * | 2021-07-09 | 2021-10-22 | 北京声智科技有限公司 | Method and device for reducing noise of audio signal, electronic equipment and storage medium |
CN113744762B (en) * | 2021-08-09 | 2023-10-27 | 杭州网易智企科技有限公司 | Signal-to-noise ratio determining method and device, electronic equipment and storage medium |
CN113744762A (en) * | 2021-08-09 | 2021-12-03 | 杭州网易智企科技有限公司 | Signal-to-noise ratio determining method and device, electronic equipment and storage medium |
CN114337908A (en) * | 2022-01-05 | 2022-04-12 | 中国科学院声学研究所 | Method and device for generating interference signal of target voice signal |
CN114337908B (en) * | 2022-01-05 | 2024-04-12 | 中国科学院声学研究所 | Method and device for generating interference signal of target voice signal |
CN114974281A (en) * | 2022-05-24 | 2022-08-30 | 云知声智能科技股份有限公司 | Training method and device of voice noise reduction model, storage medium and electronic device |
WO2023226592A1 (en) * | 2022-05-25 | 2023-11-30 | 青岛海尔科技有限公司 | Noise signal processing method and apparatus, and storage medium and electronic apparatus |
CN115294997A (en) * | 2022-06-30 | 2022-11-04 | 北京达佳互联信息技术有限公司 | Voice processing method and device, electronic equipment and storage medium |
CN114974286A (en) * | 2022-06-30 | 2022-08-30 | 北京达佳互联信息技术有限公司 | Signal enhancement method, model training method, device, equipment, sound box and medium |
CN115294997B (en) * | 2022-06-30 | 2024-10-29 | 北京达佳互联信息技术有限公司 | Voice processing method, device, electronic equipment and storage medium |
CN117437929A (en) * | 2023-12-21 | 2024-01-23 | 睿云联(厦门)网络通讯技术有限公司 | Real-time echo cancellation method based on neural network |
CN117437929B (en) * | 2023-12-21 | 2024-03-08 | 睿云联(厦门)网络通讯技术有限公司 | Real-time echo cancellation method based on neural network |
Also Published As
Publication number | Publication date |
---|---|
CN112863535B (en) | 2022-04-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112863535B (en) | Residual echo and noise elimination method and device | |
CN107452389B (en) | Universal single-track real-time noise reduction method | |
CN108172231B (en) | Dereverberation method and system based on Kalman filtering | |
KR101934636B1 (en) | Method and apparatus for integrating and removing acoustic echo and background noise based on deepening neural network | |
Zhao et al. | A two-stage algorithm for noisy and reverberant speech enhancement | |
CN112700786B (en) | Speech enhancement method, device, electronic equipment and storage medium | |
Lee et al. | DNN-based residual echo suppression. | |
Zhao et al. | Late reverberation suppression using recurrent neural networks with long short-term memory | |
CN112581973B (en) | Voice enhancement method and system | |
CN111048061B (en) | Method, device and equipment for obtaining step length of echo cancellation filter | |
CN112037809A (en) | Residual echo suppression method based on multi-feature flow structure deep neural network | |
CN112201273B (en) | Noise power spectral density calculation method, system, equipment and medium | |
CN111883154B (en) | Echo cancellation method and device, computer-readable storage medium, and electronic device | |
CN113838471A (en) | Noise reduction method and system based on neural network, electronic device and storage medium | |
CN113744748A (en) | Network model training method, echo cancellation method and device | |
Schwartz et al. | Nested generalized sidelobe canceller for joint dereverberation and noise reduction | |
CN112997249B (en) | Voice processing method, device, storage medium and electronic equipment | |
CN114302286A (en) | Method, device and equipment for reducing noise of call voice and storage medium | |
CN117219102A (en) | Low-complexity voice enhancement method based on auditory perception | |
CN115620737A (en) | Voice signal processing device, method, electronic equipment and sound amplification system | |
Kamarudin et al. | Acoustic echo cancellation using adaptive filtering algorithms for Quranic accents (Qiraat) identification | |
CN111883155B (en) | Echo cancellation method, device and storage medium | |
Braun et al. | Low complexity online convolutional beamforming | |
CN108074580B (en) | Noise elimination method and device | |
Yoshioka et al. | Speech dereverberation and denoising based on time varying speech model and autoregressive reverberation model |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |