CN117437929B - Real-time echo cancellation method based on neural network - Google Patents

Real-time echo cancellation method based on neural network Download PDF

Info

Publication number
CN117437929B
CN117437929B CN202311768706.8A CN202311768706A CN117437929B CN 117437929 B CN117437929 B CN 117437929B CN 202311768706 A CN202311768706 A CN 202311768706A CN 117437929 B CN117437929 B CN 117437929B
Authority
CN
China
Prior art keywords
echo
output
echo cancellation
model
input
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202311768706.8A
Other languages
Chinese (zh)
Other versions
CN117437929A (en
Inventor
阮炜玄
兰泽华
蔡如意
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ringslink Xiamen Network Communication Technologies Co ltd
Original Assignee
Ringslink Xiamen Network Communication Technologies Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ringslink Xiamen Network Communication Technologies Co ltd filed Critical Ringslink Xiamen Network Communication Technologies Co ltd
Priority to CN202311768706.8A priority Critical patent/CN117437929B/en
Publication of CN117437929A publication Critical patent/CN117437929A/en
Application granted granted Critical
Publication of CN117437929B publication Critical patent/CN117437929B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • G06N3/0442Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L2021/02082Noise filtering the noise being echo, reverberation of the speech

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Molecular Biology (AREA)
  • Signal Processing (AREA)
  • Acoustics & Sound (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Cable Transmission Systems, Equalization Of Radio And Reduction Of Echo (AREA)

Abstract

The invention discloses a neural network-based real-time echo cancellation method, which comprises the following steps: step 1, constructing an echo cancellation model; step 2, training the echo cancellation model; step 3, constructing an echo detection model; step 4, training the echo detection model; step 5, taking the near-end audio signal and the far-end audio signal as the input of the trained echo cancellation model to obtain the output of the trained echo cancellation model; step 6, taking the output of the trained echo cancellation model and the far-end audio signal as the input of the trained echo detection model, and obtaining the output of the trained echo detection model as an echo detection tag; and 7, judging the state of the output frame of the current echo cancellation model according to the echo detection label to obtain final target audio. The invention combines the echo cancellation model and the echo detection model, improves the effect of the echo cancellation model and reduces the deployment difficulty of the model under the condition that only a small amount of data is acquired.

Description

Real-time echo cancellation method based on neural network
Technical Field
The invention relates to the technical field of echo cancellation, in particular to a neural network-based real-time echo cancellation method.
Background
Echo cancellation is an important part of audio processing, and traditional echo cancellation algorithms based on adaptive filters, such as webrtc and speex, have the defects of complex debugging, slow convergence, difficulty in adapting to complex environments, and the like.
With the development of neural networks, neural networks are widely used for echo cancellation. There are mainly 2 echo cancellation schemes based on neural networks:
firstly, linear echo and estimated time delay are eliminated through an adaptive filter, and nonlinear echo and noise are eliminated through a neural network model. The adaptive filter and the neural network model are adopted, the requirements on the adaptive filter, particularly the time delay estimation are larger, and the debugging difficulty of the adaptive filter is larger due to the diversity of scenes in practical application, so that the performance is difficult to guarantee. In addition, since the adaptive filter is often required to be output when the model is trained, the original near-end signal and the far-end audio signal are simultaneously used as the characteristic input model, so that the model is large, the training is slow, and a large amount of data is required to be supported. Therefore, although the scheme has good effect, the implementation is difficult, the model is large, and the deployment is difficult at the equipment end;
Secondly, echo cancellation and noise reduction are directly realized through a neural network model; only one model is used for echo cancellation, and the scheme is simple to realize and deploy, but the effect of the model is difficult to guarantee in different devices and in different scenes because parameters cannot be updated for different devices like an adaptive filter.
The invention in China with publication number of CN116887160A discloses a digital hearing aid howling suppression method and system based on a neural network, wherein the method comprises the following steps: acquiring a voice signal received by a digital hearing aid; acquiring the state of each moment in the voice signal based on a neural network; determining a convergence stability coefficient corresponding to each moment according to the state of each moment in the voice signal and the distribution characteristic of the amplitude-frequency peak value of each moment; determining a non-stationary moment based on the convergence stationary coefficient; determining a step length adjusting proportion based on the time interval between the non-stable time and the howling time, and acquiring a convergence step length corresponding to each time based on the step length adjusting proportion and the convergence stability coefficient; and carrying out echo cancellation on the voice signal based on an NLMS algorithm and the convergence step length so as to inhibit howling in the digital hearing aid. The patent eliminates linear echo and estimated time delay through an adaptive filter (NLMS algorithm), the scheme is difficult to realize, the model is large, and the deployment is difficult at the equipment end.
Disclosure of Invention
Therefore, the invention aims to provide a real-time echo cancellation method based on a neural network, which combines the scheme of an echo cancellation model and an echo detection model, improves the effect of the echo cancellation model and reduces the deployment difficulty of the model under the condition that only a small amount of data is acquired.
In order to achieve the technical purpose, the invention adopts the following technical scheme:
the invention provides a neural network-based real-time echo cancellation method, which comprises the following steps:
step 1, constructing an echo cancellation model; the method specifically comprises the following steps:
step 11, the structure of the echo cancellation model uses an open source DTLN-aec model;
step 12, the DTLN-aec model is composed of two framing layers, two fourier transform layers, two core modules, an inverse fourier transform layer, two first linear layers, one second linear layer and overlapping additive layers, each core module includes two normalization layers, one connection layer, two LSTM layers, one third linear layer and one sigmoid activation function;
step 2, training the echo cancellation model;
step 3, constructing an echo detection model; the method specifically comprises the following steps:
Step 31, the echo detection model consists of two framing layers, two Fourier transformation layers, two normalization layers, a splicing layer, a fourth linear layer, two GRU layers, a fifth linear layer and a sigmoid activation function;
step 32, the input of the echo detection model is the output of the echo cancellation modelAnd a far-end audio signal y (n), two of said framing layers being adapted to receive the output of the echo cancellation model>The two framing layers respectively enter two Fourier transformation layers, the two Fourier transformation layers respectively enter two normalization layers, the two normalization layers enter a splicing layer, the splicing layer sequentially enters two GRU layers, a fifth linear layer and a sigmoid activation function, and the output of the echo detection model is obtainedout
Step 4, training the echo detection model;
step 5, taking the near-end audio signal and the far-end audio signal as the input of the trained echo cancellation model to obtain the output of the trained echo cancellation model;
step 6, taking the output of the trained echo cancellation model and the far-end audio signal as the input of the trained echo detection model, and obtaining the output of the trained echo detection model as an echo detection tag;
And 7, judging the state of the output frame of the current echo cancellation model according to the echo detection label to obtain final target audio.
Further, the step 12 further includes:
step 13, inputting a near-end audio signal x (n) and a far-end audio signal y (n) by the echo cancellation model, wherein the two framing layers are used for receiving the near-end audio signal x (n) and the far-end audio signal y (n), the two framing layers respectively enter two Fourier transform layers, the two Fourier transform layers enter a first core module, the two Fourier transform layers respectively enter two normalization layers, the two normalization layers enter a connection layer, and the connection layer sequentially enters two LSTM layers, a third linear layer and a sigmoid activation function; the near-end audio signal sequentially passes through a framing layer and a Fourier transform layer, and then the result output by the near-end audio signal is multiplied by the result output by the sigmoid activation function in the first core module and then enters an inverse Fourier transform layer;
step 14, the inverse fourier transform layer and the far-end audio signal enter two first linear layers respectively, the two first linear layers enter a second core module, the two first linear layers enter two normalization layers respectively, the two normalization layers enter a connection layer, and the connection layer sequentially enters two LSTM layers, a third linear layer and a sigmoid activation function; the result output after passing through the inverse Fourier transform layer and the first linear layer is multiplied by the result output by the sigmoid activation function in the second core module, and then sequentially enters the second linear layer and the overlapped additive layer to obtain the output of the echo cancellation model
Further, the step 2 specifically includes:
step 21, acquiring a plurality of audio signals in three data sets of an open source dns-challenge, aec-change and openplr;
step 22, the input of the echo cancellation model is a near-end audio signal x (n) and a far-end audio signal y (n), where x (n) =s (n) +e (n) +n, s (n) represents a clean speech signal that needs to be stripped from the near-end audio signal x (n), e (n) represents an echo generated after the far-end audio signal y (n) is played and re-collected by the near-end microphone, and n represents environmental noise;
step 23, selecting a plurality of synthesized echo pairs and a plurality of real echo pairs from a aec-challenge data set as y (n) and e (n) in training data, selecting a plurality of different noisy speech signals and clean speech signal pairs from a dns-challenge data set as s (n) +n and s (n) in training data respectively, and selecting a plurality of audio signals from the dns-challenge data set as y (n) in training data;
step 24, selecting a plurality of audio signals from rir _noise data in the openlr data set, generating a simulated echo in real time during training, and taking the generated simulated echo as e (n) of training data to participate in training;
step 25, using a plurality of near-end audio signals and a plurality of far-end audio signals as training data, and selecting 10% of the data from the training data as a training test set;
Step 26, training the echo cancellation model by using the training data as an input of the echo cancellation model, wherein a loss function is as follows:wherein->Representing the output of the echo cancellation model, s (n) representing the clean speech signal, i.e. the target signal; when->When the value of (2) is stable and approaches s (n), indicating that training is completed;
and step 27, after training, taking the test set as the input of the echo cancellation model, and testing the echo cancellation model to finally obtain the trained echo cancellation model.
Further, the step 24 specifically includes the following steps:
step 241, randomly selecting 1 far-end reference signal from a plurality of far-end audio signals y (n), if the selected far-end reference signal is an audio signal in aec-challenge data set, directly using e (n) corresponding to the selected far-end reference signal; if the selected far-end reference signal is an audio signal in the dns-change dataset, go to step 242;
step 242, randomly selecting 1 piece of rir _noise data in the openi data set as a response signal, and convolving the response signal with a far-end reference signal;
step 243, filtering the convolved result by a bandpass filter with random parameters to obtain 1 analog echo signal;
Step 244, repeating steps 241-243 to obtain a plurality of analog echo signals as e (n) of training data.
Further, the lowest frequency of the band-pass filter is a random number of 100-400hz, and the highest frequency is a random number of 6000-7500 hz; a random number in the range of 0-100ms is selected as the time delay of the echo signal.
Further, the step 4 specifically includes:
step 41, obtaining the output of the echo cancellation modelSelecting a plurality of synthesized echo pairs and a plurality of real echo pairs from the aec-challenge data set as far-end audio signals y (n) in the training data, and selecting a plurality of audio signals from the dns-challenge data set as far-end audio signals y (n) in the training data;
step 42, outputting the echo cancellation modelAnd the far-end audio signal y (n) is used as the input of the echo detection model, the energy of each frame, namely the square sum of the amplitude of the sampling point of each frame, is calculated through a clean voice signal s (n) provided by a dns-challenge data set, so that the label of each frame is obtained, whether the square sum of the corresponding clean voice signal s (n) in the current frame is larger than 0.001 is judged, if yes, the voice label of the current frame is set as 1, otherwise, the voice label of the current frame is set as 0;
Step 43, using the average mean square error as a loss function, namely:whereinout (n)Representing the output of the echo detection model at the current frame,label(n)in order to provide a corresponding label to be displayed,Nis the total number of frames; when (when)out (n)Is stable and approaches tolabel(n)At this point, the training is explained as completed.
Further, the step 5 has steps including:
step 51, inputting a target far-end audio signal and a target near-end audio signal into the echo cancellation model;
step 52, framing the target far-end audio signal and the target near-end audio signal and calculating the Fourier coefficient of each frame;
step 53, normalizing the fourier coefficient of each frame, splicing the fourier coefficients of each frame of the far-end audio signal and the near-end audio signal, and performing operation of an LSTM layer, a third linear layer and a sigmoid activation function, wherein in the LSTM layer, the output of the previous frame is used as the input of the next frame; the near-end audio signal is multiplied by a result output by a sigmoid activation function in a first core module sequentially after framing and Fourier transformation, and then is subjected to Fourier transformation;
step 54, respectively performing operation of a first linear layer through an inverse Fourier transform result and a far-end audio signal, performing normalization and splicing processing on an operation result of the first linear layer, and performing operation of an LSTM layer, a third linear layer and a sigmoid activation function, wherein in the LSTM layer, the output of the previous frame is used as the input of the next frame; multiplying the result output after the inverse Fourier transform and the first linear layer operation with the result output by the sigmoid activation function in the second core module;
Step 55, recovering the coefficient number of each frame to the same number as the Fourier coefficient number by a second linear layer, and adding the result to obtain an output signal with the same length as the input signal
Further, the step 6 specifically includes:
step 61, inputting target far-end audio signals and the output of an echo cancellation model into the echo detection model;
step 62, framing the target far-end audio signal and the output of the echo cancellation model and calculating the Fourier coefficient of each frame;
step 63, normalizing the fourier coefficient of each frame, splicing the far-end audio signal and the fourier coefficient of each frame output by the echo cancellation model, and obtaining the output of the echo detection model through the operation of the fourth linear layer, the GRU layer, the fifth linear layer and the sigmoid activation functionoutThe method comprises the steps of carrying out a first treatment on the surface of the In the GRU layer, the output of the previous frame is taken as the input of the next frame.
Further, the step 7 specifically includes:
step 71, the input frame length of the echo cancellation model is 512 points, the frame is moved to 128 points, an input register of 512 points is needed to store the input of the previous frame in actual deployment, and an output register of 512 points stores the output of the previous frame;
Step 72, assuming 512 points are input at a time, each input needs to run 512/128=4 times; the last 384 points of the previous frame and the first 128 points of the current frame are input in the input register for the first time, the last 256 points of the previous frame and the first 256 points of the current frame are input for the second time, the last 128 points of the previous frame and the first 384 points of the current frame are input for the third time, and the first 512 points of the current frame are input for the fourth time; when all 4 times of forward deduction are completed, the input of the current frame is imported into an input register and used when the next frame is input;
step 73, for each forward derived output, firstly shifting the point of the output register forward by 128 bits, and then setting the point to be zero at 128 positions; adding 512 points output by each frame with 512 points in an output register, outputting the first 128 points, and performing forward deduction to output 128 points each time; assuming 512 points are input at a time, and after 512/128=4 forward deductions are performed, 128×4=512 points are output, so that the length of the output is the same as that of the input;
step 74, the input frame length of the echo detection model is 512, the frame shift is 256, a register of 512 points is needed, 128 points are output by the echo cancellation model each time, the register is shifted forward by 128 points, and then 128 points output by the echo cancellation model are imported; since the frame shift of the echo detection model is 2 times of that of the echo cancellation model, the echo cancellation model performs one echo detection every 2 times of forward deduction;
Step 75, setting a decision threshold, and if the output of the echo detection model is smaller than the threshold for a plurality of times, considering that the output of the echo cancellation model contains larger echo at the moment, and setting 0 at the output; otherwise, the voice is normal voice.
By adopting the technical scheme, compared with the prior art, the invention has the beneficial effects that:
according to the method, only one echo cancellation model is trained through the open-source data set, only a small amount of data is required to be acquired for adapting to different equipment, a small echo detection model is trained, a large amount of time is not required to debug the adaptive filter, and the training difficulty and cost are greatly reduced.
According to the method, an echo detection model is added on the basis of a traditional echo cancellation model, when the echo cancellation model is poor in effect, the echo can be cancelled through the echo detection model, meanwhile, the requirements of different scenes can be met by setting a threshold value, and the problem that the end-to-end echo cancellation model is unstable in effect under the different scenes is solved. The performance of echo cancellation is improved with only a small increase in complexity.
Most of training data of the method uses an open source data set, and only a small amount of data is required to be collected for different devices to train an echo detection model, so that the method can be more convenient to adapt to different devices. The quantized echo cancellation model is about 1.5M, and the quantized echo detection model is 350KB, so that the method can be used for mobile terminal equipment.
Meanwhile, in practical application, the echo cancellation effect and the hearing are often difficult to balance, and the method can be adapted in different scenes by setting the threshold value of echo detection. The threshold of the echo detection model may be lowered by 0 if the hearing is to be guaranteed and increased if the effect of echo cancellation is to be guaranteed.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a flowchart of an implementation of a neural network-based real-time echo cancellation method according to an embodiment of the present invention.
Fig. 2 is a schematic structural diagram of an echo cancellation model according to an embodiment of the present invention.
Fig. 3 is a schematic structural diagram of a core module in an echo cancellation model according to an embodiment of the present invention.
Fig. 4 is a schematic structural diagram of an echo detection model according to an embodiment of the present invention.
Fig. 5 is a schematic diagram of LSTM layer input and output in a core module according to an embodiment of the present invention.
Fig. 6 is a schematic diagram of input and output of a GRU layer in an echo detection model according to an embodiment of the present invention.
Detailed Description
The invention is described in further detail below with reference to the drawings and examples. It is specifically noted that the following examples are only for illustrating the present invention, but do not limit the scope of the present invention. Likewise, the following examples are only some, but not all, of the examples of the present invention, and all other examples, which a person of ordinary skill in the art would obtain without making any inventive effort, are within the scope of the present invention.
Referring to fig. 1-6, the present invention provides a neural network-based real-time echo cancellation method, which includes the following steps:
step 1, constructing an echo cancellation model;
in this embodiment, the step 1 specifically includes:
step 11, the structure of the echo cancellation model uses an open source DTLN-aec model;
step 12, the DTLN-aec model is composed of two framing layers, two fourier transform layers, two core modules, an inverse fourier transform layer, two first linear layers, one second linear layer and overlapping additive layers, each core module includes two normalization layers, one connection layer, two LSTM layers, one third linear layer and one sigmoid activation function;
Step 13, inputting a near-end audio signal x (n) and a far-end audio signal y (n) by the echo cancellation model, wherein the two framing layers are used for receiving the near-end audio signal x (n) and the far-end audio signal y (n), the two framing layers respectively enter two Fourier transform layers, the two Fourier transform layers enter a first core module, the two Fourier transform layers respectively enter two normalization layers, the two normalization layers enter a connection layer, and the connection layer sequentially enters two LSTM layers, a third linear layer and a sigmoid activation function; the near-end audio signal sequentially passes through a framing layer and a Fourier transform layer, and then the result output by the near-end audio signal is multiplied by the result output by the sigmoid activation function in the first core module and then enters an inverse Fourier transform layer;
step 14, the inverse fourier transform layer and the far-end audio signal enter two first linear layers respectively, the two first linear layers enter a second core module, the two first linear layers enter two normalization layers respectively, the two normalization layers enter a connection layer, and the connection layer sequentially enters two LSTM layers, a third linear layer and a sigmoid activation function; the result output after passing through the inverse Fourier transform layer and the first linear layer is multiplied by the result output by the sigmoid activation function in the second core module, and then sequentially enters the second linear layer and the overlapped additive layer to obtain the output of the echo cancellation model
Step 2, training the echo cancellation model;
in this embodiment, the step 2 specifically includes:
step 21, acquiring a plurality of audio signals in three data sets of an open source dns-challenge, aec-change and openplr; dns-challenge, aec-change and openlr are representations of a dataset, which is a source of multiple audio data; the dns-challenge data set represents a data set of a depth noise suppression challenge, the aec-challenge data set represents a data set of an acoustic echo cancellation challenge, and the openlr data set is a site which is specially used for storing voice and language resources and is quite comprehensive and comprises data sets of voice, noise, reverberation and the like; rir _noise data is data representing the impulse response of a room and is the data required to simulate echo.
Step 22, the input of the echo cancellation model is a near-end audio signal x (n) and a far-end audio signal y (n), where x (n) =s (n) +e (n) +n, s (n) represents a clean speech signal that needs to be stripped from the near-end audio signal x (n), e (n) represents an echo generated after the far-end audio signal y (n) is played and re-collected by the near-end microphone, and n represents environmental noise;
step 23, selecting a plurality of synthesized echo pairs and a plurality of real echo pairs from a aec-challenge data set as y (n) and e (n) in training data, selecting a plurality of different noisy speech signals and clean speech signal pairs from a dns-challenge data set as s (n) +n and s (n) in training data respectively, and selecting a plurality of audio signals from the dns-challenge data set as y (n) in training data; specifically, 10000 synthesized echo pairs and 10000 real echo pairs can be selected from the aec-challenge data set as y (n) and e (n) in the training data, where y (n) has 20000 pieces and e (n) has 20000 pieces. 50000 different noisy language signals of 10S are selected from the dns-challenge data set as S (n) +n,50000 clean voice signals are selected from the dns-challenge data set as S (n), and the signal to noise ratio distribution is 0-20db. Since the echo situation is very diverse, only 20000 echo signals are far from sufficient, 30000 audio signals are selected in the dns-change dataset as far-end audio signals y (n). There are a total of 50000 s (n), 50000 s (n) +n,50000 far-end tones y (n), e (n) in 20000 aec-challenge.
Step 24, selecting a plurality of audio signals (60000 room impulse response audio signals) from rir _noise data in the openlr data set, generating a simulated echo in real time during training, and taking the generated simulated echo as e (n) of training data to participate in training; for guaranteeing the diversity of training data.
Step 25, using a plurality of near-end audio signals and a plurality of far-end audio signals as training data, and selecting 10% of the data from the training data as a training test set;
step 26, training the echo cancellation model by using the training data as an input of the echo cancellation model, wherein a loss function is as follows:wherein->Representing the output of the echo cancellation model, s (n) representing the clean speech signal, i.e. the target signal; the goal of model training is to make the results of the above formulas as small as possible whenWhen the value of (2) is stable and approaches s (n), indicating that training is completed;
and step 27, after training, taking the test set as the input of the echo cancellation model, and testing the echo cancellation model to finally obtain the trained echo cancellation model.
In this embodiment, the step 24 is specifically as follows:
step 241, randomly selecting 1 far-end reference signal from a plurality of far-end audio signals y (n), if the selected far-end reference signal is an audio signal in aec-challenge data set, directly using e (n) corresponding to the selected far-end reference signal; if the selected far-end reference signal is an audio signal in the dns-change dataset, go to step 242; specifically, 1 of 50000 far-end audio signals is randomly selected as a far-end reference signal, if one of 20000 audio signals selected from aec-change is selected, the corresponding e (n) is directly used, and if one of 30000 audio signals selected from dns-change is selected, echo is generated by using the methods of step 242 and step 243;
Step 242, randomly selecting 1 piece of rir _noise data in the openi data set as a response signal, and convolving the response signal with a far-end reference signal; specifically, 1 of 60000 room impulse response audio signals is randomly selected to be convolved with the far-end reference signal;
step 243, filtering the convolved result by a bandpass filter with random parameters to obtain 1 analog echo signal;
step 244, repeating steps 241-243 to obtain a plurality of analog echo signals as e (n) of training data. Every time 40000 audio training networks are generated according to the method, parameters are updated once, the frame length of signal input is 32ms (the sampling rate is 16K), and the frame shift is 8ms.
In this embodiment, the lowest frequency of the band-pass filter is a random number of 100-400hz, and the highest frequency is a random number of 6000-7500 hz; a random number in the range of 0-100ms is selected as the time delay of the echo signal. By constructing and training an echo cancellation model, most of echoes can be cancelled, and the audio quality is ensured.
Step 3, constructing an echo detection model;
in this embodiment, the step 3 specifically includes:
step 31, the echo detection model consists of two framing layers, two Fourier transformation layers, two normalization layers, a splicing layer, a fourth linear layer, two GRU layers, a fifth linear layer and a sigmoid activation function; since I want the echo detection model to be as small and fast as possible, we choose GRUs with faster speeds to build the model;
Step 32, the input of the echo detection model is the output of the echo cancellation modelAnd a far-end audio signal y (n), two of said framing layers being for receiving echoesOutput of cancellation model->The two framing layers respectively enter two Fourier transformation layers, the two Fourier transformation layers respectively enter two normalization layers, the two normalization layers enter a splicing layer, the splicing layer sequentially enters two GRU layers, a fifth linear layer and a sigmoid activation function, and the output of the echo detection model is obtainedout. In order to make the model easier to train, the output range is limited between (0, 1) by a sigmoid function, preventing the loss function from being too large.
Step 4, training the echo detection model;
in this embodiment, the step 4 specifically includes:
step 41, obtaining the output of the echo cancellation modelSelecting a plurality of synthesized echo pairs and a plurality of real echo pairs from the aec-challenge data set as far-end audio signals y (n) in the training data, and selecting a plurality of audio signals from the dns-challenge data set as far-end audio signals y (n) in the training data;
step 42, outputting the echo cancellation model And the far-end audio signal y (n) is used as the input of the echo detection model, the energy of each frame, namely the square sum of the amplitude of the sampling point of each frame, is calculated through a clean voice signal s (n) provided by a dns-challenge data set, so that the label of each frame is obtained, whether the square sum of the corresponding clean voice signal s (n) in the current frame is larger than 0.001 is judged, if yes, the voice label of the current frame is set as 1, otherwise, the voice label of the current frame is set as 0; because the echo detection model mainly ensures the echo cancellation effect of the echo cancellation model when the effect is poor, the signal to noise ratio of 70% of data in the training data set and clean voice is below 0db, namely +.>. 80000 pieces of 10 seconds of audio are generated as a training set by the above method. Meanwhile, because echoes are different under different microphones, for each different device, about 10 hours of echo pairs are acquired, about 8000 near-end and far-end signals are generated by the same method and clean voice signals of dns-challenge data, a trained model is input to obtain output, and the audios are set as a trained test set.
Step 43, using the average mean square error as a loss function, namely:wherein out (n)Representing the output of the echo detection model at the current frame,label(n)in order to provide a corresponding label to be displayed,Nis the total number of frames; when (when)out (n)Is stable and approaches tolabel(n)At this point, the training is explained as completed. The training goal of the model is to make the output as close as possiblelabel(n). The best performing model stored on the test set is the final model. The training frame length of the model is 32ms, and the frame shift is 16ms.
By constructing and training the echo detection model, when the echo cancellation model has poor effect, the echo can be cancelled through the echo detection model, and meanwhile, the problem that the end-to-end echo cancellation model has unstable effect under different scenes can be solved by setting the requirements of adapting to different scenes.
Deployment in practical application: a final echo cancellation and echo detection model is derived. Unlike training, multi-frame audio can be input at one time during training, and only one frame of audio can be obtained at one time in actual deployment. According to the LSTM and GRU structures of the above diagram, the output of the previous frame can be used as the input of the next frame, so that the output of the LSTM and the GRU need to be used as the input of the next frame at the same time, and the derived echo cancellation model needs to change the core module thereof into streaming.
Step 5, taking the near-end audio signal and the far-end audio signal as the input of the trained echo cancellation model to obtain the output of the trained echo cancellation model; in this embodiment, the step 5 includes:
Step 51, inputting a target far-end audio signal and a target near-end audio signal into the echo cancellation model;
step 52, framing the target far-end audio signal and the target near-end audio signal and calculating the Fourier coefficient of each frame;
step 53, normalizing the fourier coefficient of each frame, so as to make the fourier coefficient as much as possible conform to gaussian distribution in order to make the model have better generalization; splicing the Fourier coefficients of each frame of the far-end audio signal and the near-end audio signal, namely if each frame of the far-end audio signal and the near-end audio signal has 257 Fourier coefficients, each frame after splicing has 257 multiplied by 2=514 Fourier coefficients; then, through the operation of the LSTM layer, the third linear layer and the sigmoid activation function, the output of the previous frame is used as the input of the next frame in the LSTM layer; the near-end audio signal is multiplied by a result output by a sigmoid activation function in a first core module sequentially after framing and Fourier transformation, and then is subjected to Fourier transformation;
step 54, respectively performing operation of a first linear layer through an inverse Fourier transform result and a far-end audio signal, performing normalization and splicing processing on an operation result of the first linear layer, and performing operation of an LSTM layer, a third linear layer and a sigmoid activation function, wherein in the LSTM layer, the output of the previous frame is used as the input of the next frame; multiplying the result output after the inverse Fourier transform and the first linear layer operation with the result output by the sigmoid activation function in the second core module;
Step 55, recovering the coefficient number of each frame to the same number as the Fourier coefficient number by a second linear layer, and adding the result to obtain an output signal with the same length as the input signal. The object of the whole model is to let the output signal +.>As close as possible to s (n)。
Step 6, taking the output of the trained echo cancellation model and the far-end audio signal as the input of the trained echo detection model, and obtaining the output of the trained echo detection model as an echo detection tag;
in this embodiment, the step 6 specifically includes:
step 61, inputting target far-end audio signals and the output of an echo cancellation model into the echo detection model;
step 62, framing the target far-end audio signal and the output of the echo cancellation model and calculating the Fourier coefficient of each frame;
step 63, normalizing the fourier coefficient of each frame, splicing the far-end audio signal and the fourier coefficient of each frame output by the echo cancellation model, and obtaining the output of the echo detection model through the operation of the fourth linear layer, the GRU layer, the fifth linear layer and the sigmoid activation function outThe method comprises the steps of carrying out a first treatment on the surface of the In the GRU layer, the output of the previous frame is taken as the input of the next frame.
And 7, judging the state of the output frame of the current echo cancellation model according to the echo detection label to obtain final target audio.
In this embodiment, the step 7 specifically includes:
step 71, the input frame length of the echo cancellation model is 512 points, the frame is moved to 128 points, an input register of 512 points is needed to store the input of the previous frame in actual deployment, and an output register of 512 points stores the output of the previous frame;
step 72, assuming 512 points are input at a time, each input needs to run 512/128=4 times; the last 384 points of the previous frame and the first 128 points of the current frame are input in the input register for the first time, the last 256 points of the previous frame and the first 256 points of the current frame are input for the second time, the last 128 points of the previous frame and the first 384 points of the current frame are input for the third time, and the first 512 points of the current frame are input for the fourth time; when all 4 times of forward deduction are completed, the input of the current frame is imported into an input register and used when the next frame is input;
step 73, for each forward derived output, the point of the output register is first shifted forward by 128 bits, i.e. the original 128 th to 512 th sampling points in the output register become the 0 th to 384 th points of the current register, and then 128 th positions are zero (the 384 th to 512 th positions 0 of the output register); adding 512 points output by each frame with 512 points in an output register, outputting the first 128 points, and performing forward deduction to output 128 points each time; assuming 512 points are input at a time, and after 512/128=4 forward deductions are performed, 128×4=512 points are output, so that the length of the output is the same as that of the input;
Step 74, the input frame length of the echo detection model is 512, the frame shift is 256, a register of 512 points is needed, 128 points are output by the echo cancellation model each time, the register is shifted forward by 128 points, and then 128 points output by the echo cancellation model are imported; since the frame shift of the echo detection model is 2 times of that of the echo cancellation model, the echo cancellation model performs one echo detection every 2 times of forward deduction;
step 75, setting a decision threshold, and if the output of the echo detection model is smaller than the threshold for a plurality of times, considering that the output of the echo cancellation model contains larger echo at the moment, and setting 0 at the output; otherwise, the voice is normal voice. When the output of the echo detection model is smaller than the threshold value for 5 continuous times, the current frame is judged to be the echo, so that the distortion can be prevented, and the accuracy of the result is improved. By setting the threshold value of echo detection, the method is adapted in different scenes. The threshold of the echo detection model may be lowered by 0 if the hearing is to be guaranteed and increased if the effect of echo cancellation is to be guaranteed.
The foregoing description is only a partial embodiment of the present invention, and is not intended to limit the scope of the present invention, and all equivalent devices or equivalent processes using the descriptions and the drawings of the present invention or directly or indirectly applied to other related technical fields are included in the scope of the present invention.

Claims (8)

1. The real-time echo cancellation method based on the neural network is characterized by comprising the following steps:
step 1, constructing an echo cancellation model; the method specifically comprises the following steps:
step 11, the structure of the echo cancellation model uses an open source DTLN-aec model;
step 12, the DTLN-aec model is composed of two framing layers, two fourier transform layers, two core modules, an inverse fourier transform layer, two first linear layers, one second linear layer and overlapping additive layers, each core module includes two normalization layers, one connection layer, two LSTM layers, one third linear layer and one sigmoid activation function;
step 2, training the echo cancellation model;
step 3, constructing an echo detection model; the method specifically comprises the following steps:
step 31, the echo detection model consists of two framing layers, two Fourier transformation layers, two normalization layers, a splicing layer, a fourth linear layer, two GRU layers, a fifth linear layer and a sigmoid activation function;
step 32, the input of the echo detection model is the output of the echo cancellation modelAnd a far-end audio signal y (n), two of said framing layers being adapted to receive the output of the echo cancellation model >The two framing layers respectively enter two Fourier transformation layers, the two Fourier transformation layers respectively enter two normalization layers, the two normalization layers enter a splicing layer, the splicing layer sequentially enters two GRU layers, a fifth linear layer and a sigmoid activation function, and the output of the echo detection model is obtainedout
Step 4, training the echo detection model; the method specifically comprises the following steps:
step 41, obtaining the output of the echo cancellation modelSelecting a plurality of synthesized echo pairs and a plurality of real echo pairs from the aec-challenge data set as far-end audio signals y (n) in the training data, and selecting a plurality of audio signals from the dns-challenge data set as far-end audio signals y (n) in the training data;
step 42, outputting the echo cancellation modelAnd the far-end audio signal y (n) is used as the input of the echo detection model, the energy of each frame, namely the square sum of the amplitude of the sampling point of each frame, is calculated through a clean voice signal s (n) provided by a dns-challenge data set, so that the label of each frame is obtained, whether the square sum of the corresponding clean voice signal s (n) in the current frame is larger than 0.001 is judged, if yes, the voice label of the current frame is set as 1, otherwise, the voice label of the current frame is set as 0;
Step 43, using the average mean square error as a loss function, namely:whereinout(n)Representing the output of the echo detection model at the current frame,label(n)in order to provide a corresponding label to be displayed,Nis the total number of frames; when (when)out(n)Is stable and approaches tolabel(n)When the training is completed;
step 5, taking the near-end audio signal and the far-end audio signal as the input of the trained echo cancellation model to obtain the output of the trained echo cancellation model;
step 6, taking the output of the trained echo cancellation model and the far-end audio signal as the input of the trained echo detection model, and obtaining the output of the trained echo detection model as an echo detection tag;
and 7, judging the state of the output frame of the current echo cancellation model according to the echo detection label to obtain final target audio.
2. The method of neural network-based real-time echo cancellation according to claim 1, wherein said step 12 further comprises:
step 13, inputting a near-end audio signal x (n) and a far-end audio signal y (n) by the echo cancellation model, wherein the two framing layers are used for receiving the near-end audio signal x (n) and the far-end audio signal y (n), the two framing layers respectively enter two Fourier transform layers, the two Fourier transform layers enter a first core module, the two Fourier transform layers respectively enter two normalization layers, the two normalization layers enter a connection layer, and the connection layer sequentially enters two LSTM layers, a third linear layer and a sigmoid activation function; the near-end audio signal sequentially passes through a framing layer and a Fourier transform layer, and then the result output by the near-end audio signal is multiplied by the result output by the sigmoid activation function in the first core module and then enters an inverse Fourier transform layer;
Step 14, the inverse fourier transform layer and the far-end audio signal enter two first linear layers respectively, the two first linear layers enter a second core module, the two first linear layers enter two normalization layers respectively, the two normalization layers enter a connection layer, and the connection layer sequentially enters two LSTM layers, a third linear layer and a sigmoid activation function; the result output after passing through the inverse Fourier transform layer and the first linear layer is multiplied by the result output by the sigmoid activation function in the second core module, and then sequentially enters the second linear layer and the overlapped additive layer to obtain the output of the echo cancellation model
3. The method for real-time echo cancellation based on neural network as claimed in claim 1, wherein said step 2 specifically comprises:
step 21, acquiring a plurality of audio signals in three data sets of an open source dns-challenge, aec-change and openplr;
step 22, the input of the echo cancellation model is a near-end audio signal x (n) and a far-end audio signal y (n), where x (n) =s (n) +e (n) +n, s (n) represents a clean speech signal that needs to be stripped from the near-end audio signal x (n), e (n) represents an echo generated after the far-end audio signal y (n) is played and re-collected by the near-end microphone, and n represents environmental noise;
Step 23, selecting a plurality of synthesized echo pairs and a plurality of real echo pairs from a aec-challenge data set as y (n) and e (n) in training data, selecting a plurality of different noisy speech signals and clean speech signal pairs from a dns-challenge data set as s (n) +n and s (n) in training data respectively, and selecting a plurality of audio signals from the dns-challenge data set as y (n) in training data;
step 24, selecting a plurality of audio signals from rir _noise data in the openlr data set, generating a simulated echo in real time during training, and taking the generated simulated echo as e (n) of training data to participate in training;
step 25, using a plurality of near-end audio signals and a plurality of far-end audio signals as training data, and selecting 10% of the data from the training data as a training test set;
step 26, training the echo cancellation model by using the training data as an input of the echo cancellation model, wherein a loss function is as follows:wherein->Representing the output of the echo cancellation model, s (n) representing the clean speech signal, i.e. the target signal; when->When the value of (2) is stable and approaches s (n), indicating that training is completed;
and step 27, after training, taking the test set as the input of the echo cancellation model, and testing the echo cancellation model to finally obtain the trained echo cancellation model.
4. A neural network-based real-time echo cancellation method according to claim 3, wherein said step 24 is specifically as follows:
step 241, randomly selecting 1 far-end reference signal from a plurality of far-end audio signals y (n), if the selected far-end reference signal is an audio signal in aec-challenge data set, directly using e (n) corresponding to the selected far-end reference signal; if the selected far-end reference signal is an audio signal in the dns-change dataset, go to step 242;
step 242, randomly selecting 1 piece of rir _noise data in the openi data set as a response signal, and convolving the response signal with a far-end reference signal;
step 243, filtering the convolved result by a bandpass filter with random parameters to obtain 1 analog echo signal;
step 244, repeating steps 241-243 to obtain a plurality of analog echo signals as e (n) of training data.
5. The neural network-based real-time echo cancellation method of claim 4, wherein the lowest frequency of the band-pass filter is a random number of 100-400hz, and the highest frequency is a random number of 6000-7500 hz; a random number in the range of 0-100ms is selected as the time delay of the echo signal.
6. The method of neural network-based real-time echo cancellation according to claim 1, wherein said step 5 has steps of:
step 51, inputting a target far-end audio signal and a target near-end audio signal into the echo cancellation model;
step 52, framing the target far-end audio signal and the target near-end audio signal and calculating the Fourier coefficient of each frame;
step 53, normalizing the fourier coefficient of each frame, splicing the fourier coefficients of each frame of the far-end audio signal and the near-end audio signal, and performing operation of an LSTM layer, a third linear layer and a sigmoid activation function, wherein in the LSTM layer, the output of the previous frame is used as the input of the next frame; the near-end audio signal is multiplied by a result output by a sigmoid activation function in a first core module sequentially after framing and Fourier transformation, and then is subjected to Fourier transformation;
step 54, respectively performing operation of a first linear layer through an inverse Fourier transform result and a far-end audio signal, performing normalization and splicing processing on an operation result of the first linear layer, and performing operation of an LSTM layer, a third linear layer and a sigmoid activation function, wherein in the LSTM layer, the output of the previous frame is used as the input of the next frame; multiplying the result output after the inverse Fourier transform and the first linear layer operation with the result output by the sigmoid activation function in the second core module;
Step 55, recovering the coefficient number of each frame to the same number as the Fourier coefficient number by a second linear layer, and adding the result to obtain an output signal with the same length as the input signal
7. The method for real-time echo cancellation based on neural network according to claim 1, wherein the step 6 specifically comprises:
step 61, inputting target far-end audio signals and the output of an echo cancellation model into the echo detection model;
step 62, framing the target far-end audio signal and the output of the echo cancellation model and calculating the Fourier coefficient of each frame;
step 63, normalizing the fourier coefficient of each frame, splicing the far-end audio signal and the fourier coefficient of each frame output by the echo cancellation model, and obtaining the output of the echo detection model through the operation of the fourth linear layer, the GRU layer, the fifth linear layer and the sigmoid activation functionoutThe method comprises the steps of carrying out a first treatment on the surface of the In the GRU layer, the output of the previous frame is taken as the input of the next frame.
8. The method for real-time echo cancellation based on neural network according to claim 1, wherein the step 7 specifically comprises:
Step 71, the input frame length of the echo cancellation model is 512 points, the frame is moved to 128 points, an input register of 512 points is needed to store the input of the previous frame in actual deployment, and an output register of 512 points stores the output of the previous frame;
step 72, assuming 512 points are input at a time, each input needs to run 512/128=4 times; the last 384 points of the previous frame and the first 128 points of the current frame are input in the input register for the first time, the last 256 points of the previous frame and the first 256 points of the current frame are input for the second time, the last 128 points of the previous frame and the first 384 points of the current frame are input for the third time, and the first 512 points of the current frame are input for the fourth time; when all 4 times of forward deduction are completed, the input of the current frame is imported into an input register and used when the next frame is input;
step 73, for each forward derived output, firstly shifting the point of the output register forward by 128 bits, and then setting the point to be zero at 128 positions; adding 512 points output by each frame with 512 points in an output register, outputting the first 128 points, and performing forward deduction to output 128 points each time; assuming 512 points are input at a time, and after 512/128=4 forward deductions are performed, 128×4=512 points are output, so that the length of the output is the same as that of the input;
Step 74, the input frame length of the echo detection model is 512, the frame shift is 256, a register of 512 points is needed, 128 points are output by the echo cancellation model each time, the register is shifted forward by 128 points, and then 128 points output by the echo cancellation model are imported; since the frame shift of the echo detection model is 2 times of that of the echo cancellation model, the echo cancellation model performs one echo detection every 2 times of forward deduction;
step 75, setting a decision threshold, and if the output of the echo detection model is smaller than the threshold for a plurality of times, considering that the output of the echo cancellation model contains larger echo at the moment, and setting 0 at the output; otherwise, the voice is normal voice.
CN202311768706.8A 2023-12-21 2023-12-21 Real-time echo cancellation method based on neural network Active CN117437929B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311768706.8A CN117437929B (en) 2023-12-21 2023-12-21 Real-time echo cancellation method based on neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311768706.8A CN117437929B (en) 2023-12-21 2023-12-21 Real-time echo cancellation method based on neural network

Publications (2)

Publication Number Publication Date
CN117437929A CN117437929A (en) 2024-01-23
CN117437929B true CN117437929B (en) 2024-03-08

Family

ID=89546554

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311768706.8A Active CN117437929B (en) 2023-12-21 2023-12-21 Real-time echo cancellation method based on neural network

Country Status (1)

Country Link
CN (1) CN117437929B (en)

Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105957520A (en) * 2016-07-04 2016-09-21 北京邮电大学 Voice state detection method suitable for echo cancellation system
CN111292759A (en) * 2020-05-11 2020-06-16 上海亮牛半导体科技有限公司 Stereo echo cancellation method and system based on neural network
CN111885275A (en) * 2020-07-23 2020-11-03 海尔优家智能科技(北京)有限公司 Echo cancellation method and device for voice signal, storage medium and electronic device
CN112863535A (en) * 2021-01-05 2021-05-28 中国科学院声学研究所 Residual echo and noise elimination method and device
CN113763977A (en) * 2021-04-16 2021-12-07 腾讯科技(深圳)有限公司 Method, apparatus, computing device and storage medium for eliminating echo signal
CN113870874A (en) * 2021-09-23 2021-12-31 武汉大学 Multi-feature fusion echo cancellation method and system based on self-attention transformation network
CN114283830A (en) * 2021-12-17 2022-04-05 南京工程学院 Deep learning network-based microphone signal echo cancellation model construction method
CN114530160A (en) * 2022-02-25 2022-05-24 携程旅游信息技术(上海)有限公司 Model training method, echo cancellation method, system, device and storage medium
CN114566176A (en) * 2022-02-23 2022-05-31 苏州蛙声科技有限公司 Residual echo cancellation method and system based on deep neural network
CN114827363A (en) * 2022-04-13 2022-07-29 随锐科技集团股份有限公司 Method, device and readable storage medium for eliminating echo in call process
CN115132215A (en) * 2022-06-07 2022-09-30 上海声瀚信息科技有限公司 Single-channel speech enhancement method
CN115312073A (en) * 2022-06-28 2022-11-08 上海声瀚信息科技有限公司 Low-complexity residual echo suppression method combining signal processing and deep neural network
CN115457928A (en) * 2022-07-27 2022-12-09 杭州芯声智能科技有限公司 Echo cancellation method and system based on neural network double-talk detection
CN116453532A (en) * 2023-04-28 2023-07-18 济南大学 Echo cancellation method of acoustic echo
CN116781829A (en) * 2022-03-07 2023-09-19 联发科技(新加坡)私人有限公司 Apparatus and method for performing acoustic echo cancellation

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11393487B2 (en) * 2019-03-28 2022-07-19 Samsung Electronics Co., Ltd. System and method for acoustic echo cancelation using deep multitask recurrent neural networks
US10803881B1 (en) * 2019-03-28 2020-10-13 Samsung Electronics Co., Ltd. System and method for acoustic echo cancelation using deep multitask recurrent neural networks
US11902757B2 (en) * 2022-06-14 2024-02-13 Tencent America LLC Techniques for unified acoustic echo suppression using a recurrent neural network

Patent Citations (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105957520A (en) * 2016-07-04 2016-09-21 北京邮电大学 Voice state detection method suitable for echo cancellation system
CN111292759A (en) * 2020-05-11 2020-06-16 上海亮牛半导体科技有限公司 Stereo echo cancellation method and system based on neural network
CN111885275A (en) * 2020-07-23 2020-11-03 海尔优家智能科技(北京)有限公司 Echo cancellation method and device for voice signal, storage medium and electronic device
CN112863535A (en) * 2021-01-05 2021-05-28 中国科学院声学研究所 Residual echo and noise elimination method and device
CN113763977A (en) * 2021-04-16 2021-12-07 腾讯科技(深圳)有限公司 Method, apparatus, computing device and storage medium for eliminating echo signal
CN113870874A (en) * 2021-09-23 2021-12-31 武汉大学 Multi-feature fusion echo cancellation method and system based on self-attention transformation network
WO2023044961A1 (en) * 2021-09-23 2023-03-30 武汉大学 Multi-feature fusion echo cancellation method and system based on self-attention transform network
CN114283830A (en) * 2021-12-17 2022-04-05 南京工程学院 Deep learning network-based microphone signal echo cancellation model construction method
CN114566176A (en) * 2022-02-23 2022-05-31 苏州蛙声科技有限公司 Residual echo cancellation method and system based on deep neural network
CN114530160A (en) * 2022-02-25 2022-05-24 携程旅游信息技术(上海)有限公司 Model training method, echo cancellation method, system, device and storage medium
CN116781829A (en) * 2022-03-07 2023-09-19 联发科技(新加坡)私人有限公司 Apparatus and method for performing acoustic echo cancellation
CN114827363A (en) * 2022-04-13 2022-07-29 随锐科技集团股份有限公司 Method, device and readable storage medium for eliminating echo in call process
CN115132215A (en) * 2022-06-07 2022-09-30 上海声瀚信息科技有限公司 Single-channel speech enhancement method
CN115312073A (en) * 2022-06-28 2022-11-08 上海声瀚信息科技有限公司 Low-complexity residual echo suppression method combining signal processing and deep neural network
CN115457928A (en) * 2022-07-27 2022-12-09 杭州芯声智能科技有限公司 Echo cancellation method and system based on neural network double-talk detection
CN116453532A (en) * 2023-04-28 2023-07-18 济南大学 Echo cancellation method of acoustic echo

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
A Simplified Deep Learning model for Acoustic Feedback Cancellation in Digital Hearing Aid;K. Posnaik等;2021 International Symposium of Asian Control Association on Intelligent Robotics and Industrial Automation (IRIA;20211104;432-436 *
Acoustic Echo Cancellation with the Dual-Signal Transformation LSTM Network;N. L. Westhausen等;ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP);20210513;7138-7142 *
An Input Residual Connection for Simplifying Gated Recurrent Neural Networks;N. I. H. Kuo等;2020 International Joint Conference on Neural Networks (IJCNN);20200928;1-8 *
Nonlinear Residual Echo Suppression Based on Gated Dual Signal Transformation LSTM Network;K. Xie等;2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC);20221110;1696-1701 *
基于多任务网络的麦克风阵列语音增强技术研究;赖志鹏;中国优秀硕士学位论文全文数据库 信息科技辑;20231115;I135-7 *

Also Published As

Publication number Publication date
CN117437929A (en) 2024-01-23

Similar Documents

Publication Publication Date Title
Sridhar et al. ICASSP 2021 acoustic echo cancellation challenge: Datasets, testing framework, and results
CN111292759B (en) Stereo echo cancellation method and system based on neural network
Cutler et al. INTERSPEECH 2021 Acoustic Echo Cancellation Challenge.
Lee et al. DNN-based residual echo suppression.
US20220301577A1 (en) Echo cancellation method and apparatus
CN111031448B (en) Echo cancellation method, echo cancellation device, electronic equipment and storage medium
CN111768796A (en) Acoustic echo cancellation and dereverberation method and device
CN111883154B (en) Echo cancellation method and device, computer-readable storage medium, and electronic device
CN106161820B (en) A kind of interchannel decorrelation method for stereo acoustic echo canceler
CN115132215A (en) Single-channel speech enhancement method
Pfeifenberger et al. Acoustic Echo Cancellation with Cross-Domain Learning.
CN114283830A (en) Deep learning network-based microphone signal echo cancellation model construction method
CN117437929B (en) Real-time echo cancellation method based on neural network
CN115424627A (en) Voice enhancement hybrid processing method based on convolution cycle network and WPE algorithm
CN115579016B (en) Method and system for eliminating acoustic echo
CN115620737A (en) Voice signal processing device, method, electronic equipment and sound amplification system
CN114792524B (en) Audio data processing method, apparatus, program product, computer device and medium
CN115083431A (en) Echo cancellation method and device, electronic equipment and computer readable medium
CN116453532A (en) Echo cancellation method of acoustic echo
Zhang et al. Hybrid AHS: A hybrid of Kalman filter and deep learning for acoustic howling suppression
JP2002223182A (en) Echo canceling method, its device, its program and its recording medium
CN113990337A (en) Audio optimization method and related device, electronic equipment and storage medium
CN110246516B (en) Method for processing small space echo signal in voice communication
CN116233697B (en) Acoustic feedback suppression method and system based on deep learning
Chen An Implementation Research on Acoustic Echo Cancellation in Double-Talk Scenarios

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant