CN117437929B

CN117437929B - Real-time echo cancellation method based on neural network

Info

Publication number: CN117437929B
Application number: CN202311768706.8A
Authority: CN
Inventors: 阮炜玄; 兰泽华; 蔡如意
Original assignee: Ringslink Xiamen Network Communication Technologies Co ltd
Current assignee: Ringslink Xiamen Network Communication Technologies Co ltd
Priority date: 2023-12-21
Filing date: 2023-12-21
Publication date: 2024-03-08
Anticipated expiration: 2043-12-21
Also published as: CN117437929A

Abstract

The invention discloses a neural network-based real-time echo cancellation method, which comprises the following steps: step 1, constructing an echo cancellation model; step 2, training the echo cancellation model; step 3, constructing an echo detection model; step 4, training the echo detection model; step 5, taking the near-end audio signal and the far-end audio signal as the input of the trained echo cancellation model to obtain the output of the trained echo cancellation model; step 6, taking the output of the trained echo cancellation model and the far-end audio signal as the input of the trained echo detection model, and obtaining the output of the trained echo detection model as an echo detection tag; and 7, judging the state of the output frame of the current echo cancellation model according to the echo detection label to obtain final target audio. The invention combines the echo cancellation model and the echo detection model, improves the effect of the echo cancellation model and reduces the deployment difficulty of the model under the condition that only a small amount of data is acquired.

Description

Real-time echo cancellation method based on neural network

Technical Field

The invention relates to the technical field of echo cancellation, in particular to a neural network-based real-time echo cancellation method.

Background

Echo cancellation is an important part of audio processing, and traditional echo cancellation algorithms based on adaptive filters, such as webrtc and speex, have the defects of complex debugging, slow convergence, difficulty in adapting to complex environments, and the like.

With the development of neural networks, neural networks are widely used for echo cancellation. There are mainly 2 echo cancellation schemes based on neural networks:

firstly, linear echo and estimated time delay are eliminated through an adaptive filter, and nonlinear echo and noise are eliminated through a neural network model. The adaptive filter and the neural network model are adopted, the requirements on the adaptive filter, particularly the time delay estimation are larger, and the debugging difficulty of the adaptive filter is larger due to the diversity of scenes in practical application, so that the performance is difficult to guarantee. In addition, since the adaptive filter is often required to be output when the model is trained, the original near-end signal and the far-end audio signal are simultaneously used as the characteristic input model, so that the model is large, the training is slow, and a large amount of data is required to be supported. Therefore, although the scheme has good effect, the implementation is difficult, the model is large, and the deployment is difficult at the equipment end;

Secondly, echo cancellation and noise reduction are directly realized through a neural network model; only one model is used for echo cancellation, and the scheme is simple to realize and deploy, but the effect of the model is difficult to guarantee in different devices and in different scenes because parameters cannot be updated for different devices like an adaptive filter.

The invention in China with publication number of CN116887160A discloses a digital hearing aid howling suppression method and system based on a neural network, wherein the method comprises the following steps: acquiring a voice signal received by a digital hearing aid; acquiring the state of each moment in the voice signal based on a neural network; determining a convergence stability coefficient corresponding to each moment according to the state of each moment in the voice signal and the distribution characteristic of the amplitude-frequency peak value of each moment; determining a non-stationary moment based on the convergence stationary coefficient; determining a step length adjusting proportion based on the time interval between the non-stable time and the howling time, and acquiring a convergence step length corresponding to each time based on the step length adjusting proportion and the convergence stability coefficient; and carrying out echo cancellation on the voice signal based on an NLMS algorithm and the convergence step length so as to inhibit howling in the digital hearing aid. The patent eliminates linear echo and estimated time delay through an adaptive filter (NLMS algorithm), the scheme is difficult to realize, the model is large, and the deployment is difficult at the equipment end.

Disclosure of Invention

Therefore, the invention aims to provide a real-time echo cancellation method based on a neural network, which combines the scheme of an echo cancellation model and an echo detection model, improves the effect of the echo cancellation model and reduces the deployment difficulty of the model under the condition that only a small amount of data is acquired.

In order to achieve the technical purpose, the invention adopts the following technical scheme:

the invention provides a neural network-based real-time echo cancellation method, which comprises the following steps:

step 1, constructing an echo cancellation model; the method specifically comprises the following steps:

step 11, the structure of the echo cancellation model uses an open source DTLN-aec model;

step 12, the DTLN-aec model is composed of two framing layers, two fourier transform layers, two core modules, an inverse fourier transform layer, two first linear layers, one second linear layer and overlapping additive layers, each core module includes two normalization layers, one connection layer, two LSTM layers, one third linear layer and one sigmoid activation function;

step 2, training the echo cancellation model;

step 3, constructing an echo detection model; the method specifically comprises the following steps:

Step 31, the echo detection model consists of two framing layers, two Fourier transformation layers, two normalization layers, a splicing layer, a fourth linear layer, two GRU layers, a fifth linear layer and a sigmoid activation function;

step 32, the input of the echo detection model is the output of the echo cancellation modelAnd a far-end audio signal y (n), two of said framing layers being adapted to receive the output of the echo cancellation model>The two framing layers respectively enter two Fourier transformation layers, the two Fourier transformation layers respectively enter two normalization layers, the two normalization layers enter a splicing layer, the splicing layer sequentially enters two GRU layers, a fifth linear layer and a sigmoid activation function, and the output of the echo detection model is obtainedout；

Step 4, training the echo detection model;

step 5, taking the near-end audio signal and the far-end audio signal as the input of the trained echo cancellation model to obtain the output of the trained echo cancellation model;

step 6, taking the output of the trained echo cancellation model and the far-end audio signal as the input of the trained echo detection model, and obtaining the output of the trained echo detection model as an echo detection tag;

And 7, judging the state of the output frame of the current echo cancellation model according to the echo detection label to obtain final target audio.

Further, the step 12 further includes:

step 13, inputting a near-end audio signal x (n) and a far-end audio signal y (n) by the echo cancellation model, wherein the two framing layers are used for receiving the near-end audio signal x (n) and the far-end audio signal y (n), the two framing layers respectively enter two Fourier transform layers, the two Fourier transform layers enter a first core module, the two Fourier transform layers respectively enter two normalization layers, the two normalization layers enter a connection layer, and the connection layer sequentially enters two LSTM layers, a third linear layer and a sigmoid activation function; the near-end audio signal sequentially passes through a framing layer and a Fourier transform layer, and then the result output by the near-end audio signal is multiplied by the result output by the sigmoid activation function in the first core module and then enters an inverse Fourier transform layer;

step 14, the inverse fourier transform layer and the far-end audio signal enter two first linear layers respectively, the two first linear layers enter a second core module, the two first linear layers enter two normalization layers respectively, the two normalization layers enter a connection layer, and the connection layer sequentially enters two LSTM layers, a third linear layer and a sigmoid activation function; the result output after passing through the inverse Fourier transform layer and the first linear layer is multiplied by the result output by the sigmoid activation function in the second core module, and then sequentially enters the second linear layer and the overlapped additive layer to obtain the output of the echo cancellation model 。

Further, the step 2 specifically includes:

step 21, acquiring a plurality of audio signals in three data sets of an open source dns-challenge, aec-change and openplr;

step 22, the input of the echo cancellation model is a near-end audio signal x (n) and a far-end audio signal y (n), where x (n) =s (n) +e (n) +n, s (n) represents a clean speech signal that needs to be stripped from the near-end audio signal x (n), e (n) represents an echo generated after the far-end audio signal y (n) is played and re-collected by the near-end microphone, and n represents environmental noise;

step 23, selecting a plurality of synthesized echo pairs and a plurality of real echo pairs from a aec-challenge data set as y (n) and e (n) in training data, selecting a plurality of different noisy speech signals and clean speech signal pairs from a dns-challenge data set as s (n) +n and s (n) in training data respectively, and selecting a plurality of audio signals from the dns-challenge data set as y (n) in training data;

step 24, selecting a plurality of audio signals from rir _noise data in the openlr data set, generating a simulated echo in real time during training, and taking the generated simulated echo as e (n) of training data to participate in training;

step 25, using a plurality of near-end audio signals and a plurality of far-end audio signals as training data, and selecting 10% of the data from the training data as a training test set;

Step 26, training the echo cancellation model by using the training data as an input of the echo cancellation model, wherein a loss function is as follows:wherein->Representing the output of the echo cancellation model, s (n) representing the clean speech signal, i.e. the target signal; when->When the value of (2) is stable and approaches s (n), indicating that training is completed;

and step 27, after training, taking the test set as the input of the echo cancellation model, and testing the echo cancellation model to finally obtain the trained echo cancellation model.

Further, the step 24 specifically includes the following steps:

step 241, randomly selecting 1 far-end reference signal from a plurality of far-end audio signals y (n), if the selected far-end reference signal is an audio signal in aec-challenge data set, directly using e (n) corresponding to the selected far-end reference signal; if the selected far-end reference signal is an audio signal in the dns-change dataset, go to step 242;

step 242, randomly selecting 1 piece of rir _noise data in the openi data set as a response signal, and convolving the response signal with a far-end reference signal;

step 243, filtering the convolved result by a bandpass filter with random parameters to obtain 1 analog echo signal;

Step 244, repeating steps 241-243 to obtain a plurality of analog echo signals as e (n) of training data.

Further, the lowest frequency of the band-pass filter is a random number of 100-400hz, and the highest frequency is a random number of 6000-7500 hz; a random number in the range of 0-100ms is selected as the time delay of the echo signal.

Further, the step 4 specifically includes:

step 41, obtaining the output of the echo cancellation modelSelecting a plurality of synthesized echo pairs and a plurality of real echo pairs from the aec-challenge data set as far-end audio signals y (n) in the training data, and selecting a plurality of audio signals from the dns-challenge data set as far-end audio signals y (n) in the training data;

step 42, outputting the echo cancellation modelAnd the far-end audio signal y (n) is used as the input of the echo detection model, the energy of each frame, namely the square sum of the amplitude of the sampling point of each frame, is calculated through a clean voice signal s (n) provided by a dns-challenge data set, so that the label of each frame is obtained, whether the square sum of the corresponding clean voice signal s (n) in the current frame is larger than 0.001 is judged, if yes, the voice label of the current frame is set as 1, otherwise, the voice label of the current frame is set as 0;

Step 43, using the average mean square error as a loss function, namely:whereinout (n)Representing the output of the echo detection model at the current frame,label(n)in order to provide a corresponding label to be displayed,Nis the total number of frames; when (when)out (n)Is stable and approaches tolabel(n)At this point, the training is explained as completed.

Further, the step 5 has steps including:

step 51, inputting a target far-end audio signal and a target near-end audio signal into the echo cancellation model;

step 52, framing the target far-end audio signal and the target near-end audio signal and calculating the Fourier coefficient of each frame;

step 53, normalizing the fourier coefficient of each frame, splicing the fourier coefficients of each frame of the far-end audio signal and the near-end audio signal, and performing operation of an LSTM layer, a third linear layer and a sigmoid activation function, wherein in the LSTM layer, the output of the previous frame is used as the input of the next frame; the near-end audio signal is multiplied by a result output by a sigmoid activation function in a first core module sequentially after framing and Fourier transformation, and then is subjected to Fourier transformation;

step 54, respectively performing operation of a first linear layer through an inverse Fourier transform result and a far-end audio signal, performing normalization and splicing processing on an operation result of the first linear layer, and performing operation of an LSTM layer, a third linear layer and a sigmoid activation function, wherein in the LSTM layer, the output of the previous frame is used as the input of the next frame; multiplying the result output after the inverse Fourier transform and the first linear layer operation with the result output by the sigmoid activation function in the second core module;

Step 55, recovering the coefficient number of each frame to the same number as the Fourier coefficient number by a second linear layer, and adding the result to obtain an output signal with the same length as the input signal。

Further, the step 6 specifically includes:

step 61, inputting target far-end audio signals and the output of an echo cancellation model into the echo detection model;

step 62, framing the target far-end audio signal and the output of the echo cancellation model and calculating the Fourier coefficient of each frame;

step 63, normalizing the fourier coefficient of each frame, splicing the far-end audio signal and the fourier coefficient of each frame output by the echo cancellation model, and obtaining the output of the echo detection model through the operation of the fourth linear layer, the GRU layer, the fifth linear layer and the sigmoid activation functionoutThe method comprises the steps of carrying out a first treatment on the surface of the In the GRU layer, the output of the previous frame is taken as the input of the next frame.

Further, the step 7 specifically includes:

step 71, the input frame length of the echo cancellation model is 512 points, the frame is moved to 128 points, an input register of 512 points is needed to store the input of the previous frame in actual deployment, and an output register of 512 points stores the output of the previous frame;

Step 72, assuming 512 points are input at a time, each input needs to run 512/128=4 times; the last 384 points of the previous frame and the first 128 points of the current frame are input in the input register for the first time, the last 256 points of the previous frame and the first 256 points of the current frame are input for the second time, the last 128 points of the previous frame and the first 384 points of the current frame are input for the third time, and the first 512 points of the current frame are input for the fourth time; when all 4 times of forward deduction are completed, the input of the current frame is imported into an input register and used when the next frame is input;

step 73, for each forward derived output, firstly shifting the point of the output register forward by 128 bits, and then setting the point to be zero at 128 positions; adding 512 points output by each frame with 512 points in an output register, outputting the first 128 points, and performing forward deduction to output 128 points each time; assuming 512 points are input at a time, and after 512/128=4 forward deductions are performed, 128×4=512 points are output, so that the length of the output is the same as that of the input;

step 74, the input frame length of the echo detection model is 512, the frame shift is 256, a register of 512 points is needed, 128 points are output by the echo cancellation model each time, the register is shifted forward by 128 points, and then 128 points output by the echo cancellation model are imported; since the frame shift of the echo detection model is 2 times of that of the echo cancellation model, the echo cancellation model performs one echo detection every 2 times of forward deduction;

Step 75, setting a decision threshold, and if the output of the echo detection model is smaller than the threshold for a plurality of times, considering that the output of the echo cancellation model contains larger echo at the moment, and setting 0 at the output; otherwise, the voice is normal voice.

By adopting the technical scheme, compared with the prior art, the invention has the beneficial effects that:

according to the method, only one echo cancellation model is trained through the open-source data set, only a small amount of data is required to be acquired for adapting to different equipment, a small echo detection model is trained, a large amount of time is not required to debug the adaptive filter, and the training difficulty and cost are greatly reduced.

According to the method, an echo detection model is added on the basis of a traditional echo cancellation model, when the echo cancellation model is poor in effect, the echo can be cancelled through the echo detection model, meanwhile, the requirements of different scenes can be met by setting a threshold value, and the problem that the end-to-end echo cancellation model is unstable in effect under the different scenes is solved. The performance of echo cancellation is improved with only a small increase in complexity.

Most of training data of the method uses an open source data set, and only a small amount of data is required to be collected for different devices to train an echo detection model, so that the method can be more convenient to adapt to different devices. The quantized echo cancellation model is about 1.5M, and the quantized echo detection model is 350KB, so that the method can be used for mobile terminal equipment.

Meanwhile, in practical application, the echo cancellation effect and the hearing are often difficult to balance, and the method can be adapted in different scenes by setting the threshold value of echo detection. The threshold of the echo detection model may be lowered by 0 if the hearing is to be guaranteed and increased if the effect of echo cancellation is to be guaranteed.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flowchart of an implementation of a neural network-based real-time echo cancellation method according to an embodiment of the present invention.

Fig. 2 is a schematic structural diagram of an echo cancellation model according to an embodiment of the present invention.

Fig. 3 is a schematic structural diagram of a core module in an echo cancellation model according to an embodiment of the present invention.

Fig. 4 is a schematic structural diagram of an echo detection model according to an embodiment of the present invention.

Fig. 5 is a schematic diagram of LSTM layer input and output in a core module according to an embodiment of the present invention.

Fig. 6 is a schematic diagram of input and output of a GRU layer in an echo detection model according to an embodiment of the present invention.

Detailed Description

The invention is described in further detail below with reference to the drawings and examples. It is specifically noted that the following examples are only for illustrating the present invention, but do not limit the scope of the present invention. Likewise, the following examples are only some, but not all, of the examples of the present invention, and all other examples, which a person of ordinary skill in the art would obtain without making any inventive effort, are within the scope of the present invention.

Referring to fig. 1-6, the present invention provides a neural network-based real-time echo cancellation method, which includes the following steps:

step 1, constructing an echo cancellation model;

in this embodiment, the step 1 specifically includes:

Step 2, training the echo cancellation model;

in this embodiment, the step 2 specifically includes:

step 21, acquiring a plurality of audio signals in three data sets of an open source dns-challenge, aec-change and openplr; dns-challenge, aec-change and openlr are representations of a dataset, which is a source of multiple audio data; the dns-challenge data set represents a data set of a depth noise suppression challenge, the aec-challenge data set represents a data set of an acoustic echo cancellation challenge, and the openlr data set is a site which is specially used for storing voice and language resources and is quite comprehensive and comprises data sets of voice, noise, reverberation and the like; rir _noise data is data representing the impulse response of a room and is the data required to simulate echo.

step 23, selecting a plurality of synthesized echo pairs and a plurality of real echo pairs from a aec-challenge data set as y (n) and e (n) in training data, selecting a plurality of different noisy speech signals and clean speech signal pairs from a dns-challenge data set as s (n) +n and s (n) in training data respectively, and selecting a plurality of audio signals from the dns-challenge data set as y (n) in training data; specifically, 10000 synthesized echo pairs and 10000 real echo pairs can be selected from the aec-challenge data set as y (n) and e (n) in the training data, where y (n) has 20000 pieces and e (n) has 20000 pieces. 50000 different noisy language signals of 10S are selected from the dns-challenge data set as S (n) +n,50000 clean voice signals are selected from the dns-challenge data set as S (n), and the signal to noise ratio distribution is 0-20db. Since the echo situation is very diverse, only 20000 echo signals are far from sufficient, 30000 audio signals are selected in the dns-change dataset as far-end audio signals y (n). There are a total of 50000 s (n), 50000 s (n) +n,50000 far-end tones y (n), e (n) in 20000 aec-challenge.

Step 24, selecting a plurality of audio signals (60000 room impulse response audio signals) from rir _noise data in the openlr data set, generating a simulated echo in real time during training, and taking the generated simulated echo as e (n) of training data to participate in training; for guaranteeing the diversity of training data.

step 26, training the echo cancellation model by using the training data as an input of the echo cancellation model, wherein a loss function is as follows:wherein->Representing the output of the echo cancellation model, s (n) representing the clean speech signal, i.e. the target signal; the goal of model training is to make the results of the above formulas as small as possible whenWhen the value of (2) is stable and approaches s (n), indicating that training is completed;

In this embodiment, the step 24 is specifically as follows:

step 241, randomly selecting 1 far-end reference signal from a plurality of far-end audio signals y (n), if the selected far-end reference signal is an audio signal in aec-challenge data set, directly using e (n) corresponding to the selected far-end reference signal; if the selected far-end reference signal is an audio signal in the dns-change dataset, go to step 242; specifically, 1 of 50000 far-end audio signals is randomly selected as a far-end reference signal, if one of 20000 audio signals selected from aec-change is selected, the corresponding e (n) is directly used, and if one of 30000 audio signals selected from dns-change is selected, echo is generated by using the methods of step 242 and step 243;

Step 242, randomly selecting 1 piece of rir _noise data in the openi data set as a response signal, and convolving the response signal with a far-end reference signal; specifically, 1 of 60000 room impulse response audio signals is randomly selected to be convolved with the far-end reference signal;

step 244, repeating steps 241-243 to obtain a plurality of analog echo signals as e (n) of training data. Every time 40000 audio training networks are generated according to the method, parameters are updated once, the frame length of signal input is 32ms (the sampling rate is 16K), and the frame shift is 8ms.

In this embodiment, the lowest frequency of the band-pass filter is a random number of 100-400hz, and the highest frequency is a random number of 6000-7500 hz; a random number in the range of 0-100ms is selected as the time delay of the echo signal. By constructing and training an echo cancellation model, most of echoes can be cancelled, and the audio quality is ensured.

Step 3, constructing an echo detection model;

in this embodiment, the step 3 specifically includes:

step 31, the echo detection model consists of two framing layers, two Fourier transformation layers, two normalization layers, a splicing layer, a fourth linear layer, two GRU layers, a fifth linear layer and a sigmoid activation function; since I want the echo detection model to be as small and fast as possible, we choose GRUs with faster speeds to build the model;

Step 32, the input of the echo detection model is the output of the echo cancellation modelAnd a far-end audio signal y (n), two of said framing layers being for receiving echoesOutput of cancellation model->The two framing layers respectively enter two Fourier transformation layers, the two Fourier transformation layers respectively enter two normalization layers, the two normalization layers enter a splicing layer, the splicing layer sequentially enters two GRU layers, a fifth linear layer and a sigmoid activation function, and the output of the echo detection model is obtainedout. In order to make the model easier to train, the output range is limited between (0, 1) by a sigmoid function, preventing the loss function from being too large.

Step 4, training the echo detection model;

in this embodiment, the step 4 specifically includes:

step 42, outputting the echo cancellation model And the far-end audio signal y (n) is used as the input of the echo detection model, the energy of each frame, namely the square sum of the amplitude of the sampling point of each frame, is calculated through a clean voice signal s (n) provided by a dns-challenge data set, so that the label of each frame is obtained, whether the square sum of the corresponding clean voice signal s (n) in the current frame is larger than 0.001 is judged, if yes, the voice label of the current frame is set as 1, otherwise, the voice label of the current frame is set as 0; because the echo detection model mainly ensures the echo cancellation effect of the echo cancellation model when the effect is poor, the signal to noise ratio of 70% of data in the training data set and clean voice is below 0db, namely +.>. 80000 pieces of 10 seconds of audio are generated as a training set by the above method. Meanwhile, because echoes are different under different microphones, for each different device, about 10 hours of echo pairs are acquired, about 8000 near-end and far-end signals are generated by the same method and clean voice signals of dns-challenge data, a trained model is input to obtain output, and the audios are set as a trained test set.

Step 43, using the average mean square error as a loss function, namely:wherein out (n)Representing the output of the echo detection model at the current frame,label(n)in order to provide a corresponding label to be displayed,Nis the total number of frames; when (when)out (n)Is stable and approaches tolabel(n)At this point, the training is explained as completed. The training goal of the model is to make the output as close as possiblelabel(n). The best performing model stored on the test set is the final model. The training frame length of the model is 32ms, and the frame shift is 16ms.

By constructing and training the echo detection model, when the echo cancellation model has poor effect, the echo can be cancelled through the echo detection model, and meanwhile, the problem that the end-to-end echo cancellation model has unstable effect under different scenes can be solved by setting the requirements of adapting to different scenes.

Deployment in practical application: a final echo cancellation and echo detection model is derived. Unlike training, multi-frame audio can be input at one time during training, and only one frame of audio can be obtained at one time in actual deployment. According to the LSTM and GRU structures of the above diagram, the output of the previous frame can be used as the input of the next frame, so that the output of the LSTM and the GRU need to be used as the input of the next frame at the same time, and the derived echo cancellation model needs to change the core module thereof into streaming.

Step 5, taking the near-end audio signal and the far-end audio signal as the input of the trained echo cancellation model to obtain the output of the trained echo cancellation model; in this embodiment, the step 5 includes:

step 53, normalizing the fourier coefficient of each frame, so as to make the fourier coefficient as much as possible conform to gaussian distribution in order to make the model have better generalization; splicing the Fourier coefficients of each frame of the far-end audio signal and the near-end audio signal, namely if each frame of the far-end audio signal and the near-end audio signal has 257 Fourier coefficients, each frame after splicing has 257 multiplied by 2=514 Fourier coefficients; then, through the operation of the LSTM layer, the third linear layer and the sigmoid activation function, the output of the previous frame is used as the input of the next frame in the LSTM layer; the near-end audio signal is multiplied by a result output by a sigmoid activation function in a first core module sequentially after framing and Fourier transformation, and then is subjected to Fourier transformation;

Step 55, recovering the coefficient number of each frame to the same number as the Fourier coefficient number by a second linear layer, and adding the result to obtain an output signal with the same length as the input signal. The object of the whole model is to let the output signal +.>As close as possible to s (n)。

in this embodiment, the step 6 specifically includes:

step 63, normalizing the fourier coefficient of each frame, splicing the far-end audio signal and the fourier coefficient of each frame output by the echo cancellation model, and obtaining the output of the echo detection model through the operation of the fourth linear layer, the GRU layer, the fifth linear layer and the sigmoid activation function outThe method comprises the steps of carrying out a first treatment on the surface of the In the GRU layer, the output of the previous frame is taken as the input of the next frame.

In this embodiment, the step 7 specifically includes:

step 73, for each forward derived output, the point of the output register is first shifted forward by 128 bits, i.e. the original 128 th to 512 th sampling points in the output register become the 0 th to 384 th points of the current register, and then 128 th positions are zero (the 384 th to 512 th positions 0 of the output register); adding 512 points output by each frame with 512 points in an output register, outputting the first 128 points, and performing forward deduction to output 128 points each time; assuming 512 points are input at a time, and after 512/128=4 forward deductions are performed, 128×4=512 points are output, so that the length of the output is the same as that of the input;

step 75, setting a decision threshold, and if the output of the echo detection model is smaller than the threshold for a plurality of times, considering that the output of the echo cancellation model contains larger echo at the moment, and setting 0 at the output; otherwise, the voice is normal voice. When the output of the echo detection model is smaller than the threshold value for 5 continuous times, the current frame is judged to be the echo, so that the distortion can be prevented, and the accuracy of the result is improved. By setting the threshold value of echo detection, the method is adapted in different scenes. The threshold of the echo detection model may be lowered by 0 if the hearing is to be guaranteed and increased if the effect of echo cancellation is to be guaranteed.

The foregoing description is only a partial embodiment of the present invention, and is not intended to limit the scope of the present invention, and all equivalent devices or equivalent processes using the descriptions and the drawings of the present invention or directly or indirectly applied to other related technical fields are included in the scope of the present invention.

Claims

1. The real-time echo cancellation method based on the neural network is characterized by comprising the following steps:

step 2, training the echo cancellation model;

step 32, the input of the echo detection model is the output of the echo cancellation modelAnd a far-end audio signal y (n), two of said framing layers being adapted to receive the output of the echo cancellation model >The two framing layers respectively enter two Fourier transformation layers, the two Fourier transformation layers respectively enter two normalization layers, the two normalization layers enter a splicing layer, the splicing layer sequentially enters two GRU layers, a fifth linear layer and a sigmoid activation function, and the output of the echo detection model is obtainedout；

Step 4, training the echo detection model; the method specifically comprises the following steps:

Step 43, using the average mean square error as a loss function, namely:whereinout(n)Representing the output of the echo detection model at the current frame,label(n)in order to provide a corresponding label to be displayed,Nis the total number of frames; when (when)out(n)Is stable and approaches tolabel(n)When the training is completed;

2. The method of neural network-based real-time echo cancellation according to claim 1, wherein said step 12 further comprises:

Step 14, the inverse fourier transform layer and the far-end audio signal enter two first linear layers respectively, the two first linear layers enter a second core module, the two first linear layers enter two normalization layers respectively, the two normalization layers enter a connection layer, and the connection layer sequentially enters two LSTM layers, a third linear layer and a sigmoid activation function; the result output after passing through the inverse Fourier transform layer and the first linear layer is multiplied by the result output by the sigmoid activation function in the second core module, and then sequentially enters the second linear layer and the overlapped additive layer to obtain the output of the echo cancellation model。

3. The method for real-time echo cancellation based on neural network as claimed in claim 1, wherein said step 2 specifically comprises:

4. A neural network-based real-time echo cancellation method according to claim 3, wherein said step 24 is specifically as follows:

5. The neural network-based real-time echo cancellation method of claim 4, wherein the lowest frequency of the band-pass filter is a random number of 100-400hz, and the highest frequency is a random number of 6000-7500 hz; a random number in the range of 0-100ms is selected as the time delay of the echo signal.

6. The method of neural network-based real-time echo cancellation according to claim 1, wherein said step 5 has steps of:

7. The method for real-time echo cancellation based on neural network according to claim 1, wherein the step 6 specifically comprises:

8. The method for real-time echo cancellation based on neural network according to claim 1, wherein the step 7 specifically comprises: