CN109841206A

CN109841206A - A kind of echo cancel method based on deep learning

Info

Publication number: CN109841206A
Application number: CN201811013935.8A
Authority: CN
Inventors: 不公告发明人
Original assignee: Elephant Acoustical (shenzhen) Technology Co Ltd
Current assignee: Elephant Acoustical (shenzhen) Technology Co Ltd
Priority date: 2018-08-31
Filing date: 2018-08-31
Publication date: 2019-06-04
Anticipated expiration: 2038-08-31
Also published as: WO2020042706A1; CN109841206B

Abstract

The disclosure discloses a kind of echo cancel method based on deep learning, device and electronic equipment, storage medium, belongs to field of computer technology.The described method includes: extracting acoustic feature from received microphone signal, the microphone signal includes near end signal and remote signaling；The acoustic feature is iterated operation in the recurrent neural networks model with shot and long term memory of training in advance, calculates the ratio film of the acoustic feature；The acoustic feature is sheltered using the ratio film, the phase of the acoustic feature and the microphone signal after masking is synthesized, the near end signal after echo cancellor is obtained.Above-mentioned echo cancel method and device based on deep learning can ambient noise, it is double make peace non-linear distortion when realize echo cancellor, greatly improve the effect and applicable scene of echo cancellor.And without postfilter, effectively simplify electronic equipment, reduces electronic equipment cost.

Description

A kind of echo cancel method based on deep learning

Technical field

This disclosure relates to computer application technology, in particular to a kind of echo cancel method based on deep learning, Device and electronic equipment, storage medium.

Background technique

In a communications system, when loudspeaker and microphone couple, microphone will pick up signal that loudspeaker issues and its Reverberation, thus echogenicity.Such as all there is the puzzlement of echo in videoconference, hands-free phone and mobile communication.

Echo cancellor faces many problems, such as it is double say, ambient noise and non-linear distortion.Firstly, double say is communication system Typical conversational mode in system, when both ends speaker, have while speaking.However, near-end voice signals will seriously affect adaptive calculation The convergence of method and may cause they dissipate.In addition, received signal not only includes echo and proximal end language at microphone Sound signal also includes ambient noise.Traditionally, the method for echo cancellor is by finite impulse response (FIR) (FIR) filter Adaptively estimate then the acoustic pulses response between loudspeaker and microphone passes through one to realize the elimination of echo Postfilter inhibits remaining echo after ambient noise and echo cancellor.

The final goal of AEC (Acoustic Echo Cancellation, echo cancellor) is to completely eliminate remote signaling, Only near end signal is sent.However, traditional echo cancel method be echo path is modeled as to linear system, but by In the non-linear limitation of the components such as power amplifier and loudspeaker, in the actual conditions of echo cancellor, remote signaling is it is possible that non- Linear distortion has seriously affected the effect of echo cancellor.

Summary of the invention

Effect in order to solve echo cancellor in the related technology is bad and the technical issues of needing postfilter, and the disclosure mentions A kind of echo cancel method based on deep learning, device and electronic equipment, storage medium are supplied.

In a first aspect, providing a kind of echo cancel method based on deep learning, comprising:

Acoustic feature is extracted from received microphone signal, the microphone signal includes that near end signal and distal end are believed Number；

The acoustic feature is iterated in the recurrent neural networks model with shot and long term memory of training in advance Operation calculates the ratio film of the acoustic feature；

The acoustic feature is sheltered using the ratio film；

The phase of the acoustic feature and the microphone signal after masking is synthesized, is obtained by echo Near end signal after elimination.

Optionally, described the step of acoustic feature is extracted from received microphone signal, includes:

Received microphone signal is divided into time frame according to preset period of time, the microphone signal includes proximal end letter Number and remote signaling；

Spectrum amplitude vector is extracted from the time frame；

The spectrum amplitude vector is normalized, acoustic feature is formed.

Optionally, the spectrum amplitude vector is normalized, formed acoustic feature the step of include:

Current time frame is merged to be normalized to form acoustic feature with the spectrum amplitude vector of time in the past frame.

Optionally, the construction method of the recurrent neural networks model with shot and long term memory of training in advance includes:

Determine that the voice of speaking when being trained is proximally and distally (to refer to) signal；

Remote signaling, near end signal when speaking voice described in collection as distal end, proximal end, and voice training is established with this Collection, wherein the remote signaling is echo signal, the near end signal and the echo signal form microphone signal；

The voice training collection is trained by the recurrent neural network with shot and long term memory, described in building Recurrent neural networks model with shot and long term memory.

Optionally, the voice training collection is trained by the recurrent neural network with shot and long term memory, The step of building recurrent neural networks model with shot and long term memory includes:

The acoustic feature of the microphone signal, distal end (echo) signal is extracted respectively；

According to the microphone signal, the acoustic feature of remote signaling, pass through the recurrence mind with shot and long term memory Estimation through ideal ratio film when network progress echo cancellor, the building recurrent neural network mould with shot and long term memory Type.

Optionally, the voice training collection is trained by the recurrent neural network with shot and long term memory, The step of building recurrent neural networks model with shot and long term memory also may include:

Linear echo elimination is carried out to the microphone signal by traditional AEC algorithm；

The linear AEC for carrying out linear echo elimination to the remote signaling, by traditional AEC algorithm respectively exports carry out sound Learn the extraction of feature；

According to the remote signaling, the acoustic feature of the linear AEC output, pass through the passing with shot and long term memory The estimation of ideal ratio film when neural network being returned to carry out echo cancellor, the building recurrent neural network with shot and long term memory Model.

Optionally, the method can also include:

The extraction of acoustic feature is carried out to the remote signaling, microphone signal, the linear AEC output respectively；

According to the acoustic feature that the remote signaling, microphone signal, the linear AEC are exported, there is length by described The recurrent neural network of short-term memory carries out the estimation of ideal ratio film when echo cancellor, and building is described, and there is shot and long term to remember Recurrent neural networks model.

Second aspect provides a kind of echo cancelling device based on deep learning, comprising:

Acoustic feature extraction module, for extracting acoustic feature from received input signal, the input signal includes Microphone signal and remote signaling；

Ratio film computing module, for by the acoustic feature in advance training with shot and long term memory recurrent neural It is iterated operation in network model, calculates the ratio film of the acoustic feature；

Masking block, for being sheltered using the ratio film to the acoustic feature；

Voice synthetic module, for that will be carried out by the phase of the acoustic feature and the microphone signal after masking Synthesis, obtains the near end signal after echo cancellor.

Optionally, the training objective using ideal ratio film as the recurrent neural networks model with shot and long term memory.

The third aspect provides a kind of electronic equipment, comprising:

At least one processor；And

The memory being connect at least one described processor communication；Wherein,

The memory is stored with the instruction that can be executed by least one described processor, and described instruction is by described at least one A processor executes, so that at least one described processor is able to carry out method as described in relation to the first aspect.

Fourth aspect provides a kind of computer readable storage medium, and for storing program, described program is when executed So that electronic equipment executes method as described in relation to the first aspect.

The technical scheme provided by this disclosed embodiment can include the following benefits:

When carrying out echo cancellor, acoustic feature is extracted from received microphone signal, acoustic feature is instructed in advance After being iterated the ratio film that operation calculates acoustic feature in the experienced recurrent neural networks model with shot and long term memory, use The ratio film shelters acoustic feature.The phase of acoustic feature and microphone signal after masking is closed again At realization echo cancellor.Due to using the recurrent neural networks model with shot and long term memory of training in advance in the program, So as to realize echo cancellor when the noise that has powerful connections, double non-existing property of making peace are distorted, echo cancellor is greatly improved Effect and applicable scene.And without postfilter, it is effectively simplified electronic equipment, reduces electronic equipment cost.

It should be understood that above general description and following detailed description is merely illustrative, this public affairs can not be limited Open range.

Detailed description of the invention

The drawings herein are incorporated into the specification and forms part of this specification, and shows and meets implementation of the invention Example, and in specification together principle for explaining the present invention.

Fig. 1 is a kind of flow chart of echo cancel method based on deep learning shown according to an exemplary embodiment.

Fig. 2 is a kind of specific reality of step S110 in the echo cancel method based on deep learning of Fig. 1 corresponding embodiment Existing flow chart.

Fig. 3 is the building side according to the recurrent neural networks model with shot and long term memory shown in Fig. 1 corresponding embodiment A kind of specific implementation flow chart of method.

Fig. 4 is the flow diagram of echo cancellor shown according to an exemplary embodiment.

Fig. 5 is the building side according to the recurrent neural networks model with shot and long term memory shown in Fig. 4 corresponding embodiment A kind of specific implementation flow chart of step S123 in method.

Fig. 6 is the building side according to the recurrent neural networks model with shot and long term memory shown in Fig. 4 corresponding embodiment Another specific implementation flow chart of step S123 in method.

Fig. 7 is the building side according to the recurrent neural networks model with shot and long term memory shown in Fig. 6 corresponding embodiment Another specific implementation flow chart of step S123 in method.

Fig. 8 is shown according to an exemplary embodiment using the microphone signal (a) of smart phone acquisition, distally (ginseng Examine) signal (b), tradition AEC algorithm linear echo eliminate output (c) and LSTM3 output signal (d) spectrogram.

Fig. 9 is a kind of block diagram of echo cancelling device based on deep learning shown according to an exemplary embodiment.

Figure 10 is extracted according to acoustic feature in the echo cancelling device based on deep learning shown in Fig. 9 corresponding embodiment A kind of block diagram of module 110.

Figure 11 is a kind of block diagram according to the ratio film computing module 120 shown in Fig. 9 corresponding embodiment.

Figure 12 is a kind of block diagram of the model construction submodule 123 shown in Figure 11 corresponding embodiment.

Figure 13 is another block diagram of the model construction submodule 123 shown in Figure 11 corresponding embodiment.

Figure 14 is another block diagram of the model construction submodule 123 shown in Figure 11 corresponding embodiment.

Specific embodiment

Here will the description is performed on the exemplary embodiment in detail, the example is illustrated in the accompanying drawings.Following description is related to When attached drawing, unless otherwise indicated, the same numbers in different drawings indicate the same or similar elements.Following exemplary embodiment Described in embodiment do not represent all embodiments consistented with the present invention.On the contrary, they be only with it is such as appended The example of device and method being described in detail in claims, consistent with some aspects of the invention.

Fig. 1 is a kind of flow chart of echo cancel method based on deep learning shown according to an exemplary embodiment. The echo cancel method based on deep learning can be used in the electronic equipments such as smart phone, computer.As shown in Figure 1, this is based on The echo cancel method of deep learning may include step S110, step S120, step S130, step S140.

Step S110, it includes microphone signal and remote that acoustic feature microphone signal is extracted from received microphone signal End signal (i.e. echo signal).

Microphone signal is voice signal received when carrying out echo cancellor, and the sound pick-up outfits such as microphone will acquire close End signal and echo signal are that is, microphone signal includes near end signal and remote signaling (i.e. echo signal).

When electronic equipment carries out echo cancellor, the voice signal of the sound pick-up outfits such as microphone acquisition can receive, it can also be with The voice signal that other electronic equipments are sent is received, can also be and receive voice signal otherwise, herein without one One description.

For example, the sound pick-up outfits such as microphone will carry out the acquisition of voice signal in videoconference, the recording such as microphone are set Indoor near end signal where the voice signal of standby acquisition not only includes microphone further includes from distally transmitting through loudspeaker The remote signaling of broadcasting.

Optionally, the acquisition of the sound pick-up outfits such as microphone carries out the acquisition of input signal with the frequency acquisition of 16KHz.

Acoustic feature is the data characteristics that can characterize voice signal.

When extracting acoustic feature from received voice signal, STFT (Short-time can be used to voice signal Fourier transform, Short Time Fourier Transform) acoustic feature is extracted, voice signal can also be mentioned using wavelet transformation Acoustic feature is taken, can also take other form and extract acoustic feature from received voice signal.

Optionally, as shown in Fig. 2, step S110 may include step S111, step S112, step S113.

Received microphone signal is divided into time frame according to preset period of time by step S111.

Preset period of time is to be divided into voice signal more according to preset period of time the pre-set time interval phase A time frame.

Optionally, received microphone signal is carried out to the division of time frame according to preset period of time, and per adjacent two There are the overlappings of half of preset period of time between a time frame.

In a specific illustrative embodiment, received voice signal is divided into multiple time frames for 20 milliseconds according to every frame, And the overlapping between every two adjacent time frame with 10 milliseconds.Then is applied to each time frame of input signal at 320 points STFT, this can generate 161 frequency separations.

Step S112 extracts spectrum amplitude vector from time frame.

Spectrum amplitude vector is normalized in step S113, forms acoustic feature.

In one exemplary embodiment, STFT is applied to each time frame to extract spectrum amplitude vector, each frequency spectrum Amplitude vector forms acoustic feature after normalized.

Optionally, bigger vector is connected by multiple successive frames centered on current time frame form acoustics spy Sign, to improve the effect of echo cancellor.

For example, when spectrum amplitude vector is normalized, by the frequency spectrum of current time frame and time in the past frame Amplitude vector merging is normalized, and forms acoustic feature.Specifically, previous 5 frame and current time frame are spliced into one A unified feature vector, as input of the invention.The quantity of time in the past frame is also less than 5, improves the reality of application Shi Xing.

Therefore, when extracting acoustic feature from voice signal, voice signal is divided into the time according to preset period of time Frame makes to provide based on the acoustic feature echo cancellation process extracted from each time frame defeated by the way that the reasonable time period is arranged Enter, and merge to form acoustic feature by the way that current time frame is carried out selectivity with the spectrum amplitude vector of time in the past frame, Echo cancellation performance can be improved.

Step S120 carries out acoustic feature in the recurrent neural networks model with shot and long term memory of training in advance Interative computation calculates the ratio film of acoustic feature.

Ratio film is the relationship characterized between input signal and near end signal, indicates and inhibits echo and retain proximal end to believe Number tradeoff.

Ideally, after carrying out masking processing to input signal by ratio film, echo can be carried out to input signal It eliminates, restores near end signal.

With shot and long term memory (LSTM, Long Short-Term Memory) recurrent neural network (RNN, Recurrent Neural Network) (below will " with shot and long term memory recurrent neural network " be referred to as " LSTM ") be In advance made of training.

The acoustic feature that step S110 is obtained is iterated fortune in the LSTM model as the input of LSTM model It calculates, calculates the ratio film to the acoustic feature.

In this step, the target by IRM (Ideal Ratio Mask, ideal ratio film) as interative computation.Frequency spectrum The IRM of each T-F (time-frequency) unit in figure can be stated with following equation:

Wherein S_STFT(t, f) and Y_STFT(t, f) be respectively the time-frequency member near end signal and microphone signal amplitude it is big It is small.

Acoustic feature is sheltered by predicted ideal ratio film during supervised training, and then using ratio film, To obtain the near end signal after echo cancellor.

Step S130 shelters acoustic feature using ratio film.

The phase of acoustic feature and microphone signal after masking is synthesized, is obtained by returning by step S140 Near end signal after sound elimination.

After training is completed, during deduction or operation, directly inhibit echo and back using the LSTM model of training Scape noise.Specifically, one input waveform is operated to generate the ratio film of estimation with trained LSTM model.It connects (or masking) is weighted to reflective acoustic feature with this ratio film, to generate the near end signal for eliminating echo.

In one exemplary embodiment, by the spectrum amplitude vector after masking together with the phase of microphone signal It is sent to inverse Fourier transform, to export the near end signal in corresponding time domain.

Using method as described above, when carrying out echo cancellor, acoustic feature is extracted from received input signal, it will Acoustic feature is iterated operation in the recurrent neural networks model with shot and long term memory of training in advance and calculates acoustics spy After the ratio film of sign, acoustic feature is sheltered using the ratio film.Again by the acoustic feature and microphone after masking The phase of signal is synthesized, and realizes echo cancellor.Due to using training in advance there is shot and long term to remember in the program Recurrent neural networks model, so as to realize echo cancellor when ambient noise, double non-existing property of making peace are distorted, significantly The effect and applicable scene of echo cancellor are improved, and without postfilter, is effectively simplified electronic equipment, reduces Electronic equipment cost.

Fig. 3 is the building side according to the recurrent neural networks model with shot and long term memory shown in Fig. 1 corresponding embodiment A kind of specific implementation flow chart of method.As shown in figure 3, the construction method of the recurrent neural networks model that there is shot and long term to remember It may include step S121, step S122 and step S123.

Step S121 determines that the voice of speaking when being trained is used as and proximally and distally (refers to) signal.

Choose when being trained speak voice when mode there are many, can be chosen by way of preestablishing specific Voice of speaking, the voice of speaking when randomly selecting trained can also be passed through.

In order to realize that being not only restricted to training speaks the echo cancellor of voice, by using various male voices and female voice into Row training.

In one exemplary embodiment, by from TIMIT (The DARPA TIMIT Acoustic-Phonetic Continuous Speech Corpus cooperates building by Texas Instrument, the Massachusetts Institute of Technology and SRI International Acoustics-phoneme continuous speech corpus) voice of speaking of preset quantity is randomly selected in data set.

The speech sample frequency of TIMIT data set is 16kHz, altogether includes 6300 sentences, by coming from eight, U.S. master Want dialect area 630 people everyone say 10 given sentences, all sentences are all at phone-level (phone level) On carried out manual segmentation, label.Wherein, 70% speaker is male, and most of speakers are adult white men.

Step S122 collects voice of speaking as proximal end, distal reference signal, and with this and establishes voice training collection.

Echo signal by remote signaling by microphone it is practical recording or it is artificial synthesized.Voice training collection is by proximal end, remote End is constituted with reference to microphone signal.Wherein, microphone signal is that near end signal is mixed with echo signal.

Optionally, from TIMIT data set in 630 voice of speaking randomly choose 100 pairs speak voice as proximal end with Speak voice (40 pairs of male Females, 30 pairs of male-males, 30 pairs of female-females) in distal end.Every kind is recorded with 16kHz sample rate It speaks 10 language of voice.7 voices of these voice of speaking for generating multiple microphone signals, microphone signal by with The echo signal of near-end speech and the far-end speech selected at random that machine is selected mixes.Remaining 3 voices are for generating 300 test microphone signals.Entire training set is for about 50 hours.In order to further increase for the extensive of voice of speaking Ability, we randomly choose other 10 pairs of voice of speaking (4 couples of males-from 430 speakers remaining in TIMIT data set Women, 3 pairs of male-males and 3 pairs of female-females), generate the test mixed signal of 100 unbred voice of speaking.? Echo signal is recorded with smart phone in 2.7 × 3 × 4.5 meters of room, then the echo signal of recording is added to the letter of proximal end Number formed microphone signal.

Step S123 is trained voice training collection by the recurrent neural network remembered with shot and long term, building tool The recurrent neural networks model for thering is shot and long term to remember.

LSTM is a kind of time recurrent neural network, and paper is published in 1997 for the first time.Due to unique design structure, LSTM is suitable for the critical event being spaced in processing and predicted time sequence and delay is very long.

The performance of LSTM usually more preferably than other time recurrent neural network and Hidden Markov Model (HMM), for example is used On not zonal cooling handwriting recognition.2009, ICDAR handwriting recognition was won with the artificial nerve network model that LSTM is constructed First.LSTM is also commonly used for automatic speech recognition, and database of giving a lecture naturally with TIMIT for 2013 reaches 17.7% mistake The accidentally record of rate.As nonlinear model, LSTM can be used as complicated non-linear unit and construct larger deep neural network.

LSTM is a kind of certain types of RNN, can effectively capture long-term context.Compared with traditional RNN, LSTM changes Be apt in the training process over time and bring gradient reduce or gradient explosion issues.The storage list of LSTM module There are three doors for member: input gate forgets door and out gate.How many current information should be added to memory cell by input gate control, Forget door control should retain how many previous message, out gate controls whether output information.Specifically, LSTM can be retouched with mathematical formulae It states as follows.

i_t=σ (W_ixx_t+W_ihh_t-1+b_i)

f_t=σ (W_fxx_t+W_fhh_t-1+b_f)

o_t=σ (W_oxx_t+W_ohh_t-1+b_o)

z_t=g (W_zxx_t+W_zhh_t-1+b_z)

c_t=f_t⊙c_t-1+i_t⊙z_t

h_t=o_t⊙g(c_t)

Wherein i_t,f_tAnd o_tIt is input gate, the output for forgeing door and out gate respectively.x_tAnd h_tIt is illustrated respectively in time t's Input feature vector and hiding activation.z_tAnd c_tRespectively indicate block input and storage unit.σ represents sigmoidal function, i.e. and σ (x)= 1/(1+e^x), g represents hyperbolic tangent function, i.e. g (x)=(e^x-e^-x)/(e^x+e^-x)。b_i、b_f、b_oAnd b_zIt is input gate respectively, loses Forget door, out gate and the corresponding offset of input block.Symbol ⊙ indicates that array element is gradually multiplied.Input gate and forgetting door are bases What previous activation and current input calculated, and the update of context-sensitive is executed to memory cell.

Fig. 4 is the flow diagram of echo cancellor shown according to an exemplary embodiment.As shown in figure 4, input is to connect The input signal of receipts, the near end signal after exporting as echo cancellor, " 1 " in figure indicate the step of being related to during the training period, figure In " 2 " the step of indicating prediction (deductions) stage, " 3 " in figure indicate the step that training and prediction are shared.As there is supervision Learning method, the present invention are training objective using ideal ratio film (IRM).IRM be by comparing microphone signal STFT and What the STFT of its corresponding near end signal was obtained.In the training stage, the RNN with LSTM estimates each input signal (including wheat Gram wind number and remote signaling) IRM, then calculate the MSE (Mean Square Error, mean square error) between IRM. The MSE of entire training set is minimized by duplicate more wheel iteration, and training sample is used only once in every wheel iteration.Training After completion, during deduction or operation, directly inhibit echo and ambient noise using the LSTM after training.It is specific next It says, trained LSTM handle to input signal and ratio calculated film, then using the ratio film calculated to input signal It is handled, finally recombines to obtain the near end signal after echo cancellor.

The output at top obtains the prediction of ratio film by sigmoidal shape function (referring to fig. 4), then carries out with IRM Compare, by comparing, MSE mistake is generated, for adjusting LSTM weight.

Optionally, Fig. 5 is according to the recurrent neural networks model with shot and long term memory shown in Fig. 3 corresponding embodiment A kind of specific implementation flow chart of step S123 in construction method.As shown in figure 5, step S123 may include step S1231 With step S1232.

Step S1231 extracts the acoustic feature of microphone signal, remote signaling respectively.

Step S1232 passes through the recurrence remembered with shot and long term according to microphone signal, the acoustic feature of remote signaling Neural network carries out the estimation of ideal ratio film when echo cancellor, constructs the recurrent neural networks model with shot and long term memory.

Optionally, Fig. 6 is according to the recurrent neural networks model with shot and long term memory shown in Fig. 3 corresponding embodiment Another specific implementation flow chart of step S123 in construction method.As shown in fig. 6, step S123 may include step S1233, step S1234 and step S1235.

Step S1233 carries out linear echo elimination to microphone signal by traditional AEC algorithm.

Microphone signal is handled in advance by the echo cancellation algorithm of traditional linear AEC, AEC is exported into conduct The input signal of LSTM, and then construct the recurrent neural networks model with shot and long term memory.

Step S1234 carries out the extraction of acoustic feature to remote signaling, linear AEC output respectively.

Step S1235 passes through the recurrence remembered with shot and long term according to the acoustic feature that remote signaling, linear AEC are exported Neural network carries out the estimation of ideal ratio film when echo cancellor, constructs the recurrent neural networks model with shot and long term memory.

Optionally, Fig. 7 is according to the recurrent neural networks model with shot and long term memory shown in Fig. 3 corresponding embodiment Another specific implementation flow chart of step S123 in construction method.As shown in fig. 7, step S123 remove including step S1233, It can also include step S1236, step S1237 outside step S1234 and step S1235.

Step S1236 carries out the extraction of acoustic feature to remote signaling, microphone signal, linear AEC output respectively.

Step S1237, according to the acoustic feature that remote signaling, microphone signal, linear AEC are exported, by with length The recurrent neural network of phase memory carries out the estimation of ideal ratio film when echo cancellor, constructs the recurrence mind with shot and long term memory Through network model.

It will be by step S1231 and step S1232, using microphone signal, remote signaling as input signal, using having The recurrent neural network of shot and long term memory carries out the estimation of ideal ratio film when echo cancellor, and constructing has passing for shot and long term memory Neural network model is returned to be known as LSTM1.

By step S1233, step S1234 and step S1235, first pass through in advance traditional AEC algorithm to microphone signal into Row processing obtains AEC output.And using linear AEC output, remote signaling as input signal, passed using what is remembered with shot and long term The estimation of ideal ratio film when neural network being returned to carry out echo cancellor, constructs the recurrent neural networks model with shot and long term memory Referred to as LSTM2.

By step S1233, step S1236 and step S1237, remote signaling, microphone signal, linear AEC are exported As input signal, the estimation of ideal ratio film when carrying out echo cancellor using the recurrent neural network remembered with shot and long term, Constructing, there is the recurrent neural networks model of shot and long term memory to be known as LSTM3.

It compares and LSTM1, LSTM3 is by further improving docking as supplementary features for the output of traditional AEC algorithm The input signal of receipts carries out the effect of echo cancellor.

Table 1 indicates to carry out STOI (Short-Time when echo cancellor using tri- kinds of models of LSTM1, LSTM2, LSTM3 Objective Intelligibility, in short-term objective intelligibility), PESQ (Perceptual Evaluation of Speech Quality, objective speech quality assessment) and ERLE (Echo Return Loss Enhancement, echo backhaul Increment is lost) results of three kinds of performance indicators.Tri- kinds of models of LSTM1, LSTM2, LSTM3 used in during this all have Two hidden layers, every layer has 512 units."None" is the result of unprocessed signal；" ideal " is the knot of ideal ratio film Fruit can be regarded as the upper limit of optimum.

The system AEC result tested in table 1:STOI, PESQ and ERLE

As shown in table 1, compared with traditional AEC algorithm, tri- models of LSTM1, LSTM2, LSTM3 can be carried out better echo It eliminates.Traditional AEC algorithm is combined with deep learning can be further improved system performance.LSMT3 ratio LSTM2 more can be significant Improve STOI.

In order to further illustrate linear AEC as a result, Fig. 8 is shown according to an exemplary embodiment using smart phone record The microphone signal of system and the spectrogram of near end signal.Fig. 8 (a) illustrates the spectrogram of microphone signal；Fig. 8 (b) is shown The spectrogram of corresponding near end signal；Fig. 8 (c) and Fig. 8 (d) shows using LSTM3 model and uses conventional linear AEC algorithm Spectrum results contrast schematic diagram after carrying out echo cancellor, wherein Fig. 8 (c) illustrates the spectrogram of linear AEC output, Fig. 8 (d) spectrogram that LSTM3 carries out the near end signal obtained after echo cancellor is illustrated.As can be seen that carrying out echo by LSTM3 Output after elimination is very similar to clean near end signal.This shows that proposed method can retain near end signal well, It can inhibit the echo with non-linear distortion and ambient noise.

Using method as described above, input is believed by the recurrent neural networks model with shot and long term memory of building Number carry out echo cancellor when, echo cancellation performance can be effectively improved.

Following is embodiment of the present disclosure, and it is real to can be used for executing this above-mentioned echo cancel method based on deep learning Apply example.For those undisclosed details in the apparatus embodiments, echo cancellor side of the disclosure based on deep learning is please referred to Method embodiment.

Fig. 9 is a kind of block diagram of echo cancelling device based on deep learning shown according to an exemplary embodiment, should Device includes but is not limited to: acoustic feature extraction module 110, ratio film computing module 120, masking block 130 and speech synthesis Module 140.

Acoustic feature extraction module 110, for extracting acoustic feature, the input signal packet from received input signal Include microphone signal and remote signaling；

Ratio film computing module 120, for by the acoustic feature in advance training with shot and long term memory recurrence It is iterated operation in neural network model, calculates the ratio film of the acoustic feature；

Masking block 130, for being sheltered using the ratio film to the acoustic feature；

Voice synthetic module 140, for by by masking after the acoustic feature and the microphone signal phase It is synthesized, obtains the near end signal after echo cancellor.

The realization process of the function of modules and effect in above-mentioned apparatus, is specifically shown in the above-mentioned echo based on deep learning The realization process of step is corresponded in removing method, details are not described herein.

Optionally, as shown in Figure 10, acoustic feature extraction module 110 described in Fig. 9 includes but is not limited to: time frame is drawn Sub-unit 111, spectrum amplitude vector extraction unit 112 and acoustic feature form unit 113.

Time frame division unit 111, for received microphone signal to be divided into time frame according to preset period of time；

Spectrum amplitude vector extraction unit 112, for extracting spectrum amplitude vector from the time frame；

Acoustic feature forms unit 113, and for the spectrum amplitude vector to be normalized, it is special to form acoustics Sign.

Optionally, time frame division unit described in Figure 10 111 includes but is not limited to: the division subelement of time frame.

The division subelement of time frame, for received microphone signal to be carried out time frame according to preset period of time It divides, and there are the overlappings of half of preset period of time between each adjacent two time frame.

Optionally, the formation of acoustic feature described in Figure 10 unit 113 includes but is not limited to: more time frame normalizing beggars are single Member.

More time frames normalize subelement, for current time frame and the spectrum amplitude vector of time in the past frame to be merged into Row normalized forms acoustic feature.

Optionally, as shown in figure 11, the computing module of ratio film described in Fig. 9 120 further includes but is not limited to: voice determines Submodule 121, voice training collection setting up submodule 122 and model construction submodule 123.

Voice determines submodule 121, and the voice of speaking when being trained for determining is proximally and distally (to refer to) signal；

Voice training collection setting up submodule 122, for collecting distal end letter of the voice of speaking as distal end, proximal end when Number, near end signal, voice training collection is established with this, wherein the remote signaling be echo signal, the near end signal with it is described Echo signal forms microphone signal；

Model construction submodule 123, for the recurrent neural network with shot and long term memory described in the voice Training set is trained, the building recurrent neural networks model with shot and long term memory.

Optionally, as shown in figure 12, model construction submodule described in Figure 11 123 further includes but is not limited to: the first sound Learn feature unit 1231 and the first model construction unit 1232.

First acoustic feature unit 1231, for extracting the acoustic feature of the microphone signal, remote signaling respectively；

First model construction unit 1232 passes through institute for the acoustic feature according to the microphone signal, remote signaling The estimation that the recurrent neural network with shot and long term memory carries out ideal ratio film when echo cancellor is stated, building is described to have length The recurrent neural networks model of phase memory.

Optionally, as shown in figure 13, model construction module 123 described in Figure 11 can also include but is not limited to: linear AEC processing unit 1233, the second acoustics feature unit 1234 and the second model construction unit 1235.

Linear AEC processing unit 1233, for being handled by tradition AEC algorithm the microphone signal；

Second acoustics feature unit 1234, for respectively to the remote signaling, linear after the deep learning AEC output carries out the extraction of acoustic feature；

Second model construction unit 1235, the acoustic feature for being exported according to the remote signaling, the linear AEC, By the estimation of ideal ratio film when the recurrent neural network progress echo cancellor with shot and long term memory, the tool is constructed The recurrent neural networks model for thering is shot and long term to remember.

Optionally, as shown in figure 14, model construction module 123 described in Figure 11 can also include but is not limited to: third Acoustic feature unit 1236 and third model construction unit 1237.

Third acoustic feature unit 1236, for respectively to the remote signaling, microphone signal, linear AEC export into The extraction of row acoustic feature；

Third model construction unit 1237, for being exported according to the remote signaling, microphone signal, the linear AEC Acoustic feature, ideal ratio film is estimated when carrying out echo cancellor by the recurrent neural network with shot and long term memory It calculates, the building recurrent neural networks model with shot and long term memory.

Optionally, the present invention also provides a kind of electronic equipment, execute as the above exemplary embodiments it is any shown in be based on The all or part of step of the echo cancel method of deep learning.Electronic equipment includes:

Processor；And

The memory being connect with the processor communication；Wherein,

The memory is stored with readable instruction, and the readable instruction is realized when being executed by the processor as above-mentioned Method described in either exemplary embodiment.

The concrete mode that processor executes operation in terminal in the embodiment is somebody's turn to do related based on deep learning Detailed description is performed in the embodiment of echo cancel method, no detailed explanation will be given here.

In the exemplary embodiment, a kind of storage medium is additionally provided, which is that computer readable storage is situated between Matter, such as can be the provisional and non-transitory computer readable storage medium for including instruction.

It should be understood that the present invention is not limited to the precise structure already described above and shown in the accompanying drawings, it can To carry out various modifications and change when without departing from the scope.The scope of the present invention is limited only by the attached claims.

Claims

1. a kind of echo cancel method based on deep learning, which is characterized in that the described method includes:

Acoustic feature is extracted from received microphone signal, the microphone signal includes near end signal and remote signaling；

The acoustic feature is iterated operation in the recurrent neural networks model with shot and long term memory of training in advance, Calculate the ratio film of the acoustic feature；

The acoustic feature is sheltered using the ratio film；

The phase of the acoustic feature and the microphone signal after masking is synthesized, is obtained by echo cancellor Near end signal afterwards.

2. the method according to claim 1, wherein described extract acoustic feature from received microphone signal The step of include:

Received microphone signal is divided into time frame according to preset period of time, the microphone signal include near end signal and Remote signaling；

Spectrum amplitude vector is extracted from the time frame；

The spectrum amplitude vector is normalized, acoustic feature is formed.

3. according to the method described in claim 2, being formed it is characterized in that, the spectrum amplitude vector is normalized The step of acoustic feature includes:

4. the method according to claim 1, wherein the recurrence mind with shot and long term memory of training in advance Construction method through network model includes:

Remote signaling, near end signal when speaking voice described in collection as distal end, proximal end, and voice training collection is established with this, Wherein the remote signaling is echo signal, and the near end signal and the echo signal form microphone signal；

The voice training collection is trained by the recurrent neural network with shot and long term memory, is had described in building The recurrent neural networks model of shot and long term memory.

5. according to the method described in claim 4, it is characterized in that, passing through the recurrent neural network with shot and long term memory The voice training collection is trained, the step of building recurrent neural networks model with shot and long term memory includes:

According to the microphone signal, the acoustic feature of remote signaling, pass through the recurrent neural net with shot and long term memory Network carries out the estimation of ideal ratio film when echo cancellor, the building recurrent neural networks model with shot and long term memory.

6. according to the method described in claim 4, it is characterized in that, passing through the recurrent neural network with shot and long term memory The voice training collection is trained, it can also be with the step of the building recurrent neural networks model with shot and long term memory Include:

The linear AEC for carrying out linear echo elimination to the remote signaling, by the tradition AEC algorithm respectively exports carry out sound Learn the extraction of feature；

According to the remote signaling, the acoustic feature of the linear AEC output, pass through the recurrence mind with shot and long term memory Estimation through ideal ratio film when network progress echo cancellor, the building recurrent neural network mould with shot and long term memory Type.

7. according to the method described in claim 6, it is characterized in that, the method can also include:

According to the acoustic feature that the remote signaling, microphone signal, the linear AEC are exported, there is shot and long term by described The recurrent neural network of memory carries out the estimation of ideal ratio film when echo cancellor, the building recurrence with shot and long term memory Neural network model.

8. a kind of echo cancelling device based on deep learning, which is characterized in that described device includes:

Acoustic feature extraction module, for extracting acoustic feature from received input signal, the input signal includes Mike Wind number and remote signaling；

Ratio film computing module, for by the acoustic feature in advance training with shot and long term memory recurrent neural network It is iterated operation in model, calculates the ratio film of the acoustic feature；

Voice synthetic module, for closing the phase of the acoustic feature and the microphone signal after masking At obtaining the near end signal after echo cancellor.

9. a kind of electronic equipment, which is characterized in that the electronic equipment includes:

At least one processor；And

The memory is stored with the instruction that can be executed by least one described processor, and described instruction is by described at least one It manages device to execute, so that at least one described processor is able to carry out the method according to claim 1 to 7.

10. a kind of computer readable storage medium, for storing program, which is characterized in that described program makes when executed Electronic equipment executes the method according to claim 1 to 7.