Summary of the invention
Effect in order to solve echo cancellor in the related technology is bad and the technical issues of needing postfilter, and the disclosure mentions
A kind of echo cancel method based on deep learning, device and electronic equipment, storage medium are supplied.
In a first aspect, providing a kind of echo cancel method based on deep learning, comprising:
Acoustic feature is extracted from received microphone signal, the microphone signal includes that near end signal and distal end are believed
Number;
The acoustic feature is iterated in the recurrent neural networks model with shot and long term memory of training in advance
Operation calculates the ratio film of the acoustic feature;
The acoustic feature is sheltered using the ratio film;
The phase of the acoustic feature and the microphone signal after masking is synthesized, is obtained by echo
Near end signal after elimination.
Optionally, described the step of acoustic feature is extracted from received microphone signal, includes:
Received microphone signal is divided into time frame according to preset period of time, the microphone signal includes proximal end letter
Number and remote signaling;
Spectrum amplitude vector is extracted from the time frame;
The spectrum amplitude vector is normalized, acoustic feature is formed.
Optionally, the spectrum amplitude vector is normalized, formed acoustic feature the step of include:
Current time frame is merged to be normalized to form acoustic feature with the spectrum amplitude vector of time in the past frame.
Optionally, the construction method of the recurrent neural networks model with shot and long term memory of training in advance includes:
Determine that the voice of speaking when being trained is proximally and distally (to refer to) signal;
Remote signaling, near end signal when speaking voice described in collection as distal end, proximal end, and voice training is established with this
Collection, wherein the remote signaling is echo signal, the near end signal and the echo signal form microphone signal;
The voice training collection is trained by the recurrent neural network with shot and long term memory, described in building
Recurrent neural networks model with shot and long term memory.
Optionally, the voice training collection is trained by the recurrent neural network with shot and long term memory,
The step of building recurrent neural networks model with shot and long term memory includes:
The acoustic feature of the microphone signal, distal end (echo) signal is extracted respectively;
According to the microphone signal, the acoustic feature of remote signaling, pass through the recurrence mind with shot and long term memory
Estimation through ideal ratio film when network progress echo cancellor, the building recurrent neural network mould with shot and long term memory
Type.
Optionally, the voice training collection is trained by the recurrent neural network with shot and long term memory,
The step of building recurrent neural networks model with shot and long term memory also may include:
Linear echo elimination is carried out to the microphone signal by traditional AEC algorithm;
The linear AEC for carrying out linear echo elimination to the remote signaling, by traditional AEC algorithm respectively exports carry out sound
Learn the extraction of feature;
According to the remote signaling, the acoustic feature of the linear AEC output, pass through the passing with shot and long term memory
The estimation of ideal ratio film when neural network being returned to carry out echo cancellor, the building recurrent neural network with shot and long term memory
Model.
Optionally, the method can also include:
The extraction of acoustic feature is carried out to the remote signaling, microphone signal, the linear AEC output respectively;
According to the acoustic feature that the remote signaling, microphone signal, the linear AEC are exported, there is length by described
The recurrent neural network of short-term memory carries out the estimation of ideal ratio film when echo cancellor, and building is described, and there is shot and long term to remember
Recurrent neural networks model.
Second aspect provides a kind of echo cancelling device based on deep learning, comprising:
Acoustic feature extraction module, for extracting acoustic feature from received input signal, the input signal includes
Microphone signal and remote signaling;
Ratio film computing module, for by the acoustic feature in advance training with shot and long term memory recurrent neural
It is iterated operation in network model, calculates the ratio film of the acoustic feature;
Masking block, for being sheltered using the ratio film to the acoustic feature;
Voice synthetic module, for that will be carried out by the phase of the acoustic feature and the microphone signal after masking
Synthesis, obtains the near end signal after echo cancellor.
Optionally, the training objective using ideal ratio film as the recurrent neural networks model with shot and long term memory.
The third aspect provides a kind of electronic equipment, comprising:
At least one processor;And
The memory being connect at least one described processor communication;Wherein,
The memory is stored with the instruction that can be executed by least one described processor, and described instruction is by described at least one
A processor executes, so that at least one described processor is able to carry out method as described in relation to the first aspect.
Fourth aspect provides a kind of computer readable storage medium, and for storing program, described program is when executed
So that electronic equipment executes method as described in relation to the first aspect.
The technical scheme provided by this disclosed embodiment can include the following benefits:
When carrying out echo cancellor, acoustic feature is extracted from received microphone signal, acoustic feature is instructed in advance
After being iterated the ratio film that operation calculates acoustic feature in the experienced recurrent neural networks model with shot and long term memory, use
The ratio film shelters acoustic feature.The phase of acoustic feature and microphone signal after masking is closed again
At realization echo cancellor.Due to using the recurrent neural networks model with shot and long term memory of training in advance in the program,
So as to realize echo cancellor when the noise that has powerful connections, double non-existing property of making peace are distorted, echo cancellor is greatly improved
Effect and applicable scene.And without postfilter, it is effectively simplified electronic equipment, reduces electronic equipment cost.
It should be understood that above general description and following detailed description is merely illustrative, this public affairs can not be limited
Open range.
Specific embodiment
Here will the description is performed on the exemplary embodiment in detail, the example is illustrated in the accompanying drawings.Following description is related to
When attached drawing, unless otherwise indicated, the same numbers in different drawings indicate the same or similar elements.Following exemplary embodiment
Described in embodiment do not represent all embodiments consistented with the present invention.On the contrary, they be only with it is such as appended
The example of device and method being described in detail in claims, consistent with some aspects of the invention.
Fig. 1 is a kind of flow chart of echo cancel method based on deep learning shown according to an exemplary embodiment.
The echo cancel method based on deep learning can be used in the electronic equipments such as smart phone, computer.As shown in Figure 1, this is based on
The echo cancel method of deep learning may include step S110, step S120, step S130, step S140.
Step S110, it includes microphone signal and remote that acoustic feature microphone signal is extracted from received microphone signal
End signal (i.e. echo signal).
Microphone signal is voice signal received when carrying out echo cancellor, and the sound pick-up outfits such as microphone will acquire close
End signal and echo signal are that is, microphone signal includes near end signal and remote signaling (i.e. echo signal).
When electronic equipment carries out echo cancellor, the voice signal of the sound pick-up outfits such as microphone acquisition can receive, it can also be with
The voice signal that other electronic equipments are sent is received, can also be and receive voice signal otherwise, herein without one
One description.
For example, the sound pick-up outfits such as microphone will carry out the acquisition of voice signal in videoconference, the recording such as microphone are set
Indoor near end signal where the voice signal of standby acquisition not only includes microphone further includes from distally transmitting through loudspeaker
The remote signaling of broadcasting.
Optionally, the acquisition of the sound pick-up outfits such as microphone carries out the acquisition of input signal with the frequency acquisition of 16KHz.
Acoustic feature is the data characteristics that can characterize voice signal.
When extracting acoustic feature from received voice signal, STFT (Short-time can be used to voice signal
Fourier transform, Short Time Fourier Transform) acoustic feature is extracted, voice signal can also be mentioned using wavelet transformation
Acoustic feature is taken, can also take other form and extract acoustic feature from received voice signal.
Optionally, as shown in Fig. 2, step S110 may include step S111, step S112, step S113.
Received microphone signal is divided into time frame according to preset period of time by step S111.
Preset period of time is to be divided into voice signal more according to preset period of time the pre-set time interval phase
A time frame.
Optionally, received microphone signal is carried out to the division of time frame according to preset period of time, and per adjacent two
There are the overlappings of half of preset period of time between a time frame.
In a specific illustrative embodiment, received voice signal is divided into multiple time frames for 20 milliseconds according to every frame,
And the overlapping between every two adjacent time frame with 10 milliseconds.Then is applied to each time frame of input signal at 320 points
STFT, this can generate 161 frequency separations.
Step S112 extracts spectrum amplitude vector from time frame.
Spectrum amplitude vector is normalized in step S113, forms acoustic feature.
In one exemplary embodiment, STFT is applied to each time frame to extract spectrum amplitude vector, each frequency spectrum
Amplitude vector forms acoustic feature after normalized.
Optionally, bigger vector is connected by multiple successive frames centered on current time frame form acoustics spy
Sign, to improve the effect of echo cancellor.
For example, when spectrum amplitude vector is normalized, by the frequency spectrum of current time frame and time in the past frame
Amplitude vector merging is normalized, and forms acoustic feature.Specifically, previous 5 frame and current time frame are spliced into one
A unified feature vector, as input of the invention.The quantity of time in the past frame is also less than 5, improves the reality of application
Shi Xing.
Therefore, when extracting acoustic feature from voice signal, voice signal is divided into the time according to preset period of time
Frame makes to provide based on the acoustic feature echo cancellation process extracted from each time frame defeated by the way that the reasonable time period is arranged
Enter, and merge to form acoustic feature by the way that current time frame is carried out selectivity with the spectrum amplitude vector of time in the past frame,
Echo cancellation performance can be improved.
Step S120 carries out acoustic feature in the recurrent neural networks model with shot and long term memory of training in advance
Interative computation calculates the ratio film of acoustic feature.
Ratio film is the relationship characterized between input signal and near end signal, indicates and inhibits echo and retain proximal end to believe
Number tradeoff.
Ideally, after carrying out masking processing to input signal by ratio film, echo can be carried out to input signal
It eliminates, restores near end signal.
With shot and long term memory (LSTM, Long Short-Term Memory) recurrent neural network (RNN,
Recurrent Neural Network) (below will " with shot and long term memory recurrent neural network " be referred to as " LSTM ") be
In advance made of training.
The acoustic feature that step S110 is obtained is iterated fortune in the LSTM model as the input of LSTM model
It calculates, calculates the ratio film to the acoustic feature.
In this step, the target by IRM (Ideal Ratio Mask, ideal ratio film) as interative computation.Frequency spectrum
The IRM of each T-F (time-frequency) unit in figure can be stated with following equation:
Wherein SSTFT(t, f) and YSTFT(t, f) be respectively the time-frequency member near end signal and microphone signal amplitude it is big
It is small.
Acoustic feature is sheltered by predicted ideal ratio film during supervised training, and then using ratio film,
To obtain the near end signal after echo cancellor.
Step S130 shelters acoustic feature using ratio film.
The phase of acoustic feature and microphone signal after masking is synthesized, is obtained by returning by step S140
Near end signal after sound elimination.
After training is completed, during deduction or operation, directly inhibit echo and back using the LSTM model of training
Scape noise.Specifically, one input waveform is operated to generate the ratio film of estimation with trained LSTM model.It connects
(or masking) is weighted to reflective acoustic feature with this ratio film, to generate the near end signal for eliminating echo.
In one exemplary embodiment, by the spectrum amplitude vector after masking together with the phase of microphone signal
It is sent to inverse Fourier transform, to export the near end signal in corresponding time domain.
Using method as described above, when carrying out echo cancellor, acoustic feature is extracted from received input signal, it will
Acoustic feature is iterated operation in the recurrent neural networks model with shot and long term memory of training in advance and calculates acoustics spy
After the ratio film of sign, acoustic feature is sheltered using the ratio film.Again by the acoustic feature and microphone after masking
The phase of signal is synthesized, and realizes echo cancellor.Due to using training in advance there is shot and long term to remember in the program
Recurrent neural networks model, so as to realize echo cancellor when ambient noise, double non-existing property of making peace are distorted, significantly
The effect and applicable scene of echo cancellor are improved, and without postfilter, is effectively simplified electronic equipment, reduces
Electronic equipment cost.
Fig. 3 is the building side according to the recurrent neural networks model with shot and long term memory shown in Fig. 1 corresponding embodiment
A kind of specific implementation flow chart of method.As shown in figure 3, the construction method of the recurrent neural networks model that there is shot and long term to remember
It may include step S121, step S122 and step S123.
Step S121 determines that the voice of speaking when being trained is used as and proximally and distally (refers to) signal.
Choose when being trained speak voice when mode there are many, can be chosen by way of preestablishing specific
Voice of speaking, the voice of speaking when randomly selecting trained can also be passed through.
In order to realize that being not only restricted to training speaks the echo cancellor of voice, by using various male voices and female voice into
Row training.
In one exemplary embodiment, by from TIMIT (The DARPA TIMIT Acoustic-Phonetic
Continuous Speech Corpus cooperates building by Texas Instrument, the Massachusetts Institute of Technology and SRI International
Acoustics-phoneme continuous speech corpus) voice of speaking of preset quantity is randomly selected in data set.
The speech sample frequency of TIMIT data set is 16kHz, altogether includes 6300 sentences, by coming from eight, U.S. master
Want dialect area 630 people everyone say 10 given sentences, all sentences are all at phone-level (phone level)
On carried out manual segmentation, label.Wherein, 70% speaker is male, and most of speakers are adult white men.
Step S122 collects voice of speaking as proximal end, distal reference signal, and with this and establishes voice training collection.
Echo signal by remote signaling by microphone it is practical recording or it is artificial synthesized.Voice training collection is by proximal end, remote
End is constituted with reference to microphone signal.Wherein, microphone signal is that near end signal is mixed with echo signal.
Optionally, from TIMIT data set in 630 voice of speaking randomly choose 100 pairs speak voice as proximal end with
Speak voice (40 pairs of male Females, 30 pairs of male-males, 30 pairs of female-females) in distal end.Every kind is recorded with 16kHz sample rate
It speaks 10 language of voice.7 voices of these voice of speaking for generating multiple microphone signals, microphone signal by with
The echo signal of near-end speech and the far-end speech selected at random that machine is selected mixes.Remaining 3 voices are for generating
300 test microphone signals.Entire training set is for about 50 hours.In order to further increase for the extensive of voice of speaking
Ability, we randomly choose other 10 pairs of voice of speaking (4 couples of males-from 430 speakers remaining in TIMIT data set
Women, 3 pairs of male-males and 3 pairs of female-females), generate the test mixed signal of 100 unbred voice of speaking.?
Echo signal is recorded with smart phone in 2.7 × 3 × 4.5 meters of room, then the echo signal of recording is added to the letter of proximal end
Number formed microphone signal.
Step S123 is trained voice training collection by the recurrent neural network remembered with shot and long term, building tool
The recurrent neural networks model for thering is shot and long term to remember.
LSTM is a kind of time recurrent neural network, and paper is published in 1997 for the first time.Due to unique design structure,
LSTM is suitable for the critical event being spaced in processing and predicted time sequence and delay is very long.
The performance of LSTM usually more preferably than other time recurrent neural network and Hidden Markov Model (HMM), for example is used
On not zonal cooling handwriting recognition.2009, ICDAR handwriting recognition was won with the artificial nerve network model that LSTM is constructed
First.LSTM is also commonly used for automatic speech recognition, and database of giving a lecture naturally with TIMIT for 2013 reaches 17.7% mistake
The accidentally record of rate.As nonlinear model, LSTM can be used as complicated non-linear unit and construct larger deep neural network.
LSTM is a kind of certain types of RNN, can effectively capture long-term context.Compared with traditional RNN, LSTM changes
Be apt in the training process over time and bring gradient reduce or gradient explosion issues.The storage list of LSTM module
There are three doors for member: input gate forgets door and out gate.How many current information should be added to memory cell by input gate control,
Forget door control should retain how many previous message, out gate controls whether output information.Specifically, LSTM can be retouched with mathematical formulae
It states as follows.
it=σ (Wixxt+Wihht-1+bi)
ft=σ (Wfxxt+Wfhht-1+bf)
ot=σ (Woxxt+Wohht-1+bo)
zt=g (Wzxxt+Wzhht-1+bz)
ct=ft⊙ct-1+it⊙zt
ht=ot⊙g(ct)
Wherein it,ftAnd otIt is input gate, the output for forgeing door and out gate respectively.xtAnd htIt is illustrated respectively in time t's
Input feature vector and hiding activation.ztAnd ctRespectively indicate block input and storage unit.σ represents sigmoidal function, i.e. and σ (x)=
1/(1+ex), g represents hyperbolic tangent function, i.e. g (x)=(ex-e-x)/(ex+e-x)。bi、bf、boAnd bzIt is input gate respectively, loses
Forget door, out gate and the corresponding offset of input block.Symbol ⊙ indicates that array element is gradually multiplied.Input gate and forgetting door are bases
What previous activation and current input calculated, and the update of context-sensitive is executed to memory cell.
Fig. 4 is the flow diagram of echo cancellor shown according to an exemplary embodiment.As shown in figure 4, input is to connect
The input signal of receipts, the near end signal after exporting as echo cancellor, " 1 " in figure indicate the step of being related to during the training period, figure
In " 2 " the step of indicating prediction (deductions) stage, " 3 " in figure indicate the step that training and prediction are shared.As there is supervision
Learning method, the present invention are training objective using ideal ratio film (IRM).IRM be by comparing microphone signal STFT and
What the STFT of its corresponding near end signal was obtained.In the training stage, the RNN with LSTM estimates each input signal (including wheat
Gram wind number and remote signaling) IRM, then calculate the MSE (Mean Square Error, mean square error) between IRM.
The MSE of entire training set is minimized by duplicate more wheel iteration, and training sample is used only once in every wheel iteration.Training
After completion, during deduction or operation, directly inhibit echo and ambient noise using the LSTM after training.It is specific next
It says, trained LSTM handle to input signal and ratio calculated film, then using the ratio film calculated to input signal
It is handled, finally recombines to obtain the near end signal after echo cancellor.
The output at top obtains the prediction of ratio film by sigmoidal shape function (referring to fig. 4), then carries out with IRM
Compare, by comparing, MSE mistake is generated, for adjusting LSTM weight.
Optionally, Fig. 5 is according to the recurrent neural networks model with shot and long term memory shown in Fig. 3 corresponding embodiment
A kind of specific implementation flow chart of step S123 in construction method.As shown in figure 5, step S123 may include step S1231
With step S1232.
Step S1231 extracts the acoustic feature of microphone signal, remote signaling respectively.
Step S1232 passes through the recurrence remembered with shot and long term according to microphone signal, the acoustic feature of remote signaling
Neural network carries out the estimation of ideal ratio film when echo cancellor, constructs the recurrent neural networks model with shot and long term memory.
Optionally, Fig. 6 is according to the recurrent neural networks model with shot and long term memory shown in Fig. 3 corresponding embodiment
Another specific implementation flow chart of step S123 in construction method.As shown in fig. 6, step S123 may include step
S1233, step S1234 and step S1235.
Step S1233 carries out linear echo elimination to microphone signal by traditional AEC algorithm.
Microphone signal is handled in advance by the echo cancellation algorithm of traditional linear AEC, AEC is exported into conduct
The input signal of LSTM, and then construct the recurrent neural networks model with shot and long term memory.
Step S1234 carries out the extraction of acoustic feature to remote signaling, linear AEC output respectively.
Step S1235 passes through the recurrence remembered with shot and long term according to the acoustic feature that remote signaling, linear AEC are exported
Neural network carries out the estimation of ideal ratio film when echo cancellor, constructs the recurrent neural networks model with shot and long term memory.
Optionally, Fig. 7 is according to the recurrent neural networks model with shot and long term memory shown in Fig. 3 corresponding embodiment
Another specific implementation flow chart of step S123 in construction method.As shown in fig. 7, step S123 remove including step S1233,
It can also include step S1236, step S1237 outside step S1234 and step S1235.
Step S1236 carries out the extraction of acoustic feature to remote signaling, microphone signal, linear AEC output respectively.
Step S1237, according to the acoustic feature that remote signaling, microphone signal, linear AEC are exported, by with length
The recurrent neural network of phase memory carries out the estimation of ideal ratio film when echo cancellor, constructs the recurrence mind with shot and long term memory
Through network model.
It will be by step S1231 and step S1232, using microphone signal, remote signaling as input signal, using having
The recurrent neural network of shot and long term memory carries out the estimation of ideal ratio film when echo cancellor, and constructing has passing for shot and long term memory
Neural network model is returned to be known as LSTM1.
By step S1233, step S1234 and step S1235, first pass through in advance traditional AEC algorithm to microphone signal into
Row processing obtains AEC output.And using linear AEC output, remote signaling as input signal, passed using what is remembered with shot and long term
The estimation of ideal ratio film when neural network being returned to carry out echo cancellor, constructs the recurrent neural networks model with shot and long term memory
Referred to as LSTM2.
By step S1233, step S1236 and step S1237, remote signaling, microphone signal, linear AEC are exported
As input signal, the estimation of ideal ratio film when carrying out echo cancellor using the recurrent neural network remembered with shot and long term,
Constructing, there is the recurrent neural networks model of shot and long term memory to be known as LSTM3.
It compares and LSTM1, LSTM3 is by further improving docking as supplementary features for the output of traditional AEC algorithm
The input signal of receipts carries out the effect of echo cancellor.
Table 1 indicates to carry out STOI (Short-Time when echo cancellor using tri- kinds of models of LSTM1, LSTM2, LSTM3
Objective Intelligibility, in short-term objective intelligibility), PESQ (Perceptual Evaluation of
Speech Quality, objective speech quality assessment) and ERLE (Echo Return Loss Enhancement, echo backhaul
Increment is lost) results of three kinds of performance indicators.Tri- kinds of models of LSTM1, LSTM2, LSTM3 used in during this all have
Two hidden layers, every layer has 512 units."None" is the result of unprocessed signal;" ideal " is the knot of ideal ratio film
Fruit can be regarded as the upper limit of optimum.
The system AEC result tested in table 1:STOI, PESQ and ERLE
As shown in table 1, compared with traditional AEC algorithm, tri- models of LSTM1, LSTM2, LSTM3 can be carried out better echo
It eliminates.Traditional AEC algorithm is combined with deep learning can be further improved system performance.LSMT3 ratio LSTM2 more can be significant
Improve STOI.
In order to further illustrate linear AEC as a result, Fig. 8 is shown according to an exemplary embodiment using smart phone record
The microphone signal of system and the spectrogram of near end signal.Fig. 8 (a) illustrates the spectrogram of microphone signal;Fig. 8 (b) is shown
The spectrogram of corresponding near end signal;Fig. 8 (c) and Fig. 8 (d) shows using LSTM3 model and uses conventional linear AEC algorithm
Spectrum results contrast schematic diagram after carrying out echo cancellor, wherein Fig. 8 (c) illustrates the spectrogram of linear AEC output, Fig. 8
(d) spectrogram that LSTM3 carries out the near end signal obtained after echo cancellor is illustrated.As can be seen that carrying out echo by LSTM3
Output after elimination is very similar to clean near end signal.This shows that proposed method can retain near end signal well,
It can inhibit the echo with non-linear distortion and ambient noise.
Using method as described above, input is believed by the recurrent neural networks model with shot and long term memory of building
Number carry out echo cancellor when, echo cancellation performance can be effectively improved.
Following is embodiment of the present disclosure, and it is real to can be used for executing this above-mentioned echo cancel method based on deep learning
Apply example.For those undisclosed details in the apparatus embodiments, echo cancellor side of the disclosure based on deep learning is please referred to
Method embodiment.
Fig. 9 is a kind of block diagram of echo cancelling device based on deep learning shown according to an exemplary embodiment, should
Device includes but is not limited to: acoustic feature extraction module 110, ratio film computing module 120, masking block 130 and speech synthesis
Module 140.
Acoustic feature extraction module 110, for extracting acoustic feature, the input signal packet from received input signal
Include microphone signal and remote signaling;
Ratio film computing module 120, for by the acoustic feature in advance training with shot and long term memory recurrence
It is iterated operation in neural network model, calculates the ratio film of the acoustic feature;
Masking block 130, for being sheltered using the ratio film to the acoustic feature;
Voice synthetic module 140, for by by masking after the acoustic feature and the microphone signal phase
It is synthesized, obtains the near end signal after echo cancellor.
The realization process of the function of modules and effect in above-mentioned apparatus, is specifically shown in the above-mentioned echo based on deep learning
The realization process of step is corresponded in removing method, details are not described herein.
Optionally, as shown in Figure 10, acoustic feature extraction module 110 described in Fig. 9 includes but is not limited to: time frame is drawn
Sub-unit 111, spectrum amplitude vector extraction unit 112 and acoustic feature form unit 113.
Time frame division unit 111, for received microphone signal to be divided into time frame according to preset period of time;
Spectrum amplitude vector extraction unit 112, for extracting spectrum amplitude vector from the time frame;
Acoustic feature forms unit 113, and for the spectrum amplitude vector to be normalized, it is special to form acoustics
Sign.
Optionally, time frame division unit described in Figure 10 111 includes but is not limited to: the division subelement of time frame.
The division subelement of time frame, for received microphone signal to be carried out time frame according to preset period of time
It divides, and there are the overlappings of half of preset period of time between each adjacent two time frame.
Optionally, the formation of acoustic feature described in Figure 10 unit 113 includes but is not limited to: more time frame normalizing beggars are single
Member.
More time frames normalize subelement, for current time frame and the spectrum amplitude vector of time in the past frame to be merged into
Row normalized forms acoustic feature.
Optionally, as shown in figure 11, the computing module of ratio film described in Fig. 9 120 further includes but is not limited to: voice determines
Submodule 121, voice training collection setting up submodule 122 and model construction submodule 123.
Voice determines submodule 121, and the voice of speaking when being trained for determining is proximally and distally (to refer to) signal;
Voice training collection setting up submodule 122, for collecting distal end letter of the voice of speaking as distal end, proximal end when
Number, near end signal, voice training collection is established with this, wherein the remote signaling be echo signal, the near end signal with it is described
Echo signal forms microphone signal;
Model construction submodule 123, for the recurrent neural network with shot and long term memory described in the voice
Training set is trained, the building recurrent neural networks model with shot and long term memory.
Optionally, as shown in figure 12, model construction submodule described in Figure 11 123 further includes but is not limited to: the first sound
Learn feature unit 1231 and the first model construction unit 1232.
First acoustic feature unit 1231, for extracting the acoustic feature of the microphone signal, remote signaling respectively;
First model construction unit 1232 passes through institute for the acoustic feature according to the microphone signal, remote signaling
The estimation that the recurrent neural network with shot and long term memory carries out ideal ratio film when echo cancellor is stated, building is described to have length
The recurrent neural networks model of phase memory.
Optionally, as shown in figure 13, model construction module 123 described in Figure 11 can also include but is not limited to: linear
AEC processing unit 1233, the second acoustics feature unit 1234 and the second model construction unit 1235.
Linear AEC processing unit 1233, for being handled by tradition AEC algorithm the microphone signal;
Second acoustics feature unit 1234, for respectively to the remote signaling, linear after the deep learning
AEC output carries out the extraction of acoustic feature;
Second model construction unit 1235, the acoustic feature for being exported according to the remote signaling, the linear AEC,
By the estimation of ideal ratio film when the recurrent neural network progress echo cancellor with shot and long term memory, the tool is constructed
The recurrent neural networks model for thering is shot and long term to remember.
Optionally, as shown in figure 14, model construction module 123 described in Figure 11 can also include but is not limited to: third
Acoustic feature unit 1236 and third model construction unit 1237.
Third acoustic feature unit 1236, for respectively to the remote signaling, microphone signal, linear AEC export into
The extraction of row acoustic feature;
Third model construction unit 1237, for being exported according to the remote signaling, microphone signal, the linear AEC
Acoustic feature, ideal ratio film is estimated when carrying out echo cancellor by the recurrent neural network with shot and long term memory
It calculates, the building recurrent neural networks model with shot and long term memory.
Optionally, the present invention also provides a kind of electronic equipment, execute as the above exemplary embodiments it is any shown in be based on
The all or part of step of the echo cancel method of deep learning.Electronic equipment includes:
Processor;And
The memory being connect with the processor communication;Wherein,
The memory is stored with readable instruction, and the readable instruction is realized when being executed by the processor as above-mentioned
Method described in either exemplary embodiment.
The concrete mode that processor executes operation in terminal in the embodiment is somebody's turn to do related based on deep learning
Detailed description is performed in the embodiment of echo cancel method, no detailed explanation will be given here.
In the exemplary embodiment, a kind of storage medium is additionally provided, which is that computer readable storage is situated between
Matter, such as can be the provisional and non-transitory computer readable storage medium for including instruction.
It should be understood that the present invention is not limited to the precise structure already described above and shown in the accompanying drawings, it can
To carry out various modifications and change when without departing from the scope.The scope of the present invention is limited only by the attached claims.