CN111161752B

CN111161752B - Echo cancellation method and device

Info

Publication number: CN111161752B
Application number: CN201911420690.5A
Authority: CN
Inventors: 陈国明
Original assignee: Goertek Inc
Current assignee: Goertek Inc
Priority date: 2019-12-31
Filing date: 2019-12-31
Publication date: 2022-10-14
Anticipated expiration: 2039-12-31
Also published as: CN111161752A

Abstract

The invention discloses an echo cancellation method, a voice activation method, an echo cancellation device, an audio device, and a computer-readable storage medium. The method comprises the following steps: acquiring a first audio signal and a second audio signal, wherein the first audio signal is a signal input to a loudspeaker, and the second audio signal is a signal picked up by a microphone; estimating an echo signal caused by the first audio signal through a linear adaptive filtering algorithm to obtain an echo estimation signal; extracting characteristic parameters of the echo estimation signal as a first characteristic; extracting a characteristic parameter of the error signal as a second characteristic; inputting the first characteristic and the second characteristic into a pre-trained neural network model, and outputting the gain of the user voice signal by the neural network; the product of the error signal and the gain is calculated to obtain the user speech signal.

Description

Echo cancellation method and device

Technical Field

The present specification relates to acoustic technology, and more particularly, to an echo cancellation method, a voice activation method, an echo cancellation device, an audio device, and a computer-readable storage medium.

Background

The distance between the speaker and the microphone of the audio device is usually relatively close, and if the speaker plays the audio signal and the user leaves the audio device relatively far, among the sound signals collected by the microphone, an echo signal caused by the sound signal played by the speaker may be much larger than a voice command of the user, so that the voice command of the user cannot be accurately acquired. For the smart sound box, the microphone of the smart sound box cannot detect the awakening word under the condition, and the microphone cannot be awakened by the user, so that poor experience is caused to the user.

Non-linear distortion exists when the loudspeaker plays the audio signal, and the non-linear signal is introduced. The traditional echo cancellation method can only fit the linear part in the echo signal, and can not accurately cancel the echo signal. Therefore, it is necessary to provide a new echo cancellation scheme.

Disclosure of Invention

Embodiments disclosed herein provide a new echo cancellation scheme.

According to a first aspect of the present disclosure, there is provided an echo cancellation method, including the steps of:

acquiring a first audio signal and a second audio signal, wherein the first audio signal is a signal input to a loudspeaker, and the second audio signal is a signal picked up by a microphone;

estimating an echo signal caused by the first audio signal by adopting a linear adaptive filtering algorithm according to the first audio signal and the error signal to obtain an echo estimation signal; the error signal is a difference signal between the second audio signal and the echo estimation signal;

extracting characteristic parameters of the echo estimation signal as a first characteristic;

extracting a characteristic parameter of the error signal as a second characteristic;

inputting the first characteristic and the second characteristic into a pre-trained neural network model, and outputting the gain of a user voice signal by the neural network;

the product of the error signal and the gain is calculated to obtain the user speech signal.

Optionally, the gain of the user speech signal is a subband gain; said calculating the product of the error signal and the gain to obtain the user speech signal, comprising:

performing frequency domain transformation on the error signal;

and carrying out frequency domain multiplication on the error signal subjected to frequency domain transformation and the sub-band gain, and carrying out inverse transformation from a frequency domain to a time domain on a multiplication result to obtain a user sound signal.

Optionally, the training process of the neural network model includes:

acquiring sample data, wherein the sample data comprises an echo estimation sample signal, a user voice sample signal and a microphone mixed sample signal, and the echo estimation sample signal is an echo estimation signal estimated by the linear adaptive filtering algorithm in a first scene; the user voice sample signal is a signal picked up by the microphone in a second scene; the microphone mixed sample signal is a signal picked up by the microphone in a third scene; the first scene is a scene that no user voice exists in a test environment and only the loudspeaker plays a first test audio signal, the second scene is a scene that the loudspeaker stops working and only the first test user voice exists in the test environment, and the third scene is a scene that the first test user voice exists in the test environment and the loudspeaker plays the first test audio signal;

carrying out frequency domain transformation on the user voice sample signal and the microphone mixed sample signal to obtain a user voice sample frequency domain signal and a microphone mixed sample frequency domain signal;

dividing the user voice sample frequency domain signal and the microphone mixed sample frequency domain signal according to a plurality of preset sub-bands;

calculating the energy of the user voice sample frequency domain signal on each sub-band;

calculating the energy of the microphone mixed sample frequency domain signal on each sub-band;

determining the sub-band gain of the sub-band according to the ratio of the energy of the user voice sample frequency domain signal on the sub-band to the energy of the microphone mixed sample frequency domain signal on the sub-band;

extracting characteristic parameters of echo estimation sample signals;

extracting characteristic parameters of a user voice sample signal;

inputting the characteristic parameters of the echo estimation sample signal and the characteristic parameters of the user voice sample signal into a neural network model, and training the neural network model by using the determined subband gain as supervision.

Optionally, the neural network model comprises first to fifth networks;

the inputting the first feature and the second feature into a pre-trained neural network model, and outputting a gain of a user voice signal by the neural network, includes:

inputting combined features spliced by the first features and the second features into a first network;

inputting the features extracted by the first network into a second network to obtain voice activation detection data;

inputting the combined features, the features extracted by the first network and the voice activation detection data into a third network to obtain noise spectrum estimation data;

inputting the combined features, voice activation detection data and noise spectrum estimation data into a fourth network to obtain enhanced voice data;

and inputting the enhanced voice data into a fifth network to obtain the gain of the voice signal of the user.

Optionally, the first network and the fifth network respectively adopt a fully-connected neural network; the full-connection neural network adopts a Tanh activation function or a Relu activation function;

the second network, the third network and the fourth network respectively adopt a long and short memory network or a gated circulation unit neural network.

Optionally, the linear adaptive filtering algorithm is any one of the following algorithms:

a least mean square calculation filtering algorithm;

a recursive least mean square filtering algorithm;

and (4) a normalized least mean square filtering algorithm.

Optionally, the characteristic parameter of the echo estimation signal (y _ est) comprises at least one of the following characteristic parameters

Mel frequency domain cepstrum parameters;

bark frequency domain cepstrum parameters;

LPC cepstral parameters.

Optionally, the characteristic parameter of the error signal (e) includes at least any one of the following characteristic parameters:

cepstrum parameters;

a pitch parameter;

sensing a linear prediction parameter;

an amplitude modulation spectral parameter.

According to a second aspect of the present disclosure, there is provided a voice activation method comprising the echo cancellation method of any one of the preceding claims; further comprising:

and detecting whether the voice signal of the user is a preset awakening word, and if so, awakening the audio device.

According to a third aspect disclosed in the present specification, there is provided an echo cancellation device comprising a processor and a memory, wherein the memory stores computer readable instructions, and the computer readable instructions, when executed by the processor, implement the echo cancellation method of any one of the preceding claims.

According to a fourth aspect of the present disclosure, there is provided an audio apparatus comprising a processor and a memory, the memory having stored therein computer-readable instructions, which when executed by the processor, implement the echo cancellation method of any one of the preceding claims.

According to a fifth aspect of the present disclosure, there is provided an audio device comprising a processor and a memory, the memory having stored therein computer-readable instructions, which when executed by the processor, implement the voice activation method of any one of the preceding claims.

According to a sixth aspect of the present disclosure, there is provided a computer readable storage medium having stored thereon computer readable instructions, which when executed by a processor, implement the echo cancellation method of any one of the preceding claims.

According to a seventh aspect of the present disclosure, there is provided a computer readable storage medium having stored thereon computer readable instructions, which when executed by a processor, implement the voice activation method of any one of the preceding claims.

The echo cancellation method disclosed by the embodiment of the invention adopts a self-adaptive filtering algorithm to estimate echo signals, then utilizes a pre-trained neural network model to estimate signal gain according to the echo estimation signals and error signals, and utilizes the signal gain to obtain user voice signals.

Features of embodiments of the present specification and advantages thereof will become apparent from the following detailed description of exemplary embodiments of the specification, which proceeds with reference to the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the specification and together with the description, serve to explain the principles of the embodiments of the specification.

Fig. 1 is a schematic diagram of an echo cancellation method provided in an embodiment of the present specification;

fig. 2 is a schematic diagram of a neural network model provided in another embodiment of the present disclosure.

Fig. 3 is a schematic diagram of a training process of a neural network model according to another embodiment of the present disclosure.

Detailed Description

Various exemplary embodiments of the present specification will now be described in detail with reference to the accompanying drawings.

The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the embodiments, their application, or uses.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be discussed further in subsequent figures.

< echo cancellation method >

Referring to fig. 1, an echo cancellation system according to an embodiment of the present invention is illustrated:

the audio device has a speaker 100 and a microphone 200.

After the audio signal in to be played is gained by the intelligent power amplifier, the first audio signal x is output, and the first audio signal x is input to the loudspeaker 100 for playing. The signal picked up by the microphone 200 is a second audio signal d, and if the user is in a speaking state, the second audio signal d includes an echo signal y caused by the loudspeaker playing the first audio signal x, and also includes a user speech signal s.

The objective of the embodiment of the present invention is to eliminate the influence of the echo signal y, and extract the user voice signal s as accurately as possible from the second audio signal d picked up by the microphone, that is, it is desirable that the user voice signal out finally output by the echo cancellation system after the echo cancellation process approaches the user voice signal s as much as possible. In order to achieve the purpose, a linear adaptive filtering algorithm and a pre-trained neural network are adopted for carrying out echo cancellation, and the work of extracting the voice signal of the user is carried out.

The echo cancellation method according to an embodiment of the present invention is described below, and is implemented in the audio device having both a speaker and a microphone, which may be, for example, a smart speaker. The echo cancellation method provided by this embodiment includes steps S202 to S212.

S202, a first audio signal x and a second audio signal d are obtained, where the first audio signal x is a signal input to a speaker, and the second audio signal d is a signal picked up by a microphone.

The second audio signal d contains an echo signal y caused by the loudspeaker playing the first audio signal x, and also contains a user voice signal s in the user speaking state.

S204, according to the first audio signal x and the error signal e, estimating an echo signal y caused by the first audio signal x by adopting a linear adaptive filtering algorithm to obtain an echo estimation signal y _ est.

The first audio signal x and the error signal e are input into a linear adaptive filter, and an echo estimation signal y _ est is output by the linear adaptive filter, wherein the linear adaptive filter adopts a linear adaptive filtering algorithm. The error signal e is a difference signal between the second audio signal d and the echo estimation signal y _ est, and the echo estimation signal y _ est output by the linear adaptive filter is subtracted from the second audio signal d to obtain the error signal e, i.e., e = d-y _ est.

The weight coefficients of the iterative filter are solved from the first audio signal x and the error signal e. In one specific example, the filter weight coefficients are updated according to the following formula:

wherein w (n + 1) is a weight coefficient after iteration, and w (n) is a weight coefficient before iteration; x (n) is a time domain representation of the first audio signal x, x ^T (n) is the conjugate of x (n); e (n) is an error signalA time domain representation of the number e; delta is a tuning parameter, which is a small quantity; mu.s _n Is a regulating parameter of a small quantity, 0<μ _n <2。

Since the linear adaptive filtering algorithm can only estimate the linear part of the echo signal y, the error signal e contains the non-linear part of the echo signal y and also contains the user speech signal s in the user speaking state.

In a specific example, the linear adaptive filter may use a linear adaptive filtering algorithm such as:

least Mean Square filtering algorithms (least Mean Square, LMS),

a Recursive Least mean Square filtering Algorithm (RLMS),

normalized least mean square filtering Algorithm (Algorithm Normalized LMS, NLMS).

In a specific example, the adaptive filtering algorithm may be performed after the time domain, or may be performed in the frequency domain.

S206, extracting the characteristic parameter of the echo estimation signal y _ est as a first characteristic.

In a specific example, the characteristic parameter of the echo estimation signal y _ est at least includes any one of the following characteristic parameters:

mel frequency domain cepstral coefficient (MFCC);

the Bark Frequency domain Cepstrum parameter (Bark Frequency Cepstrum Coefficients, BFCC);

linear Prediction Cepstrum Coefficient (LPCC).

And S208, extracting the characteristic parameter of the error signal e as a second characteristic.

In a specific example, the characteristic parameter of the error signal e includes at least any one of the following characteristic parameters:

cepstrum parameters;

a pitch parameter;

perceptual Linear Predictive parameter (Perceptual Linear Predictive);

amplitude modulation spectral parameter (AMS).

S210, inputting the first characteristic and the second characteristic into a pre-trained neural network model, and outputting the gain g of the user voice signal by the pre-trained neural network model.

In a specific example, the pre-trained neural network model may employ the following neural network:

deep Neural Networks (DNN);

long Short Term Memory neural networks (LSTM);

gated Recurrent unit neural network (GRU);

convolutional Neural Networks (CNN).

In a specific example, the first feature and the second feature are pre-spliced into a combined feature, the combined feature is input into the pre-trained neural network model, and the pre-trained neural network model outputs the gain g of the user speech signal.

Referring to fig. 2, in a specific example, the pre-trained neural network model includes 5 sub-neural networks, which are a first network to a fifth network, and the following architecture is adopted: the output end of the first network is respectively connected with the input end of the second network and the input end of the third network; the output end of the second network is respectively connected with the input end of the third network and the input end of the fourth network; the output end of the third network is connected with the output end of the fourth network; the output end of the fourth network is connected with the input end of the fifth network.

The first network and the fifth network respectively adopt a fully-connected neural network, wherein the fully-connected neural network adopts a Tanh activation function or a Relu activation function. The second network, the third network and the fourth network respectively adopt a long and short memory network or a gated cyclic unit neural network.

Inputting a combined feature formed by splicing the first feature and the second feature into a pre-trained neural network model, and outputting a gain g of a user voice signal by the pre-trained neural network model, wherein the gain g comprises:

the combined features are input into the first network.

And inputting the features extracted by the first network into a second network to obtain voice activation detection data. The voice activity detection data is used to characterize whether the current user is speaking.

And inputting the combined features, the features extracted by the first network and the voice activation detection data into a third network to obtain the frequency spectrum estimation data of the noise.

And inputting the combined features, the voice activation detection data and the frequency spectrum estimation data of the noise into a fourth network to obtain enhanced voice data.

The enhanced voice data is input into the fifth network to obtain the gain g of the user voice signal.

Compared with the conventional neural network model based on the deep neural network, the neural network model with the special architecture adopted by the embodiment can save a large number of computing nodes, thereby saving the computing amount and the storage space.

In one specific example, if the user is not speaking and the second audio signal d does not contain the user speech signal s, the gain g of the neural network output approaches 0.

S212, the product of the error signal e and the gain g is calculated to obtain the user speech signal out.

In one specific example, the gain g of the user speech signal is a subband gain. In step S212, the error signal e is subjected to frequency domain transform; and carrying out frequency domain multiplication on the error signal e subjected to frequency domain transformation and the sub-band gain, and carrying out inverse transformation from a frequency domain to a time domain on a multiplication result to obtain a user voice signal.

The following describes a training process of a neural network according to an embodiment of the present invention, which includes steps S302 to S318.

S302, obtaining a sample data set. The sample data set includes a plurality of sets of sample data. Each set of sample data comprises an echo estimate sample signal y est sample, a user speech sample signal s _ sample, a microphone mixed sample signal d _ sample, the three have corresponding relationship.

The sample data is obtained by actual measurement, and a group of sample data corresponds to a complete test process. A testing room is prepared in advance, the audio device is located in the testing room, and a simulation mouth is further arranged in the testing room. A complete test procedure includes steps S702-S708:

s702, preparing a first test audio signal and a first test user voice signal in advance.

S704, in a first scene, that is, in a scene where the simulated mouth is closed and only the first test audio signal is input to the speaker of the audio device for playing, obtaining the echo estimation signal y _ est _ sample by using the first test audio signal and a signal picked up by the microphone of the audio device through the linear adaptive filtering algorithm, where the linear filtering algorithm is the same as the linear filtering algorithm described above.

S706, in a second scenario, that is, in a scenario where the speaker stops working and only the first test user speech signal is played through the simulated mouth, a signal picked up by the microphone of the audio device is used as the user speech sample signal S _ sample, which indicates that the user speech sample signal S _ sample only contains the user speech.

S708, in a third scenario, that is, in a scenario where the first test audio signal is input to a speaker of the audio device and played back, and the first test user speech signal is played back through a simulated mouth, a signal picked up by a microphone of the audio device is used as a microphone mixed sample signal d _ sample. It can be seen that the microphone mixed sample signal d _ sample contains the user speech and also contains the echo signal.

S304, carrying out frequency domain transformation on the user voice sample signal S _ sample and the microphone mixed sample signal d _ sample to obtain the user voice sample frequency domain signal and the microphone mixed sample frequency domain signal.

And S306, dividing the user voice sample frequency domain signal and the microphone mixed sample frequency domain signal according to a plurality of preset sub-bands. In one embodiment, the human audible band is pre-divided into eighteen sub-bands.

And S308, calculating the energy of the user voice sample frequency domain signal on each sub-band.

And S310, calculating the energy of the microphone mixed sample frequency domain signal on each sub-band.

And S312, determining the sub-band gain of the sub-band according to the ratio of the energy of the user voice sample frequency domain signal on the sub-band to the energy of the microphone mixed sample frequency domain signal on the sub-band.

In one specific example, the subband gain is one-half of the ratio corresponding to the subband.

And S314, extracting the characteristic parameters of the echo estimation sample signal y _ est _ sample.

S316, extracting the characteristic parameters of the user voice sample signal S _ sample.

S318, referring to fig. 3, the feature parameter of the echo estimation sample signal y _ est _ sample is used as the first feature, the feature parameter of the user speech sample signal S _ sample is used as the second feature, and is input into the neural network model, the subband gain determined in step S612 is used as a monitor, and the neural network model is trained, so that the subband gain output by the neural network model is continuously close to the subband gain used as the monitor. And when the error between the subband gain output by the neural network model and the subband gain serving as supervision is smaller than a preset threshold value, the training is considered to be successful.

In a specific example, the feature parameter of the microphone mixed sample signal d _ sample may be extracted as the third feature, and in step S618, the feature parameter of the echo estimation sample signal y _ est _ sample, the feature parameter of the user speech sample signal S _ sample, and the feature parameter of the microphone mixed sample signal d _ sample are input into the neural network model, and the neural network model is trained by using the determined subband gain as a monitor. Correspondingly, before step S210, the feature parameter of the second audio signal d is extracted as a third feature, the feature parameter of the echo estimation signal y _ est, the feature parameter of the error signal e, and the feature parameter of the second audio signal d are input into the pre-trained neural network model, and the pre-trained neural network model outputs the subband gain of the user speech signal. In this embodiment, the characteristic parameters of the signals picked up by the microphone are also input into the model, which is beneficial to generalization of the model, so that the model can be suitable for environments with different signal-to-noise ratios.

The echo cancellation method disclosed by the embodiment of the invention firstly adopts a self-adaptive filtering algorithm to estimate the echo signal, then utilizes a pre-trained neural network model to estimate the signal gain according to the echo estimation signal and the error signal, and utilizes the signal gain to obtain the user voice signal.

The applicant carries out a large number of experimental verifications on the echo cancellation method disclosed by the embodiment of the invention, and experimental results show that the echo cancellation method disclosed by the embodiment of the invention can well eliminate residual noise after linear adaptive filtering and accurately extract a user voice signal.

< Voice activation method >

The embodiment of the invention provides a voice activation method, which comprises the echo cancellation method of any one of the embodiments, and further comprises the following steps:

The voice activation method disclosed by the embodiment of the invention can still accurately extract the voice signal of the user to recognize the awakening word under the condition of larger echo interference.

< echo canceller >

An embodiment of the present invention provides an echo cancellation device, which includes a processor and a memory, where the memory stores computer readable instructions, and the computer readable instructions, when executed by the processor, implement the echo cancellation method of any one of the foregoing embodiments.

< Audio apparatus >

An embodiment of the present invention provides an audio apparatus, which includes a processor and a memory, where the memory stores computer readable instructions, and the computer readable instructions, when executed by the processor, implement the echo cancellation method of any one of the foregoing embodiments.

An embodiment of the present invention provides an audio apparatus, including a processor and a memory, where the memory stores computer-readable instructions, and the computer-readable instructions, when executed by the processor, implement the voice activation method of any one of the foregoing embodiments.

< computer-readable storage Medium >

An embodiment of the present invention provides a computer-readable storage medium, on which computer-readable instructions are stored, and when executed by a processor, the computer-readable instructions implement the echo cancellation method of any one of the foregoing embodiments.

Embodiments of the present invention provide a computer-readable storage medium, on which computer-readable instructions are stored, and the computer-readable instructions, when executed by a processor, implement the voice activation method of any of the foregoing embodiments.

All the embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from other embodiments. In particular, as for the device and medium embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference may be made to some descriptions of the method embodiments for relevant points.

The foregoing description of specific embodiments has been presented for purposes of illustration and description. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

Embodiments of the present description may be systems, methods, and/or computer program products. The computer program product may include a computer-readable storage medium having computer-readable program instructions embodied thereon for causing a processor to implement aspects of embodiments of the specification.

The computer readable storage medium may be a tangible device that can hold and store the instructions for use by the instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove projection structures having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media as used herein is not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission medium (e.g., optical pulses through a fiber optic cable), or electrical signals transmitted through electrical wires.

The computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or to an external computer or external storage device over a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. The network adapter card or network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in the respective computing/processing device.

Computer program instructions for carrying out operations for embodiments of the present specification may be assembly instructions, instruction Set Architecture (ISA) instructions, machine related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, the electronic circuitry can execute computer-readable program instructions to implement various aspects of embodiments of the specification by personalizing, with state information of the computer-readable program instructions, a custom electronic circuit, such as a programmable logic circuit, a Field Programmable Gate Array (FPGA), or a Programmable Logic Array (PLA).

Aspects of embodiments of the present specification are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the specification. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.

These computer-readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable medium storing the instructions comprises an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present description. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions. It is well known to those skilled in the art that implementation by hardware, by software, and by a combination of software and hardware are equivalent.

The foregoing description of the embodiments of the present specification has been presented for purposes of illustration and description, but is not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. An echo cancellation method, comprising the steps of:

acquiring a first audio signal (x) and a second audio signal (d), wherein the first audio signal (x) is a signal input to a loudspeaker, and the second audio signal (d) is a signal picked up by a microphone;

estimating an echo signal caused by the first audio signal (x) by adopting a linear adaptive filtering algorithm according to the first audio signal (x) and the error signal (e) to obtain an echo estimation signal (y _ est); said error signal (e) is a difference signal between the second audio signal (d) and the echo estimate signal (y _ est);

extracting a characteristic parameter of the echo estimation signal (y _ est) as a first characteristic;

extracting a characteristic parameter of the error signal (e) as a second characteristic;

inputting the first characteristic and the second characteristic into a pre-trained neural network model, and outputting a gain (g) of a user voice signal by the neural network; the neural network model includes first to fifth networks;

the inputting the first feature and the second feature into a pre-trained neural network model, and outputting a gain (g) of a user voice signal by the neural network, comprises:

inputting combined features spliced by the first features and the second features into a first network; inputting the features extracted by the first network into a second network to obtain voice activation detection data; inputting the combined features, the features extracted by the first network and the voice activation detection data into a third network to obtain noise spectrum estimation data; inputting the combined features, voice activation detection data and noise spectrum estimation data into a fourth network to obtain enhanced voice data; inputting the enhanced voice data into a fifth network to obtain the gain (g) of the user voice signal;

the product of the error signal (e) and the gain (g) is calculated to obtain the user speech signal.

2. The method of claim 1, the gain (g) of the user speech signal being a subband gain; said calculating a product of the error signal (e) and the gain (g) to obtain a user speech signal, comprising:

frequency domain transforming the error signal (e);

and carrying out frequency domain multiplication on the error signal (e) subjected to frequency domain transformation and the sub-band gain, and carrying out inverse transformation from the frequency domain to the time domain on a multiplication result to obtain a user sound signal.

3. The method of claim 2, the training process of the neural network model comprising:

acquiring sample data, wherein the sample data comprises an echo estimation sample signal (y _ est _ sample), a user voice sample signal (s _ sample), and a microphone mixed sample signal (d _ sample), and the echo estimation sample signal (y _ est _ sample) is an echo estimation signal estimated by the linear adaptive filtering algorithm in a first scene; the user voice sample signal (s _ sample) is a signal picked up by the microphone in a second scene; the microphone mixed sample signal is a signal picked up by the microphone under a third scene; the first scene is a scene that no user voice exists in a test environment and only the loudspeaker plays a first test audio signal, the second scene is a scene that the loudspeaker stops working and only the first test user voice exists in the test environment, and the third scene is a scene that the first test user voice exists in the test environment and the loudspeaker plays the first test audio signal;

carrying out frequency domain transformation on the user voice sample signal (s _ sample) and the microphone mixed sample signal (d _ sample) to obtain a user voice sample frequency domain signal and a microphone mixed sample frequency domain signal;

extracting characteristic parameters of an echo estimation sample signal (y _ est _ sample);

extracting characteristic parameters of a user voice sample signal (s _ sample);

inputting the characteristic parameters of the echo estimation sample signal (y _ est _ sample) and the characteristic parameters of the user voice sample signal (s _ sample) into a neural network model, and training the neural network model by using the determined subband gain as supervision.

4. The method of claim 1, the first network and the fifth network each employing a fully connected neural network; the fully-connected neural network adopts a Tanh activation function or a Relu activation function;

5. The method of claim 1, the linear adaptive filtering algorithm being any one of:

a least mean square calculation filtering algorithm;

a recursive least mean square filtering algorithm;

and (4) a normalized least mean square filtering algorithm.

6. Method according to claim 1, wherein the characteristic parameters of the echo estimate signal (y est) comprise at least one of the following characteristic parameters

Mel frequency domain cepstrum parameters;

bark frequency domain cepstrum parameters;

LPC cepstral parameters.

7. The method according to claim 1, wherein the characteristic parameters of the error signal (e) comprise at least one of the following characteristic parameters:

cepstrum parameters;

a pitch parameter;

sensing a linear prediction parameter;

an amplitude modulation spectral parameter.

8. A voice activation method comprising the echo cancellation method according to any one of claims 1 to 7; further comprising:

and detecting whether the user voice signal is a preset awakening word, and if so, awakening the audio device.

9. An echo cancellation device comprising a processor and a memory, said memory having stored therein computer readable instructions which, when executed by said processor, implement the method of any of claims 1-7.

10. An audio device comprising a processor and a memory, the memory having stored therein computer-readable instructions that, when executed by the processor, implement the method of any of claims 1-8.

11. A computer readable storage medium having computer readable instructions stored thereon which, when executed by a processor, implement the method of any one of claims 1-8.