CN117727317A

CN117727317A - Echo cancellation method, device, audio equipment and storage medium

Info

Publication number: CN117727317A
Application number: CN202311778667.XA
Authority: CN
Inventors: 魏子凯; 卢县; 董璘
Original assignee: Bestechnic Shanghai Co Ltd
Current assignee: Bestechnic Shanghai Co Ltd
Priority date: 2023-12-21
Filing date: 2023-12-21
Publication date: 2024-03-19

Abstract

The application provides an echo cancellation method, an echo cancellation device, audio equipment and a storage medium. The echo cancellation method comprises the following steps: acquiring an initial echo cancellation result of microphone data after linear echo cancellation; inputting the far-end voice data and the initial echo cancellation result into a preset deep learning residual echo cancellation network to obtain a frequency domain signal mask; the deep learning residual echo cancellation network is used for determining a nonlinear echo part in the initial echo cancellation result by taking the far-end voice data as a reference signal, and the frequency domain signal mask represents information to be suppressed and/or information to be reserved in the initial echo cancellation result; and carrying out residual echo cancellation on the initial echo cancellation result based on the frequency domain signal mask to obtain microphone data after residual echo cancellation. The echo cancellation method can improve the signal quality after residual echo cancellation of microphone data in different scenes.

Description

Echo cancellation method, device, audio equipment and storage medium

Technical Field

The present invention relates to the field of speech processing, and in particular, to an echo cancellation method, an echo cancellation device, an audio device, and a storage medium.

Background

In the voice call process, the sound played by the loudspeaker may be collected by the microphone, so that the voice data collected by the microphone carries the sound played by the loudspeaker, namely, the phenomenon of acoustic echo exists, and the quality of the voice call is affected.

The acoustic echo comprises linear echo and nonlinear echo, the traditional acoustic echo cancellation method can only cancel the linear echo part, and nonlinear echo still exists in the output result, namely residual echo exists.

Currently, a residual echo suppressor is used to cancel residual echo in the output result of the conventional acoustic echo cancellation method. However, the residual echo suppressor may cause signal distortion when processing the signal, affect the quality of the voice signal, and the residual echo suppressor cannot have a better cancellation effect on the residual echo of more scenes.

Disclosure of Invention

In view of the foregoing, the present application is directed to an echo cancellation method, apparatus, audio device, and storage medium, so as to improve signal quality after residual echo cancellation of microphone data in different scenarios.

In a first aspect, an embodiment of the present application provides an echo cancellation method, including: acquiring an initial echo cancellation result of microphone data after linear echo cancellation; inputting far-end voice data and the initial echo cancellation result into a preset deep learning residual echo cancellation network to obtain a frequency domain signal mask outputted by the deep learning residual echo cancellation network; the deep learning residual echo cancellation network is used for determining a nonlinear echo part in the initial echo cancellation result by taking the far-end voice data as a reference signal, and the frequency domain signal mask represents information to be suppressed and/or information to be reserved in the initial echo cancellation result; and carrying out residual echo cancellation on the initial echo cancellation result based on the frequency domain signal mask to obtain microphone data after residual echo cancellation.

In this embodiment, the deep learning residual echo cancellation network is used to determine a frequency domain signal mask, where the frequency domain signal mask characterizes information to be suppressed and/or information to be retained in the initial echo cancellation result, that is, characterizes the residual echo, and then the frequency domain signal mask is used to perform residual echo cancellation on the initial echo cancellation result, thereby reducing signal distortion caused by directly processing the initial echo cancellation result, and improving quality of microphone data after echo cancellation. In addition, compared with a residual echo suppressor, the deep learning residual echo cancellation network can be trained by voice data of different scenes, so that the accuracy of residual echo cancellation of different scenes can be effectively improved, and the signal quality after residual echo cancellation of microphone data of different scenes is improved. In addition, compared with the method for performing echo cancellation on microphone data by directly utilizing a deep learning network, the method for performing linear echo cancellation on the microphone data can effectively reduce the complexity of the data, further reduce the computational complexity, and enable the echo cancellation method to be arranged in some devices with limited computational resources, so that the application range is expanded.

In one embodiment, before the remote voice data and the initial echo cancellation result are input to a preset deep learning residual echo cancellation network, the method further includes: respectively converting the far-end voice data and the initial echo cancellation result into a mel frequency spectrum; correspondingly, the inputting the far-end voice data and the initial echo cancellation result into a preset deep learning residual echo cancellation network comprises the following steps: and inputting respective mel spectrums of the far-end voice data and the initial echo cancellation result into the deep learning residual echo cancellation network.

In the embodiment of the application, the far-end voice data and the initial echo cancellation result are converted into the mel frequency spectrum, and the mel frequency spectrum is more in accordance with human hearing perception and can reflect the energy distribution and essential characteristics of the voice signals, so that each frequency channel can be compressed into a plurality of limited symbolic frequency channels by using the mel frequency spectrum, the complexity of the data is reduced, the calculation workload of the deep learning residual echo cancellation network is further reduced, and the deep learning residual echo cancellation network can be applied to audio equipment with limited computational resources.

In one embodiment, the converting the far-end voice data and the initial echo cancellation result into mel spectrums respectively includes: and respectively inputting the far-end voice data and the initial echo cancellation result into a preset Mel filter to obtain a Mel frequency spectrum of the far-end voice data and a Mel frequency spectrum of the initial echo cancellation result.

In the embodiment of the application, the preset Mel filter can convert voice data into Mel frequency spectrum, and the Mel filter is used for converting far-end voice data and an initial echo cancellation result, so that the complexity of converting Mel frequency spectrum can be effectively reduced, the consumption of computing resources is reduced, and meanwhile, the conversion efficiency can be improved.

In an embodiment, the obtaining the initial echo cancellation result of the microphone data after the linear echo cancellation includes: performing short-time Fourier transform on the microphone data and the far-end voice data to obtain frequency domain data of the microphone data and frequency domain data of the far-end voice data; inputting the frequency domain data of the microphone data and the frequency domain data of the far-end voice data into a preset adaptive filter for filtering to obtain the initial echo cancellation result output by the adaptive filter; the adaptive filter is configured to cancel a linear echo portion of frequency domain data of the microphone data with the frequency domain data of the far-end speech data as a reference signal.

In the embodiment of the application, the data is filtered by using the preset adaptive filter, so that the complexity of linear echo cancellation can be effectively simplified, and the computing resources required by the linear echo cancellation are reduced.

In one embodiment, the adaptive filter comprises an adaptive kalman filter.

The adaptive Kalman filter is used for filtering, and meanwhile, whether the dynamic state of the system is changed or not can be continuously judged by the filtering, and model parameters and noise statistics characteristics are estimated and corrected so as to improve the filtering design and reduce the actual errors of the filtering.

In an embodiment, the performing residual echo cancellation on the initial echo cancellation result based on the frequency domain signal mask includes: multiplying the frequency domain signal mask with the initial echo cancellation result to obtain a product; and performing inverse Fourier transform on the product to obtain the microphone data after the residual echo is eliminated.

In this embodiment of the present application, the frequency domain signal mask characterizes information to be suppressed and/or information to be reserved in the initial echo cancellation result, and then the initial echo cancellation result is multiplied by the frequency domain signal mask to cancel residual echo of the initial echo cancellation result and/or reserve other voice signals except for the residual echo. The method is simple to realize, the complexity of residual echo cancellation can be effectively reduced, and the consumption of calculation resources is reduced.

In one embodiment, the deep learning residual echo cancellation network is obtained by: acquiring a double-talk echo data set, wherein the double-talk echo data set comprises frequency domain data of microphone voice training data and frequency domain data of far-end voice training data acquired in different scenes; inputting the double-talk echo data set into the deep learning residual echo cancellation network to obtain a learning result of the deep learning residual echo cancellation network; calculating a loss function based on the learning result and the double-talk echo dataset; if the loss function is larger than a preset threshold, adjusting parameters of the deep learning residual echo cancellation network, and repeating the training process from acquiring the double-talk echo data set to calculating the loss function; and determining that the deep learning residual echo cancellation network training is completed until the loss function is lower than a preset threshold value.

According to the method and the device for achieving the deep learning residual echo cancellation, training is conducted on the deep learning residual echo cancellation network, and training is confirmed to be completed until the loss function is lower than the preset threshold, so that the frequency domain signal mask determined by the deep learning residual echo cancellation network can have higher accuracy, the efficiency of residual echo cancellation is improved, and the quality of microphone data after the residual echo cancellation is improved.

In a second aspect, an embodiment of the present application provides an echo cancellation device, including: the linear echo cancellation module is used for acquiring an initial echo cancellation result of microphone data after linear echo cancellation; the deep learning module is used for inputting the far-end voice data and the initial echo cancellation result into a preset deep learning residual echo cancellation network to obtain a frequency domain signal mask output by the deep learning residual echo cancellation network; the deep learning residual echo cancellation network is used for determining a nonlinear echo part in the initial echo cancellation result by taking the far-end voice data as a reference signal, and the frequency domain signal mask represents information to be suppressed and/or information to be reserved in the initial echo cancellation result; and the residual echo cancellation module is used for performing residual echo cancellation on the initial echo cancellation result based on the frequency domain signal mask to obtain microphone data after residual echo cancellation.

In a third aspect, embodiments of the present application provide an audio device, including: a speaker; a microphone; a processor configured to perform the echo cancellation method according to any one of the first aspects.

In a fourth aspect, embodiments of the present application provide a readable storage medium having a program stored therein, which when run on a processor causes the processor to perform the echo cancellation method according to any one of the first aspects.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and should not be considered as limiting the scope, and other related drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flowchart of an echo cancellation method according to an embodiment of the present application;

FIG. 2 is a schematic diagram of signal transmission according to an embodiment of the present disclosure;

fig. 3 is a schematic structural diagram of a deep learning residual echo cancellation network according to an embodiment of the present application;

fig. 4 is a schematic diagram of an echo cancellation device according to an embodiment of the present application;

fig. 5 is a schematic diagram of an audio device according to an embodiment of the present application.

Icon: echo cancellation device 200; a linear echo cancellation module 210; a deep learning module 220; a residual echo cancellation module 230; an audio device 300; a speaker 310; a microphone 320; a processor 330.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application more apparent, the present application is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.

Referring to fig. 1, fig. 1 is a flowchart of an echo cancellation method according to an embodiment of the present application. The echo cancellation method comprises the following steps:

s110, obtaining an initial echo cancellation result of microphone data after linear echo cancellation.

Referring to fig. 2, fig. 2 is a signal transmission schematic diagram provided in an embodiment of the present application, taking a call as an example, in the embodiment of the present application, the collected voice signal includes two parts, namely microphone data and far-end voice data, the far-end voice data is voice data played by an audio playing end such as a speaker, the microphone data is voice data collected by a microphone, and the microphone data may also be referred to as near-end data.

The microphone data may include far-end voice data, where far-end voice data is an echo that needs to be removed. For example, during a phone call, voice played by a speaker of the phone may be collected by a microphone and transmitted back to the far end, so that an echo occurs to a listener at the far end.

The echoes include linear echoes and nonlinear echoes, and in the embodiment of the present application, the linear echoes in the microphone data may be first cancelled to obtain an initial echo cancellation result, and it may be understood that the initial echo cancellation result is also the microphone data.

In one embodiment, the linear echo cancellation may be performed as follows, to obtain an initial echo cancellation result: firstly, carrying out short-time Fourier transform on microphone data and far-end voice data to obtain frequency domain data of the microphone data and frequency domain data of the far-end voice data; and inputting the frequency domain data of the microphone data and the frequency domain data of the far-end voice data into a preset adaptive filter for filtering, and obtaining an initial echo cancellation result output by the adaptive filter.

In this embodiment, the adaptive filter may be pre-constructed, and the adaptive filter may be configured to use frequency domain data of the far-end voice data as a reference signal, so as to cancel a linear echo portion of the frequency domain data of the microphone data.

In fig. 2, STFT represents fourier transform, in this embodiment, the microphone data and the far-end voice data are time domain signals, so that the microphone data and the far-end voice data can be converted into frequency domain data through short-time fourier transform, and then the frequency domain microphone data and the far-end voice data are input into an adaptive filter for filtering, so as to realize elimination of linear echo in the microphone data.

Therefore, the complexity of linear echo cancellation can be simplified through the preset adaptive filter, and the computing resource required by echo cancellation is reduced.

In one embodiment, the adaptive filter may comprise an adaptive Kalman filter.

In some embodiments, other types of adaptive filters may be used as well. In some embodiments, conventional acoustic echo cancellation methods may also be used for linear echo cancellation, and specific implementations may refer to the prior art and are not further developed herein.

The deep learning network has higher processing capability on the device chip, and the echo cancellation has higher requirement on real-time performance, and the method for directly combining the deep learning network to perform integral echo cancellation on microphone data comprising linear echo and nonlinear echo cannot be applied to devices with limited processor performances such as Bluetooth headphones, sound equipment and the like. Therefore, in the embodiment of the present application, an initial echo cancellation result after linear echo cancellation may be obtained first, and the initial echo cancellation result may be output to the deep learning residual echo cancellation network, so as to reduce the computing resources required by the deep learning residual echo cancellation network.

S120, inputting the far-end voice data and the initial echo cancellation result into a preset deep learning residual echo cancellation network to obtain a frequency domain signal mask output by the deep learning residual echo cancellation network.

In this embodiment, the deep learning residual echo cancellation network is a pre-constructed neural network model, and in this embodiment of the present application, the deep learning residual echo cancellation network is configured to determine a nonlinear echo portion in an initial echo cancellation result by using far-end speech data as a reference signal. It can be understood that the microphone data collected by the microphone includes far-end voice data, the far-end voice data is echo data to be eliminated for the microphone data, after linear echo elimination, a part of nonlinear echo still remains, and the far-end voice data also includes the part of nonlinear echo, so that the far-end voice data can be used as a reference signal to determine residual echo data matched with the far-end voice data in an initial echo elimination result, namely, nonlinear echo in the initial echo elimination result. The construction of the neural network model can refer to the prior art and is not developed here.

In this embodiment, after determining the nonlinear echo portion by using the far-end voice data as the reference signal, the frequency domain signal mask may be used to characterize the information to be suppressed and/or the information to be reserved in the initial echo cancellation result, that is, to characterize the nonlinear echo portion and/or the rest of the non-linear echo in the initial echo cancellation result.

In one embodiment, before the far-end voice data and the initial echo cancellation result are input to the preset deep learning residual echo cancellation network, the far-end voice data and the initial echo cancellation result may be converted into mel spectrums, respectively. Correspondingly, when the far-end voice data and the initial echo cancellation result are input into the preset deep learning residual echo cancellation network, the mel frequency spectrums of the far-end voice data and the initial echo cancellation result can be input into the deep learning residual echo cancellation network.

Compared with the frequency domain data of the voice signal, the Mel frequency spectrum can reflect the characteristics of the voice signal, and the Mel frequency spectrum is output to the deep learning residual echo cancellation network, so that the calculation resources required by the deep learning residual echo cancellation network can be reduced. Specifically, mel-frequency spectrum is a method of converting a sound signal into a frequency domain representation, which is more consistent with human auditory perception. In addition, the mel spectrum converts the sound signal into a set of energy values reflecting the energy distribution of the sound signal, and more reflecting the essential characteristics of sound, whereby converting the sound signal into the mel spectrum can compress each frequency channel into several limited symbolic frequency channels, thereby greatly reducing the complexity of the data and thus the computation complexity using the spectrum. Thus, in this embodiment, the frequency domain data of the far-end speech data and the initial echo cancellation result may now be converted into mel frequency spectrum, so as to reduce complex reading of data and reduce the computational resources required by the deep learning residual echo cancellation network to determine the frequency domain signal mask.

In some embodiments of the present application, the mel-frequency spectrum conversion may be performed by using a mel filter, that is, the far-end voice data and the initial echo cancellation result are respectively input to a preset mel filter, so as to obtain a mel frequency spectrum of the far-end voice data and a mel frequency spectrum of the initial echo cancellation result.

The mel filter is an implementation manner for converting data into a mel spectrum, and through the mel filter, the calculation amount for converting the mel spectrum can be reduced and the conversion can be realized relatively quickly, so in this embodiment, the mel filter can be preset to convert the mel spectrum, so that the calculation resource required by echo cancellation is further reduced, and the conversion efficiency of the mel spectrum is improved.

In the above embodiments, the specific principles and implementation of the mel spectrum and the mel filter may refer to the prior art, and are not further developed herein.

Referring to fig. 3, fig. 3 is a schematic structural diagram of a deep learning residual echo cancellation network according to an embodiment of the present application. In an embodiment of the present application, the structure of the deep learning residual echo cancellation network may include a first convolution layer, a second convolution layer, a GRU (Gated Recurrent Unit, gate-controlled cyclic unit), a first deconvolution layer, and a second deconvolution layer connected in sequence, the first convolution layer being connected with a second deconvolution layer residual, the second convolution layer being connected with the first deconvolution layer residual. The first and second convolution layers may be identical in structure, the first and second only being shown separately, the first and second deconvolution layers being identical.

In this embodiment, the deep learning residual echo cancellation network combines the far-end voice data and the initial echo cancellation result after receiving the two, and then sequentially outputs the two to each network layer for calculation, and the frequency domain signal mask can be obtained through calculation of each network layer.

In one embodiment, the deep learning residual echo cancellation network may be obtained by: acquiring a double-talk echo data set, and inputting the double-talk echo data set into a deep learning residual echo cancellation network to obtain a learning result of the deep learning residual echo cancellation network; calculating a loss function based on the learning result and the double-talk echo data set; if the loss function is larger than a preset threshold, parameters of the deep learning residual echo cancellation network are adjusted, and the training process from obtaining the double-talk echo data set to calculating the loss function is repeated; and determining that the deep learning residual echo cancellation network training is completed until the loss function is lower than a preset threshold value.

In this embodiment, the double-talk echo data set includes frequency domain data of microphone voice training data and frequency domain data of far-end voice training data collected in different scenes. The deep learning residual echo cancellation network is trained through the voice data in different scenes, so that the trained deep learning residual echo cancellation network has a good calculation effect on the voice data in different scenes, and a more accurate frequency domain signal mask is obtained. Compared with a residual echo suppressor, the method does not need to be configured and adjusted for different application scenes, so that the application range and convenience of echo cancellation are improved.

In this embodiment, the learning result and the loss function of the two-way echo data set are calculated, and if the loss function is greater than a preset threshold, the configuration parameters of each network layer are adjusted, where the optimization mode may refer to the prior art, for example, an Adam algorithm is used, and the optimization mode is not expanded here. When the loss function is smaller than or equal to a preset threshold value, the frequency domain signal mask output by the deep learning residual echo cancellation network has higher accuracy, and training completion can be determined.

S130, carrying out residual echo cancellation on the initial echo cancellation result based on the frequency domain signal mask, and obtaining microphone data after residual echo cancellation.

In this embodiment, the frequency domain signal mask characterizes information to be suppressed and/or information to be retained in the initial echo cancellation result, where the information to be suppressed is a nonlinear echo in microphone data, and the information to be retained is the rest of speech data except for the nonlinear echo in the microphone data. The residual echo in the initial echo cancellation result may be suppressed and/or the remaining speech signals except for the residual echo may be retained based on the frequency domain signal mask, thereby obtaining microphone data after the residual echo cancellation.

In one embodiment, the residual echo cancellation of the initial echo cancellation result based on the frequency domain signal mask may include: multiplying the frequency domain signal mask with the initial echo cancellation result to obtain a product; and then carrying out inverse Fourier transform on the product to obtain microphone data after the residual echo is eliminated.

In this embodiment, the frequency domain signal mask may be regarded as a "mask", and the multiplication of the two may suppress the residual echo and/or retain the rest of the information except for the residual echo. The initial echo cancellation result is frequency domain data, the product of the initial echo cancellation result and the frequency domain signal mask is also frequency domain data, and the data played by the audio device is frequency domain data, so after the product of the initial echo cancellation result and the frequency domain signal mask is obtained, the product can be subjected to inverse Fourier transform to obtain microphone data after residual echo cancellation.

The residual echo cancellation can be carried out on the initial echo cancellation result by multiplying the frequency domain signal mask and the initial echo cancellation result, so that the calculation complexity of carrying out residual echo cancellation on the initial echo cancellation result is effectively reduced, and the required calculation resources are reduced.

In the embodiment of the application, the deep learning residual echo cancellation network is utilized to determine the frequency domain signal mask, the frequency domain signal mask characterizes the information to be suppressed and/or the information to be reserved in the initial echo cancellation result, namely, characterizes the residual echo, and then the frequency domain signal mask is utilized to carry out residual echo cancellation on the initial echo cancellation result, so that signal distortion caused by directly processing the initial echo cancellation result can be reduced, and therefore, the quality of microphone data after echo cancellation can be improved. In addition, compared with a residual echo suppressor, the deep learning residual echo cancellation network can be trained by voice data of different scenes, so that the accuracy of residual echo cancellation of different scenes can be effectively improved, and the signal quality after residual echo cancellation of microphone data of different scenes is improved. In addition, compared with the method for performing echo cancellation on microphone data by directly utilizing a deep learning network, the method for performing linear echo cancellation on the microphone data can effectively reduce the complexity of the data, further reduce the computational complexity, and enable the echo cancellation method to be arranged in some devices with limited computational resources, so that the application range is expanded.

Based on the same inventive concept, the embodiment of the present application further provides an echo cancellation device, referring to fig. 4, fig. 4 is a schematic diagram of the echo cancellation device provided in an embodiment of the present application, and the echo cancellation device 200 includes: a linear echo cancellation module 210, a deep learning module 220, and a residual echo cancellation module 230.

The linear echo cancellation module 210 is configured to obtain an initial echo cancellation result of microphone data after linear echo cancellation.

The deep learning module 220 is configured to input the far-end speech data and the initial echo cancellation result into a preset deep learning residual echo cancellation network, so as to obtain a frequency domain signal mask output by the deep learning residual echo cancellation network; the deep learning residual echo cancellation network is used for determining a nonlinear echo part in an initial echo cancellation result by taking far-end voice data as a reference signal, and the frequency domain signal mask represents information to be suppressed and/or information to be reserved in the initial echo cancellation result.

The residual echo cancellation module 230 is configured to perform residual echo cancellation on the initial echo cancellation result based on the frequency domain signal mask, so as to obtain microphone data after residual echo cancellation.

In one embodiment, the deep learning module 220 is configured to convert the far-end voice data and the initial echo cancellation result into mel frequency spectrums, respectively; the mel spectrum of each of the far-end speech data and the initial echo cancellation result is input to a deep learning residual echo cancellation network.

In one embodiment, the deep learning module 220 is configured to input the far-end voice data and the initial echo cancellation result to a predetermined mel filter, respectively, to obtain a mel spectrum of the far-end voice data and a mel spectrum of the initial echo cancellation result.

In one embodiment, the linear echo cancellation module 210 is configured to perform short-time fourier transform on the microphone data and the far-end voice data to obtain frequency domain data of the microphone data and frequency domain data of the far-end voice data; inputting the frequency domain data of the microphone data and the frequency domain data of the far-end voice data into a preset adaptive filter for filtering to obtain the initial echo cancellation result output by the adaptive filter; the adaptive filter is configured to cancel a linear echo portion of frequency domain data of the microphone data with the frequency domain data of the far-end speech data as a reference signal.

In one embodiment, the adaptive filter comprises an adaptive kalman filter.

In one embodiment, the residual echo cancellation module 230 is configured to multiply the frequency domain signal mask with the initial echo cancellation result to obtain a product; and carrying out inverse Fourier transform on the product to obtain microphone data after the residual echo is eliminated.

In one embodiment, the deep learning residual echo cancellation network of the deep learning module 220 is obtained by: acquiring a double-talk echo data set, wherein the double-talk echo data set comprises frequency domain data of microphone voice training data and frequency domain data of far-end voice training data acquired under different scenes; inputting the double-talk echo data set into a deep learning residual echo cancellation network to obtain a learning result of the deep learning residual echo cancellation network; calculating a loss function based on the learning result and the double-talk echo data set; if the loss function is larger than a preset threshold, parameters of the deep learning residual echo cancellation network are adjusted, and the process is repeated; and determining that the deep learning residual echo cancellation network training is completed until the loss function is lower than a preset threshold value.

Based on the same inventive concept, the embodiment of the application also provides audio equipment. Referring to fig. 5, fig. 5 is a schematic diagram of an audio device according to an embodiment of the present application. The audio device 300 includes: speaker 310, microphone 320, and processor 330. The voice data played by the loudspeaker is far-end voice data, the voice data collected by the microphone is microphone data, and the microphone data may comprise far-end voice data.

In an embodiment of the present application, the processor 330 is configured to perform the echo cancellation method provided in any of the foregoing embodiments of the present application.

In this embodiment, the audio device 300 may be a bluetooth headset, a sound device, or the like.

Based on the same inventive concept, the embodiments of the present application also provide a readable storage medium having a program stored thereon, which when run on a processor causes the processor to perform the method provided in the above embodiments.

The processor-readable storage medium may be any available medium that can be accessed by a processor or a data storage device, such as a server, data center, or the like, that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., a floppy Disk, a hard Disk, a magnetic tape), an optical medium (e.g., a DVD (digital videodisc, digital versatile Disk)), or a semiconductor medium (e.g., an SSD (Solid State Disk)), or the like.

The echo cancellation method, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a processor to perform all or part of the steps of the method described in the various embodiments of the present application. And the aforementioned storage medium includes: a usb disk, a removable hard disk, a ROM, a RAM, a magnetic disk, or an optical disk, etc.

In the embodiments provided in the present application, it should be understood that the disclosed method and apparatus may be implemented in other manners as well. The device embodiments described above are merely illustrative. The functional modules in the embodiments of the present application may be integrated together to form a single part, or each module may exist alone, or two or more modules may be integrated to form a single part.

The above embodiments can be freely combined without conflict, and the combined embodiments are covered in the protection scope of the present application.

The foregoing is merely specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily think about changes or substitutions within the technical scope of the present application, and the changes and substitutions are intended to be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

Claims

1. An echo cancellation method, comprising:

acquiring an initial echo cancellation result of microphone data after linear echo cancellation;

inputting far-end voice data and the initial echo cancellation result into a preset deep learning residual echo cancellation network to obtain a frequency domain signal mask outputted by the deep learning residual echo cancellation network; the deep learning residual echo cancellation network is used for determining a nonlinear echo part in the initial echo cancellation result by taking the far-end voice data as a reference signal, and the frequency domain signal mask represents information to be suppressed and/or information to be reserved in the initial echo cancellation result;

and carrying out residual echo cancellation on the initial echo cancellation result based on the frequency domain signal mask to obtain microphone data after residual echo cancellation.

2. The method of echo cancellation according to claim 1, wherein before said inputting the far-end speech data and the initial echo cancellation result into a preset deep learning residual echo cancellation network, the method further comprises:

respectively converting the far-end voice data and the initial echo cancellation result into a mel frequency spectrum;

correspondingly, the inputting the far-end voice data and the initial echo cancellation result into a preset deep learning residual echo cancellation network comprises the following steps:

and inputting respective mel spectrums of the far-end voice data and the initial echo cancellation result into the deep learning residual echo cancellation network.

3. The method of echo cancellation according to claim 2, wherein said converting said far-end speech data and said initial echo cancellation result, respectively, into mel-frequency spectra comprises:

and respectively inputting the far-end voice data and the initial echo cancellation result into a preset Mel filter to obtain a Mel frequency spectrum of the far-end voice data and a Mel frequency spectrum of the initial echo cancellation result.

4. The method of echo cancellation according to claim 1, wherein said obtaining initial echo cancellation results of microphone data after linear echo cancellation comprises:

performing short-time Fourier transform on the microphone data and the far-end voice data to obtain frequency domain data of the microphone data and frequency domain data of the far-end voice data;

inputting the frequency domain data of the microphone data and the frequency domain data of the far-end voice data into a preset adaptive filter for filtering to obtain the initial echo cancellation result output by the adaptive filter; the adaptive filter is configured to cancel a linear echo portion of frequency domain data of the microphone data with the frequency domain data of the far-end speech data as a reference signal.

5. The method of echo cancellation according to claim 4, wherein the adaptive filter comprises an adaptive kalman filter.

6. The method of echo cancellation according to claim 1, wherein said residual echo cancellation of said initial echo cancellation result based on said frequency domain signal mask comprises:

multiplying the frequency domain signal mask with the initial echo cancellation result to obtain a product;

and performing inverse Fourier transform on the product to obtain the microphone data after the residual echo is eliminated.

7. The echo cancellation method according to any one of claims 1 to 6, wherein the deep learning residual echo cancellation network is obtained by:

acquiring a double-talk echo data set, wherein the double-talk echo data set comprises frequency domain data of microphone voice training data and frequency domain data of far-end voice training data acquired in different scenes;

inputting the double-talk echo data set into the deep learning residual echo cancellation network to obtain a learning result of the deep learning residual echo cancellation network;

calculating a loss function based on the learning result and the double-talk echo dataset;

if the loss function is larger than a preset threshold, adjusting parameters of the deep learning residual echo cancellation network, and repeating the training process from acquiring the double-talk echo data set to calculating the loss function;

and determining that the deep learning residual echo cancellation network training is completed until the loss function is lower than a preset threshold value.

8. An echo cancellation device, comprising:

the linear echo cancellation module is used for acquiring an initial echo cancellation result of microphone data after linear echo cancellation;

the deep learning module is used for inputting the far-end voice data and the initial echo cancellation result into a preset deep learning residual echo cancellation network to obtain a frequency domain signal mask output by the deep learning residual echo cancellation network; the deep learning residual echo cancellation network is used for determining a nonlinear echo part in the initial echo cancellation result by taking the far-end voice data as a reference signal, and the frequency domain signal mask represents information to be suppressed and/or information to be reserved in the initial echo cancellation result;

and the residual echo cancellation module is used for performing residual echo cancellation on the initial echo cancellation result based on the frequency domain signal mask to obtain microphone data after residual echo cancellation.

9. An audio device, comprising:

a speaker; a microphone;

a processor configured to perform the echo cancellation method according to any one of claims 1 to 7.

10. A readable storage medium, characterized in that a program is stored in the readable storage medium, which when run on a processor causes the processor to perform the echo cancellation method according to any one of claims 1-7.