CN111524498B

CN111524498B - Filtering method and device and electronic equipment

Info

Publication number: CN111524498B
Application number: CN202010280555.1A
Authority: CN
Inventors: 张勇
Original assignee: Vivo Mobile Communication Co Ltd
Current assignee: Vivo Mobile Communication Co Ltd
Priority date: 2020-04-10
Filing date: 2020-04-10
Publication date: 2023-06-16
Anticipated expiration: 2040-04-10
Also published as: CN111524498A

Abstract

The embodiment of the invention provides a filtering method, a filtering device and electronic equipment, which are applied to the technical field of communication and are used for solving the problem of poor voice call quality of the electronic equipment. The method comprises the following steps: acquiring an estimated echo signal corresponding to a first voice signal output by a loudspeaker of electronic equipment; subtracting the estimated echo signal from a first audio signal received by a microphone of the electronic device to generate a second audio signal, the first audio signal including a second speech signal and a first echo signal fed back to the microphone by a speaker; calculating a weighted filtering parameter according to the energy parameter of a residual echo signal in the second audio signal, wherein the residual echo signal is the residual echo signal in the second audio signal; and carrying out filtering processing on the second audio signal according to the weighted filtering parameters to generate a target audio signal. The method and the device are applied to the elimination of the acoustic echo scene.

Description

Filtering method and device and electronic equipment

Technical Field

The embodiment of the invention relates to the technical field of communication, in particular to a filtering method, a filtering device and electronic equipment.

Background

With the development of communication technology, electronic devices having hands-free voice communication systems are increasing, and users can make voice calls with other users using a speaker and a microphone of the electronic device without having to pick up and place the electronic device at the ear.

At present, due to the fact that the distance between a speaker and a microphone of an electronic device is relatively short, sound output by the speaker of the electronic device may be fed back to the microphone of the electronic device, and therefore the voice communication quality of the electronic device is affected by the acoustic echo phenomenon.

Disclosure of Invention

The embodiment of the invention provides a filtering method, a filtering device and electronic equipment, which are used for solving the problem of poor voice call quality of the electronic equipment.

In order to solve the technical problems, the application is realized as follows:

in a first aspect, an embodiment of the present invention provides a filtering method, including: acquiring an estimated echo signal corresponding to a first voice signal output by a loudspeaker of electronic equipment; subtracting the estimated echo signal from a first audio signal received by a microphone of the electronic device to generate a second audio signal, the first audio signal including a second speech signal and a first echo signal fed back to the microphone by a speaker; calculating a weighted filtering parameter according to the energy parameter of a residual echo signal in the second audio signal, wherein the residual echo signal is the residual echo signal in the second audio signal; and carrying out filtering processing on the second audio signal according to the weighted filtering parameters to generate a target audio signal.

In a second aspect, an embodiment of the present invention further provides a filtering apparatus, including: the device comprises an acquisition module, a generation module and a calculation module; the acquisition module is used for acquiring an estimated echo signal corresponding to a first voice signal output by a loudspeaker of the electronic equipment; the generating module is used for subtracting the estimated echo signal acquired by the acquisition module from a first audio signal received by a microphone of the electronic equipment to generate a second audio signal, wherein the first audio signal comprises a second voice signal and a first echo signal fed back to the microphone by a loudspeaker; the calculation module is used for calculating a weighted filtering parameter according to the energy parameter of the residual echo signal in the second audio signal generated by the generation module, wherein the residual echo signal is the residual echo signal in the second audio signal; and the generating module is also used for carrying out filtering processing on the second audio signal according to the weighted filtering parameters calculated by the calculating module to generate a target audio signal.

In a third aspect, an embodiment of the present invention provides an electronic device, including a processor, a memory, and a computer program stored on the memory and executable on the processor, the computer program implementing the steps of the filtering method according to the first aspect when executed by the processor.

In a fourth aspect, embodiments of the present invention provide a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the filtering method according to the first aspect.

In the embodiment of the invention, after the electronic device obtains the estimated echo signal corresponding to the first voice signal output by the speaker of the electronic device, the estimated echo signal may be subtracted from the first audio signal (including the second voice signal and the first echo signal fed back to the microphone by the speaker) received by the microphone of the electronic device, so as to generate the second audio signal. Since the second audio signal also comprises a residual echo signal, the residual echo signal still affects the voice call quality. Therefore, the electronic device may calculate the weighted filtering parameter according to the energy parameter of the residual echo signal remaining in the second audio signal, and perform filtering processing on the second audio signal according to the weighted filtering parameter, so as to generate the target audio signal. Through the scheme, the electronic equipment can filter the first audio signal through the adaptive filter to obtain the second audio signal, and then the electronic equipment can carry out secondary filtering on the residual echo signal according to the estimated energy parameter of the residual echo signal in the second audio signal, so that the echo signal is further restrained, and the conversation quality of the electronic equipment is further improved.

Drawings

Fig. 1 is a schematic diagram of one possible acoustic echo generation according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a possible operating system according to an embodiment of the present invention;

FIG. 3 is a schematic flow chart of a conventional filtering method according to an embodiment of the present invention;

FIG. 4 is a schematic flow chart of a filtering method according to an embodiment of the present invention;

FIG. 5 is a second flow chart of a filtering method according to the embodiment of the invention;

FIG. 6 is a third flow chart of a filtering method according to an embodiment of the present invention;

FIG. 7 is a flowchart of a filtering method according to an embodiment of the present invention;

FIG. 8 is a fifth flow chart of a filtering method according to an embodiment of the present invention;

FIG. 9 is a graph of a filtering method according to an embodiment of the present invention;

FIG. 10 is a graph of a speech spectrum after filtering by a conventional filtering method according to an embodiment of the present invention;

FIG. 11 is a second graph of a filtering method according to an embodiment of the present invention;

fig. 12 is a schematic structural diagram of a filtering device according to an embodiment of the present invention;

fig. 13 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, are intended to be within the scope of the present application.

In this context "/" means "or" for example, a/B may mean a or B; "and/or" herein is merely an association relationship describing an association object, and means that three relationships may exist, for example, a and/or B may mean: a exists alone, A and B exist together, and B exists alone.

It should be noted that "plurality" herein means two or more than two.

It should be noted that, in the embodiments of the present invention, words such as "exemplary" or "such as" are used to mean serving as an example, instance, or illustration. Any embodiment or design described herein as "exemplary" or "e.g." in an embodiment should not be taken as preferred or advantageous over other embodiments or designs. Rather, the use of words such as "exemplary" or "such as" is intended to present related concepts in a concrete fashion.

It should be noted that, in order to clearly describe the technical solution of the embodiment of the present invention, in the embodiment of the present invention, the words "first", "second", etc. are used to distinguish the same item or similar items having substantially the same function or effect, and those skilled in the art will understand that the words "first", "second", etc. do not limit the number and execution order. For example, the first audio signal and the second audio signal are used to distinguish between different audio signals, rather than to describe a particular order of the audio signals.

The acoustic echo phenomenon in the embodiment of the invention refers to: the voice signal played by the loudspeaker is picked up by the microphone and then transmitted back to the far end, so that the far end user can hear the phenomenon of own voice.

Wherein acoustic echoes are further divided into direct echoes and indirect echoes. The direct echo refers to the fact that a voice signal played by a loudspeaker directly enters a microphone without any reflection, and the indirect echo refers to an echo set generated by the fact that the voice signal played by the loudspeaker enters the microphone after being reflected once or more times through different paths.

For example, as shown in fig. 1, a voice signal played by a microphone enters the microphone through an echo path 1, namely, a direct echo; the echo path 2 enters the microphone to be an indirect echo.

The execution main body of the filtering method provided by the embodiment of the invention can be the electronic equipment (including mobile electronic equipment and non-mobile electronic equipment), or can be a functional module and/or a functional entity which can realize the filtering method in the electronic equipment, and the implementation main body can be specifically determined according to actual use requirements. The filtering method provided by the embodiment of the invention is exemplified by an electronic device.

The electronic device in the embodiment of the invention can be a mobile electronic device or a non-mobile electronic device. The mobile electronic device may be a mobile phone, a tablet computer, a notebook computer, a palm computer, a vehicle-mounted terminal device, a wearable device, an ultra-mobile personal computer (ultra-mobile personal computer, UMPC), a netbook or a personal digital assistant (personal digital assistant, PDA), etc.; the non-mobile electronic device may be a personal computer (personal computer, PC), a Television (TV), a teller machine, a self-service machine, or the like; the embodiment of the present invention is not particularly limited.

The electronic device in the embodiment of the invention can be an electronic device with an operating system. The operating system may be an Android operating system, an ios operating system, or other possible operating systems, and the embodiment of the present invention is not limited specifically.

The software environment to which the filtering method according to the embodiment of the present invention is applied will be described below by taking the operating system shown in fig. 2 as an example.

Fig. 2 is a schematic diagram of a possible architecture of an operating system according to an embodiment of the present invention. In fig. 2, the architecture of the operating system includes 4 layers, respectively: an application program layer, an application program framework layer, a system runtime layer and a kernel layer (specifically, a Linux kernel layer).

The application layer includes various applications (including system applications and third party applications) in the operating system.

The application framework layer is a framework of applications, and developers can develop some applications based on the application framework layer while adhering to the development principle of the framework of the applications.

The system runtime layer includes libraries (also referred to as system libraries) and an operating system runtime environment. Libraries mainly provide various resources required by the operating system. The operating system runtime environment is used to provide a software environment for the operating system.

The kernel layer is an operating system layer of the operating system, and belongs to the bottommost layer of the operating system software layer. The kernel layer provides core system services and hardware-related drivers for the operating system based on the Linux kernel.

Taking the operating system shown in fig. 2 as an example, in the embodiment of the present invention, a developer may develop a software program for implementing the filtering method provided in the embodiment of the present invention based on the system architecture of the operating system shown in fig. 2, so that the filtering method may be operated based on the operating system shown in fig. 2. I.e. the processor or the electronic device may implement the filtering method provided by the embodiments of the invention by running the software program in an operating system.

In the related art, an electronic device generally employs an adaptive filter to perform acoustic echo cancellation, and fig. 3 is a schematic structural diagram of a conventional acoustic echo cancellation device. The principle of the device is as follows: firstly, an echo path model is built by the self-adaptive filter, the real echo signal is estimated, then the estimated value of the echo signal is subtracted from the near-end signal output by the microphone, an error signal is generated, and the weight of the self-adaptive filter is gradually adjusted according to the error signal fed back to the self-adaptive filter, so that the estimated value of the echo signal is gradually close to the real echo signal, and finally, the purpose of echo cancellation is achieved.

As shown in fig. 3, x (m) is a voice signal transmitted remotely; y (m) is an echo signal of the far-end voice signal x (m) which is output by the loudspeaker and enters the microphone after being reflected; s (m) is a local speech signal formed by the user speaking into the microphone; d (m) is the near-end speech signal of the input microphone, where d (m) =y (m) +s (m); y' (m) is the estimated echo signal calculated by the adaptive filter; e (m) is an error signal obtained by the difference between the near-end speech signal d (m) and the estimated echo signal y '(m), i.e., e (m) =d (m) -y' (m); the double talk detection (double talk detection, DTD) is used to determine whether the current talk state is near talk, double talk or far talk, and if the DTD determines that the current talk state is double talk or far talk, the adaptive filter performs filtering.

However, in the current communication system, since devices such as a speaker and a microphone have nonlinear characteristics, an adaptive filter having linear characteristics cannot completely and accurately simulate a real echo signal, and the adaptive filter is misaligned even if it converges, so that an error signal outputted after filtering by the adaptive filter inevitably has a residual echo signal.

In order to solve the above-mentioned problems, the present invention provides a filtering method, which can obtain an estimated echo signal first, then, after obtaining an error signal by using a difference value between a near-end speech signal and the estimated echo signal, design a perceptual weighting filter according to an energy parameter of a residual echo signal in the error signal, so as to perform secondary filtering on the residual echo signal, thereby improving the call quality of electronic equipment.

The filtering method according to the embodiment of the present invention is described below with reference to the filtering method flowchart shown in fig. 4, and fig. 4 is a schematic flowchart of the filtering method according to the embodiment of the present invention, including steps 201 to 204:

step 201: the electronic device obtains an estimated echo signal corresponding to a first voice signal output by a loudspeaker of the electronic device.

In the embodiment of the present invention, the speaker may be a speaker of the electronic device, or may be a speaker externally connected to the electronic device, which is not limited in the embodiment of the present invention.

The estimated echo signal may be an echo signal generated by the electronic device for estimating the first speech signal.

Optionally, in the embodiment of the present invention, the electronic device may estimate the estimated echo signal by using a voice signal that does not pass through the speaker and corresponds to the first voice signal.

The step 201 may specifically include the following step 201a:

step 201a: the electronic equipment inputs an original voice signal corresponding to a first voice signal output by a loudspeaker of the electronic equipment into the adaptive filter for echo estimation to obtain an estimated echo signal corresponding to the original voice signal.

The original voice signal is illustratively a voice signal that does not pass through a speaker of the electronic device.

In one example, the original speech signal described above is also referred to as a far-end speech signal.

It should be noted that, the electronic device performs echo estimation using the original voice signal, and may ignore signal attenuation existing between the original voice signal and the first voice signal. If signal attenuation is taken into account, the electronic device may subtract the estimated attenuated signal from the original speech signal and then perform echo estimation.

Step 202: the electronic device subtracts the estimated echo signal from a first audio signal received by a microphone of the electronic device to generate a second audio signal.

The first audio signal may include a second voice signal and a first echo signal fed back to the microphone by the speaker.

The second speech signal is illustratively a signal that the user speaks through a microphone.

The first echo signal is a signal of the first voice signal reflected into the microphone.

In one example, the first audio signal described above is also referred to as a near-end speech signal.

Optionally, in the embodiment of the present invention, in order to reduce the operation complexity of the electronic device, the electronic device may perform time-frequency conversion on the audio signal, and further calculate the audio signal.

For example, in the case where the second audio signal is a frequency domain signal, the above step 202 may specifically include the following steps 202a and 202b:

step 202a: the electronic device subtracts the estimated echo signal from a first audio signal received by a microphone of the electronic device to generate a time domain signal.

In one example, the time domain signal described above is also referred to as an error signal.

Step 202b: the electronic device performs time-frequency conversion on the time domain signal to generate the frequency domain signal.

It will be appreciated that the electronic device performs a time-frequency transformation on the time-domain signal to generate the frequency-domain signal, i.e. to generate the second audio signal.

For example, the electronic device may perform other operations besides time-frequency conversion of the time-domain signal, so as to facilitate subsequent signal computation by the electronic device. All operations performed by the electronic device on the time domain signal may be generally referred to as performing signal analysis on the time domain signal.

For example, as shown in fig. 5, the electronic device performing signal analysis on the time domain signal includes: the electronic device frames the time domain signal E (m), then performs windowing and time-frequency conversion, and finally converts the time domain signal E (m) into a frequency domain signal E (Ω).

Note that, the electronic device may transform the signal from the time domain signal into the frequency domain signal using discrete fourier (discrete Fourier transform, DFT), and when implemented on a computer, the DFT may be implemented by fast fourier (fourier) transform (fast Fourier transformation, FFT).

Step 203: the electronic device calculates weighted filter parameters based on energy parameters of the residual echo signal in the second audio signal.

In the embodiment of the present invention, the residual echo signal is the residual echo signal in the second audio signal.

It will be appreciated that the second audio signal described above includes a residual echo signal because the first echo signal in the first audio signal cannot be completely cancelled.

Illustratively, the energy parameter may be at least one of: signal strength, signal energy, signal power spectral density (power spectral density, PSD).

In one example, since the first echo signal is generated from the original speech signal, the electronic device may estimate the energy parameter of the residual echo signal in the second audio signal from the second audio signal and the original speech signal as described above.

For example, as shown in FIG. 6, the electronic device may generate a coherence function C based on the second audio signal E (Ω) and the frequency-domain original speech signal X (Ω) _xe (Ω) to estimate the PSD of the residual echo signal, R can be used _b (Ω). The specific process is as follows: the time domain signal E (m) is transformed into the frequency domain signal E (Ω) by the signal analysis method shown in fig. 5, the original speech signal X (m) is transformed into the frequency domain original speech signal X (Ω) by the signal analysis method shown in fig. 5, and at this time, a coherence function C of the second audio signal E (Ω) and the frequency domain original speech signal X (Ω) can be calculated _xe (Ω)，C _xe The formula (Ω) is as follows:

wherein R is _xe (Ω) represents the cross-correlation power spectrum of the second audio signal E (Ω) and the frequency domain original speech signal X (Ω), R _x (Ω) represents the PSD of the frequency domain original speech signal X (Ω), R _e (Ω) represents the PSD of the second audio signal E (Ω).

Wherein the coherence function C _xe (Ω) characterizes the degree of linear correlation between the components of the two signals at each frequency, which can range from 0,1]The more relevant the signal, the closer it takes on a value of 1, the less relevant the signal, the closer it takes on a value of 0. Then when remaining in the second audio signalThe more echo signals, the coherence function C _xe The bigger (Ω) is, the opposite is the coherence function C _xe The smaller (Ω).

When the coherence function C is obtained _xe After (Ω), the electronic device may calculate the power spectral density R of the residual echo signal of the second audio signal _b (Ω) the formula is as follows:

R _b (Ω)＝R _e (Ω)C _xe (Ω)

illustratively, the formula for the weighted filter parameter G (Ω) is as follows:

where A is an empirical energy threshold.

Note that, if the weighted filter parameter G (Ω) is greater than 1, the second audio signal cannot be filtered, and therefore, it should be ensured that G (Ω) is equal to or less than 1, the formula of the weighted filter parameter G (Ω) is as follows:

step 204: and the electronic equipment performs filtering processing on the second audio signal according to the weighted filtering parameters to generate a target audio signal.

For example, the electronic device may multiply the second audio signal with the weighted filtering parameter, and further filter the second audio signal to generate the target audio signal S' (Ω). The formula is as follows:

S'(Ω)＝G(Ω)E(Ω)

the target audio signal may include a filtered second speech signal, for example.

According to the filtering method provided by the embodiment of the invention, after the electronic equipment obtains the estimated echo signal corresponding to the first voice signal output by the loudspeaker of the electronic equipment, the estimated echo signal can be subtracted from the first audio signal (comprising the second voice signal and the first echo signal fed back to the microphone by the loudspeaker) received by the microphone of the electronic equipment, so as to generate the second audio signal. However, since the second audio signal also includes a residual echo signal, the residual echo signal still affects the speech passing quality. Therefore, the electronic device may calculate the weighted filtering parameter according to the energy parameter of the residual echo signal remaining in the second audio signal, and perform filtering processing on the second audio signal according to the weighted filtering parameter, so as to generate the target audio signal. Through the scheme, the electronic equipment can filter the first audio signal through the adaptive filter to obtain the second audio signal, and then the electronic equipment can carry out secondary filtering on the residual echo signal according to the estimated energy parameter of the residual echo signal in the second audio signal, so that the echo signal is further restrained, and the conversation quality of the electronic equipment is further improved.

Optionally, in an embodiment of the present invention, after the step 204, the method may further include the following step 205:

step 205: and the electronic equipment performs inverse time-frequency transformation on the target audio signal to generate a target time-domain audio signal.

For example, the electronic device may perform other operations in addition to inverse time-frequency transformation of the target audio signal, and all operations performed by the electronic device on the target audio signal may be collectively referred to as signal synthesis.

For example, the target time-domain audio signal may be denoted as s' (m). As shown in fig. 7, the electronic device performing signal synthesis on the target audio signal s' (m) includes: the electronic device can perform inverse time-frequency transformation on the target audio signal S '(Ω), and then perform frame combination to finally obtain the target time-domain audio signal S' (m), i.e. synthesize a complete speech signal and output.

Thus, after the electronic equipment filters the frequency domain audio signals, the frequency domain audio signals can be converted into time domain audio signals to be output, and then the output of the voice signals is completed.

Alternatively, in embodiments of the present invention, the masking effect is present in the human ear auditory system, i.e., the phenomenon that sound at one frequency blocks the auditory system from experiencing sound at another frequency. Therefore, if it is required to ensure that the residual echo signal in the target audio signal does not affect the voice call quality of the electronic device, it is only required to ensure that the energy of the residual echo signal is less than the human ear auditory masking threshold.

The step 203 may specifically include the following steps 203a and 203b:

step 203a: the electronic device calculates an auditory masking threshold corresponding to the second audio signal.

Illustratively, the electronic device may calculate the auditory masking threshold from a psychoacoustic model, which may be represented as R _T (omega), wherein the psychoacoustic model is an abstract mathematical model reflecting human auditory perception characteristics on the basis of researching human auditory system. It describes the perception and masking capabilities of the human auditory system for speech and noise. According to the psychoacoustic model, the input signal frequency bands need to be re-divided according to critical frequency bands (units: bark), and then the auditory masking threshold of each critical frequency band is estimated, so that noise is shaped, and the noise power in each critical frequency band is smaller than the masking threshold of the critical frequency band, so that the noise can be masked by a voice signal, and the minimum perceived distortion is achieved.

In one example, the electronic device may estimate the approximate speech signal by spectral subtraction and then calculate the auditory masking threshold based on the estimated approximate speech signal.

Step 203b: the electronic device calculates weighted filter parameters based on the energy parameter of the residual echo signal in the second audio signal and the auditory masking threshold.

Illustratively, there is no apparent order of execution between the computing of the energy parameter of the residual echo signal in the second audio signal and the computing of the auditory masking threshold by the electronic device, which may be performed after computing the energy parameter of the residual echo signal in the second audio signal, may be performed before computing the energy parameter of the residual echo signal in the second audio signal, and may be performed while computing the energy parameter of the residual echo signal in the second audio signal, which is not limited by the embodiments of the present invention.

The noise suppression is to remove as much noise as possible while ensuring that the distortion of the speech signal is as small as possible, but the noise suppression cannot be satisfied at the same time, that is, the speech distortion is small, the residual noise after noise reduction is relatively large, otherwise, the residual noise after noise reduction is small, and the speech distortion is relatively large. Therefore, on the premise of reducing the voice distortion as much as possible, the noise is not required to be completely suppressed, and the energy of the residual echo signal in the target audio signal is ensured to be smaller than or equal to the hearing masking threshold value, so that the residual echo signal in the target audio signal is not perceived by human ears.

In one example, according to the design principle of removing as much noise as possible on the premise of reducing speech distortion as much as possible, the weighted filtering parameter G (Ω) may be expressed as follows:

in this way, the electronic device can use the auditory masking threshold as a filtering reference, so that the residual echo signal in the target audio signal is ensured to be smaller than or equal to the auditory masking threshold, namely, the speech signal in the target audio signal is minimally distorted under the condition of being not perceived by human ears.

Alternatively, in an embodiment of the present invention, where the second audio signal further includes a background noise signal, the microphone may receive a background noise signal (e.g., voice or musical sound of another user) in addition to the second voice signal and the first echo signal fed back to the microphone by the speaker. Besides the first echo signal affecting the call quality of the electronic device, the background noise signal also affects the call quality of the electronic device, and the existing acoustic echo cancellation device can only cancel the echo signal and cannot cancel the background noise signal, so that the electronic device needs to cancel the background noise signal while canceling the echo signal.

The step 203 may specifically include the following step 203c:

step 203c: the electronic device calculates a weighted filtering parameter based on the energy parameter of the residual echo signal and the energy parameter of the background noise signal in the second audio signal.

Wherein the electronic device can estimate the background noise signal PSD, which can be expressed as R _n (Ω)。

The background noise signal may be background music or speech of other users, which is not limited by the embodiment of the present invention.

It can be understood that the background noise signal estimation is an important part in the voice enhancement algorithm, and if the background noise signal estimation is too high, the weak voice signal will be filtered out, so that the voice signal distortion is large; if the background noise signal is estimated to be too low, too many background noise signals will remain in the voice signal, and the conversation quality is affected.

Illustratively, the background noise signal described above may include: stationary noise signals and non-stationary noise signals. The estimation method is different for different background noise signals.

Example 1, when the background noise signal is a stationary noise signal, the electronic device may obtain the PSD of the background noise signal by averaging the noise signal power spectrum of the silence segment.

Example 2 when the background noise signal is a non-stationary noise signal, continuous tracking and correction of the background noise signal is required due to rapid changes in the background noise signal over time.

In particular, the PSD of a non-stationary noise signal may be estimated in at least two ways.

In a first possible implementation:

illustratively, since the speech signal and the background noise signal are independent of each other, that is, the power spectrum of the noisy speech signal is equal to the sum of the power spectral densities of the speech signal and the background noise signal. The power spectrum of noisy speech will typically be equal to the power spectral density of the background noise signal between speech signal intervals or syllables. Thus, the electronic device may search for a minimum value in the noise estimation window as an estimate of the background noise signal power spectral density, which may be biased to be small due to the minimum search, and may derive an unbiased estimate of the background noise signal power spectral density by multiplying a bias factor derived from statistics of local minima.

In a second possible implementation:

for example, the electronic device may estimate the background noise signal through a two-pass smoothing and minimum tracking process. The electronic device can perform rough estimation on the voice of each frequency band for the first time, then eliminates the strong voice component through minimum tracking in the second smoothing process, and smoothes the background noise signal, so as to obtain the estimated value of the power spectrum density of the background noise signal.

Illustratively, R is calculated at the electronic device _n After (Ω), to achieve simultaneous cancellation of the residual echo signal and the background noise signal in the second audio signal, the above-mentioned weighted filtering parameter G (Ω) may be expressed as:

where B is an empirical energy threshold.

In one example, the electronic device may calculate the weighted filter parameters based on the energy parameters of the residual echo signal and the energy parameters of the background noise signal in the second audio signal and the auditory masking threshold, based on the design principle of removing as much noise as possible while minimizing speech distortion.

Specifically, the weighted filtering parameter G (Ω) may be expressed as:

where α is an empirical factor, which takes a constant value.

By way of example, the time domain signal e (m) comprises: a background noise signal n (m), a second speech signal s (m) and a first echo signal y (m). As shown in fig. 8, the electronic device signals the time domain signal e (m)The number analysis results in a second audio signal E (Ω), from which the electronic device may then estimate the auditory masking threshold, the background noise signal PSD and the residual echo signal PSD. Furthermore, the electronic device can be based on the auditory masking threshold R _T (Ω), background noise signal power spectral density R _n (Ω) and residual echo signal power spectral density R _b (Ω) a weighted filter parameter G (Ω) is calculated. The electronic device may then filter the second audio signal E (Ω) according to the weighted filter parameter G (Ω) to obtain a target audio signal S' (Ω). Finally, the electronic device may perform signal synthesis on the target audio signal S '(Ω) to obtain a target time-domain audio signal S' (m).

If R is _b (Ω) and R _n The smaller (Ω) represents the less residual echo signal and background noise signal that need to be filtered out, the larger G (Ω). Conversely, if R _b (Ω) and R _n The larger (Ω) is indicative of the more residual echo signals and background noise signals that need to be filtered out, the smaller G (Ω) is, and the more residual echo signals and background noise signals can be filtered out by multiplying the second audio signal E (Ω) by G (Ω). When G (Ω) is greater than 1, the second audio signal E (Ω) is multiplied by G (Ω), and then the unnecessary residual echo signal and background noise signal are added, so that the second audio signal E (Ω) cannot be filtered, and therefore, the weighted filter parameter G (Ω) may be 1 or less.

In this way, the electronic device may calculate the weighted filtering parameter according to the energy parameter of the residual echo signal and the energy parameter of the background noise signal in the second audio signal, and perform filtering processing on the second audio signal according to the weighted filtering parameter, so as to generate the target audio signal. Through the scheme, the electronic equipment can carry out secondary filtering on the residual echo signals in the second audio signals and also can carry out filtering on background noise, so that the electronic equipment can realize the simultaneous filtering of the residual echo signals and the background noise signals without independently filtering the background noise, and therefore, the communication quality of the electronic equipment can be improved, and the filtering process can be simplified.

The filtering method provided by the embodiment of the invention is illustrated by a spectrogram obtained through experimental detection.

Specifically, fig. 9 is a signal spectrogram of a first audio signal collected by the microphone, the part framed by the frame 1 (i.e. 31 in fig. 9) is a first echo signal, the part framed by the frame 2 (i.e. 32 in fig. 9) is a second speech signal, and it can be seen from fig. 9 that the first audio speech signal contains more first echo signals. Fig. 10 is a graph of a time-frequency signal output after filtering by an adaptive filter, where the adaptive filter has a length of 1024 points, and it can be seen from fig. 10 that there is still a part of residual echo in the time-frequency signal, as shown by the part framed by blocks 3 to 5 (i.e. 41 in fig. 10). Meanwhile, it can be seen from fig. 10 that the adaptive filter does not process the background noise. Fig. 11 is a graph of the time-frequency signal after the weighted filtering, and it can be seen from fig. 11 that the residual echo signal in the output target time-domain audio signal is effectively suppressed.

Fig. 12 is a schematic diagram of a possible structure of a filtering apparatus according to an embodiment of the present invention, and as shown in fig. 12, a filtering apparatus 600 includes: an acquisition module 601, a generation module 602, and a calculation module 603, wherein: an obtaining module 601, configured to obtain an estimated echo signal corresponding to a first voice signal output by a speaker of the electronic device; a generating module 602, configured to subtract the estimated echo signal acquired by the acquiring module 601 from a first audio signal received by a microphone of the electronic device, to generate a second audio signal, where the first audio signal includes a second speech signal and a first echo signal fed back to the microphone by a speaker; a calculating module 603, configured to calculate a weighted filtering parameter according to an energy parameter of a residual echo signal in the second audio signal generated by the generating module 602, where the residual echo signal is a residual echo signal in the second audio signal; the generating module 602 is further configured to perform filtering processing on the second audio signal according to the weighted filtering parameter calculated by the calculating module 601, and generate a target audio signal.

Optionally, the calculating module 603 is specifically configured to calculate an auditory masking threshold corresponding to the second audio signal generated by the generating module 602; and calculating a weighted filtering parameter according to the energy parameter of the residual echo signal in the second audio signal and the auditory masking threshold.

Optionally, the calculating module 603 is specifically configured to calculate the weighted filtering parameter according to the energy parameter of the residual echo signal and the energy parameter of the background noise signal in the second audio signal, where the second audio signal generated by the generating module 602 further includes the background noise signal.

Optionally, the acquiring module 601 is specifically configured to input an original voice signal corresponding to the first voice signal output by the speaker of the electronic device into the adaptive filter to perform echo estimation, so as to obtain an estimated echo signal corresponding to the original voice signal.

According to the filtering device provided by the embodiment of the invention, after the estimated echo signal corresponding to the first voice signal output by the loudspeaker of the electronic equipment is obtained, the estimated echo signal can be subtracted from the first audio signal (comprising the second voice signal and the first echo signal fed back to the microphone by the loudspeaker) received by the microphone of the electronic equipment to generate the second audio signal. However, since the second audio signal also includes a residual echo signal, the residual echo signal still affects the speech passing quality. Therefore, the filtering device can calculate the weighted filtering parameter according to the energy parameter of the residual echo signal remained in the second audio signal, and perform filtering processing on the second audio signal according to the weighted filtering parameter to generate the target audio signal. Through the scheme, the filtering device can filter the first audio signal through the adaptive filter to obtain the second audio signal, and then the electronic equipment can perform secondary filtering on the residual echo signal according to the estimated energy parameter of the residual echo signal in the second audio signal, so that the echo signal is further restrained, and the conversation quality of the electronic equipment is further improved.

The filtering device provided by the embodiment of the invention can realize each process in the embodiment of the method, and in order to avoid repetition, the description is omitted here.

Fig. 13 is a schematic hardware structure of an electronic device implementing various embodiments of the present application, where the electronic device 100 includes, but is not limited to: radio frequency unit 101, network module 102, audio output unit 103, input unit 104, sensor 105, display unit 106, user input unit 107, interface unit 108, memory 109, processor 110, and power supply 111. It will be appreciated by those skilled in the art that the structure of the electronic device 100 shown in fig. 13 does not constitute a limitation of the electronic device, and the electronic device 100 may include more or less components than illustrated, or may combine certain components, or may have a different arrangement of components. In an embodiment of the present invention, the electronic device 100 includes, but is not limited to, a mobile phone, a tablet computer, a notebook computer, a palm computer, a vehicle-mounted terminal device, a wearable device, a pedometer, and the like.

The processor 110 is configured to obtain an estimated echo signal corresponding to a first voice signal output by a speaker of the electronic device; and subtracting the estimated echo signal from a first audio signal received by a microphone of the electronic device to generate a second audio signal, the first audio signal including a second speech signal and a first echo signal fed back to the microphone by the speaker; and calculating a weighted filtering parameter according to an energy parameter of a residual echo signal in the second audio signal, wherein the residual echo signal is the residual echo signal in the second audio signal; and the target audio signal is generated by filtering the second audio signal according to the weighted filtering parameters.

Optionally, the processor 110 is specifically configured to calculate an auditory masking threshold corresponding to the second audio signal; and calculating a weighted filtering parameter according to the energy parameter of the residual echo signal in the second audio signal and the auditory masking threshold.

Optionally, the processor 110 is specifically configured to calculate the weighted filtering parameter according to an energy parameter of the residual echo signal and an energy parameter of the background noise signal in the second audio signal, in case the second audio signal further comprises the background noise signal.

Optionally, the processor 110 is specifically configured to input an original voice signal corresponding to the first voice signal output by the speaker of the electronic device into the adaptive filter to perform echo estimation, so as to obtain an estimated echo signal corresponding to the original voice signal.

According to the electronic device provided by the embodiment of the invention, after the electronic device obtains the estimated echo signal corresponding to the first voice signal output by the loudspeaker of the electronic device, the estimated echo signal can be subtracted from the first audio signal (including the second voice signal and the first echo signal fed back to the microphone by the loudspeaker) received by the microphone of the electronic device, so as to generate the second audio signal. However, since the second audio signal also includes a residual echo signal, the residual echo signal still affects the speech passing quality. Therefore, the electronic device may calculate the weighted filtering parameter according to the energy parameter of the residual echo signal remaining in the second audio signal, and perform filtering processing on the second audio signal according to the weighted filtering parameter, so as to generate the target audio signal. Through the scheme, the electronic equipment can filter the first audio signal through the adaptive filter to obtain the second audio signal, and then the electronic equipment can carry out secondary filtering on the residual echo signal according to the estimated energy parameter of the residual echo signal in the second audio signal, so that the echo signal is further restrained, and the conversation quality of the electronic equipment is further improved.

It should be understood that, in the embodiment of the present invention, the radio frequency unit 101 may be configured to receive and send information or signals during a call, specifically, receive downlink data from a base station, and then process the received downlink data with the processor 110; and, the uplink data is transmitted to the base station. Typically, the radio frequency unit 101 includes, but is not limited to, an antenna, at least one amplifier, a transceiver, a coupler, a low noise amplifier, a duplexer, and the like. In addition, the radio frequency unit 101 may also communicate with networks and other devices through a wireless communication system.

The electronic device 100 provides wireless broadband internet access to users, such as helping users send and receive e-mail, browse web pages, access streaming media, etc., through the network module 102.

The audio output unit 103 may convert audio data received by the radio frequency unit 101 or the network module 102 or stored in the memory 109 into an audio signal and output as sound. Also, the audio output unit 103 may also provide audio output (e.g., a call signal reception sound, a message reception sound, etc.) related to a specific function performed by the electronic device 100. The audio output unit 103 includes a speaker, a buzzer, a receiver, and the like.

The input unit 104 is used for receiving an audio or video signal. The input unit 104 may include a graphics processor (Graphics Processing Unit, GPU) 1041 and a microphone 1042, the graphics processor 1041 processing image data of still pictures or video obtained by an image capturing device (e.g., a camera) in a video capturing mode or an image capturing mode. The processed image frames may be displayed on the display unit 106. The image frames processed by the graphics processor 1041 may be stored in the memory 109 (or other storage medium) or transmitted via the radio frequency unit 101 or the network module 102. Microphone 1042 may receive sound and be capable of processing such sound into audio data. The processed audio data may be converted into a format output that can be transmitted to the mobile communication base station via the radio frequency unit 101 in the case of a telephone call mode.

The electronic device 100 also includes at least one sensor 105, such as a light sensor, a motion sensor, and other sensors. Specifically, the light sensor includes an ambient light sensor and a proximity sensor, wherein the ambient light sensor can adjust the brightness of the display panel 1061 according to the brightness of ambient light, and the proximity sensor can turn off the display panel 1061 and/or the backlight when the electronic device 100 moves to the ear. As one of the motion sensors, the accelerometer sensor can detect the acceleration in all directions (generally three axes), and can detect the gravity and direction when stationary, and can be used for recognizing the gesture of the electronic equipment (such as horizontal and vertical screen switching, related games, magnetometer gesture calibration), vibration recognition related functions (such as pedometer and knocking), and the like; the sensor 105 may further include a fingerprint sensor, a pressure sensor, an iris sensor, a molecular sensor, a gyroscope, a barometer, a hygrometer, a thermometer, an infrared sensor, etc., which are not described herein.

The display unit 106 is used to display information input by a user or information provided to the user. The display unit 106 may include a display panel 1061, and the display panel 1061 may be configured in the form of a liquid crystal display (Liquid Crystal Display, LCD), an Organic Light-Emitting Diode (OLED), or the like.

The user input unit 107 may be used to receive input numeric or character information and to generate key signal inputs related to user settings and function control of the electronic device 100. Specifically, the user input unit 107 includes a touch panel 1071 and other input devices 1072. The touch panel 1071, also referred to as a touch screen, may collect touch operations thereon or thereabout by a user (e.g., operations of the user on the touch panel 1071 or thereabout using any suitable object or accessory such as a finger, stylus, etc.). The touch panel 1071 may include two parts of a touch detection device and a touch controller. The touch detection device detects the touch azimuth of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch detection device, converts the touch information into touch point coordinates, and sends the touch point coordinates to the processor 110, and receives and executes commands sent by the processor 110. Further, the touch panel 1071 may be implemented in various types such as resistive, capacitive, infrared, and surface acoustic wave. The user input unit 107 may include other input devices 1072 in addition to the touch panel 1071. In particular, other input devices 1072 may include, but are not limited to, a physical keyboard, function keys (e.g., volume control keys, switch keys, etc.), a trackball, a mouse, and a joystick, which are not described in detail herein.

Further, the touch panel 1071 may be overlaid on the display panel 1061, and when the touch panel 1071 detects a touch operation thereon or nearby, the touch operation is transmitted to the processor 110 to determine the type of touch event, and then the processor 110 provides a corresponding visual output on the display panel 1061 according to the type of touch event. Although in fig. 13, the touch panel 1071 and the display panel 1061 are two independent components for implementing the input and output functions of the electronic device 100, in some embodiments, the touch panel 1071 may be integrated with the display panel 1061 to implement the input and output functions of the electronic device 100, which is not limited herein.

The interface unit 108 is an interface to which an external device is connected to the electronic apparatus 100. For example, the external devices may include a wired or wireless headset port, an external power (or battery charger) port, a wired or wireless data port, a memory card port, a port for connecting a device having an identification module, an audio input/output (I/O) port, a video I/O port, an earphone port, and the like. The interface unit 108 may be used to receive input (e.g., data information, power, etc.) from an external device and transmit the received input to one or more elements within the electronic apparatus 100 or may be used to transmit data between the electronic apparatus 100 and an external device.

Memory 109 may be used to store software programs as well as various data. The memory 109 may mainly include a storage program area that may store an operating system, application programs required for at least one function (such as a sound playing function, an image playing function, etc.), and a storage data area; the storage data area may store data (such as audio data, phonebook, etc.) created according to the use of the handset, etc. In addition, memory 109 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device.

The processor 110 is a control center of the electronic device 100, connects various parts of the entire electronic device 100 using various interfaces and lines, and performs various functions of the electronic device 100 and processes data by running or executing software programs and/or modules stored in the memory 109, and calling data stored in the memory 109, thereby performing overall monitoring of the electronic device 100. Processor 110 may include one or more processing units; alternatively, the processor 110 may integrate an application processor that primarily handles operating systems, user interfaces, applications, etc., with a modem processor that primarily handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 110.

The electronic device 100 may also include a power supply 111 (e.g., a battery) for powering the various components, and optionally the power supply 111 may be logically connected to the processor 110 via a power management system that performs functions such as managing charging, discharging, and power consumption.

In addition, the electronic device 100 includes some functional modules, which are not shown, and will not be described herein.

Optionally, the embodiment of the present invention further provides an electronic device, including a processor, a memory, and a computer program stored in the memory and capable of running on the processor 110, where the computer program when executed by the processor implements each process of the foregoing filtering method embodiment, and the process can achieve the same technical effect, so that repetition is avoided, and details are not repeated herein.

The embodiment of the invention also provides a computer readable storage medium, on which a computer program is stored, which when executed by a processor, implements the processes of the filtering method embodiment described above, and can achieve the same technical effects, so that repetition is avoided and no further description is given here. Wherein the computer readable storage medium is selected from Read-Only Memory (ROM), random access Memory (Random Access Memory, RAM), magnetic disk or optical disk.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk), comprising several instructions for causing an electronic device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the method described in the embodiments of the present application.

The embodiments of the present application have been described above with reference to the accompanying drawings, but the present application is not limited to the above-described embodiments, which are merely illustrative and not restrictive, and many forms may be made by those of ordinary skill in the art without departing from the spirit of the present application and the scope of the claims, which are also within the protection of the present application.

Claims

1. A filtering method applied to an electronic device, the method comprising:

acquiring an estimated echo signal corresponding to a first voice signal output by a loudspeaker of the electronic equipment;

subtracting the estimated echo signal from a first audio signal received by a microphone of the electronic device to generate a second audio signal, wherein the first audio signal comprises a second voice signal and a first echo signal fed back to the microphone by the loudspeaker;

according to a coherence function C between the second audio signal and the frequency domain original speech signal _xe (Ω) estimating a power spectral density R of a residual echo signal in the second audio signal _b (Ω); wherein C is _xe The formula of (Ω) is:

R _xe (Ω) represents a cross-correlation power spectrum of the second audio signal E (Ω) and the frequency domain original speech signal X (Ω), R _x (Ω) represents the power spectral density, R, of the frequency domain original speech signal X (Ω) _e (Ω) represents the power spectral density of the second audio signal E (Ω); r is R _b The formula of (Ω) is: r is R _b (Ω)＝R _e (Ω)C _xe (Ω); the frequency domain original voice signal is obtained by transforming an original voice signal corresponding to the first voice signal into a frequency domain; the coherence function between the second audio signal and the frequency domain original voice signal is used for representing the linear correlation degree between each frequency component of the second audio signal and the frequency domain original voice signal;

calculating a weighted filtering parameter according to the power spectral density of a residual echo signal in the second audio signal, wherein the residual echo signal is the residual echo signal in the second audio signal;

and carrying out filtering processing on the second audio signal according to the weighted filtering parameters to generate a target audio signal.

2. The method of claim 1, wherein calculating weighted filter parameters from the power spectral density of the residual echo signal in the second audio signal comprises:

calculating an auditory masking threshold corresponding to the second audio signal;

and calculating a weighted filtering parameter according to the power spectral density of the residual echo signal in the second audio signal and the auditory masking threshold.

3. The method according to claim 1 or 2, wherein the second audio signal further comprises: a background noise signal; said calculating weighted filter parameters from the power spectral density of the residual echo signal in said second audio signal, comprising:

and calculating a weighted filtering parameter according to the power spectral density of the residual echo signal in the second audio signal and the power spectral density of the background noise signal.

4. The method of claim 1, wherein the obtaining an estimated echo signal corresponding to a first speech signal output by a speaker of the electronic device comprises:

and inputting an original voice signal corresponding to the first voice signal output by a loudspeaker of the electronic equipment into an adaptive filter for echo estimation to obtain an estimated echo signal corresponding to the original voice signal.

5. A filtering apparatus, the apparatus comprising: the device comprises an acquisition module, a generation module and a calculation module;

the acquisition module is used for acquiring an estimated echo signal corresponding to a first voice signal output by a loudspeaker of the electronic equipment;

the generating module is configured to subtract the estimated echo signal acquired by the acquiring module from a first audio signal received by a microphone of the electronic device, and generate a second audio signal, where the first audio signal includes a second speech signal and a first echo signal fed back to the microphone by the speaker;

The calculation module is used for calculating a coherence function C between the second audio signal and the frequency domain original voice signal _xe (Ω) estimating a power spectral density R of a residual echo signal in the second audio signal _b (Ω); wherein C is _xe The formula of (Ω) is:

the calculation module is further configured to calculate a weighted filtering parameter according to a power spectral density of a residual echo signal in the second audio signal generated by the generation module, where the residual echo signal is a residual echo signal in the second audio signal;

The generating module is further configured to perform filtering processing on the second audio signal according to the weighted filtering parameter calculated by the calculating module, so as to generate a target audio signal.

6. The apparatus according to claim 5, wherein the calculating module is specifically configured to calculate an auditory masking threshold corresponding to the second audio signal generated by the generating module; and calculating a weighted filtering parameter according to the power spectral density of the residual echo signal in the second audio signal and the auditory masking threshold.

7. The apparatus according to claim 5 or 6, wherein the calculating module is configured to calculate a weighted filter parameter based on a power spectral density of a residual echo signal in the second audio signal and a power spectral density of the background noise signal, in particular in case the second audio signal generated by the generating module further comprises a background noise signal.

8. The apparatus of claim 5, wherein the obtaining module is specifically configured to input an original voice signal corresponding to a first voice signal output by a speaker of the electronic device into an adaptive filter to perform echo estimation, and obtain an estimated echo signal corresponding to the original voice signal.

9. An electronic device comprising a processor, a memory and a computer program stored on the memory and executable on the processor, which when executed by the processor implements the steps of the filtering method according to any one of claims 1 to 4.

10. A computer-readable storage medium, on which a computer program is stored, which computer program, when being executed by a processor, implements the steps of the filtering method according to any one of claims 1 to 4.