CN111524532A

CN111524532A - Echo suppression method, device, equipment and storage medium

Info

Publication number: CN111524532A
Application number: CN202010354137.2A
Authority: CN
Inventors: 罗本彪; 潘思伟; 董斐; 雍雅琴; 林福辉
Original assignee: Spreadtrum Communications Shanghai Co Ltd
Current assignee: Spreadtrum Communications Shanghai Co Ltd
Priority date: 2020-04-29
Filing date: 2020-04-29
Publication date: 2020-08-11
Anticipated expiration: 2040-04-29
Also published as: CN111524532B

Abstract

The application provides an echo suppression method, device, equipment and storage medium, wherein the method comprises the following steps: acquiring a voice signal to be processed from a microphone of terminal equipment; determining echo signals in medium-high frequency signals in the voice signals to be processed according to characteristic values of the medium-low frequency signals in the voice signals to be processed; and after the echo signal is subjected to suppression processing, the echo signal is sent to the far-end equipment. The method can eliminate echo and avoid the influence on near-end voice, thereby improving the quality of voice communication.

Description

Echo suppression method, device, equipment and storage medium

Technical Field

The present application relates to audio technologies, and in particular, to an echo suppression method, apparatus, device, and storage medium.

Background

In an audio system for voice communication, due to the coupling of the speaker and the microphone, the voice of the far-end speaker of the voice communication is played through the terminal speaker, and may be collected by the microphone together with the voice of the user himself, i.e., the near-end speaker. The voice of the far-end speaker collected by the microphone is called echo, the voice of the near-end speaker collected by the microphone is called near-end voice, and in order to ensure the voice quality, the echo needs to be eliminated as much as possible.

At present, the common echo cancellation method includes two parts, namely adaptive filtering and nonlinear echo processing, wherein part of linear echo is removed by the adaptive filtering, and the rest nonlinear echo is suppressed by a nonlinear processing module. Near-end speech is also susceptible to large losses if the adaptively filtered nonlinear echo is to be completely suppressed during nonlinear processing. For example, in a period or a frequency band where the nonlinear echo is large, the signal is erased or replaced by an average value, and the like, the method can effectively suppress the nonlinear echo, but the near-end speech loss is severe, and thus the communication experience is affected.

Therefore, there is a need for an echo suppression method that can eliminate echo as much as possible and preserve near-end speech as much as possible.

Disclosure of Invention

The application provides an echo suppression method, device, equipment and storage medium, so as to realize that near-end voice is kept as much as possible while echo is eliminated as much as possible.

In a first aspect, the present application provides an echo suppression method, including:

acquiring a voice signal to be processed from a microphone of terminal equipment;

determining an echo signal in the voice signal to be processed according to a characteristic value of a low-frequency signal in the voice signal to be processed, wherein the frequency of the echo signal is greater than that of the low-frequency signal;

and after the echo signals in the voice signals to be processed are subjected to suppression processing, sending the echo signals to the far-end equipment.

Optionally, the characteristic values of the low-frequency signal include: a peak eigenvalue of the low frequency signal;

determining an echo signal in the voice signal to be processed according to the characteristic value of the low-frequency signal in the voice signal to be processed, including:

and determining the frequency point with the amplitude value higher than the peak characteristic value in the voice signal to be processed as the frequency point of the echo signal.

Optionally, the method further includes:

determining a low-frequency peak value of the voice signal to be processed according to the amplitude value of the low-frequency signal of the voice signal to be processed;

determining a first low-frequency characteristic vector according to the low-frequency peak value and a preset gain vector; and elements in the first low-frequency characteristic vector are peak characteristic values corresponding to each frequency point in the voice signal to be processed.

Optionally, determining the frequency point with the amplitude higher than the peak characteristic value in the voice signal to be processed as the frequency point of the echo signal includes:

if the amplitude of a first frequency point in the voice signal to be processed is larger than a peak value characteristic value corresponding to the first frequency point in the first low-frequency characteristic vector, determining the first frequency point as a first echo frequency point of the echo signal; the frequency of the first frequency point is greater than the frequency of the low-frequency signal.

Optionally, the suppressing the echo signal in the to-be-processed speech signal includes:

and determining the amplitude after the first echo frequency point is suppressed according to the characteristic value corresponding to the first echo frequency point in the first low-frequency characteristic vector.

Optionally, the characteristic value of the low-frequency signal further includes: a mean characteristic value of the low frequency signal;

the determining the echo signal in the voice signal to be processed according to the characteristic value of the low-frequency signal in the voice signal to be processed further comprises:

determining a target frequency point of the voice signal to be processed, wherein the amplitude value of the target frequency point is higher than the average characteristic value and lower than the peak characteristic value;

and determining the frequency point of the echo signal in the target frequency point according to the correlation coefficient between the target frequency point and the corresponding frequency point in the far-end voice signal.

Optionally, the method further includes:

determining a low-frequency mean value of the voice signal to be processed according to the amplitude of the low-frequency signal of the voice signal to be processed;

determining a second low-frequency feature vector according to the low-frequency mean value and a preset gain vector; and elements in the second low-frequency feature vector are mean feature values corresponding to each frequency point in the voice signal to be processed.

Optionally, the determining a target frequency point in the to-be-processed speech signal, where the amplitude is higher than the mean characteristic value and lower than the peak characteristic value, includes:

and if the amplitude of a second frequency point in the voice signal to be processed is smaller than the peak characteristic value corresponding to the second frequency point in the first low-frequency characteristic vector and is larger than the mean characteristic value corresponding to the second frequency point in the second low-frequency characteristic vector, determining the second frequency point as a target frequency point, wherein the frequency of the second frequency point is larger than the frequency of the low-frequency signal.

Optionally, the determining the frequency point of the echo signal in the target frequency point according to the correlation coefficient between the target frequency point and the corresponding frequency point in the far-end voice signal includes:

and determining a frequency point, of the target frequency points, with a correlation coefficient with a corresponding frequency point in the far-end voice signal larger than a threshold value, as a second echo frequency point of the echo signal.

Optionally, the suppressing the echo signal in the to-be-processed speech signal further includes:

and determining the amplitude after the suppression of the second echo frequency point according to the amplitude of the second echo frequency point and the correlation coefficient of the second echo frequency point and the corresponding frequency point in the far-end voice signal.

Optionally, the method further includes:

and determining the correlation coefficient of each frequency point in the voice signal to be processed and the corresponding frequency point in the far-end voice signal according to the amplitude of each frequency point in the voice signal to be processed and the amplitude of the corresponding frequency point in the far-end voice signal.

Optionally, the acquiring, from the terminal device microphone, a voice signal to be processed includes:

collecting a speech signal from the microphone;

and carrying out self-adaptive processing on the voice signal to obtain the voice signal to be processed.

In a second aspect, the present application provides an echo suppression device, comprising:

the acquisition module is used for acquiring a voice signal to be processed from a terminal equipment microphone;

the determining module is used for determining an echo signal in the voice signal to be processed according to a characteristic value of a low-frequency signal in the voice signal to be processed, wherein the frequency of the echo signal is greater than that of the low-frequency signal;

and the suppression module is used for suppressing the echo signal in the voice signal to be processed and then sending the echo signal to the far-end equipment.

the determination module is to:

Optionally, the determining module is further configured to:

Optionally, the suppression module is configured to:

the determination module is further to:

Optionally, the determining module is further configured to:

Optionally, the determining module is configured to:

Optionally, the suppression module is further configured to:

Optionally, the determining module is further configured to:

Optionally, the obtaining module is configured to:

collecting a speech signal from the microphone;

In a third aspect, the present application provides an electronic device, a memory, a processor, a speaker, and a microphone;

the memory is connected with the processor;

the memory is used for storing a computer program;

the processor is configured to implement the method according to any of the first aspect as described above when the computer program is executed.

In a fourth aspect, the present application provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method of any of the first aspects above.

The application provides an echo suppression method, a device, equipment and a storage medium, wherein in the method, after a voice signal to be processed is acquired from a microphone of terminal equipment, an echo signal in a medium-high frequency signal in the voice signal to be processed is determined according to a characteristic value of a low-frequency signal in the voice signal to be processed; and further carrying out suppression processing on the echo signal and then sending the echo signal to the far-end equipment. Because the echoes of the loudspeaker are mainly concentrated in medium-high frequency, the echo signals in the medium-high frequency of the voice signals to be processed can be screened out in a targeted manner according to the low-frequency signals, so that the influence on near-end voice can be avoided while the echoes are eliminated, and the quality of voice communication is improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to these drawings without inventive exercise.

FIG. 1 is a schematic diagram of the transmission of sound signals for a speaker and microphone;

fig. 2 is a schematic flow chart of an echo suppression method according to the present application;

fig. 3 is a schematic flow chart of determining an echo signal according to the present application;

FIG. 4 is a schematic diagram of echo return loss enhancement;

fig. 5 is a schematic structural diagram of an echo suppression device provided in the present application;

fig. 6 is a schematic structural diagram of an electronic device provided in the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

In an audio system for voice communication, due to the coupling of the speaker and the microphone, sound played by the speaker at the receiving end is received by the microphone on the device and transmitted to the far-end talker, forming an acoustic echo. The acoustic echo comprises direct echo and indirect echo, wherein the direct echo is sound played by a loudspeaker and then directly transmitted to a microphone; the indirect echo refers to the set of echoes of sound played by the speaker that enters the microphone after being reflected once or more times by different paths. The signal received by the microphone may contain three phases: the voice-based voice communication method comprises a pure echo stage, a pure voice stage and a double-speaking stage, wherein the pure echo stage is that a microphone only receives echo of a loudspeaker, the pure voice stage is that the microphone only receives near-end voice of a near-end speaker, and the double-speaking stage is that the microphone receives both echo and near-end voice.

The acoustic echo is delayed by a channel and then transmitted back to the far-end speaker, so that the far-end speaker hears the voice of the far-end speaker and the quality of voice communication is influenced. Particularly, in the case of hands-free conversation, the echo with excessive energy can cause interference of semantic understanding to a far-end talker, and communication experience is greatly influenced. With the innovation of communication technology, the quality requirement of voice communication is increasing, and Acoustic Echo Cancellation (AEC) of mobile communication terminals also becomes an important concern for voice communication quality.

In order to remove the influence of echo on communication, an Adaptive Digital Filter (ADF) and a Nonlinear Processing (NLP) method are commonly used to implement acoustic echo cancellation. Illustratively, as shown in fig. 1, a far-end reference signal, i.e., a far-end speech signal 101, is played by a speaker 102 to obtain an amplified signal 103, and the amplified signal 103 generates an echo signal 105 via an acoustic echo path 104. The near-end speech signal 106, i.e. the speech signal of the near-end user speaking, and the echo signal 105 are jointly picked up by the microphone 107, forming a mixed signal 108. By simulating the acoustic echo path 104 using the adaptive filter 109, the adaptively simulated echo signal 110 may be made to approximate the true echo signal 105, and then the simulated echo signal 110 is subtracted from the hybrid signal 108, resulting in a residual echo signal 111. The adaptive filter 109 may use any one of a variety of adaptive filtering algorithms, including but not limited to Least Mean Square (LMS) algorithm, Normalized Least Mean Square (NLMS) algorithm, Affine Projection (AP) algorithm, Fast Affine Projection (FAP) algorithm, Least Square (LS) algorithm, Recursive Least Squares (RLS) algorithm, and the like.

However, the optimal adaptive filter can only eliminate a part of linear echo, and cannot eliminate nonlinear echo. Due to the miniaturization of the handheld device, the adopted micro-speaker is much smaller than a speaker with a conventional size, and in order to meet the requirement of the volume of the hands-free communication, the micro-speaker often works in a nonlinear area, so that the audio distortion is serious. In addition, the material and structural characteristics of the handheld device may also create non-linear transmission paths. Thus, it is far from sufficient to rely on adaptive filters to cancel the echo. That is to say, the residual echo signal 111 still contains a large amount of nonlinear echoes, and in order to improve the performance of echo cancellation, the nonlinear processing module 112 is further required to process the residual echo signal 111 to obtain the transmission signal 113.

However, in the nonlinear processing, the near-end speech is also susceptible to large loss if the adaptively filtered nonlinear echo is to be completely suppressed. For example, in a period or a frequency band where the nonlinear echo is large, the signal is erased or replaced by an average value, and the like, the method can effectively suppress the nonlinear echo, but the near-end speech loss is severe, and thus the communication experience is affected.

In order to eliminate echo as much as possible and to retain the near-end speech signal as much as possible, the present application proposes an echo suppression method, which performs nonlinear echo suppression according to the general nonlinear characteristics and frequency characteristics of the speaker. The inherent non-linear characteristics of the loudspeaker cause the loudspeaker to produce additional frequency components that are not present in the excitation signal, these components typically being integer multiples of the fundamental of the input signal. The frequency characteristic of the loudspeaker is the rule that the output sound pressure changes along with the frequency change of the input signal, and reflects the radiation capability of the loudspeaker to sound waves with different frequencies. For micro-speakers, the radiation capability of low frequencies is weak, for example, in the frequency range below 400Hz, the frequency response curve has a significant downward trend.

Because the micro-speaker has the nonlinear characteristics and frequency characteristics, the sound energy played by the far-end speaker through the speaker is mainly in medium-high frequency and low-frequency energy. Even after echo cancellation by adaptive filtering, the energy of the medium and high frequencies is more than the low frequencies. In practice, it can be found by obtaining a spectrogram of the far-end speech signal 101, a spectrogram of the hybrid signal 108, and a spectrogram of the adaptively filtered residual echo signal 111 in fig. 1 and comparing them, that the medium-high frequency energy of the hybrid signal 108 is more than the low frequency energy, and the medium-high frequency energy of the residual echo signal 111 after the adaptive filtering is still more than the low frequency. That is, the echo signals are mainly concentrated at medium-high frequencies, and medium-high frequency echoes exist in the speech signal to be processed 108 and the residual echo signal 111 regardless of whether the adaptive filtering process is performed. Therefore, it is proposed in the present application that the characteristics of the low frequency part of the speech signal 108 to be processed or the residual echo signal 111 are adopted to suppress the medium-high frequency echo therein, so that the influence on the near-end speech signal can be reduced while suppressing the echo.

The radio resource management measurement method provided by the present application is described in detail below with reference to specific embodiments. It is to be understood that the following detailed description may be combined with certain embodiments, and that the same or similar concepts or processes may not be repeated in some embodiments.

Fig. 2 is a schematic flow chart of an echo suppression method according to the present application. The main execution body of the method is an echo suppression device, which can be implemented by software and/or hardware, for example, the echo suppression device can be disposed in a terminal device. The method of the embodiment comprises the following steps:

s201, acquiring a voice signal to be processed from a microphone of the terminal equipment.

The terminal device in this embodiment may be a mobile terminal, such as a mobile phone, or other electronic devices with a voice communication function, such as a computer, a vehicle-mounted device, and the like, which is not limited in this embodiment.

The speech signal to be processed in this embodiment may be a mixed signal formed by the near-end speech signal and the echo signal, or may be a signal after the mixed signal is further subjected to adaptive processing. For example, the voice signal to be processed may be a mixed signal 108 formed by the near-end voice signal 106 and the echo signal 105 collected from the microphone as shown in fig. 1, or may be a residual echo signal 111 obtained by adaptively processing the mixed signal 108 after the mixed signal 108 is collected from the microphone. The method of the adaptive processing in this embodiment is not particularly limited, and the adaptive filtering algorithm may be any one of the prior art.

S202, according to the characteristic value of the low-frequency signal in the voice signal to be processed, determining an echo signal in the voice signal to be processed, wherein the frequency of the echo signal is greater than that of the low-frequency signal.

Since, as stated above, the echo signals of the micro-speaker are mainly concentrated at medium-high frequencies, and whether or not the adaptive filtering process is performed, medium-high frequency echoes exist in the speech signal to be processed. Therefore, in the embodiment, the feature value of the low-frequency signal of the voice signal to be processed is used to determine the medium-high frequency echo signal, and since the low-frequency signal has less echoes, the medium-high frequency signal can be screened by using the feature value of the low-frequency signal as a reference, the echo signal in the medium-high frequency signal can be determined, and the echo signal in the medium-high frequency signal can be determined as the echo signal of the voice signal to be processed.

And S203, after the echo signal in the voice signal to be processed is subjected to suppression processing, sending the echo signal to the far-end equipment.

Since the echo signal in the speech signal to be processed is determined in S202, the echo signal can be suppressed in this step, so that the echo is eliminated as much as possible after the suppression processing, and the near-end speech is also retained.

In the echo suppression method provided by this embodiment, after a to-be-processed voice signal is acquired from a terminal device microphone, an echo signal in a medium-high frequency signal in the to-be-processed voice signal is determined according to a feature value of a low-frequency signal in the to-be-processed voice signal; and further carrying out suppression processing on the echo signal and then sending the echo signal to the far-end equipment. Because the echoes of the loudspeaker are mainly concentrated in medium-high frequency, the echo signals in the medium-high frequency of the voice signals to be processed can be screened out in a targeted manner according to the low-frequency signals, so that the influence on near-end voice can be avoided while the echoes are eliminated, and the quality of voice communication is improved.

It can be understood that the speech signal to be processed in the above embodiment is a frequency domain signal, that is, a signal obtained by performing frequency domain conversion on a time domain signal collected from a microphone. If the speech signal to be processed is a signal subjected to adaptive filtering and the adaptive filter is a frequency domain adaptive filter, frequency domain conversion is not required.

In addition, in the above embodiment, the ranges of the low-frequency signal and the middle-high frequency signal may be selectively set according to actual situations, for example, the low-frequency cutoff frequency is selected to be in a range of 200Hz to 1000Hz, for example, the low-frequency range is in a range of 0Hz to 200Hz or in a range of 0Hz to 1000Hz, and the embodiment is not limited.

Based on the above embodiment, the determination of the echo signal in the voice signal to be processed in S202 according to the feature value of the low-frequency signal in the voice signal to be processed is described in detail.

Optionally, the characteristic values of the low-frequency signal include: peak eigenvalues of the low frequency signal. Correspondingly, in S202, determining an echo signal in the to-be-processed speech signal according to the feature value of the low-frequency signal in the to-be-processed speech signal, including: and determining the frequency point with the amplitude value higher than the peak characteristic value in the voice signal to be processed as the frequency point of the echo signal.

The peak characteristic value of the low-frequency signal can be determined according to the peak amplitude value of the low-frequency signal, and for the medium-high frequency point in the voice signal to be processed, if the amplitude value is higher than the peak characteristic value, the frequency point is determined as the frequency point of the echo signal. If the amplitude is not higher than the peak characteristic value, the frequency point may be the frequency point of the echo signal or the frequency point of the near-end voice signal.

In order to further screen the echo signal, optionally, the characteristic value of the low frequency signal further includes: mean eigenvalues of the low frequency signal. Correspondingly, in S202, determining an echo signal in the to-be-processed speech signal according to the feature value of the low-frequency signal in the to-be-processed speech signal, which may further include: determining a target frequency point of which the amplitude value is higher than the mean characteristic value and lower than the peak characteristic value in the voice signal to be processed; and determining the frequency point of the echo signal in the target frequency point according to the correlation coefficient of the target frequency point and the corresponding frequency point in the far-end voice signal.

The mean characteristic value of the low-frequency signal can be determined according to the mean value of the amplitude values of the low-frequency signal, and for a target frequency point with an amplitude value higher than the mean characteristic value and lower than the peak characteristic value in the medium-high frequency of the voice signal to be processed, whether the target frequency point is the frequency point of the echo signal needs to be further determined according to the correlation coefficient between the target frequency point and the corresponding frequency point in the far-end voice signal.

Wherein the correlation coefficient is determined by: and determining the correlation coefficient of each frequency point in the voice signal to be processed and the corresponding frequency point in the far-end voice signal according to the amplitude of each frequency point in the voice signal to be processed and the amplitude of the corresponding frequency point in the far-end voice signal. The correlation coefficient reflects the similarity degree of the voice signal to be processed and the far-end voice signal, and the frequency point with the large correlation coefficient is similar to the far-end voice signal.

Determining a frequency point, of the target frequency points, with the amplitude value higher than the mean characteristic value and lower than the peak characteristic value, of the medium-high frequency in the voice signal to be processed, wherein the frequency point, of the target frequency points, with the correlation coefficient with the corresponding frequency point in the far-end voice signal larger than a threshold value, is a frequency point of an echo signal; and the frequency point of which the correlation coefficient with the corresponding frequency point in the far-end voice signal is less than or equal to the threshold value is determined as the frequency point of the near-end voice signal. In addition, for the medium-high frequency points in the voice signals to be processed, if the amplitude values of the medium-high frequency points are lower than the mean characteristic value, the medium-high frequency points are also determined as near-end voice frequency points.

According to the method, the medium-high frequency signals are screened through the peak characteristic value, the mean characteristic value and the correlation coefficient of the low-frequency signals of the voice signals to be processed, and the correlation coefficient reflects the correlation between the far-end voice signals and the voice signals to be processed at the frequency point, namely the possibility that the frequency point is an echo signal, so that the characteristic frequency point which possibly exists is identified through the characteristics of the low-frequency signals, and further judgment is carried out through the correlation coefficient, the echo signal in the medium-high frequency signals can be more accurately determined, and the echo signal is restrained.

The peak characteristic value, the mean characteristic value, and the correlation coefficient of the low-frequency signal are further described below, and the determination of the echo signal is described in detail with reference to the peak characteristic value, the mean characteristic value, and the correlation coefficient.

Firstly, determining a low-frequency peak value of a voice signal to be processed according to the amplitude value of the low-frequency signal of the voice signal to be processed for the peak value characteristic value of the low-frequency signal; determining a first low-frequency characteristic vector according to the low-frequency peak value and a preset gain vector; and the elements in the first low-frequency characteristic vector are peak characteristic values corresponding to each frequency point in the voice signal to be processed.

By way of example, the first low frequency feature vector is determined using the following equation:

maxew (k) ═ MaxEWl g (k) formula (1)

Wherein, MaxEWl is the low frequency peak of the voice signal to be processed, g (k) is a gain vector preset according to the general voice characteristics, the vector length of g (k) is the same as the length of the voice signal to be processed, and k represents the frequency point. Maxew (k) is the first low-frequency feature vector, which may also be referred to as peak feature vector, and each element in maxew (k) is the peak feature value corresponding to the corresponding frequency point of the speech signal to be processed. For example, the value range of the element in g (k) may be (0, 10), or may be other values greater than 0, and may be set according to actual needs. For example, if the suppression of the middle and high frequencies is to be weakened, the elements of the gain vector g (k) may take values greater than 1, so that the values of the elements in MaxEWl are larger, and accordingly, after the amplitudes of the frequency points of the middle and high frequencies are compared with the MaxEWl, the screened echo frequency points are relatively fewer.

Similarly to the above, for the mean characteristic value of the low-frequency signal, determining the low-frequency mean value of the voice signal to be processed according to the amplitude of the low-frequency signal of the voice signal to be processed; determining a second low-frequency feature vector according to the low-frequency mean value and a preset gain vector; and elements in the second low-frequency feature vector are mean feature values corresponding to each frequency point in the voice signal to be processed.

By way of example, the second low frequency feature vector is determined using the following equation:

avgiew (k) ═ avgiwlg (k) formula (2)

The avgnewl is a low-frequency average value of the voice signal to be processed, g (k) is a gain vector preset according to general voice characteristics, the vector length of g (k) is the same as the length of the voice signal to be processed, and k represents a frequency point. Avgww (k) is the second low-frequency feature vector, which may also be referred to as the mean feature vector, and each element in maxew (k) is the mean feature value corresponding to the corresponding frequency point of the voice signal to be processed. For example, the value range of the element in g (k) may be (0, 10), or may be other values greater than 0, and may be set according to actual needs. Similarly, if the suppression of the middle and high frequencies is to be reduced, the element of the gain vector g (k) may take a value greater than 1, so that the value of the element in avgww (k) is larger, and accordingly, after the amplitude of the frequency point of the middle and high frequencies is compared with avgww (k), the number of the screened echo frequency points is relatively smaller.

The correlation coefficient of each frequency point in the voice signal to be processed and the corresponding frequency point in the far-end voice signal is determined by adopting the following method:

AmpX_n(k)＝ρ*AmpX_n-1(k) + (1- ρ) × (k) · conj (x (k)) formula (3)

AmpD_n(k)＝ρ*AmpD_n-1(k) + (1- ρ) D (k). conj (D (k)) formula (4)

AmpXD_n(k)＝ρ*AmpXD_n-1(k) + (1- ρ) × (k) · conj (d (k)) formula (5)

CohereXD(k)＝AmpXD_n(k).*conj(AmpXD_n(k))./(AmpX_n(k).*AmpD_n(k) (+) formula (6)

Wherein, x (k) represents a far-end voice signal, d (k) represents a voice signal to be processed, and k represents a frequency point. Ampx (k) is the square of the modulus of the far-end speech signal, ampd (k) is the square of the modulus of the speech signal to be processed, ampdx (k) is the correlation coefficient, or cross-correlation coefficient, of the far-end speech signal and the speech signal to be processed, and the index n denotes the nth frame signal. I.e., AmpX_n(k) Is the nth frame far-end speech signalSquare of the modulus of (1), AmpX_n-1(k) Is the square of the modulus of the n-1 th frame of the far-end speech signal; AmpD_n(k) Is the square of the modulus, AmpD, of the nth frame of the speech signal to be processed_n-1(k) Is the square of the modulus of the (n-1) th frame of the speech signal to be processed; AmpXD_n(k) Is the correlation coefficient, or cross-correlation coefficient, AmpXD, of the nth frame far-end speech signal and the speech signal to be processed_n-1(k) Is the correlation coefficient of the far-end speech signal of the (n-1) th frame and the speech signal to be processed. Coherexd (k) is the normalization of cross-correlation coefficients of the far-end speech signal and the speech signal to be processed, the value range is (0,1), ρ represents a smoothing factor, ρ represents 0.9 in the example, conj represents complex conjugate,. indicates vector dot product, which is a small quantity, in case the denominator is 0. The closer the value of a certain point of the coherexd (k) is to 1, the higher the correlation between the far-end voice signal and the voice signal to be processed at the frequency point is, and the higher the possibility that the frequency point is an echo signal is.

The process of determining the echo signal is described with reference to fig. 3, and each frequency point in the medium-high frequency points of the voice signal to be processed needs to be compared and judged according to the process shown in fig. 3. First, the magnitude (modulus) of the frequency point is compared with the peak eigenvalue of the corresponding frequency point in the first low frequency eigenvector. If the amplitude of a first frequency point in the voice signal to be processed is larger than a peak value characteristic value corresponding to the first frequency point in the first low-frequency characteristic vector, determining the first frequency point as a first echo frequency point of the echo signal; the frequency of the first frequency point is greater than the frequency of the low-frequency signal. And if the amplitude of the second frequency point is smaller than the peak value characteristic value corresponding to the second frequency point in the first low-frequency characteristic vector, further judgment on the second frequency point is needed.

Further, if the amplitude of a second frequency point in the voice signal to be processed is smaller than a peak value feature value corresponding to the second frequency point in the first low-frequency feature vector and is larger than a mean value feature value corresponding to the second frequency point in the second low-frequency feature vector, determining the second frequency point as a target frequency point, wherein the frequency of the second frequency point is larger than that of the low-frequency signal. For a target frequency point, whether the target frequency point is a frequency point of an echo signal needs to be judged by combining a correlation coefficient.

Specifically, among the target frequency points, the frequency point of which the correlation coefficient with the corresponding frequency point in the far-end voice signal is greater than the threshold value is determined as the second echo frequency point of the echo signal. And determining the frequency points of the target frequency points, wherein the correlation coefficient of the frequency points in the far-end voice signals is less than or equal to the threshold value, as the frequency points of the near-end voice signals.

In addition, if the amplitude of a third frequency point in the middle-high frequency points of the voice signal to be processed is smaller than the mean characteristic value corresponding to the third frequency point in the second low-frequency characteristic vector, the third frequency point is determined as the frequency point of the near-end voice signal.

It can be seen from the above statements that the signals of the echo signals determined by the method of the present application are classified into two types, namely, the first echo frequency point and the second echo frequency point, and the two types of echo frequency points can be suppressed by different methods.

And for the first echo frequency point, determining the amplitude after the first echo frequency point is suppressed according to the characteristic value corresponding to the first echo frequency point in the first low-frequency characteristic vector.

Optionally, for the first echo point, its amplitude is suppressed to be the same as maxew (k).

EW₀₁(k)＝EW₁(k) MaxEW (k)/abs (EW (k)) formula (7)

Wherein EW₁(k) Representing the first echo frequency point in the speech signal to be processed, k representing the frequency point, abs representing the complex modulo, EW₀₁(k) And representing the suppressed first echo frequency point.

And for the second echo frequency point, determining the amplitude after the suppression of the second echo frequency point according to the amplitude of the second echo frequency point and the correlation coefficient of the second echo frequency point and the corresponding frequency point in the far-end voice signal.

Optionally, the following formula is adopted to determine the amplitude of the second echo frequency point after suppression

EW₀₂(k)＝EW₂(k) (1-cohexd (k)) formula (8)

Wherein EW₂(k) Indicating the second echo frequency point in the speech signal to be processed, k indicating the frequency point, EW₀₂(k) And representing the suppressed second echo frequency point.

By adopting the method, the echo signal frequency points in the voice signal to be processed are suppressed, and the fact that the voice spectrogram of the voice signal to be processed after being suppressed is obviously reduced in medium-high frequency energy compared with the voice spectrogram before being suppressed can be determined through actual detection. In addition, the similarity of the spectrogram of the pure voice stage and the double-speaking stage after the comparison processing is very high, namely the loss of a near-end voice signal of the double-speaking stage is very small, which shows that the method not only can effectively inhibit nonlinear echo, but also can reserve the required near-end voice to the maximum extent under the condition that the signal-to-echo ratio is less than zero.

The effectiveness of Echo cancellation can be measured by Echo Return Loss Enhancement (ERLE), and the higher the value, the better the Echo cancellation performance can be considered. The ERLE values after NLMS adaptive filtering process and adaptive filtering process alone plus the nonlinear NLP suppression method of the present application are compared in fig. 4. Because the nonlinearity of the micro-speaker is strong, the ERLE maximum value after only adaptive filtering is only about 10dB, and the ERLE value of most pure echo sections is less than 10dB, because quite a plurality of high-frequency nonlinear echoes cannot be effectively inhibited. After the processing of the application, the ERLE of the pure echo section is obviously improved, and the ERLE value of partial places reaches 40dB, and is generally increased by more than 20dB compared with that before the processing. The double-talk section ensures that the ERLE is improved by more than 10dB under the condition that the voice is not obviously lost, which shows that the method has better echo cancellation performance.

In addition to the above advantages, the method of the present application only needs to traverse the low frequency signal to find the maximum value and average the value, further construct the feature vector, compare the feature vector with the medium and high frequency and suppress the feature vector, and the nonlinear processing has a small computational complexity o (n), which is difficult to implement by the conventional nonlinear processing algorithm.

The energy difference between the pure echo section and the Double-Talk section in the voice signal obtained after the processing of the application is very large, and the accuracy rate of Double-Talk Detection (DTD) is higher. After the double-talk detection is carried out, the result can be further suppressed to obtain the final echo cancellation output result. In addition, by performing objective tests in the standard test environment of 3GPP and Vodafone, the test results show that: in the case that the signal-to-echo ratio is less than zero, the voice loss in both echo cancellation amount and double-talk stage is better than the test result of most algorithms in the market.

Fig. 5 is a schematic structural diagram of an echo suppression device according to the present application. As shown in fig. 5, the echo suppressing apparatus 50 includes:

an obtaining module 501, configured to obtain a to-be-processed voice signal from a terminal device microphone;

a determining module 502, configured to determine an echo signal in the voice signal to be processed according to a feature value of a low-frequency signal in the voice signal to be processed, where a frequency of the echo signal is greater than a frequency of the low-frequency signal;

the suppression module 503 is configured to perform suppression processing on an echo signal in the voice signal to be processed, and then send the echo signal to the remote device.

Optionally, the characteristic values of the low-frequency signal include: peak eigenvalues of the low frequency signal;

the determination module 502 is configured to:

Optionally, the determining module 502 is further configured to:

determining a first low-frequency characteristic vector according to the low-frequency peak value and a preset gain vector; and the elements in the first low-frequency characteristic vector are peak characteristic values corresponding to each frequency point in the voice signal to be processed.

Optionally, the determining module 502 is further configured to:

Optionally, the suppression module 503 is configured to:

Optionally, the characteristic value of the low frequency signal further includes: mean characteristic value of the low frequency signal;

the determining module 502 is further configured to:

determining a target frequency point of which the amplitude value is higher than the mean characteristic value and lower than the peak characteristic value in the voice signal to be processed;

and determining the frequency point of the echo signal in the target frequency point according to the correlation coefficient of the target frequency point and the corresponding frequency point in the far-end voice signal.

Optionally, the determining module 502 is further configured to:

and if the amplitude of the second frequency point in the voice signal to be processed is smaller than the peak characteristic value corresponding to the second frequency point in the first low-frequency characteristic vector and is larger than the mean characteristic value corresponding to the second frequency point in the second low-frequency characteristic vector, determining the second frequency point as a target frequency point, wherein the frequency of the second frequency point is larger than that of the low-frequency signal.

Optionally, the determining module 502 is configured to:

and determining the frequency points of the target frequency points, wherein the correlation coefficient of the frequency points and the corresponding frequency points in the far-end voice signals is greater than the threshold value, as second echo frequency points of the echo signals.

Optionally, the suppression module 503 is further configured to:

Optionally, the determining module 502 is further configured to:

Optionally, the obtaining module 501 is configured to:

collecting a speech signal from a microphone;

and carrying out self-adaptive processing on the voice signal to obtain a voice signal to be processed.

The apparatus of this embodiment may be configured to implement the technical solutions of the above method embodiments, and the implementation principles and technical effects are similar, which are not described herein again.

Fig. 6 is a schematic structural diagram of an electronic device provided in the present application. As shown in fig. 6, the electronic device 60 includes: a memory 601, a processor 602, a transceiver 603, a speaker 604, and a microphone 605, wherein the memory 601 is in communication with the processor 602; illustratively, the memory 601, the processor 602, the transceiver 603, the speaker 604 and the microphone 605 may communicate via a communication bus 606, the memory 601 being used to store a computer program, which the processor 602 executes to implement the above-described echo suppression method.

Optionally, the Processor may be a Central Processing Unit (CPU), or may be another general-purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of a method embodiment disclosed in this application may be directly implemented by a hardware processor, or may be implemented by a combination of hardware and software modules in a processor.

An embodiment of the present invention further provides a computer-readable storage medium, including: on which a computer program is stored which, when being executed by a processor, carries out the method in any of the above-mentioned method embodiments.

All or a portion of the steps of implementing the above-described method embodiments may be performed by hardware associated with program instructions. The aforementioned program may be stored in a readable memory. When executed, the program performs steps comprising the method embodiments described above; and the aforementioned memory (storage medium) includes: read-only memory (ROM), RAM, flash memory, hard disk, solid state disk, magnetic tape (magnetic tape), floppy disk (optical disk), and any combination thereof.

Embodiments of the present application are described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processing unit of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processing unit of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It will be apparent to those skilled in the art that various changes and modifications may be made in the embodiments of the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the embodiments of the present application fall within the scope of the claims of the present application and their equivalents, the present application is also intended to encompass such modifications and variations.

In the present application, the terms "include" and variations thereof may refer to non-limiting inclusions; the term "or" and variations thereof may mean "and/or". The terms "first," "second," and the like in this application are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. In the present application, "a plurality" means two or more. "and/or" describes the association relationship of the associated objects, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship.

Claims

1. An echo suppression method, comprising:

2. The method of claim 1, wherein the eigenvalues of the low frequency signal comprise: a peak eigenvalue of the low frequency signal;

3. The method of claim 2, further comprising:

4. The method according to claim 3, wherein the determining the frequency points of the to-be-processed voice signal with amplitudes higher than the peak characteristic value as the frequency points of the echo signal comprises:

5. The method according to claim 4, wherein the suppressing the echo signal in the speech signal to be processed comprises:

6. The method according to any one of claims 3-5, wherein the eigenvalues of the low frequency signal further comprise: a mean characteristic value of the low frequency signal;

7. The method of claim 6, further comprising:

8. The method according to claim 7, wherein the determining the target frequency point in the to-be-processed speech signal whose amplitude is higher than the mean characteristic value and lower than the peak characteristic value comprises:

9. The method according to claim 7 or 8, wherein the determining the frequency point of the echo signal in the target frequency point according to the correlation coefficient between the target frequency point and the corresponding frequency point in the far-end voice signal comprises:

10. The method according to claim 9, wherein the suppressing the echo signal in the speech signal to be processed further comprises:

11. The method of claim 6, further comprising:

12. The method according to any one of claims 1-5, wherein the obtaining the voice signal to be processed from the terminal device microphone comprises:

collecting a speech signal from the microphone;

13. An echo suppression device, comprising:

14. The apparatus of claim 13, wherein the eigenvalues of the low frequency signal comprise: a peak eigenvalue of the low frequency signal;

the determining module is used for determining the frequency point with the amplitude value higher than the peak characteristic value in the voice signal to be processed as the frequency point of the echo signal.

15. The apparatus of claim 14, wherein the determining module is further configured to:

16. The apparatus of claim 14, wherein the determining module is further configured to:

17. The apparatus of claim 16, wherein the suppression module is configured to:

18. The apparatus according to any one of claims 15-17, wherein the feature value of the low frequency signal further comprises: a mean characteristic value of the low frequency signal;

the determination module is further to:

19. The apparatus of claim 18, wherein the determining module is further configured to:

20. The apparatus of claim 19, wherein the determining module is further configured to:

21. The apparatus of claim 19 or 20, wherein the determining module is further configured to:

22. The apparatus of claim 21, wherein the suppression module is further configured to:

23. The apparatus of claim 18, wherein the determining module is further configured to:

24. The apparatus of any one of claims 13-17, wherein the obtaining module is configured to:

collecting a speech signal from the microphone;

25. An electronic device, comprising: a memory, a processor, a speaker, and a microphone;

the memory is connected with the processor;

the memory is used for storing a computer program;

the processor is adapted to carry out the method of any one of claims 1-12 when the computer program is executed.

26. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the method according to any one of claims 1-12.