CN112217948B

CN112217948B - Echo processing method, device, equipment and storage medium for voice call

Info

Publication number: CN112217948B
Application number: CN202011078262.1A
Authority: CN
Inventors: 马士乾; 宋辉; 张毅; 沙永涛; 邓承韵
Original assignee: Beijing Didi Infinity Technology and Development Co Ltd
Current assignee: Beijing Didi Infinity Technology and Development Co Ltd
Priority date: 2020-10-10
Filing date: 2020-10-10
Publication date: 2022-01-18
Anticipated expiration: 2040-10-10
Also published as: CN112217948A

Abstract

The embodiment of the disclosure provides an echo processing method, device and equipment for voice call and a storage medium. The method comprises the following steps: acquiring a first voice signal and receiving a second voice signal, wherein the first voice signal is a voice signal acquired at the near end of a voice call, and the second voice signal is a voice signal acquired at the far end of the voice call; according to the second voice signal, carrying out linear processing on the first voice signal to obtain a first residual echo signal and an estimated echo signal; according to the first voice signal and the estimated echo signal, carrying out nonlinear processing on the first residual echo signal to obtain a second residual echo signal; and carrying out nonlinear processing on the second residual echo signal to obtain a final voice signal. According to the method disclosed by the embodiment of the invention, the residual echo signals are subjected to staged nonlinear processing after linear processing, so that the echo cancellation effect of voice call is improved.

Description

Echo processing method, device, equipment and storage medium for voice call

Technical Field

Embodiments of the present disclosure relate to the field of voice processing technologies, and in particular, to an echo processing method, apparatus, device, and storage medium for voice calls.

Background

In a voice call scene, especially a voice call scene based on vehicle-mounted bluetooth, the call is often influenced by echo, so that the call quality cannot be guaranteed.

To improve call quality, echo cancellation may be performed on the speech. The main ways of echo cancellation for speech include: the speech is linearly processed by a filter. Since residual echo may also exist in the linear processed speech, the linear processed speech may be subjected to nonlinear processing.

In the above manner, the process of the nonlinear processing depends too much on the echo cancellation effect of the linear processing process, and if the echo cancellation effect of the linear processing process is not obvious or does not work, the echo cancellation effect of the nonlinear processing also drops obviously, resulting in a poor echo cancellation effect.

Disclosure of Invention

Embodiments of the present disclosure provide an echo processing method, apparatus, device and storage medium for voice call, so as to solve the problem of poor echo cancellation effect in voice call.

In a first aspect, an embodiment of the present disclosure provides an echo processing method for a voice call, including:

acquiring a first voice signal and receiving a second voice signal, wherein the first voice signal is a voice signal acquired at the near end of a voice call, and the second voice signal is a voice signal acquired at the far end of the voice call;

according to the second voice signal, performing linear processing on the first voice signal to obtain a first residual echo signal and an estimated echo signal;

according to the first voice signal and the estimated echo signal, carrying out nonlinear processing on the first residual echo signal to obtain a second residual echo signal;

and carrying out nonlinear processing on the second residual echo signal to obtain a final voice signal.

In a second aspect, an embodiment of the present disclosure provides an echo processing apparatus for a voice call, including:

the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring a first voice signal and a second voice signal, the first voice signal is a voice signal acquired at the near end of a voice call, and the second voice signal is a voice signal acquired at the far end of the voice call;

the linear processing module is used for carrying out linear processing on the first voice signal according to the second voice signal to obtain a first residual echo signal and an estimated echo signal;

the first nonlinear processing module is used for carrying out nonlinear processing on the first residual echo signal according to the first voice signal and the estimated echo signal to obtain a second residual echo signal;

and the second nonlinear processing module is used for carrying out nonlinear processing on the second residual echo signal to obtain a final voice signal.

In a third aspect, an embodiment of the present disclosure provides a terminal device, including:

a memory and a processor;

the memory is to store program instructions;

the processor is configured to invoke program instructions in the memory to perform the method according to the first aspect.

In a fourth aspect, embodiments of the present disclosure provide a computer-readable storage medium having stored thereon a computer program which, when executed, implements the method as described in the first aspect above.

In a fifth aspect, embodiments of the present disclosure provide a program product comprising program instructions, the program product comprising a computer program which, when executed by a processor, implements the method of the first aspect.

The echo processing method, device, equipment and storage medium for voice call provided by the embodiments of the present disclosure collect a first voice signal and receive a second voice signal, perform linear processing on the first voice signal according to the second voice signal to obtain a first residual echo signal and an estimated echo signal, perform nonlinear processing on the first residual echo signal according to the first voice signal and the estimated echo signal to obtain a second residual echo signal, and perform nonlinear processing on the second residual echo signal to obtain a final voice signal. Therefore, after the first voice signal is subjected to linear processing, the residual echo signal is subjected to non-linear processing step by step, the dependence degree of the non-linear processing process on the echo cancellation effect of the linear processing process is reduced, the echo cancellation effect of the non-linear processing is prevented from being reduced, and the echo cancellation effect is improved.

Various possible embodiments of the present disclosure and technical advantages thereof will be described in detail below.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.

FIG. 1 is an exemplary diagram of an application scenario in which embodiments of the present disclosure are applicable;

fig. 2 is a flowchart illustrating an echo processing method for a voice call according to an embodiment of the disclosure;

fig. 3 is a flowchart illustrating an echo processing method for a voice call according to another embodiment of the disclosure;

fig. 4 is a flowchart illustrating an echo processing method for a voice call according to another embodiment of the disclosure;

fig. 5 is a schematic structural diagram of an echo processing device for a voice call according to an embodiment of the present disclosure;

fig. 6 is a schematic structural diagram of a terminal device according to an embodiment of the present disclosure;

fig. 7 is a block diagram of a terminal device according to an embodiment of the disclosure.

With the foregoing drawings in mind, certain embodiments of the disclosure have been shown and described in more detail below. These drawings and written description are not intended to limit the scope of the disclosed concepts in any way, but rather to illustrate the concepts of the disclosure to those skilled in the art by reference to specific embodiments.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

First, terms related to embodiments of the present disclosure are explained:

echo cancellation: referred to in this disclosure as Acoustic Echo Cancellation (AEC). In voice communication, acoustic echo refers to echo formed when a terminal device at a near end plays a voice signal of a far-end user, and the played voice is collected by a microphone of the terminal device at the near end through air or other propagation media and transmitted to a receiver of the terminal device at the far end. The near end is also called a home end or a called end, and the far end is also called an opposite end.

Single far-end call state: the state refers to a state in which only the far-end is speaking in the voice call, that is, a state in which only the far-end user is speaking, and the single far-end call state is also referred to as a mute state.

Single near-end call state: the state refers to a state in which only the near-end utters in the voice call, that is, a state in which only the near-end user speaks, and the single near-end call state is also referred to as an uttered state.

Double-end conversation state: the state refers to a state in which the near end and the far end are speaking simultaneously in a voice call, that is, a state in which the near end user and the far end user are speaking simultaneously, and a double-end call state is also called an active state.

Subband of speech signal: also called sub-bands, means that the original speech signal is converted from the time domain to the frequency domain, and the speech signal converted to the frequency domain is divided into several sub-bands according to the frequency.

In voice communication, when a far-end user speaks, a near-end terminal device receives a far-end voice signal sent by the far-end terminal device and plays the far-end voice signal through a loudspeaker, when the near-end user speaks, the near-end terminal device sends a collected near-end voice signal to the far-end terminal device, and the near-end voice signal comprises echo of the far-end voice signal and speaking sound of the near-end user. Therefore, when listening to the near-end voice signal, the far-end user can simultaneously listen to the echo of the far-end voice signal, which affects the definition of the near-end voice signal and further affects the communication quality. The terminal device at the near end and the terminal device at the far end can be terminal devices with a voice call function, such as a mobile phone, a smart watch, a vehicle-mounted communication device and the like.

Taking a terminal device at a near end as an in-vehicle communication device as an example, fig. 1 shows an example of an application scenario to which the embodiment of the present disclosure is applied. As shown in fig. 1, the application scenario includes a vehicle 101, a vehicle-mounted communication device 102 located on the vehicle 101, and a near-end terminal device 103, where the vehicle-mounted communication device 102 establishes a wireless connection, such as a bluetooth connection, with the near-end terminal device 103. The terminal device 103 at the near end is, for example, a mobile phone, a smart watch, or a tablet computer.

In the voice call, the near-end terminal device 103 receives a far-end voice signal sent by the far-end terminal device, and forwards the far-end voice signal to the vehicle-mounted communication device 102, and the vehicle-mounted communication device 102 plays the far-end voice signal, for example, plays the far-end voice signal through a speaker on the vehicle 101. The in-vehicle communication device 102 may also collect the speech of the user in the vehicle 101, for example, by a microphone in the vehicle 101, and may also collect a far-end speech signal that is directly propagated to the vicinity of the microphone through the air, or a far-end speech signal that is propagated to the vicinity of the microphone through the air after one or more reflections. The vehicle-mounted communication device 102 transmits the collected voice signals as near-end voice signals to the near-end terminal device 103, and the near-end terminal device 103 transmits the near-end voice signals to the far-end terminal device. When a far-end user listens to a near-end voice signal, the far-end user can hear an echo of speaking, and the conversation quality is influenced.

Therefore, echo cancellation of the near-end speech signal is required. Particularly, in a vehicle-mounted communication scene, echo of voice call is obvious, noise in a vehicle is large, and if echo of voice call is not completely eliminated, the reduction of call quality is more obvious under the influence of noise.

In the echo cancellation process, the acoustic echo includes a direct echo and an indirect echo. The direct echo is also called a linear echo, which means that when a speaker of a terminal device at a near end plays a far-end voice signal, the terminal device at the near end directly acquires the sound, and the indirect echo is also called a nonlinear echo, which means that when the speaker of the terminal device at the near end plays a far-end voice signal, the played far-end voice signal is reflected once or multiple times and then acquired by the speaker of the terminal device at the near end. Therefore, it is usually necessary to perform linear processing on the voice signal collected by the speaker of the near-end terminal device to eliminate the linear echo in the near-end voice signal, and then perform nonlinear processing on the voice signal after the linear processing to now divide the nonlinear echo in the near-end voice signal.

However, the echo cancellation effect of the non-linear processing procedure is too dependent on the echo cancellation effect of the linear processing procedure, and when the echo cancellation effect of the linear processing procedure is not obvious or does not have the echo cancellation effect, the echo cancellation effect of the non-linear processing procedure is obviously reduced.

The embodiment of the disclosure provides an echo processing method for voice communication, which includes acquiring a first voice signal and receiving a second voice signal, performing linear processing on the first voice signal according to the second voice signal to obtain a first residual echo signal and an estimated echo signal, performing nonlinear processing on the first residual echo signal according to the first voice signal and the estimated echo signal to obtain a second residual echo signal, and performing nonlinear processing on the second residual echo signal to obtain a final slave voice signal. Therefore, by carrying out stepwise nonlinear processing on the first voice signal after linear processing, the dependence of the echo cancellation effect of the nonlinear processing on the echo cancellation effect of the linear processing is reduced, the echo cancellation effect of the nonlinear processing is improved, and the call quality of the voice call is further improved. The first voice signal is a voice signal acquired at a near end of the voice call, and may be understood as a near-end voice signal mentioned in the description of the application scenario plus an acoustic echo of a far-end voice signal, and the second voice signal is a voice signal acquired at a far end of the voice call, and may be understood as a far-end voice signal mentioned in the description of the application scenario.

The following describes technical solutions of the embodiments of the present disclosure and how to solve the above technical problems with specific embodiments. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments. Embodiments of the present disclosure will be described below with reference to the accompanying drawings.

Fig. 2 is a flowchart illustrating an echo processing method for a voice call according to an embodiment of the disclosure. As shown in fig. 2, the method includes:

s201, collecting a first voice signal and receiving a second voice signal.

The first voice signal is a voice signal collected at the near end of the voice call and may also be referred to as a collected signal, and the second voice signal is a signal collected at the far end of the voice call and may also be referred to as a reference signal. The first speech signal includes an echo of the second speech signal.

Specifically, the voice call process includes: the method comprises the following steps of (I) collecting a first voice signal, and sending the first voice signal to a far-end terminal device, so that a far-end user can conveniently listen to the voice of a near-end user; and (II) receiving the second voice signal, and playing the second voice signal through the loudspeaker, so that the user at the near end can receive and hear the voice of the user at the far end. In the case where both (one) and (two) are performed, the call state of the voice call is a double-talk state, that is, both the far-end user and the near-end user are speaking, in the case where only (one) is performed, the call state of the voice call is a single near-end call state, that is, only the near-end user is speaking, and in the case where only (two) is performed, the call state of the voice call is a single far-end call state, that is, only the far-end user is speaking. The played second voice signal is propagated through air or other transmission media and then collected by a microphone of the terminal equipment at the near end, so that echo of the second voice signal is obtained, and the echo of the second voice signal is transmitted back to the terminal equipment at the far end. Therefore, in the double-talk state and the single far-end talk state, it is necessary to cancel the echo of the second voice signal in the first voice signal. In the single near-end call state, the echo of the second voice signal in the first voice signal may be cancelled or not cancelled. Wherein, under the condition that the user at the far end does not speak, namely, under the condition of a single near-end conversation state, the collected second voice signal is zero.

S202, according to the second voice signal, linear processing is carried out on the first voice signal, and a first residual echo signal and an estimated echo signal are obtained.

Wherein the first residual echo signal is the first speech signal that has undergone linear processing. Estimating the echo signal means that the echo signal contained in the first speech signal is estimated by linear processing.

Specifically, in the process of performing linear processing on the first voice signal, a voice model of the echo of the second voice signal in the first voice signal is established based on the correlation between the second voice signal and the echo in the first voice signal, the echo of the second voice signal is estimated through the voice model, an estimated echo signal is obtained and stored, and the estimated echo signal is cut out from the first voice signal to obtain a residual echo signal. The voice model is a pre-established echo path simulation function, and is used for approximating the echo path of the second voice signal, and the voice model is not limited herein.

As an example, the speech model may be represented as fe ═ f (fs), fs represents the second speech signal, fe is the estimated echo signal, and f () is the echo path modeling function.

And S203, carrying out nonlinear processing on the first residual echo signal according to the first voice signal and the estimated echo signal to obtain a second residual echo signal.

Specifically, after the linear processing, a filter coefficient of a filter for the nonlinear processing is determined according to the first echo signal and the estimated echo signal, and the first residual echo signal is subjected to the nonlinear processing by the filter for the nonlinear processing based on the determined filter coefficient to obtain a second residual echo signal, so that the second round echo cancellation of the first voice signal is realized.

And S204, carrying out nonlinear processing on the second residual echo signal to obtain a final voice signal.

Specifically, after the first residual echo signal is subjected to nonlinear processing to obtain a second residual echo signal, the second residual echo signal is subjected to nonlinear processing through a filter for nonlinear processing to obtain a final voice signal, so that third round echo cancellation and second round nonlinear echo processing of the first voice signal are realized, and the degree of dependence of an echo cancellation effect of the nonlinear processing on an echo cancellation effect of the linear processing is reduced by performing stepwise nonlinear processing on the first residual echo signal.

After the final voice signal is obtained, the final voice signal can be sent to the remote terminal device.

In the embodiment of the present disclosure, a first voice signal is subjected to linear processing to obtain a first residual echo signal, the first residual echo signal is subjected to nonlinear processing to obtain a second residual echo signal, and the second residual echo signal is subjected to nonlinear processing to obtain a final voice signal, so that by performing linear processing and step-by-step nonlinear processing on the first voice signal, the degree of dependence of an echo cancellation effect of the nonlinear processing on an echo cancellation effect of the linear processing is reduced, the echo cancellation effect is improved, and further, the call quality of a voice call is improved.

Based on the embodiment shown in fig. 2, one possible implementation manner of S202 is: and according to the second voice signal, performing linear processing on the first voice signal through a first preset filtering algorithm to obtain a first residual echo signal and an estimated echo signal. The first preset filtering algorithm is an algorithm filter and is used for adjusting the voice model to enable the voice model to approach the echo path of the second voice signal more and more, so that the echo cancellation effect of linear processing is improved.

Optionally, the first preset filtering algorithm is a Normalized Least Mean square error (NLMS) filtering algorithm, so as to improve an echo cancellation effect of the linear processing through the Normalized Least Mean square error algorithm.

Based on the embodiment shown in fig. 2, one possible implementation manner of S203 is: and carrying out nonlinear processing on the first residual echo signal through a second preset filtering algorithm according to the first voice signal and the estimated echo signal to obtain a second residual echo signal. One possible implementation of S204 is: and carrying out nonlinear processing on the second residual echo signal through a second preset filtering algorithm to obtain a final voice signal, and improving the echo cancellation effect of the nonlinear processing. Wherein the second pre-set filtering algorithm is an algorithm filter for non-linear processing of the speech signal.

Optionally, the second preset filtering algorithm is a wiener filtering algorithm, and the first residual echo signal is subjected to progressive nonlinear processing through the wiener filtering algorithm, so that an echo cancellation effect of the nonlinear processing is improved. The performance of the wiener filtering algorithm depends on the echo cancellation effect of linear processing, if the echo cancellation effect of the linear processing is not obvious, the performance of the wiener filtering algorithm is obviously reduced, therefore, the first residual echo signal is subjected to nonlinear processing through the wiener filtering algorithm to obtain a second residual echo signal, the second residual echo signal is subjected to nonlinear processing through the wiener filtering algorithm to realize step-by-step nonlinear processing, and the dependence degree of the performance of the wiener filtering algorithm on the linear processing is reduced.

Fig. 3 is a flowchart illustrating an echo processing method for a voice call according to another embodiment of the disclosure. As shown in fig. 3, the method includes:

s301, collecting a first voice signal and receiving a second voice signal.

S302, according to the second voice signal, linear processing is carried out on the first voice signal through a first preset filtering algorithm, and a first residual echo signal and an estimated echo signal are obtained.

And S303, carrying out nonlinear processing on the first residual echo signal through a second preset filtering algorithm according to the first voice signal and the estimated echo signal to obtain a second residual echo signal.

Specifically, S301 to S302 refer to corresponding contents in the embodiment shown in fig. 2, and are not described again.

And S304, updating the second preset filtering algorithm, and performing nonlinear processing on the second residual echo signal according to the updated second preset filtering algorithm to obtain a final voice signal.

Specifically, the second preset filtering algorithm may be updated by updating a filtering coefficient of the second preset filtering algorithm. And carrying out nonlinear processing on the second residual echo signal through the updated second preset filtering algorithm to obtain a final voice signal. Therefore, the second residual echo signal is processed by the updated second preset filtering algorithm, and the phenomenon that the final voice signal is unsmooth and linear and even accompanied by noise caused by unsmooth filtering coefficient of the final voice signal due to the second preset filtering algorithm fixed is avoided, namely the phenomenon that the voice signal is unsmooth due to nonlinear processing can be balanced by updating the second filtering algorithm.

In the embodiment of the disclosure, not only is the dependence of the echo cancellation effect of the nonlinear processing on the echo cancellation effect of the linear processing reduced by performing the nonlinear processing on the first residual echo signal step by step, but also the updated second preset filtering algorithm is used to process the second residual echo signal after the nonlinear processing is performed on the first residual echo signal by using the second preset filtering algorithm, so as to balance the phenomenon that the voice signal is not smooth caused by the nonlinear processing, thereby effectively improving the communication quality of the voice communication.

Fig. 4 is a flowchart illustrating an echo processing method for a voice call according to an embodiment of the present disclosure. As shown in fig. 4, the method includes:

s401, collecting a first voice signal and receiving a second voice signal.

S402, according to the second voice signal, the first voice signal is subjected to linear processing to obtain a first residual echo signal and an estimated echo signal.

And S403, performing nonlinear processing on the first residual echo signal according to the first voice signal and the estimated echo signal to obtain a second residual echo signal.

Specifically, S401 to S403 may refer to corresponding contents of the embodiment shown in fig. 2, and are not described again.

S404, detecting the second voice signal to obtain a detection result.

Wherein, the detection result is that the far-end user is speaking or the far-end user is not speaking.

Specifically, the second voice signal is detected to determine whether the remote user speaks, and a detection result is obtained.

S405, determining the call state of the voice call according to the detection result.

Specifically, if the far-end user is speaking, it is determined that the voice call state may be a single far-end call state or a double-end call state, and if the far-end user is not speaking, it is determined that the voice call state may be a single near-end call state.

And S406, if the call state of the voice call is determined to be a single far-end call state or a double-end call state, performing nonlinear processing on the second residual echo signal to obtain a final voice signal.

Specifically, in a voice call scenario, compared with a single far-end call state and a single near-end call state, the occurrence time of a double-end call state is relatively short, and compared with a single near-end call state, the influence of echo on the call quality in the single far-end call state and the double-end call state is relatively large. In order to eliminate the echo in the voice signal as much as possible and to retain the voice quality in the single near-end call state to the maximum extent, the second residual echo signal is subjected to the nonlinear processing only when the call state of the voice call is the single far-end call state or the double-end call state. And under the condition that the call state of the voice call is a single near-end call state, directly determining the second residual echo signal as a final voice signal.

Specifically, the process of performing the nonlinear processing on the second residual echo signal may refer to corresponding contents in the embodiments shown in fig. 2 or fig. 3, and is not described again.

In the embodiment of the disclosure, when the call state of the voice call is a single far-end call state or a double-end call state, the first voice signal is subjected to linear processing and stepwise nonlinear processing to eliminate an echo in the first voice signal as much as possible and improve an echo cancellation effect, and when the call state of the voice call is a single near-end call state, the first voice signal is subjected to linear processing and nonlinear processing, but nonlinear processing is not stepwise, so as to ensure voice quality in the single near-end call state.

Based on the embodiment shown in fig. 4, one possible implementation manner of S404 is: and detecting the second Voice signal through a preset Voice Activity Detection (VAD) algorithm to obtain a Detection result so as to improve the Detection accuracy.

Alternatively, the detection result may include the number of consecutive voice frames of the second voice signal. One possible implementation of S404 is: and detecting the second voice signal through VAD algorithm to obtain the continuous voice frame number of the second voice signal. One possible implementation of S405 is: if the continuous voice frame number of the second voice signal is less than or equal to the preset frame number threshold, the far-end user is not speaking, the conversation state of the voice conversation is determined to be a single near-end conversation state, otherwise, the far-end user is speaking, and the conversation state of the voice conversation is determined to be a single far-end conversation state or a double-end conversation state.

Based on any of the embodiments shown in fig. 2 to 4, in the process of obtaining the second residual echo signal by performing the nonlinear processing on the first residual echo signal through the second preset filtering algorithm according to the first speech signal and the estimated echo signal, one possible implementation manner is as follows: and determining a filter coefficient of a second preset filter algorithm according to the first voice signal, the first residual echo signal and the estimated echo signal, and performing nonlinear processing on the first residual echo signal through the second preset filter algorithm with the filter coefficient determined to obtain a second residual echo signal. Therefore, the filter coefficient of the second preset filtering algorithm is determined based on the first voice signal, the first residual echo signal and the estimated echo signal, the rationality of the filter coefficient of the second preset filtering algorithm is improved, and the echo cancellation effect of the second preset filtering algorithm is further improved.

In the process of determining the filter coefficient of the second preset filtering algorithm according to the first speech signal, the first residual echo signal and the estimated echo signal, one possible implementation manner is as follows: the filter weighting coefficients of the second predetermined filter algorithm are determined first, and the values of the filter weighting coefficients may be set, for example, through practical experience. And determining a filter coefficient according to the filter weighting coefficient, the cross-power spectral density of the first voice signal and the first residual echo signal and the self-power spectral density of the estimated echo so as to improve the rationality of the filter coefficient and the echo cancellation effect of a second preset filter algorithm.

Further, the calculation formula of the filter coefficient can be expressed as:

wherein H represents the filter coefficient of the second predetermined filter algorithm, P_edRepresenting the cross-power spectral density, P, of the first speech signal and the first residual echo signal_yyDenotes the sub-power spectral density of the estimated echo signal and a denotes the filter weighting coefficient.

Further, when determining the filtering weighting coefficient of the second preset filtering coefficient, the filtering weighting coefficient may be determined according to the capability of the first speech signal and the energy of the first residual echo signal, so as to improve the rationality of the filtering weighting coefficient. For example, the value of the filter weighting factor may be determined within a preset value range (e.g., 2-8) of the filter weighting factor according to a ratio of the capability of the first speech signal and the energy of the first residual echo signal. The preset value range of the filtering weighting coefficient can be set according to actual experience.

Based on any of the embodiments shown in fig. 3 to 4, in the process of updating the second preset filtering algorithm, one possible implementation manner is: determining a filtering weighting coefficient corresponding to the second residual echo signal according to a plurality of preset grading threshold values; updating the filtering weighting coefficient of a second preset filtering algorithm according to the filtering weighting coefficient corresponding to the second residual echo signal; and updating the filter coefficient of the second preset filter algorithm according to the updated filter weighting coefficient to obtain an updated second preset filter algorithm. Therefore, different filtering weighting coefficients are determined for the second residual echo signals with different energies by setting a plurality of grading threshold values, so that the filtering weighting coefficients can be flexibly adjusted, and further, the filtering coefficients of the second preset filtering algorithm can be flexibly adjusted. The energy of the second residual echo signal in the single far-end call state and the energy of the second residual echo signal in the double far-end call state are obviously different, so that along with the refinement of the grading threshold value, not only can the echo processing strength in the single far-end call state be distinguished from the echo processing strength in the double far-end call state, but also an absolute echo distinguishing processing mode is avoided being adopted in the single far-end call state and the double far-end call state, the filtering weighting coefficient can be kept in a reasonable range, meanwhile, the echo can be prevented from being missed, and the echo cancellation effect is improved.

Specifically, the filtering weighting coefficient corresponding to the second residual echo signal is determined according to a plurality of preset classification threshold values, and the energy of the second residual echo may be compared with the plurality of classification threshold values to determine the filtering weighting coefficient corresponding to the second residual echo signal. For example, if the energy of the second residual echo is greater than the classification threshold B and less than the classification threshold C, the filter weighting coefficient corresponding to the threshold interval [ classification threshold C, classification threshold B ] is the filter weighting coefficient corresponding to the second residual echo signal.

Furthermore, the energy of the residual echo signal containing the voice features can be acquired from the second residual echo signal, and the filtering weighting coefficient corresponding to the second residual echo signal is determined according to the energy and a plurality of grading threshold values, so that the rationality of the filtering weighting coefficient is improved, and the echo cancellation effect in the nonlinear processing process is further improved. The residual echo signal with the characterization of the voice of the person speaking (for example, the characterization of the frequency of the person speaking) can be identified in the second residual echo signal, so as to obtain a residual echo signal containing the speech feature, and further obtain the energy of the residual echo signal containing the speech feature.

Optionally, when the second residual echo signal includes a plurality of second residual echo signal subbands, the second residual echo signal subbands including the speech feature may be identified from the plurality of second residual echo signal subbands, and the energy of all the second residual echo signal subbands including the speech feature is obtained through statistics.

Further, the larger the energy of the residual echo signal containing the speech feature, the smaller the filter weighting coefficient. According to the calculation formula of the filter coefficients, the smaller the filter weighting coefficient is, the larger the filter coefficient of the second preset filter algorithm is, so that echoes can be prevented from being leaked in the nonlinear processing process through the larger filter coefficient in the double-end call state, and the echo cancellation effect is improved.

Based on any of the embodiments shown in fig. 2-4, one possible implementation manner of S202, S302, or S402 is: respectively carrying out sub-band decomposition on the first voice signal and the second voice signal to obtain a first voice signal sub-band and a second voice signal sub-band; and performing linear processing on the first voice signal sub-band according to the second voice signal sub-band to obtain a first residual echo signal sub-band and an estimated echo signal sub-band. One possible implementation of S203, S303, or S403 is: and carrying out nonlinear processing on the first residual echo signal sub-band according to the first voice signal sub-band and the estimated echo signal sub-band to obtain a second residual echo signal sub-band. One possible implementation of S204, S304, or S406 is: and carrying out nonlinear processing on the second residual echo signal sub-band, and carrying out sub-band synthesis on the second residual echo signal sub-band after the nonlinear processing to obtain a final voice signal. Therefore, by utilizing the characteristics that the sub-band bandwidth is small and each sub-band can be independently operated, the linear processing and the step-by-step nonlinear processing of the first voice signal are realized by performing the linear processing and the step-by-step nonlinear processing on the first voice signal sub-band, and the echo cancellation effect of the voice call is improved. The plurality of first residual echo signal sub-bands constitute first residual echoes, the plurality of estimated echo signal sub-bands constitute estimated echo signals, and the plurality of second residual echo signal sub-bands constitute second residual echo signals.

Specifically, the first voice signal and the second voice signal can be respectively converted from the time domain to the frequency domain through fourier transform, and subband decomposition is respectively performed on the first voice signal and the second voice signal on the frequency domain to obtain a first voice signal subband and a second voice signal subband.

Specifically, the first preset filtering algorithm may be adopted when the first voice signal subband is subjected to linear processing, and the second preset filtering algorithm described above may be adopted when the first residual echo signal is subjected to stepwise nonlinear processing.

As an example, taking the second predetermined filtering algorithm as the wiener filtering algorithm as an example, the formula for performing the nonlinear processing on the first residual echo signal subband or the second residual echo signal subband by the wiener filtering algorithm may be represented as:

e (k) × H, where k denotes a kth first residual echo signal subband or a kth second residual echo signal subband, E' (k) denotes a residual echo amplitude of the kth first residual echo signal subband, E (k) denotes a frequency representation of the kth first residual echo signal subband, and H is a filter coefficient of the second predetermined filtering algorithm.

Fig. 5 is a schematic structural diagram of an echo processing device for voice call according to an embodiment of the present disclosure. As shown in fig. 5, the apparatus includes:

an obtaining module 501, configured to obtain a first voice signal and a second voice signal, where the first voice signal is a voice signal acquired at a near end of a voice call, and the second voice signal is a voice signal acquired at a far end of the voice call;

a linear processing module 502, configured to perform linear processing on the first voice signal according to the second voice signal to obtain a first residual echo signal and an estimated echo signal;

a first nonlinear processing module 503, configured to perform nonlinear processing on the first residual echo signal according to the first voice signal and the estimated echo signal, so as to obtain a second residual echo signal;

the second nonlinear processing module 504 is configured to perform nonlinear processing on the second residual echo signal to obtain a final voice signal.

In a possible implementation manner, the linear processing module 502 is specifically configured to:

according to the second voice signal, performing linear processing on the first voice signal through a first preset filtering algorithm to obtain a first residual echo signal and an estimated echo signal;

the first nonlinear processing module 503 is specifically configured to:

according to the first voice signal and the estimated echo signal, carrying out nonlinear processing on the first residual echo signal through a second preset filtering algorithm to obtain a second residual echo signal;

the second non-linear processing module 504 is specifically configured to:

and carrying out nonlinear processing on the second residual echo signal through a second preset filtering algorithm to obtain a final voice signal.

In a possible implementation, the first predetermined filtering algorithm is a normalized minimum mean square error filtering algorithm, and the second predetermined filtering algorithm is a wiener filtering algorithm.

In a possible implementation manner, the second nonlinear processing module 504 is specifically configured to:

and updating the second preset filtering algorithm, and carrying out nonlinear processing on the second residual echo signal according to the updated second preset filtering algorithm.

In one possible implementation, the apparatus further includes:

the voice detection module is used for detecting the second voice signal through a preset voice activity detection algorithm to obtain a detection result;

the state determining module is used for determining the conversation state of the voice conversation according to the detection result;

the second non-linear processing module 504 is specifically configured to:

and if the call state of the voice call is determined to be a single far-end call state or a double-end call state, carrying out nonlinear processing on the second residual echo signal to obtain a final voice signal.

In one possible implementation, the detection result includes a number of consecutive speech frames of the second speech signal; a state determination module specifically configured to:

and if the continuous voice frame number of the second voice signal is less than or equal to a preset frame number threshold, determining that the call state of the voice call is a single near-end call state, and otherwise, determining that the call state of the voice call is a single far-end call state or a double-end call state.

In a possible implementation manner, the first nonlinear processing module 503 is specifically configured to:

determining a filter coefficient of a second preset filter algorithm according to the first voice signal, the first residual echo signal and the estimated echo signal; and carrying out nonlinear processing on the first residual echo signal through a second preset filtering algorithm to obtain a second residual echo signal.

determining a filtering weighting coefficient of a second preset filtering algorithm; and determining a filter coefficient according to the filter weighting coefficient, the cross-power spectral density of the first voice signal and the first residual echo signal and the self-power spectral density of the estimated echo.

a filter weighting factor is determined based on the energy of the first speech signal and the energy of the first residual echo signal.

determining a filtering weighting coefficient corresponding to the second residual echo signal according to a plurality of preset grading threshold values; updating the filtering weighting coefficient of a second preset filtering algorithm according to the filtering weighting coefficient corresponding to the second residual echo signal; and updating the filter coefficient of the second preset filter algorithm according to the updated filter weighting coefficient.

obtaining the energy of the residual echo signal containing the voice characteristics in the second residual echo signal; and determining a filtering weighting coefficient corresponding to the second residual echo signal according to the energy and the plurality of grading threshold values.

In one possible implementation, the larger the energy of the residual echo signal containing the speech feature, the smaller the filtering weighting factor.

respectively carrying out sub-band decomposition on the first voice signal and the second voice signal to obtain a first voice signal sub-band and a second voice signal sub-band; according to the second voice signal sub-band, performing linear processing on the first voice signal sub-band to obtain a first residual echo signal sub-band and an estimated echo signal sub-band;

the first nonlinear processing module 503 is specifically configured to:

according to the first voice signal sub-band and the estimated echo signal sub-band, carrying out nonlinear processing on the first residual echo signal sub-band to obtain a second residual echo signal sub-band;

the second non-linear processing module 504 is specifically configured to:

carrying out nonlinear processing on the second residual echo signal sub-band; and performing sub-band synthesis on the second residual echo signal sub-band after the nonlinear processing to obtain a final voice signal.

The echo processing device for voice call provided in fig. 5 can implement the above corresponding method embodiments, and the implementation principle and technical effect are similar, which are not described herein again.

Fig. 6 is a schematic structural diagram of a terminal device according to an embodiment of the present disclosure. As shown in fig. 6, the terminal device may include: a processor 601 and a memory 602. The memory 602 is used for storing computer execution instructions, and the processor 601, when executing the computer program, implements the method according to any of the embodiments described above.

The processor 601 may be a general-purpose processor including a central processing unit CPU, a Network Processor (NP), and the like. The memory 602 may include a Random Access Memory (RAM) and may further include a non-volatile memory (non-volatile memory), such as at least one disk memory.

An embodiment of the present disclosure also provides a computer-readable storage medium having stored therein instructions, which, when run on a computer, cause the computer to perform the method of any of the embodiments described above.

An embodiment of the present disclosure also provides a program product comprising a computer program, the computer program being stored in a storage medium, the computer program being readable from the storage medium by at least one processor, the at least one processor being capable of implementing the method of any of the above embodiments when executing the computer program.

Fig. 7 is a block diagram illustrating a terminal device, which may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, etc., according to one exemplary embodiment.

The apparatus 700 may include one or more of the following components: a processing component 702, a memory 704, a power component 706, a multimedia component 708, an audio component 710, an input/output (I/O) interface 712, a sensor component 714, and a communication component 716.

The processing component 702 generally controls overall operation of the device 700, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing components 702 may include one or more processors 720 to execute instructions to perform all or a portion of the steps of the methods described above. Further, the processing component 702 may include one or more modules that facilitate interaction between the processing component 702 and other components. For example, the processing component 702 may include a multimedia module to facilitate interaction between the multimedia component 708 and the processing component 702.

The memory 704 is configured to store various types of data to support operations at the apparatus 700. Examples of such data include instructions for any application or method operating on device 700, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 704 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

The power supply component 706 provides power to the various components of the device 700. The power components 706 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the apparatus 700.

The multimedia component 708 includes a screen that provides an output interface between the device 700 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 708 includes a front facing camera and/or a rear facing camera. The front camera and/or the rear camera may receive external multimedia data when the device 700 is in an operation mode, such as a photographing mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio component 710 is configured to output and/or input audio signals. For example, audio component 710 includes a Microphone (MIC) configured to receive external audio signals when apparatus 700 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signal may further be stored in the memory 704 or transmitted via the communication component 716. In some embodiments, audio component 710 also includes a speaker for outputting audio signals.

The I/O interface 712 provides an interface between the processing component 702 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor assembly 714 includes one or more sensors for providing status assessment of various aspects of the apparatus 700. For example, sensor assembly 714 may detect an open/closed state of device 700, the relative positioning of components, such as a display and keypad of device 700, sensor assembly 714 may also detect a change in position of device 700 or a component of device 700, the presence or absence of user contact with device 700, orientation or acceleration/deceleration of device 700, and a change in temperature of device 700. The sensor assembly 714 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 714 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 714 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 716 is configured to facilitate wired or wireless communication between the apparatus 700 and other devices. The apparatus 700 may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 716 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 716 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the apparatus 700 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.

In an exemplary embodiment, a non-transitory computer readable storage medium comprising instructions, such as the memory 704 comprising instructions, executable by the processor 720 of the device 700 to perform the above-described method is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

A non-transitory computer-readable storage medium, in which instructions, when executed by a processor of a terminal device, enable the terminal device to perform a split screen processing method of the terminal device.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. The embodiments of the disclosure are intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. An echo processing method for a voice call, the method comprising:

2. The method of claim 1, wherein performing linear processing on the first speech signal to obtain a first residual echo signal and an estimated echo signal according to the second speech signal comprises:

and performing linear processing on the first voice signal through a first preset filtering algorithm according to the second voice signal to obtain the first residual echo signal and the estimated echo signal.

3. The method according to claim 2, wherein the first predetermined filtering algorithm is a normalized minimum mean square error filtering algorithm and the second predetermined filtering algorithm is a wiener filtering algorithm.

4. The method of claim 1, further comprising:

detecting the second voice signal through a preset voice activity detection algorithm to obtain a detection result;

determining the call state of the voice call according to the detection result;

the performing nonlinear processing on the second residual echo signal to obtain a final speech signal includes:

5. The method of claim 4, wherein the detection result comprises a number of consecutive frames of speech of the second speech signal;

the determining the call state of the voice call according to the detection result comprises:

and if the continuous voice frame number of the second voice signal is less than or equal to a preset frame number threshold, determining that the voice call state is a single near-end call state, otherwise, determining that the voice call state is a single far-end call state or a double-end call state.

6. The method according to claim 1, wherein said performing a non-linear processing on said first residual echo signal according to said first speech signal and said estimated echo signal by said second predetermined filtering algorithm to obtain said second residual echo signal comprises:

determining a filter coefficient of the second preset filter algorithm according to the first voice signal, the first residual echo signal and the estimated echo signal;

and carrying out nonlinear processing on the first residual echo signal through the second preset filtering algorithm to obtain a second residual echo signal.

7. The method of claim 6, wherein determining the filter coefficients of the second predetermined filtering algorithm based on the first speech signal, the first residual echo signal, and the estimated echo signal comprises:

determining a filtering weighting coefficient of the second preset filtering algorithm;

and determining the filter coefficient according to the filter weighting coefficient, the cross-power spectral density of the first voice signal and the first residual echo signal, and the self-power spectral density of the estimated echo.

8. The method of claim 7, wherein determining the filter weighting coefficients of the second predetermined filter algorithm comprises:

determining the filtering weighting factor according to the energy of the first voice signal and the energy of the first residual echo signal.

9. The method of claim 1, wherein the updating the second predetermined filtering algorithm comprises:

determining a filtering weighting coefficient corresponding to the second residual echo signal according to a plurality of preset grading threshold values;

updating the filtering weighting coefficient of the second preset filtering algorithm according to the filtering weighting coefficient corresponding to the second residual echo signal;

and updating the filter coefficient of the second preset filter algorithm according to the updated filter weighting coefficient.

10. The method according to claim 9, wherein said determining the filtering weighting coefficient corresponding to the second residual echo signal according to a plurality of preset scaling thresholds comprises:

obtaining the energy of the residual echo signal containing the voice characteristics in the second residual echo signal;

and determining a filtering weighting coefficient corresponding to the second residual echo signal according to the energy and the plurality of grading threshold values.

11. The method of claim 10, wherein the filtering weighting factor is smaller for larger energy of the residual echo signal containing speech features.

12. The method according to any of claims 1-11, wherein said performing linear processing on said first speech signal based on said second speech signal to obtain a first residual echo signal and an estimated echo signal comprises:

respectively performing subband decomposition on the first voice signal and the second voice signal to obtain a first voice signal subband and a second voice signal subband;

performing linear processing on the first voice signal sub-band according to the second voice signal sub-band to obtain a first residual echo signal sub-band and an estimated echo signal sub-band;

the performing nonlinear processing on the first residual echo signal according to the first speech signal and the estimated echo signal to obtain a second residual echo signal includes:

carrying out nonlinear processing on the second residual echo signal sub-band;

and performing sub-band synthesis on the second residual echo signal sub-band after the nonlinear processing to obtain the final voice signal.

13. An echo processing device for a voice call, the device comprising:

the first nonlinear processing module is used for carrying out nonlinear processing on the first residual echo signal through a second preset filtering algorithm according to the first voice signal and the estimated echo signal to obtain a second residual echo signal;

and the second nonlinear processing module is used for updating the second preset filtering algorithm and carrying out nonlinear processing on the second residual echo signal according to the updated second preset filtering algorithm.

14. A terminal device, comprising: a memory and a processor;

the memory is to store program instructions;

the processor is configured to invoke program instructions in the memory to perform the method of any of claims 1-12.

15. A computer-readable storage medium, having stored thereon a computer program which, when executed, implements the method of any one of claims 1-12.