CN109309764B

CN109309764B - Audio data processing method and device, electronic equipment and storage medium

Info

Publication number: CN109309764B
Application number: CN201710632689.3A
Authority: CN
Inventors: 李洋; 纪璇; 陈伟
Original assignee: Beijing Sogou Technology Development Co Ltd
Current assignee: Beijing Sogou Technology Development Co Ltd
Priority date: 2017-07-28
Filing date: 2017-07-28
Publication date: 2021-09-03
Anticipated expiration: 2037-07-28
Also published as: CN109309764A

Abstract

The embodiment of the invention provides an audio data processing method, an audio data processing device, electronic equipment and a storage medium, so as to effectively eliminate echo in recorded audio. The method comprises the following steps: acquiring a voice signal, and determining a far-end signal according to frame shift, wherein the frame shift is not equal to the block length; determining a preset number of target far-end signals according to the far-end signals, wherein a part of the target far-end signals are the same as a part of the target far-end signals of a preset frame, the preset number is related to the frame length and the block length, and the preset frame is related to the frame shift and the block length; and carrying out echo cancellation processing according to the voice signal and the target far-end signal to obtain a target signal with echo cancellation.

Description

Audio data processing method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of technologies, and in particular, to an audio data processing method, an audio data processing apparatus, an electronic device, and a readable storage medium.

Background

With the rapid development of communication technology, terminals such as mobile phones and tablet computers are more and more popular, and great convenience is brought to life, study and work of people.

When using the terminal, the user can interact with other users through voice, video and the like, such as making a call, performing video communication and the like. In these interaction processes, the terminal usually opens a Microphone (Mic) to record voice and send the voice to the opposite communication terminal, and also plays voice data of the opposite communication terminal through a speaker. Therefore, in the actual processing, the audio data recorded by the microphone includes the sound of the local user and the sound of the opposite end played by the loudspeaker, and the recorded sound of the opposite end played by the loudspeaker can be called echo. In order to improve the communication quality and prevent the echo from causing normal speech content in the audio, the echo needs to be removed.

Disclosure of Invention

The embodiment of the invention provides an audio data processing method, which is used for effectively eliminating echo in recorded audio.

Correspondingly, the embodiment of the invention also provides an audio data processing device, electronic equipment and a storage medium, which are used for ensuring the realization and application of the method.

In order to solve the above problem, an embodiment of the present invention discloses an audio data processing method, including: acquiring a voice signal, and determining a far-end signal according to frame shift, wherein the frame shift is not equal to the block length; determining a preset number of target far-end signals according to the far-end signals, wherein a part of the target far-end signals are the same as a part of the target far-end signals of a preset frame, the preset number is related to the frame length and the block length, and the preset frame is related to the frame shift and the block length; and carrying out echo cancellation processing according to the voice signal and the target far-end signal to obtain a target signal with echo cancellation.

Optionally, the determining the far-end signal according to the frame shift includes: determining a far-end signal of a first length as a function of a frame shift, wherein the first length is associated with the frame shift; and splicing the far-end signals according to the frame length and the frame shift to obtain the far-end signals with the second length, wherein the second length is related to the frame length.

Optionally, splicing the far-end signal according to the frame length and the frame shift to obtain a far-end signal with a second length, including: determining a far-end signal of a third length before the far-end signal of the first length according to the frame length, wherein the third length is the difference value between the first length and the second length; and splicing the remote signal with the third length and the remote signal with the first length to obtain a remote signal with a second length.

Optionally, the determining a preset number of target far-end signals according to the far-end signals includes: determining a first number of target far-end signals with a fourth length according to the far-end signals with the second length; and acquiring a target far-end signal with a fourth length of a second number stored in a previous setting frame, wherein the sum of the first number and the second number is a preset number, and the fourth length is related to the block length.

Optionally, performing echo cancellation processing according to the voice signal and the target far-end signal to obtain a target signal with echo cancelled, including: determining echo signals to be eliminated correspondingly according to the preset number of target far-end signals with the fourth length; and subtracting the echo signal to be eliminated from the voice signal to obtain a target signal for eliminating the echo.

Optionally, the determining, according to the preset number of target far-end signals with the fourth length, corresponding echo signals to be cancelled includes: processing the first number of target far-end signals with the fourth length to obtain a first far-end signal of a frequency domain; acquiring a second number of target far-end signals with a fourth length, which correspond to the second far-end signals of the frequency domain; and processing the first far-end signal and the second far-end signal of the frequency domain with the spatial impulse response to obtain an echo signal to be eliminated.

Optionally, the method further includes: and updating the frame number corresponding to the target far-end signal with the fourth length of the second number.

The embodiment of the invention also discloses an audio data processing device, which comprises: the signal acquisition module is used for acquiring a voice signal and determining a far-end signal according to frame shift, wherein the frame shift is not equal to the block length; a signal processing module, configured to determine a preset number of target far-end signals according to the far-end signals, where a part of the target far-end signals is the same as a part of target far-end signals of a previous set frame, the preset number is related to a frame length and a block length, and the set frame is related to a frame shift and a block length; and the echo cancellation module is used for carrying out echo cancellation processing according to the voice signal and the target far-end signal to obtain a target signal of echo cancellation.

Optionally, the signal acquisition module includes: a far-end acquisition submodule for determining a far-end signal of a first length according to a frame shift, wherein the first length is associated with the frame shift; and the splicing submodule is used for splicing the far-end signals according to the frame length and the frame shift to obtain the far-end signals with the second length, and the second length is related to the frame length.

Optionally, the splicing sub-module is configured to determine, according to a frame length, a far-end signal of a third length before the far-end signal of the first length, where the third length is a difference between the first length and the second length; and splicing the remote signal with the third length and the remote signal with the first length to obtain a remote signal with a second length.

Optionally, the signal processing module includes: the target determining submodule is used for determining a first number of target far-end signals with a fourth length according to the far-end signals with the second length; and the cache obtaining sub-module is used for obtaining a target far-end signal with a fourth length of a second number stored in a preset frame, wherein the sum of the first number and the second number is a preset number, and the fourth length is related to the block length.

Optionally, the echo cancellation module includes: the echo determining submodule is used for determining the echo signal to be eliminated according to the preset number of the target far-end signals with the fourth length; and the eliminating submodule is used for subtracting the echo signal to be eliminated from the voice signal to obtain a target signal for eliminating the echo.

Optionally, the echo determining submodule is configured to process the first number of target far-end signals with the fourth length to obtain a frequency-domain first far-end signal; acquiring a second number of target far-end signals with a fourth length, which correspond to the second far-end signals of the frequency domain; and processing the first far-end signal and the second far-end signal of the frequency domain with the spatial impulse response to obtain an echo signal to be eliminated.

Optionally, the echo cancellation module is further configured to update the frame number corresponding to the target far-end signal of the second number and the fourth length.

An embodiment of the present invention further discloses an electronic device, which includes a memory, and one or more programs, where the one or more programs are stored in the memory, and configured to be executed by one or more processors, where the one or more programs include instructions for:

acquiring a voice signal, and determining a far-end signal according to frame shift, wherein the frame shift is not equal to the block length;

determining a preset number of target far-end signals according to the far-end signals, wherein a part of the target far-end signals are the same as a part of the target far-end signals of a preset frame, the preset number is related to the frame length and the block length, and the preset frame is related to the frame shift and the block length;

and carrying out echo cancellation processing according to the voice signal and the target far-end signal to obtain a target signal with echo cancellation.

Optionally, the method further comprises instructions for: and updating the frame number corresponding to the target far-end signal with the fourth length of the second number.

The embodiment of the invention also discloses a readable storage medium, which is characterized in that when the instructions in the storage medium are executed by a processor of the electronic equipment, the electronic equipment can execute the audio data processing method according to one or more of the invention embodiments.

The embodiment of the invention has the following advantages:

the embodiment of the invention can collect voice signals and determine far-end signals according to frame shift, wherein the frame shift is not equal to the block length, so that a preset number of target far-end signals are determined according to the far-end signals, part of the target far-end signals are correspondingly determined by a previously set frame, wherein the preset number is related to the frame length and the block length, the set frame is related to the frame shift and the block length, when echo cancellation processing is carried out according to the voice signals and the target far-end signals, the repeated part of the target far-end signals do not need to be recalculated, the calculated amount can be effectively reduced, and then the echo-cancelled target signals can be obtained, so that echoes in the voice signals can be effectively cancelled, and voice delay can be shortened.

Drawings

FIG. 1 is a flow chart of the steps of an embodiment of an audio data processing method of the present application;

FIG. 2 is a flow chart of steps of another audio data processing method embodiment of the present application;

FIG. 3 is a block diagram of an embodiment of an audio data processing apparatus according to the present application;

FIG. 4 is a block diagram of another audio data processing apparatus embodiment of the present application;

FIG. 5 is a block diagram illustrating a configuration of an electronic device for audio data processing in accordance with an exemplary embodiment;

fig. 6 is a schematic structural diagram of an electronic device for audio data processing according to another exemplary embodiment of the present application.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present application more comprehensible, the present application is described in further detail with reference to the accompanying drawings and the detailed description.

In communications involving speech, acoustic echo is difficult to avoid. The far-end signal is transmitted to the near end through a telephone or a network, then is played through a loudspeaker, and after the far-end signal is transmitted through the space, the far-end signal which is picked up by a near-end microphone and then is transmitted back is acoustic echo. A mathematical model of a speech signal received by a microphone can be expressed in the time domain as:

y(n)＝h(n)*x(n)+d(n)

wherein, y (n) is a voice signal collected by a microphone; x (n) is a far-end signal, h (n) is a spatial impulse response, and h (n) x (n) is a convolution result of x (n) and h (n) and is expressed as a signal picked up by a near-end microphone after the far-end signal is transmitted through the space; d (n) is the near-end signal, i.e. the echo-cancelled target signal.

An Acoustic Echo Cancellation (AEC) algorithm can be used to cancel the acoustic Echo signal, and the AEC algorithm generally has two steps, the first step is an adaptive filtering algorithm, and the second step is a residual Echo post-filtering algorithm, so as to obtain a target signal for Echo Cancellation. The embodiment of the invention is improved based on an AEC algorithm so as to more effectively eliminate echo.

Referring to fig. 1, a flowchart illustrating steps of an embodiment of an audio data processing method according to the present application is shown, which may specifically include the following steps:

step 102, collecting a voice signal, and determining a far-end signal according to a frame shift, wherein the frame shift is not equal to a block length.

When a terminal or other equipment is used, communication including voice can be performed, a microphone can be used for recording voice data in the process, namely, a voice signal is collected, the microphone is also called a microphone and is an energy conversion device for converting a voice signal into an electric signal, and the microphone of the embodiment can be a microphone carried by the equipment or an external microphone connected to the equipment. The collected voice signal includes an echo, which is a signal received by a microphone after being played by a loudspeaker, i.e., a far-end signal received by the microphone. The equipment such as the terminal can also be used for collecting remote signals which can be transmitted to the equipment through a telephone or a network in the communication process and played through a loudspeaker. In the embodiment of the present invention, the far-end signal may be determined according to a frame length and a frame shift, where the frame shift is not equal to the block length.

In one example, for an Adaptive filtering algorithm, a Partitioned Block Frequency Domain Adaptive Filter (PBFDAF) may be used, and in using the PBFDAF algorithm, the Block length of a partition in filtering is set to be consistent with the length of a Block Filter, while the frame shift of speech processing is not equal to the Block length, assuming that the Block length is N, the length of the Block Filter is also N, and assuming that the frame shift is M, N ≠ M, where N and M are integers, e.g., N and M may be set to a power number of 2. Both block length and frame shift are parameters in the PBFDAF algorithm.

And 104, determining a preset number of target far-end signals according to the far-end signals, wherein a part of the target far-end signals are the same as a part of the target far-end signals of a preset frame, the preset number is related to the frame length and the block length, and the preset frame is related to the frame shift and the block length.

After the far-end signal is determined in the process of the PBFDAF algorithm, a target far-end signal can be determined based on the far-end signal, and the target far-end signal is a far-end signal which needs to be calculated to perform echo cancellation. In the embodiment of the present invention, the target far-end signal is composed of a preset number of far-end signals and a preset length, the preset number may be determined according to a frame length, a block length, and the like, and the preset length is related to the block length. The far-end signals are collected according to the frame shift, the corresponding target far-end signals are also related to the frame length and the block length, and the set frame is related to the frame shift and the block length, so that the preset number of target far-end signals determined through the frame shift are determined and compared with the preset number of target far-end signals corresponding to the previous frame, and the fact that the partial number of target far-end signals before the current frame and the previous set frame are repeated is found. Assuming that the current frame is the ith frame and the set frame is the i-b frame, part of the target far-end signals in the ith frame and part of the far-end signals in the ith frame are the same, so that the repeated target far-end signals do not need to be recalculated in the processing process, and only the repeated target far-end signals need to be calculated.

And 106, carrying out echo cancellation processing according to the voice signal and the target far-end signal to obtain a target signal with echo cancellation.

Then, the target far-end signal is processed, and the processing is a processing of converting a time domain into a frequency domain, for example, various processing operations based on Fourier Transform, such as Discrete Fourier Transform (DFT) and Fast Fourier Transform (FFT), so as to obtain a far-end signal in the frequency domain, which can determine noise, i.e., an echo signal received by the microphone, together with the spatial impulse response h (n) in the frequency domain. The frequency domain result of the Fourier transform corresponding to the target far-end signal of the repeated part does not need to be repeatedly calculated, and only the calculation result corresponding to the previous set frame needs to be obtained.

According to the far-end signal of the frequency domain, noise, namely the echo signal received by the microphone, can be determined, and h (n) × (n), namely the product of the far-end signal of the frequency domain and the spatial impulse response of the frequency domain, can be calculated in the frequency domain, so that the signal picked up by the near-end microphone after the far-end signal is transmitted in the space can be obtained. Then, echo cancellation is carried out on the voice signal, echo signals such as echo in the voice signal are eliminated, a target signal of echo cancellation is obtained, and therefore echo in recorded audio data is eliminated. For example, in the process of passing voice, video and the like, after the data recorded by the microphone is transmitted to the opposite terminal, echo can be eliminated as much as possible, and the call quality is ensured. If the speech signal is y (n), the convolution result of the far-end signal and the spatial impulse response is calculated in the time domain, and the product result of the far-end signal and the spatial impulse response is correspondingly calculated in the frequency domain, so that the echo signal, namely h (n) x (n), and the target signal d (n) y (n) -h (n) x (n) of the echo cancellation are obtained.

For the PBFDAF algorithm with block length, frame phase shift, etc., if the block length of the block is too large, although it is beneficial to echo cancellation, the voice delay is also large, and if the block length of the block is too small, the voice delay can be better solved, but it is not beneficial to echo cancellation performance. Compared with the prior art, in the scheme of the embodiment of the invention, the frame shift is not equal to the block length, and the frame shift can be set to be smaller than the block length, so that the echo can be effectively counteracted on the basis of reducing the voice delay.

In summary, a voice signal may be collected, and a far-end signal may be determined according to a frame shift, where the frame shift is not equal to a block length, so as to determine a preset number of target far-end signals according to the far-end signal, and a part of the target far-end signals are determined corresponding to a previously set frame, where the preset number is related to the frame length and the block length, and the set frame is related to the frame shift and the block length, and then when performing echo cancellation processing according to the voice signal and the target far-end signal, a repeated part of the target far-end signal does not need to be recalculated, which can effectively reduce the amount of computation, and then the echo-cancelled target signal may be obtained, thereby effectively eliminating an echo in the voice signal, and shortening a voice delay.

Referring to fig. 2, a flowchart illustrating steps of another embodiment of an audio data processing method according to the present application is shown, which may specifically include the following steps:

step 202, collecting voice signals.

Step 204, determining a far-end signal of a first length according to the frame shift, wherein the first length is related to the frame shift.

And step 206, splicing the far-end signals according to the frame length and the frame shift to obtain the far-end signals with the second length, wherein the second length is related to the frame length.

When a terminal or other equipment is used, communication including voice can be performed, a microphone can be used for collecting voice signals in the process, the collected voice signals include echoes, and the echoes are signals received by the microphone after being played through a loudspeaker, namely far-end signals received by the microphone. And the remote signal can be transmitted to the equipment through a telephone or a network in the communication process and played through a loudspeaker. In the embodiment of the present invention, the far-end signal may be determined according to a frame length and a frame shift, where the frame shift is not equal to the block length. Assuming that the block length is N and the frame shift is M, the length of the block filter is also N, N ≠ M, where N and M are positive integers.

In echo cancellation, a far-end signal of a first length may be acquired, where the first length is associated with a frame shift M. In the process, the far-end signal is continuously received, and the embodiment of the invention carries out periodic echo cancellation, so that the corresponding far-end signal can be obtained after the first length is not reached, and the processing of one period is carried out every other far-end signal with the first length. And then determining a far-end signal to be spliced based on the frame length, and splicing the far-end signal to be spliced and the far-end signal with the first length to obtain a far-end signal with a second length, wherein the second length is related to the frame length.

In an alternative embodiment, the splicing the far-end signals according to the frame length and the frame shift to obtain the far-end signal of the second length includes: determining a far-end signal of a third length before the far-end signal of the first length according to the frame length, wherein the third length is the difference value between the first length and the second length; and splicing the remote signal with the third length and the remote signal with the first length to obtain a remote signal with a second length. That is, the second length may be determined according to the frame length, and then the difference between the second length and the first length is determined as the third length, and the far-end signal of the third length before the far-end signal of the first length is obtained. And then splicing the remote signal with the third length and the remote signal with the first length according to a sequence such as a time sequence to obtain the remote signal with the second length.

Step 208, determining a first number of far-end signals with a fourth length according to the far-end signals with the second length.

Step 210, obtaining a fourth-length far-end signal of a second number stored in a previous setting frame, where a sum of the first number and the second number is a preset number, and the fourth length is related to a block length.

The remote signals with the fourth length may be determined according to the remote signals with the second length, wherein a preset number of remote signals with the fourth length may be determined as the target remote signals, the fourth length is related to the block length, and the preset number is related to the frame length and the block length.

In the embodiment of the present invention, the number of repetitions in a predetermined number of target remote signals is defined as a second number, and if the number of non-repetitions is defined as a first number, the first number + the second number is defined as a predetermined number. The second number may be determined according to the block length and the frame shift. The first number of far-end signals of a fourth length may be determined with reference to the far-end signals of the second length, e.g. on the basis of the far-end signals of the second length. The stored second number of the remote signals with the fourth length may also be obtained from a buffer, a memory, and the like, and then the preset number of the target remote signals with the fourth length may be formed by the first number of the remote signals with the fourth length and the second number of the remote signals with the fourth length.

Step 212, performing fourier transform processing on the first number of target far-end signals with the fourth length to obtain a frequency-domain first far-end signal.

Step 214, obtaining and storing the second number of target far-end signals with the fourth length, which correspond to the second far-end signals of the frequency domain.

And step 216, processing the first far-end signal and the second far-end signal of the frequency domain with the spatial impulse response to obtain an echo signal to be eliminated.

Then, fast fourier transform FFT may be performed on the target far-end signals, that is, the preset number of far-end signals of the fourth length, to obtain the first far-end signal of the frequency domain. For the repeated part of the target far-end signal, that is, the second number of target far-end signals with the fourth length, the frequency-domain second far-end signal obtained by the fourier transform processing is stored, so that the stored frequency-domain second far-end signal can be obtained, then the frequency-domain first far-end signal and the frequency-domain second far-end signal are adopted to form the frequency-domain target far-end signal, wherein the frequency-domain target far-end signal can be synthesized according to the time sequence information, and then the frequency-domain target far-end signal is multiplied by the spatial impulse response to obtain the echo signal to be cancelled.

After the current frame moves forward to the setting frame, two frames correspond to the repeated target far-end signals, the current frame can move forward to the setting frame, the repeated FFT of the target far-end signals with the fourth length of the second number is correspondingly calculated, and the second far-end signals of the frequency domain are obtained, so that when the setting frame moves backward to the current frame, the second far-end signals calculated before can be directly obtained without repeatedly calculating FFT.

Step 218, performing echo cancellation processing according to the voice signal and the echo signal to obtain a target signal for echo cancellation.

After the echo signal is obtained, the echo signal and the voice signal may be subjected to echo cancellation processing, for example, the echo signal is subjected to inverse transform of fourier transform to obtain an echo signal in a time domain, and the voice signal and the echo signal are subtracted in the time domain, so that an obtained result is a target signal for echo cancellation. Certainly, in the process, the echo signal may not be completely eliminated, and other processing operations may also be performed, such as processing based on a residual echo post-filtering algorithm, and the like, to eliminate the echo in a new step.

And acquiring a voice signal with a fourth length corresponding to the target far-end signal. Performing echo cancellation processing according to the voice signal and the echo signal to obtain a target signal for echo cancellation, including: and subtracting the echo signal from the voice signal with the fourth length to obtain a target signal for echo cancellation. Since the embodiment of the present invention periodically performs echo cancellation, after acquiring a voice signal, the voice signal may be intercepted at a certain period time, for example, a voice signal of a fourth length corresponding to the target far-end signal may be obtained according to time information, and then the voice signal of the fourth length is subtracted from the echo signal of the time domain to obtain a target signal for echo cancellation.

In the embodiment of the invention, after the repeated target far-end signal and the second echo information thereof are obtained by first calculation, the frame number can be stored and configured, and then the corresponding frame number can be updated after the repeated target far-end signal and the second echo information thereof are obtained each time, so that the next obtaining is facilitated.

In one example, assuming that the block length is N, the frame shift is M, and the frame length is L, the number k of voice blocks in one frame is L/N, and k is usually a power of 2. In the process of periodic echo cancellation:

at time t, acquiring a first length M of remote signal points by frame shifting the incoming M remote signal points: x (0), x (1), … …, x (M-1)

Assuming that the second length is the same as the frame length L, the third length is (L-M), and the frame splicing is performed with the previous (L-M) point, then the far-end signal of the current frame, that is, the far-end signal with the length L is:

x(M-L),x(M-L+1),......,x(M-1)

the preset number is k, that is, the number of speech blocks in a frame, and the fourth length is 2 × N, that is, the frame length twice as long, then k target far-end signals with a length of 2 × N:

x(M-2*N),x(M-2*N+1),......,x(M-1)

x(M-3*N),x(M-3*N+1),......,x(M-N-1)

……

x(M-(k+1)*N),x(M-(k+1)*N+1),......,x(M-(k-1)*N-1)

where k is L/N, the last target far-end signal with length 2 × N can be further expressed as:

x(M-L-N),x(M-L-N+1),......,x(M-L+N-1)

based on the above principle, the target far-end signal of the i-b frame is:

x(-M-b*M-L),x(-M-b*M+1),……,x(M-b*M-1)

where b is N/M, the target far-end signal of the i-b frame may be further represented as:

x(-M-N-L),x(-M-N+1),……,x(M-N-1)

for the i-b frames, the corresponding k target far-end signals with the length of 2 × N are calculated as:

x(M-3*N),x(M-3*N+1),......,x(M-N-1)

x(M-4*N),x(M-4*N+1),......,x(M-2*N-1)

……

x(M-(k+2)*N),x(M-(k+2)*N+1),......,x(M-k*N-1)

comparing the k target far-end signals with the length of 2 × N corresponding to the i-th frame with the k target far-end signals with the length of 2 × N corresponding to the i-b-th frame, the k-1 target far-end signals with the length of 2 × N corresponding to the i-th frame are completely consistent with the k-1 target far-end signals with the length of 2 × N corresponding to the i-b-th frame.

Therefore, when echo cancellation is carried out on the ith frame, the historical result of FFT corresponding to the far-end signal of the ith-b frame can be cached, FFT calculation of the ith frame for k-1 times is avoided, and the calculation amount is effectively reduced.

Therefore, when each frame of echo cancellation is performed, the FFT result of the far-end signal of the previous b-th frame is needed, and then the buffer result of the previous b-th frame is updated in real time, that is, the frame number is updated for echo cancellation of the next frame, and the cycle is repeated, thereby achieving the purpose of reducing the amount of computation.

The first target far-end signal of the ith frame can be subjected to FFT to obtain a first far-end signal of a frequency domain, the second far-end signal of the stored frequency domain can be obtained for the k-1 target far-end signals with the length of 2 x N, so that the target far-end signal of the corresponding frequency domain is obtained, and then the echo signal to be eliminated is obtained by multiplying the target far-end signal of the frequency domain by the spatial impulse response.

In the embodiment of the present invention, a speech signal with a length of 2 × N at time T is further obtained from the collected speech signals, that is,:

y(M-2*N),y(M-2*N+1),......,y(M-1)

then, on the basis, the echo signal to be eliminated in the frequency domain is inversely transformed into the echo signal in the time domain, and the echo signal and the voice signal are subjected to adaptive cancellation to obtain an estimated target signal d (n).

The above is an example, in the actual processing, the lengths (including the first length, the second length, the third length, and the fourth length) may also be set according to requirements, for example, the lengths corresponding to the parameters are set according to a certain proportion.

Based on the above process, assuming that b is equal to N/M, if b is set to be greater than or equal to 2, the frame shift is smaller than the block length, so that the delay of the PBFDAF algorithm of the general frame shift length can be shortened without affecting the echo cancellation. Moreover, the process can ensure the performance of echo cancellation under the condition that the FFT length of the signal is kept unchanged. And through the partially repeated target far-end signals, the calculation amount can be effectively reduced, and the processing efficiency is improved.

Further, assuming that the amount of calculation of PBFDAF per N points is C, the amount of calculation of the above-described processing is b × C since N ═ b × M, and if b is greater than or equal to 2, the frame shift is shortened, so that the convergence rate of the adaptive filter algorithm can be increased.

Therefore, the block frequency domain self-adaptive algorithm based on the universal frame shift length achieves the purposes of not increasing excessive calculation amount, not only meeting the echo cancellation performance, but also increasing the convergence time of the algorithm and shortening the time delay of the AEC algorithm at the same time according to reasonably set parameters.

It should be noted that, for simplicity of description, the method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the illustrated order of acts, as some steps may occur in other orders or concurrently in accordance with the embodiments of the present invention. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred and that no particular act is required to implement the invention.

On the basis of the above embodiment, the embodiment of the invention also provides an audio data processing device. The method can be applied to terminal equipment such as mobile phones and tablet computers.

Referring to fig. 3, a block diagram of an embodiment of an audio data processing apparatus according to the present application is shown, which may specifically include the following modules:

the signal acquiring module 302 is configured to acquire a speech signal and determine a far-end signal according to a frame shift, where the frame shift is not equal to the block length.

A signal processing module 304, configured to determine a preset number of target far-end signals according to the far-end signals, where a part of the target far-end signals is the same as a part of target far-end signals of a previous set frame, where the preset number is related to a frame length and a block length, and the set frame is related to a frame shift and a block length.

And the echo cancellation module 306 is configured to perform echo cancellation processing according to the voice signal and the target far-end signal, so as to obtain a target signal for echo cancellation.

Referring to fig. 4, a block diagram of another embodiment of the audio data processing apparatus of the present application is shown, which may specifically include the following modules:

Wherein, the signal acquisition module 302 includes:

and the voice acquisition submodule 3022 is used for acquiring a voice signal.

A far-end acquisition sub-module 3024 configured to determine a far-end signal of a first length according to the frame shift, wherein the first length is related to the frame shift.

The splicing submodule 3026 is configured to splice the far-end signals according to the frame length and the frame shift, so as to obtain a far-end signal with a second length, where the second length is related to the frame length.

The splicing submodule 3026 is configured to determine, according to a frame length, a far-end signal of a third length that is a difference between the first length and the second length and precedes the far-end signal of the first length; and splicing the remote signal with the third length and the remote signal with the first length to obtain a remote signal with a second length.

The signal processing module 304 includes:

the target determining submodule 3042 is configured to determine, according to the far-end signals with the second length, a first number of target far-end signals with a fourth length.

The buffer obtaining sub-module 3044 is configured to obtain a target far-end signal of a fourth length of a second number stored in a previous setting frame, where a sum of the first number and the second number is a preset number, and the fourth length is related to a block length.

The echo cancellation module 306 includes:

the echo determination submodule 3062 is configured to determine, according to the preset number of target far-end signals with the fourth length, an echo signal to be cancelled.

The cancellation submodule 3064 is configured to subtract the echo signal to be cancelled from the voice signal to obtain a target signal for echo cancellation.

The echo determination submodule 3062 is configured to process the first number of target far-end signals with the fourth length to obtain a frequency-domain first far-end signal; acquiring a second number of target far-end signals with a fourth length, which correspond to the second far-end signals of the frequency domain; and processing the first far-end signal and the second far-end signal of the frequency domain with the spatial impulse response to obtain an echo signal to be eliminated.

The echo cancellation module 306 is further configured to update the frame number corresponding to the target far-end signal of the second number and the fourth length.

Based on the above processing procedure, assuming that b is equal to N/M, if b is set to be greater than or equal to 2, the frame shift is smaller than the block length, so that the delay of the PBFDAF algorithm of the general frame shift length can be shortened without affecting echo cancellation. Moreover, the process can ensure the performance of echo cancellation under the condition that the FFT length of the signal is kept unchanged.

Further, assuming that the amount of calculation of PBFDAF per N points is C, the amount of calculation of the above-described processing is b × C since N ═ b × M, and if b is 2 or more, the frame shift is shortened, so that the convergence rate of the adaptive filter algorithm can be increased.

When echo elimination of the ith frame is carried out, the historical result of FFT corresponding to the far-end signal of the ith-b frame can be cached, FFT calculation of the ith frame for k-1 times is avoided, and the calculation amount is effectively reduced. Therefore, when each frame of echo cancellation is performed, the FFT result of the far-end signal of the previous b-th frame is needed, and then the buffer result of the previous b-th frame is updated in real time, i.e., the frame number is updated for the echo cancellation of the next frame, and the cycle is repeated, thereby achieving the purpose of reducing the amount of computation

For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.

Fig. 5 is a block diagram illustrating a structure of an electronic device 500 for audio data processing according to an example embodiment. For example, the electronic device 500 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, a fitness device, a personal digital assistant, or the like; or may be a server-side device, such as a server.

Referring to fig. 5, electronic device 500 may include one or more of the following components: processing component 502, memory 504, power component 506, multimedia component 508, audio component 510, input/output (I/O) interface 512, sensor component 514, and communication component 516.

The processing component 502 generally controls overall operation of the electronic device 500, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing elements 502 may include one or more processors 520 to execute instructions to perform all or a portion of the steps of the methods described above. Further, the processing component 502 can include one or more modules that facilitate interaction between the processing component 502 and other components. For example, the processing component 502 can include a multimedia module to facilitate interaction between the multimedia component 508 and the processing component 502.

The memory 504 is configured to store various types of data to support operation at the device 500. Examples of such data include instructions for any application or method operating on the electronic device 500, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 504 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

The power components 504 provide power to the various components of the electronic device 500. Power components 504 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for electronic device 500.

The multimedia component 508 includes a screen that provides an output interface between the electronic device 500 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 508 includes a front facing camera and/or a rear facing camera. The front camera and/or the rear camera may receive external multimedia data when the electronic device 500 is in an operating mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio component 510 is configured to output and/or input audio signals. For example, the audio component 510 includes a Microphone (MIC) configured to receive external audio signals when the electronic device 500 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 504 or transmitted via the communication component 516. In some embodiments, audio component 510 further includes a speaker for outputting audio signals.

The I/O interface 512 provides an interface between the processing component 502 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor assembly 514 includes one or more sensors for providing various aspects of status assessment for the electronic device 500. For example, the sensor assembly 514 may detect an open/closed state of the device 500, the relative positioning of components, such as a display and keypad of the electronic device 500, the sensor assembly 514 may detect a change in the position of the electronic device 500 or a component of the electronic device 500, the presence or absence of user contact with the electronic device 500, orientation or acceleration/deceleration of the electronic device 500, and a change in the temperature of the electronic device 500. The sensor assembly 514 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 514 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 514 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 516 is configured to facilitate wired or wireless communication between the electronic device 500 and other devices. The electronic device 500 may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication section 514 receives a broadcast signal or broadcast associated information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communications component 514 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the electronic device 500 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.

In an exemplary embodiment, a non-transitory computer-readable storage medium comprising instructions, such as the memory 504 comprising instructions, executable by the processor 520 of the electronic device 500 to perform the above-described method is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

A non-transitory computer readable storage medium in which instructions, when executed by a processor of an electronic device, enable the electronic device to perform a method of audio data processing, the method comprising: acquiring a voice signal, and determining a far-end signal according to frame shift, wherein the frame shift is not equal to the block length; determining a preset number of target far-end signals according to the far-end signals, wherein a part of the target far-end signals are the same as a part of the target far-end signals of a preset frame, the preset number is related to the frame length and the block length, and the preset frame is related to the frame shift and the block length; and carrying out echo cancellation processing according to the voice signal and the target far-end signal to obtain a target signal with echo cancellation.

Fig. 6 is a schematic structural diagram of an electronic device 600 for audio data processing according to another exemplary embodiment of the present application. The electronic device 600 may be a server, which may vary greatly due to different configurations or capabilities, and may include one or more Central Processing Units (CPUs) 622 (e.g., one or more processors) and memory 632, one or more storage media 630 (e.g., one or more mass storage devices) storing applications 642 or data 644. Memory 632 and storage medium 630 may be, among other things, transient or persistent storage. The program stored in the storage medium 630 may include one or more modules (not shown), each of which may include a series of instruction operations for the server. Still further, the central processor 622 may be configured to communicate with the storage medium 630 to execute a series of instruction operations in the storage medium 630 on the server.

The server may also include one or more power supplies 626, one or more wired or wireless network interfaces 650, one or more input-output interfaces 658, one or more keyboards 656, and/or one or more operating systems 641, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, etc.

In an exemplary embodiment, the server is configured to execute the one or more programs by the one or more central processors 622 including instructions for: acquiring a voice signal, and determining a far-end signal according to frame shift, wherein the frame shift is not equal to the block length; determining a preset number of target far-end signals according to the far-end signals, wherein a part of the target far-end signals are the same as a part of the target far-end signals of a preset frame, the preset number is related to the frame length and the block length, and the preset frame is related to the frame shift and the block length; and carrying out echo cancellation processing according to the voice signal and the target far-end signal to obtain a target signal with echo cancellation.

The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing terminal to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing terminal to cause a series of operational steps to be performed on the computer or other programmable terminal to produce a computer implemented process such that the instructions which execute on the computer or other programmable terminal provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present invention have been described, additional variations and modifications of these embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the embodiments of the invention.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or terminal that comprises the element.

The foregoing has introduced in detail an audio data processing method, an audio data processing apparatus, an electronic device and a storage medium provided by the present application, and specific examples are applied herein to explain the principles and embodiments of the present application, and the descriptions of the foregoing examples are only used to help understand the method and core ideas of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims

1. A method of audio data processing, comprising:

collecting voice signals;

determining a far-end signal of a first length as a function of a frame shift, wherein the first length is associated with the frame shift;

splicing the far-end signals according to the frame length and the frame shift to obtain the far-end signals with a second length, wherein the second length is related to the frame length, the frame shift is smaller than the block length, and the block length is the block length of the blocks in the filtering;

and carrying out echo cancellation processing according to the voice signal and the target far-end signal to obtain a target signal subjected to echo cancellation, wherein the frequency domain result of part of the target far-end signal of the previous set frame is multiplexed in the echo cancellation processing.

2. The method of claim 1, wherein splicing the far-end signals according to frame length and frame shift to obtain a far-end signal of a second length comprises:

determining a far-end signal of a third length before the far-end signal of the first length according to the frame length, wherein the third length is the difference value between the first length and the second length;

and splicing the remote signal with the third length and the remote signal with the first length to obtain a remote signal with a second length.

3. The method of claim 1, wherein said determining a predetermined number of target far-end signals from said far-end signals comprises:

determining a first number of target far-end signals with a fourth length according to the far-end signals with the second length;

and acquiring a target far-end signal with a fourth length of a second number stored in a previous setting frame, wherein the sum of the first number and the second number is a preset number, and the fourth length is related to the block length.

4. The method of claim 3, wherein performing echo cancellation processing according to the speech signal and a target far-end signal to obtain an echo-cancelled target signal comprises: determining echo signals to be eliminated correspondingly according to the preset number of target far-end signals with the fourth length;

and subtracting the echo signal to be eliminated from the voice signal to obtain a target signal for eliminating the echo.

5. The method according to claim 4, wherein the determining the corresponding echo signal to be cancelled according to the preset number of target far-end signals with the fourth length comprises: processing the first number of target far-end signals with the fourth length to obtain a first far-end signal of a frequency domain;

acquiring a second number of target far-end signals with a fourth length, which correspond to the second far-end signals of the frequency domain;

and processing the first far-end signal and the second far-end signal of the frequency domain with the spatial impulse response to obtain an echo signal to be eliminated.

6. The method of claim 3, further comprising:

and updating the frame number corresponding to the target far-end signal with the fourth length of the second number.

7. An audio data processing apparatus, comprising:

the signal acquisition module is used for acquiring a voice signal and determining a far-end signal according to frame shift, wherein the frame shift is smaller than a block length, and the block length is the block length of a block in filtering;

a signal processing module, configured to determine a preset number of target far-end signals according to the far-end signals, where a part of the target far-end signals is the same as a part of target far-end signals of a previous set frame, the preset number is related to a frame length and a block length, and the set frame is related to a frame shift and a block length;

an echo cancellation module, configured to perform echo cancellation processing according to the voice signal and a target far-end signal to obtain a target signal subjected to echo cancellation, where a frequency domain result of a part of the target far-end signal of the previous set frame is multiplexed in the echo cancellation processing;

the signal acquisition module comprises:

a far-end acquisition submodule for determining a far-end signal of a first length according to a frame shift, wherein the first length is associated with the frame shift;

and the splicing submodule is used for splicing the far-end signals according to the frame length and the frame shift to obtain the far-end signals with the second length, and the second length is related to the frame length.

8. The apparatus of claim 7,

the splicing submodule is used for determining a far-end signal with a third length before the far-end signal with the first length according to the frame length, wherein the third length is the difference value between the first length and the second length; and splicing the remote signal with the third length and the remote signal with the first length to obtain a remote signal with a second length.

9. The apparatus of claim 7, wherein the signal processing module comprises:

the target determining submodule is used for determining a first number of target far-end signals with a fourth length according to the far-end signals with the second length;

and the cache obtaining sub-module is used for obtaining a target far-end signal with a fourth length of a second number stored in a preset frame, wherein the sum of the first number and the second number is a preset number, and the fourth length is related to the block length.

10. The apparatus of claim 9, wherein the echo cancellation module comprises:

the echo determining submodule is used for determining the echo signal to be eliminated according to the preset number of the target far-end signals with the fourth length;

and the eliminating submodule is used for subtracting the echo signal to be eliminated from the voice signal to obtain a target signal for eliminating the echo.

11. The apparatus of claim 10,

the echo determining submodule is used for processing the first number of target far-end signals with the fourth length to obtain a first far-end signal of a frequency domain; acquiring a second number of target far-end signals with a fourth length, which correspond to the second far-end signals of the frequency domain; and processing the first far-end signal and the second far-end signal of the frequency domain with the spatial impulse response to obtain an echo signal to be eliminated.

12. The apparatus of claim 9,

the echo cancellation module is further configured to update the frame number corresponding to the target far-end signal of the second number and the fourth length.

13. An electronic device comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by one or more processors the one or more programs including instructions for:

acquiring a voice signal, and determining a far-end signal with a first length according to frame shift, wherein the frame shift is smaller than a block length, the block length is the block length of a block in filtering, and the first length is related to the frame shift;

splicing the far-end signals according to the frame length and the frame shift to obtain a far-end signal with a second length, wherein the second length is related to the frame length;

14. The electronic device of claim 13, wherein splicing the far-end signals according to frame length and frame shift to obtain a far-end signal of a second length comprises:

15. The electronic device of claim 13, wherein said determining a preset number of target far-end signals from said far-end signals comprises:

16. The electronic device of claim 15, wherein performing echo cancellation processing according to the voice signal and a target far-end signal to obtain an echo-cancelled target signal comprises:

determining echo signals to be eliminated correspondingly according to the preset number of target far-end signals with the fourth length;

17. The electronic device according to claim 16, wherein said determining the echo signal to be cancelled according to the preset number of the target far-end signals with the fourth length comprises:

processing the first number of target far-end signals with the fourth length to obtain a first far-end signal of a frequency domain;

18. The electronic device of claim 15, further comprising instructions to:

19. A readable storage medium, characterized in that instructions in the storage medium, when executed by a processor of an electronic device, enable the electronic device to perform the audio data processing method of any of claims 1-6.