CN113362847A

CN113362847A - Audio signal processing method and device and storage medium

Info

Publication number: CN113362847A
Application number: CN202110582749.1A
Authority: CN
Inventors: 侯海宁
Original assignee: Beijing Xiaomi Mobile Software Co Ltd; Beijing Xiaomi Pinecone Electronic Co Ltd
Current assignee: Beijing Xiaomi Mobile Software Co Ltd; Beijing Xiaomi Pinecone Electronic Co Ltd
Priority date: 2021-05-26
Filing date: 2021-05-26
Publication date: 2021-09-07

Abstract

The disclosure relates to an audio signal processing method and apparatus, and a storage medium. The method comprises the following steps: acquiring original noisy signals collected by at least two microphones for at least two sound sources respectively; carrying out sound source separation on original noisy signals of the at least two microphones to obtain frequency domain estimation signals of the at least two sound sources; determining observed estimated signals of each sound source at the at least two microphones respectively based on the respective frequency domain estimated signals of the at least two sound sources; determining an enhanced output signal for each sound source based on the observed estimated signals corresponding to the at least two sound sources; and determining the audio signals emitted by the at least two sound sources respectively according to the enhanced output signals of each sound source after filtering processing. Through the scheme of the embodiment of the disclosure, the interference can be reduced, and the voice quality of the audio signal can be improved.

Description

Audio signal processing method and device and storage medium

Technical Field

The present disclosure relates to the field of signal processing, and in particular, to an audio signal processing method and apparatus, and a storage medium.

Background

In the related technology, the intelligent product equipment mostly adopts a microphone array for pickup, and a microphone beam forming technology is applied to improve the processing quality of a voice signal so as to improve the voice recognition rate in a real environment. However, the beam forming technology of multiple microphones is sensitive to the position error of the microphones, the performance influence is large, and in addition, the increase of the number of the microphones also causes the increase of the product cost.

Therefore, currently more and more smart product devices are configured with only two microphones; two microphones often enhance speech using blind source separation techniques that are completely different from the multiple microphone beamforming techniques. However, the speech signal after blind source separation often has noise residue, which causes a problem of low signal-to-noise ratio.

Disclosure of Invention

The present disclosure provides an audio signal processing method and apparatus, and a storage medium.

According to a first aspect of embodiments of the present disclosure, there is provided an audio signal processing method, including:

acquiring original noisy signals collected by at least two microphones for at least two sound sources respectively;

carrying out sound source separation on original noisy signals of the at least two microphones to obtain frequency domain estimation signals of the at least two sound sources;

determining observed estimated signals of each sound source at the at least two microphones respectively based on the respective frequency domain estimated signals of the at least two sound sources;

determining an enhanced output signal for each sound source based on the observed estimated signals corresponding to the at least two sound sources;

and determining the audio signals emitted by the at least two sound sources respectively according to the enhanced output signals of each sound source after filtering processing.

In some embodiments, the performing sound source separation on the original noisy signals of each of the at least two microphones to obtain frequency domain estimation signals of each of the at least two sound sources includes:

carrying out sound source separation on the original noisy signals by using the separation matrix of each frame of signals after deblurring processing to obtain respective frequency domain estimation signals of the at least two sound sources; wherein the frequency domain estimation signal carries phase information of the audio signal emitted by the sound source.

In some embodiments, the method further comprises:

and determining the fuzzy processed separation matrix by using the separation matrix and the inverse matrix of the separation matrix.

In some embodiments, the method further comprises:

when the current frame is not the first frame, determining the separation matrix of the current frame based on the separation matrix of the previous frame of the current frame and the original noisy signal of the current frame; or

When the current frame is the first frame, a separation matrix of the current frame is determined based on a predetermined identity matrix and an original noisy signal of the current frame.

In some embodiments, the observation estimation signal carries phase information of the audio signal emitted by the sound source; determining an enhanced output signal for each sound source based on the observed estimated signals corresponding to the at least two sound sources, comprising:

determining estimated coordinate information of the at least two sound sources according to the observation estimation signals;

determining time delay differences from the at least two sound sources to the at least two microphones according to the estimated coordinate information and the coordinate information of the at least two microphones;

determining the enhanced output signal of each sound source according to the time delay difference.

In some embodiments, said determining a delay difference from said at least two sound sources to said at least two microphones based on said estimated coordinate information and coordinate information of said at least two microphones comprises:

determining the distance from each sound source to the at least two microphones respectively according to the estimated coordinate information and the coordinate information of the at least two microphones;

and determining the time delay difference according to the distance and the sound velocity.

In some embodiments, said determining said enhanced output signal for each sound source from said time delay difference comprises:

and determining an enhanced output signal of each sound source according to the time delay difference and the observation estimation signals of each sound source at each microphone.

In some embodiments, the method further comprises:

and performing the filtering processing on the enhanced output signal of each sound source according to the observation estimation signal.

In some embodiments, said performing said filtering process on said enhanced output signal of each sound source based on said observed estimated signal comprises:

determining an interference signal of the enhanced output signal according to the observation estimation signal;

and carrying out the filtering processing on the enhanced output signal according to the interference signal.

According to a second aspect of the embodiments of the present disclosure, there is provided an audio signal processing apparatus including:

the first acquisition module is used for acquiring original noisy signals acquired by at least two microphones for at least two sound sources respectively;

the separation module is used for carrying out sound source separation on the original noisy signals of the at least two microphones to obtain frequency domain estimation signals of the at least two sound sources;

a first determining module, configured to determine, based on respective frequency domain estimated signals of the at least two sound sources, observed estimated signals of each of the sound sources at the at least two microphones, respectively;

a second determining module for determining an enhanced output signal for each sound source based on the observed estimated signals corresponding to the at least two sound sources;

and the third determining module is used for determining the audio signals emitted by the at least two sound sources respectively according to the enhanced output signals of the sound sources after filtering processing.

In some embodiments, the separation module comprises:

the separation submodule is used for carrying out sound source separation on the original noisy signals by utilizing the separation matrix of each frame of signals after the deblurring processing is carried out, and respective frequency domain estimation signals of the at least two sound sources are obtained; wherein the frequency domain estimation signal carries phase information of the audio signal emitted by the sound source.

In some embodiments, the apparatus further comprises:

and the fourth determination module is used for determining the separation matrix after the fuzzy processing by using the separation matrix and the inverse matrix of the separation matrix.

In some embodiments, the apparatus further comprises:

a fifth determining module, configured to determine, when the current frame is not the first frame, a separation matrix of the current frame based on the separation matrix of the previous frame of the current frame and an original noisy signal of the current frame; or

And a sixth determining module, configured to determine, when the current frame is the first frame, a separation matrix of the current frame based on the predetermined identity matrix and the original noisy signal of the current frame.

In some embodiments, the observation estimation signal carries phase information of the audio signal emitted by the sound source; the second determining module includes:

a first determining submodule, configured to determine estimated coordinate information of the at least two sound sources according to the observation estimation signal;

a second determining submodule, configured to determine, according to the estimated coordinate information and the coordinate information of the at least two microphones, a delay difference from the at least two sound sources to the at least two microphones;

a third determining submodule, configured to determine the enhanced output signal of each sound source according to the time delay difference.

In some embodiments, the first determining submodule is specifically configured to:

In some embodiments, the third determining submodule is specifically configured to:

In some embodiments, the apparatus further comprises:

and the filtering module is used for carrying out filtering processing on the enhanced output signal of each sound source according to the observation estimation signal.

In some embodiments, the filtering module comprises:

a fourth determining submodule, configured to determine an interference signal of the enhanced output signal according to the observation estimation signal;

and the filtering submodule is used for carrying out filtering processing on the enhanced output signal according to the interference signal.

According to a third aspect of embodiments of the present disclosure, there is provided an audio signal processing apparatus, the apparatus comprising at least: a processor and a memory for storing executable instructions operable on the processor, wherein:

the processor is configured to execute the executable instructions, and the executable instructions perform the steps of any of the audio signal processing methods described above.

According to a fourth aspect of embodiments of the present disclosure, there is provided a non-transitory computer-readable storage medium having stored therein computer-executable instructions that, when executed by a processor, implement the steps in any of the audio signal processing methods described above.

The technical scheme provided by the embodiment of the disclosure can have the following beneficial effects: according to the technical scheme of the embodiment of the disclosure, after the audio signals are separated to obtain the respective frequency domain estimation signals of each sound source, the observation estimation signals of each sound source at the plurality of microphones are further determined according to the frequency domain estimation signals, and then the audio signals of each sound source are enhanced and filtered, so that the signal-to-noise ratio of the separated signals is improved, and the signal quality is improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.

FIG. 1 is a first flowchart illustrating a method of audio signal processing according to an exemplary embodiment;

FIG. 2 is a flowchart II illustrating a method of audio signal processing according to an exemplary embodiment;

fig. 3 is a block diagram illustrating an application scenario of an audio signal processing method according to an exemplary embodiment.

FIG. 4 is a flowchart three illustrating a method of audio signal processing according to an exemplary embodiment;

fig. 5 is a block diagram illustrating a structure of an audio signal processing apparatus according to an exemplary embodiment;

fig. 6 is a block diagram illustrating a physical structure of an audio signal processing apparatus according to an exemplary embodiment.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present invention. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the invention, as detailed in the appended claims.

Fig. 1 is a flowchart illustrating an audio signal processing method according to an exemplary embodiment, as shown in fig. 1, including the steps of:

s101, acquiring original noisy signals acquired by at least two microphones for at least two sound sources respectively;

step S102, carrying out sound source separation on original noisy signals of the at least two microphones to obtain frequency domain estimation signals of the at least two sound sources;

step S103, determining observation estimated signals of each sound source at the at least two microphones respectively based on respective frequency domain estimated signals of the at least two sound sources;

step S104, determining an enhanced output signal of each sound source based on the observation estimated signals corresponding to the at least two sound sources;

and step S105, determining the audio signals emitted by the at least two sound sources according to the enhanced output signals of each sound source after filtering processing.

The method disclosed by the embodiment of the disclosure is applied to the terminal. Here, the terminal is an electronic device into which two or more microphones are integrated. For example, the terminal may be a vehicle-mounted terminal, a computer, a server, or the like.

In an embodiment, the terminal may further be: an electronic device connected to a predetermined device in which two or more microphones are integrated; and the electronic equipment receives the audio signal collected by the predetermined equipment based on the connection and sends the processed audio signal to the predetermined equipment based on the connection. For example, the predetermined device is a sound box or the like.

In practical application, the terminal comprises at least two microphones, and the at least two microphones simultaneously detect audio signals sent by at least two sound sources respectively to obtain original noisy signals of the at least two microphones respectively. Here, it is understood that in the present embodiment, the at least two microphones detect the audio signals emitted by the two sound sources synchronously.

In the embodiment of the present disclosure, the number of the microphones is 2 or more, and the number of the sound sources is 2 or more.

In the embodiment of the present disclosure, the original noisy signal is: comprising a mixed signal of the sounds emitted by at least two sound sources. For example, the number of the microphones is 2, namely a microphone 1 and a microphone 2; the number of the sound sources is 2, namely a sound source 1 and a sound source 2; the original noisy signal of said microphone 1 is an audio signal comprising a sound source 1 and a sound source 2; the original noisy signal of the microphone 2 is also an audio signal comprising both the sound source 1 and the sound source 2.

As another example, the number of the microphones is 3, which are respectively the microphone 1, the microphone 2 and the microphone 3; the number of the sound sources is 3, namely a sound source 1, a sound source 2 and a sound source 3; the original noisy signal of the microphone 1 is an audio signal comprising a sound source 1, a sound source 2 and a sound source 3; the original noisy signals of said microphone 2 and said microphone 3 are likewise audio signals each comprising a sound source 1, a sound source 2 and a sound source 3.

It is understood that if the sound emitted by one sound source is an audio signal in a corresponding microphone, the signals of other sound sources in the microphones are noise signals. The disclosed embodiments require recovery of sound sources emanating from at least two sound sources from at least two microphones.

It will be appreciated that the number of sound sources is generally the same as the number of microphones. If the number of microphones is smaller than the number of sound sources in some embodiments, the number of sound sources may be reduced to a dimension equal to the number of microphones.

It will be understood that when the microphones collect audio signals from sound sources, the audio signals of at least one frame of audio frame may be collected, and the collected audio signals are the original noisy signals of each microphone. The original noisy signal may be either a time domain signal or a frequency domain signal. If the original signal with noise is a time domain signal, the time domain signal can be converted into a frequency domain signal according to the operation of time-frequency conversion.

Here, the time-domain signal may be subjected to time-frequency Transform based on Fast Fourier Transform (FFT) to obtain a frequency-domain signal. Alternatively, the time-domain signal may be frequency-domain transformed based on a short-time Fourier transform (STFT). Or, the time domain signal may be subjected to time-frequency transformation based on other fourier transforms to obtain a frequency domain signal.

For example, if the time domain signal of the p-th microphone in the n-th frame is:

transforming the time domain signal of the nth frame into a frequency domain signal, and determining the original noisy signal of the nth frame as follows:

and m is the discrete time point number of the nth frame of time domain signal, and k is a frequency point. Thus, the present embodiment can obtain the original noisy signal of each frame through the time domain to frequency domain variation. Of course, the original noisy signal for each frame may be obtained based on other fast fourier transform equations, which is not limited herein.

According to the original noisy signal of the frequency domain, an initial frequency domain estimation signal can be obtained in a priori estimation mode.

Illustratively, the method may be based on an initialized separation matrix, such as an identity matrix; or separating the original signal with noise according to the separation matrix obtained from the previous frame to obtain the frequency domain estimation signal of each frame of each sound source. Therefore, the method provides a basis for separating the audio signals of the sound sources based on the frequency domain estimated signals and the separation matrix.

In the embodiment of the present disclosure, after sound source separation, the frequency domain estimation signal of each sound source is obtained, but noise residue may still exist in the frequency domain estimation signal of each sound source. Therefore, in order to reduce the noise residual and further improve the signal-to-noise ratio of the signal, post-processing is also required to be performed on the frequency domain estimation signal after separation.

Here, the observed signals of the respective sound sources at the respective microphones, i.e., the above-described observed estimated signals, may be estimated using the frequency domain estimated signals. The frequency domain estimation signal of each sound source can be enhanced, filtered and the like through the observation estimation signal, and finally, the audio signal sent by each enhanced sound source is obtained.

Therefore, through the embodiment of the disclosure, the audio signal after the blind source separation is further post-processed, and signal enhancement and filtering are realized, so that the signal-to-noise ratio of the signal is improved, the residual noise is reduced, and the signal quality is improved.

In some embodiments, as shown in fig. 2, in the step S102, the performing sound source separation on the original noisy signals of each of the at least two microphones to obtain frequency domain estimation signals of each of the at least two sound sources includes:

step S202, carrying out sound source separation on the original noisy signals by using the separation matrix of each frame of signals after the deblurring processing to obtain respective frequency domain estimation signals of the at least two sound sources; wherein the frequency domain estimation signal carries phase information of the audio signal emitted by the sound source.

In the embodiment of the present disclosure, the separation matrix may be used to perform sound source separation on the original noisy signal, and after separating the original noisy signal of each frame, the separation matrix may be updated, and the updated separation matrix is used to separate the signal of the next frame.

For example, after obtaining the original noisy signal of each frame, a separation signal of the current frame may be obtained based on the separation matrix and the original noisy signal of the current frame. Here, based on the separation matrix and the original noisy signal of the current frame, obtaining the separation signal of the current frame may be: and multiplying the original noisy signal of the current frame based on the separation matrix to obtain a separation signal of the current frame. For example, if the separation matrix is w (k), if the original noisy signal of the current frame is X (k, n); the split signal of the current frame is: y (k, n) ═ w (k) X (k, n).

In the embodiment of the present disclosure, the deblurred separation matrix is the separation matrix subjected to amplitude deblurring. Here, the deblurring process may include: the amplitude is adjusted by using an MDP (minimum Distortion Principle) algorithm. The separation signals obtained by the separation matrix after the deblurring processing can recover the estimation signals of the observation data of each sound source at the microphones, so in the embodiment of the present disclosure, the observation estimation signals of each sound source at least two microphones respectively can be determined by the frequency domain estimation signals.

Exemplarily, for the case where there are two sound sources s1 and s2 and two microphones mic1 and mic2, the separated Y (k, τ) is used as [ Y (k, τ) ]₁(k,τ),Y₂(k,τ)]^TThe observed estimated signal for each sound source can be recovered:

the observed estimated signal at mic1 for sound source s1 is: y is₁(k,τ)＝h₁₁s₁(k, τ), i.e. Y₁₁(k,τ)＝Y₁(k, τ); wherein h is₁₁Is a transfer function, s₁(k, τ) is the signal vector of sound source s 1.

The observed estimated signal at mic2 for sound source s2 is: y is₂(k,τ)＝h₂₂s₂(k, τ), i.e. Y₂₂(k,τ)＝Y₂(k, τ). Wherein h is₂₂Is a transfer function, s₂(k, τ) is the signal vector of sound source s 2.

Since the observed signal at each microphone is a superposition of the two source observations, the observed estimated signal at mic1 for source s2 is: y is₁₂(k,τ)＝X₁(k,τ)-Y₁₁(k, τ); the observed estimated signal at mic2 for sound source s1 is: y is₂₁(k,τ)＝X₂(k,τ)-Y₂₂(k, τ), where k denotes the number of frequency bins and τ denotes the number of frames of the audio signal.

In some embodiments, as shown in fig. 2, the method further comprises:

step S201, determining the separation matrix after the blurring processing by using the separation matrix and an inverse matrix of the separation matrix.

In the embodiment of the present disclosure, the separation matrix may be deblurred by using an MDP algorithm, that is, the deblurred separation matrix is determined by using the separation matrix and an inverse matrix of the separation matrix.

Illustratively, the separation matrix is amplitude deblurred for W (k, τ): w (k, τ) ═ diag (invW (k, τ)) · W (k, τ). Where invW (k, τ) is the inverse of W (k, τ). diag (invW (k, τ)) means that the non-dominant diagonal element of invW (k, τ) is set to 0.

Therefore, the frequency domain estimation signals can be separated by using the separation matrix obtained after the MDP algorithm is used for deblurring, so that the observation signals of each sound source at each microphone can be recovered, the original phase information is reserved, the direction of each sound source is convenient to determine, and the signals of each sound source after separation are convenient to enhance.

In some embodiments, the method further comprises:

In an embodiment, if the separation matrix is the separation matrix of the current frame, the separation signal of the current frame is obtained based on the separation matrix of the current frame and the original noisy signal of the current frame.

In another embodiment, if the separation matrix is the separation matrix of the previous frame of the current frame, the separation signal of the current frame is obtained based on the separation matrix of the previous frame and the original noisy signal of the current frame.

In an embodiment, if a frame length of an audio signal collected by a microphone is n, where n is a natural number greater than or equal to 1, then n is 1, which is a first frame. And if the current frame is the first frame, the separation matrix of the first frame is an identity matrix. For example, if the number of the microphones is 2, the identity matrix is:

if the number of the microphones is 3, the identity matrix is:

by analogy, if the number of the microphones is N, the identity matrix may be:

wherein, the

Is an N × N identity matrix.

In other embodiments, if the current frame is an audio frame after the first frame, the separation matrix of the current frame is determined based on the separation matrix of the previous frame of the current frame and the original noisy signal of the current frame.

In one embodiment, an audio frame may be an audio segment of a predetermined duration.

For example, the separation matrix of the current frame is determined based on the separation matrix of the previous frame of the current frame and the original noisy signal of the current frame, which may specifically be as follows: then, the covariance matrix of the current frame can be calculated according to the original signal with noise and the covariance matrix of the previous frame; and calculating the separation matrix of the current frame based on the covariance of the current frame and the separation matrix of the previous frame.

Here, the observation estimation signals retain original phase information, and thus, the azimuth of each sound source can be determined using the observation estimation signals. That is, the signals after separation are subjected to direction tracking and delay-sum beamforming, so as to obtain the signals of the main paths of the sound sources, thereby realizing signal enhancement.

In the embodiment of the present disclosure, a SRP-PHAT (controlled Response Power-Phase Transform) direction-finding algorithm may be used to find the direction-finding using the observed estimation signal, so as to determine the position of the sound source. For example, for the case of two sound sources s1 and s2, the estimated coordinate information of the sound source s1 may be determined using a tour algorithm

And estimated coordinate information of the sound source s2

Wherein x, y, z respectively represent three directions of coordinate axes.

Since the positions of the microphones are known, for example, in the case of two microphones, the coordinate information is set to

And

the distance from the sound source to the microphone and hence the delay spread of the sound signal delivery can be determined based on the sound source position and the microphone position.

Here, the delay difference refers to a delay difference between the sound source to the first microphone and the second microphone, respectively.

By using the time delay difference, the interference signal in the separated audio signal can be estimated, and the enhanced output signal of each sound source can be further determined.

Here, the distance between each sound source and each microphone may be determined using the estimated coordinate information of each sound source and the coordinate information of each microphone, and further, the distance difference between the same sound source and different microphones may be determined:

where d1 is the difference between the distance from the sound source s1 to the first microphone and the distance to the second microphone; d2 is the difference between the distance from the sound source to the first microphone and the distance from the sound source to the second microphone.

Using the distance difference and the sound velocity, the delay difference can be obtained:

the delay difference corresponding to the sound source s1 is:

the delay difference corresponding to the sound source s2 is

Wherein f is_sIs the sampling frequency, c is the speed of sound.

In the embodiment of the present disclosure, by using the delay difference and the observation estimation signals corresponding to different microphones of each sound source, a beam delay and sum beamforming process may be performed to obtain an enhanced output of the main channel, that is, the enhanced output signal.

Illustratively, for sound source s1, its observation at the first microphone may be used to estimate signal Y₁₁(k, τ) and an observed estimated signal Y of the second microphone₂₁(k, τ) and the delay difference τ₁Performing beam delay and sum beam forming to obtain enhanced output of main path

K is 1,2, …, K is the sequence number of the frequency point, and K is the total frequency point number corresponding to one frame; nfftIs the system frame length. Wherein exp represents an exponential function; j represents an imaginary part; pi represents the circumferential ratio.

Accordingly, for sound source s2, signal Y can be estimated using its observation at the first microphone₁₂(k, τ) and an observed estimated signal Y of the second microphone₂₂(k, τ) and the delay difference τ₂Performing beam delay and sum beam forming to obtain enhanced output of main path

K is 1,2, …, K is the sequence number of the frequency point, and K is the total frequency point number corresponding to one frame; nfft is the system frame length. Wherein exp represents an exponential function; j represents an imaginary part; pi represents the circumferential ratio.

Therefore, the audio signal subjected to blind source separation by using separation matrix separation is further enhanced, the system noise is reduced, and the signal quality is improved.

In some embodiments, the method further comprises:

In the embodiment of the present disclosure, after the separated audio signal is enhanced to obtain an enhanced output signal, further noise reduction filtering may be performed in a manner of adaptive filtering or the like.

In an embodiment, the observation estimation signal may be used as a reference signal, and an adaptive filter is used to determine an interference residue in the enhanced output signal, so as to filter and remove the interference residue.

Therefore, the enhanced signals are further filtered, the signal-to-noise ratio of the signals is improved, and the signal quality and the separation effect of the signals after blind source separation are improved.

Illustratively, for the sound source s1, the estimated signal Y can be observed with the sound source s2₂₂(k, τ) or Y₁₂(k, tau) as reference signal, passing through an adaptive filter

To estimate YM₁The interference residue in (k, τ), i.e. the interference signal of the enhanced output signal described above. Note YM₁The interference residue in (k, τ) is YC₁(k, τ), then:

thus, the interference signal can be used to perform adaptive cancellation, i.e. filtering, on the enhanced output signal, so as to obtain a filtered output result, i.e. to complete the noise reduction processing.

It should be noted that the above adaptive filter

The updating can be performed by:

wherein u is₁(k,n)＝[|Y₂₂(k,n)|²,...,|Y₂₂(k,n-L+1)|²]Is a reference input vector;

to estimate the error.

Similarly, for source s2, signal Y may be estimated using observations₁₁(k, τ) or Y₂₁And (k, tau) is used as a reference signal, an adaptive filter is used for estimating an interference signal, and then filtering processing is carried out on the interference signal to obtain an output result after noise reduction.

Furthermore, the frequency domain signal of the output signal may be subjected to inverse fourier transform to obtain a time domain signal. For example, the frequency domain signal after noise reduction may be subjected to ISTFT (inverse short-time fourier transform) and overlap-add to obtain a separated and enhanced time domain signal, thereby restoring the audio signal emitted by each sound source.

Embodiments of the present disclosure also provide the following examples:

FIG. 4 is a flow chart illustrating a method of audio signal processing according to an exemplary embodiment; in the audio signal processing method, as shown in fig. 3, the sound source includes a sound source 1 and a sound source 2, and the microphone includes a microphone 1 and a microphone 2. Based on the audio signal processing method, the audio signals of the sound source 1 and the sound source 2 are restored from the original noisy signals of the microphone 1 and the microphone 2. As shown in fig. 4, the method comprises the steps of:

step S401: initializing W (k) and V_p(k)；

Wherein the initialization comprises the following steps: if the system frame length is Nfft, the corresponding frequency point number in one frame is K ═ Nfft/2+ 1.

1) Initializing a separation matrix of each frequency point;

wherein, the

Is an identity matrix; the k is the serial number of the frequency point; the value of k may be: k is 1,2, …, K is any frequency point in a frame.

2) Initializing weighted covariance matrix V of each sound source at each frequency point_p(k)。

Wherein the content of the first and second substances,

is a zero matrix; wherein p is used to represent a microphone; p ═1,2。

Step S402: obtaining an original noisy signal of a p microphone in an n frame;

to pair

Windowing and Nfft point obtaining corresponding frequency domain signals:

wherein m is the number of points selected by Fourier transform; wherein the STFT is a short-time Fourier transform; the above-mentioned

Time domain signals of the nth frame of the p microphone; here, the time domain signal is an original noisy signal.

Then the X is_pThe observed signal for (k,) n is: x (k, n) ═ X₁(k,n),X₂(k,n)]^T(ii) a Wherein, [ X ]₁(k,n),X₂(k,n)]^TIs a transposed matrix.

Step S403: obtaining frequency domain estimation signals of two sound source signals by using W (k) of the previous frame;

let the a priori frequency domain estimates of the two source signals Y (k, n) be [ Y [ [ Y ]₁(k,n),Y₂(k,n)]^TWherein Y is₁(k,n),Y₂(k, n) are estimated values of the sound source 1 and the sound source 2 at the time frequency points (k, n), respectively.

The observation matrix X (k, n) is separated by a separation matrix W (k) to obtain: y (k, n) ═ w (k)' X (k, n); where W' (k) is the separation matrix of the previous frame (i.e., the frame previous to the current frame).

Then the prior frequency domain estimation of the p sound source in the n frame is:

step S404: updating a weighted covariance matrix V_p(k,n)；

Calculating an updated weighted covariance matrix:

wherein β is a smoothing coefficient. In one embodiment, β is 0.98; wherein, the V_p(k, n-1) is the weighted covariance matrix of the previous frame; the above-mentioned

Is X_pConjugate transpose of (k, n); the above-mentioned

Is a weighting coefficient, wherein

Is an auxiliary variable; the above-mentioned

As a comparison function.

Step S405: update separation matrix W (k, τ):

w_i(k,τ)＝(W(k,τ-1)V_i(k,τ))^-1e_i；

W(k,τ)＝[w₁(k,τ),w₂(k,τ)]^H(ii) a i is 1, 2. Wherein e is_iIs a feature vector.

Step S406: and (3) carrying out amplitude deblurring processing on W (k, tau) by using an MDP algorithm:

w (k, τ) ═ diag (invW (k, τ)) · W (k, τ), where invW (k, τ) is the inverse matrix of W (k, τ). diag (invW (k, τ)) indicates that the non-dominant diagonal element is set to 0.

In some embodiments, the original noisy signals may be separated by using the processed separation matrix to obtain respective frequency domain estimation signals of each sound source, but a certain noise residual may still exist in the separated signals. Therefore, in order to improve the signal-to-noise ratio, the separated signals may be post-processed here. The method specifically comprises the following steps:

step S407: the observed signal at each microphone for each sound source is determined using a separation matrix:

separating the original signal with noise by using W (k, tau) after MDP processing to obtain Y (k, tau) ═ Y₁(k,τ),Y₂(k,τ)]^T. According to the characteristics of the MDP algorithm, the recovered frequency domain estimation signal Y (k, τ) is exactly the estimation of the observation data of the sound source at the corresponding microphone, i.e. the estimation of the observation signal of the sound source s1 at the mic1 is: y is₁(k,τ)＝h₁₁s₁(k, τ) as Y₁₁(k,τ)＝Y₁(k, τ). The estimate of the observed signal at mic2 for sound source s2 is: y is₂(k,τ)＝h₂₂s₂(k, τ) as Y₂₂(k,τ)＝Y₂(k,τ)。

Since the observed signal at each microphone is a superposition of the two source observations, the estimate of the observed data at the mic1 for the source s2 is: y is₁₂(k,τ)＝X₁(k,τ)-Y₁₁(k, τ); the estimation of the observed data at mic2 for sound source s1 is: y is₂₁(k,τ)＝X₂(k,τ)-Y₂₂(k,τ)。

In this way, based on the MDP algorithm, the observed signals of each sound source at each microphone are completely recovered, and the original phase information is retained, so that the azimuth of each sound source can be further estimated based on the signals.

Step S408: and (3) estimating the azimuth of each sound source by using the observation signal estimation of each sound source at each mic position respectively by using an SRP-PHAT algorithm:

y pair by using SRP-PAHT algorithm₁₁(k, τ) and Y₂₁(k, τ) process to estimate the bearing of the sound source s 1:

the above-mentioned observation estimation signal of the sound source s1 is traversed:

wherein, Y_i(τ)＝[Y_i(1,τ),…,Y_i(K,τ)]^TThe signal is estimated for the observations of the ith microphone's frame τ. K is the length of the system frame; means that the two vector correspondences are multiplied; denotes the adjoint.

Similarly, the sound source s2 is also processed by the algorithm.

For any point s on the unit sphere, the coordinate is(s)_x,s_y,s_z) Satisfy the following requirements

Calculating to obtain the time delay difference of any two microphones:

wherein f is_sIs the sampling frequency of the system and c is the speed of sound.

According to

Find the corresponding SRP (controlled Response Power):

traversing all points s on the unit sphere, and finding the point with the maximum value of the SRP, namely the estimated sound source:

in the embodiment of the present disclosure, the estimated coordinate information of the sound source may be obtained by the above method, for example: the coordinates of sound source s1 are

The coordinates of sound source s2 are

Step S409: and (3) obtaining main path signals of each sound source by using a delay-sum beam forming technology, and improving the signal-to-noise ratio:

determining the time delay difference from each sound source to each microphone based on the estimated azimuth information:

using Y₁₁(k, τ) and Y₂₁(k, τ) to the sound source s₁Performing beam delay and sum beam forming to obtain the enhanced output of the main channel

Using Y₁₂(k, τ) and Y₂₂(k, τ) to the sound source s₂Performing beam delay and sum beamforming to obtain an enhanced output signal of a main path thereof:

where K is 1,2, …, K.

Step S410, removing interference residue from the enhanced output signal:

for sound source s₁Further removing YM₁Interference residue in (k, τ), using Y₂₂(k, τ) or Y₁₂(k, τ) as reference signal, in this case Y₂₂(k, τ) through an adaptive filter

To estimate YM₁Interference remains in (k, τ). Note YM₁The interference residue in (k, τ) is YC₁(k, τ), then:

the updating method comprises the following steps:

wherein the content of the first and second substances,

u₁(k,n)＝[|Y₂₂(k,n)|²,...,|Y₂₂(k,n-L+1)|²]is a reference input vector;

to estimate the error.

The output after adaptive noise cancellation is:

the YM is further removed from the sound source s2 in the same manner₂Interference residue in (k, τ), using Y₁₁(k, τ) or Y₂₁(k, τ) as reference signal, in this case Y₁₁(k, τ) through an adaptive filter

L-1, to estimate YM₂Interference remains in (k, τ). Note YM₂The interference residue in (k, τ) is YC₂(k, τ), then:

the updating method comprises the following steps:

wherein u is₂(k,n)＝[|Y₁₁(k,n)|²,...,|Y₁₁(k,n-L+1)|²]Is a reference input vector;

to estimate the error;

the output after adaptive noise cancellation is:

in this way, a separated signal with reduced residual interference can be obtained.

Step S411: and performing time-frequency conversion on the separated signals to obtain time-domain audio signals emitted by each sound source.

Are respectively paired with YE₁(τ)＝[YE₁(1,τ),...,YE₁(K,τ)]And YE₂(τ)＝[YE₂(1,τ),...,YE₂(K,τ)]K is 1, K is subjected to ISTFT and overlap addition to obtain a time domain sound source signal which is subjected to separation and post-processing enhancement and is recorded as

Wherein m is 1, …, Nfft; i is 1, 2.

Fig. 5 is a block diagram illustrating an apparatus for processing an audio signal according to an exemplary embodiment. Referring to fig. 5, the apparatus 500 includes a first obtaining module 501, a separating module 502, a first determining module 503, a second determining module 504, and a third determining module 505.

A first obtaining module 501, configured to obtain original noisy signals that are collected by at least two microphones for at least two sound sources respectively;

a separation module 502, configured to perform sound source separation on original noisy signals of the at least two microphones, to obtain frequency domain estimation signals of the at least two sound sources;

a first determining module 503, configured to determine observed estimated signals of the at least two microphones of each of the sound sources based on respective frequency domain estimated signals of the at least two sound sources;

a second determining module 504, configured to determine an enhanced output signal for each sound source based on the observed estimated signals corresponding to the at least two sound sources;

a third determining module 505, configured to determine, according to the enhanced output signal of each sound source after the filtering processing, audio signals emitted by the at least two sound sources respectively.

In some embodiments, the separation module comprises:

In some embodiments, the apparatus further comprises:

In some embodiments, the filtering module comprises:

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

Fig. 6 is a block diagram illustrating a physical structure of an audio signal processing apparatus 600 according to an exemplary embodiment. For example, the apparatus 600 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, and so forth.

Referring to fig. 6, apparatus 600 may include one or more of the following components: a processing component 601, a memory 602, a power component 603, a multimedia component 604, an audio component 605, an input/output (I/O) interface 606, a sensor component 607, and a communication component 608.

The processing component 601 generally controls the overall operation of the device 600, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing component 601 may include one or more processors 610 to execute instructions to perform all or a portion of the steps of the methods described above. Further, the processing component 601 may also include one or more modules that facilitate interaction between the processing component 601 and other components. For example, the processing component 601 may include a multimedia module to facilitate interaction between the multimedia component 604 and the processing component 601.

The memory 610 is configured to store various types of data to support operations at the apparatus 600. Examples of such data include instructions for any application or method operating on the apparatus 600, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 602 may be implemented by any type or combination of volatile or non-volatile storage devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

The power supply component 603 provides power to the various components of the device 600. The power supply component 603 may include: a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for the device 600.

The multimedia component 604 includes a screen that provides an output interface between the device 600 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 604 includes a front facing camera and/or a rear facing camera. The front camera and/or the rear camera may receive external multimedia data when the device 600 is in an operating mode, such as a shooting mode or a video mode. Each front camera and/or rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

Audio component 605 is configured to output and/or input audio signals. For example, audio component 605 includes a Microphone (MIC) configured to receive external audio signals when apparatus 600 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 610 or transmitted via the communication component 608. In some embodiments, audio component 605 also includes a speaker for outputting audio signals.

The I/O interface 606 provides an interface between the processing component 601 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor component 607 includes one or more sensors for providing various aspects of status assessment for the apparatus 600. For example, the sensor component 607 may detect the open/closed state of the apparatus 600, the relative positioning of components, such as a display and keypad of the apparatus 600, the sensor component 607 may also detect a change in the position of the apparatus 600 or a component of the apparatus 600, the presence or absence of user contact with the apparatus 600, orientation or acceleration/deceleration of the apparatus 600, and a change in the temperature of the apparatus 600. The sensor component 607 may include a proximity sensor configured to detect the presence of nearby objects in the absence of any physical contact. The sensor component 607 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor component 607 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 608 is configured to facilitate wired or wireless communication between the apparatus 600 and other devices. The apparatus 600 may access a wireless network based on a communication standard, such as WiFi, 2G, or 3G, or a combination thereof. In an exemplary embodiment, the communication component 608 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 608 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, or other technologies.

In an exemplary embodiment, the apparatus 600 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.

In an exemplary embodiment, a non-transitory computer readable storage medium comprising instructions, such as the memory 602 comprising instructions, executable by the processor 610 of the apparatus 600 to perform the above-described method is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

A non-transitory computer readable storage medium having instructions therein, which when executed by a processor of a mobile terminal, enable the mobile terminal to perform any of the methods provided in the above embodiments.

Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.

It will be understood that the invention is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the invention is limited only by the appended claims.

Claims

1. An audio signal processing method, comprising:

2. The method according to claim 1, wherein said performing sound source separation on the original noisy signals of each of the at least two microphones to obtain frequency domain estimation signals of each of the at least two sound sources comprises:

3. The method of claim 2, further comprising:

4. The method of claim 2, further comprising:

5. The method according to claim 1, wherein the observation estimation signal carries phase information of the audio signal emitted by the sound source; determining an enhanced output signal for each sound source based on the observed estimated signals corresponding to the at least two sound sources, comprising:

6. The method of claim 5, wherein determining the time delay difference from the at least two sound sources to the at least two microphones based on the estimated coordinate information and the coordinate information of the at least two microphones comprises:

7. The method of claim 5, wherein determining the enhanced output signal for each sound source based on the delay difference comprises:

8. The method of claim 1, further comprising:

9. The method of claim 8, wherein said performing said filtering process on said enhanced output signal of each sound source based on said observed estimated signal comprises:

10. An audio signal processing apparatus, comprising:

11. The apparatus of claim 10, wherein the separation module comprises:

12. The apparatus of claim 11, further comprising:

13. The apparatus of claim 11, further comprising:

14. The apparatus according to claim 10, wherein the observation estimation signal carries phase information of the audio signal emitted by the sound source; the second determining module includes:

15. The apparatus according to claim 14, wherein the first determining submodule is specifically configured to:

16. The apparatus according to claim 14, wherein the third determining submodule is specifically configured to:

17. The apparatus of claim 10, further comprising:

18. The apparatus of claim 17, wherein the filtering module comprises:

19. Audio signal processing device, characterized in that it comprises at least: a processor and a memory for storing executable instructions operable on the processor, wherein:

the processor is adapted to execute the executable instructions, which when executed perform the steps of the audio signal processing method as provided in any of the preceding claims 1 to 9.

20. A non-transitory computer-readable storage medium having stored therein computer-executable instructions that, when executed by a processor, implement the steps in the audio signal processing method provided in any one of claims 1 to 9.