CN114863944B

CN114863944B - Low-delay audio signal overdetermined blind source separation method and separation device

Info

Publication number: CN114863944B
Application number: CN202210174605.7A
Authority: CN
Inventors: 王泰辉
Original assignee: Institute of Acoustics CAS
Current assignee: Institute of Acoustics CAS
Priority date: 2022-02-24
Filing date: 2022-02-24
Publication date: 2023-07-14
Anticipated expiration: 2042-02-24
Also published as: CN114863944A

Abstract

The invention belongs to the technical field of frequency domain blind source separation and audio signal processing, and particularly relates to a low-delay audio signal overdetermined blind source separation method, which comprises the following steps: each microphone array element in the microphone array picks up the acoustic signals of N sound sources to be separated in the target environment, converts the acoustic signals into corresponding digital signals, and then performs short-time Fourier transform on the digital signals to obtain corresponding time-frequency domain observation signals; repeatedly iterating and updating the obtained time-frequency domain observation signals until convergence is achieved, and obtaining variances and unmixed vectors of each sound source to be separated; constructing a unmixed matrix by using the obtained unmixed vector; inverting the solution mixing matrix to obtain an estimation of the mixing matrix; constructing a multi-channel wiener filter based on the mixed matrix for each sound source to be separated and performing filtering to obtain time-frequency domain signals to be separated; and then carrying out short-time Fourier inverse transformation to obtain the time domain waveform of the signal to be separated.

Description

Low-delay audio signal overdetermined blind source separation method and separation device

Technical Field

The invention belongs to the technical field of frequency domain blind source separation (Blind source separation, BSS) and audio signal processing, and particularly relates to a low-delay audio signal overdetermined blind source separation method and a separation device.

Background

In a scenario where multiple speakers speak simultaneously, one may focus on the voice of one speaker of interest while automatically ignoring the voice of the other speaker, a well-known "cocktail party" problem. The problem was at the earliest addressed by the uk cognizant, the professor Cherry, in the last 50 th century. However, this problem has long been left unsolved. Blind source separation is a new area developed to solve this problem. The blind source separation of the audio signals has wide application prospect, including man-machine voice interaction, automatic conference log, music separation and the like.

Frequency domain blind source separation technology has evolved rapidly over the last two decades as a class of representative audio separation solutions, with representative algorithms including independent component analysis (independent component analysis, IVA), independent vector analysis (independent vector analysis, IVA), independent low-rank matrix analysis (ILRMA), and the like. These algorithms essentially make use of the higher order statistic information of the signal. To achieve good separation performance, enough data needs to be accumulated to achieve accurate high order statistic estimation. In an off-line implementation, the estimation of the required statistics can be achieved with a longer length of data that has been collected, and thus the algorithms achieve better performance. Many practical application systems require blind source separation algorithms to be implemented on-line and require as little time delay as possible between system inputs and outputs. For example, high-end hearing aids require a system delay of less than 5 milliseconds. This is a demanding requirement for current blind source separation algorithms.

Most of the current blind source separation algorithms are based on a so-called narrowband assumption that the window length for the short-time fourier transform is much longer than the length of the system hybrid filter. In a conference system, a typical value for the reverberation time of a room is 600 milliseconds, which requires a window length of the short-time fourier transform of more than 600 milliseconds. It is apparent that the system delay is too great for many applications. The current real-time blind source separation algorithm cannot significantly reduce the time delay of the system. Therefore, development of a low-delay audio signal blind source separation technology is urgently needed to meet the requirement of real-time processing.

Disclosure of Invention

In order to solve the problems in the prior art, the invention provides a low-delay audio signal overdetermined blind source separation method, which comprises the following steps:

each microphone array element in the microphone array picks up the acoustic signals of N sound sources to be separated in the target environment, converts the acoustic signals into corresponding digital signals, and then performs short-time Fourier transform on the digital signals to obtain corresponding time-frequency domain observation signals;

repeatedly iterating and updating the obtained time-frequency domain observation signals until convergence is achieved, and obtaining variances and unmixed vectors of each sound source to be separated; constructing a unmixed matrix by using the obtained unmixed vector; inverting the solution mixing matrix to obtain an estimation of the mixing matrix; constructing a multi-channel wiener filter based on the mixed matrix for each sound source to be separated and performing filtering to obtain time-frequency domain signals to be separated; and then carrying out short-time Fourier inverse transformation to obtain the time domain waveform of the signal to be separated.

The invention also provides a device for separating the overdetermined blind source of the low-delay audio signal, which comprises:

the microphone array comprises M microphone array elements and is used for picking up acoustic signals of N sound sources to be separated in a target environment; wherein M > N;

the A/D module is used for converting the sound signals of the N sound sources to be separated picked up by the microphone array into corresponding digital signals;

the short-time Fourier transform module is used for caching the signals acquired by the microphone array and performing short-time Fourier transform to obtain corresponding time-frequency domain signals;

the sound source variance and unmixed matrix estimation module is used for carrying out continuous iterative updating by utilizing the obtained time-frequency domain observation signals until convergence is achieved, estimating the variance and unmixed vector of the nth sound source to be separated, constructing an unmixed matrix by utilizing the obtained unmixed vector, and updating the unmixed matrix;

a mixing matrix estimation module for inverting the solution mixing matrix to obtain a mixing matrix;

the multi-channel wiener filtering module is used for constructing a multi-channel wiener filter of the nth sound source to be separated based on the mixing matrix and filtering the multi-channel wiener filter to obtain a time-frequency domain signal of the nth sound source to be separated; and

and the short-time inverse Fourier transform module is used for transforming the sound source signals of the N separated time-frequency domains into time domain waveforms, and taking the time domain waveforms as the sound signals of the real sound sources to be separated, so as to finish the ultra-blind source separation of the low-time-delay audio signals.

As one of the improvements of the above technical solutions, the apparatus further includes: a D/A module and a speaker array module;

the D/A module is used for converting the separated time domain digital signals of each channel output by the short-time inverse Fourier transform module into analog signals;

and the loudspeaker array module plays the analog separation signal through the loudspeaker array and sends the separation signal to the post-processing module for further processing.

Compared with the prior art, the invention has the beneficial effects that:

1. the method of the invention provides a low-delay audio signal blind source separation method, which is suitable for a real-time processing system requiring short delay, such as a remote online conference system;

2. the audio signal obtained by separation in the method can only contain direct sound and early reflected sound parts, so that the method has the characteristics of signal separation and dereverberation.

Drawings

FIG. 1 is a schematic diagram of the method for separating the overdetermined blind source of a low-delay audio signal;

FIG. 2 is a method flow chart of a low-delay audio signal overdetermined blind source separation method of the present invention;

FIG. 3 is a specific flowchart of step 2) of a low-delay audio signal overdetermined blind source separation method of the present invention;

fig. 4 is a schematic structural diagram of a low-delay audio signal overdetermined blind source separation device according to the present invention.

Detailed Description

The invention will now be further described with reference to the accompanying drawings.

The invention provides a low-delay audio signal overdetermined blind source separation method, which solves the problem of overdetermined blind source separation and requires more microphones than sound sources; the method of the invention needs short time Fourier transform window length shorter than the reverberation time of the space, thereby reducing the time delay between the input and the output of the real-time processing system.

The method comprises the following steps:

repeatedly iterating and updating the obtained time-frequency domain observation signals until convergence is achieved, and obtaining variances and unmixed vectors of each sound source to be separated; constructing a unmixed matrix by using the obtained unmixed vector; inverting the solution mixing matrix to obtain an estimation of the mixing matrix; constructing a multi-channel wiener filter for each sound source to be separated and performing filtering to obtain time-frequency domain signals to be separated; and carrying out short-time Fourier inverse transformation to obtain the time domain waveform of the signal to be separated.

As shown in FIG. 1, there are N sound signals s of sound sources to be separated in a certain target environment space _n (t), wherein 1.ltoreq.n.ltoreq.N and t is a discrete time. The acoustic signal s of the sound source to be separated _n (t) simultaneously received by each microphone array element in a microphone array, the microphone array comprising M microphones; the signals received by the M microphones are recorded as x _m (t), M is more than or equal to 1 and less than or equal to M. The method of the invention is limited to ultra-blind source separation, i.e. requiring that the total number of microphone elements is greater than the number of sound sources. The time domain transfer function from the nth sound source to be separated to the mth microphone array element is h _nm (t), then the signal received by the mth microphone element is expressed as

Where x represents the convolution operation.

In the method of the present invention, as shown in fig. 1, blind source separation 101 is performed by using only the signal x received by the microphone array element _m (t), M is more than or equal to 1 and less than or equal to M, and recovering the true sound source signal to be separated

However, it is in practice difficult to obtain a clean sound source signal to be separated, nor does the method of the invention seek to obtain an accurate estimate of the sound source signal to be separated, but rather to estimate the direct sound and early reflected sound portions of the sound source signal received by the microphone array elements or their mirror images at the microphone array elements.

It is difficult to perform the separation task directly in the time domain because the reverberation time in a closed space can be relatively long, resulting in slow convergence of the blind source separation algorithm in the time domain and unsatisfactory performance after convergence. The method of the invention obtains the corresponding time-frequency domain signal after the time domain signal is subjected to the short-time Fourier transform, so that the blind source separation of the audio signal can be more efficiently executed in the time-frequency domain.

As shown in fig. 2, the method specifically includes:

step 1) the m-th microphone array element in the microphone array picks up the acoustic signal s of the n-th sound source to be separated in the target environment _n (t) and converts it into a corresponding digital signal, denoted as mth microphone signal x _m (t) and performing short-time Fourier transform on the obtained signal to obtain a corresponding time-frequency domain observation signal X _m (ω, k), wherein 1.ltoreq.n.ltoreq.N; t is the discrete time; m is more than or equal to 1 and less than or equal to M; m is the total number of microphone array elements in the microphone array, k is the frame identification, and ω is the frequency; the acoustic signal s of the nth sound source to be separated _n (t) is an analog signal;

the microphone array comprises M microphone array elements, wherein the number M of the microphone array elements is greater than the total number of acoustic signals of sound sources to be separated, and the number M is recorded as M > N; i.e. overdetermined source separation.

Step 2) using the obtained time-frequency domain observation signal X _m (omega, k) performing continuous iterative updating until convergence is reached, and estimating the variance lambda of the nth sound source to be separated _n (omega, k-l) and the unmixed vector w _n,l (omega) constructing a de-mixing matrix by using the obtained de-mixing vector, and updating the de-mixing matrix W (omega), wherein N is more than or equal to 1 and less than or equal to N; l is more than or equal to 0 and less than or equal to L _n ；L _n Representing the number of reflected sounds to be estimated of the nth sound source to be separated, wherein N represents the number of sound sources to be estimated;

specifically, the step 2) specifically includes:

step 201) using the obtained time-frequency domain observation signal X _m (omega, k) updating the nth sound source to be separated to be nearest L _n Variance lambda of frame _n (ω,k-l)：

Wherein F is the window length of the short-time Fourier transform; x (ω, k) = [ X ₁ (ω,k),…,X _M (ω,k)] ^T ；

Step 202) utilize lambda _n (omega, k-L), updating the nth sound source to be separated at the nearest L _n Weighted covariance matrix V of frame _n,l (ω,k)：

Where α is a smoothing factor very close to 1; v (V) _n,l (ω, k-1) is a weighted covariance matrix of the (k-1) th frame; h is conjugate transpose;

step 203) utilizing V _n,l (omega, k) updating the L corresponding to the nth sound source to be separated _n Individual de-mixing vectors w _n,l (ω)：

Upper contract L ₀ =0, column vector

(L) ₀ +…+L _n-1 ) +l elements are 1 and the other remaining elements are all 0, w (ω) = [ w _1,0 (ω),…,w _1,L-1 (ω),…,w _N,0 (ω),…,w _N,L-1 (ω)] ^H Is a unmixed matrix.

Step 204) for the updated L corresponding to the nth sound source to be separated _n Individual de-mixing vectors w _n,l (omega) performing normalization operation to obtain a normalized solution mixing vector;

step 205) uses the de-mixing vector w obtained in step 204) _n,l (ω) constructing a unmixed matrix W (ω);

repeating step 201) -step 204), performing continuous iterative updating,

if the iteration times reach a preset value P and reach convergence, ending the iteration to obtain a solution mixing matrix;

otherwise, step 201) to step 205) are re-performed.

Step 3) inverting the unmixed matrix W (omega) to obtain a mixed matrix H (omega);

specifically, the step 3) specifically includes:

inverting the unmixed matrix W (omega) to obtain a mixed matrix H (omega);

H(ω)＝[H ₁ (ω),…,H _N (ω)]＝W ^-1 (ω)

wherein,,

is of dimension M x L _n Matrix of (h), h _n,l Is a column vector of dimension mx1.

Step 4) constructing a multichannel wiener filter omega of the nth sound source to be separated based on the mixed matrix H (omega) aiming at the nth sound source to be separated _n (omega, k) and performing filtering to obtain the time-frequency domain signal of the nth sound source to be separated

Wherein the sum of the number of all reflected sounds to be estimated is equal to the total number of microphone elements, i.e. there is a constraint +.>

And suggest L _n The value of N is more than or equal to 1 and less than or equal to N is as close as possible;

specifically, the step 4) specifically includes:

for the nth sound source to be separated, constructing a multichannel wiener filter omega of the nth sound source to be separated based on the mixed matrix H (omega) _n (ω,k)：

Wherein,,

Σ _x (ω, k) is the covariance matrix of the current frame frequency domain microphone signal vector;

by using the omega thus obtained _n (ω, k) the current frame frequency domain microphone receives a signal vector X (ω, k) = [ X) ₁ (ω,k),…,X _M (ω,k)] ^T Filtering to obtain a filtered signal c _n,0 (ω,k)：

c _n,0 (ω,k)＝Ω _n (ω,k)x(ω,k)

From the resulting filtered signal c _n,0 (omega, k) to obtain the time-frequency domain signal of the nth sound source to be separated

Frequency domain signal at nth sound source to be separated +.>

Is c _n,0 (ω, k).

In other specific embodiments, the step 4) may further specifically include:

Receiving signal vector X (omega, k) = [ X ] by using multichannel wiener filter obtained by the method ₁ (ω,k),…,X _M (ω,k)] ^T Filtering to obtain a filtered signal c _n (ω, k) is

c _n (ω,k)＝Ω _n (ω,k)x(ω,k)

From the resulting filtered signal c _n (omega, k) to obtain the time-frequency domain signal of the nth sound source to be separated

The time-frequency domain signal of the nth sound source to be separated +.>

Is c _n (ω, k).

Step 5) for the nth time-frequency domain signal of the sound source to be separated

Performing short-time inverse Fourier transform to obtain corresponding time domain waveform +.>

And the sound source is used as a real sound signal of a sound source to be separated, so that the ultra-stationary blind source separation of the low-delay audio signal is completed.

Example 1.

Fig. 2 is a system block diagram of a real-time blind separation method of an audio signal according to the present invention, which includes a short-time fourier transform 201, a sound source variance and a unmixed matrix calculation 202, a mixed matrix estimation module 203, a multi-channel wiener filter 204, and an inverse short-time fourier transform 205.

The invention provides a low-delay audio signal overdetermined blind source separation method, which comprises the following steps:

short-time fourier transform 201

Performing short-time Fourier transform on each path of time domain signal of each sound source to be separated in a target environment acquired by a microphone array to obtain a corresponding current frame time-frequency domain observation signal; specifically, the short-time fourier transforms 201 respectively apply to the signals x received by the microphone array elements _m (t) performing short-time Fourier transform to obtain X _m (ω, k), wherein k is a frame identity and ω is a frequency; the window length of the short-time Fourier transform is F. Unlike other existing real-time processing algorithms, the short-time Fourier transform window used in the invention can be much smaller than the reverberation time, thereby reducing the time delay between the input and output of the real-time system.

Acoustic source variance and unmixed matrix estimation 202

Using short-time Fourier transform signals X _m (omega, k) calculating to obtain variance lambda of each sound source signal to be separated of the current frame _n (ω, k) and the unmixed vector w _n,l (ω) iteratively updating the variance of each sound source to be separated and the corresponding downmix vector and updating the downmix matrix;

frequency-domain microphone received signal vector defining dimension as Mx 1

x(ω,k)＝[X ₁ (ω,k),…,X _M (ω,k)] ^T . (2)

Defining the frequency domain signal vector of the acoustic signal of the nth sound source to be separated received by the microphone array as

c _n (ω,k)＝[c _n1 (ω,k),…,c _nM (ω,k)] ^T (3)

Wherein this vector is also called mirror image of the nth sound source to be separated. The invention subjects the mirror image c _n (omega, k) modeled as the sum of a series of reflected sounds

Wherein L is _n Representing the number of reflected sounds to be estimated of the nth sound source to be separated, c _n,0 (ω, k) is the first reflected sound portion (including the direct sound) of the nth sound source to be separated c _n,1 (ω, k) is the second reflected sound portion of the nth sound source to be separated, and so on. The technique disclosed in the present patent can realize the estimation of the first reflected sound part and the mirror image c _n An estimate of (ω, k).

To ensure that this patent works well, the number of reflected sound portions of all recovered sound sources to be separated needs to be constrained as follows:

in addition, in practice L should be ensured _n The value of N is not less than 1 and not more than N is as much as possibleNear. For example, if the number of sound sources to be separated is n=2 and the total number of microphones is m=4, then L is preferably set ₁ ＝L ₂ =2; the number of sound sources to be separated is n=2, the total number of microphone array elements is m=5, then L is preferably set ₁ ＝2,L ₂ =3 or L ₁ ＝3,L ₂ =2, not recommended for use of L ₁ ＝1,L ₂ =4 and L ₁ ＝4,L ₂ ＝1。

Currently existing real-time blind source separation methods are mostly based on so-called narrowband assumptions, which require the use of a very long short-time fourier transform window length to cover the main energy part of the hybrid filter. The invention divides the whole mixed impulse response into a plurality of parts and the former L _n The reflected acoustic portions are separated. For example, assuming that separation of two sound sources to be separated is required, the reverberation time of a room where 470 milliseconds is located, the window length of the short-time fourier transform is set to 128 milliseconds, and the number of reflected sounds to be separated per sound source is L _n =2, a better performance can be achieved with only 4 microphones. Whereas with existing blind source separation methods, short-time fourier transforms are required with window lengths approaching 470 milliseconds. Therefore, the real-time blind source separation method provided by the invention greatly reduces the time delay of the real-time processing system, which is a great advantage for an online system.

To realize all L of the nth sound source to be separated _n Separation of the reflected acoustic portions requires L _n Individual de-mixing vectors w _n,l (ω),0≤l≤L _n -1. But it will be apparent to those skilled in the art that w _n,l (omega) cannot be directly used to implement L _n Separation of the reflected sound portions.

With the above knowledge background, a specific flow chart for implementing the sound source variance and unmixed matrix estimation 202 is shown in fig. 3.

More specifically, the source variance and unmixed matrix estimation 202 is implemented by iteration, the number of iterations being set to P. For example, setting p=2 can achieve a better separation performance. In each iteration, the following 5 steps are performed in sequence:

step 202-1) when in useFrequency domain observation signal X _m (omega, k) updating the nth sound source to be separated to be nearest L _n Variance lambda of frame _n (ω, k-l) the computational expression is

Step 202-2) utilize lambda _n (omega, k-L), updating the nth sound source to be separated at the nearest L _n Weighted covariance matrix V of frame _n,l (ω, k) by

Where α is a smoothing factor very close to 1.

Step 202-3) utilizing V _n,l (omega, k) updating the L corresponding to the nth sound source to be separated _n Individual de-mixing vectors w _n,l (ω)

Upper contract L ₀ =0, column vector

(L) ₀ +…+L _n-1 ) +l elements are 1 and the other elements are all 0, w (ω) = [ w) _1,0 (ω),…,w _1,L-1 (ω),…,w _N,0 (ω),…,w _N,L-1 (ω)] ^H Is a de-mixing matrix;

step 202-4) L for nth Sound Source to be separated _n Individual de-mixing vectors w _n,l (ω) performing a normalization operation;

step 202-5) uses the de-mixing vector w obtained in step 202-4) _n,l (omega) construction solutionMixing matrix W (ω).

If the iteration times reach a preset value P, finishing the iteration updating process; otherwise, steps 202-1 to 202-5 shown in fig. 3 are re-performed.

Hybrid matrix estimation 203

Inverting the solution mixing matrix to obtain a mixing matrix;

in particular, the inverse of the unmixed matrix is used to construct a mixed matrix H (ω) of dimension MxM, in particular

H(ω)＝[H ₁ (ω),…,H _N (ω)]＝W ^-1 (ω) (10)

Wherein the method comprises the steps of

Multi-channel wiener filtering 204

Constructing a multi-channel wiener filter aiming at each sound source to be separated to obtain estimation of time-frequency domain signals of the sound source to be separated;

specifically, the variance λ of all N sound sources is obtained using the sound source variance and the unmixed matrix estimate 202 _n (ω, k) and said 203 estimated mixing matrix H (ω) construct N multi-channel wiener filters. Wherein the multichannel wiener filter Ω for the nth sound source _n (ω, k) is

Wherein,,

is the covariance matrix of the microphone signal.

Furthermore, the multichannel wiener filter Ω is utilized _n (omega, k) filtering the current frame frequency domain microphone received signal vector x (omega, k) to obtain

c _n,0 (ω,k)＝Ω _n (ω,k)x(ω,k) (12)

In the above formula (12), the first reflected sound portion of the nth sound source to be separated is output. Therefore, the invention has the functions of signal separation and dereverberation, which can better improve the voice quality of the separated voice.

Alternatively, it is also possible to selectively output a mirror image c containing all reflected sound portions _n (omega, k) at this time, the multichannel wiener filter corresponding to the nth sound source to be separated is

Mirror image c recovered with the estimated multi-channel wiener filter of (13) _n (ω, k) is

c _n (ω,k)＝Ω _n (ω,k)x(ω,k) (14)

C obtained by the formula (12) or (14) _n,0 (ω, k) or c _n And (omega, k) obtaining M channel signals of the nth sound source to be separated, wherein in practice, each sound source only needs one output signal.

For convenience, the present patent uniformly selects the mirror image at the first microphone or the first reflected acoustic portion as the output, i.e

Or->

Wherein the method comprises the steps of

And->

Respectively the vector c _n (ω, k) and c _n,0 (ω, k).

Inverse short-time fourier transform 205

And performing short-time Fourier inverse transformation on the time-frequency domain signals of each sound source to be separated to obtain corresponding time domain waveforms, and taking the time domain waveforms as real sound signals of the sound sources to be separated to finish the ultra-blind source separation of the low-time-delay audio signals.

Specifically, estimation of the sound source signal to be separated output by the multi-channel wiener filter 204

Performing short-time Fourier inverse transformation, and obtaining corresponding time domain signal by overlap-add method

Example 2.

As shown in fig. 4, the present invention further provides a low-delay audio signal overdetermined blind source separation device, which includes:

the microphone array 401 includes M microphone array elements for picking up acoustic signals of N sound sources to be separated in the target environment, where the total number of microphone array elements is required to be greater than the number of sound sources to be separated, i.e., M > N; m is more than or equal to 3; n is more than or equal to 2;

an a/D module 402, configured to convert the acoustic signals (analog signals) of the N sound sources to be separated, which are picked up by the microphone array 401, into corresponding digital signals, so as to send the digital signals to a processor or other devices to execute a related separation algorithm; in a MEMS microphone, the a/D module 402 may be integrated into the microphone.

The short-time fourier transform module 403 is configured to buffer the signals collected by the microphone array and perform short-time fourier transform to obtain corresponding time-frequency domain signals; the real-time blind source separation method is performed in the time-frequency domain. The window length of the short-time fourier transform required for the present invention may be much shorter than the reverberation time of the space in which the microphone array is located.

The sound source variance and unmixed matrix estimation module 404 is configured to perform continuous iterative update by using the obtained time-frequency domain observation signal until convergence is reached, estimate the variance and unmixed vector of the nth sound source to be separated, construct an unmixed matrix by using the obtained unmixed vector, and update the unmixed matrix; the iteration specific process comprises the following steps:

1) Respectively calculating variances of all sound sources to be separated; specifically, the variances of the nth sound sources to be separated are respectively calculated by using the obtained M de-mixing vectors;

2) Updating the weighted covariance matrix of the nth sound source to be separated;

3) Updating all the unmixed vectors of the nth sound source to be separated;

4) Normalizing all the unmixed vectors of the nth sound source to be separated;

5) Constructing a unmixed matrix by using the unmixed vector obtained in the previous step;

a mixing matrix estimation module 405, configured to invert the solution mixing matrix to obtain a mixing matrix;

the multi-channel wiener filtering module 406 is configured to construct a multi-channel wiener filter of the nth to-be-separated sound source based on the mixing matrix, perform filtering to obtain a time-frequency domain signal of the nth to-be-separated sound source, that is, calculate a multi-channel wiener filter corresponding to each to-be-separated sound source, multiply the multi-channel wiener filter with a microphone time-frequency domain vector to obtain a mirror image of the to-be-separated sound source or a first early-stage reflected sound part in the mirror image, and take a first signal of the vector as the sound source separation output signal.

The short-time inverse fourier transform module 407 is configured to transform the N time-frequency domain sound source signals obtained by separation into a time domain waveform, and use the time domain waveform as a real sound signal of a sound source to be separated, thereby completing the ultra-blind source separation of the low-delay audio signal.

Wherein the apparatus further comprises: a D/a module 408, a speaker array module 409, and a post-processing module 410;

the D/a module 408 is configured to convert the separated time domain digital signals of each channel output by the inverse short-time fourier transform module 407 into analog signals;

the speaker array module 409 plays the analog split signal through the speaker array and sends the split signal to the post-processing module 410 (e.g., a speech recognition engine, keyword recognition engine, etc.) for further processing.

It should be noted that the real-time blind source separation method described in the present invention can be implemented in various ways, such as hardware, software, or a combination of hardware and software. The hardware platform may be an FPGA, PLD, or other application specific integrated circuit ASIC. The software platform includes a DSP, ARM, or other microprocessor. The combination of software and hardware, for example, part of the modules are implemented in DSP software and part of the modules are implemented in hardware accelerators.

Finally, it should be noted that the above embodiments are only for illustrating the technical solution of the present invention and are not limiting. Although the present invention has been described in detail with reference to the embodiments, it should be understood by those skilled in the art that modifications and equivalents may be made thereto without departing from the spirit and scope of the present invention, which is intended to be covered by the appended claims.

Claims

1. A method for overdetermined source separation of low-delay audio signals, the method comprising:

repeatedly iterating and updating the obtained time-frequency domain observation signals until convergence is achieved, and obtaining variances and unmixed vectors of each sound source to be separated; constructing a unmixed matrix by using the obtained unmixed vector; inverting the solution mixing matrix to obtain an estimation of the mixing matrix; constructing a multi-channel wiener filter based on the mixed matrix for each sound source to be separated and performing filtering to obtain time-frequency domain signals to be separated; then carrying out short-time Fourier inverse transformation to obtain a signal time domain waveform to be separated;

the method specifically comprises the following steps:

step 1) the m-th microphone array element in the microphone array picks up the acoustic signal s of the n-th sound source to be separated in the target environment _n (t) and converts it into a corresponding digital signal, denoted as mth microphone signal x _m (t) and performing short-time Fourier transform to obtain corresponding time-frequency domainObservation signal X _m (ω, k), wherein 1.ltoreq.n.ltoreq.N; t is the discrete time; m is more than or equal to 1 and less than or equal to M; m is the total number of microphone array elements in the microphone array, k is the frame identification, and ω is the frequency;

step 2) using the obtained time-frequency domain observation signal X _m (omega, k) performing continuous iterative updating until convergence is reached, and estimating the variance lambda of the nth sound source to be separated _n (omega, k-l) and the unmixed vector w _n,l (ω) using the resulting de-mixing vector w _n,l (ω) constructing a unmixed matrix; and updating a unmixing matrix W (omega), wherein N is more than or equal to 1 and less than or equal to N; l is more than or equal to 0 and less than or equal to L _n ；L _n Representing the number of reflected sounds to be estimated of the nth sound source to be separated, wherein N represents the number of sound sources to be estimated;

And the sound source is used as a sound signal of a real sound source to be separated, so that the ultra-stationary blind source separation of the low-delay audio signal is completed;

the microphone array comprises M microphone array elements, wherein the number M of the microphone array elements is greater than the total number of the sound signals of the sound sources to be separated, and is recorded as M & gtN;

the step 2) specifically comprises the following steps:

step 201) updating the nth sound source to be separated by using the obtained time-frequency domain observation signal x (omega, k)Variance lambda of the k-l frame of (2) _n (ω,k-l)：

Wherein F is the window length of the short-time Fourier transform; x (ω, k) = [ X ₁ (ω,k),…,X _M (ω,k)] ^T ；w _n,l (omega) represents the L-th corresponding to the nth sound source to be separated _n A plurality of de-mixing vectors;

Where α is a smoothing factor approaching 1; v (V) _n,l (ω, k-1) is a weighted covariance matrix of the (k-1) th frame; h is conjugate transpose;

Upper contract L ₀ =0, column vector

(L) ₀ +…+L _n-1 ) +l elements are 1 and the other remaining elements are all 0, w (ω) = [ w _1,0 (ω),…,w _1,L-1 (ω),…,w _N,0 (ω),…,w _N,L-1 (ω)] ^H Is a de-mixing matrix;

step 204) for the updated L corresponding to the nth sound source to be separated _n Individual de-mixing vectors w _n,l (omega) performing normalization operation to obtain normalized solution mixtureVector;

repeating steps 201) through 204), performing continuous iterative updating,

otherwise, re-executing steps 201) to 205);

the step 4) specifically comprises the following steps:

Wherein,,

Σ _x (ω, k) is the covariance matrix of the current frame frequency domain microphone received signal vector x (ω, k);

c _n,0 (ω,k)＝Ω _n (ω,k)x(ω,k)

Frequency domain signal at nth sound source to be separated +.>

Is c _n,0 (ω, k).

2. The method of claim 1, wherein the sum of the number of all reflected sounds to be estimated is equal to the total number of microphone elements, denoted as

3. The method for overdetermined source separation of low-delay audio signals according to claim 1, wherein the step 3) specifically comprises:

inverting the unmixed matrix W (omega) to obtain a mixed matrix H (omega);

H(ω)＝[H ₁ (ω),…,H _N (ω)]＝W ^-1 (ω)

wherein,,

4. The method for overdetermined source separation of low-delay audio signals according to claim 1, wherein the step 4) specifically comprises:

c _n (ω,k)＝Ω _n (ω,k)x(ω,k)

The time-frequency domain signal of the nth sound source to be separated +.>

Is c _n (ω, k).

5. A low-delay audio signal overdetermined blind source separation device, characterized in that the device comprises:

the microphone array (401) comprises M microphone array elements, and is used for picking up acoustic signals of N sound sources to be separated in a target environment; wherein M is greater than N;

an A/D module (402) for converting the acoustic signals of N sound sources to be separated picked up by the microphone array (401) into corresponding digital signals;

the short-time Fourier transform module (403) is used for buffering the signals acquired by the microphone array and performing short-time Fourier transform to obtain corresponding time-frequency domain signals;

the sound source variance and unmixed matrix estimation module (404) is used for carrying out continuous iterative updating by utilizing the obtained time-frequency domain observation signal until convergence is reached, estimating the variance and unmixed vector of the nth sound source to be separated, constructing an unmixed matrix by utilizing the obtained unmixed vector, and updating the unmixed matrix;

a mixing matrix estimation module (405) for inverting the solution mixing matrix to obtain a mixing matrix;

the multi-channel wiener filtering module (406) is used for constructing a multi-channel wiener filter of the nth sound source to be separated based on the mixing matrix and filtering the multi-channel wiener filter to obtain a time-frequency domain signal of the nth sound source to be separated; and

the short-time inverse Fourier transform module (407) is used for transforming the sound source signals of the N separated time-frequency domains into time domain waveforms, and taking the time domain waveforms as the sound signals of the real sound sources to be separated, so as to finish the ultra-blind source separation of the low-time-delay audio signals;

the method for performing overdetermined source separation on the low-delay audio signal by the device comprises the following steps:

step 1) the m-th microphone array element in the microphone array picks up the acoustic signal s of the n-th sound source to be separated in the target environment _n (t) and converts it into a corresponding digital signal, denoted as mth microphone signal x _m (t) and performing short-time Fourier transform on the obtained signal to obtain a corresponding time-frequency domain observation signal X _m (ω, k), wherein 1.ltoreq.n.ltoreq.N; t is the discrete time; m is more than or equal to 1 and less than or equal to M; m is the total number of microphone array elements in the microphone array, k is the frame identification, and ω is the frequency;

the step 2) specifically comprises the following steps:

step 201) updating the variance lambda of the k-l frame of the nth sound source to be separated using the obtained time-frequency domain observation signal x (omega, k) _n (ω,k-l)：

Upper contract L ₀ =0, column vector

repeating steps 201) through 204), performing continuous iterative updating,

otherwise, re-executing steps 201) to 205);

the step 4) specifically comprises the following steps:

Wherein,,

Σ _x (ω, k) is the covariance matrix of the current frame frequency domain microphone received signal vector x (ω, k)；

c _n,0 (ω,k)＝Ω _n (ω,k)x(ω,k)

Frequency domain signal at nth sound source to be separated +.>

Is c _n,0 (ω, k).

6. The low-latency audio signal overdetermined source separation device of claim 5, further comprising: a D/a module (408), a speaker array module (409) and a post-processing module (410);

the D/A module (408) is used for converting the separated time domain digital signals of each channel output by the short-time inverse Fourier transform module (407) into analog signals;

the speaker array module (409) plays the analog split signal through the speaker array and sends the split signal to the post-processing module (410) for further processing.