CN113628634A

CN113628634A - Real-time voice separation method and device guided by pointing information

Info

Publication number: CN113628634A
Application number: CN202110963498.1A
Authority: CN
Inventors: 何平; 蒋升
Original assignee: Suirui Technology Group Co Ltd
Current assignee: Suirui Technology Group Co Ltd
Priority date: 2021-08-20
Filing date: 2021-08-20
Publication date: 2021-11-09
Anticipated expiration: 2041-08-20
Also published as: CN113628634B

Abstract

The invention discloses a real-time voice separation method and a device for pointing information guidance, belonging to the field of information processing, wherein the method comprises the following steps: s1: initializing a guide vector and a directional filter for a time domain signal of each microphone; s2: performing time-frequency decomposition on the initialized signal to finish the transformation from the time domain signal to the time-frequency domain signal; s3: performing separation filter calculation on the time-frequency domain signal to obtain a filter for separating the target voice from the residual signal; s4: and according to the obtained filter, obtaining a time-frequency domain signal of the target voice, and further obtaining a time-domain signal of the target voice. The method constructs the initial estimation of the real-time IVA based on the super-directional filter, and modifies the optimization function of the IVA, thereby ensuring that the separation algorithm can be rapidly converged and accurately extracting the target voice signal.

Description

Real-time voice separation method and device guided by pointing information

Technical Field

The invention belongs to the field of information processing, and particularly relates to a real-time voice separation method and device guided by pointing information.

Background

At present, a microphone array beam forming technology is widely applied to the fields of online conference systems, vehicle-mounted human-computer interaction, smart home and the like. In an actual environment, interference such as obvious noise, competing speakers and the like exists, and the listening feeling of conference communication and the accuracy of subsequent voice recognition can be obviously reduced. The most common method for reducing signal noise and improving communication quality is to perform wave beam generation based on multiple array elements of a microphone array. How to pertinently extract the voice signal of a certain direction, other noises are obviously suppressed simultaneously, and the method has important significance for improving conference communication quality, improving voice recognition rate and the like.

Independent Vector Analysis (IVA) based speech separation/picking technology is currently the most commonly used technique. Firstly, time domain signals picked up by all array elements are converted into time-frequency domains through short-time Fourier transformation, then an optimization function is constructed based on the principle that the cross entropy of separated voice is minimum, a separation matrix is updated iteratively based on the optimization function, after the separation matrix is estimated, frequency domain estimation of a target signal can be obtained, and finally time domain estimation is obtained based on inverse Fourier transformation. In some latest IVA methods, the target speech is extracted in real time by adding the distance constraint between the separation matrix and the target direction guide vector.

The main disadvantages of the prior art are as follows:

1) the existing directional IVA is restricted by directly increasing the distance between a separation matrix and a guide vector, and the accuracy of the guide vector is greatly reduced in a reverberation scene, so that the performance is obviously insufficient in the reverberation scene.

2) The directional IVA technique is not constrained in the initial estimation, resulting in too long convergence time, and if the environment changes, such as an interfering speaker moving around, the convergence speed of the IVA separation matrix cannot keep up with the speed of the change of the acoustic environment.

In view of the above, the present invention is particularly proposed.

Disclosure of Invention

The invention aims to provide a real-time voice separation method and device guided by pointing information, which are used for constructing initial estimation of a real-time IVA based on a super-pointing filter, correcting an optimization function of the IVA, ensuring that a separation algorithm can be rapidly converged and accurately extracting a target voice signal.

In order to achieve the above object, the present invention provides a real-time voice separation method guided by directional information, which is applied to a system based on a microphone array, and comprises the following steps:

s1: initializing a guide vector and a directional filter for a time domain signal of each microphone;

s2: performing time-frequency decomposition on the initialized signal to finish the transformation from the time domain signal to the time-frequency domain signal;

s3: performing separation filter calculation on the time-frequency domain signal to obtain a filter for separating the target voice from the residual signal;

s4: and according to the obtained filter, obtaining a time-frequency domain signal of the target voice, and further obtaining a time-domain signal of the target voice.

Further, step S1 is preceded by: obtaining a time-domain signal x for each microphone_m(n)；

In step S1, the method for performing the steering vector is as follows: for each frequency band k, a steering vector u (k) is calculated,

q(θ)＝[cos(θ)，sin(θ)]

wherein ,f_kK is the frequency of the kth band, K being 1,2,. K; c is sound speed, and c is 340 m/s; d_mIs the two-dimensional coordinate value of the mth microphone; q (theta) is a direction vector, omega_kIs the frequency band circle frequency;

the method for initializing the directional filter is as follows: calculating a super-directional filter h (k) for each frequency band k:

where r (k) represents the autocorrelation coefficients of each microphone of the uniform scattered field normalized with respect to the picked-up signal.

Further, the step S2 includes:

s201: for time domain signal x_m(n) performing short-time Fourier transform to obtain a time-frequency domain expression:

wherein, N is the frame length, and N is 512; w (n) is a Hamming window of length 512, l is a time frame number, and k is a frequency number. X_m(l, k) is the spectrum of the mth microphone signal, in the mth frame, in the kth frequency band;

s202: for each frequency band k, a frequency domain original vector X (l, k) is constructed:

X(l，k)＝[X₁(l，k)，X₂(l，k)，...，X_M(l，k)]^T。

further, the step S3 includes:

s301: calculating a frame-level separation guidance factor:

wherein ,r₁(l) and r₂(l) Respectively used for guiding the target voice and the residual signal;

s302: computing a separate steering matrix for each band:

ψ₁(k)＝αψ₁(k)+(1-α)r₁(l)X(l，k)X^H(l，k)

ψ₂(k)＝αψ₁(k)+(1-α)r₁(l)X(l，k)X^H(l，k)

wherein ,ψ₁(k) and ψ₂(k) A steering matrix representing the target speech and the residual signal, respectively; alpha is a smoothing factor and has a value range of 0 to 1;

s303: constructing a new optimization function for the filter separating the target speech and the residual signal, the optimization function being as follows:

wherein ,G₁(k) and G₂(k) Filters for separating target speech and residual signal respectively

S304: and minimizing the optimization function to obtain an optimal filter.

The process of minimizing the optimization function is to solve the following equation:

Ψ(k)G(k)＝ρ(k)

wherein ,

the optimal filter g (k) can be solved as:

G(k)＝Ψ^-1(k)ρ(k)。

further, the step S4 includes:

s401: and according to the optimal filter obtained by solving, further obtaining the frequency domain estimation of the target voice:

s402: performing inverse Fourier transform to obtain target voice time domain estimation:

the invention also provides a real-time voice separation device guided by the pointing information, which is applied to a system based on a microphone array and comprises an initialization module, a signal decomposition module, a separation filter calculation module and a target voice estimation module:

the initialization module is used for initializing a guide vector and a directional filter for the time domain signal of each microphone;

the signal decomposition module is used for performing time-frequency decomposition on the initialized signal to finish the transformation from a time domain signal to a time-frequency domain signal;

the separation filter calculation module is used for performing separation filter calculation on the time-frequency domain signals to obtain a filter for separating target voice and residual signals;

and the target voice estimation module is used for obtaining a time-frequency domain signal of the target voice according to the obtained filter so as to obtain a time-domain signal of the target voice.

Further, the initialization module is further configured to obtain a time domain signal x of each microphone_m(n)；

The method for the signal decomposition module to perform the steering vector is as follows: for each frequency band k, a steering vector u (k) is calculated,

q(θ)＝[cos(θ)，sin(θ)]

the method for initializing the directional filter by the signal decomposition module is as follows: calculating a super-directional filter h (k) for each frequency band k:

Further, the signal decomposition module comprises the following steps:

first, for a time domain signal x_m(n) performing short-time Fourier transform to obtain a time-frequency domain expression:

secondly, for each frequency band k, a frequency domain original vector X (l, k) is constructed:

X(l，k)＝[X₁(l，k)，X₂(l，k)，...，X_M(l，k)]^T。

further, the separation filter calculation module operates as follows:

first, a frame-level separation guidance factor is calculated:

next, a separate steering matrix for each frequency band is calculated:

ψ₁(k)＝αψ₁(k)+(1-α)r₁(l)X(l，k)X^H(l，k)

ψ₂(k)＝αψ₁(k)+(1-α)r₁(l)X(l，k)X^H(l，k)

wherein ,ψ₁(k) and ψ₂(k) A steering matrix representing the target speech and the residual signal, respectively; alpha is a smoothing factor and has a value ranging from 0 to1；

Then, a new optimization function is constructed for the filter separating the target speech and the residual signal, the optimization function is as follows:

And finally, minimizing the optimization function to obtain an optimal filter.

Ψ(k)G(k)＝ρ(k)

wherein ,

the optimal filter g (k) can be solved as:

G(K)＝Ψ^-1(k)ρ(k)。

further, the operation steps of the speech estimation module are as follows:

firstly, according to the optimal filter obtained by solving, further obtaining the frequency domain estimation of the target voice:

secondly, performing inverse Fourier transform to obtain a target voice time domain estimation:

the invention provides a real-time voice separation method and a device guided by pointing information, which have the following beneficial effects:

1. compared with the traditional IVA, the invention uses the super-directional filter to calculate the guide factor, so the convergence is faster, and the invention can adapt to the scene of the change of the acoustic environment.

2. The target function designed by the invention not only considers the difference between signals, but also increases ambiguity constraint, can obtain an optimal solution for analysis, and does not need iteration, so that the robustness is stronger, and the separation effect is more stable and reliable.

Drawings

Fig. 1 is a flowchart of a method for separating real-time voice guided by directional information according to the present embodiment.

Fig. 2 is a diagram of a hamming window function used in this embodiment.

Fig. 3 is a schematic diagram of a real-time voice separating apparatus guided by directional information according to this embodiment.

Detailed Description

The present invention will be described in further detail with reference to specific embodiments in order to make the technical field better understand the scheme of the present invention.

As shown in fig. 1, an embodiment of the present invention is a real-time voice separation method guided by directional information, which can be applied to a system based on a microphone array, such as a voice conference system, an on-vehicle voice communication system, and a human-computer interaction system, and can extract a target voice signal in real time, thereby facilitating improvement of communication quality of an on-line voice conference and improving accuracy of subsequent voice recognition.

The method specifically comprises the following four implementation steps:

s1: and initializing a guide vector and a directional filter for the time domain signal of each microphone.

Before step S1, the method further includes obtaining a voice signal of each microphone, where the obtained voice signals are as follows: let x be_m(n) represents the original time domain signal picked up by M microphone elements in real time, wherein M represents a microphone serial number label, the value of the microphone serial number label is from 1 to M, and n represents a time scaleAnd the direction of the target voice relative to the microphone array is theta.

The target voice is a voice signal corresponding to a target direction, and for a voice separation task, the target direction is known in advance according to the extracted signal, for example, for a large-screen voice communication device, a voice signal in a 90-degree direction is expected to be separated.

Specifically, the method of performing the steering vector is as follows:

for each frequency band k, a steering vector u (k) is calculated, wherein a frequency band refers to a signal component corresponding to a certain frequency:

q(θ)＝[cos(θ)，sin(θ)]

wherein ,f_kK, where K is determined according to subsequent fourier transform, and if the frame length is 512, the value of K is half of the frame length; c is sound speed, and c is 340 m/s; d_mIs the two-dimensional coordinate value of the mth microphone; superscript H represents the conjugate transpose operator; j represents an imaginary unit

q (theta) is a direction vector, omega_kIs the frequency band circle frequency.

The method for initializing the directional filter is as follows:

calculating a super-directional filter h (k) for each frequency band k:

where r (k) represents the autocorrelation coefficients of each microphone of the uniform scattered field normalized with respect to the picked up signal, with the superscript-1 representing the inverse of the matrix.

The initialization of the filter is completed through the two steps to calculate the subsequent spatial distinguishability information.

S2: and performing time-frequency decomposition on the initialized signal to finish the transformation from the time domain signal to the time-frequency domain signal.

Specifically, the method comprises the following steps:

wherein, N is the frame length, and N is 512; w (n) is a Hamming window of length 512, where n represents the number of times, and thus w (n) represents the value of each corresponding time number n; l is a time frame sequence number and takes a frame as a unit; k is a frequency number. X_m(l, k) is the spectrum of the mth microphone signal, in the mth frame, the kth frequency band. The hamming window function used in the present invention is shown in fig. 2.

X(l，k)＝[X₁(l，k)，X₂(l，k)，...，X_M(l，k)]^T

wherein, the superscript T represents the transpose operator, and the obtained original vector is an M-dimension column vector.

The transformation from the time domain signal to the time-frequency domain can be completed through the steps.

S3: and performing separation filter calculation on the time-frequency domain signal to obtain a filter for separating the target voice and the residual signal.

Specifically, the method comprises the following steps:

s301: calculating a frame-level separation guidance factor:

wherein, | represents a modulus of a complex number, r₁(l) and r₂(l) For guiding the target speech and the residual signal, respectively. The target direction refers to a direction in which the user is interested, and the target voice is a voice signal from the direction; the residual signal refers to sound and environmental noise from other directions except the target speech signal, and the residual signal can be understood as the residual signal obtained by subtracting the target speech signal from the total signal acquired by the microphone. This step is used as the next step to calculate the steering matrix, providing the priors.

S302: computing a separate steering matrix for each band:

ψ₁(k)＝αψ₁(k)+(1-α)r₁(l)X(l，k)X^H(l，k)

ψ₂(k)＝αψ₁(k)+(1-α)r₁(l)X(l，k)X^H(l，k)

wherein ,ψ₁(k) and ψ₂(k) A steering matrix representing the target speech and the residual signal, respectively; alpha is a smoothing factor, the value range is 0 to 1, the preferred value of the invention is alpha-0.85, the adoption of the value can avoid excessive dependence on historical information, and the spatial information of the signal can be fully mined.

This step leads to the calculation of a matrix directly for the updating of the subsequent separation filter.

wherein ,G₁(k) and G₂(k) Respectively, filters that separate the target speech from the residual signal.

The first term of the optimization function maximizes the difference between the split signals, and the second term can avoid ambiguity in filter estimation, ensuring that the sum of the split results is as consistent as possible with the value of the microphone signal.

S304: and minimizing the optimization function to obtain an optimal filter.

This minimization process is equivalent to solving the following equation:

Ψ(k)G(k)＝ρ(k)

wherein ,

the superscript denotes the conjugate operator.

The optimal filter g (k) can be solved as:

G(k)＝Ψ^-1(k)ρ(k)。

after solving to obtain G (k), according to the corresponding relation of the vector, the filter G for separating the target voice and the residual signal is obtained₁(k) and G₂(k)。

In step S301, a super-directional filter is used to calculate a voice separation guidance factor; in step S302, a separation guide matrix is calculated based on the guide factor; in step S303, a designed speech separation optimization function; in step S304, the formula adopted is the invented optimal separation filter calculation method, which adopts the principle of minimum mean square error, and is derived from the formula ψ G ρ, and the solution of the minimum mean square error is G ψ^-1ρ, which in turn guarantees the minimization of the defined J.

Therefore, step S3 enables the calculation of the frequency domain separation filter.

The method specifically comprises the following steps:

s402: performing inverse Fourier transform to obtain target voice time domain estimation, and further obtaining a target voice time domain signal:

the step realizes the acquisition of the target voice time domain signal.

Through the steps of the invention, the initialization, the signal decomposition, the separation filter calculation and the target voice estimation of the microphone matrix signal can be realized.

As shown in fig. 3, an embodiment of the present invention is a directional information guided real-time speech separation apparatus applied to a microphone array based system, which includes an initialization module 1, a signal decomposition module 2, a separation filter calculation module 3, and a target speech estimation module 4.

The initialization module 1 is used for initializing a steering vector and a directional filter for the voice signal of each microphone.

The initialization module 1 can also be used to obtain the voice signal of each microphone, the obtained voice signal is as follows: let x be_mAnd (n) represents original signals picked up by the M microphone elements in real time, wherein M represents a microphone serial number label, the value of the microphone serial number label is from 1 to M, n represents a time label, and the direction of the target voice relative to the microphone array is theta.

Specifically, the method of performing the steering vector is as follows:

q(θ)＝[cos(θ)，sin(θ)]

wherein ,f_kK is the frequency of the kth band, K being 1,2,. K; c is sound speed, and c is 340 m/s; d_mIs the two-dimensional coordinate value of the mth microphone; superscript H represents the conjugate transpose operator; j represents an imaginary unit

q (theta) is a direction vector, omega_kIs the frequency band circle frequency.

The method for initializing the directional filter is as follows:

calculating a super-directional filter h (k) for each frequency band k:

The signal decomposition module 2 is configured to perform time-frequency decomposition on the initialized signal, and complete the transformation from the time domain signal to the time-frequency domain signal.

Specifically, the signal decomposition module 2 operates as follows:

firstly, to time domain signals_xm (n) performing short-time Fourier transform to obtain a time-frequency domain expression:

wherein, N is the frame length, and N is 512; w (n) is a Hamming window of length 512, l is a time frame number, and k is a frequency number. X_m(l, k) is the mth microphoneSignal, in the l-th frame, the frequency spectrum of the k-th band. The hamming window function is shown in fig. 2.

X(l，k)＝[X₁(l，k)，X₂(l，k)，...，X_M(l，k)]^T

The separation filter calculation module 3 is configured to perform separation filter calculation on the time-frequency domain signal, and obtain a filter for separating the target speech and the residual signal.

Specifically, the operation steps of the separation filter calculation module 3 are as follows:

first, a frame-level separation guidance factor is calculated:

wherein, | represents a modulus of a complex number, r₁(l) and r₂(l) The target speech and the residual signal are guided separately. The above operation is used for the next step of calculating the steering matrix, providing a priori.

Next, a separate steering matrix for each frequency band is calculated:

ψ₁(k)＝αψ₁(k)+(1-α)r₁(l)X(l，k)X^H(l，k)

ψ₂(k)＝αψ₁(k)+(1-α)r₁(l)X(l，k)X^H(l，k)

wherein ,ψ₁(k) and ψ₂(k) A steering matrix representing the target speech and the residual signal, respectively; alpha is a smoothing factor, the value range is 0 to 1, and the preferred value adopted by the inventionThe value of α is 0.85, so that excessive dependence on historical information can be avoided, and spatial information of the signal can be sufficiently mined.

This operation leads to the calculation of a matrix directly for the updating of the subsequent separation filter.

And finally, minimizing the optimization function to obtain an optimal filter.

This minimization process is equivalent to solving the following equation:

Ψ(k)G(k)＝ρ(k)

wherein ,

the superscript denotes the conjugate operator.

The optimal filter g (k) can be solved as:

G(k)＝Ψ^-1(k)ρ(k)

the calculation of the frequency domain separation filter can be realized by the above operation.

And the target voice estimation module 4 is configured to obtain a time-frequency domain signal of the target voice according to the obtained filter, and further obtain a time-domain signal of the target voice.

Specifically, the target speech estimation module 4 operates as follows:

the target speech estimation module 4 can achieve the acquisition of the target speech time domain signal.

In the above embodiment, the 4 modules including the initialization module 1, the signal decomposition module 2, the separation filter calculation module 3, and the target speech estimation module 4 are all absent, and the absence of any one module can result in that the target speech cannot be extracted.

The inventive concept is explained in detail herein using specific examples, which are given only to aid in understanding the core concepts of the invention. It should be understood that any obvious modifications, equivalents and other improvements made by those skilled in the art without departing from the spirit of the present invention are included in the scope of the present invention.

Claims

1. A real-time voice separation method guided by directional information is applied to a system based on a microphone array, and is characterized by comprising the following steps:

2. The method for separating real-time voice guided by pointing information according to claim 1, wherein said step S1 is preceded by the step of: obtaining a time-domain signal x for each microphone_m(n)；

q(θ)＝[cos(θ)，sin(θ)]

3. The direction information guided real-time speech separation method according to claim 2, wherein said step S2 comprises:

wherein, N is the frame length, and N is 512; w (n) is a Hamming window with a length of 512, l is a time frame sequence number, and k is a frequency sequence number; x_m(l, k) is the spectrum of the mth microphone signal, in the mth frame, in the kth frequency band;

X(l，k)＝[X₁(l，k)，X₂(l，k)，...，X_M(l，k)]^T。

4. the direction information guided real-time speech separation method according to claim 3, wherein said step S3 comprises:

s301: calculating a frame-level separation guidance factor:

s302: computing a separate steering matrix for each band:

ψ₁(k)＝αψ₁(k)+(1-α)r₁(l)X(l，k)X^H(l，k)

ψ₂(k)＝αψ₁(k)+(1-α)r₁(l)X(l，k)X^H(l，k)

wherein ,ψ₁(k) and ψ₂(k) Representing target speech and residual signal, respectivelyA steering matrix; alpha is a smoothing factor and has a value range of 0 to 1;

S304: minimizing an optimization function to obtain an optimal filter;

Ψ(k)G(k)＝ρ(k)

wherein ,

filter g (k) can be solved as:

G(k)＝Ψ^-1(k)ρ(k)。

5. the direction information guided real-time speech separation method according to claim 4, wherein said step S4 comprises:

s401: and according to the filter obtained by solving, further obtaining the frequency domain estimation of the target voice:

6. the real-time voice separation device guided by the directional information is applied to a system based on a microphone array, and is characterized by comprising an initialization module, a signal decomposition module, a separation filter calculation module and a target voice estimation module:

7. The direction-information guided real-time speech separation apparatus of claim 6, wherein the initialization module is further configured to obtain a time-domain signal x for each microphone_m(n)；

q(θ)＝[cos(θ)，sin(θ)]

wherein ,f_kFrequency of the k-th frequency bandA rate, K ═ 1,2,. K; c is sound speed, and c is 340 m/s; d_mIs the two-dimensional coordinate value of the mth microphone; q (theta) is a direction vector, omega_kIs the frequency band circle frequency;

8. The direction-information guided real-time speech separation apparatus of claim 7, wherein the signal decomposition module operates as follows:

X(l，k)＝[X₁(l，k)，X₂(l，k)，...，X_M(l，k)]^T。

9. the direction-information guided real-time speech separation apparatus of claim 8, wherein said separation filter computation module operates as follows:

first, a frame-level separation guidance factor is calculated:

next, a separate steering matrix for each frequency band is calculated:

ψ₁(k)＝αψ₁(k)+(1-α)r₁(l)X(l，k)X^H(l，k)

ψ₂(k)＝αψ₁(k)+(1-α)r₁(l)X(l，k)X^H(l，k)

wherein ,ψ₁(k) and ψ₂(k) A steering matrix representing the target party speech and the residual signal, respectively; alpha is a smoothing factor and has a value range of 0 to 1;

Finally, minimizing an optimization function to obtain an optimal filter;

Ψ(k)G(k)＝ρ(k)

wherein ,

the optimal filter g (k) can be solved as:

G(k)＝Ψ^-1(k)ρ(k)。

10. the direction information guided real-time speech separation apparatus of claim 9, wherein the speech estimation module operates as follows: