CN113628634B

CN113628634B - Real-time voice separation method and device guided by directional information

Info

Publication number: CN113628634B
Application number: CN202110963498.1A
Authority: CN
Inventors: 何平; 蒋升
Original assignee: Suirui Technology Group Co Ltd
Current assignee: Suirui Technology Group Co Ltd
Priority date: 2021-08-20
Filing date: 2021-08-20
Publication date: 2023-10-03
Anticipated expiration: 2041-08-20
Also published as: CN113628634A

Abstract

The invention discloses a method and a device for separating real-time voice directed to information guidance, which belong to the field of information processing, and the method comprises the following steps: s1: initializing a steering vector and a directional filter for the time domain signal of each microphone; s2: performing time-frequency decomposition on the initialized signals to finish the transformation from the time domain signals to the time-frequency domain signals; s3: performing separation filter calculation on the time-frequency domain signal to obtain a filter for separating target voice from residual signals; s4: and obtaining a time-frequency domain signal of the target voice according to the obtained filter, and further obtaining a time-domain signal of the target voice. The invention constructs the initial estimation of the real-time IVA based on the super-directional filter, corrects the optimization function of the IVA, ensures that the separation algorithm can quickly converge, and accurately extracts the target voice signal.

Description

Real-time voice separation method and device guided by directional information

Technical Field

The invention belongs to the field of information processing, and particularly relates to a method and a device for real-time voice separation of information guidance.

Background

At present, the microphone array beam forming technology is widely applied to the fields of online conference systems, vehicle-mounted man-machine interaction, intelligent home and the like. In an actual environment, the noise, the interference of competing speakers and the like are obvious, and the hearing of conference communication and the accuracy of subsequent voice recognition can be obviously reduced. The method for generating the wave beam based on the microphone array multi-array elements is the most commonly used method for reducing signal noise and improving communication quality. How to extract the voice signal in a certain direction in a targeted way, and meanwhile, remarkably suppress other noise, and has important significance in improving conference communication quality, improving voice recognition rate and the like.

Independent vector analysis (Independent vector analysis, IVA) based is currently the most common speech separation/pickup technique. Firstly, converting time domain signals picked up by all array elements into a time-frequency domain through short-time Fourier change, then constructing an optimization function based on the principle of minimum cross entropy of separated voice, iteratively updating a separation matrix based on the optimization function, obtaining frequency domain estimation of a target signal after estimating the separation matrix, and finally obtaining time domain estimation based on inverse Fourier transform. In some of the latest IVA methods, the distance constraint of the separation matrix and the target direction guide vector is increased, so that the IVA separation result can extract the target voice in real time.

The main disadvantages of the prior art are as follows:

1) The existing directional IVA is obviously insufficient in performance in a reverberation scene due to the fact that the accuracy of a steering vector is greatly reduced in the reverberation scene by directly increasing the constraint of the distance between a separation matrix and the steering vector.

2) The directional IVA technique does not constrain the initial estimate, resulting in an excessively long convergence time, and if the environment changes, such as an interfering speaker is walking, the IVA separation matrix convergence rate will not follow the rate of acoustic environment changes.

In view of this, the present invention has been made.

Disclosure of Invention

The invention aims to provide a real-time voice separation method and device guided by directional information, which construct an initial estimation of real-time IVA based on a super directional filter and correct an optimization function of IVA, so that a separation algorithm can be ensured to be converged rapidly and a target voice signal can be extracted accurately.

In order to achieve the above object, the present invention provides a real-time voice separation method guided by directional information, which is applied to a microphone array-based system, and includes the following steps:

s1: initializing a steering vector and a directional filter for the time domain signal of each microphone;

s2: performing time-frequency decomposition on the initialized signals to finish the transformation from the time domain signals to the time-frequency domain signals;

s3: performing separation filter calculation on the time-frequency domain signal to obtain a filter for separating target voice from residual signals;

s4: and obtaining a time-frequency domain signal of the target voice according to the obtained filter, and further obtaining a time-domain signal of the target voice.

Further, before the step S1, the method further includes: acquiring a time-domain signal x for each microphone _m (n)；

In the step S1, the method for performing the steering vector is as follows: for each frequency band k, a steering vector u (k) is calculated,

q(θ)＝[cos(θ)，sin(θ)]

wherein ,f_k K=1, 2, for the frequency of the kth band; c is the speed of sound, c=340 m/s; d, d _m Two-dimensional coordinate values for the mth microphone; q (θ) is a direction vector, ω _k Is the frequency band circle frequency;

the method for initializing the directional filter is as follows: a super-steering filter h (k) is calculated for each frequency band k:

where R (k) represents the autocorrelation coefficients of each microphone of the uniform scattered field normalized with respect to the picked-up signal.

Further, the step S2 includes:

s201: for time domain signal x _m (n) performing a short-time Fourier transform to obtain a time-frequency domain tableAnd (3) the following steps:

where N is the frame length, n=512; w (n) is a hamming window of length 512, l is a time frame number, and k is a frequency number. X is X _m (l, k) is the spectrum of the mth microphone signal in the kth frequency band in the first frame;

s202: for each frequency band k, a frequency domain original vector X (l, k) is constructed:

X(l，k)＝[X ₁ (l，k)，X ₂ (l，k)，...，X _M (l，k)] ^T 。

further, the step S3 includes:

s301: calculating a frame level separation guide factor:

wherein ,r₁(l) and r₂ (l) Respectively used for guiding the voice of the target party and the residual signal;

s302: calculating a separate steering matrix for each frequency band:

ψ ₁ (k)＝αψ ₁ (k)+(1-α)r ₁ (l)X(l，k)X ^H (l，k)

ψ ₂ (k)＝αψ ₁ (k)+(1-α)r ₁ (l)X(l，k)X ^H (l，k)

wherein ,ψ₁(k) and ψ₂ (k) A guide matrix representing the target voice and the residual signal respectively; alpha is a smoothing factor, and the value range is 0 to 1;

s303: a new optimization function is constructed for the filter separating the target speech from the residual signal, the optimization function being as follows:

wherein ,G₁(k) and G₂ (k) Filters for separating target speech from residual signal

S304: and minimizing the optimization function to obtain the optimal filter.

The process of minimizing the optimization function is to solve the following equation:

Ψ(k)G(k)＝ρ(k)

wherein ,

the optimal filter G (k) can be solved as:

G(k)＝Ψ ^-1 (k)ρ(k)。

further, the step S4 includes:

s401: according to the optimal filter obtained by solving, further obtaining the frequency domain estimation of the target voice:

s402: performing inverse Fourier transform to obtain target voice time domain estimation:

the invention also provides a real-time voice separation device guided by the pointing information, which is applied to a system based on a microphone array and comprises an initialization module, a signal decomposition module, a separation filter calculation module and a target voice estimation module:

the initialization module is used for initializing a steering vector and a directional filter for the time domain signal of each microphone;

the signal decomposition module is used for performing time-frequency decomposition on the initialized signals to finish the transformation from the time domain signals to the time-frequency domain signals;

the separation filter calculation module is used for carrying out separation filter calculation on the time-frequency domain signals to obtain a filter for separating target voice from residual signals;

the target voice estimation module is used for obtaining a time-frequency domain signal of the target voice according to the obtained filter, and further obtaining a time-domain signal of the target voice.

Further, the initialization module is further configured to obtain a time domain signal x of each microphone _m (n)；

The method for guiding the vector by the signal decomposition module is as follows: for each frequency band k, a steering vector u (k) is calculated,

q(θ)＝[cos(θ)，sin(θ)]

the method for initializing the directional filter by the signal decomposition module is as follows: a super-steering filter h (k) is calculated for each frequency band k:

Further, the operation steps of the signal decomposition module are as follows:

first, a time domain signal x _m (n) performing a short-time fourier transform to obtain a time-frequency domain representation:

next, for each frequency band k, a frequency domain original vector X (l, k) is constructed:

X(l，k)＝[X ₁ (l，k)，X ₂ (l，k)，...，X _M (l，k)] ^T 。

further, the operation steps of the separation filter calculation module are as follows:

first, a frame level separation guide factor is calculated:

wherein ,r₁(l) and r₂ (l) Respectively used for guiding the target voice and the residual signal;

secondly, a separate steering matrix for each frequency band is calculated:

ψ ₁ (k)＝αψ ₁ (k)+(1-α)r ₁ (l)X(l，k)X ^H (l，k)

ψ ₂ (k)＝αψ ₁ (k)+(1-α)r ₁ (l)X(l，k)X ^H (l，k)

wherein ,ψ₁(k) and ψ₂ (k) Representing the target speech and the residual signal respectivelyIs a pilot matrix of (a); alpha is a smoothing factor, and the value range is 0 to 1;

then, a new optimization function is constructed for the filter for separating the target voice from the residual signal, and the optimization function is as follows:

And finally, minimizing an optimization function to obtain an optimal filter.

Ψ(k)G(k)＝ρ(k)

wherein ,

the optimal filter G (k) can be solved as:

G(K)＝Ψ ^-1 (k)ρ(k)。

further, the operation steps of the voice estimation module are as follows:

firstly, according to the optimal filter obtained by solving, further obtaining the frequency domain estimation of the target voice:

then, performing inverse Fourier transform to obtain target voice time domain estimation:

the real-time voice separation method and device for guiding the pointing information provided by the invention have the following beneficial effects:

1. compared with the traditional IVA, the invention calculates the guide factor by using the super-directional filter, so that the convergence is faster, and the invention can adapt to the scene of acoustic environment change.

2. The objective function designed by the invention considers the difference between signals, increases the ambiguity constraint, can obtain the analytic optimal solution, and does not need iteration, so that the robustness is stronger, and the separation effect is more stable and reliable.

Drawings

Fig. 1 is a flowchart of a real-time voice separation method directed to information guidance in this embodiment.

Fig. 2 is a schematic diagram of a hamming window function used in this embodiment.

Fig. 3 is a schematic diagram of a real-time voice separation apparatus directed to information guidance in this embodiment.

Detailed Description

In order that those skilled in the art will better understand the present invention, the present invention will be described in further detail with reference to specific embodiments.

As shown in fig. 1, an embodiment of the present invention is a real-time voice separation method directed to information guidance, which can be applied to a microphone array-based system, such as a voice conference system, an on-vehicle voice communication system, and a man-machine interaction system, and can extract a target voice signal in real time, thereby being beneficial to improving the communication quality of an on-line voice conference and improving the accuracy of subsequent voice recognition.

The method specifically comprises the following four implementation steps:

s1: steering vector and directional filter initialization is performed on the time domain signal of each microphone.

Before step S1, the method further includes acquiring a voice signal of each microphone, where the acquired voice signals are as follows: let x be _m (n) represents the original time domain signal picked up in real time by M microphone array elements, wherein M represents the microphoneAnd the gram serial number label has a value from 1 to M, n represents a time label, and the direction of target voice relative to the microphone array is theta.

The target voice refers to a voice signal corresponding to a target direction, and for a voice separation task, the target direction is known in advance according to the extracted signal, for example, for a large-screen voice communication device, it is desirable to separate the voice signal in a 90-degree direction.

Specifically, the method for performing the steering vector is as follows:

for each frequency band k, a steering vector u (k) is calculated, where a frequency band refers to a signal component corresponding to a certain frequency:

q(θ)＝[cos(θ)，sin(θ)]

wherein ,f_k K=1, 2, where the value of K is determined from the subsequent fourier transform, if the frame length is 512, then the value of K is half the frame length; c is the speed of sound, c=340 m/s; d, d _m Two-dimensional coordinate values for the mth microphone; the superscript H represents a conjugate transpose operator; j represents imaginary unitq (θ) is a direction vector, ω _k Is the band round frequency.

The method for initializing the directional filter is as follows:

a super-steering filter h (k) is calculated for each frequency band k:

where R (k) represents the normalized autocorrelation coefficients of the individual microphones of the uniform scattered field with respect to the picked-up signal, and the superscript-1 represents the inverse of the matrix.

The initialization of the filter is completed through the two steps, so as to calculate the subsequent spatial distinguishing information.

S2: and performing time-frequency decomposition on the initialized signal to complete the conversion from the time domain signal to the time-frequency domain signal.

Specifically, the method comprises the following steps:

s201: for time domain signal x _m (n) performing a short-time fourier transform to obtain a time-frequency domain representation:

where N is the frame length, n=512; w (n) is a hamming window of length 512, where n represents a sequence number over time, and thus w (n) represents a value over each corresponding time sequence number n; l is a time frame number in frames; k is the frequency number. X is X _m (l, k) is the spectrum of the mth microphone signal in the kth frequency band in the first frame. In the present invention, a hamming window function is used as shown in fig. 2.

X(l，k)＝[X ₁ (l，k)，X ₂ (l，k)，...，X _M (l，k)] ^T

wherein the superscript T represents the transpose operator, resulting in the original vector being an M-dimensional column vector.

The conversion from the time domain signal to the time-frequency domain can be completed through the steps.

S3: and performing separation filter calculation on the time-frequency domain signal to obtain a filter for separating the target voice from the residual signal.

Specifically, the method comprises the following steps:

s301: calculating a frame level separation guide factor:

wherein, I represents taking the modulus of complex numbers, r ₁(l) and r₂ (l) Respectively for guiding the target speech and the residual signal. The target direction refers to the direction of interest to the user, and the target voice is the voice signal from this direction; the residual signal is the target voice signal, sound from other directions and environmental noise, and the residual signal is the target voice signal subtracted from the total signal acquired by the microphone. This step calculates the steering matrix with the next step, providing a priori.

S302: calculating a separate steering matrix for each frequency band:

ψ ₁ (k)＝αψ ₁ (k)+(1-α)r ₁ (l)X(l，k)X ^H (l，k)

ψ ₂ (k)＝αψ ₁ (k)+(1-α)r ₁ (l)X(l，k)X ^H (l，k)

wherein ,ψ₁(k) and ψ₂ (k) A guide matrix representing the target voice and the residual signal respectively; the value of alpha is a smoothing factor, the value range is 0 to 1, and the preferred invention adopts the value of alpha=0.85, and the adoption of the value can avoid excessive dependence on history information and fully mine the spatial information of the signal.

By this step the calculation of the matrix is guided, directly with the subsequent updating of the separation filter.

wherein ,G₁(k) and G₂ (k) Filters separating the target speech and the residual signal, respectively.

The first term of the optimization function maximizes the difference between the separated signals, and the second term can avoid ambiguity of the filter estimation, and ensure that the sum of the separation results is as consistent as possible with the value of the microphone signal.

S304: and minimizing the optimization function to obtain an optimal filter.

This minimization process is equivalent to solving the following equation:

Ψ(k)G(k)＝ρ(k)

wherein ,

superscript x represents a conjugate operator.

The optimal filter G (k) can be solved as:

G(k)＝Ψ ^-1 (k)ρ(k)。

after solving to obtain G (k), obtaining a filter G for separating the target voice and the residual signal according to the corresponding relation of the vectors ₁(k) and G₂ (k)。

In step S301, a super directional filter is used to calculate a speech separation guide factor; in step S302, a separate steering matrix is calculated based on the steering factors; in step S303, the designed speech separation optimization function; in step S304, the formula adopted is the invented optimal separation filter calculation method, which adopts the principle of minimum mean square error, and is derived from the formula ψ=ρ, and the solution of minimum mean square error is g=ψ ^-1 * ρ, in turn, ensures the minimization of the defined J.

Thus, step S3 enables the calculation of a frequency domain separation filter.

The method specifically comprises the following steps:

s402: performing inverse Fourier transform to obtain target voice time domain estimation, and further obtaining a target voice time domain signal:

this step achieves the acquisition of the target speech time domain signal.

Through the steps of the invention, the initialization of the microphone matrix signals, the signal decomposition, the calculation of the separation filter and the target voice estimation can be realized.

As shown in fig. 3, an embodiment of the present invention is a real-time voice separation device directed to information guidance, which is applied to a microphone array-based system, and includes an initialization module 1, a signal decomposition module 2, a separation filter calculation module 3, and a target voice estimation module 4.

The initialization module 1 is used for initializing steering vectors and directional filters for voice signals of each microphone.

The initialization module 1 can also be used to obtain the voice signal of each microphone, the obtained voice signals are as follows: let x be _m And (n) represents original signals picked up by M microphone array elements in real time, wherein M represents a microphone serial number label, the value of the microphone serial number label is from 1 to M, n represents a time label, and the direction of target voice relative to the microphone array is theta.

Specifically, the method for performing the steering vector is as follows:

q(θ)＝[cos(θ)，sin(θ)]

wherein ,f_k K=1, 2, for the frequency of the kth band; c is the speed of sound, c=340 m/s; d, d _m Two-dimensional coordinate values for the mth microphone; the superscript H represents a conjugate transpose operator; j represents imaginary unitq (θ) is a direction vector, ω _k Is the band round frequency.

The method for initializing the directional filter is as follows:

a super-steering filter h (k) is calculated for each frequency band k:

The signal decomposition module 2 is configured to perform time-frequency decomposition on the initialized signal, and complete conversion from a time domain signal to a time-frequency domain signal.

Specifically, the operation steps of the signal decomposition module 2 are as follows:

first, a time domain signal _x m (n) performs short-time Fourier transform to obtain a time-frequency domain expression:

where N is the frame length, n=512; w (n) is a Hamming window of length 512, and l is timeFrame number, k is frequency number. X is X _m (l, k) is the spectrum of the mth microphone signal in the kth frequency band in the first frame. The hamming window function is shown in fig. 2.

X(l，k)＝[X ₁ (l，k)，X ₂ (l，k)，...，X _M (l，k)] ^T

The separation filter calculation module 3 is configured to perform separation filter calculation on the time-frequency domain signal, and obtain a filter for separating the target voice from the residual signal.

Specifically, the operation steps of the separation filter calculation module 3 are as follows:

first, a frame level separation guide factor is calculated:

wherein, I represents taking the modulus of complex numbers, r ₁(l) and r₂ (l) The target speech and the residual signal are directed separately. The above operation calculates the steering matrix with the next step, providing a priori.

Secondly, a separate steering matrix for each frequency band is calculated:

ψ ₁ (k)＝αψ ₁ (k)+(1-α)r ₁ (l)X(l，k)X ^H (l，k)

ψ ₂ (k)＝αψ ₁ (k)+(1-α)r ₁ (l)X(l，k)X ^H (l，k)

wherein ,ψ₁(k) and ψ₂ (k) A guide matrix representing the target voice and the residual signal respectively; alphaFor the smoothing factor, the value range is 0 to 1, and the preferred invention adopts the value of alpha=0.85, so that excessive dependence on history information can be avoided, and spatial information of signals can be fully mined.

The operation guides the matrix calculation, and the subsequent updating of the separation filter is directly used.

And finally, minimizing the optimization function to obtain the optimal filter.

This minimization process is equivalent to solving the following equation:

Ψ(k)G(k)＝ρ(k)

wherein ,

superscript x represents a conjugate operator.

The optimal filter G (k) can be solved as:

G(k)＝Ψ ^-1 (k)ρ(k)

the calculation of the frequency domain separation filter can be realized by the above operation.

The target voice estimation module 4 is configured to obtain a time-frequency domain signal of the target voice according to the obtained filter, and further obtain a time-domain signal of the target voice.

Specifically, the operation steps of the target speech estimation module 4 are as follows:

the target speech estimation module 4 is capable of achieving acquisition of a target speech time domain signal.

In the above embodiment, these 4 modules including the initialization module 1, the signal decomposition module 2, the separation filter calculation module 3, and the target speech estimation module 4 are not sufficient, and any of the modules is missing, which results in that the target speech cannot be extracted.

Specific examples are set forth herein to illustrate the invention in detail, and the description of the above examples is only for the purpose of aiding in understanding the core concept of the invention. It should be noted that any obvious modifications, equivalents, or other improvements to those skilled in the art without departing from the inventive concept are intended to be included in the scope of the present invention.

Claims

1. A method for real-time voice separation directed to information guidance, applied to a microphone array-based system, comprising the steps of:

step S1: initializing a steering vector and a directional filter for the time domain signal of each microphone;

step S2: performing time-frequency decomposition on the initialized signals to finish the transformation from the time domain signals to the time-frequency domain signals;

step S3: performing separation filter calculation on the time-frequency domain signal to obtain a filter for separating target voice from residual signals;

step S4: obtaining a time-frequency domain signal of the target voice according to the obtained filter, and further obtaining a time-domain signal of the target voice;

the step S1 further includes: acquiring a time-domain signal x for each microphone _m (n)；

q(θ)＝[cos(θ),sin(θ)]

wherein R (k) represents the normalized autocorrelation coefficients of the individual microphones of the uniform scattered field with respect to the picked-up signal;

the step S2 includes:

where N is the frame length, n=512; w (n) is a hamming window with length 512, 1 is a time frame number, and k is a frequency number; x is X _m (l, k) is the spectrum of the mth microphone signal in the kth frequency band in frame 1;

X(l，k)＝[X ₁ (l，k)X ₂ (l，k)，...，X _M (l，k)] ^T ；

the step S3 includes:

s301: calculating a frame level separation guide factor:

s302: calculating a separate steering matrix for each frequency band:

ψ ₁ (k)＝αψ ₁ (k)+(1-α)r ₁ (l)X(l，k)X ^H (l，k)

ψ ₂ (k)＝αψ ₁ (k)+(1-α)r ₁ (l)X(l，k)X ^H (l，k)

S304: minimizing the optimization function to obtain an optimal filter;

Ψ(k)G(k)＝ρ(k)

wherein ,

the filter G (k) is solved as:

G(k)＝Ψ ^-1 (k) ρ (middle).

2. The method for real-time voice separation guided by directional information according to claim 1, wherein the step S4 comprises:

s401: according to the filter obtained by solving, further obtaining the frequency domain estimation of the target voice:

3. a real-time voice separation device guided by directional information, which is applied to a system based on a microphone array, and is characterized by comprising an initialization module, a signal decomposition module, a separation filter calculation module and a target voice estimation module:

the target voice estimation module is used for obtaining a time-frequency domain signal of target voice according to the obtained filter, and further obtaining a target voice time domain signal;

the initialization module is further configured to obtain a time domain signal x of each microphone _m (n)；

q(θ)＝[cos(θ),sin(θ)]

the operation steps of the signal decomposition module are as follows:

where N is the frame length, n=512; w (n) is a Hamming window of length 512, l is a time frame number, and k is a frequency number; x is X _m (l, k) is the spectrum of the mth microphone signal in the kth frequency band in the first frame;

X(l，k)＝[X ₁ (l，k)，X ₂ (l，k)，...，X _M (l，k)] ^T ；

the operation steps of the separation filter calculation module are as follows:

first, a frame level separation guide factor is calculated:

secondly, a separate steering matrix for each frequency band is calculated:

ψ ₁ (k)＝αψ ₁ (k)+(1-α)r ₁ (l)X(l，k)X ^H (l，k)

ψ ₂ (k)＝αψ ₁ (k)+(1-αr ₁ (l)X(l，k)X ^H (l，k)

wherein ,ψ₁(k) and ψ₂ (k) The guiding matrixes respectively represent the target party voice and the residual signals; alpha is a smoothing factor, and the value range is 0 to 1;

Finally, minimizing an optimization function to obtain an optimal filter;

Ψ(k)G(k)＝ρ(k)

wherein ,

the optimal filter G (k) is solved as:

G(k)＝Ψ ^-1 (k)ρ(k)。

4. the information-directed real-time voice separation apparatus of claim 3, wherein the target voice estimation module operates as follows: