CN113823316B

CN113823316B - Voice signal separation method for sound source close to position

Info

Publication number: CN113823316B
Application number: CN202111125927.4A
Authority: CN
Inventors: 廖乐乐; 卢晶; 陈锴
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2021-09-26
Filing date: 2021-09-26
Publication date: 2023-09-12
Anticipated expiration: 2041-09-26
Also published as: CN113823316A

Abstract

The invention discloses a voice signal separation method aiming at a sound source close to the position. The method comprises the following steps: step 1, acquiring a mixed voice time-frequency domain signal to be processed; step 2, initializing a separation matrix of each frequency band; step 3, performing joint optimization on the separation matrixes of all the frequency bands; step 4, performing size normalization on the separation matrix; step 5, estimating the separated time-frequency domain voice signals; and 6, recovering a time domain voice signal from the separated time-frequency domain voice signal. The method can help the separation algorithm to obtain better voice signal separation effect under the unfavorable condition that the sound source positions are close.

Description

Voice signal separation method for sound source close to position

Technical Field

The invention relates to the technical field of voice processing, in particular to a voice signal separation technology.

Background

The voice separation technology can separate the original sound source signals from the mixed signals of a plurality of sound sources, is an important task in the field of voice signal processing, and plays an important role in various application scenes such as intelligent home systems, video conference systems, voice recognition systems and the like.

In a multichannel speech signal processing scheme, independent Vector Analysis (IVA) establishes the association of each frequency component of a source signal through a joint probability distribution model, thereby constructing an overall cost function. Auxiliary function based IVA (AuxIVA) and Independent low-rank matrix analysis (ILRMA) are considered to be the most advanced methods of separating convolutionally mixed audio signals. The AuxIVA algorithm utilizes the optimization skill of the localization-minimization (MM) to deduce Iterative Projection (IP) iteration rules, and can rapidly and stably optimize the separation matrix. The optimization of AuxIVA can also be combined with other more flexible signal models. ILRMA is a signal model fused with the optimization strategy of AuxIVA and MNMF, and ensures that the cost after each iteration is not increased while utilizing the strong representation capability of MNMF.

The separation effect of the IVA is independent of the sound source position in the ideal case, however, in the actual case, due to the presence of noise, when the sound source position approaches, the separation effect of the algorithm is significantly reduced, which limits the application of the separation algorithm in practice to a great extent. How to improve the separation effect of the close-to-sound source is a considerable technical problem.

Disclosure of Invention

In order to solve the technical problems, the invention provides a voice signal separation method aiming at a sound source close to the position, which can obviously improve the separation effect of voice signals.

The invention adopts the technical scheme that:

a voice signal separation method for a sound source positioned close to the sound source comprises the following steps:

step 1, acquiring a mixed voice time-frequency domain signal to be processed;

step 2, initializing a separation matrix of each frequency band for the mixed voice time-frequency domain signal;

step 3, joint optimization is carried out on the separation matrixes of all the frequency bands so as to solve the sequencing uncertainty;

step 4, performing size normalization on the optimized separation matrix;

step 5, estimating a time-frequency domain voice signal according to the separation matrix processed in the step 4;

and 6, recovering the time domain voice signal from the time-frequency domain voice signal estimated in the step 5.

Further, the specific steps of the step 1 are as follows: and obtaining a time domain signal of the mixed voice to be processed by using a signal acquisition system, and performing short-time Fourier transform on the time domain signal to obtain a time-frequency domain signal of the mixed voice to be processed.

Further, in the step 2, the separation matrix of each frequency band is initialized by using an identity matrix, the diagonal element of the matrix is 1, and the rest elements are 0.

Further, in the step 3, the specific step of performing joint optimization on the separation matrix of all the frequency bands is as follows: selecting a source signal distribution model to obtain a cost function; (2) Selecting an optimization method for the cost function to obtain an updating rule of the separation matrix; (3) And iterating the separation matrix by using the updating rule until convergence to obtain the separation matrix after each frequency band is optimized.

Further, in the step 4, the separation matrix is subjected to the size normalization according to the minimum distortion criterion.

Further, the specific steps of the step 5 are as follows: multiplying the separation matrix obtained in the step 4 with the mixed voice time-frequency domain signal to be processed, and estimating a separated time-frequency domain voice signal.

Further, the specific steps of the step 6 are as follows: and (5) performing short-time Fourier inverse transformation on the time-frequency domain voice signals estimated in the step (5) to obtain separated time domain voice signals.

The invention realizes an improved voice signal separation method aiming at sound sources close to each other. The method has the advantages that the separation effect of the sound source position close to the scene is obviously improved, meanwhile, the block ordering problem of IVA under certain conditions is relieved, and the separation effect under the scene with the sound source far away is also improved.

Drawings

FIG. 1 is a flow chart of a method for separating speech signals according to the present invention;

FIG. 2 is a schematic diagram of a sound source approach scene to which the present invention is applicable;

fig. 3 is a graph comparing SDR improvement values at different reverberation times for the original AuxIVA method, the improved AuxIVA method of the present invention, the original ILRMA method, and the improved ILRMA method of the present invention.

Fig. 4 is a graph of SIR rise values at different reverberation times for the original AuxIVA method, the improved AuxIVA method of the present invention, the original ilmma method, and the improved ilmma method of the present invention.

Detailed Description

The invention mainly aims at the voice separation method of the position close to the sound source, which mainly comprises the following parts:

1. signal acquisition

1) And convoluting and mixing the pure source signal with the room impulse response, and adding diffusion noise to obtain a mixed signal.

2) Performing short-time Fourier transform on signals

If the mixed signal acquired by the mth microphone is x _m (t) performing short-time Fourier transform on the signal to obtain a time-frequency domain, ignoring a time frame index t, and expressing the signal of the kth frequency band asThe signals picked up by a total of M microphones form a mixed signal vector +.>The superscript T denotes a transpose operation.

2. Iterative algorithm

The nth source signal vector is denoted s _n N is the source signal indicator and n=1, 2, …, N is the total number of source signals. The separation matrix is denoted by W, and the nth row of the separation matrix isThe superscript H denotes the conjugate transpose, the superscript K denotes the kth frequency band, and k=1, 2, …, K are the total number of frequency bands. />Representing a set of all band separation matrices, detW ^k Is a determinant of the separation matrix within the kth frequency band. Source signal vector s _n The corresponding estimated signal is denoted y _n ，/>A t-th frame representing an n-th estimated signal in a k-th frequency band. Neglecting the time frame index,/->For separation purposes, the estimated signals are made as independent as possible, and the cost function is constructed by using mutual information as an independence measure.

1) If a laplace source signal distribution model is selected, the mutual information cost function is properly modified to be suitable for a scene with a close sound source position, and the final cost function can be written as follows:

wherein Mean sample, +.>Is in the form of y _n || ² As a function of the argument, f represents the probability density distribution function of the source signal. The auxiliary function is constructed by adopting the optimization skill of the localization-minimization (MM):

wherein Is an auxiliary variable. Let->Optimal conditions for obtaining solutions

Where q is another source signal indicator. The iteration rule is then:

g' (. Cndot.) represents the first derivative of G (. Cndot.), e _n Representing a unit vector, the nth element is 1 and the remaining elements are 0. For the laplace distribution, G (|) y _n || ² )＝||y _n || ² ，G'(||y _n || ² ) =1. Initializing the separation matrix as an identity matrix, and then iterating until convergence according to the rules of formulas (4) - (7) to obtain an optimized separation matrix.

2) If MNMF is selected as the source signal distribution model, the cost functions of IVA and MNMF are fused, and the cost functions are properly modified to be suitable for the scene with the close sound source position, the final cost functions can be written as follows:

wherein ,t_kl,n and v_lt,n The basis and the activation parameters of different sound sources, respectively, l being an indicator of the basis. The following iteration rule is obtained by adopting the optimization skill of the localization-minimization (MM):

wherein the model parameter t _kl,n and v_lt,n The update rules of (a) are respectively:

wherein Representing the sample average, l' is a new indicator of basis. Initializing the separation matrix as an identity matrix, and then iterating until convergence according to the rules of formulas (9) - (14) to obtain an optimized separation matrix.

3. Size is regular

In order to solve the uncertainty of the recovered signal amplitude, the separation matrix obtained after convergence needs to be subjected to amplitude normalization. According to MDP, the optimized separation matrix is subjected to the following treatment:

W ^k ←(W ^k (W ^k ) ^H ) ^-1/2 W ^k (15)

4. reconstructing a target signal

1) Estimating time-frequency domain target signals

The final separation matrix obtained from equation (15) can be estimated from the following equation for each band-separated speech signal:

y ^k ＝W ^k x ^k (16)

2) Reconstructing a time domain target signal

Finally, the separated time-frequency domain voice signals are transformed into time domains through short-time inverse Fourier transform, and signals of the time domains are recovered.

Examples

The technical solutions in the embodiments of the present invention are clearly and completely described below with reference to the accompanying drawings.

1. Test sample and objective evaluation criterion

The clean speech signal in this embodiment is selected from the TIMIT dataset (cut and spliced to form a speech signal for each segment 10s long) at a sampling rate of 16kHz. The room impulse response was generated with an image model (j.b. allen and d.a. berkley, "Image method for efficiently simulating small-room games," j.acoust.soc.am., vol.65, pp.943-950,1979.), the room size was 7m x 5m x 2.75m, and the reverberation times were set to 0ms, 100ms, 300ms, 500ms, 700ms, respectively. As shown in fig. 2,2 microphones are used in this embodiment to receive signals from 2 sound sources. The distance between the two microphones is 2.5cm, and the center is at [4,1,1.5] (m). The sound source and the microphone are positioned at the same horizontal plane, the two sound sources are respectively positioned at 45 degrees and 60 degrees, and the distance from the center of the array is 1m. The clean speech signal was convolutionally mixed with the room impulse response and a diffuse noise of signal to noise ratio (SNR) of 30dB was added as described in literature (e.a. habets and s. Gannot, "Generating sensor signals in isotropic noise fields," JASA, vol.122, no.6, pp.3464-3470,2007.), yielding 100 different mixed signals. All algorithms are processed in the time-frequency domain, and the short-time Fourier transform uses a Hanning window of 2048 points and an overlap ratio of 3/4.

The present embodiment uses signal to distortion ratio (SDR) and signal to interference ratio (SIR) as objective evaluation criteria, and subtracts the output SDR value (sdr_out)/SIR value (sir_out) after the algorithm processing from the SDR value (sdr_in)/SIR value (sir_in) of the input mixed signal to obtain the SDR boost value (sdramp)/SIR boost value (SIRimp) after the algorithm processing, that is, sdramp=sdr_out-sdr_in, and sirimp=sir_out-sir_in.

2. Specific implementation flow of method

Referring to fig. 1, a time-domain mixed speech signal is input and subjected to short-time fourier transform to obtain a time spectrum, and a separation matrix of each frequency band is initialized to an identity matrix. In the modified AuxIVA algorithm (denoted AuxIVA-imp), use is made ofCarrying out iterative optimization on formulas (4) - (7); in the modified ilmra algorithm (denoted ilmra-imp), iterative optimization was performed using equations (9) - (14). After iteration convergence, the final separation matrix W is obtained by adopting the formula (15) to carry out the size normalization ^k Substituting the time-domain speech signal into the formula (16) to obtain the separated speech time-frequency spectrum estimation, and finally performing short-time inverse Fourier transform on the estimated speech time-frequency spectrum to obtain the separated time-domain speech signal.

To demonstrate the performance of the method of the present invention, the present example compares the original AuxIVA algorithm (denoted AuxIVA-ori) (N.Ono., "Stable and fast update rules for independent vector analysis based on auxiliary function technique," in Proc. IEEE WASPAA, pp.189-192,2011.) and the ILRMA algorithm (denoted ILRMA-ori) (D.Kitamura, N.Ono, H.Sawada, H.Kameoka, and H.Saruwatari, "Determined blind source separation unifying independent vector analysis and nonnegative matrix factorization," IEEE Trans. Audio, spech, lang. Process., vol.24, no.9, pp.1626-1641,2016.) with the modified methods of the present invention AuxIVA-imp, ILRMA-imp. FIG. 3 shows the results of 100 tests at different reverberation times for the average SDRimp; figure 4 shows the results of 100 tests with average SIRimp at different reverberation times.

It can be found that the method of the invention can separate more effectively under noise-containing conditions than the original algorithm in a scene where the sound source is close, and has more obvious advantages under the condition of middle and low reverberation.

Claims

1. A method of separating speech signals for a sound source located close to the sound source, the method comprising the steps of:

step 1, acquiring a mixed voice time-frequency domain signal to be processed;

step 3, joint optimization is carried out on the separation matrixes of all the frequency bands so as to solve the sequencing uncertainty; the method comprises the following specific steps:

step 31, selecting a source signal distribution model to obtain a cost function;

when the Laplace distribution is selected as a source signal distribution model, the cost function is:

wherein ,representing sample averages, G (·) is a scoring function determined by the source signal model; n is the source signal index and n=1, 2, …, N is the total number of source signals; k is a frequency index and k=1, 2, …, K is the total number of frequency bands; />Represents the nth estimated signal in the kth frequency band, detW ^k Is a determinant of a separation matrix within the kth frequency band;

when multi-channel non-negative matrix factorization is selected as a source signal model, the cost function is:

wherein t is a time frame index, t _kl,n and v_lt,n Respectively the basis and the activation parameters of different sound sources, i is the index of the basis, N is the index of the source signal and n=1, 2, …, N is the total number of source signals; k is a frequency index and k=1, 2, …, K is the total number of frequency bands;a t-th frame, detW, representing an nth estimated signal in a kth frequency band ^k Is a determinant of a separation matrix within the kth frequency band;

step 32, adopting a localization-minimization optimization method for the cost function to obtain an updating rule of the separation matrix;

step 33, iterating the separation matrix by using the updating rule until convergence to obtain a separation matrix after each frequency band is optimized;

step 4, performing size normalization on the optimized separation matrix;

2. The method for separating a voice signal from a sound source according to claim 1, wherein the specific steps of step 1 are as follows: and obtaining a time domain signal of the mixed voice to be processed by using a signal acquisition system, and performing short-time Fourier transform on the time domain signal to obtain a time-frequency domain signal of the mixed voice to be processed.

3. The method according to claim 1, wherein in the step 2, the separation matrix of each frequency band is initialized by using an identity matrix, the diagonal element of the matrix is 1, and the remaining elements are 0.

4. The method according to claim 1, wherein in step 32, when a laplace distribution is selected as the source signal distribution model, an update rule for obtaining the separation matrix is:

wherein Representing a separation matrix W ^k In (2), the superscript H denotes the conjugate transpose, x ^k Representing a mixed signal vector in the kth frequency band, is->M represents the total number of microphones, G' (. Cndot.) represents the first derivative of G (. Cndot.), G (r) _n )＝r _n ，G'(r _n )＝1；e _n Representing a unit vector, the nth element being 1, the remaining elements being 0;

when multi-channel non-negative matrix factorization is selected as a source signal model, the update rule of the obtained separation matrix is as follows:

wherein t_kl,n and v_lt,n The update rules of (a) are respectively:

wherein ,represents sample average, e _n Representing a unit vector, the nth element being 1, the remaining elements being 0,l' new indicators of the base; />Representing a separation matrix W ^k The superscript H denotes the conjugate transpose.

5. The method for separating voice signals from sound sources according to claim 1, wherein in the step 4, the separation matrix is subjected to the size normalization according to the minimum distortion criterion, and the specific steps are as follows:

W ^k ←(W ^k (W ^k ) ^H ) ^-1/2 W ^k

where K is a frequency index, k=1, 2, …, K is the total number of frequency bands; w (W) ^k The separation matrix of the kth frequency band is represented, and the superscript H represents the conjugate transpose.

6. The method for separating a speech signal from a sound source according to claim 5, wherein the specific steps of step 5 are as follows: separating matrix W obtained in step 4 ^k With the mixed speech time-frequency domain signal x to be processed ^k Multiplying to estimate the separated time-frequency domain voice signal y ^k 。

7. The method for separating a voice signal from a sound source according to claim 1, wherein the specific steps of step 6 are as follows: and (5) performing short-time Fourier inverse transformation on the time-frequency domain voice signals estimated in the step (5) to obtain separated time domain voice signals.