CN113823316A

CN113823316A - Voice signal separation method for sound source close to position

Info

Publication number: CN113823316A
Application number: CN202111125927.4A
Authority: CN
Inventors: 廖乐乐; 卢晶; 陈锴
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2021-09-26
Filing date: 2021-09-26
Publication date: 2021-12-21
Anticipated expiration: 2041-09-26
Also published as: CN113823316B

Abstract

The invention discloses a voice signal separation method aiming at a position close to a sound source. The method comprises the following steps: step 1, acquiring a mixed voice time-frequency domain signal to be processed; step 2, initializing a separation matrix of each frequency band; step 3, performing joint optimization on the separation matrixes of all frequency bands; step 4, conducting amplitude normalization on the separation matrix; step 5, estimating the separated time-frequency domain voice signal; and 6, recovering the time domain voice signal from the separated time-frequency domain voice signal. The method can help the separation algorithm to obtain better voice signal separation effect under the unfavorable condition that the sound source positions are close.

Description

Voice signal separation method for sound source close to position

Technical Field

The invention relates to the technical field of voice processing, in particular to a voice signal separation technology.

Background

The voice separation technology can separate original sound source signals from mixed signals of a plurality of sound sources, is an important task in the field of voice signal processing, and plays an important role in various application scenes such as intelligent home systems, video conference systems and voice recognition systems.

In a multi-channel speech signal processing scheme, Independent Vector Analysis (IVA) establishes associations of frequency components of source signals by a joint probability distribution model, thereby constructing an overall cost function. Auxiliary function based IVA (AuxiVA) and Independent low-rank matrix analysis (ILRMA) are considered to be the most advanced current methods of separating convolution mixed audio signals. The Auxiva algorithm utilizes the optimization skill of the majority-minimization (MM) to deduce Iterative Project (IP) iteration rules, and can quickly and stably optimize the separation matrix. The optimization of AuxIVA can also be combined with other more flexible signal models. The ILRMA is a signal model which combines an optimization strategy of AuxIVA and MNMF, and the cost after each iteration is ensured to be non-increased while the strong expression capability of the MNMF is utilized.

Ideally, the separation effect of IVA is independent of the sound source position, however, in practice, due to the presence of noise, the separation effect of the algorithm is significantly reduced when the sound source positions are close, which greatly limits the application of the separation algorithm in practice. How to improve the separation effect of the sound sources located close to each other is a technical problem of great concern.

Disclosure of Invention

In order to solve the above technical problem, the present invention provides a method for separating a speech signal located close to a sound source, which can significantly improve the separation effect of the speech signal.

The technical scheme adopted by the invention is as follows:

a method for separating a voice signal positioned close to a sound source, comprising the steps of:

step 1, acquiring a mixed voice time-frequency domain signal to be processed;

step 2, initializing a separation matrix of each frequency band for the mixed voice time-frequency domain signal;

step 3, performing joint optimization on the separation matrixes of all frequency bands to solve the sequencing uncertainty;

step 4, conducting amplitude normalization on the optimized separation matrix;

step 5, estimating a time-frequency domain voice signal according to the separation matrix processed in the step 4;

and 6, restoring the time domain voice signal from the time-frequency domain voice signal estimated in the step 5.

Further, the specific steps of step 1 are: and acquiring a time domain signal of the mixed voice to be processed by using a signal acquisition system, and performing short-time Fourier transform on the time domain signal to obtain a time-frequency domain signal of the mixed voice to be processed.

Further, in step 2, the unit matrix is used to initialize the separation matrix of each frequency band, the diagonal elements of the matrix are 1, and the remaining elements are 0.

Further, in step 3, the specific step of performing joint optimization on the separation matrices of all frequency bands is: (1) selecting a source signal distribution model to obtain a cost function; (2) selecting an optimization method for the cost function to obtain an update rule of the separation matrix; (3) and iterating the separation matrix by using the updating rule until convergence is achieved, and obtaining the separation matrix after each frequency band is optimized.

Further, in the step 4, the separation matrix is amplitude-normalized according to a minimum distortion criterion.

Further, the specific steps of step 5 are: and (4) multiplying the separation matrix obtained in the step (4) with the mixed voice time-frequency domain signal to be processed, and estimating the separated time-frequency domain voice signal.

Further, the specific steps of step 6 are: and 5, performing short-time Fourier inverse transformation on the time-frequency domain voice signal estimated in the step 5 to obtain a separated time-domain voice signal.

The invention aims at sound sources with close positions and realizes an improved voice signal separation method. The method has the advantages that the separation effect of the sound source position close to the scene is obviously improved, the block sorting problem of the IVA under certain conditions is relieved, and the separation effect of the sound source position far away from the scene is also improved.

Drawings

FIG. 1 is a flow chart of a speech signal separation method according to the present invention;

FIG. 2 is a schematic diagram of a sound source close-up scene to which the present invention is applicable;

fig. 3 is a graph comparing SDR lift values at different reverberation times for the original AuxIVA method, the improved AuxIVA method of the present invention, the original ILRMA method, and the improved ILRMA method of the present invention.

Fig. 4 is a graph comparing SIR increase values at different reverberation times for the original AuxIVA method, the improved AuxIVA method of the present invention, and the original ILRMA method and the improved ILRMA method of the present invention.

Detailed Description

The invention relates to a voice separation method aiming at a position close to a sound source, which mainly comprises the following parts:

1. signal acquisition

1) And mixing the pure source signal with the room impulse response convolution, and adding the diffusion noise to obtain a mixed signal.

2) Short-time Fourier transform of signals

If the mixed signal collected by the m microphone is x_m(t) converting the signal into a time-frequency domain by short-time Fourier transform, ignoring the time-frame index t, and expressing the signal of the k-th frequency band as

The signals collected by the M microphones form a mixed signal vector

The superscript T denotes the transpose operation.

2. Iterative algorithm

The nth source signal vector is denoted as s_nN is the source signal index and N is 1,2, …, N is the total number of source signals. The separation matrix is denoted by W, and the n-th row of the separation matrix is denoted by

Note that the superscript H denotes the conjugate transpose, the superscript K denotes the kth band, K is 1,2, …, and K is the total number of bands.

Representing the set of all band separation matrices, detW^kIs the determinant of the separation matrix in the k-th band. Source signal vector s_nThe corresponding estimated signal is denoted y_n，

A t-th frame representing an nth estimated signal in a k-th frequency band. The time-frame indicator is ignored,

for the purpose of separation, the estimated signals are made independent as much as possible, and mutual information is used as a measure of independence to construct a cost function.

1) If a Laplace source signal distribution model is selected, the mutual information cost function is properly modified to be suitable for a scene with a close sound source position, and the final cost function can be written in the following form:

wherein

Which represents the average of the sample samples,

is | | | y_n||²F represents the probability density distribution function of the source signal as a function of the argument. Adopting the optimization skill of the major-minimization (MM), constructing an auxiliary function:

wherein

Is an auxiliary variable. Order to

Optimal condition for obtaining solution

Where q is another source signal indicator. The iteration rule is then:

g' (. cndot.) represents the first derivative of G (-), e_nRepresenting a unit vector, the nth element is 1 and the remaining elements are 0. For Laplace distribution, G (| | y)_n||²)＝||y_n||²，G'(||y_n||²) 1. Initializing the separation matrix into a unit matrix, and then iterating until convergence according to the rules of the formulas (4) to (7) to obtain the optimized separation matrix.

2) If the MNMF is used as a source signal distribution model, cost functions of the IVA and the MNMF are fused, and the cost functions are properly modified to be suitable for a scene with a close sound source position, the final cost function can be written into the following form:

wherein ,t_kl,n and v_lt,nRespectively, the base and activation parameters of different sound sources, and l is an index of the base. Adopting the optimization skill of the major-minimization (MM), the following iteration rules are obtained:

wherein the model parameter t_kl,n and v_lt,nThe update rules of (1) are respectively:

wherein

Represents the mean of the samples and l' is a new indicator of the basis. Initializing the separation matrix into an identity matrix, and then iterating until convergence according to the rules of equations (9) - (14) to obtain an optimized separation matrix.

3. Amplitude normalization

In order to solve the uncertainty of the recovered signal amplitude, amplitude normalization needs to be performed on the separation matrix obtained after convergence. According to the MDP, the following processing is further carried out on the optimized separation matrix:

W^k←(W^k(W^k)^H)^-1/2W^k (15)

4. reconstructing a target signal

1) Estimating a time-frequency domain target signal

From the final separation matrix obtained by equation (15), the speech signal after each band separation can be estimated by the following equation:

y^k＝W^kx^k (16)

2) reconstructing a time-domain target signal

And finally, transforming the separated time-frequency domain voice signals to the time domain through short-time Fourier inverse transformation, and recovering the time-domain signals.

Examples

The technical scheme in the embodiment of the invention is clearly and completely described below with reference to the accompanying drawings.

1. Test sample and objective evaluation criteria

The clean speech signal in this embodiment is selected from the TIMIT data set (a speech signal of 10s length is formed by cutting and splicing) and has a sampling rate of 16 kHz. Room impulse responses were generated with an Image model (j.b. allen and d.a. berkley, "Image method for influencing small-room communications," j.acoust. soc. am., vol.65, pp.943-950,1979.), the room size was 7m × 5m × 2.75m, and the reverberation times were set to 0ms, 100ms, 300ms, 500ms, 700ms, respectively. As shown in fig. 2, in the present embodiment, 2 microphones are used to receive signals from 2 sound sources. The distance between the two microphones is 2.5cm, and the center of the two microphones is positioned at the position of [4,1,1.5] (m). The sound source and the microphone are positioned on the same horizontal plane, the two sound sources are respectively positioned at 45 degrees and 60 degrees, and the distance from the center of the array is 1 m. Clean speech signals are convolved with the room impulse response and 100 different segments of mixed signal are generated by adding a diffuse noise signal to noise ratio (SNR) of 30dB as per the method in the literature (e.a. habps and s.gannot, "Generating sensor signals in anisotropic noise fields," JASA, vol.122, No.6, pp.3464-3470,2007.). All algorithms are processed in the time-frequency domain, and the short-time Fourier transform adopts a 2048-point Hanning window and an overlap ratio of 3/4.

In the embodiment, a signal to deviation ratio (SDR) and a signal to interference ratio (SIR) are used as objective evaluation criteria, and an output SDR value (SDR _ out)/SIR value (SIR _ out) after algorithm processing is subtracted from an SDR value (SDR _ in)/SIR value (SIR _ in) of an input mixed signal to obtain an SDR boost value (sdramp)/SIR boost value (SIRimp) after algorithm processing, that is, sdramp is SDR _ out-SDR _ in and SIRimp is SIR _ out-SIR _ in.

2. Concrete implementation process of method

Referring to fig. 1, a time-domain mixed speech signal is input and subjected to short-time fourier transform to obtain a time-frequency spectrum, and a separation matrix of each frequency band is initialized to an identity matrix. In a modified AuxIVA algorithm (denoted as AuxIVA-imp), iterative optimization is performed using equations (4) - (7); in the modified ILRMA algorithm (denoted ILRMA-imp), iterative optimization is performed using equations (9) - (14). After iterative convergence, the final separation matrix W is obtained by amplitude warping with the formula (15)^kAnd substituting the estimated time-domain frequency spectrum into a formula (16) to obtain a separated voice time-domain frequency spectrum estimation, and finally performing short-time Fourier inverse transformation on the estimated voice time-domain frequency spectrum to obtain a separated time-domain voice signal.

In order to embody the performance of the method of the present invention, the present embodiment compares the original Auxiva algorithm (denoted as Auxiva-ori) (N.Ono, "Stable and fast update rules for independent vector analysis based on automatic function technology," in Proc.IEEE WASPAA, pp.189-192,2011 ") with the ILRMA algorithm (denoted as ILRMA-ori) (D.Kitamura, N.Ono, H.Sawada, H.Kameoka, and H.Sarwari," Determined blue resource partitioning analysis and non-networked function analysis, "IEEE transactions, speed, Lang.Process, vol.24, No.9, pp.1626-1641,2016) with the method of the present invention. FIG. 3 shows the results of an average SDRimp obtained from 100 tests at different reverberation times; figure 4 shows the results of the average SIRimp obtained from 100 tests at different reverberation times.

It can be found that in a scene with a close sound source position, compared with the original algorithm, the method of the present invention can perform separation more effectively under a noisy condition, and the advantage is more obvious in the case of medium-low reverberation.

Claims

1. A method for separating a speech signal located close to a sound source, the method comprising the steps of:

step 1, acquiring a mixed voice time-frequency domain signal to be processed;

step 4, conducting amplitude normalization on the optimized separation matrix;

2. The method for separating a voice signal of a location close to a sound source according to claim 1, wherein the step 1 comprises the following steps: and acquiring a time domain signal of the mixed voice to be processed by using a signal acquisition system, and performing short-time Fourier transform on the time domain signal to obtain a time-frequency domain signal of the mixed voice to be processed.

3. The method as claimed in claim 1, wherein the step 2 initializes the separation matrix of each frequency band by using an identity matrix, the diagonal element of the matrix is 1, and the remaining elements are 0.

4. The method as claimed in claim 1, wherein the step 3 of jointly optimizing the separation matrices of all frequency bands comprises the following steps:

(1) selecting a source signal distribution model to obtain a cost function;

(2) selecting an optimization method for the cost function to obtain an update rule of the separation matrix;

(3) and iterating the separation matrix by using the updating rule until convergence is achieved, and obtaining the separation matrix after each frequency band is optimized.

5. The method according to claim 4, wherein the Laplace distribution is selected as the source signal distribution model in step (1), and the cost function is:

wherein ,

representing the sample-sample average, G (-) is a scoring function determined by the source signal model; n is the source signal index and N is 1,2, …, N is the total number of source signals; k is a frequency index and K is 1,2, …, K being the total number of frequency bands;

representing the nth estimated signal, detW, in the k-th frequency band^kIs a determinant of a separation matrix within the k-th frequency band;

adopting a maj orientation-minimization optimization method for the cost function, and obtaining an update rule of the separation matrix as follows:

wherein

Represents a separation matrix W^kIn the nth row of (1), the superscript H denotes the conjugate transpose, x^kRepresenting the mixed signal vector in the k-th frequency band,

m represents the total number of microphones, G' (. cndot.) represents the first derivative of G (-), G (r)_n)＝r_n，G'(r_n)＝1；e_nRepresenting a unit vector, the nth element is 1 and the remaining elements are 0.

6. The method according to claim 4, wherein the multi-channel non-negative matrix decomposition is selected as the source signal model in step (1), and the cost function is:

where t is a time frame index, t_kl,n and v_lt,nRespectively, a base and an activation parameter of different sound sources, wherein l is an index of the base, N is a source signal index, and N is 1,2, …, N and N are the total number of source signals; k is a frequency index and K is 1,2, …, K being the total number of frequency bands;

t-th frame, detW, representing the n-th estimated signal in the k-th frequency band^kIs a determinant of a separation matrix within the k-th frequency band;

wherein t_kl,n and v_lt,nThe update rules of (1) are respectively:

wherein ,

represents the mean of the samples, e_nRepresenting a unit vector, the nth element is 1, the rest elements are 0, and l' is a new index of the base;

represents a separation matrix W^kAnd the superscript H on the nth row of (a) indicates the conjugate transpose.

7. The method according to claim 1, wherein in the step 4, the amplitude warping is performed on the separation matrix according to a minimum distortion criterion, and the specific steps are as follows:

W^k←(W^k(W^k)^H)^-1/2W^k

wherein K is a frequency index, K is 1,2, …, and K is the total number of frequency bands; w^kThe separation matrix representing the k-th band and the superscript H the conjugate transpose.

8. The method according to claim 7, wherein the step 5 comprises the following steps: separating the separation matrix W obtained in the step 4^kWith the mixed speech time-frequency domain signal x to be processed^kMultiplying to estimate the separated time-frequency domain speech signal y^k。

9. The method for separating a voice signal of a location close to a sound source according to claim 1, wherein the step 6 comprises the following steps: and 5, performing short-time Fourier inverse transformation on the time-frequency domain voice signal estimated in the step 5 to obtain a separated time-domain voice signal.