CN102074236A

CN102074236A - Speaker clustering method for distributed microphone

Info

Publication number: CN102074236A
Application number: CN2010105683868A
Authority: CN
Inventors: 杨毅; 刘加
Original assignee: Tsinghua University
Current assignee: Beijing Huacong Zhijia Technology Co Ltd
Priority date: 2010-11-29
Filing date: 2010-11-29
Publication date: 2011-05-25
Anticipated expiration: 2030-11-29
Also published as: CN102074236B

Abstract

The invention relates to a speaker clustering method for a distributed microphone, which comprises the following steps: firstly performing pretreatment on signals acquired by the distributed microphone, further adopting the time delay estimation method for calculation against sound source signal fragments, getting a corresponding time delay estimation vector, then ruling out wrong data, performing speaker segmentation, and finally performing speaker clustering according to the speaker segmentation result. The distributed microphone is used as a signal acquisition and output device for calculating the time delay vector of the voice signal fragments, the time delay estimation precision is improved by ruling out the wrong data, and clustering algorithm is adopted for the time delay vector so as to respectively classify the voice signal fragments according to identities of speakers; furthermore, the device has the advantages of low price and convenience in use, and the speaker clustering method can be applied in a multi-person multi-party dialogue scene under a complex acoustic environment.

Description

A kind of speaker clustering method of distributed mike wind

Technical field

The invention belongs to the voice technology field, relate to a kind of speaker clustering method of distributed mike wind particularly.

Background technology

Along with the continuous development of network and mechanics of communication, utilize existing multimedia technology, network and mechanics of communication, distributed proccessing etc. can realize that the many people under the complicated acoustic enviroment scene talk with in many ways.Input of tradition sound source and sound pick-up outfit comprise head microphone, omni-directional and directivity single microphone, microphone array etc.Single microphone has advantages such as volume is little, cheap as traditional sound source input and sound pick-up outfit, but does not possess the ability to neighbourhood noise processing and auditory localization; Microphone array is made up of a plurality of microphones of putting according to specific geometrical position, and spacing wave is carried out the time-space domain Combined Treatment, and its ability comprises: the auditory localization under identification and separating sound-source, the reverberation condition, enhancing voice signal etc.

The sound signal collecting system that distributed mike wind is made up of a plurality of single microphones, each microphone is controlled by distinct device, and without any restriction, the signal that microphone is gathered is not exclusively synchronous in time domain to the arrangement of microphone and spacing.Distributed mike wind is simple in structure, easy to use, the saving cost, meets the requirement of the multi-direction complex dialogs scene of many sound sources, can finish multiple application such as speaker's cluster, identification and location effectively.Different with the microphone array system is, distributed mike wind is to the position of microphone and put without any constraint and restriction, the sound source in the distributed mike wind system and the unknown of microphone position information in addition.

It is one of research topic of field of voice signal that acoustic information is classified automatically, and the speaker cuts apart (Speaker Segmentation) and speaker's cluster (Speaker Clustering) is an important component part wherein.Usual way is: the speaker is cut apart whole tested speech is divided into a series of sound bites, and these sound bites only belong to a certain speaker dependent; The voice that belong to a speaker that speaker's cluster is responsible for disperseing are classified as a class.

Traditional speaker's dividing method moves statistic law based on the window of Gauss model substantially, adopts different distance measures to select, and obtains cut-point by merging based on Bayesian information criterion.Speaker clustering method can adopt evolution Hidden Markov (EHMM) computing method, upgrades segmentation result by weighing the path mark.When speaker's number does not limit, can adopt the method for hierarchical clustering to carry out speaker's cluster.

The speaker clustering method of microphone array mainly utilizes speaker's differences in spatial location to classify.Cardinal principle is: with the space characteristics of time delay estimate vector as the speaker, in GMM/HMM (gauss hybrid models/hidden Markov model) model these features are integrated and classified.The time delay algorithm for estimating of microphone array mainly comprises GCC (broad sense simple crosscorrelation) method and LMS (least mean-square error) method.It is more serious that GCC (broad sense simple crosscorrelation) is influenced by reverberation, produced GCC (broad sense simple crosscorrelation) method of CEP (cepstrum pre-filtering) method and fundamental tone weighting after the improvement, EVD (characteristic value decomposition) and then utilize the technology of subspace and transport function recently to find the solution respectively based on the delay time estimation method of ATF (acoustic transfer function).But therefore the error sensitivity to sampling between each equipment during the microphone array system-computed requires very strict to the voice data synchronism; And sound source number the unknown in common many people Multi-Party Conference scene, microphone position the unknown, the unknown of room acoustics environment promptly need be handled voice data under the scene that time and spatial prior information all lack.

As the single microphone of traditional sound source input and sound pick-up outfit, cheap, simple in structure, shortcoming is to be subject to environmental interference, and can not position sound source; The conventional microphone array system is widely studied, the main cause that does not have commercialization be specialized hardware cost an arm and a leg and algorithm complex higher.

Summary of the invention

In order to overcome the shortcoming of above-mentioned prior art, the objective of the invention is to propose a kind of speaker clustering method of distributed mike wind, with distributed mike wind as signals collecting and output device, the time delay vector of computing voice signal segment, improve the time delay estimated accuracy by the debug data, adopt clustering algorithm that speech signal segment is sorted out respectively by speaker ' s identity to the time delay vector, equipment price is cheap, have advantage easy to use, can be applicable to the many people session operational scenarios in many ways under the complicated acoustic enviroment.

A kind of speaker clustering method of distributed mike wind may further comprise the steps:

The first step is carried out pre-service to the signal of distributed mike elegance collection

At first the multichannel sound-source signal that distributed mike wind is obtained carries out pre-service, earlier the multichannel sound-source signal is divided frame and carries out the fast Fourier transform (FFT) conversion, then the multichannel sound-source signal is carried out end-point detection, signal is divided into sound-source signal and non-sound-source signal two classes, the purpose of end-point detection is to distinguish voice signal and non-speech audio from audio digital signals, sound end detecting method can adopt subband spectrum entropy algorithm, at first the spectrum division with every frame voice becomes n (n is the integer greater than zero) subband, calculate the frequency spectrum entropy of each subband, then the subband spectrum entropy of n frame is in succession obtained the frequency spectrum entropy of every frame through one group of order statistics wave filter, according to the value of frequency spectrum entropy the voice of input are classified, concrete steps are: with the voice signal of every frame through obtaining its N on power spectrum after the fast Fourier transform (FFT) _FFTIndividual some Y _i(0≤i≤N _FFT), the probability density of each point on spectrum domain can be used formula (1) expression:

p_{i} = Y_{i} / Σ_{k = 0}^{N_{FFT} - 1} Y_{k} - - - (1)

Wherein: Y _kBe k the point of voice signal on power spectrum through the FFT conversion, Y _iBe i the point of voice signal on power spectrum through the FFT conversion, N _FFTBe the number of i, p _iBe the probability density of i point on spectrum domain,

The entropy function of corresponding signal on spectrum domain defines available formula (2) expression:

H = - Σ_{k = 0}^{N_{FFT} - 1} p_{k} \log (p_{k}) - - - (2)

Wherein: p _kBe the probability density of k point on spectrum domain, N _FFTBe the number of i, H is the entropy function on the spectrum domain,

With the N on the frequency domain _FFTIndividual point is divided into the frequency range of K non-overlapping copies, is called subband, the probability that calculates each point on the l frame frequency spectral domain as shown in Equation (3):

p_{l} [k, i] = (Y_{i} + Q) / Σ_{j = m_{k}}^{m_{k + 1} - 1} (Y_{j} + Q) - - - (3)

Wherein: Y _jBe j the point of voice signal on power spectrum through the FFT conversion, Y _iBe the point on k the subband,

(0≤k≤K-1, m _k≤ i≤m _K+1-1) be the subband lower limit, Q is a constant, p _l[k, i] is the probability of each point on the l frame frequency spectral domain,

According to the definition of information entropy, the value of the frequency spectrum entropy of k subband of l frame as shown in Equation (4):

E_{s} [l, k] = Σ_{i = mk}^{m_{k + 1} - 1} p_{l} [k, i] \log (p_{l} [k, i]) (0 \leq k \leq K - 1) - - - (4)

Wherein: p _l[k, i] is the probability of each point on the l frame frequency spectral domain, E _s[l, k] is the frequency spectrum entropy of k subband of l frame,

We can calculate the spectrum information entropy of l frame according to following formula (5):

H_{l} = - \frac{1}{K} Σ_{k = 0}^{K - 1} E_{h} [l, k] - - - (5)

Wherein: E _h[l, k] is the frequency spectrum entropy of k subband of l frame, and K is the subband number, H _lThe information entropy of k subband of the l frame after handling for popin after filtration is sliding defines as shown in Equation (6):

E _h[l，k]＝(1-λ)E _s(h)[l，k]+λE _s(h+1)[l，k](0≤k≤K-1)(6)

Wherein: E _{S (h)}[l, k] preparation method is as follows: the order statistics wave filter of each subband acts on the sub-band information entropy E that a group length is L in the algorithm _s[l-N, k], KE _s[l, k], KE _sOn [l+N, k], this group sub-band information entropy is pressed ascending order rank order, E _{S (h)}[l, k] is E _s[l-N, k], KE _s[l, k], KE _sH maximal value in [l+N, k]; λ is a constant, E _h[l, k] is the information entropy of k subband of the l frame after the filtering smoothing processing,

The signal that can be obtained every frame by formula (5) has a frequency spectrum entropy H _l, work as H _lValue during greater than prior preset threshold T, the l frame is differentiated speech frame, otherwise is judged to non-speech frame; Threshold value T is defined as T=β Avg+ θ, wherein β=0.01, θ=0.1, E _m[k] is E _s[0, k], K, E _sThe intermediate value of [N-1, k], Avg is the Noise Estimation that input signal begins the N frame most,

Second step, adopt Time Delay Estimation Method to calculate to the sound-source signal fragment, obtain corresponding time delay estimate vector

At first determine volume coordinate, concrete grammar is: each microphone is numbered in order is microphone M1, M2..., Mn, n is the integer greater than 1, select initially to be numbered 1 and 2 two microphone M1 and M2, if the position of microphone M1 is an origin, microphone M1 is the starting point coordinate direction of principal axis to the direction of microphone M2, subsequently per 50 frame voice signals are considered as one group of sound bite, adopt Time Delay Estimation Method that every group of sound bite estimated to the delay inequality between any two microphones, obtain the individual time delay estimated value of n (n-1), as shown in Equation (7):

τ_{k} = {[\begin{matrix} {\hat{τ}}_{12} & {\hat{τ}}_{13} & L & {\hat{τ}}_{ij} \end{matrix}]}^{T} - - - (7)

Wherein:

Be that delay inequality between i microphone and j the microphone is estimated τ _kBe the delay inequality estimate vector,

Time delay is estimated to adopt PHAT (phase tranformation) weighting algorithm, its weighting coefficient as shown in Equation (8), delay time estimation method is shown in formula (9)～(10):

W (ω) = \frac{1}{| X_{1} (ω) X_{2}^{*} (ω) |} - - - (8)

Wherein: X ₁(ω), X ₂(ω) be respectively the two-way time-domain signal through the output after the FFT conversion, ^*Be conjugate of symbol,

R_{x_{1} x_{2}} (n) = IFFT (W (ω) \cdot X_{1} (ω) \cdot X_{2}^{*} (ω)) - - - (9)

\hat{τ} = \underset{n}{\arg \max} R_{x_{1} x_{2}} (n) - - - (10)

Wherein:

Be the broad sense cross correlation function of two paths of signals,

Be x ₁And x ₂Between the time

Prolong estimated value,

In the 3rd step, the debug data are also carried out the speaker and are cut apart

At first need to remove invalid data, press following formula (11) calculation delay:

τ [n] = \{\begin{matrix} \hat{τ} [n - 1] & SNR < {Thr}_{SNR} \\ \hat{τ} [n] & SNR &GreaterEqual; {Thr}_{SNR} \end{matrix} - - - (11)

Wherein: n is the index value of a certain frame, and τ is the delay data of a certain frame correspondence,

Be the delay data that a certain frame is estimated, when a certain moment signal to noise ratio (S/N ratio) less than threshold value Thr _SNRThe time, adopt last one constantly estimation time delay as this time delay estimated value constantly, and (12) further calculation delay by formula:

τ [n] = \{\begin{matrix} \hat{τ} [n - 1] & \hat{τ} [n] < Thr \\ \hat{τ} [n] & \hat{τ} [n] &GreaterEqual; Thr \end{matrix} - - - (12)

Be the delay data that a certain frame is estimated, when a certain moment time delay is estimated less than threshold value Thr, adopted the estimation time delay in a lasted moment as the time delay estimated value in this moment,

Then the speaker of different spatial is cut apart calculating, at first calculate posterior probability β _i(τ _k) as shown in Equation (13):

β_{i} (τ_{k}) = \frac{α_{i} g (τ_{k}; μ_{i} . σ_{i}^{2})}{α_{1} g (τ_{k}; μ_{1} . σ_{1}^{2}) + α_{2} g (τ_{k}; μ_{2} . σ_{2}^{2}) + L + α_{i} g (τ_{k}; μ_{i} . σ_{i}^{2})} - - - (13)

Wherein: Be defined parameters, α _i=1/i, i represent the number of GMM model,

Initial value adopt K-means algorithm computation, τ _kFor formula 7 calculates the time delay estimate vector that obtains, β _i(τ _k) be posterior probability,

Formula (14) is the parameter update algorithm:

\{\begin{matrix} {\hat{μ}}_{i} = \frac{Σ_{k = 1}^{n} β_{i} (τ_{k}) τ_{k}}{Σ_{k = 1}^{n} β_{i} (τ_{k})} \\ {\hat{σ}}_{i}^{2} = \frac{1}{d} \frac{Σ_{k = 1}^{n} β_{i} (τ_{k}) {(τ_{k} - μ_{i})}^{T} (τ_{k} - μ_{i})}{Σ_{k = 1}^{n} β_{i} (τ_{k})} \\ {\hat{α}}_{i} = \frac{1}{n} Σ_{k = 1}^{n} β_{i} (τ_{k}) \end{matrix} - - - (14)

Wherein:

Be estimates of parameters,

Be the estimation of GMM model parameter, β _i(τ _k) be the posterior probability that formula 13 calculates gained, when

The time stop undated parameter, min is a constant herein, represents minimum tolerance value,

In the 4th step, the result of cutting apart according to the speaker carries out speaker's cluster

Utilize a kind of algorithm that the sound bite after cutting apart is carried out cluster based on K-means, calculate the territory density of each set earlier, as initial point, next initial point is and the point of first initial point apart from maximum that the number up to initial point meets the requirements by that analogy with the point of density maximum;

Next calculates sample point and upgrades the value at center to the distance at set center, selects the sampled point of coincidence formula (15) to upgrade as new set center,

Func = Σ_{j = 1}^{J} Σ_{n = 1}^{M} {| | \hat{τ} [n] - τ_{j} | |}^{2} - - - (15)

Wherein: Be the time delay estimate vector

Cluster centre τ with each sound bite _jDistance, τ _j[n] is center vector, and J is speaker's number, and M is the microphone number,

At last come different spaces speaker's sound bite is sorted out and marked according to the distance of set center vector and sound bite vector.

The present invention has following advantage:

(1), the distributed asynchronous sound sensor that proposes of the present invention, the locus of sonic transducer is not had strict restriction, in addition the synchronism of acquired signal is required lowlyer, compare the microphone array application more flexibly extensively;

(2) the present invention made full use of between the microphone and sound source and microphone between a plurality of delay inequalities carry out information fusion, carry out the speaker by the time delay estimate vector and cut apart, when having reduced the complicacy of traditional speaker's partitioning algorithm, robustness increases;

(3), the present invention made full use of the advantage of distributed mike wind in spatial domain, and single speaker's speech signal segment time delay estimate vector is carried out speaker's cluster, reduced the complicacy of traditional speaker's clustering algorithm;

(4), the speaker clustering method of distributed mike wind of the present invention can be applied to multiple many people session operational scenarios in many ways, it is good to have robustness, adapts to the characteristics of multiple acoustic enviroment, and

The present invention can realize on present palm PC, PDA(Personal Digital Assistant) or mobile phone that its range of application is very extensive.

Description of drawings

Fig. 1 is a schematic flow sheet of the present invention.

Fig. 2 is the schematic flow sheet of end-point detection of the present invention.

Fig. 3 is the synoptic diagram that sound source time delay of the present invention is estimated.

Fig. 4 is the schematic flow sheet that speaker of the present invention is cut apart cluster.

Embodiment

The present invention is described in detail below in conjunction with accompanying drawing.

With reference to Fig. 1, a kind of speaker clustering method of distributed mike wind may further comprise the steps:

With reference to Fig. 2, at first the multichannel sound-source signal that distributed mike wind is obtained carries out pre-service, earlier the multichannel sound-source signal is divided frame and carries out the fast Fourier transform (FFT) conversion, then the multichannel sound-source signal is carried out end-point detection, signal is divided into sound-source signal and non-sound-source signal two classes, the purpose of end-point detection is to distinguish voice signal and non-speech audio from audio digital signals, the early stage employing can be distinguished voice signal and noise exactly based on the method for energy and zero-crossing rate, but the voice in the reality are usually polluted by bigger neighbourhood noise, sound end detecting method can adopt subband spectrum entropy algorithm, at first the spectrum division with every frame voice becomes n (n is the integer greater than zero) subband, calculate the frequency spectrum entropy of each subband, then the subband spectrum entropy of n frame is in succession obtained the frequency spectrum entropy of every frame through one group of order statistics wave filter, according to the value of frequency spectrum entropy the voice of input are classified, concrete steps are: with the voice signal of every frame through obtaining its N on power spectrum after the fast Fourier transform (FFT) _FFTIndividual some Y _i(0≤i≤N _FFT), the probability density of each point on spectrum domain can be used formula (1) expression:

p_{i} = Y_{i} / Σ_{k = 0}^{N_{FFT} - 1} Y_{k} - - - (1)

H = - Σ_{k = 0}^{N_{FFT} - 1} p_{k} \log (p_{k}) - - - (2)

With the N on the frequency domain _FFTIndividual point is divided into the frequency range of K non-overlapping copies, is called subband, the probability that calculates each point on the i frame frequency spectral domain as shown in Equation (3):

p_{l} [k, i] = (Y_{i} + Q) / Σ_{j = m_{k}}^{m_{k + 1} - 1} (Y_{j} + Q) - - - (3)

Wherein: Y _iBe j the point of voice signal on power spectrum through the FFT conversion, Y _iBe the point on k the subband,

E_{s} [l, k] = Σ_{i = mk}^{m_{k + 1} - 1} p_{l} [k, i] \log (p_{l} [k, i]) (0 \leq k \leq K - 1) - - - (4)

H_{l} = - \frac{1}{K} Σ_{k = 0}^{K - 1} E_{h} [l, k] - - - (5)

E _h[l，k]＝(1-λ)E _s(h)[l，k]+λE _s(h+1)[l，k](0≤k≤K-1)(6)

The signal that can be obtained every frame by formula (5) has a frequency spectrum entropy H _l, work as H _lValue during greater than prior preset threshold T, the l frame is differentiated speech frame, otherwise is judged to non-speech frame; Threshold value T is defined as T=β Avg+ θ, wherein

β=0.01, θ=0.1, E _m[k] is E _s[0, k], K, E _sThe intermediate value of [N-1, k], Avg is the Noise Estimation that input signal begins the N frame most,

With reference to Fig. 3, at first determine volume coordinate, concrete grammar is: each microphone is numbered in order is microphone M1, M2..., Mn, n is the integer greater than 1, select initially to be numbered 1 and 2 two microphone M1 and M2, if the position of microphone M1 is an origin, microphone M1 is the starting point coordinate direction of principal axis to the direction of microphone M2, subsequently per 50 frame voice signals is considered as one group of sound bite, adopts Time Delay Estimation Method that every group of sound bite estimated to the delay inequality between any two microphones, obtain the individual time delay estimated value of n (n-1), as shown in Equation (7):

τ_{k} = {[\begin{matrix} {\hat{τ}}_{12} & {\hat{τ}}_{13} & L & {\hat{τ}}_{ij} \end{matrix}]}^{T} - - - (7)

Wherein:

W (ω) = \frac{1}{| X_{1} (ω) X_{2}^{*} (ω) |} - - - (8)

R_{x_{1} x_{2}} (n) = IFFT (W (ω) \cdot X_{1} (ω) \cdot X_{2}^{*} (ω)) - - - (9)

Wherein: X ₁(ω), X ₂(ω) be respectively the two-way time-domain signal through the output after the FFT conversion, ^*Be conjugate of symbol, IFFT is anti-FFT conversion,

Be the broad sense cross correlation function of two paths of signals,

\hat{τ} = \underset{n}{\arg \max} R_{x_{1} x_{2}} (n) - - - (10)

Wherein:

Be the broad sense cross correlation function of two paths of signals,

Be x ₁And x ₂Between the time delay estimated value,

With reference to Fig. 4, at first need to remove invalid data, press following formula (11) calculation delay:

τ [n] = \{\begin{matrix} \hat{τ} [n - 1] & SNR < {Thr}_{SNR} \\ \hat{τ} [n] & SNR &GreaterEqual; {Thr}_{SNR} \end{matrix} - - - (11)

Wherein: n is the index value of a certain frame, and τ is the delay data of a certain frame correspondence, Be the delay data that a certain frame is estimated, when a certain moment signal to noise ratio (S/N ratio) less than threshold value Thr _SNRThe time, adopt last one constantly estimation time delay as this time delay estimated value constantly, and (12) further calculation delay by formula:

τ [n] = \{\begin{matrix} \hat{τ} [n - 1] & \hat{τ} [n] < Thr \\ \hat{τ} [n] & \hat{τ} [n] &GreaterEqual; Thr \end{matrix} - - - (12)

β_{i} (τ_{k}) = \frac{α_{i} g (τ_{k}; μ_{i} . σ_{i}^{2})}{α_{1} g (τ_{k}; μ_{1} . σ_{1}^{2}) + α_{2} g (τ_{k}; μ_{2} . σ_{2}^{2}) + L + α_{i} g (τ_{k}; μ_{i} . σ_{i}^{2})} - - - (12)

Wherein:

Be defined parameters, α _i=1/i, i represent the number of GMM model,

Formula (14) is the parameter update algorithm:

\{\begin{matrix} {\hat{μ}}_{i} = \frac{Σ_{k = 1}^{n} β_{i} (τ_{k}) τ_{k}}{Σ_{k = 1}^{n} β_{i} (τ_{k})} \\ {\hat{σ}}_{i}^{2} = \frac{1}{d} \frac{Σ_{k = 1}^{n} β_{i} (τ_{k}) {(τ_{k} - μ_{i})}^{T} (τ_{k} - μ_{i})}{Σ_{k = 1}^{n} β_{i} (τ_{k})} \\ {\hat{α}}_{i} = \frac{1}{n} Σ_{k = 1}^{n} β_{i} (τ_{k}) \end{matrix} - - - (14)

Wherein:

Be estimates of parameters,

Utilize a kind of algorithm based on K-means that the sound bite after cutting apart is carried out cluster, this algorithm can overcome standard K-means algorithm performance and be subjected to initial value and isolated point to influence big defective,

Calculate earlier the territory density of each set, as initial point, next initial point is and the point of first initial point apart from maximum that the number up to initial point meets the requirements by that analogy with the point of density maximum;

Func = Σ_{j = 1}^{J} Σ_{n = 1}^{M} {| | \hat{τ} [n] - τ_{j} | |}^{2} - - - (15)

Wherein:

Be the time delay estimate vector

In the accompanying drawing:

Be the locus vector of single sound source,

Be the locus vector of another single sound source,

Be respectively single microphone M _iM _kM _jThe locus vector.

Claims

1. the speaker clustering method of a distributed mike wind is characterized in that: may further comprise the steps:

p_{i} = Y_{i} / Σ_{k = 0}^{N_{FFT} - 1} Y_{k} - - - (1)

H = - Σ_{k = 0}^{N_{FFT} - 1} p_{k} \log (p_{k}) - - - (2)

p_{l} [k, i] = (Y_{i} + Q) / Σ_{j = m_{k}}^{m_{k + 1} - 1} (Y_{j} + Q) - - - (3)

E_{s} [l, k] = Σ_{i = mk}^{m_{k + 1} - 1} p_{l} [k, i] \log (p_{l} [k, i]) (0 \leq k \leq K - 1) - - - (4)

H_{l} = - \frac{1}{K} Σ_{k = 0}^{K - 1} E_{h} [l, k] - - - (5)

E _h[l，k]＝(1-λ)E _s(h)[l，k]+λE _s(h+1)[l，k](0≤k≤K-1)(6)

At first determine volume coordinate, concrete grammar is: to each microphone M1 that numbers in order, M2..., Mn, n is the integer greater than 1, select initially to be numbered 1 and 2 two microphone M1 and M2, if the position of M1 is an origin, M1 is the starting point coordinate direction of principal axis to the direction of M2, subsequently per 50 frame voice signals are considered as one group of sound bite, adopt Time Delay Estimation Method that every group of sound bite estimated to the delay inequality between any two microphones, obtain the individual time delay estimated value of n (n-1), as shown in Equation (7):

τ_{k} = {[\begin{matrix} {\hat{τ}}_{12} & {\hat{τ}}_{13} & L & {\hat{τ}}_{ij} \end{matrix}]}^{T} - - - (7)

Wherein:

W (ω) = \frac{1}{| X_{1} (ω) X_{2}^{*} (ω) |} - - - (8)

R_{x_{1} x_{2}} (n) = IFFT (W (ω) \cdot X_{1} (ω) \cdot X_{2}^{*} (ω)) - - - (9)

\hat{τ} = \underset{n}{\arg \max} R_{x_{1} x_{2}} (n) - - - (10)

Wherein: Be the broad sense cross correlation function of two paths of signals,

Be x ₁And x ₂Between the time delay estimated value,

τ [n] = \{\begin{matrix} \hat{τ} [n - 1] & SNR < {Thr}_{SNR} \\ \hat{τ} [n] & SNR &GreaterEqual; {Thr}_{SNR} \end{matrix} - - - (11)

τ [n] = \{\begin{matrix} \hat{τ} [n - 1] & \hat{τ} [n] < Thr \\ \hat{τ} [n] & \hat{τ} [n] &GreaterEqual; Thr \end{matrix} - - - (12)

β_{i} (τ_{k}) = \frac{α_{i} g (τ_{k}; μ_{i} . σ_{i}^{2})}{α_{1} g (τ_{k}; μ_{1} . σ_{1}^{2}) + α_{2} g (τ_{k}; μ_{2} . σ_{2}^{2}) + L + α_{i} g (τ_{k}; μ_{i} . σ_{i}^{2})} - - - (13)

Wherein: For defined parameters-, α _i=1/i, i represent the number of GMM model,

Formula (14) is the parameter update algorithm:

\{\begin{matrix} {\hat{μ}}_{i} = \frac{Σ_{k = 1}^{n} β_{i} (τ_{k}) τ_{k}}{Σ_{k = 1}^{n} β_{i} (τ_{k})} \\ {\hat{σ}}_{i}^{2} = \frac{1}{d} \frac{Σ_{k = 1}^{n} β_{i} (τ_{k}) {(τ_{k} - μ_{i})}^{T} (τ_{k} - μ_{i})}{Σ_{k = 1}^{n} β_{i} (τ_{k})} \\ {\hat{α}}_{i} = \frac{1}{n} Σ_{k = 1}^{n} β_{i} (τ_{k}) \end{matrix} - - - (14)

Wherein:

Be estimates of parameters,

Be the estimation of GMM model parameter, β _i(τ _k) be the posterior probability that formula 13 calculates gained, when The time stop undated parameter, min is a constant herein, represents minimum tolerance value,

Func = Σ_{j = 1}^{J} Σ_{n = 1}^{M} {| | \hat{τ} [n] - τ_{j} | |}^{2} - - - (15)

Wherein:

Be the time delay estimate vector