CN104700842B

CN104700842B - The delay time estimation method and device of voice signal

Info

Publication number: CN104700842B
Application number: CN201510083890.1A
Authority: CN
Inventors: 陈超
Original assignee: Guangzhou Baiguoyuan Information Technology Co Ltd
Current assignee: Guangzhou Baiguoyuan Information Technology Co Ltd
Priority date: 2015-02-13
Filing date: 2015-02-13
Publication date: 2018-05-08
Anticipated expiration: 2035-02-13
Also published as: CN104700842A

Abstract

The invention discloses the delay time estimation method and device of a kind of voice signal, belong to audio signal processing technique field.The described method includes：Obtain two-way voice signal；Coherence is carried out according to the Short Time Fourier Transform of two-way voice signal to two-way voice signal to match to obtain the first matching result, the first matching result includes the first matched position and the first matching degree of two-way voice signal；Coherence is carried out according to the Spectral correlation of the power spectrum of two-way voice signal to two-way voice signal to match to obtain the second matching result, the second matching result includes the second matched position and the second matching degree of two-way voice signal；The time delay between two-way voice signal is calculated according to the first matching result and the second matching result.Solve the problems, such as that accuracy is low existing for Time Delay Estimation Based；Voice signal is matched from two angles of Spectral correlation of frequency domain distribution and power spectrum, comprehensive two matching results determine final matching results, improve matching precision and time delay accuracy of estimation.

Description

The delay time estimation method and device of voice signal

Technical field

The present invention relates to audio signal processing technique field, the delay time estimation method and device of more particularly to a kind of voice signal.

Background technology

The Time Delay Estimation Algorithms of voice signal are all obtained in numerous areas such as Sound Match, encoding and decoding alignment, sound rangings Extensive use.

The prior art also provides a variety of different delay time estimation methods, the widely used time delay estimation of one of which Algorithm is the delay time estimation method based on correlation analysis.The basic thought of this method is：Using two voice signals on frequency domain Similarity degree estimate time delay between the two voice signals.

In the implementation of the present invention, inventor has found that above-mentioned technology has at least the following problems：It is above-mentioned to be based on correlation The delay time estimation method of analysis only accounts for similarity degree of two voice signals on frequency domain so that of two voice signals It is relatively low with precision, cause the time delay accuracy that is finally calculated relatively low.

The content of the invention

In order to solve the problems, such as that accuracy existing for delay time estimation method that above-mentioned technology is related to is low, the embodiment of the present invention carries The delay time estimation method and device of a kind of voice signal are supplied.The technical solution is as follows：

First aspect, there is provided a kind of delay time estimation method of voice signal, the described method includes：

Obtain two-way voice signal；

Coherence's matching is carried out to the two-way voice signal according to the Short Time Fourier Transform of the two-way voice signal The first matching result is obtained, first matching result includes the first matched position and the first matching of the two-way voice signal Degree；

Coherence is carried out to the two-way voice signal according to the Spectral correlation of the power spectrum of the two-way voice signal Matching obtains the second matching result, and second matching result includes the second matched position and second of the two-way voice signal Matching degree；

The time delay between the two-way voice signal is calculated according to first matching result and second matching result.

Optionally, it is described that the two-way voice signal is calculated according to first matching result and second matching result Between time delay, including：

For per voice signal all the way, being calculated using weighted average first matched position and second matched position Method calculates final matched position, and the weight of the Weighted Average Algorithm is according to first matching degree and second matching degree Definite；

The time delay between the two-way voice signal is calculated according to the final matched position of the two-way voice signal.

Optionally, the Short Time Fourier Transform according to the two-way voice signal carries out the two-way voice signal Coherence matches to obtain the first matching result, including：

For per voice signal all the way, making an uproar according to the following equation to each frame voice signal in the voice signal Acoustic tracking, obtains the noise spectrum N (w, n) of each frame voice signal：

Wherein, X (w, n) represents the Short Time Fourier Transform of the voice signal；α_u、α_dFor predetermined coefficient and 0 ＜ α_d＜ α_u ＜ 1；W represents the frequency point sequence number on frequency domain；N represents the frame number in time domain；

The Short Time Fourier Transform progress binary conversion treatment to each frame voice signal obtains two-value spectrum according to the following equation Xb(w,n)：

T_bFor preset first threshold value；

By the wherein all the way corresponding K of voice signal_aA two-value composes K corresponding with another way voice signal_bA two-value spectrum carries out Coherence between two-by-two matches to obtain first matching result, and first matching result includes highest one group two of matching degree Value composes corresponding matched position and matching degree, K_a、K_bIt is positive integer.

Optionally, the Spectral correlation of the power spectrum according to the two-way voice signal is to the two-way voice signal Coherence is carried out to match to obtain the second matching result, including：

For per voice signal all the way, calculating the work(of each frame voice signal in the voice signal according to the following equation Rate spectrum P (w, n)：

P (w, n)=α_pP(w,n-1)+(1-α_p)|X(w,n)|²；

Wherein, X (w, n) represents the Short Time Fourier Transform of the voice signal；α_pFor predetermined coefficient and 0 ＜ α_p＜ 1；W tables Show the frequency point sequence number on frequency domain；N represents the frame number in time domain；

The Spectral correlation DP (w, n) of the power spectrum of each frame voice signal is calculated according to the following equation：

DP (w, n)=| P (w+1, n)-P (w, n) |；

Noise tracking is carried out to the Spectral correlation DP (w, n) according to the following equation, obtains each frame voice signal The Spectral correlation NDP (w, n) of noise power spectrum：

Wherein, β_u、β_dFor predetermined coefficient and 0 ＜ β_d＜ β_u＜ 1；

The Spectral correlation DP (w, n) the progress binary conversion treatment of each frame voice signal is obtained according to the following equation Correlation two-value spectrum XDb (w, n)：

T_DbTo preset second threshold；

By the wherein all the way corresponding KD of voice signal_aA correlation two-value composes KD corresponding with another way voice signal_bA phase Coherence between closing property two-value spectrum carries out two-by-two matches to obtain second matching result, and second matching result includes matching Spend highest one group of correlation two-value and compose corresponding matched position and matching degree, KD_a、KD_bIt is positive integer.

Optionally, the Short Time Fourier Transform according to the two-way voice signal carries out the two-way voice signal Before coherence matches to obtain the first matching result, further include：

For per voice signal all the way, being pre-processed to obtain pretreated voice signal, institute to the voice signal Stating pretreatment includes at least one of noise reduction process, enhanced processing, high-pass filtering processing, lifting sampling processing；

Short Time Fourier Transform is carried out to the pretreated voice signal.

Second aspect, there is provided a kind of time delay estimation device of voice signal, described device include：

Signal acquisition module, for obtaining two-way voice signal；

First matching module, believes the two-way sound for the Short Time Fourier Transform according to the two-way voice signal Number carry out coherence match to obtain the first matching result, first matching result includes first of the two-way voice signal With position and the first matching degree；

Second matching module, the Spectral correlation for the power spectrum according to the two-way voice signal is to the two-way sound Sound signal carries out coherence and matches to obtain the second matching result, and second matching result includes the of the two-way voice signal Two matched positions and the second matching degree；

Time-delay calculation module, for calculating the two-way sound according to first matching result and second matching result Time delay between sound signal.

Optionally, the time-delay calculation module, including：Position calculation unit and time-delay calculation unit；

The position calculation unit, for for per voice signal all the way, to first matched position and described second Matched position calculates final matched position using Weighted Average Algorithm, and the weight of the Weighted Average Algorithm is according to described first What matching degree and second matching degree determined；

The time-delay calculation unit, described two are calculated for the final matched position according to the two-way voice signal Time delay between the voice signal of road.

Optionally, first matching module, including：First tracking cell, the first binarization unit and the first matching are single Member；

First tracking cell, for for per voice signal all the way, according to the following equation in the voice signal Each frame voice signal carry out noise tracking, obtain the noise spectrum N (w, n) of each frame voice signal：

First binarization unit, for according to the following equation to the Short Time Fourier Transform of each frame voice signal into Row binary conversion treatment obtains two-value spectrum Xb (w, n)：

T_bFor preset first threshold value；

First matching unit, for will the wherein all the way corresponding K of voice signal_aA two-value spectrum is believed with another way sound Number corresponding K_bCoherence between a two-value spectrum carries out two-by-two matches to obtain first matching result, first matching result Corresponding matched position and matching degree, K are composed including the highest one group of two-value of matching degree_a、K_bIt is positive integer.

Optionally, second matching module, including：Spectra calculation unit, correlation calculations unit, the second tracking are single Member, the second binarization unit and the second matching unit；

The spectra calculation unit, for for per voice signal all the way, calculating the sound letter according to the following equation The power spectrum P (w, n) of each frame voice signal in number：

P (w, n)=α_pP(w,n-1)+(1-α_p)|X(w,n)|²；

The correlation calculations unit, for phase between the spectrum for the power spectrum for calculating each frame voice signal according to the following equation Closing property DP (w, n)：

DP (w, n)=| P (w+1, n)-P (w, n) |；

Second tracking cell, for carrying out noise tracking to the Spectral correlation DP (w, n) according to the following equation, Obtain the Spectral correlation NDP (w, n) of the noise power spectrum of each frame voice signal：

Second binarization unit, for according to the following equation to the Spectral correlation DP of each frame voice signal (w, n) carries out binary conversion treatment and obtains correlation two-value spectrum XDb (w, n)：

T_DbTo preset second threshold；

Second matching unit, for will the wherein all the way corresponding KD of voice signal_aA correlation two-value spectrum with it is another The corresponding KD of road voice signal_bCoherence between a correlation two-value spectrum carries out two-by-two matches to obtain second matching result, Second matching result includes the highest one group of correlation two-value of matching degree and composes corresponding matched position and matching degree, KD_a、KD_b It is positive integer.

Optionally, described device further includes：

Signal pre-processing module, for for per voice signal all the way, being pre-processed to obtain to the voice signal pre- Voice signal after processing, the pretreatment are included in noise reduction process, enhanced processing, high-pass filtering processing, lifting sampling processing At least one；

Fourier transformation module, for carrying out Short Time Fourier Transform to the pretreated voice signal.

The beneficial effect that technical solution provided in an embodiment of the present invention is brought is：

The first matching result is obtained by carrying out analysis matching to the Short Time Fourier Transform of two-way voice signal, and is passed through Analysis matching is carried out to the Spectral correlation of the power spectrum of two-way voice signal and obtains the second matching result, then in conjunction with first The time delay between two-way voice signal is calculated with result and the second matching result；Solves the delay time estimation method that correlation technique is related to The problem of existing accuracy is low；From frequency domain distribution and two angles of Spectral correlation of power spectrum respectively to two-way voice signal The matching analysis is carried out, and comprehensive two matching results determine final matching results, have reached raising matching precision, improve time delay The effect of accuracy of estimation.

Brief description of the drawings

To describe the technical solutions in the embodiments of the present invention more clearly, make required in being described below to embodiment Attached drawing is briefly described, it should be apparent that, drawings in the following description are only some embodiments of the present invention, for For those of ordinary skill in the art, without creative efforts, other can also be obtained according to these attached drawings Attached drawing.

Fig. 1 is the method flow diagram of the delay time estimation method of voice signal provided by one embodiment of the present invention；

Fig. 2 is the method flow diagram of the delay time estimation method for the voice signal that another embodiment of the present invention provides；

Fig. 3 is the block diagram of the time delay estimation device of voice signal provided by one embodiment of the present invention；

Fig. 4 is the block diagram of the time delay estimation device for the voice signal that another embodiment of the present invention provides；

Fig. 5 is the structure diagram of electronic equipment provided by one embodiment of the present invention.

Embodiment

To make the object, technical solutions and advantages of the present invention clearer, below in conjunction with attached drawing to embodiment party of the present invention Formula is described in further detail.

Please refer to Fig.1, it illustrates the method for the delay time estimation method of voice signal provided by one embodiment of the present invention Flow chart, the present embodiment by the delay time estimation method be applied to mobile phone, tablet computer, laptops or it is desk-top in terms of It is illustrated in the electronic equipment of calculation machine etc.The delay time estimation method can include the following steps：

Step 102, two-way voice signal is obtained.

Step 104, coherence's matching is carried out to two-way voice signal according to the Short Time Fourier Transform of two-way voice signal The first matching result is obtained, which includes the first matched position and the first matching degree of two-way voice signal.

Step 106, coherence is carried out to two-way voice signal according to the Spectral correlation of the power spectrum of two-way voice signal Matching obtains the second matching result, which includes the second matched position and the second matching of two-way voice signal Degree.

Step 108, the time delay between two-way voice signal is calculated according to the first matching result and the second matching result.

It should be noted is that：Above-mentioned steps 106 can carry out after step 104, can also be before step 104 Carry out, or be carried out at the same time with step 104, in the present embodiment, is only illustrated after step 104 with step 106, This is not especially limited.

In conclusion the delay time estimation method of voice signal provided in this embodiment, by the short of two-way voice signal When Fourier transformation carry out analysis matching obtain the first matching result, and phase the spectrum for passing through the power spectrum two-way voice signal Closing property carries out analysis matching and obtains the second matching result, and two-way sound is calculated then in conjunction with the first matching result and the second matching result Time delay between sound signal；Solve the problems, such as that accuracy existing for the delay time estimation method that correlation technique is related to is low；From frequency domain point Two angles of Spectral correlation of cloth and power spectrum respectively carry out two-way voice signal the matching analysis, and comprehensive two matchings knot Fruit determines final matching results, has reached raising matching precision, improves the effect of time delay accuracy of estimation.

Please refer to Fig.2, the method for the delay time estimation method of the voice signal provided it illustrates another embodiment of the present invention Flow chart, the present embodiment by the delay time estimation method be applied to mobile phone, tablet computer, laptops or it is desk-top in terms of It is illustrated in the electronic equipment of calculation machine etc.The delay time estimation method can include the following steps：

Step 201, two-way voice signal is obtained.

Two-way voice signal is the discrete signal in time domain.In this example, it is assumed that wherein all the way voice signal A For x_ra(n), another way voice signal B is x_rb(n), n represents the frame number in time domain, n ∈ [0, M-1], M >=2 and n, M are Integer.

Step 202, for per voice signal all the way, being pre-processed to obtain pretreated sound letter to voice signal Number.

Wherein, pretreatment includes but not limited in noise reduction process, enhanced processing, high-pass filtering processing, lifting sampling processing At least one.The purpose pre-processed to voice signal is to extract more accurate, reliable sound in the follow-up process Sound feature, to improve matching precision.

In this example, it is assumed that to being pretreatment F (*) per voice signal all the way, then wherein voice signal A's all the way is pre- Handling result is x_a(n)=F (x_ra(n)), the pre-processed results of another way voice signal B are x_b(n)=F (x_rb(n))。

It should be noted is that：The present embodiment is only illustrated with the above-mentioned several pretreatment modes enumerated, In practical application, other pretreatment modes can be used according to the actual requirements, this present embodiment is not especially limited.

Step 203, Short Time Fourier Transform is carried out to pretreated voice signal.

The process of STFT (Short Time Fourier Transform, Short Time Fourier Transform) is to use predetermined window letter It is several that and FFT (Fast Fourier Transform, Fast Fourier Transform (FFT)) is done to pretreated voice signal adding window, from And voice signal is transformed into frequency domain from time domain.Wherein, the selection of window function includes but not limited to hamming window and Caesar's window.Window The length of function is 2 power, such as 128,256 etc..

In the present embodiment, remember that STFT functions are STFT (*), then wherein the STFT of voice signal A can be denoted as X all the way_a(w, N)=STFT (x_a(n)), the STFT of another way voice signal B can be denoted as X_b(w, n)=STFT (x_b(n)), w is represented on frequency domain Frequency point sequence number, w ∈ [0, N-1], N >=2 and w, N are integer.

Step 204, coherence's matching is carried out to two-way voice signal according to the Short Time Fourier Transform of two-way voice signal The first matching result is obtained, which includes the first matched position and the first matching degree of two-way voice signal.

In the first matching way, two-way voice signal is matched from the frequency domain distribution angle of voice signal.Tool For body, this step can include following several sub-steps：

First, for per voice signal all the way, being carried out according to the following equation to each frame voice signal in voice signal Noise tracks, and obtains the noise spectrum N (w, n) of each frame voice signal：

Wherein, X (w, n) represents the Short Time Fourier Transform of voice signal；α_u、α_dFor predetermined coefficient and 0 ＜ α_d＜ α_u＜ 1； W represents the frequency point sequence number on frequency domain；N represents the frame number in time domain.

The purpose that noise tracking is carried out to voice signal is to reduce the noise in voice signal as much as possible to matching As a result influence, to improve matching precision.In addition, pass through reasonable set predetermined coefficient α_u、α_dSize, in signal uphill process It is middle to use smaller factor alpha_u, and use larger factor alpha during signal decline_d, noise tracking effect can be improved.

In this example, it is assumed that the noise spectrum of the i-th frame voice signal in wherein all the way voice signal A is N_a(w, i), Then：

Wherein, X_a(w, i) represents the Short Time Fourier Transform of the i-th frame voice signal in voice signal A, i >=0 and i is Integer.

Similar, it is assumed that the noise spectrum of the jth frame voice signal in another way voice signal B is N_b(w, j), then：

Wherein, X_b(w, j) represents the Short Time Fourier Transform of the jth frame voice signal in voice signal B, j >=0 and j is Integer.

Second, binary conversion treatment is carried out to the Short Time Fourier Transform of each frame voice signal according to the following equation and obtains two Value spectrum Xb (w, n)：

T_bFor preset first threshold value.

After binary conversion treatment is carried out to the frequency spectrum of voice signal, the Short Time Fourier Transform of each frame voice signal Binary sequence being converted into equal length, being made of 0 and 1.Data processing can fully be improved by binary conversion treatment Speed, subsequently to carry out the matching primitives of highly efficient (robust).

In this example, it is assumed that the corresponding two-value spectrum of the i-th frame voice signal in wherein all the way voice signal A is Xb_a (w, i), then：

Similar, it is assumed that the corresponding two-value spectrum of jth frame voice signal in another way voice signal B is Xb_b(w, j), then：

3rd, will the wherein all the way corresponding K of voice signal_aA two-value composes K corresponding with another way voice signal_bA two-value Coherence between spectrum carries out two-by-two matches to obtain the first matching result, K_a、K_bIt is positive integer.

Wherein, the first matching result includes the corresponding matched position of the highest one group of two-value spectrum of matching degree and matching degree.When The corresponding two-value spectrum quantity of wherein all the way voice signal A is K_aAnd the corresponding two-value spectrum quantity of another way voice signal B is K_bWhen, Need to carry out K_a×K_bSecondary two-value spectrum matching.For matching each time, the matching degree between two two-value spectrums is recorded, and carry out The sequence number n of matched two two-values spectrum_a(i) and n_b(j).Wherein, n_a(i) the i-th frame sound in wherein all the way voice signal A is represented The sequence number of the corresponding two-value spectrum of sound signal, n_a(i)∈[0,K_a- 1], K_a>=2 and n_a(i)、K_aIt is integer；n_b(j) represent another The sequence number of the corresponding two-value spectrum of jth frame voice signal in the voice signal B of road, n_b(j)∈[0,K_b- 1], K_b>=2 and n_b(j)、K_b It is integer.

For example if the corresponding two-value of the i-th frame voice signal in wherein all the way voice signal A is composed into Xb_aIt is (w, i) and another The corresponding two-value spectrum Xb of jth frame voice signal all the way in voice signal B_b(w, j) is matched, then both matching degree Pb_ij For：

Wherein, ⊙ represents same or oeprator.Matching degree Pb_ijCorresponding position binaryzation result in equal to two two-value spectrums Equal frequency point is to quantity and frequency point to total ratio.

In a specific example, it is assumed that Xb_a(w, i)={ 1,1,0,0,0,1,1,1 }, Xb_b(w, j)=0,1,1,1, 0,1,1,1 }, then as w=0, due to Xb_a(0, i)=1, Xb_b(0, j)=0, so Xb_a(0,i)⊙Xb_b(0, j)=0；Work as w When=1, due to Xb_a(1, i)=1, Xb_b(1, j)=1, so Xb_a(1,i)⊙Xb_b(1, j)=1；And so on, it can calculate The matching degree Pb of above-mentioned two binary sequence_ij=5/8=0.625.

After the matching degree being calculated between each two two-value spectrum, the highest one group of two-value spectrum of matching degree is chosen, and remember Record this group of two-value and compose corresponding sequence number and matching degree.

In this example, it is assumed that the i-th frame voice signal corresponding n-th in wherein all the way voice signal A_a(i) a two Value spectrum and the jth frame voice signal corresponding n-th in another way voice signal B_b(j) the matching degree highest of a two-value spectrum, is denoted as P₁.Such as i=1 and j=2,

Step 205, coherence is carried out to two-way voice signal according to the Spectral correlation of the power spectrum of two-way voice signal Matching obtains the second matching result, which includes the second matched position and the second matching of two-way voice signal Degree.

In second of matching way, two-way voice signal is matched from the Spectral correlation angle of power spectrum.Tool For body, this step can include following several sub-steps：

First, for per voice signal all the way, calculating each frame voice signal in voice signal according to the following equation Power spectrum P (w, n)：

P (w, n)=α_pP(w,n-1)+(1-α_p)|X(w,n)|²；

Wherein, X (w, n) represents the Short Time Fourier Transform of voice signal；α_pFor predetermined coefficient and 0 ＜ α_p＜ 1；W represents frequency Frequency point sequence number on domain；N represents the frame number in time domain.

In this example, it is assumed that the power spectrum of the i-th frame voice signal in wherein all the way voice signal A is P_a(w, i), Then：

P_a(w, i)=α_pP_a(w,i-1)+(1-α_p)|X_a(w,i)|²；

Similar, it is assumed that the power spectrum of the jth frame voice signal in another way voice signal B is P_b(w, j), then：

P_b(w, j)=α_pP_b(w,j-1)+(1-α_p)|X_b(w,j)|²；

Second, the Spectral correlation DP (w, n) of the power spectrum of each frame voice signal is calculated according to the following equation：

DP (w, n)=| P (w+1, n)-P (w, n) |.

After the power spectrum of voice signal is calculated, the energy of high frequency points is subtracted to the energy of low frequency point, calculates power The Spectral correlation of spectrum.

It is in this example, it is assumed that related between the spectrum of the power spectrum of the i-th frame voice signal in wherein all the way voice signal A Property is DP_a(w, i), then：

DP_a(w, i)=| P_a(w+1,i)-P_a(w,i)|；

Similar, it is assumed that the Spectral correlation of the power spectrum of the jth frame voice signal in another way voice signal B is DP_b (w, j), then：

DP_b(w, j)=| P_b(w+1,j)-P_b(w,j)|。

3rd, noise tracking is carried out to Spectral correlation DP (w, n) according to the following equation, obtains each frame voice signal The Spectral correlation NDP (w, n) of noise power spectrum：

Wherein, β_u、β_dFor predetermined coefficient and 0 ＜ β_d＜ β_u＜ 1.

The purpose for carrying out noise tracking to Spectral correlation is that the noise fluctuated in order to prevent brings subsequent match erroneous judgement, Influence of the noise to matching result is reduced as much as possible, to improve matching precision.In addition, pass through reasonable set predetermined coefficient β_u、β_d Size, noise tracking effect can be improved.

In this example, it is assumed that between the spectrum of the noise power spectrum of the i-th frame voice signal in wherein all the way voice signal A Correlation is NDP_a(w, i), then：

It is similar, it is assumed that the Spectral correlation of the noise power spectrum of the jth frame voice signal in another way voice signal B is NDP_b(w, j), then：

4th, binary conversion treatment is carried out to the Spectral correlation DP (w, n) of each frame voice signal according to the following equation and is obtained To correlation two-value spectrum XDb (w, n)：

T_DbTo preset second threshold.

After carrying out binary conversion treatment to Spectral correlation, the Spectral correlation of each frame voice signal can be converted into one Equal length, the binary sequence being made of 0 and 1.Data processing speed can fully be improved by binary conversion treatment, so as to follow-up Carry out the matching primitives of highly efficient (robust).

In this example, it is assumed that the corresponding correlation two-value spectrum of the i-th frame voice signal in wherein all the way voice signal A For XDb_a(w, i), then：

Similar, it is assumed that the corresponding correlation two-value spectrum of jth frame voice signal in another way voice signal B is XDb_b (w, j), then：

5th, will the wherein all the way corresponding KD of voice signal_aA correlation two-value spectrum is corresponding with another way voice signal KD_bCoherence between a correlation two-value spectrum carries out two-by-two matches to obtain the second matching result, KD_a、KD_bIt is positive integer.

Wherein, the second matching result includes the corresponding matched position of the highest one group of correlation two-value spectrum of matching degree and matching Degree.When the corresponding correlation two-value spectrum quantity of wherein all the way voice signal A is KD_aAnd the corresponding correlations of another way voice signal B Property two-value spectrum quantity be KD_bWhen, it is necessary to carry out KD_a×KD_bSecondary correlation two-value spectrum matching.For matching each time, record two Matching degree between a correlation two-value spectrum, and carry out the sequence number nd of matched two correlation two-values spectrum_a(i) and nd_b(j).Its In, nd_a(i) sequence number of the corresponding correlation two-value spectrum of the i-th frame voice signal in wherein all the way voice signal A, nd are represented_a(i) ∈[0,KD_a- 1], KD_a>=2 and nd_a(i)、KD_aIt is integer；nd_b(j) the jth frame sound letter in another way voice signal B is represented The sequence number of number corresponding correlation two-value spectrum, nd_b(j)∈[0,KD_b- 1], KD_b>=2 and nd_b(j)、KD_bIt is integer.In addition, Under normal conditions, KD_aWith the K in above-mentioned steps 204_aIt is equal, KD_bWith the K in above-mentioned steps 204_bIt is equal.

For example if the corresponding correlation two-value of the i-th frame voice signal in wherein all the way voice signal A is composed into XDb_a (w, i) correlation two-value spectrum XDb corresponding with the jth frame voice signal in another way voice signal B_b(w, j) is matched, then Both matching degree PDb_ijFor：

Wherein, ⊙ represents same or oeprator.Matching degree PDb_ijCorresponding position two in equal to two correlation two-value spectrums The equal data of value result are to quantity and data to total ratio.

After the matching degree being calculated between each two correlation two-value spectrum, the highest one group of correlation of matching degree is chosen Two-value is composed, and is recorded this group of correlation two-value and composed corresponding sequence number and matching degree.

In this example, it is assumed that corresponding n-th d of the i-th frame voice signal in wherein all the way voice signal A_a(i) a phase Closing property two-value composes n-th d corresponding with the jth frame voice signal in another way voice signal B_b(j) of a correlation two-value spectrum With degree highest, P is denoted as₂.Such as i=1 and j=3,

Step 206, the time delay between two-way voice signal is calculated according to the first matching result and the second matching result.

After the first matching result and the second matching result is calculated, above-mentioned two matching result is integrated to obtain Final matching results, and then calculate the time delay between two-way voice signal according to final matching results.Specifically, this step can be with Including following two sub-steps：

First, for per voice signal all the way, Weighted Average Algorithm is used to the first matched position and the second matched position Final matched position is calculated, the weight of the Weighted Average Algorithm is determined according to the first matching degree and the second matching degree.

In a kind of possible embodiment, it is assumed that the first matching obtained from the frequency domain distribution angle calculation of voice signal Spend for P₁, the second matching degree obtained from the Spectral correlation angle calculation of power spectrum is P₂, then the corresponding power of the first matched position Weight isThe corresponding weight of second matched position is

The wherein all the way final matched position nl of voice signal A_aFor：

Wherein, n_aRepresent the first matched position of voice signal A, nd_aRepresent the second matched position of voice signal A.

Similar, the final matched position nl of another way voice signal B_bFor：

Wherein, n_bRepresent the first matched position of voice signal B, nd_bRepresent the second matched position of voice signal B.

Second, the time delay between two-way voice signal is calculated according to the final matched position of two-way voice signal.

The time delay t between two-way voice signal is calculated according to the following equation：

T=k (nl_a-nl_b)；

Wherein, k is time coefficient.

Time coefficient k can be calculated according to STFT sample frequency f, sampled point quantity Num and the overlap coefficient η converted, Time coefficient

In a specific example, if the FFT overlapping 50% of 256 points of the signal sampling using 16KHz, then the time Coefficient

It should be noted is that：Above-mentioned steps 205 can carry out after step 204, can also be before the step 204 Carry out, or be carried out at the same time with step 204, in the present embodiment, is only illustrated after step 204 with step 205, This is not especially limited.

In addition, delay time estimation method provided in this embodiment, before matching primitives are carried out, by carrying out two-value to data Change is handled, and substantially increases matching efficiency, there is provided a kind of delay time estimation method of robust.

Following is apparatus of the present invention embodiment, can be used for performing the method for the present invention embodiment.It is real for apparatus of the present invention The details not disclosed in example is applied, refer to the method for the present invention embodiment.

Please refer to Fig.3, it illustrates the structure of the time delay estimation device of voice signal provided by one embodiment of the present invention Block diagram, the time delay estimation device can by software, hardware or both be implemented in combination with as electronic equipment part or Person is whole.The time delay estimation device can include：Signal acquisition module 310, the first matching module 320, the second matching module 330 With time-delay calculation module 340.

Signal acquisition module 310, for obtaining two-way voice signal.

First matching module 320, for the Short Time Fourier Transform according to the two-way voice signal to the two-way sound Sound signal carries out coherence and matches to obtain the first matching result, and first matching result includes the of the two-way voice signal One matched position and the first matching degree.

Second matching module 330, the Spectral correlation for the power spectrum according to the two-way voice signal is to described two Road voice signal carries out coherence and matches to obtain the second matching result, and second matching result includes the two-way voice signal The second matched position and the second matching degree.

Time-delay calculation module 340, for calculating described two according to first matching result and second matching result Time delay between the voice signal of road.

In conclusion the time delay estimation device of voice signal provided in this embodiment, by the short of two-way voice signal When Fourier transformation carry out analysis matching obtain the first matching result, and phase the spectrum for passing through the power spectrum two-way voice signal Closing property carries out analysis matching and obtains the second matching result, and two-way sound is calculated then in conjunction with the first matching result and the second matching result Time delay between sound signal；Solve the problems, such as that accuracy existing for the delay time estimation method that correlation technique is related to is low；From frequency domain point Two angles of Spectral correlation of cloth and power spectrum respectively carry out two-way voice signal the matching analysis, and comprehensive two matchings knot Fruit determines final matching results, has reached raising matching precision, improves the effect of time delay accuracy of estimation.

Please refer to Fig.4, the structure of the time delay estimation device of the voice signal provided it illustrates another embodiment of the present invention Block diagram, the time delay estimation device can by software, hardware or both be implemented in combination with as electronic equipment part or Person is whole.The time delay estimation device can include：Signal acquisition module 310, the first matching module 320, the second matching module 330 With time-delay calculation module 340.

Signal acquisition module 310, for obtaining two-way voice signal.

Wherein, first matching module 320, can include：First tracking cell 320a, the first binarization unit 320b With the first matching unit 320c.

The first tracking cell 320a, for for per voice signal all the way, believing according to the following equation the sound Each frame voice signal in number carries out noise tracking, obtains the noise spectrum N (w, n) of each frame voice signal：

Wherein, X (w, n) represents the Short Time Fourier Transform of the voice signal；α_u、α_dFor predetermined coefficient and 0 ＜ α_d＜ α_u ＜ 1；W represents the frequency point sequence number on frequency domain；N represents the frame number in time domain.

The first binarization unit 320b, for carrying out two-value to the frequency spectrum of each frame voice signal according to the following equation Change handles to obtain two-value spectrum Xb (w, n)：

T_bFor preset first threshold value.

The first matching unit 320c, for will the wherein all the way corresponding K of voice signal_aA two-value spectrum and another way sound The corresponding K of sound signal_bCoherence between a two-value spectrum carries out two-by-two matches to obtain first matching result, first matching As a result include the highest one group of two-value of matching degree and compose corresponding matched position and matching degree, K_a、K_bIt is positive integer.

Wherein, second matching module 330, can include：Spectra calculation unit 330a, correlation calculations unit 330b, the second tracking cell 330c, the second binarization unit 330d and the second matching unit 330e.

The spectra calculation unit 330a, for for per voice signal all the way, calculating the sound according to the following equation The power spectrum P (w, n) of each frame voice signal in sound signal：

P (w, n)=α_pP(w,n-1)+(1-α_p)|X(w,n)|²；

Wherein, X (w, n) represents the Short Time Fourier Transform of the voice signal；α_pFor predetermined coefficient and 0 ＜ α_p＜ 1；W tables Show the frequency point sequence number on frequency domain；N represents the frame number in time domain.

The correlation calculations unit 330b, the spectrum of the power spectrum for calculating each frame voice signal according to the following equation Between correlation DP (w, n)：

DP (w, n)=| P (w+1, n)-P (w, n) |.

The second tracking cell 330c, for carrying out noise to the Spectral correlation DP (w, n) according to the following equation Tracking, obtains the Spectral correlation NDP (w, n) of the noise power spectrum of each frame voice signal：

Wherein, β_u、β_dFor predetermined coefficient and 0 ＜ β_d＜ β_u＜ 1.

The second binarization unit 330d, for according to the following equation related the spectrum of each frame voice signal Property DP (w, n) carry out binary conversion treatment obtain correlation two-value spectrum XDb (w, n)：

T_DbTo preset second threshold.

The second matching unit 330e, for will the wherein all the way corresponding KD of voice signal_aA correlation two-value spectrum with The corresponding KD of another way voice signal_bCoherence between a correlation two-value spectrum carries out two-by-two matches to obtain the second matching knot Fruit, second matching result include the highest one group of correlation two-value of matching degree and compose corresponding matched position and matching degree, KD_a、KD_bIt is positive integer.

Wherein, the time-delay calculation module 340, can include：Position calculation unit 340a and time-delay calculation unit 340b.

The position calculation unit 340a, for for per voice signal all the way, to first matched position and described Second matched position calculates final matched position using Weighted Average Algorithm, and the weight of the Weighted Average Algorithm is according to What the first matching degree and second matching degree determined.

The time-delay calculation unit 340b, institute is calculated for the final matched position according to the two-way voice signal State the time delay between two-way voice signal.

Optionally, described device can also include：Signal pre-processing module 312 and fourier transformation module 314.

Signal pre-processing module 312, for for per voice signal all the way, being pre-processed to obtain to the voice signal Pretreated voice signal, the pretreatment include noise reduction process, enhanced processing, high-pass filtering processing, lifting sampling processing At least one of.

Fourier transformation module 314, for carrying out Short Time Fourier Transform to the pretreated voice signal.

In addition, time delay estimation device provided in this embodiment, before matching primitives are carried out, by carrying out two-value to data Change is handled, and substantially increases matching efficiency.

It should be noted that：The time delay estimation device for the voice signal that above-described embodiment provides is calculating two-way voice signal When be delayed, can as needed will be above-mentioned only with the division progress of above-mentioned each function module for example, in practical application Function distribution is completed by different function module, i.e., the internal structure of equipment is divided into different function modules, with complete with The all or part of function of upper description.In addition, the time delay estimation device of above-described embodiment offer and the side of delay time estimation method Method embodiment belongs to same design, its specific implementation process refers to embodiment of the method, and which is not described herein again.

Fig. 5 is refer to, it illustrates the structure diagram of electronic equipment provided by one embodiment of the present invention.The electronics is set It is ready for use on the delay time estimation method for the voice signal for implementing to be provided in above-described embodiment.Specifically：

Electronic equipment 500 can include RF (Radio Frequency, radio frequency) circuit 510, include one or one with Memory 520, input unit 530, display unit 540, sensor 550, the voicefrequency circuit of upper computer-readable recording medium 560th, WiFi (wireless fidelity, Wireless Fidelity) module 570, include one or more than one processing core The component such as processor 580 and power supply 590.It will be understood by those skilled in the art that the electronic devices structure shown in Fig. 5 is simultaneously The restriction to electronic equipment is not formed, can be included than illustrating more or fewer components, either combines some components or not Same component arrangement.Wherein：

RF circuits 510 can be used for receive and send messages or communication process in, the reception and transmission of signal, especially, by base station After downlink information receives, transfer to one or more than one processor 580 is handled；In addition, will be related to the data sending of uplink to Base station.In general, RF circuits 510 include but not limited to antenna, at least one amplifier, tuner, one or more oscillators, use Family identity module (SIM) card, transceiver, coupler, LNA (Low Noise Amplifier, low-noise amplifier), duplex Device etc..In addition, RF circuits 510 can also be communicated by wireless communication with network and other equipment.The wireless communication can make With any communication standard or agreement, include but not limited to GSM (Global System of Mobile communication, entirely Ball mobile communcations system), GPRS (General Packet Radio Service, general packet radio service), CDMA (Code Division Multiple Access, CDMA), WCDMA (Wideband Code Division Multiple Access, wideband code division multiple access), LTE (Long Term Evolution, Long Term Evolution), Email, SMS (Short Messaging Service, Short Message Service) etc..

Memory 520 can be used for storage software program and module, and processor 580 is stored in memory 520 by operation Software program and module, so as to perform various functions application and data processing.Memory 520 can mainly include storage journey Sequence area and storage data field, wherein, storing program area can storage program area, the application program (ratio needed at least one function Such as sound-playing function, image player function) etc.；Storage data field can be stored to be created according to using for electronic equipment 500 Data (such as voice data, phone directory etc.) etc..In addition, memory 520 can include high-speed random access memory, may be used also With including nonvolatile memory, for example, at least a disk memory, flush memory device or other volatile solid-states Part.Correspondingly, memory 520 can also include Memory Controller, to provide processor 580 and input unit 530 to storage The access of device 520.

Input unit 530 can be used for the numeral or character information for receiving input, and produce and user setting and function Control related keyboard, mouse, operation lever, optics or the input of trace ball signal.Specifically, input unit 530 may include figure As input equipment 531 and other input equipments 532.Image input device 531 can be camera or optoelectronic scanning Equipment.Except image input device 531, input unit 530 can also include other input equipments 532.Specifically, other are inputted Equipment 532 can include but is not limited to physical keyboard, function key (such as volume control button, switch key etc.), trace ball, mouse One or more in mark, operation lever etc..

Display unit 540 can be used for display by information input by user or be supplied to the information and electronic equipment of user 500 various graphical user interface, these graphical user interface can by figure, text, icon, video and its any combination Lai Form.Display unit 540 may include display panel 541, optionally, can use LCD (Liquid Crystal Display, Liquid crystal display), the form such as OLED (Organic Light-Emitting Diode, Organic Light Emitting Diode) configure display Panel 541.

Electronic equipment 500 may also include at least one sensor 550, for example, optical sensor, motion sensor and other Sensor.Specifically, optical sensor may include ambient light sensor and proximity sensor, wherein, ambient light sensor can basis The light and shade of ambient light adjusts the brightness of display panel 541, proximity sensor can when electronic equipment 500 is moved in one's ear, Close display panel 541 and/or backlight.As one kind of motion sensor, gravity accelerometer can detect all directions The size of upper (generally three axis) acceleration, can detect that size and the direction of gravity, available for identification mobile phone posture when static Application (such as horizontal/vertical screen switching, dependent game, magnetometer pose calibrating), Vibration identification correlation function (for example pedometer, strikes Hit) etc.；The gyroscope that can also configure as electronic equipment 500, barometer, hygrometer, thermometer, infrared ray sensor etc. other Sensor, details are not described herein.

Voicefrequency circuit 560, loudspeaker 561, microphone 562 can provide the audio interface between user and electronic equipment 500. The transformed electric signal of the voice data received can be transferred to loudspeaker 561, is changed by loudspeaker 561 by voicefrequency circuit 560 Exported for voice signal；On the other hand, the voice signal of collection is converted to electric signal by microphone 562, is connect by voicefrequency circuit 560 Voice data is converted to after receipts, then after voice data output processor 580 is handled, it is such as another to be sent to through RF circuits 510 One electronic equipment, or voice data is exported to memory 520 further to handle.Voicefrequency circuit 560 is also possible that Earphone jack, to provide the communication of peripheral hardware earphone and electronic equipment 500.

WiFi belongs to short range wireless transmission technology, and electronic equipment 500 can help user to receive and dispatch by WiFi module 570 Email, browse webpage and access streaming video etc., it has provided wireless broadband internet to the user and has accessed.Although Fig. 5 Show WiFi module 570, but it is understood that, it is simultaneously not belonging to must be configured into for electronic equipment 500, completely can root Omitted according to needs in the essential scope for do not change invention.

Processor 580 is the control centre of electronic equipment 500, utilizes each of various interfaces and connection whole mobile phone Part, by running or performing the software program and/or module that are stored in memory 520, and calls and is stored in memory Data in 520, perform the various functions and processing data of electronic equipment 500, so as to carry out integral monitoring to mobile phone.It is optional , processor 580 may include one or more processing cores；Preferably, processor 580 can integrate application processor and modulatedemodulate Processor is adjusted, wherein, application processor mainly handles operating system, user interface and application program etc., modem processor Main processing wireless communication.It is understood that above-mentioned modem processor can not also be integrated into processor 580.

Electronic equipment 500 further includes the power supply 590 (such as battery) to all parts power supply, it is preferred that power supply can lead to Cross power-supply management system and processor 580 be logically contiguous, thus by power-supply management system realize management charging, electric discharge and The functions such as power managed.Power supply 590 can also include one or more direct current or AC power, recharging system, electricity The random component such as source fault detection circuit, power supply changeover device or inverter, power supply status indicator.

Although being not shown, electronic equipment 500 can also be including bluetooth module etc., and details are not described herein.

Specifically in the present embodiment, electronic equipment 500 has further included memory, and one or more than one journey Sequence, either more than one program storage in memory and is configured to by one or more than one processor for one of them Perform the delay time estimation method such as above-mentioned Fig. 1 or embodiment illustrated in fig. 2.

The embodiments of the present invention are for illustration only, do not represent the quality of embodiment.

One of ordinary skill in the art will appreciate that hardware can be passed through by realizing all or part of step of above-described embodiment To complete, relevant hardware can also be instructed to complete by program, the program can be stored in a kind of computer-readable In storage medium, storage medium mentioned above can be read-only storage, disk or CD etc..

The foregoing is merely presently preferred embodiments of the present invention, is not intended to limit the invention, it is all the present invention spirit and Within principle, any modification, equivalent replacement, improvement and so on, should all be included in the protection scope of the present invention.

Claims

A kind of 1. delay time estimation method of voice signal, it is characterised in that the described method includes：

Obtain two-way voice signal；

Coherence is carried out according to the Short Time Fourier Transform of the two-way voice signal to the two-way voice signal to match to obtain First matching result, first matching result include the first matched position and the first matching degree of the two-way voice signal, First matching result is that the two-value spectrum of the two-way voice signal is carried out coherence to match, and is believed per sound all the way Number two-value spectrum be to carry out what binary conversion treatment obtained to the Short Time Fourier Transform of the voice signal；

Coherence's matching is carried out to the two-way voice signal according to the Spectral correlation of the power spectrum of the two-way voice signal The second matching result is obtained, second matching result includes the second matched position and the second matching of the two-way voice signal Degree, second matching result are that the correlation two-value spectrum of the two-way voice signal is carried out coherence to match, often The correlation two-value spectrum of voice signal is to carry out binary conversion treatment to the Spectral correlation of the power spectrum of the voice signal all the way Obtain；

For per voice signal all the way, Weighted Average Algorithm meter is used to first matched position and second matched position Final matched position is calculated, the weight of the Weighted Average Algorithm is determined according to first matching degree and second matching degree 's；

The time delay between the two-way voice signal is calculated according to the final matched position of the two-way voice signal.
2. the according to the method described in claim 1, it is characterized in that, Fourier in short-term according to the two-way voice signal Conversion carries out coherence to the two-way voice signal and matches to obtain the first matching result, including：

For per voice signal all the way, each frame voice signal in the voice signal is carried out according to the following equation noise with Track, obtains the noise spectrum N (w, n) of each frame voice signal：

<mrow> <mi>N</mi> <mrow> <mo>(</mo> <mi>w</mi> <mo>,</mo> <mi>n</mi> <mo>)</mo> </mrow> <mo>=</mo> <mfenced open = "{" close = ""> <mtable> <mtr> <mtd> <mrow> <mo>(</mo> <mn>1</mn> <mo>-</mo> <msub> <mi>&alpha;</mi> <mi>u</mi> </msub> <mo>)</mo> <mo>|</mo> <mi>X</mi> <mo>(</mo> <mi>w</mi> <mo>,</mo> <mi>n</mi> <mo>)</mo> <mo>|</mo> <mo>+</mo> <msub> <mi>&alpha;</mi> <mi>u</mi> </msub> <mi>N</mi> <mo>(</mo> <mi>w</mi> <mo>,</mo> <mi>n</mi> <mo>-</mo> <mn>1</mn> <mo>)</mo> <mo>,</mo> </mrow> </mtd> <mtd> <mrow> <mo>|</mo> <mi>X</mi> <mrow> <mo>(</mo> <mi>w</mi> <mo>,</mo> <mi>n</mi> <mo>)</mo> </mrow> <mo>|</mo> <mo>&GreaterEqual;</mo> <mi>N</mi> <mrow> <mo>(</mo> <mi>w</mi> <mo>,</mo> <mi>n</mi> <mo>-</mo> <mn>1</mn> <mo>)</mo> </mrow> </mrow> </mtd> </mtr> <mtr> <mtd> <mrow> <mo>(</mo> <mn>1</mn> <mo>-</mo> <msub> <mi>&alpha;</mi> <mi>d</mi> </msub> <mo>)</mo> <mo>|</mo> <mi>X</mi> <mo>(</mo> <mi>w</mi> <mo>,</mo> <mi>n</mi> <mo>)</mo> <mo>|</mo> <mo>+</mo> <msub> <mi>&alpha;</mi> <mi>d</mi> </msub> <mi>N</mi> <mo>(</mo> <mi>w</mi> <mo>,</mo> <mi>n</mi> <mo>-</mo> <mn>1</mn> <mo>)</mo> <mo>,</mo> </mrow> </mtd> <mtd> <mrow> <mo>|</mo> <mi>X</mi> <mrow> <mo>(</mo> <mi>w</mi> <mo>,</mo> <mi>n</mi> <mo>)</mo> </mrow> <mo>|</mo> <mo><</mo> <mi>N</mi> <mrow> <mo>(</mo> <mi>w</mi> <mo>,</mo> <mi>n</mi> <mo>-</mo> <mn>1</mn> <mo>)</mo> </mrow> </mrow> </mtd> </mtr> </mtable> </mfenced> <mo>;</mo> </mrow>

Wherein, X (w, n) represents the Short Time Fourier Transform of the voice signal；α_u、α_dFor predetermined coefficient and 0 ＜ α_d＜ α_u＜ 1；w Represent the frequency point sequence number on frequency domain；N represents the frame number in time domain；

The Short Time Fourier Transform of each frame voice signal is carried out according to the following equation binary conversion treatment obtain two-value spectrum Xb (w, n)：

T_bFor preset first threshold value；

By the wherein all the way corresponding K of voice signal_aA two-value composes K corresponding with another way voice signal_bA two-value spectrum carries out two-by-two Between coherence match to obtain first matching result, first matching result includes the highest one group of two-value spectrum of matching degree Corresponding matched position and matching degree, K_a、K_bIt is positive integer.
3. the according to the method described in claim 1, it is characterized in that, spectrum of the power spectrum according to the two-way voice signal Between correlation to the two-way voice signal carry out coherence match to obtain the second matching result, including：

For per voice signal all the way, calculating the power spectrum of each frame voice signal in the voice signal according to the following equation P(w,n)：

P (w, n)=α_pP(w,n-1)+(1-α_p)|X(w,n)|²；

Wherein, X (w, n) represents the Short Time Fourier Transform of the voice signal；α_pFor predetermined coefficient and 0 ＜ α_p＜ 1；W represents frequency Frequency point sequence number on domain；N represents the frame number in time domain；

The Spectral correlation DP (w, n) of the power spectrum of each frame voice signal is calculated according to the following equation：

DP (w, n)=| P (w+1, n)-P (w, n) |；

Noise tracking is carried out to the Spectral correlation DP (w, n) according to the following equation, obtains the noise of each frame voice signal The Spectral correlation NDP (w, n) of power spectrum：

<mrow> <mi>N</mi> <mi>D</mi> <mi>P</mi> <mrow> <mo>(</mo> <mi>w</mi> <mo>,</mo> <mi>n</mi> <mo>)</mo> </mrow> <mo>=</mo> <mfenced open = "{" close = ""> <mtable> <mtr> <mtd> <mrow> <mo>(</mo> <mn>1</mn> <mo>-</mo> <msub> <mi>&beta;</mi> <mi>u</mi> </msub> <mo>)</mo> <mi>D</mi> <mi>P</mi> <mo>(</mo> <mi>w</mi> <mo>,</mo> <mi>n</mi> <mo>)</mo> <mo>+</mo> <msub> <mi>&beta;</mi> <mi>u</mi> </msub> <mi>N</mi> <mi>D</mi> <mi>P</mi> <mo>(</mo> <mi>w</mi> <mo>,</mo> <mi>n</mi> <mo>-</mo> <mn>1</mn> <mo>)</mo> <mo>,</mo> </mrow> </mtd> <mtd> <mrow> <mi>D</mi> <mi>P</mi> <mrow> <mo>(</mo> <mi>w</mi> <mo>,</mo> <mi>n</mi> <mo>)</mo> </mrow> <mo>&GreaterEqual;</mo> <mi>N</mi> <mi>D</mi> <mi>P</mi> <mrow> <mo>(</mo> <mi>w</mi> <mo>,</mo> <mi>n</mi> <mo>-</mo> <mn>1</mn> <mo>)</mo> </mrow> </mrow> </mtd> </mtr> <mtr> <mtd> <mrow> <mo>(</mo> <mn>1</mn> <mo>-</mo> <msub> <mi>&beta;</mi> <mi>d</mi> </msub> <mo>)</mo> <mi>D</mi> <mi>P</mi> <mo>(</mo> <mi>w</mi> <mo>,</mo> <mi>n</mi> <mo>)</mo> <mo>+</mo> <msub> <mi>&beta;</mi> <mi>d</mi> </msub> <mi>N</mi> <mi>D</mi> <mi>P</mi> <mo>(</mo> <mi>w</mi> <mo>,</mo> <mi>n</mi> <mo>-</mo> <mn>1</mn> <mo>)</mo> <mo>,</mo> </mrow> </mtd> <mtd> <mrow> <mi>D</mi> <mi>P</mi> <mrow> <mo>(</mo> <mi>w</mi> <mo>,</mo> <mi>n</mi> <mo>)</mo> </mrow> <mo><</mo> <mi>N</mi> <mi>D</mi> <mi>P</mi> <mrow> <mo>(</mo> <mi>w</mi> <mo>,</mo> <mi>n</mi> <mo>-</mo> <mn>1</mn> <mo>)</mo> </mrow> </mrow> </mtd> </mtr> </mtable> </mfenced> <mo>;</mo> </mrow>

Wherein, β_u、β_dFor predetermined coefficient and 0 ＜ β_d＜ β_u＜ 1；

Correlation is obtained to the Spectral correlation DP (w, n) the progress binary conversion treatment of each frame voice signal according to the following equation Property two-value spectrum XDb (w, n)：

T_DbTo preset second threshold；

By the wherein all the way corresponding KD of voice signal_aA correlation two-value composes KD corresponding with another way voice signal_bA correlation Coherence between two-value spectrum carries out two-by-two matches to obtain second matching result, and second matching result includes matching degree most One group of high correlation two-value composes corresponding matched position and matching degree, KD_a、KD_bIt is positive integer.
4. method according to any one of claims 1 to 3, it is characterised in that described according to the short of the two-way voice signal When two-way voice signal described in Fourier transform pairs before coherence matches to obtain the first matching result, further include：

It is described pre- for per voice signal all the way, being pre-processed to obtain pretreated voice signal to the voice signal Processing includes at least one of noise reduction process, enhanced processing, high-pass filtering processing, lifting sampling processing；

Short Time Fourier Transform is carried out to the pretreated voice signal.
5. the time delay estimation device of a kind of voice signal, it is characterised in that described device includes：

Signal acquisition module, for obtaining two-way voice signal；

First matching module, for the Short Time Fourier Transform according to the two-way voice signal to the two-way voice signal into Row coherence matches to obtain the first matching result, and first matching result includes the first match bit of the two-way voice signal It is that the two-value spectrum of the two-way voice signal is carried out coherence to match to obtain to put with the first matching degree, first matching result , the two-value spectrum per voice signal all the way is to carry out binary conversion treatment to the Short Time Fourier Transform of the voice signal to obtain 's；

Second matching module, the Spectral correlation for the power spectrum according to the two-way voice signal believe the two-way sound Number carry out coherence match to obtain the second matching result, second matching result includes second of the two-way voice signal With position and the second matching degree, second matching result is that the correlation two-value spectrum of the two-way voice signal is concerned with Property match, the correlation two-value spectrum per voice signal all the way is the Spectral correlation to the power spectrum of the voice signal Carry out what binary conversion treatment obtained；

Time-delay calculation module, for for per voice signal all the way, to first matched position and second matched position Final matched position is calculated using Weighted Average Algorithm, the weight of the Weighted Average Algorithm be according to first matching degree and What second matching degree determined；The two-way sound letter is calculated according to the final matched position of the two-way voice signal Time delay between number.
6. device according to claim 5, it is characterised in that first matching module, including：First tracking cell, First binarization unit and the first matching unit；

First tracking cell, for for per voice signal all the way, according to the following equation to every in the voice signal One frame voice signal carries out noise tracking, obtains the noise spectrum N of each frame voice signal_b(w,n)：

<mrow> <msub> <mi>N</mi> <mi>b</mi> </msub> <mrow> <mo>(</mo> <mi>w</mi> <mo>,</mo> <mi>n</mi> <mo>)</mo> </mrow> <mo>=</mo> <mfenced open = "{" close = ""> <mtable> <mtr> <mtd> <mrow> <mo>(</mo> <mn>1</mn> <mo>-</mo> <msub> <mi>&alpha;</mi> <mi>u</mi> </msub> <mo>)</mo> <mo>|</mo> <mi>X</mi> <mo>(</mo> <mi>w</mi> <mo>,</mo> <mi>n</mi> <mo>)</mo> <mo>|</mo> <mo>+</mo> <msub> <mi>&alpha;</mi> <mi>u</mi> </msub> <mi>N</mi> <mo>(</mo> <mi>w</mi> <mo>,</mo> <mi>n</mi> <mo>-</mo> <mn>1</mn> <mo>)</mo> <mo>,</mo> </mrow> </mtd> <mtd> <mrow> <mo>|</mo> <mi>X</mi> <mrow> <mo>(</mo> <mi>w</mi> <mo>,</mo> <mi>n</mi> <mo>)</mo> </mrow> <mo>|</mo> <mo>&GreaterEqual;</mo> <mi>N</mi> <mrow> <mo>(</mo> <mi>w</mi> <mo>,</mo> <mi>n</mi> <mo>-</mo> <mn>1</mn> <mo>)</mo> </mrow> </mrow> </mtd> </mtr> <mtr> <mtd> <mrow> <mo>(</mo> <mn>1</mn> <mo>-</mo> <msub> <mi>&alpha;</mi> <mi>d</mi> </msub> <mo>)</mo> <mo>|</mo> <mi>X</mi> <mo>(</mo> <mi>w</mi> <mo>,</mo> <mi>n</mi> <mo>)</mo> <mo>|</mo> <mo>+</mo> <msub> <mi>&alpha;</mi> <mi>d</mi> </msub> <mi>N</mi> <mo>(</mo> <mi>w</mi> <mo>,</mo> <mi>n</mi> <mo>-</mo> <mn>1</mn> <mo>)</mo> <mo>,</mo> </mrow> </mtd> <mtd> <mrow> <mo>|</mo> <mi>X</mi> <mrow> <mo>(</mo> <mi>w</mi> <mo>,</mo> <mi>n</mi> <mo>)</mo> </mrow> <mo>|</mo> <mo><</mo> <mi>N</mi> <mrow> <mo>(</mo> <mi>w</mi> <mo>,</mo> <mi>n</mi> <mo>-</mo> <mn>1</mn> <mo>)</mo> </mrow> </mrow> </mtd> </mtr> </mtable> </mfenced> <mo>;</mo> </mrow>

Wherein, X (w, n) represents the Short Time Fourier Transform of the voice signal；α_u、α_dFor predetermined coefficient and 0 ＜ α_d＜ α_u＜ 1；w Represent the frequency point sequence number on frequency domain；N represents the frame number in time domain；

First binarization unit, for carrying out two to the Short Time Fourier Transform of each frame voice signal according to the following equation Value handles to obtain two-value spectrum Xb (w, n)：

T_bFor preset first threshold value；

First matching unit, for will the wherein all the way corresponding K of voice signal_aA two-value spectrum and another way voice signal pair The K answered_bCoherence between a two-value spectrum carries out two-by-two matches to obtain first matching result, and first matching result includes The highest one group of two-value of matching degree composes corresponding matched position and matching degree, K_a、K_bIt is positive integer.
7. device according to claim 5, it is characterised in that second matching module, including：Spectra calculation list Member, correlation calculations unit, the second tracking cell, the second binarization unit and the second matching unit；

The spectra calculation unit, for for per voice signal all the way, calculating according to the following equation in the voice signal Each frame voice signal power spectrum P (w, n)：

P (w, n)=α_pP(w,n-1)+(1-α_p)|X(w,n)|²；

Wherein, X (w, n) represents the Short Time Fourier Transform of the voice signal；α_pFor predetermined coefficient and 0 ＜ α_p＜ 1；W represents frequency Frequency point sequence number on domain；N represents the frame number in time domain；

The correlation calculations unit, the Spectral correlation of the power spectrum for calculating each frame voice signal according to the following equation DP(w,n)：

DP (w, n)=| P (w+1, n)-P (w, n) |；

Second tracking cell, for carrying out noise tracking to the Spectral correlation DP (w, n) according to the following equation, obtains The Spectral correlation NDP (w, n) of the noise power spectrum of each frame voice signal：

<mrow> <mi>N</mi> <mi>D</mi> <mi>P</mi> <mrow> <mo>(</mo> <mi>w</mi> <mo>,</mo> <mi>n</mi> <mo>)</mo> </mrow> <mo>=</mo> <mfenced open = "{" close = ""> <mtable> <mtr> <mtd> <mrow> <mo>(</mo> <mn>1</mn> <mo>-</mo> <msub> <mi>&beta;</mi> <mi>u</mi> </msub> <mo>)</mo> <mi>D</mi> <mi>P</mi> <mo>(</mo> <mi>w</mi> <mo>,</mo> <mi>n</mi> <mo>)</mo> <mo>+</mo> <msub> <mi>&beta;</mi> <mi>u</mi> </msub> <mi>N</mi> <mi>D</mi> <mi>P</mi> <mo>(</mo> <mi>w</mi> <mo>,</mo> <mi>n</mi> <mo>-</mo> <mn>1</mn> <mo>)</mo> <mo>,</mo> </mrow> </mtd> <mtd> <mrow> <mi>D</mi> <mi>P</mi> <mrow> <mo>(</mo> <mi>w</mi> <mo>,</mo> <mi>n</mi> <mo>)</mo> </mrow> <mo>&GreaterEqual;</mo> <mi>N</mi> <mi>D</mi> <mi>P</mi> <mrow> <mo>(</mo> <mi>w</mi> <mo>,</mo> <mi>n</mi> <mo>-</mo> <mn>1</mn> <mo>)</mo> </mrow> </mrow> </mtd> </mtr> <mtr> <mtd> <mrow> <mo>(</mo> <mn>1</mn> <mo>-</mo> <msub> <mi>&beta;</mi> <mi>d</mi> </msub> <mo>)</mo> <mi>D</mi> <mi>P</mi> <mo>(</mo> <mi>w</mi> <mo>,</mo> <mi>n</mi> <mo>)</mo> <mo>+</mo> <msub> <mi>&beta;</mi> <mi>d</mi> </msub> <mi>N</mi> <mi>D</mi> <mi>P</mi> <mo>(</mo> <mi>w</mi> <mo>,</mo> <mi>n</mi> <mo>-</mo> <mn>1</mn> <mo>)</mo> <mo>,</mo> </mrow> </mtd> <mtd> <mrow> <mi>D</mi> <mi>P</mi> <mrow> <mo>(</mo> <mi>w</mi> <mo>,</mo> <mi>n</mi> <mo>)</mo> </mrow> <mo><</mo> <mi>N</mi> <mi>D</mi> <mi>P</mi> <mrow> <mo>(</mo> <mi>w</mi> <mo>,</mo> <mi>n</mi> <mo>-</mo> <mn>1</mn> <mo>)</mo> </mrow> </mrow> </mtd> </mtr> </mtable> </mfenced> <mo>;</mo> </mrow>

Wherein, β_u、β_dFor predetermined coefficient and 0 ＜ β_d＜ β_u＜ 1；

Second binarization unit, for according to the following equation to the Spectral correlation DP of each frame voice signal (w, N) carry out binary conversion treatment and obtain correlation two-value spectrum XDb (w, n)：

T_DbTo preset second threshold；

Second matching unit, for will the wherein all the way corresponding KD of voice signal_aA correlation two-value spectrum and another way sound The corresponding KD of signal_bCoherence between a correlation two-value spectrum carries out two-by-two matches to obtain second matching result, and described the Two matching results include the highest one group of correlation two-value of matching degree and compose corresponding matched position and matching degree, KD_a、KD_bIt is just Integer.
8. according to any device of claim 5 to 7, it is characterised in that described device further includes：

Signal pre-processing module, for for per voice signal all the way, being pre-processed to the voice signal Voice signal afterwards, the pretreatment are included in noise reduction process, enhanced processing, high-pass filtering processing, lifting sampling processing extremely Few one kind；

Fourier transformation module, for carrying out Short Time Fourier Transform to the pretreated voice signal.
A kind of 9. computer-readable recording medium, it is characterised in that have program stored therein in the computer-readable recording medium, institute Program is stated to be used for realization such as Claims 1-4 any one of them method.