CN104700842A

CN104700842A - Sound signal time delay estimation method and device

Info

Publication number: CN104700842A
Application number: CN201510083890.1A
Authority: CN
Inventors: 陈超
Original assignee: All Kinds Of Fruits Garden Guangzhou Network Technology Co Ltd
Current assignee: Guangzhou Baiguoyuan Information Technology Co Ltd
Priority date: 2015-02-13
Filing date: 2015-02-13
Publication date: 2015-06-10
Anticipated expiration: 2035-02-13
Also published as: CN104700842B

Abstract

The invention discloses a sound signal time delay estimation method and device and belongs to the technical field of audio processing. The sound signal time delay estimation method comprises the steps of obtaining two sound signals, conducting coherence matching on the two sound signals through short-time Fourier transformation of the two sound signals to obtain a first matching result, wherein the first matching result includes first matching positions and first matching degrees of the two sound signals; conducting coherence matching on the two sound signals according to spectral correlation of power spectrums of the two sound signals to obtain a second matching result, wherein the second matching result includes second matching positions and second matching degrees of the two sound signals; calculating the time delay of the two sound signals according to the first matching result and the second matching result. The problem that a relevant time delay estimation method is low in accuracy is solved. The sound signals are matched from two aspects of frequency domain distribution and power spectrums, the two matching results are integrated to determine a final matching result, and matching accuracy and time delay estimation accuracy are improved.

Description

The delay time estimation method of voice signal and device

Technical field

The present invention relates to audio signal processing technique field, particularly a kind of delay time estimation method of voice signal and device.

Background technology

The Time Delay Estimation Algorithms of voice signal is obtained for widespread use at numerous areas such as Sound Match, encoding and decoding alignment, sound rangings.

Prior art also provides multiple different delay time estimation method, and wherein a kind of widely used Time Delay Estimation Algorithms is the delay time estimation method based on correlation analysis.The basic thought of the method is: utilize the similarity degree of two voice signals on frequency domain to estimate time delay between these two voice signals.

Realizing in process of the present invention, inventor finds that above-mentioned technology at least exists following problem: the above-mentioned delay time estimation method based on correlation analysis only considers the similarity degree of two voice signals on frequency domain, make the matching precision of two voice signals lower, cause the time delay accuracy that finally calculates lower.

Summary of the invention

The problem that the accuracy that the delay time estimation method related to solve above-mentioned technology exists is low, embodiments provides a kind of delay time estimation method and device of voice signal.Described technical scheme is as follows:

First aspect, provides a kind of delay time estimation method of voice signal, and described method comprises:

Obtain two-way voice signal;

Carry out coherence's coupling according to the Short Time Fourier Transform of described two-way voice signal to described two-way voice signal and obtain the first matching result, described first matching result comprises the first matched position and first matching degree of described two-way voice signal;

Carry out coherence's coupling according to the Spectral correlation of the power spectrum of described two-way voice signal to described two-way voice signal and obtain the second matching result, described second matching result comprises the second matched position and second matching degree of described two-way voice signal;

The time delay between described two-way voice signal is calculated according to described first matching result and described second matching result.

Optionally, describedly calculate time delay between described two-way voice signal according to described first matching result and described second matching result, comprising:

For each road voice signal, adopt Weighted Average Algorithm to calculate final matched position to described first matched position and described second matched position, the weight of described Weighted Average Algorithm is determined according to described first matching degree and described second matching degree;

The time delay between described two-way voice signal is calculated according to the described final matched position of described two-way voice signal.

Optionally, the described Short Time Fourier Transform according to described two-way voice signal is carried out coherence's coupling to described two-way voice signal and is obtained the first matching result, comprising:

For each road voice signal, according to the following equation noise tracking is carried out to each the frame voice signal in described voice signal, obtains the noise spectrum N (w, n) of each frame voice signal:

N (w, n) = \{\begin{matrix} (1 - α_{u}) | X (w, n) | + α_{u} N (w, n - 1), & | X (w, n) | &GreaterEqual; N (w, n - 1) \\ (1 - α_{d}) | X (w, n) | + α_{d} N (w, n - 1), & | X (w, n) | < N (w, n - 1) \end{matrix};

Wherein, X (w, n) represents the Short Time Fourier Transform of described voice signal; α _u, α _dfor predetermined coefficient and 0 < α _d< α _u< 1; W represents the frequency sequence number on frequency domain; N represents the frame number in time domain;

According to the following equation binary conversion treatment is carried out to the Short Time Fourier Transform of each frame voice signal and obtains two-value spectrum Xb (w, n):

Xb (w, n) = \{\begin{matrix} 1, & | X (w, n) | - N (w, n) > T_{b} \\ 0, & | X (w, n) | - N (w, n) \leq T_{b} \end{matrix},

T _bfor preset first threshold value;

By K corresponding for a wherein road voice signal _athe K that individual two-value spectrum is corresponding with another road voice signal _bindividual two-value is composed the coupling of the coherence between carrying out between two and is obtained described first matching result, and described first matching result comprises matched position corresponding to one group of the highest two-value spectrum of matching degree and matching degree, K _a, K _bbe positive integer.

Optionally, the Spectral correlation of the described power spectrum according to described two-way voice signal carries out coherence's coupling to described two-way voice signal and obtains the second matching result, comprising:

For each road voice signal, calculate the power spectrum P (w, n) of each the frame voice signal in described voice signal according to the following equation:

P(w,n)＝α _pP(w,n-1)+(1-α _p)|X(w,n)| ²；

Wherein, X (w, n) represents the Short Time Fourier Transform of described voice signal; α _pfor predetermined coefficient and 0 < α _p< 1; W represents the frequency sequence number on frequency domain; N represents the frame number in time domain;

Calculate the Spectral correlation DP (w, n) of the power spectrum of each frame voice signal according to the following equation:

DP(w,n)＝|P(w+1,n)-P(w,n)|；

According to the following equation noise tracking is carried out to described Spectral correlation DP (w, n), obtains the Spectral correlation NDP (w, n) of the noise power spectrum of each frame voice signal:

NDP (w, n) = \{\begin{matrix} (1 - β_{u}) DP (w, n) + β_{u} NDP (w, n - 1), & DP (w, n) &GreaterEqual; NDP (w, n - 1) \\ (1 - β_{d}) DP (w, n) + β_{d} NDP (w, n - 1), & DP (w, n) < NDP (w, n - 1) \end{matrix};

Wherein, β _u, β _dfor predetermined coefficient and 0 < β _d< β _u< 1;

According to the following equation binary conversion treatment is carried out to the described Spectral correlation DP (w, n) of each frame voice signal and obtains correlativity two-value spectrum XDb (w, n):

XDb (w, n) = \{\begin{matrix} 1, & DP (w, n) - NDP (w, n) > T_{Db} \\ 0, & DP (w, n) - NDP (w, n) \leq T_{Db} \end{matrix},

T _dbfor default Second Threshold;

By KD corresponding for a wherein road voice signal _athe KD that individual correlativity two-value spectrum is corresponding with another road voice signal _bindividual correlativity two-value is composed the coupling of the coherence between carrying out between two and is obtained described second matching result, and described second matching result comprises matched position corresponding to one group of the highest coherence's two-value spectrum of matching degree and matching degree, KD _a, KD _bbe positive integer.

Optionally, the described Short Time Fourier Transform according to described two-way voice signal carries out before coherence's coupling obtains the first matching result, also comprising to described two-way voice signal:

For each road voice signal, carry out pre-service obtain pretreated voice signal to described voice signal, described pre-service comprises at least one in noise reduction process, amplification process, high-pass filtering process, lifting sampling process;

Short Time Fourier Transform is carried out to described pretreated voice signal.

Second aspect, provide a kind of time delay estimation unit of voice signal, described device comprises:

Signal acquisition module, for obtaining two-way voice signal;

First matching module, obtain the first matching result for carrying out coherence's coupling according to the Short Time Fourier Transform of described two-way voice signal to described two-way voice signal, described first matching result comprises the first matched position and first matching degree of described two-way voice signal;

Second matching module, Spectral correlation for the power spectrum according to described two-way voice signal carries out coherence's coupling to described two-way voice signal and obtains the second matching result, and described second matching result comprises the second matched position and second matching degree of described two-way voice signal;

Time-delay calculation module, for calculating the time delay between described two-way voice signal according to described first matching result and described second matching result.

Optionally, described time-delay calculation module, comprising: position calculation unit and time-delay calculation unit;

Described position calculation unit, for for each road voice signal, adopt Weighted Average Algorithm to calculate final matched position to described first matched position and described second matched position, the weight of described Weighted Average Algorithm is determined according to described first matching degree and described second matching degree;

Described time-delay calculation unit, for calculating the time delay between described two-way voice signal according to the described final matched position of described two-way voice signal.

Optionally, described first matching module, comprising: the first tracking cell, the first binarization unit and the first matching unit;

Described first tracking cell, for for each road voice signal, carries out noise tracking to each the frame voice signal in described voice signal according to the following equation, obtains the noise spectrum N (w, n) of each frame voice signal:

N (w, n) = \{\begin{matrix} (1 - α_{u}) | X (w, n) | + α_{u} N (w, n - 1), & | X (w, n) | &GreaterEqual; N (w, n - 1) \\ (1 - α_{d}) | X (w, n) | + α_{d} N (w, n - 1), & | X (w, n) | < N (w, n - 1) \end{matrix};

Described first binarization unit, obtains two-value spectrum Xb (w, n) for carrying out binary conversion treatment to the Short Time Fourier Transform of each frame voice signal according to the following equation:

Xb (w, n) = \{\begin{matrix} 1, & | X (w, n) | - N (w, n) > T_{b} \\ 0, & | X (w, n) | - N (w, n) \leq T_{b} \end{matrix},

T _bfor preset first threshold value;

Described first matching unit, for by K corresponding for a wherein road voice signal _athe K that individual two-value spectrum is corresponding with another road voice signal _bindividual two-value is composed the coupling of the coherence between carrying out between two and is obtained described first matching result, and described first matching result comprises matched position corresponding to one group of the highest two-value spectrum of matching degree and matching degree, K _a, K _bbe positive integer.

Optionally, described second matching module, comprising: spectra calculation unit, correlation calculations unit, the second tracking cell, the second binarization unit and the second matching unit;

Described spectra calculation unit, for for each road voice signal, calculates the power spectrum P (w, n) of each the frame voice signal in described voice signal according to the following equation:

P(w,n)＝α _pP(w,n-1)+(1-α _p)|X(w,n)| ²；

Described correlation calculations unit, for calculating the Spectral correlation DP (w, n) of the power spectrum of each frame voice signal according to the following equation:

DP(w,n)＝|P(w+1,n)-P(w,n)|；

Described second tracking cell, for carrying out noise tracking to described Spectral correlation DP (w, n) according to the following equation, obtains the Spectral correlation NDP (w, n) of the noise power spectrum of each frame voice signal:

NDP (w, n) = \{\begin{matrix} (1 - β_{u}) DP (w, n) + β_{u} NDP (w, n - 1), & DP (w, n) &GreaterEqual; NDP (w, n - 1) \\ (1 - β_{d}) DP (w, n) + β_{d} NDP (w, n - 1), & DP (w, n) < NDP (w, n - 1) \end{matrix};

Wherein, β _u, β _dfor predetermined coefficient and 0 < β _d< β _u< 1;

Described second binarization unit, obtains correlativity two-value spectrum XDb (w, n) for carrying out binary conversion treatment to the described Spectral correlation DP (w, n) of each frame voice signal according to the following equation:

XDb (w, n) = \{\begin{matrix} 1, & DP (w, n) - NDP (w, n) > T_{Db} \\ 0, & DP (w, n) - NDP (w, n) \leq T_{Db} \end{matrix},

T _dbfor default Second Threshold;

Described second matching unit, for by KD corresponding for a wherein road voice signal _athe KD that individual correlativity two-value spectrum is corresponding with another road voice signal _bindividual correlativity two-value is composed the coupling of the coherence between carrying out between two and is obtained described second matching result, and described second matching result comprises matched position corresponding to one group of the highest coherence's two-value spectrum of matching degree and matching degree, KD _a, KD _bbe positive integer.

Optionally, described device also comprises:

Signal pre-processing module, for for each road voice signal, carries out pre-service to described voice signal and obtains pretreated voice signal, and described pre-service comprises noise reduction process, at least one of amplifying in process, high-pass filtering process, lifting sampling process;

Fourier transform module, for carrying out Short Time Fourier Transform to described pretreated voice signal.

The beneficial effect that the technical scheme that the embodiment of the present invention provides is brought is:

The first matching result is obtained by carrying out analysis coupling to the Short Time Fourier Transform of two-way voice signal, and carry out analysis coupling by the Spectral correlation of the power spectrum to two-way voice signal and obtain the second matching result, then calculate the time delay between two-way voice signal in conjunction with the first matching result and the second matching result; Solve the problem that the accuracy of the delay time estimation method existence that correlation technique relates to is low; Respectively the matching analysis is carried out to two-way voice signal from Spectral correlation two angles of frequency domain distribution and power spectrum, and comprehensive two matching results determine final matching results, reach raising matching precision, improve the effect of time delay accuracy of estimation.

Accompanying drawing explanation

In order to be illustrated more clearly in the technical scheme in the embodiment of the present invention, below the accompanying drawing used required in describing embodiment is briefly described, apparently, accompanying drawing in the following describes is only some embodiments of the present invention, for those of ordinary skill in the art, under the prerequisite not paying creative work, other accompanying drawing can also be obtained according to these accompanying drawings.

Fig. 1 is the method flow diagram of the delay time estimation method of the voice signal that one embodiment of the invention provides;

Fig. 2 is the method flow diagram of the delay time estimation method of the voice signal that another embodiment of the present invention provides;

Fig. 3 is the block diagram of the time delay estimation unit of the voice signal that one embodiment of the invention provides;

Fig. 4 is the block diagram of the time delay estimation unit of the voice signal that another embodiment of the present invention provides;

Fig. 5 is the structural representation of the electronic equipment that one embodiment of the invention provides.

Embodiment

For making the object, technical solutions and advantages of the present invention clearly, below in conjunction with accompanying drawing, embodiment of the present invention is described further in detail.

Please refer to Fig. 1, it illustrates the method flow diagram of the delay time estimation method of the voice signal that one embodiment of the invention provides, the present embodiment is applied in the electronic equipment of mobile phone, panel computer, laptops or desk-top computer and so on this delay time estimation method and is illustrated.This delay time estimation method can comprise following several step:

Step 102, obtains two-way voice signal.

Step 104, carry out coherence's coupling according to the Short Time Fourier Transform of two-way voice signal to two-way voice signal and obtain the first matching result, this first matching result comprises the first matched position and first matching degree of two-way voice signal.

Step 106, carry out coherence's coupling according to the Spectral correlation of the power spectrum of two-way voice signal to two-way voice signal and obtain the second matching result, this second matching result comprises the second matched position and second matching degree of two-way voice signal.

Step 108, calculates the time delay between two-way voice signal according to the first matching result and the second matching result.

It should be noted is that: above-mentioned steps 106 can be carried out after step 104, also can carry out before step 104, or carry out with step 104 simultaneously, in the present embodiment, only be illustrated after step 104 with step 106, concrete restriction is not done to this.

In sum, the delay time estimation method of the voice signal that the present embodiment provides, the first matching result is obtained by carrying out analysis coupling to the Short Time Fourier Transform of two-way voice signal, and carry out analysis coupling by the Spectral correlation of the power spectrum to two-way voice signal and obtain the second matching result, then calculate the time delay between two-way voice signal in conjunction with the first matching result and the second matching result; Solve the problem that the accuracy of the delay time estimation method existence that correlation technique relates to is low; Respectively the matching analysis is carried out to two-way voice signal from Spectral correlation two angles of frequency domain distribution and power spectrum, and comprehensive two matching results determine final matching results, reach raising matching precision, improve the effect of time delay accuracy of estimation.

Please refer to Fig. 2, it illustrates the method flow diagram of the delay time estimation method of the voice signal that another embodiment of the present invention provides, the present embodiment is applied in the electronic equipment of mobile phone, panel computer, laptops or desk-top computer and so on this delay time estimation method and is illustrated.This delay time estimation method can comprise following several step:

Step 201, obtains two-way voice signal.

Two-way voice signal is the discrete signal in time domain.In the present embodiment, suppose that wherein a road voice signal A is x _ran (), another road voice signal B is x _rbn (), n represents the frame number in time domain, n ∈ [0, M-1], M>=2 and n, M are integer.

Step 202, for each road voice signal, carries out pre-service to voice signal and obtains pretreated voice signal.

Wherein, pre-service includes but not limited at least one in noise reduction process, amplification process, high-pass filtering process, lifting sampling process.Carrying out pretreated object to voice signal is to extract more accurate, reliable sound characteristic in subsequent process, to improve matching precision.

In the present embodiment, suppose to be pre-service F (*) to each road voice signal, then wherein the pre-processed results of a road voice signal A is x _a(n)=F (x _ra(n)), the pre-processed results of another road voice signal B is x _b(n)=F (x _rb(n)).

It should be noted is that: the present embodiment is only illustrated with the above-mentioned several pretreatment modes exemplified, in actual applications, other pretreatment mode can be adopted according to the actual requirements, concrete restriction is not done to this present embodiment.

Step 203, carries out Short Time Fourier Transform to pretreated voice signal.

STFT (Short Time Fourier Transform, Short Time Fourier Transform) process for adopt predetermined window function to pretreated voice signal windowing, and be FFT (Fast Fourier Transform, Fast Fourier Transform (FFT)), thus voice signal is transformed into frequency domain from time domain.Wherein, the selection of window function includes but not limited to hamming window and Caesar's window.The length of window function is the power of 2, such as 128,256 etc.

In the present embodiment, note STFT function is STFT (*), then wherein the STFT of a road voice signal A can be designated as X _a(w, n)=STFT (x _a(n)), the STFT of another road voice signal B can be designated as X _b(w, n)=STFT (x _b(n)), w represents the frequency sequence number on frequency domain, w ∈ [0, N-1], N>=2 and w, N are integer.

Step 204, carry out coherence's coupling according to the Short Time Fourier Transform of two-way voice signal to two-way voice signal and obtain the first matching result, this first matching result comprises the first matched position and first matching degree of two-way voice signal.

In the first matching way, from the frequency domain distribution angle of voice signal, two-way voice signal is mated.Specifically, this step can comprise following a few sub-steps:

The first, for each road voice signal, according to the following equation noise tracking is carried out to each the frame voice signal in voice signal, obtain the noise spectrum N (w, n) of each frame voice signal:

N (w, n) = \{\begin{matrix} (1 - α_{u}) | X (w, n) | + α_{u} N (w, n - 1), & | X (w, n) | &GreaterEqual; N (w, n - 1) \\ (1 - α_{d}) | X (w, n) | + α_{d} N (w, n - 1), & | X (w, n) | < N (w, n - 1) \end{matrix};

Wherein, X (w, n) represents the Short Time Fourier Transform of voice signal; α _u, α _dfor predetermined coefficient and 0 < α _d< α _u< 1; W represents the frequency sequence number on frequency domain; N represents the frame number in time domain.

The object of voice signal being carried out to noise tracking is to reduce the impact of the noise in voice signal on matching result as much as possible, to improve matching precision.In addition, by reasonable set predetermined coefficient α _u, α _dsize, in signal uphill process, use less factor alpha _u, and larger factor alpha is used in signal decline process _d, noise tracking effect can be improved.

In the present embodiment, the noise spectrum supposing the i-th frame voice signal in a wherein road voice signal A is N _a(w, i), then:

N_{a} (w, i) = \{\begin{matrix} (1 - α_{u}) | X_{a} (w, i) | + α_{u} N_{a} (w, i - 1), & | X_{a} (w, i) | &GreaterEqual; N_{a} (w, i - 1) \\ (1 - α_{d}) | X_{a} (w, i) | + α_{d} N_{a} (w, i - 1), & | X_{a} (w, i) | < N_{a} (w, i - 1) \end{matrix};

Wherein, X _a(w, i) represents the Short Time Fourier Transform of the i-th frame voice signal in voice signal A, i>=0 and i is integer.

Similar, suppose that the noise spectrum of the jth frame voice signal in the voice signal B of another road is N _b(w, j), then:

N_{b} (w, i) = \{\begin{matrix} (1 - α_{u}) | X_{b} (w, j) | + α_{u} N_{b} (w, j - 1), & | X_{b} (w, j) | &GreaterEqual; N_{b} (w, j - 1) \\ (1 - α_{d}) | X_{b} (w, j) | + α_{d} N_{b} (w, j - 1), & | X_{b} (w, j) | < N_{b} (w, j - 1) \end{matrix};

Wherein, X _b(w, j) represents the Short Time Fourier Transform of the jth frame voice signal in voice signal B, j>=0 and j is integer.

The second, according to the following equation binary conversion treatment is carried out to the Short Time Fourier Transform of each frame voice signal and obtain two-value spectrum Xb (w, n):

Xb (w, n) = \{\begin{matrix} 1, & | X (w, n) | - N (w, n) > T_{b} \\ 0, & | X (w, n) | - N (w, n) \leq T_{b} \end{matrix},

T _bfor preset first threshold value.

After binary conversion treatment is carried out to the frequency spectrum of voice signal, the Short Time Fourier Transform of each frame voice signal all can be converted into length equal, the binary sequence that is made up of 0 and 1.Fully can improve data processing speed by binary conversion treatment, carry out the matching primitives of more efficient (robust) so that follow-up.

In the present embodiment, suppose that the two-value spectrum that the i-th frame voice signal in a wherein road voice signal A is corresponding is Xb _a(w, i), then:

X b_{a} (w, i) = \{\begin{matrix} 1, & | X_{a} (w, i) | - N_{a} (w, i) > T_{b} \\ 0, & | X_{a} (w, i) | - N_{a} (w, i) \leq T_{b} \end{matrix};

Similar, suppose that the two-value spectrum that jth frame voice signal in the voice signal B of another road is corresponding is Xb _b(w, j), then:

X b_{b} (w, j) = \{\begin{matrix} 1, & | X_{b} (w, j) | - N_{b} (w, j) > T_{b} \\ 0, & | X_{b} (w, j) | - N_{b} (w, j) \leq T_{b} \end{matrix} .

3rd, by K corresponding for a wherein road voice signal _athe K that individual two-value spectrum is corresponding with another road voice signal _bindividual two-value is composed the coupling of the coherence between carrying out between two and is obtained the first matching result, K _a, K _bbe positive integer.

Wherein, the first matching result comprises matched position corresponding to one group of the highest two-value spectrum of matching degree and matching degree.The two-value spectrum quantity corresponding as a wherein road voice signal A is K _aand another road voice signal B corresponding two-value spectrum quantity be K _btime, need to carry out K _a× K _bsecondary two-value spectrum coupling.For mating each time, record the matching degree between two two-value spectrums, and carry out the sequence number n of two two-value spectrums of mating _a(i) and n _b(j).Wherein, n _ai () represents the sequence number of the two-value spectrum that the i-th frame voice signal wherein in a road voice signal A is corresponding, n _a(i) ∈ [0, K _a-1], K _a>=2 and n _a(i), K _abe integer; n _bj () represents the sequence number of the two-value spectrum that the jth frame voice signal in the voice signal B of another road is corresponding, n _b(j) ∈ [0, K _b-1], K _b>=2 and n _b(j), K _bbe integer.

Such as, if by two-value corresponding for the i-th frame voice signal in wherein road voice signal A spectrum Xb _athe two-value that (w, i) is corresponding with the jth frame voice signal in the voice signal B of another road composes Xb _b(w, j) mates, then both matching degree Pb _ijfor:

Wherein, ⊙ represents same or sign of operation.Matching degree Pb _ijequal frequency that in two two-values spectrum, corresponding position binaryzation result is equal to quantity and frequency to the ratio of sum.

In a concrete example, suppose Xb _a(w, i)={ 1,1,0,0,0,1,1,1}, Xb _b(w, j)={ 0,1,1,1,0,1,1,1}, then as w=0, due to Xb _a(0, i)=1, Xb _b(0, j)=0, so Xb _a(0, i) ⊙ Xb _b(0, j)=0; As w=1, due to Xb _a(1, i)=1, Xb _b(1, j)=1, so Xb _a(1, i) ⊙ Xb _b(1, j)=1; The like, the matching degree Pb of above-mentioned two binary sequences can be calculated _ij=5/8=0.625.

After calculating the matching degree between every two two-values spectrum, choose one group of two-value spectrum that matching degree is the highest, and record sequence number corresponding to this group two-value spectrum and matching degree.

In the present embodiment, suppose the i-th frame voice signal in a wherein road voice signal A corresponding n-th _ai () individual two-value composes corresponding with the jth frame voice signal in the voice signal B of another road n-th _bj the matching degree of () individual two-value spectrum is the highest, be designated as P ₁.Such as, as i=1 and j=2 time,

Step 205, carry out coherence's coupling according to the Spectral correlation of the power spectrum of two-way voice signal to two-way voice signal and obtain the second matching result, this second matching result comprises the second matched position and second matching degree of two-way voice signal.

In the second matching way, from the Spectral correlation angle of power spectrum, two-way voice signal is mated.Specifically, this step can comprise following a few sub-steps:

The first, for each road voice signal, calculate the power spectrum P (w, n) of each the frame voice signal in voice signal according to the following equation:

P(w,n)＝α _pP(w,n-1)+(1-α _p)|X(w,n)| ²；

Wherein, X (w, n) represents the Short Time Fourier Transform of voice signal; α _pfor predetermined coefficient and 0 < α _p< 1; W represents the frequency sequence number on frequency domain; N represents the frame number in time domain.

In the present embodiment, the power spectrum supposing the i-th frame voice signal in a wherein road voice signal A is P _a(w, i), then:

P _a(w,i)＝α _pP _a(w,i-1)+(1-α _p)|X _a(w,i)| ²；

Similar, suppose that the power spectrum of the jth frame voice signal in the voice signal B of another road is P _b(w, j), then:

P _b(w,j)＝α _pP _b(w,j-1)+(1-α _p)|X _b(w,j)| ²；

The second, calculate the Spectral correlation DP (w, n) of the power spectrum of each frame voice signal according to the following equation:

DP(w,n)＝|P(w+1,n)-P(w,n)|。

After calculating the power spectrum of voice signal, the energy of high frequency points is deducted the energy of low frequency, the Spectral correlation of rated output spectrum.

In the present embodiment, the Spectral correlation supposing the power spectrum of the i-th frame voice signal in a wherein road voice signal A is DP _a(w, i), then:

DP _a(w,i)＝|P _a(w+1,i)-P _a(w,i)|；

Similar, suppose that the Spectral correlation of the power spectrum of the jth frame voice signal in the voice signal B of another road is DP _b(w, j), then:

DP _b(w,j)＝|P _b(w+1,j)-P _b(w,j)|。

3rd, according to the following equation noise tracking is carried out to Spectral correlation DP (w, n), obtain the Spectral correlation NDP (w, n) of the noise power spectrum of each frame voice signal:

NDP (w, n) = \{\begin{matrix} (1 - β_{u}) DP (w, n) + β_{u} NDP (w, n - 1), & DP (w, n) &GreaterEqual; NDP (w, n - 1) \\ (1 - β_{d}) DP (w, n) + β_{d} NDP (w, n - 1), & DP (w, n) < NDP (w, n - 1) \end{matrix};

Wherein, β _u, β _dfor predetermined coefficient and 0 < β _d< β _u< 1.

The object of Spectral correlation being carried out to noise tracking is to prevent the noise fluctuated from bringing erroneous judgement to subsequent match, reduces noise as much as possible to the impact of matching result, to improve matching precision.In addition, by reasonable set predetermined coefficient β _u, β _dsize, noise tracking effect can be improved.

In the present embodiment, the Spectral correlation supposing the noise power spectrum of the i-th frame voice signal in a wherein road voice signal A is NDP _a(w, i), then:

ND P_{a} (w, i) = \{\begin{matrix} (1 - β_{u}) D P_{a} (w, i) + β_{u} ND P_{a} (w, i - 1), & D P_{a} (w, i) &GreaterEqual; ND P_{a} (w, i - 1) \\ (1 - β_{d}) D P_{a} (w, i) + β_{d} ND P_{a} (w, i - 1), & D P_{a} (w, i) < ND P_{a} (w, i - 1) \end{matrix};

Similar, suppose that the Spectral correlation of the noise power spectrum of the jth frame voice signal in the voice signal B of another road is NDP _b(w, j), then:

ND P_{b} (w, j) = \{\begin{matrix} (1 - β_{u}) D P_{b} (w, j) + β_{u} ND P_{b} (w, j - 1), & D P_{b} (w, j) &GreaterEqual; ND P_{b} (w, j - 1) \\ (1 - β_{d}) D P_{b} (w, j) + β_{d} ND P_{b} (w, j - 1), & D P_{b} (w, j) < ND P_{b} (w, j - 1) \end{matrix} .

4th, according to the following equation binary conversion treatment is carried out to the Spectral correlation DP (w, n) of each frame voice signal and obtain correlativity two-value spectrum XDb (w, n):

XDb (w, n) = \{\begin{matrix} 1, & DP (w, n) - NDP (w, n) > T_{Db} \\ 0, & DP (w, n) - NDP (w, n) \leq T_{Db} \end{matrix},

T _dbfor default Second Threshold.

After carrying out binary conversion treatment to Spectral correlation, the Spectral correlation of each frame voice signal all can be converted into a length binary sequence that is equal, that be made up of 0 and 1.Fully can improve data processing speed by binary conversion treatment, carry out the matching primitives of more efficient (robust) so that follow-up.

In the present embodiment, suppose that the correlativity two-value spectrum that the i-th frame voice signal in a wherein road voice signal A is corresponding is XDb _a(w, i), then:

XD b_{a} (w, i) = \{\begin{matrix} 1, & D P_{a} (w, i) - ND P_{a} (w, i) > T_{Db} \\ 0, & D P_{a} (w, i) - ND P_{a} (w, i) \leq T_{Db} \end{matrix};

Similar, suppose that the correlativity two-value spectrum that jth frame voice signal in the voice signal B of another road is corresponding is XDb _b(w, j), then:

XD b_{b} (w, j) = \{\begin{matrix} 1, & D P_{b} (w, j) - ND P_{b} (w, j) > T_{Db} \\ 0, & D P_{b} (w, j) - ND P_{b} (w, j) \leq T_{Db} \end{matrix} .

5th, by KD corresponding for a wherein road voice signal _athe KD that individual correlativity two-value spectrum is corresponding with another road voice signal _bindividual correlativity two-value is composed the coupling of the coherence between carrying out between two and is obtained the second matching result, KD _a, KD _bbe positive integer.

Wherein, the second matching result comprises matched position corresponding to one group of the highest correlativity two-value spectrum of matching degree and matching degree.The correlativity two-value spectrum quantity corresponding as a wherein road voice signal A is KD _aand another road voice signal B corresponding correlativity two-value spectrum quantity be KD _btime, need to carry out KD _a× KD _bsecondary correlativity two-value spectrum coupling.For mating each time, record the matching degree between two correlativity two-value spectrums, and carry out the sequence number nd of two correlativity two-value spectrums of mating _a(i) and nd _b(j).Wherein, nd _ai () represents the sequence number of the correlativity two-value spectrum that the i-th frame voice signal wherein in a road voice signal A is corresponding, nd _a(i) ∈ [0, KD _a-1], KD _a>=2 and nd _a(i), KD _abe integer; Nd _bj () represents the sequence number of the correlativity two-value spectrum that the jth frame voice signal in the voice signal B of another road is corresponding, nd _b(j) ∈ [0, KD _b-1], KD _b>=2 and nd _b(j), KD _bbe integer.In addition, under normal conditions, KD _awith the K in above-mentioned steps 204 _aequal, KD _bwith the K in above-mentioned steps 204 _bequal.

Such as, if by correlativity two-value corresponding for the i-th frame voice signal in wherein road voice signal A spectrum XDb _athe correlativity two-value that (w, i) is corresponding with the jth frame voice signal in the voice signal B of another road composes XDb _b(w, j) mates, then both matching degree PDb _ijfor:

Wherein, ⊙ represents same or sign of operation.Matching degree PDb _ijequal data that in two correlativity two-values spectrum, corresponding position binaryzation result is equal to quantity and data to the ratio of sum.

After calculating the matching degree between every two correlativity two-values spectrum, choose one group of correlativity two-value spectrum that matching degree is the highest, and record sequence number corresponding to this group correlativity two-value spectrum and matching degree.

In the present embodiment, the n-th d that the i-th frame voice signal in a wherein road voice signal A is corresponding is supposed _ai the n-th d that () individual correlativity two-value spectrum is corresponding with the jth frame voice signal in the voice signal B of another road _bj the matching degree of () individual correlativity two-value spectrum is the highest, be designated as P ₂.Such as, as i=1 and j=3 time,

Step 206, calculates the time delay between two-way voice signal according to the first matching result and the second matching result.

After calculating the first matching result and the second matching result, above-mentioned two matching results are comprehensively obtained final matching results, and then calculate the time delay between two-way voice signal according to final matching results.Specifically, this step can comprise following two sub-steps:

The first, for each road voice signal, adopt Weighted Average Algorithm to calculate final matched position to the first matched position and the second matched position, the weight of this Weighted Average Algorithm is determined according to the first matching degree and the second matching degree.

In a kind of possible embodiment, suppose that the first matching degree obtained from the frequency domain distribution angle calculation of voice signal is P ₁, the second matching degree obtained from the Spectral correlation angle calculation of power spectrum is P ₂, then the weight that the first matched position is corresponding is the weight that second matched position is corresponding is

The wherein final matched position nl of a road voice signal A _afor:

{nl}_{a} = \frac{P_{1}}{P_{1} + P_{2}} \times n_{a} + \frac{P_{2}}{P_{1} + P_{2}} \times {nd}_{a};

Wherein, n _arepresent first matched position of this voice signal A, nd _arepresent second matched position of this voice signal A.

Similar, the final matched position nl of another road voice signal B _bfor:

{nl}_{b} = \frac{P_{1}}{P_{1} + P_{2}} \times n_{b} + \frac{P_{2}}{P_{1} + P_{2}} \times {nd}_{b};

Wherein, n _brepresent first matched position of this voice signal B, nd _brepresent second matched position of this voice signal B.

The second, calculate the time delay between two-way voice signal according to the final matched position of two-way voice signal.

Calculate the time delay t between two-way voice signal according to the following equation:

t＝k(nl _a-nl _b)；

Wherein, k is time coefficient.

Time coefficient k can calculate according to sample frequency f, the sampled point quantity Num of STFT conversion and overlap coefficient η, time coefficient

In a concrete example, if use the FFT overlap 50% of the signal sampling 256 of 16KHz, so time coefficient

k = \frac{256 \times 50 %}{16000} = 8 ms .

It should be noted is that: above-mentioned steps 205 can be carried out after step 204, also can carry out before the step 204, or carry out with step 204 simultaneously, in the present embodiment, only be illustrated after step 204 with step 205, concrete restriction is not done to this.

In addition, the delay time estimation method that the present embodiment provides, before carrying out matching primitives, by carrying out binary conversion treatment to data, substantially increasing matching efficiency, providing a kind of delay time estimation method of robust.

Following is apparatus of the present invention embodiment, may be used for performing the inventive method embodiment.For the details do not disclosed in apparatus of the present invention embodiment, please refer to the inventive method embodiment.

Please refer to Fig. 3, it illustrates the block diagram of the time delay estimation unit of the voice signal that one embodiment of the invention provides, this time delay estimation unit can realize becoming the some or all of of electronic equipment by software, hardware or both combinations.This time delay estimation unit can comprise: signal acquisition module 310, first matching module 320, second matching module 330 and time-delay calculation module 340.

Signal acquisition module 310, for obtaining two-way voice signal.

First matching module 320, obtain the first matching result for carrying out coherence's coupling according to the Short Time Fourier Transform of described two-way voice signal to described two-way voice signal, described first matching result comprises the first matched position and first matching degree of described two-way voice signal.

Second matching module 330, Spectral correlation for the power spectrum according to described two-way voice signal carries out coherence's coupling to described two-way voice signal and obtains the second matching result, and described second matching result comprises the second matched position and second matching degree of described two-way voice signal.

Time-delay calculation module 340, for calculating the time delay between described two-way voice signal according to described first matching result and described second matching result.

In sum, the time delay estimation unit of the voice signal that the present embodiment provides, the first matching result is obtained by carrying out analysis coupling to the Short Time Fourier Transform of two-way voice signal, and carry out analysis coupling by the Spectral correlation of the power spectrum to two-way voice signal and obtain the second matching result, then calculate the time delay between two-way voice signal in conjunction with the first matching result and the second matching result; Solve the problem that the accuracy of the delay time estimation method existence that correlation technique relates to is low; Respectively the matching analysis is carried out to two-way voice signal from Spectral correlation two angles of frequency domain distribution and power spectrum, and comprehensive two matching results determine final matching results, reach raising matching precision, improve the effect of time delay accuracy of estimation.

Please refer to Fig. 4, it illustrates the block diagram of the time delay estimation unit of the voice signal that another embodiment of the present invention provides, this time delay estimation unit can realize becoming the some or all of of electronic equipment by software, hardware or both combinations.This time delay estimation unit can comprise: signal acquisition module 310, first matching module 320, second matching module 330 and time-delay calculation module 340.

Signal acquisition module 310, for obtaining two-way voice signal.

Wherein, described first matching module 320, can comprise: the first tracking cell 320a, the first binarization unit 320b and the first matching unit 320c.

Described first tracking cell 320a, for for each road voice signal, carries out noise tracking to each the frame voice signal in described voice signal according to the following equation, obtains the noise spectrum N (w, n) of each frame voice signal:

N (w, n) = \{\begin{matrix} (1 - α_{u}) | X (w, n) | + α_{u} N (w, n - 1), & | X (w, n) | &GreaterEqual; N (w, n - 1) \\ (1 - α_{d}) | X (w, n) | + α_{d} N (w, n - 1), & | X (w, n) | < N (w, n - 1) \end{matrix};

Wherein, X (w, n) represents the Short Time Fourier Transform of described voice signal; α _u, α _dfor predetermined coefficient and 0 < α _d< α _u< 1; W represents the frequency sequence number on frequency domain; N represents the frame number in time domain.

Described first binarization unit 320b, obtains two-value spectrum Xb (w, n) for carrying out binary conversion treatment to the frequency spectrum of each frame voice signal according to the following equation:

Xb (w, n) = \{\begin{matrix} 1, & | X (w, n) | - N (w, n) > T_{b} \\ 0, & | X (w, n) | - N (w, n) \leq T_{b} \end{matrix},

T _bfor preset first threshold value.

Described first matching unit 320c, for by K corresponding for a wherein road voice signal _athe K that individual two-value spectrum is corresponding with another road voice signal _bindividual two-value is composed the coupling of the coherence between carrying out between two and is obtained described first matching result, and described first matching result comprises matched position corresponding to one group of the highest two-value spectrum of matching degree and matching degree, K _a, K _bbe positive integer.

Wherein, described second matching module 330, can comprise: spectra calculation unit 330a, correlation calculations unit 330b, the second tracking cell 330c, the second binarization unit 330d and the second matching unit 330e.

Described spectra calculation unit 330a, for for each road voice signal, calculates the power spectrum P (w, n) of each the frame voice signal in described voice signal according to the following equation:

P(w,n)＝α _pP(w,n-1)+(1-α _p)|X(w,n)| ²；

Wherein, X (w, n) represents the Short Time Fourier Transform of described voice signal; α _pfor predetermined coefficient and 0 < α _p< 1; W represents the frequency sequence number on frequency domain; N represents the frame number in time domain.

Described correlation calculations unit 330b, for calculating the Spectral correlation DP (w, n) of the power spectrum of each frame voice signal according to the following equation:

DP(w,n)＝|P(w+1,n)-P(w,n)|。

Described second tracking cell 330c, for carrying out noise tracking to described Spectral correlation DP (w, n) according to the following equation, obtains the Spectral correlation NDP (w, n) of the noise power spectrum of each frame voice signal:

NDP (w, n) = \{\begin{matrix} (1 - β_{u}) DP (w, n) + β_{u} NDP (w, n - 1), & DP (w, n) &GreaterEqual; NDP (w, n - 1) \\ (1 - β_{d}) DP (w, n) + β_{d} NDP (w, n - 1), & DP (w, n) < NDP (w, n - 1) \end{matrix};

Wherein, β _u, β _dfor predetermined coefficient and 0 < β _d< β _u< 1.

Described second binarization unit 330d, obtains correlativity two-value spectrum XDb (w, n) for carrying out binary conversion treatment to the described Spectral correlation DP (w, n) of each frame voice signal according to the following equation:

XDb (w, n) = \{\begin{matrix} 1, & DP (w, n) - NDP (w, n) > T_{Db} \\ 0, & DP (w, n) - NDP (w, n) \leq T_{Db} \end{matrix},

T _dbfor default Second Threshold.

Described second matching unit 330e, for by KD corresponding for a wherein road voice signal _athe KD that individual correlativity two-value spectrum is corresponding with another road voice signal _bindividual correlativity two-value is composed the coupling of the coherence between carrying out between two and is obtained described second matching result, and described second matching result comprises matched position corresponding to one group of the highest coherence's two-value spectrum of matching degree and matching degree, KD _a, KD _bbe positive integer.

Wherein, described time-delay calculation module 340, can comprise: position calculation unit 340a and time-delay calculation unit 340b.

Described position calculation unit 340a, for for each road voice signal, adopt Weighted Average Algorithm to calculate final matched position to described first matched position and described second matched position, the weight of described Weighted Average Algorithm is determined according to described first matching degree and described second matching degree.

Described time-delay calculation unit 340b, for calculating the time delay between described two-way voice signal according to the described final matched position of described two-way voice signal.

Optionally, described device can also comprise: signal pre-processing module 312 and Fourier transform module 314.

Signal pre-processing module 312, for for each road voice signal, carry out pre-service to described voice signal and obtain pretreated voice signal, described pre-service comprises at least one in noise reduction process, amplification process, high-pass filtering process, lifting sampling process.

Fourier transform module 314, for carrying out Short Time Fourier Transform to described pretreated voice signal.

In addition, the time delay estimation unit that the present embodiment provides, before carrying out matching primitives, by carrying out binary conversion treatment to data, substantially increases matching efficiency.

It should be noted that: the time delay estimation unit of the voice signal that above-described embodiment provides is in the time time delay calculating two-way voice signal, only be illustrated with the division of above-mentioned each functional module, in practical application, can distribute as required and by above-mentioned functions and be completed by different functional modules, inner structure by equipment is divided into different functional modules, to complete all or part of function described above.In addition, the time delay estimation unit that above-described embodiment provides and the embodiment of the method for delay time estimation method belong to same design, and its specific implementation process refers to embodiment of the method, repeats no more here.

Please refer to Fig. 5, it illustrates the structural representation of the electronic equipment that one embodiment of the invention provides.This electronic equipment is for implementing the delay time estimation method of the voice signal provided in above-described embodiment.Specifically:

Electronic equipment 500 can comprise RF (Radio Frequency, radio frequency) circuit 510, the storer 520 including one or more computer-readable recording mediums, input block 530, display unit 540, sensor 550, voicefrequency circuit 560, WiFi (wireless fidelity, Wireless Fidelity) module 570, include the parts such as processor 580 and power supply 590 that more than or processes core.It will be understood by those skilled in the art that the electronic devices structure shown in Fig. 5 does not form the restriction to electronic equipment, the parts more more or less than diagram can be comprised, or combine some parts, or different parts are arranged.Wherein:

RF circuit 510 can be used for receiving and sending messages or in communication process, the reception of signal and transmission, especially, after being received by the downlink information of base station, transfer to more than one or one processor 580 to process; In addition, base station is sent to by relating to up data.Usually, RF circuit 510 includes but not limited to antenna, at least one amplifier, tuner, one or more oscillator, subscriber identity module (SIM) card, transceiver, coupling mechanism, LNA (Low Noise Amplifier, low noise amplifier), diplexer etc.In addition, RF circuit 510 can also by radio communication and network and other devices communicatings.Described radio communication can use arbitrary communication standard or agreement, include but not limited to GSM (Global System of Mobile communication, global system for mobile communications), GPRS (General Packet Radio Service, general packet radio service), CDMA (Code Division Multiple Access, CDMA), WCDMA (Wideband Code Division Multiple Access, Wideband Code Division Multiple Access (WCDMA)), LTE (Long Term Evolution, Long Term Evolution), Email, SMS (Short Messaging Service, Short Message Service) etc.

Storer 520 can be used for storing software program and module, and processor 580 is stored in software program and the module of storer 520 by running, thus performs the application of various function and data processing.Storer 520 mainly can comprise storage program district and store data field, and wherein, storage program district can store operating system, application program (such as sound-playing function, image player function etc.) etc. needed at least one function; Store data field and can store the data (such as voice data, phone directory etc.) etc. created according to the use of electronic equipment 500.In addition, storer 520 can comprise high-speed random access memory, can also comprise nonvolatile memory, such as at least one disk memory, flush memory device or other volatile solid-state parts.Correspondingly, storer 520 can also comprise Memory Controller, to provide the access of processor 580 and input block 530 pairs of storeies 520.

Input block 530 can be used for the numeral or the character information that receive input, and produces and to arrange with user and function controls relevant keyboard, mouse, control lever, optics or trace ball signal and inputs.Particularly, input block 530 can comprise image input device 531 and other input equipments 532.Image input device 531 can be camera, also can be photoelectric scanning device.Except image input device 531, input block 530 can also comprise other input equipments 532.Particularly, other input equipments 532 can include but not limited to one or more in physical keyboard, function key (such as volume control button, switch key etc.), trace ball, mouse, control lever etc.

Display unit 540 can be used for the various graphical user interface showing information or the information being supplied to user and the electronic equipment 500 inputted by user, and these graphical user interface can be made up of figure, text, icon, video and its combination in any.Display unit 540 can comprise display panel 541, optionally, the form such as LCD (Liquid Crystal Display, liquid crystal display), OLED (Organic Light-Emitting Diode, Organic Light Emitting Diode) can be adopted to configure display panel 541.

Electronic equipment 500 also can comprise at least one sensor 550, such as optical sensor, motion sensor and other sensors.Particularly, optical sensor can comprise ambient light sensor and proximity transducer, and wherein, ambient light sensor the light and shade of environmentally light can regulate the brightness of display panel 541, proximity transducer when electronic equipment 500 moves in one's ear, can cut out display panel 541 and/or backlight.As the one of motion sensor, Gravity accelerometer can detect the size of all directions (are generally three axles) acceleration, size and the direction of gravity can be detected time static, can be used for identifying the application (such as horizontal/vertical screen switching, dependent game, magnetometer pose calibrating) of mobile phone attitude, Vibration identification correlation function (such as passometer, knock) etc.; As for electronic equipment 500 also other sensors such as configurable gyroscope, barometer, hygrometer, thermometer, infrared ray sensor, do not repeat them here.

Voicefrequency circuit 560, loudspeaker 561, microphone 562 can provide the audio interface between user and electronic equipment 500.Voicefrequency circuit 560 can by receive voice data conversion after electric signal, be transferred to loudspeaker 561, by loudspeaker 561 be converted to voice signal export; On the other hand, the voice signal of collection is converted to electric signal by microphone 562, voice data is converted to after being received by voicefrequency circuit 560, after again voice data output processor 580 being processed, through RF circuit 510 to send to such as another electronic equipment, or export voice data to storer 520 to process further.Voicefrequency circuit 560 also may comprise earphone jack, to provide the communication of peripheral hardware earphone and electronic equipment 500.

WiFi belongs to short range wireless transmission technology, and electronic equipment 500 can help user to send and receive e-mail by WiFi module 570, browse webpage and access streaming video etc., and its broadband internet wireless for user provides is accessed.Although Fig. 5 shows WiFi module 570, be understandable that, it does not belong to must forming of electronic equipment 500, can omit in the scope of essence not changing invention as required completely.

Processor 580 is control centers of electronic equipment 500, utilize the various piece of various interface and the whole mobile phone of connection, software program in storer 520 and/or module is stored in by running or performing, and call the data be stored in storer 520, perform various function and the process data of electronic equipment 500, thus integral monitoring is carried out to mobile phone.Optionally, processor 580 can comprise one or more process core; Preferably, processor 580 accessible site application processor and modem processor, wherein, application processor mainly processes operating system, user interface and application program etc., and modem processor mainly processes radio communication.Be understandable that, above-mentioned modem processor also can not be integrated in processor 580.

Electronic equipment 500 also comprises the power supply 590 (such as battery) of powering to all parts, preferably, power supply can be connected with processor 580 logic by power-supply management system, thus realizes the functions such as management charging, electric discharge and power managed by power-supply management system.Power supply 590 can also comprise one or more direct current or AC power, recharging system, power failure detection circuit, power supply changeover device or the random component such as inverter, power supply status indicator.

Although not shown, electronic equipment 500 can also comprise bluetooth module etc., does not repeat them here.

Specifically in the present embodiment, electronic equipment 500 also includes storer, and one or more than one program, one of them or more than one program are stored in storer, and are configured to be performed as above-mentioned Fig. 1 or delay time estimation method embodiment illustrated in fig. 2 by more than one or one processor.

The invention described above embodiment sequence number, just to describing, does not represent the quality of embodiment.

One of ordinary skill in the art will appreciate that all or part of step realizing above-described embodiment can have been come by hardware, the hardware that also can carry out instruction relevant by program completes, described program can be stored in a kind of computer-readable recording medium, the above-mentioned storage medium mentioned can be ROM (read-only memory), disk or CD etc.

The foregoing is only preferred embodiment of the present invention, not in order to limit the present invention, within the spirit and principles in the present invention all, any amendment done, equivalent replacement, improvement etc., all should be included within protection scope of the present invention.

Claims

1. a delay time estimation method for voice signal, is characterized in that, described method comprises:

Obtain two-way voice signal;

2. method according to claim 1, is characterized in that, describedly calculates time delay between described two-way voice signal according to described first matching result and described second matching result, comprising:

3. method according to claim 1, is characterized in that, the described Short Time Fourier Transform according to described two-way voice signal is carried out coherence's coupling to described two-way voice signal and obtained the first matching result, comprising:

N (w, n) = \{\begin{matrix} (1 - α_{u}) | X (w, n) | + α_{u} N (w, n - 1) & , | X (w, n) | &GreaterEqual; N (w, n - 1) \\ (1 - α_{d}) | X (w, n) | + α_{d} N (w, n - 1) & , | X (w, n) | < N (w, n - 1) \end{matrix};

Xb (w, n) = \{\begin{matrix} 1 & , | X (w, n) | - N (w, n) > T_{b} \\ 0 & , | X (w, n) | N (w, n) \leq T_{b} \end{matrix},

T _bfor preset first threshold value;

4. method according to claim 1, is characterized in that, the Spectral correlation of the described power spectrum according to described two-way voice signal carries out coherence's coupling to described two-way voice signal and obtains the second matching result, comprising:

P(w,n)＝α _pP(w,n-1)+(1-α _p)|X(w,n)| ²；

DP(w,n)＝|P(w+1,n)-P(w,n)|；

NDP (w, n) = \{\begin{matrix} (1 - β_{u}) DP (w, n) + β_{u} NDP (w, n - 1) & , DP (w, n) &GreaterEqual; NDP (w, n - 1) \\ (1 - β_{d}) Dp (w, n) + β_{d} NDP (w, n - 1) & , DP (w, n) < NDP (w, n - 1) \end{matrix};

Wherein, β _u, β _dfor predetermined coefficient and 0 < β _d< β _u< 1;

XDb (w, n) = \{\begin{matrix} 1 & , DP (w, n) - NDP (w, n) > T_{Db} \\ 0 & , DP (w, n) - NDP (w, n) \leq T_{Db} \end{matrix},

T _dbfor default Second Threshold;

5. according to the arbitrary described method of Claims 1-4, it is characterized in that, the described Short Time Fourier Transform according to described two-way voice signal carries out before coherence's coupling obtains the first matching result, also comprising to described two-way voice signal:

6. a time delay estimation unit for voice signal, is characterized in that, described device comprises:

Signal acquisition module, for obtaining two-way voice signal;

7. device according to claim 6, is characterized in that, described time-delay calculation module, comprising: position calculation unit and time-delay calculation unit;

8. device according to claim 6, is characterized in that, described first matching module, comprising: the first tracking cell, the first binarization unit and the first matching unit;

Described first tracking cell, for for each road voice signal, carries out noise tracking to each the frame voice signal in described voice signal according to the following equation, obtains the noise spectrum N of each frame voice signal _b(w, n):

N_{b} (w, n) = \{\begin{matrix} (1 - α_{u}) | X (w, n) | + α_{u} N (w, n - 1) & , | X (w, n) | &GreaterEqual; N (w, n - 1) \\ (1 - α_{d}) | X (w, n) | + α_{d} N (w, n - 1) & , | X (w, n) | < N (w, n - 1) \end{matrix};

Xb (w, n) = \{\begin{matrix} 1 & , | X (w, n) | - N_{b} (w, n) > T_{b} \\ 0 & , | X (w, n) | - N (w, n) \leq T_{b} \end{matrix},

T _bfor preset first threshold value;

9. device according to claim 6, is characterized in that, described second matching module, comprising: spectra calculation unit, correlation calculations unit, the second tracking cell, the second binarization unit and the second matching unit;

P(w,n)＝α _pP(w,n-1)+(1-α _p)|X(w,n)| ²；

DP(w,n)＝|P(w+1,n)-P(w,n)|；

NDP (w, n) = \{\begin{matrix} (1 - β_{u}) DP (w, n) + β_{u} NDP (w, n - 1) & , DP (w, n) &GreaterEqual; NDP (w, n - 1) \\ (1 - β_{d}) DP (w, n) + β_{d} NDP (w, n - 1) & , DP (w, n) < NDP (w, n - 1) \end{matrix};

Wherein, β _u, β _dfor predetermined coefficient and 0 < β _d< β _u< 1;

XDb (w, n) = \{\begin{matrix} 1 & , DP (w, n) - NDP (w, n) > T_{Db} \\ 0 & , DP (w, n) - NDP (w, n) \leq T_{Db} \end{matrix},

T _dbfor default Second Threshold;

10., according to the arbitrary described device of claim 6 to 9, it is characterized in that, described device also comprises: