CN101807397B

CN101807397B - Voice detection method of noise robustness based on hidden semi-Markov model

Info

Publication number: CN101807397B
Application number: CN2010101175378A
Authority: CN
Inventors: 刘祥龙; 梁苑; 单宝松; 楼奕华; 李未
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2010-03-03
Filing date: 2010-03-03
Publication date: 2011-11-16
Anticipated expiration: 2030-03-03
Also published as: CN101807397A

Abstract

The invention discloses a voice detection method of noise robustness based on a hidden semi-Markov model, which comprises the following steps: (1) building the hidden semi-Markov model lambda= (A, B, pi and tau); (2) initializing parameters of pi and tau in the hidden semi-Markov model lambda; (3) carrying out DCT transformation on non-empty input signals; (4) estimating the parameters of B and a likelihood ratio test threshold respectively by utilizing front multi-frame input signals and a likelihood ratio, carrying out likelihood ratio test and finishing the voice detection; and (5) regulating the parameters of B and the likelihood ratio test threshold dynamically. The method regulates the parameters and the test threshold of the model dynamically according to the time-delay feature of voice and noise and realizes the real-time voice detection of noise robustness by utilizing the likelihood ratio test to carry out the voice detection.

Description

A kind of speech detection method of the noise robustness based on hidden semi-Markov model

Invention field

The present invention relates to a kind of under noise circumstance voice signal handle under the category, based on the speech detection method of the noise robustness of hidden semi-Markov model.

Background of invention

Speech detection is used for detection signal phonological component and noise section, is extensive use of in fields such as voice coding, transmission, voice enhancing and speech recognitions.Method based on statistical model has also obtained quite good detecting effectiveness at present, fluctuates bigger but these methods detect effect under different noise types, different signal to noise ratio (S/N ratio) environment.And in the application of reality, noise circumstance is various, inevitable, so noise robustness becomes the focus of present speech detection.Propose the speech detection algorithms of the robust of the different noise circumstances of adaptation, use all significant for voice coding, enhancing, identification etc.

Summary of the invention

The technical problem to be solved in the present invention: traditional voice detect to lack robustness under the noise circumstance, provides a kind of under different signal to noise ratio (S/N ratio)s, different noise circumstance, based on the speech detection method of the noise robustness of hidden semi-Markov model.

The technical solution used in the present invention: a kind of speech detection method of the noise robustness based on hidden semi-Markov model is characterized in that step is as follows:

(1) foundation comprises voice and two state Q={q of non-voice ₀, q ₁Hidden semi-Markov model λ=(A, B, π, τ), wherein:

q ₀Be non-voice, q ₁Be voice;

A={a _Ij, i, j=0,1 is state q _i, q _jTransition probability;

B={b _i(O _t), i=0,1; T＞0 is input signal dct transform coefficient O _t={ o ₁, o ₂..., o _K, K＞0 is at given state q _iFollowing condition distribution probability b _i(O _t)=P (O _t| q _i), o wherein ₁, o ₂..., o _KSeparate;

π={ π _i, i=0,1; π _i＞0 is state q _iThe prior distribution probability;

τ={ P (d|q _i), i=0,1; D＞0 is state q _iContinue the probability of d;

(2) according to the prior distribution probability π={ π of training dataset statistics initialization to state in the hidden semi-Markov model _i, the parameter (k of state duration distribution Weibull _i, ω _i), signal frame sequence number t=0;

(3) if input voice S signal is empty, finish; Otherwise, S is carried out dct transform T=t+1;

(4) if t＜P judges that current demand signal is noise VAD=0, change (3); If t=P estimates input signal dct transform coefficient O under the given state _tGauss parameter (the μ that distributes _i ^G, σ _i) and Laplace parameter (μ _i ^L, l _i), the likelihood ratio LRT of P frame before calculating _t, initialization likelihood ratio test threshold value η judges that current demand signal is noise VAD=0, changes (3); If t＞P calculates likelihood ratio LRT _t, if LRT _t〉=η judges that then current demand signal is voice VAD=1, if LRT _t＜η judges that then current demand signal is noise VAD=0, changes (5);

(5) adjust dct transform coefficient O under the given state _tGauss parameter (the μ that distributes _i ^G, σ _i) and Laplace parameter (μ _i ^L, l _i), upgrade likelihood ratio test threshold value η; Change (3).

According to a further aspect of the invention, wherein step (1) further comprises again:

According to the training dataset statistics, determine

(1)a ₀₀＝a ₁₁＝0，a ₁₀＝a ₀₁＝1；

(2) to q ₀, b ₀(o _i ^t) be that Gauss distributes

N (o_{i}^{t}, μ_{i}^{G}, σ_{i}) = \frac{1}{σ_{i} \sqrt{2 π}} e^{- \frac{(o_{i}^{t} - μ_{i}^{G})}{2 {σ_{i}}^{2}}};

(3) to q ₁, b ₁(o _i ^t) for distributing

L (o_{i}^{t}; μ_{i}^{L}, l_{i}) = \frac{1}{4 l_{i}} e^{- \frac{σ_{i}}{2 {l_{i}}^{2}}} [e^{\frac{o_{i}^{'}}{l_{i}}} erfc (\frac{l_{i} o_{i}^{'} + {σ_{i}}^{2}}{\sqrt{2} l_{i} σ_{i}}) + e^{- \frac{o_{i}^{'}}{l_{i}}} erfc (\frac{- l_{i} o_{i}^{'} + {σ_{i}}^{2}}{\sqrt{2} l_{i} σ_{i}})],

Wherein

o_{i}^{'} = o_{i}^{t} - μ_{i}^{G} - μ_{i}^{L};

(4) to q ₀And q ₁, P (d|q _i) be that Weibull distributes

W (d; k_{i}, ω_{i}) = \frac{k_{i}}{ω_{i}} {(\frac{d}{ω_{i}})}^{k_{i} - 1} e^{- {(\frac{d}{ω_{i}})}^{k_{i}}} .

According to a further aspect of the invention, wherein step (2) further comprises again:

(a) according to the noise duration frequency F according to statistics of reference numerals in the training set ₀And voice duration frequency F ₁

(b) by F _iApproximate W (d; k _i, ω _i) parameter (k _i, ω _i) maximal possibility estimation;

(c) the prior distribution probability of state in the hidden semi-Markov model

According to a further aspect of the invention, wherein step (4) further comprises again:

(a) calculate forward variable α _i ^t, i=0,1:

If t=1,

α_{i}^{t *} = π_{i} P (d = 1 / q_{i}) b_{j} (O_{t});

If t＞1,

α_{i}^{t *} = Σ_{d = 1}^{D} Σ_{j &NotEqual; i} α_{i}^{(t - d) *} a_{ji} P (d | q_{i}) Π_{s = t - d + 1}^{t} b_{i} (O_{s}),

α_{i}^{t} = Σ_{d = 1}^{D} Σ_{d^{'} = 0}^{d} Σ_{j &NotEqual; i} α_{i}^{(t - d^{'}) *} a_{ji} P (d | q_{i}) Π_{s = t - d^{'} + 1}^{t} b_{i} (O_{s});

(b) calculate likelihood ratio

{LRT}_{t} = \ln (π_{0} α_{1}^{t}) - \ln (π_{1} α_{0}^{t});

(c) during t=P, by the dct transform coefficient O of P frame before the input signal _t, wherein P＞0,1≤t≤P estimates the parameter (μ that B distributes _i ^G, σ _i) and (μ _i ^L, l _i) be:

μ_{i}^{L} = μ_{i}^{G} = \frac{1}{P} Σ_{i = 1}^{P} o_{i}^{t};

σ_{i} = \sqrt{\frac{1}{P - 1} Σ_{i = 1}^{P} {(o_{i}^{t} - μ_{i}^{G})}^{2}}

l_{i} = \sqrt{R {σ_{i}}^{2} / 2};

P wherein, R is a constant;

(d) during t=P, by the dct transform coefficient O of P frame before the input signal _t, wherein P＞0,1≤t≤P estimates that the likelihood ratio test threshold value is

According to a further aspect of the invention, wherein step (5) further comprises again:

(a), adjust parameter (μ if present frame is judged to be noise _i ^G, σ _i) and threshold value η:

μ_{i}^{G} = ρ_{0} μ_{i}^{G} + (1 - ρ_{0}) o_{i}

σ_{i} = ρ_{0} σ_{i} + (1 - ρ_{0}) {(o_{i} - μ_{i}^{G})}^{2}

η＝ρ ₀η+(1-ρ ₀)LRT _t

Otherwise adjust parameter (μ _i ^L, l _i) and threshold value η:

μ_{i}^{L} = ρ_{1} μ_{i}^{L} + (1 - ρ_{1}) o_{i}

l_{i} = ρ_{1} l_{i} + (1 - ρ_{1}) | o_{i} - μ_{i}^{G} |

η＝ρ ₁η+(1-ρ ₁)LRT _t

0＜ρ wherein ₀, ρ ₁＜1 for upgrading constant;

Description of drawings

Fig. 1 is the inventive method basic flow sheet.

Embodiment

Below with reference to accompanying drawing, embodiments of the invention are described in detail.

At first principle of the present invention is described.

Human acoustic mechanism is that vocal cords are subjected to certain external force generation vibrations, and forms through a series of sympathetic response organ coordination thereafter.Therefore whole voiced process can be thought a life cycle, is subjected to the constraint of human organ's self-characteristic, and the life cycle of sounding can be thought and has certain statistical law.And this statistical law noise robustness normally, promptly Ren Lei sounding can be thought and not be subjected to The noise in the environment, therefore this statistical law of accurate description will make that the speech activity modeling tallies with the actual situation more under the noise circumstance, improve the noise robustness of speech detection.The normal Birnbaum-Saunders of use distributes and Weibull distribution description life cycle on the engineering.

Particularly, method basic procedure proposed by the invention as shown in Figure 1.

The core concept that the present invention mainly comprises: input audio signal is set up hidden semi-Markov model; Relate to the type of distribution by the training dataset testing model, and utilize the parameter that relates in this data set and the preceding some frame estimation models of input audio signal; Carry out speech detection by likelihood ratio test; Dynamically update model parameter and likelihood ratio test threshold value thereafter.

Arthmetic statement of the present invention is as follows:

1. set up and comprise voice and two state Q={q of non-voice ₀, q ₁Hidden semi-Markov model λ=(A, B, π, τ), wherein: q ₀Be non-voice, q ₁Be voice;

A={a _Ij, i, j=0,1 is state q _i, q _jTransition probability;

B={b _i(O _t), i=0,1; T＞0 is input signal dct transform coefficient O _t={ o ₁, o ₂..., o _K), at given state q _iFollowing condition distribution probability b _i(O _t)=P (O _t| q _i), o wherein ₁, o ₂..., o _KSeparate;

π={ π _i, i=0,1; π _i＞0 is state q _iThe prior distribution probability;

τ={ P (d|q _i), i=0,1; D＞0 is state q _iContinue the probability of d;

The distribution pattern that relates to according to TIMIT training dataset statistics discovery model is as follows:

(1)a ₀₀＝a ₁₁＝0，a ₁₀＝a ₀₁＝1；

(2) to q ₀, b ₀(o _i ^t) be that Gauss distributes

N (o_{i}^{t}, μ_{i}^{G}, σ_{i}) = \frac{1}{σ_{i} \sqrt{2 π}} e^{- \frac{(o_{i}^{t} - μ_{i}^{G})}{2 {σ_{i}}^{2}}};

(3) to q ₁, b ₁(o _i ^t) for distributing

L (o_{i}^{t}; μ_{i}^{L}, l_{i}) = \frac{1}{4 l_{i}} e^{- \frac{σ_{i}}{2 {l_{i}}^{2}}} [e^{\frac{o_{i}^{'}}{l_{i}}} erfc (\frac{l_{i} o_{i}^{'} + {σ_{i}}^{2}}{\sqrt{2} l_{i} σ_{i}}) + e^{- \frac{o_{i}^{'}}{l_{i}}} erfc (\frac{- l_{i} o_{i}^{'} + {σ_{i}}^{2}}{\sqrt{2} l_{i} σ_{i}})],

Wherein

o_{i}^{'} = o_{i}^{t} - μ_{i}^{G} - μ_{i}^{L};

(4) to q ₀And q ₁, P (d|q _i) be that Weibull distributes

W (d; k_{i}, ω_{i}) = \frac{k_{i}}{ω_{i}} {(\frac{d}{ω_{i}})}^{k_{i} - 1} e^{- {(\frac{d}{ω_{i}})}^{k_{i}}} .

According to the prior distribution probability π={ π of training dataset statistics initialization to state in the hidden semi-Markov model _i, the parameter (k that distributes of state duration _i, ω _i), signal frame sequence number t=0; Method is as follows:

(c) the prior distribution probability of state in the hidden semi-Markov model

3. if input voice S signal is empty, finish; Otherwise, S is carried out dct transform

T=t+1;

4. if t＜P judges that current demand signal is noise VAD=0, change (3); If t=P estimates input signal dct transform coefficient O under the given state _tParameter (the μ that distributes _i ^G, σ _i) and (μ _i ^L, l _i), the likelihood ratio LRT of P frame before calculating _t, initialization likelihood ratio test threshold value η judges that current demand signal is noise VAD=0, changes (3); If t＞P calculates likelihood ratio LRT _t, if LRT _t〉=η judges that then current demand signal is voice VAD=1, if LRT _t＜η judges that then current demand signal is noise VAD=0, changes (5); Method is as follows:

(a) calculate forward variable α _i ^t, i=0,1:

If t=1,

α_{i}^{t *} = π_{i} P (d = 1 / q_{i}) b_{j} (O_{t});

If t＞1,

α_{i}^{t *} = Σ_{d = 1}^{D} Σ_{j &NotEqual; i} α_{i}^{(t - d) *} a_{ji} P (d | q_{i}) Π_{s = t - d + 1}^{t} b_{i} (O_{s}),

α_{i}^{t} = Σ_{d = 1}^{D} Σ_{d^{'} = 0}^{d} Σ_{j &NotEqual; i} α_{i}^{(t - d^{'}) *} a_{ji} P (d | q_{i}) Π_{s = t - d^{'} + 1}^{t} b_{i} (O_{s});

(b) calculate likelihood ratio

{LRT}_{t} = \ln (π_{0} α_{1}^{t}) - \ln (π_{1} α_{0}^{t});

μ_{i}^{L} = μ_{i}^{G} = \frac{1}{P} Σ_{i = 1}^{P} o_{i}^{t};

σ_{i} = \sqrt{\frac{1}{P - 1} Σ_{i = 1}^{P} {(o_{i}^{t} - μ_{i}^{G})}^{2}}

l_{i} = \sqrt{R {σ_{i}}^{2} / 2};

P wherein, R is a constant;

5. dct transform coefficient O under the adjustment given state _tParameter (the μ that distributes _i ^G, σ _i) and (μ _i ^L, l _i), upgrade likelihood ratio test threshold value η; Change (3); Method is as follows:

μ_{i}^{G} = ρ_{0} μ_{i}^{G} + (1 - ρ_{0}) o_{i}

σ_{i} = ρ_{0} σ_{i} + (1 - ρ_{0}) {(o_{i} - μ_{i}^{G})}^{2}

η＝ρ ₀η+(1-ρ ₀)LRT _t

Otherwise adjust parameter (μ _i ^L, l _i) and threshold value η:

μ_{i}^{L} = ρ_{1} μ_{i}^{L} + (1 - ρ_{1}) o_{i}

l_{i} = ρ_{1} l_{i} + (1 - ρ_{1}) | o_{i} - μ_{i}^{G} |

η＝ρ ₁η+(1-ρ ₁)LRT _t

ρ wherein ₀, ρ ₁Be constant;

In the speech detection experiment of NOIZEUS data set, constant P=15, R=20, ρ ₀=0.99, ρ ₁=0.79;

Experimental data is as shown in the table:

Can see that the present invention obtains effect under multiple noise circumstance almost consistent, and most applications be better than international standard G.729B reach AMR2.

In sum, be speech frame in the input signal and noise frame under the detection noise environment according to said method.

What may be obvious that for the person of ordinary skill of the art draws other advantages and modification.Therefore, the present invention with wider aspect is not limited to shown and described specifying and exemplary embodiment here.Therefore, under situation about not breaking away from, can make various modifications to it by the spirit and scope of claim and the defined general inventive concept of equivalents thereof subsequently.

Claims

1. based on the speech detection method of the noise robustness of hidden semi-Markov model, it is characterized in that step is as follows:

q ₀Be non-voice, q ₁Be voice;

A={a _Ij, a _IjBe state q _i, q _jTransition probability; I=0,1; J=0,1;

B={b _i(0 _t); I=0,1; T＞0 is input signal dct transform coefficient O _t={ o ₁, o ₂..., o _K, K＞0 is at given state q _iFollowing condition distribution probability b _i(O _t)=P (O _t| q _i), o wherein ₁, o ₂..., o _KSeparate;

π={ π _i, i=0,1; π _i＞0 is state q _iThe prior distribution probability;

τ={ P (d|q _i), i=0,1; D＞0 is state q _iContinue the probability of d;

(3) if input speech signal S is empty, finish; Otherwise, S is carried out dct transform

T=t+1;

(4) if t＜P judges that current demand signal is noise VAD=0, change (3); If t=P estimates input signal dct transform coefficient O under the given state _tThe Gauss parameter that distributes

With the Laplace parameter

The likelihood ratio LRT of P frame before calculating _t, initialization likelihood ratio test threshold value η judges that current demand signal is noise VAD=0, changes (3); If t＞P calculates likelihood ratio LRT _t, if LRT _t〉=η judges that then current demand signal is voice VAD=1, if LRT _t＜η judges that then current demand signal is noise VAD=0, changes (5);

(5) adjust dct transform coefficient O under the given state _tThe Gauss parameter that distributes

And Laplace parameter

Upgrade likelihood ratio test threshold value η; Change (3).

2. according to the speech detection method based on the noise robustness of hidden semi-Markov model of claim 1, it is characterized in that: described step (1) further comprises:

According to the training dataset statistics, determine

(1.1)a ₀₀＝a ₁₁＝0，a ₁₀＝a ₀₁＝1；

(1.2) to q ₀,

For Gauss distributes

N (o_{i}^{t}; μ_{i}^{G}, σ_{i}) = \frac{1}{σ_{i} \sqrt{2 π}} e^{- \frac{(o_{i}^{t} - μ_{i}^{G})}{2 {σ_{i}}^{2}}};

(1.3) to q ₁, For distributing

L (o_{i}^{t}; μ_{i}^{L}, l_{i}) = \frac{1}{4 l_{i}} e^{- \frac{σ_{i}}{2 {l_{i}}^{2}}} [e^{\frac{o_{i}^{'}}{l_{i}}} erfc (\frac{l_{i} o_{i}^{'} + {σ_{i}}^{2}}{\sqrt{2} l_{i} σ_{i}}) + e^{- \frac{o_{i^{'}}}{l_{i}}} erfc (\frac{- l_{i} o_{i}^{'} + {o_{i}}^{2}}{\sqrt{2} l_{i} σ_{i}})],

Wherein

o_{i}^{'} = o_{i}^{t} - μ_{i}^{G} - μ_{i}^{L};

(1.4) to q ₀And q ₁, P (d|q _i) be that Weibull distributes

W (d; k_{i}, ω_{i}) = \frac{k_{i}}{ω_{i}} {(\frac{d}{ω_{i}})}^{k_{i} - 1} e^{- {(\frac{d}{ω_{i}})}^{k_{i}}} .

3. according to the speech detection method based on the noise robustness of hidden semi-Markov model of claim 1, it is characterized in that: described step (2) further comprises:

(2.1) according to the noise duration frequency F according to statistics of reference numerals in the training dataset ₀And voice duration frequency F ₁

(2.2) by F _iApproximate W (d; k _i, ω _i) parameter (k _i, ω _i) maximal possibility estimation;

(2.3) the prior distribution probability of state in the hidden semi-Markov model

4. according to the speech detection method based on the noise robustness of hidden semi-Markov model of claim 1, it is characterized in that: described step (4) further comprises:

(4.1) calculate forward variable

I=0,1:

If t=1,

α_{i}^{t *} = π_{i} P (d = 1 | q_{i}) b_{j} (O_{t});

If t＞1,

α_{i}^{t *} = Σ_{d = 1}^{D} {Σ_{j &NotEqual; i} α}_{i}^{(t - d) *} a_{ji} P (d | q_{i}) Π_{s = t - d + 1}^{t} b_{i} (O_{s}),

α_{i}^{t} = Σ_{d = 1}^{D} Σ_{d^{'} = 0}^{d} Σ_{j &NotEqual; i} α_{i}^{(t - d^{'}) *} a_{ji} P (d | q_{i}) Π_{s = t - d^{'} + 1}^{t} b_{i} (O_{s});

(4.2) calculate likelihood ratio

{LRT}_{t} = \ln (π_{0} α_{1}^{t}) - \ln (π_{1} α_{0}^{t});

(4.3) during t=P, by the dct transform coefficient O of P frame before the input signal _t, wherein P＞0,1≤t≤P estimates the parameter that B distributes And

For:

μ_{i}^{L} = μ_{i}^{G} = \frac{1}{P} Σ_{i = 1}^{P} o_{i}^{t};

σ_{i} = \sqrt{\frac{1}{P - 1} Σ_{i = 1}^{P} {(o_{i}^{t} - μ_{i}^{G})}^{2}};

l_{i} = \sqrt{R {σ_{i}}^{2} / 2};

P wherein, R is a constant;

(4.4) during t=P, by the dct transform coefficient O of P frame before the input signal _t, wherein P＞0,1≤t≤P estimates that the likelihood ratio test threshold value is

5. according to the speech detection method based on the noise robustness of hidden semi-Markov model of claim 1, it is characterized in that: described step (5) further comprises:

(5.1), adjust parameter if present frame is judged to be noise And threshold value η:

μ_{i}^{G} = ρ_{0} μ_{i}^{G} + (1 - ρ_{0}) o_{i}

σ_{i} = ρ_{0} σ_{i} + (1 - ρ_{0}) {(o_{i} - μ_{i}^{G})}^{2}

η＝ρ ₀η+(1-ρ ₀)LRT _t

Otherwise adjustment parameter

And threshold value η:

μ_{i}^{L} = ρ_{1} μ_{i}^{L} + (1 - ρ_{1}) o_{i}

l_{i} = ρ_{1} l_{i} + (1 - ρ_{1}) | o_{i} - μ_{i}^{G} |

η＝ρ ₁η+(1-ρ ₁)LRT _t

0＜ρ wherein ₀, ρ ₁＜1 for upgrading constant.