WO2004111998A1

WO2004111998A1 - Out-of-vocabulary word rejection algorithms using phoneme segmentation in variable vocabulary word recognition

Info

Publication number: WO2004111998A1
Application number: PCT/KR2003/001200
Authority: WO
Inventors: Chul Ho Kang; In Hak Lee
Original assignee: Kwangwoon Foundation
Current assignee: Kwangwoon Foundation
Priority date: 2003-06-18
Filing date: 2003-06-18
Publication date: 2004-12-23
Anticipated expiration: 2005-12-18
Also published as: AU2003243024A1

Abstract

Disclosure is a phoneme segmentation method which is designed to improve the function of rejecting an out-of-vocabulary word using a phoneme segmentation by performing an utterance authentication using a phoneme segmentation based on a result recognized in the variable vocabulary word recognition. In an out-of-vocabulary word rejection apparatus using a phone segmentation in a variable vocabulary word recognition, an initial sound detector detects an initial sound of sampled input speech data. A middle sound detector detects a starting point of a middle sound of a voiced sound which exists every phoneme after the initial sound is detected by the initial sound detector. An unvoiced sound initial sound detector detects a start point of an initial sound of an unvoiced sound before the middle sound is detected by the middle sound detector. The detector detects a starting point of a phoneme which is detected by the middle sound detector and the unvoiced sound initial sound detector. An out-of-vocabulary word rejecting section performs an utterance authentication using the values derived from the initial sound detector, the middle sound detector, the unvoiced sound initial sound detector, and the detector.

Description

OUT-OF-VOCABULARY WORD REJECTION ALGORITHMS USING PHONEME SEGMENTATION IN VARIABLE VOCABULARY WORD

RECOGNITION

Technical Field

The present invention relates to an out-of -vocabulary word rejection method using phoneme segmentation in a variable vocabulary word recognition. More specifically, the present invention is concerning an apparatus and a method for improving an out-of-vocabulary word rejection method using phoneme segmentation in a variable vocabulary word recognition by segmenting input speech in a phoneme and calculating reliability based on uttered speech and recognized result.

Background Art Utterance verification is a function informing that out-of -words cannot be recognized when unregistered words are inputted. The utterance verification is used to design speech recognition, which makes it possible to become a user-friendly technique.

FIG. 1 is a flow chart which illustrates a conventional variable vocabulary word recognition method.

FIG. 2 is a flow chart which illustrates a conventional variable vocabulary word recognition method with an utterance verification function.

In step S106, when a speech of a word to be recognized is inputted, a starting and end point of the speech of a word are detected (step S107). Then, a feature vector is extracted from the speech of a word (step S108). On the other hand, separately, when a word to be recognized is inputted (step SlOl), the inputted word is added to a recognition target word list (step Sl 02). A pronunciation dictionary generator generates a copy of a recognition target word list (step Sl 03). A total model for words is generated based on a phoneme model which is previously collected (step S104).

Thereafter, in step Sl 09, N (N is an integer) similarity degrees between the extracted feature vector of a word in step Sl 08 and the total word models generated in step S104 are calculated. In step SIlO, a maximal similarity degree is selected from the N similarity degrees. It is then judged whether the speaker producing speech of a word is an identified registered speaker based on maximal similarity degree. The judgment result is obtained (step Sill) and all operations are finished.

In detail, the recognition method of the speech recognition device is different from that of a conventional speech recognition device. Namely, when the recognition target word list changes whenever speech is inputted, the speech recognition device does not newly train a speech for a vocabulary to be recognized but changes the pronunciation dictionary to reconstruct word models. Thus, the speech recognition device can recognize a plurality of unrestricted words through the use of recognition target word list. In order to embody such a variable vocabulary word recognition, all phonemes should be exactly modeled. All the phonemes exist in Korean language suitable for the environment in which the variable vocabulary word recognition will be used. In order to satisfy such requirements, the present invention uses Phonetically Balanced Words (PBW) 445DB of ETRI as training speech data and 49 context isolated phoneme models. FIG. Ib is a flow chart which illustrates a conventional variable vocabulary word recognition method having an utterance verification function. In the conventional variable vocabulary word recognition method having an utterance verification function shown in FIG. Ib, the utterance verification function is only added without significantly changing the conventional variable vocabulary word recognition method, the time required to embody where it is reduced. The utterance verification function corresponds to steps S211 through S213.

In an embodiment of the present invention, by using reliability of phoneme unit, reliability of word unit used in the variable vocabulary word recognition method is formed. The reliability of word unit is a scale indicating how the speech recognition result is reliable. The reliability of word unit is different from Viterbi searching result value of HMM model. That is, the Viterbi searching result value indicates the similarity degree of predetermined word and phoneme. The reliability of word unit is a relative value for probability that speech for phoneme or word is uttered from another phoneme or word. The phoneme or word is a recognized result.

Anti-phoneme model includes a pseudo phoneme rather than self phoneme. When 40 trained phoneme models exist, the anti-phoneme model is formed without performing a special training. An / is a constant and has a negative value. A registered word i is a

recognized result obtained from a variable vocabulary word recognition, and is formed by N(i) phonemes. An i{q) is the q -th model of the registered word i . A

reliability of word unit is expressed by the following equation 1. When the reliability of word unit is less than or equal to the threshold value τ_s , the word is rejected.

[Equation 1]

1 N <VW(i) - s_t(O;®) = log[— -∑expf 'Lγ^ζQ,;®)]'

Similarity distance between each phoneme and anti-phoneme model is expressed by the following equation 2. [Equation 2]

_{Lγ (C} .₀₎ S_i{q)(O_q)- G_Kq)(O_q)

The equation 2 is the reliability of phoneme unit normalized by a log similarity degree g₍₍₉₎(O_?) and indicates a verification performance of a general

phoneme unit. The g.^(O_q) is an observed probability value for a self phoneme.

The G_{/( )} (O₉ ) is an observed probability value for an anti-phoneme.

[Equation 3] g_Kg)(P_t) = logp(P₉)p>?< > )

[Equation 4] σ,_(f)(o,) = io_gjp(ø,|Θj"> )

An observed probability is used to calculate the equations 3 and 4 and is expressed by the following equation 5.

[Equation] b_j = max {c_JkN(o,μ_jk, U_Jk)} l ≤ k ≤ (8orU8) where c_jk is a weight value of a branch

b. = max {c_βN(o,μ_β, U_β)} l ≤ k ≤ (8orU8)

where c_jk is a weight value of a branch, j represents a state of each

phoneme, & is a branch of each state, N is a Gaussian distribution of each branch, U_jk represents an average vector, and U _jk is a covariance matrix.

However, in the conventional method, when a boundary between phonemes is not exactly defined, reliability for Viterbi searching of a phoneme model and input speech will- decrease. It results in a significantly wrong recognition rate. That is, when exactly segmenting the input speech into the phoneme unit, the out-of- vocabulary word is accurately rejected by the Viterbi value.

Disclosure of Invention

Therefore, it is an objective of the present invention for providing a phoneme segmentation method which improves the function of rejecting an out-of- vocabulary word using a phoneme segmentation by performing an utterance authentication using a phoneme segmentation based on a result recognized in a variable vocabulary word recognition.

According to the present invention, there is provided an out-of-vocabulary word rejection apparatus using phoneme segmentation in a variable vocabulary word recognition, the apparatus comprising: an initial sound detector for detecting an initial sound of sampled input speech data; a middle sound detector for detecting a starting point of a middle sound of a voiced sound existing in every phoneme after the initial sound is detected by the initial sound detector; an unvoiced sound initial sound detector for detecting a starting point of an initial sound of an unvoiced sound before the middle sound is detected by the middle sound detector; a detector for detecting a starting point of a phoneme which is not detected by the middle sound detector and the unvoiced sound initial sound detector; and an out-of-vocabulary word rejecting section performs a utterance authentication using the values detected by the initial sound detector, the middle sound detector, the unvoiced sound initial sound detector, and the detector.

There is also provided an out-of-vocabulary word rejection method using phoneme segmentation in a variable vocabulary word recognition, the method comprising the steps of: (i) detecting an initial sound of sampled input speech data;

(ii) detecting a starting point of a middle sound of a voiced sound existing in every phoneme after the initial sound is detected in step (i);

(iii) detecting a starting point of an initial sound of an unvoiced sound before the middle sound is detected in step (ii); (iv) detecting a starting point of a phoneme which are not detected by steps

(ii) and (iii); and

(v) performing a utterance authentication using the values detected by steps (i), (ii), (iii), and (iv). Brief Description of Drawings

The foregoing and other objects, features and advantages of the present invention will become more apparent from the following detailed description when taken in conjunction with the accompanying drawings in which: FIG. Ia is a flow chart which illustrates a conventional variable vocabulary word recognition method;

FIG. Ib is a flow chart which illustrates a conventional variable vocabulary word recognition method having an utterance verification function;

FIG. 2 is a flow chart which illustrates a conventional variable vocabulary word recognition method having an utterance verification function according to an embodiment of the present invention;

FIG. 3 is a flow chart which shows a phoneme segmentation method according to an embodiment of the present invention;

FIG. 4 is a graph showing an initial sound detecting result which detects an initial sound of sampled input speech data; and

FIG. 5 is a graph showing a middle sound detecting result and an unvoiced sound initial sound detecting result which detect a starting point of a middle sound of a voiced sound and a start point of an initial sound of an unvoiced sound, respectively.

Best Mode for Carrying Out the Invention

Hereinafter, an out-of-vocabulary word rejection method using phoneme segmentation in a variable vocabulary word recognition based on a preferred embodiment of the present invention will be described. The out-of-vocabulary word rejection method of the present invention rejects an out-of-vocabulary through utterance verification using phoneme segmentation in a variable vocabulary word recognition.

In the conventional variable vocabulary word recognition, a rejecting function using anti-phoneme model requires a process for segmenting input speech to phoneme unit. In order to do that, an automatic phoneme segmenting device is needed to automatically segment input speech to corresponding phoneme unit using HMM parameters. However, the automatic phoneme segmenting device can not exactly detect a boundary between phonemes. Therefore, the method of the present invention minimizes an error of the case of automatically segment input speech to corresponding phoneme unit using HMM parameters by rejecting an out-of-vocabulary through utterance verification using phone segmentation in a variable vocabulary word recognition.

FIG. 3 is a flow chart which shows a phoneme segmentation method based on an embodiment of the present invention.

(1) Step S401: obtains an average value of speech sample of unvoiced section. In order to avoid the concentration of a special part, an optional sample before and behind the speech is extracted and an average of the sample is given. The average value of the sample is expressed by the following equation 6. [Equation 6]

/])

where / is the sample number of frames, s is the total number of samples, and temp = rand ()%4. (2) Step S402: adds a weight value to the average of the sample and adds the average of the sample having the weight value to a total sample. The result is expressed by the following equation 7.

[Equation 7]

x (t) = weighting * average + x(t)

where x (t) is a biased sample value. The weighting is obtained by an

experiment. The weighting varies from environmental effects but is not sensitive to

the environment. The weighting ranges from 10 to 20.

(3) Steps S403 and S404: A Zero Cross Rate (ZCR) biased by frames is obtained by overlapping frames of a speech. The ZCR is expressed by the following equation 8. [Equation 8]

Z_n

sgn[x(rø)] = l, x (m) > 0 w(ή) = — O ≤ n ≤ N-l where _ , 2N

= -l, x (m)< 0 _{= o otherwise}.

(4) Step S405: detects an initial sound of sampled input speech data using the obtained ZCR. FIG. 1 shows an example which detects the initial sound of sample input speech data according to an embodiment of the present invention.

(5) Steps S406 and S407: apply Hamming window to each frame and perform 512-point Fast Fourier Transform (FFT) therefore after segmenting a frame into 10 msec without overlapping. (6) Steps S408 and S409: obtain a power density of a low frequency having 0—400 Hz for each frame and obtain an average power density of two continuous frames.

(7) Step S410: obtains a difference of the average power density in frames using a difference interval of 70 ms (7 frames) in order to indicate changes of opening and closing of sentence. The difference of the average power density is expressed by the following equation 9. [Equation 9]

∑j(power_i+J -power_H) delta _ power, = -^ ^

2 ∑/

7=1 where M is a differential interval and 7 frames are used as M . [M 12] is

an integer less than or equal to Mil . The power, and delta _ power, are an

average power density of an i-th frame and an average power density, respectively.

(8) Step S411: detects a starting point of a middle sound of a voiced sound existing in every phoneme after the initial sound and a starting point of an initial sound of an unvoiced sound before the middle sound by selecting locally maximal and minimal values from the difference of the average power density. A peak is detected at a place which is not the starting point of the voiced sound. In this case, when a slope of an energy change has a negative value, the place is detected to not be the start point of the voiced sound. FIG. 5 is a graph showing a middle sound detecting result and an unvoiced sound initial sound detecting result which detect a starting point of a middle sound of a voiced sound and a starting point of an initial sound of an unvoiced sound, respectively.

(9) Step S412: reads out utterance information of a word most similar to the input speech through the variable vocabulary word recognition.

(10) Step S413: compares a word recognized for a continuous vowel with a final sound which is not segmented, and segments a formant or a feature vector according to the compared result.

(11) Step S312 of FIG. 2: calculates the above reliability using the input speech which is segmented to the phoneme unit, and rejects the out-of -vocabulary.

Such an experimental result is expressed by the following table 1. • Registered word

(1) CA: Correctly Accepted for Keyword, namely, a probabilty of the case of exactly accepting registered recognition target word.

(2) FAI: False Accepted In-Grammar Word (=Keyword), namely, a probability of the case of accepting the registered recognition target word but wrongly recognizes it.

(3) FR: False Rejected for Keyword, namely, a probability of the case of rejecting a registered recognition target word although the registered recognition target word is uttered. (4) Accordingly, CA + FAI + FR = 100%. • Out-of -vocabulary word

(1) CR: Correctly Rejected for OOV, namely, a probability of the case of rejecting an out-of -vocabulary word. (2) FAO: False Rejected Out-of-Grammar Word (=OOV), namely, a probability of the case of accepting the out-of -vocabulary word. (3) Accordingly, CR + FAO = 100%.

As shown in table 1, an experimental result of the present invention which performs utterance verification after segmenting phonemes is better than that of the conventional method.

Industrial Applicability

As mentioned above, utterance verification is performed in order to reject an out-of-vocabulary word in a variable vocabulary word recognition. At this time, the out-of-vocabulary word is exactly rejected by detecting an accurate boundary between phonemes and calculating reliability based on the detected result.

Although a preferred embodiment of the present invention has been described for illustrative purposes, those skilled in the art will appreciate that various modifications, additions and substitutions are possible, without departing from the scope and spirit of the invention as disclosed in the accompanying claims.

Claims

1. An out-of -vocabulary word rejection apparatus using phoneme segmentation in a variable vocabulary word recognition, the apparatus comprising: an initial sound detector for detecting an initial sound of sampled input speech data; a middle sound detector for detecting a starting point of a middle sound of a voiced sound existing in every phoneme after the initial sound is detected by the initial sound detector; an unvoiced sound initial sound detector for detecting a starting point of an initial sound of an unvoiced sound before the middle sound is detected by the middle sound detector; a detector for detecting a starting point of a phoneme which are not detected by the middle sound detector and the unvoiced sound initial sound detector; and an out-of -vocabulary word rejecting section performs a utterance authentication using the values detected by the initial sound detector, the middle sound detector, the unvoiced sound initial sound detector, and the detector.

2. The apparatus according to claim 1, wherein the initial sound detector detects an initial sound of sampled input speech data by the following equation 1:

average = s + t — temp * /*/])

x (t) = weighting * average + x(t) Z_n = ∑ sgn[x(m)]-sgn|>(m-l)]

- m)

sga[x (m)] = l, x (m) ≥ 0 _w(n) = — O ≤ n ≤ N-l where _ ' , 2N

= -1, x (m)< 0 _{= o} othemnse.

where / is the sample number of frames, s is the total number of samples,

and temp- rand ()%4. x (t) is a biased sample value, the weighting is obtained by an experiment, varies according to an environment but is not sensitive to the environment, and ranges from 10 to 20.

3. The apparatus according to claim 1, wherein the middle sound detector and the an unvoiced sound initial sound detector detect the starting point of a middle sound of a voiced sound existing in every phoneme and the starting point of an initial sound of an unvoiced sound by the following equation 2, respectively:

[M /2]

∑jipσwer^_j -power ^ delta _ power, = — ^_n=

2 ∑/

where M is a differential interval and 7 frames are used as M . [M 72] is

an integer less than or equal to M 12 , the power _t and delta _ power) are an average

power density of an i-th frame and an average power density, respectively, a starting point of a middle sound of a voiced sound existing every phoneme is detected after the initial sound and a starting point of an initial sound of an unvoiced sound before the middle sound by selecting locally maximal and minimal values from the difference of the average power density, a peak is detected at a place which is not the starting point of the voiced sound, and when a slope of an energy change has a negative value, the place is detected not to be the starting point of the voiced sound.

4. An out-of-vocabulary word rejection method using phoneme segmentation in a variable vocabulary word recognition, the method comprising the steps of:

(i) detecting an initial sound of sampled input speech data; (ii) detecting a starting point of a middle sound of a voiced sound existing in every phoneme after the initial sound is detected in step (i);

(iii) detecting a starting point of an initial sound of an unvoiced sound before the middle sound is detected in step (ii);

(iv) detecting a starting point of a phoneme which are not detected by steps (ii) and (iii); and (v) performing a utterance authentication using the values detected by steps

(i), (ii), (iii), and (iv).

5. The method according to claim 4, wherein step (i) detects an initial sound of sampled input speech data by the following equation 3:

1 2 1/2 0 average=—^--∑(∑_ιx[t + temp *j*l]+ ∑x[s + t -temp *j*l])

2 ' y=i o /=Z/2

x (t) = weighting * average + x(t)

^z,, = ∑ sgn[x(m)]-sgn[*(m -l)] w(n -m) sgn[jc (/»)] = 1, x(m) ≥ 0 w(n) = O ≤ n ≤ JV-1 where _ , 2N

= -1, x (m)< 0 _{= o otherwise}.

where / is the sample number of frames, s is the total number of samples,

and temp = rand ()%4. x (t) is a biased sample value, the weighting is obtained by an experiment, varies according to an environment but is not sensitive to the environment, and ranges from 10 to 20; and steps (ii) and (iii) detect the starting point of a middle sound of a voiced sound existing in every phoneme and the starting point of an initial sound of an unvoiced sound by the following equation 4, respectively:

[M /2]

∑j(poM>er_i+j - power _H) delta _ power, = -^ ^₇₁

where M is a differential interval and 7 frames are used as M . [M 12} is

an integer less than or equal to M 12 , the power_t - and delta _poweη are an average

power density of an i-th frame and an average power density, respectively, a starting point of a middle sound of a voiced sound existing in every phoneme is detected after the initial sound and a starting point of an initial sound of an unvoiced sound before the middle sound is detected by selecting locally maximal and minimal values from the difference of the average power density, a peak is detected at a place which is not the starting point of the voiced sound, and when a slope of an energy change has a negative value, the place is detected not to be the starting point of the voiced sound.

6. A recording medium by a computer which executes an out-of-vocabulary word rejection program using phoneme segmentation in a variable vocabulary word recognition, the method comprising the steps of: (i) detecting an initial sound of sampled input speech data;

(ii) and (iii); and

(v) performing a utterance authentication using the values detected by steps (i), (ii), (iii), and (iv), wherein step (i) detects an initial sound of sampled input speech data by the following equation 3 :

average= 1 - temp *j*l])

x (t) = weighting * average + x(t)

Z_n = J] sgn[x(m)]-sgn[*(m-l)] w(n-m) m=-∞ 1 sgn[x (m)] = 1, x (m) ≥ 0 w(n) = O ≤ n ≤ N-l where _ , 2N

= -1, x (m)< 0 _{= o otherwise}.

where / is the sample number of frames, s is the total number of samples, and temp = rand ()%4. x (t) is a biased sample value, the weighting is obtained by an experiment, varies according to an environment but is not sensitive to the environment, and ranges from 10 to 20; and steps (ii) and (iii) detect the starting point of a middle sound of a voiced sound existing in every phoneme and the starting point of an initial sound of an unvoiced sound by the following equation 4, respectively:

[Af /2]

2 j(power_i+j - power _H ) delta _ power _t = -^ ^

2 ∑/

where M is a differential interval and 7 frames are used as M . [M 12] is

an integer less than or equal to Λ/72 , the power ;. and delta _ power _t are an average

power density of an i-th frame and an average power density, respectively, a starting point of a middle sound of a voiced sound existing in every phoneme is detected after the initial sound and a starting point of an initial sound of an unvoiced sound before the middle sound by selecting locally maximal and minimal values from the difference of the average power density, a peak point is detected at a place which is not the starting point of the voiced sound, and when a slope of an energy change has a negative value, the place is detected not to be the starting point of the voiced sound.