WO2004111998A1 - Out-of-vocabulary word rejection algorithms using phoneme segmentation in variable vocabulary word recognition - Google Patents

Out-of-vocabulary word rejection algorithms using phoneme segmentation in variable vocabulary word recognition Download PDF

Info

Publication number
WO2004111998A1
WO2004111998A1 PCT/KR2003/001200 KR0301200W WO2004111998A1 WO 2004111998 A1 WO2004111998 A1 WO 2004111998A1 KR 0301200 W KR0301200 W KR 0301200W WO 2004111998 A1 WO2004111998 A1 WO 2004111998A1
Authority
WO
WIPO (PCT)
Prior art keywords
sound
starting point
detected
initial
phoneme
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/KR2003/001200
Other languages
French (fr)
Inventor
Chul Ho Kang
In Hak Lee
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Kwangwoon Foundation
Original Assignee
Kwangwoon Foundation
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Kwangwoon Foundation filed Critical Kwangwoon Foundation
Priority to PCT/KR2003/001200 priority Critical patent/WO2004111998A1/en
Priority to AU2003243024A priority patent/AU2003243024A1/en
Publication of WO2004111998A1 publication Critical patent/WO2004111998A1/en
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit

Definitions

  • the present invention relates to an out-of -vocabulary word rejection method using phoneme segmentation in a variable vocabulary word recognition. More specifically, the present invention is concerning an apparatus and a method for improving an out-of-vocabulary word rejection method using phoneme segmentation in a variable vocabulary word recognition by segmenting input speech in a phoneme and calculating reliability based on uttered speech and recognized result.
  • Utterance verification is a function informing that out-of -words cannot be recognized when unregistered words are inputted.
  • the utterance verification is used to design speech recognition, which makes it possible to become a user-friendly technique.
  • FIG. 1 is a flow chart which illustrates a conventional variable vocabulary word recognition method.
  • FIG. 2 is a flow chart which illustrates a conventional variable vocabulary word recognition method with an utterance verification function.
  • step S106 when a speech of a word to be recognized is inputted, a starting and end point of the speech of a word are detected (step S107). Then, a feature vector is extracted from the speech of a word (step S108). On the other hand, separately, when a word to be recognized is inputted (step SlOl), the inputted word is added to a recognition target word list (step Sl 02). A pronunciation dictionary generator generates a copy of a recognition target word list (step Sl 03). A total model for words is generated based on a phoneme model which is previously collected (step S104).
  • step Sl 09 N (N is an integer) similarity degrees between the extracted feature vector of a word in step Sl 08 and the total word models generated in step S104 are calculated.
  • step SIlO a maximal similarity degree is selected from the N similarity degrees. It is then judged whether the speaker producing speech of a word is an identified registered speaker based on maximal similarity degree. The judgment result is obtained (step Sill) and all operations are finished.
  • the recognition method of the speech recognition device is different from that of a conventional speech recognition device. Namely, when the recognition target word list changes whenever speech is inputted, the speech recognition device does not newly train a speech for a vocabulary to be recognized but changes the pronunciation dictionary to reconstruct word models. Thus, the speech recognition device can recognize a plurality of unrestricted words through the use of recognition target word list.
  • all phonemes should be exactly modeled. All the phonemes exist in Korean language suitable for the environment in which the variable vocabulary word recognition will be used.
  • the present invention uses Phonetically Balanced Words (PBW) 445DB of ETRI as training speech data and 49 context isolated phoneme models.
  • PBW Phonetically Balanced Words
  • Ib is a flow chart which illustrates a conventional variable vocabulary word recognition method having an utterance verification function.
  • the utterance verification function is only added without significantly changing the conventional variable vocabulary word recognition method, the time required to embody where it is reduced.
  • the utterance verification function corresponds to steps S211 through S213.
  • reliability of word unit used in the variable vocabulary word recognition method is formed.
  • the reliability of word unit is a scale indicating how the speech recognition result is reliable.
  • the reliability of word unit is different from Viterbi searching result value of HMM model. That is, the Viterbi searching result value indicates the similarity degree of predetermined word and phoneme.
  • the reliability of word unit is a relative value for probability that speech for phoneme or word is uttered from another phoneme or word.
  • the phoneme or word is a recognized result.
  • Anti-phoneme model includes a pseudo phoneme rather than self phoneme.
  • the anti-phoneme model is formed without performing a special training.
  • An / is a constant and has a negative value.
  • a registered word i is a
  • the equation 2 is the reliability of phoneme unit normalized by a log similarity degree g ((9) (O ? ) and indicates a verification performance of a general
  • the g. ⁇ (O q ) is an observed probability value for a self phoneme.
  • the G /( ) (O 9 ) is an observed probability value for an anti-phoneme.
  • c jk is a weight value of a branch
  • j represents a state of each
  • & is a branch of each state
  • N is a Gaussian distribution of each branch
  • U jk represents an average vector
  • U jk is a covariance matrix
  • an out-of-vocabulary word rejection apparatus using phoneme segmentation in a variable vocabulary word recognition comprising: an initial sound detector for detecting an initial sound of sampled input speech data; a middle sound detector for detecting a starting point of a middle sound of a voiced sound existing in every phoneme after the initial sound is detected by the initial sound detector; an unvoiced sound initial sound detector for detecting a starting point of an initial sound of an unvoiced sound before the middle sound is detected by the middle sound detector; a detector for detecting a starting point of a phoneme which is not detected by the middle sound detector and the unvoiced sound initial sound detector; and an out-of-vocabulary word rejecting section performs a utterance authentication using the values detected by the initial sound detector, the middle sound detector, the unvoiced sound initial sound detector, and the detector.
  • an out-of-vocabulary word rejection method using phoneme segmentation in a variable vocabulary word recognition comprising the steps of: (i) detecting an initial sound of sampled input speech data;
  • step (ii) detecting a starting point of a middle sound of a voiced sound existing in every phoneme after the initial sound is detected in step (i);
  • step (iii) detecting a starting point of an initial sound of an unvoiced sound before the middle sound is detected in step (ii); (iv) detecting a starting point of a phoneme which are not detected by steps
  • FIG. Ia is a flow chart which illustrates a conventional variable vocabulary word recognition method
  • FIG. Ib is a flow chart which illustrates a conventional variable vocabulary word recognition method having an utterance verification function
  • FIG. 2 is a flow chart which illustrates a conventional variable vocabulary word recognition method having an utterance verification function according to an embodiment of the present invention
  • FIG. 3 is a flow chart which shows a phoneme segmentation method according to an embodiment of the present invention.
  • FIG. 4 is a graph showing an initial sound detecting result which detects an initial sound of sampled input speech data.
  • FIG. 5 is a graph showing a middle sound detecting result and an unvoiced sound initial sound detecting result which detect a starting point of a middle sound of a voiced sound and a start point of an initial sound of an unvoiced sound, respectively.
  • the out-of-vocabulary word rejection method of the present invention rejects an out-of-vocabulary through utterance verification using phoneme segmentation in a variable vocabulary word recognition.
  • a rejecting function using anti-phoneme model requires a process for segmenting input speech to phoneme unit.
  • an automatic phoneme segmenting device is needed to automatically segment input speech to corresponding phoneme unit using HMM parameters.
  • the automatic phoneme segmenting device can not exactly detect a boundary between phonemes. Therefore, the method of the present invention minimizes an error of the case of automatically segment input speech to corresponding phoneme unit using HMM parameters by rejecting an out-of-vocabulary through utterance verification using phone segmentation in a variable vocabulary word recognition.
  • FIG. 3 is a flow chart which shows a phoneme segmentation method based on an embodiment of the present invention.
  • Step S401 obtains an average value of speech sample of unvoiced section. In order to avoid the concentration of a special part, an optional sample before and behind the speech is extracted and an average of the sample is given. The average value of the sample is expressed by the following equation 6. [Equation 6]
  • Step S402 adds a weight value to the average of the sample and adds the average of the sample having the weight value to a total sample. The result is expressed by the following equation 7.
  • the weighting ranges from 10 to 20.
  • Steps S403 and S404 A Zero Cross Rate (ZCR) biased by frames is obtained by overlapping frames of a speech.
  • the ZCR is expressed by the following equation 8. [Equation 8]
  • Step S405 detects an initial sound of sampled input speech data using the obtained ZCR.
  • FIG. 1 shows an example which detects the initial sound of sample input speech data according to an embodiment of the present invention.
  • Steps S406 and S407 apply Hamming window to each frame and perform 512-point Fast Fourier Transform (FFT) therefore after segmenting a frame into 10 msec without overlapping.
  • Steps S408 and S409 obtain a power density of a low frequency having 0—400 Hz for each frame and obtain an average power density of two continuous frames.
  • FFT Fast Fourier Transform
  • Step S410 obtains a difference of the average power density in frames using a difference interval of 70 ms (7 frames) in order to indicate changes of opening and closing of sentence.
  • the difference of the average power density is expressed by the following equation 9. [Equation 9]
  • the power, and delta _ power are an integer less than or equal to Mil .
  • Step S411 detects a starting point of a middle sound of a voiced sound existing in every phoneme after the initial sound and a starting point of an initial sound of an unvoiced sound before the middle sound by selecting locally maximal and minimal values from the difference of the average power density. A peak is detected at a place which is not the starting point of the voiced sound. In this case, when a slope of an energy change has a negative value, the place is detected to not be the start point of the voiced sound.
  • FIG. 5 is a graph showing a middle sound detecting result and an unvoiced sound initial sound detecting result which detect a starting point of a middle sound of a voiced sound and a starting point of an initial sound of an unvoiced sound, respectively.
  • Step S412 reads out utterance information of a word most similar to the input speech through the variable vocabulary word recognition.
  • Step S413 compares a word recognized for a continuous vowel with a final sound which is not segmented, and segments a formant or a feature vector according to the compared result.
  • Step S312 of FIG. 2 calculates the above reliability using the input speech which is segmented to the phoneme unit, and rejects the out-of -vocabulary.
  • CA Correctly Accepted for Keyword, namely, a probabilty of the case of exactly accepting registered recognition target word.
  • CR Correctly Rejected for OOV, namely, a probability of the case of rejecting an out-of -vocabulary word.
  • CR + FAO 100%.
  • utterance verification is performed in order to reject an out-of-vocabulary word in a variable vocabulary word recognition.
  • the out-of-vocabulary word is exactly rejected by detecting an accurate boundary between phonemes and calculating reliability based on the detected result.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Machine Translation (AREA)

Abstract

Disclosure is a phoneme segmentation method which is designed to improve the function of rejecting an out-of-vocabulary word using a phoneme segmentation by performing an utterance authentication using a phoneme segmentation based on a result recognized in the variable vocabulary word recognition. In an out-of-vocabulary word rejection apparatus using a phone segmentation in a variable vocabulary word recognition, an initial sound detector detects an initial sound of sampled input speech data. A middle sound detector detects a starting point of a middle sound of a voiced sound which exists every phoneme after the initial sound is detected by the initial sound detector. An unvoiced sound initial sound detector detects a start point of an initial sound of an unvoiced sound before the middle sound is detected by the middle sound detector. The detector detects a starting point of a phoneme which is detected by the middle sound detector and the unvoiced sound initial sound detector. An out-of-vocabulary word rejecting section performs an utterance authentication using the values derived from the initial sound detector, the middle sound detector, the unvoiced sound initial sound detector, and the detector.

Description

OUT-OF-VOCABULARY WORD REJECTION ALGORITHMS USING PHONEME SEGMENTATION IN VARIABLE VOCABULARY WORD
RECOGNITION
Technical Field
The present invention relates to an out-of -vocabulary word rejection method using phoneme segmentation in a variable vocabulary word recognition. More specifically, the present invention is concerning an apparatus and a method for improving an out-of-vocabulary word rejection method using phoneme segmentation in a variable vocabulary word recognition by segmenting input speech in a phoneme and calculating reliability based on uttered speech and recognized result.
Background Art Utterance verification is a function informing that out-of -words cannot be recognized when unregistered words are inputted. The utterance verification is used to design speech recognition, which makes it possible to become a user-friendly technique.
FIG. 1 is a flow chart which illustrates a conventional variable vocabulary word recognition method.
FIG. 2 is a flow chart which illustrates a conventional variable vocabulary word recognition method with an utterance verification function.
In step S106, when a speech of a word to be recognized is inputted, a starting and end point of the speech of a word are detected (step S107). Then, a feature vector is extracted from the speech of a word (step S108). On the other hand, separately, when a word to be recognized is inputted (step SlOl), the inputted word is added to a recognition target word list (step Sl 02). A pronunciation dictionary generator generates a copy of a recognition target word list (step Sl 03). A total model for words is generated based on a phoneme model which is previously collected (step S104).
Thereafter, in step Sl 09, N (N is an integer) similarity degrees between the extracted feature vector of a word in step Sl 08 and the total word models generated in step S104 are calculated. In step SIlO, a maximal similarity degree is selected from the N similarity degrees. It is then judged whether the speaker producing speech of a word is an identified registered speaker based on maximal similarity degree. The judgment result is obtained (step Sill) and all operations are finished.
In detail, the recognition method of the speech recognition device is different from that of a conventional speech recognition device. Namely, when the recognition target word list changes whenever speech is inputted, the speech recognition device does not newly train a speech for a vocabulary to be recognized but changes the pronunciation dictionary to reconstruct word models. Thus, the speech recognition device can recognize a plurality of unrestricted words through the use of recognition target word list. In order to embody such a variable vocabulary word recognition, all phonemes should be exactly modeled. All the phonemes exist in Korean language suitable for the environment in which the variable vocabulary word recognition will be used. In order to satisfy such requirements, the present invention uses Phonetically Balanced Words (PBW) 445DB of ETRI as training speech data and 49 context isolated phoneme models. FIG. Ib is a flow chart which illustrates a conventional variable vocabulary word recognition method having an utterance verification function. In the conventional variable vocabulary word recognition method having an utterance verification function shown in FIG. Ib, the utterance verification function is only added without significantly changing the conventional variable vocabulary word recognition method, the time required to embody where it is reduced. The utterance verification function corresponds to steps S211 through S213.
In an embodiment of the present invention, by using reliability of phoneme unit, reliability of word unit used in the variable vocabulary word recognition method is formed. The reliability of word unit is a scale indicating how the speech recognition result is reliable. The reliability of word unit is different from Viterbi searching result value of HMM model. That is, the Viterbi searching result value indicates the similarity degree of predetermined word and phoneme. The reliability of word unit is a relative value for probability that speech for phoneme or word is uttered from another phoneme or word. The phoneme or word is a recognized result.
Anti-phoneme model includes a pseudo phoneme rather than self phoneme. When 40 trained phoneme models exist, the anti-phoneme model is formed without performing a special training. An / is a constant and has a negative value. A registered word i is a
recognized result obtained from a variable vocabulary word recognition, and is formed by N(i) phonemes. An i{q) is the q -th model of the registered word i . A
reliability of word unit is expressed by the following equation 1. When the reliability of word unit is less than or equal to the threshold value τs , the word is rejected.
[Equation 1]
1 N <VW(i) - st(O;®) = log[— -∑expf 'Lγ^ζQ,;®)]'
Similarity distance between each phoneme and anti-phoneme model is expressed by the following equation 2. [Equation 2]
Lγ (C .0) Si{q)(Oq)- GKq)(Oq)
The equation 2 is the reliability of phoneme unit normalized by a log similarity degree g((9)(O?) and indicates a verification performance of a general
phoneme unit. The g.^(Oq) is an observed probability value for a self phoneme.
The G/( ) (O9 ) is an observed probability value for an anti-phoneme.
[Equation 3] gKg)(Pt) = logp(P9)p>?< > )
[Equation 4] σ,(f)(o,) = iogjp(ø,|Θj"> )
An observed probability is used to calculate the equations 3 and 4 and is expressed by the following equation 5.
[Equation] bj = max {cJkN(o,μjk, UJk)} l ≤ k ≤ (8orU8) where cjk is a weight value of a branch
b. = max {cβN(o,μβ, Uβ)} l ≤ k ≤ (8orU8)
where cjk is a weight value of a branch, j represents a state of each
phoneme, & is a branch of each state, N is a Gaussian distribution of each branch, Ujk represents an average vector, and U jk is a covariance matrix.
However, in the conventional method, when a boundary between phonemes is not exactly defined, reliability for Viterbi searching of a phoneme model and input speech will- decrease. It results in a significantly wrong recognition rate. That is, when exactly segmenting the input speech into the phoneme unit, the out-of- vocabulary word is accurately rejected by the Viterbi value.
Disclosure of Invention
Therefore, it is an objective of the present invention for providing a phoneme segmentation method which improves the function of rejecting an out-of- vocabulary word using a phoneme segmentation by performing an utterance authentication using a phoneme segmentation based on a result recognized in a variable vocabulary word recognition.
According to the present invention, there is provided an out-of-vocabulary word rejection apparatus using phoneme segmentation in a variable vocabulary word recognition, the apparatus comprising: an initial sound detector for detecting an initial sound of sampled input speech data; a middle sound detector for detecting a starting point of a middle sound of a voiced sound existing in every phoneme after the initial sound is detected by the initial sound detector; an unvoiced sound initial sound detector for detecting a starting point of an initial sound of an unvoiced sound before the middle sound is detected by the middle sound detector; a detector for detecting a starting point of a phoneme which is not detected by the middle sound detector and the unvoiced sound initial sound detector; and an out-of-vocabulary word rejecting section performs a utterance authentication using the values detected by the initial sound detector, the middle sound detector, the unvoiced sound initial sound detector, and the detector.
There is also provided an out-of-vocabulary word rejection method using phoneme segmentation in a variable vocabulary word recognition, the method comprising the steps of: (i) detecting an initial sound of sampled input speech data;
(ii) detecting a starting point of a middle sound of a voiced sound existing in every phoneme after the initial sound is detected in step (i);
(iii) detecting a starting point of an initial sound of an unvoiced sound before the middle sound is detected in step (ii); (iv) detecting a starting point of a phoneme which are not detected by steps
(ii) and (iii); and
(v) performing a utterance authentication using the values detected by steps (i), (ii), (iii), and (iv). Brief Description of Drawings
The foregoing and other objects, features and advantages of the present invention will become more apparent from the following detailed description when taken in conjunction with the accompanying drawings in which: FIG. Ia is a flow chart which illustrates a conventional variable vocabulary word recognition method;
FIG. Ib is a flow chart which illustrates a conventional variable vocabulary word recognition method having an utterance verification function;
FIG. 2 is a flow chart which illustrates a conventional variable vocabulary word recognition method having an utterance verification function according to an embodiment of the present invention;
FIG. 3 is a flow chart which shows a phoneme segmentation method according to an embodiment of the present invention;
FIG. 4 is a graph showing an initial sound detecting result which detects an initial sound of sampled input speech data; and
FIG. 5 is a graph showing a middle sound detecting result and an unvoiced sound initial sound detecting result which detect a starting point of a middle sound of a voiced sound and a start point of an initial sound of an unvoiced sound, respectively.
Best Mode for Carrying Out the Invention
Hereinafter, an out-of-vocabulary word rejection method using phoneme segmentation in a variable vocabulary word recognition based on a preferred embodiment of the present invention will be described. The out-of-vocabulary word rejection method of the present invention rejects an out-of-vocabulary through utterance verification using phoneme segmentation in a variable vocabulary word recognition.
In the conventional variable vocabulary word recognition, a rejecting function using anti-phoneme model requires a process for segmenting input speech to phoneme unit. In order to do that, an automatic phoneme segmenting device is needed to automatically segment input speech to corresponding phoneme unit using HMM parameters. However, the automatic phoneme segmenting device can not exactly detect a boundary between phonemes. Therefore, the method of the present invention minimizes an error of the case of automatically segment input speech to corresponding phoneme unit using HMM parameters by rejecting an out-of-vocabulary through utterance verification using phone segmentation in a variable vocabulary word recognition.
FIG. 3 is a flow chart which shows a phoneme segmentation method based on an embodiment of the present invention.
(1) Step S401: obtains an average value of speech sample of unvoiced section. In order to avoid the concentration of a special part, an optional sample before and behind the speech is extracted and an average of the sample is given. The average value of the sample is expressed by the following equation 6. [Equation 6]
/])
Figure imgf000010_0001
where / is the sample number of frames, s is the total number of samples, and temp = rand ()%4. (2) Step S402: adds a weight value to the average of the sample and adds the average of the sample having the weight value to a total sample. The result is expressed by the following equation 7.
[Equation 7]
x (t) = weighting * average + x(t)
where x (t) is a biased sample value. The weighting is obtained by an
experiment. The weighting varies from environmental effects but is not sensitive to
the environment. The weighting ranges from 10 to 20.
(3) Steps S403 and S404: A Zero Cross Rate (ZCR) biased by frames is obtained by overlapping frames of a speech. The ZCR is expressed by the following equation 8. [Equation 8]
Zn
Figure imgf000011_0001
sgn[x(rø)] = l, x (m) > 0 w(ή) = — O ≤ n ≤ N-l where _ , 2N
= -l, x (m)< 0 = o otherwise.
(4) Step S405: detects an initial sound of sampled input speech data using the obtained ZCR. FIG. 1 shows an example which detects the initial sound of sample input speech data according to an embodiment of the present invention.
(5) Steps S406 and S407: apply Hamming window to each frame and perform 512-point Fast Fourier Transform (FFT) therefore after segmenting a frame into 10 msec without overlapping. (6) Steps S408 and S409: obtain a power density of a low frequency having 0—400 Hz for each frame and obtain an average power density of two continuous frames.
(7) Step S410: obtains a difference of the average power density in frames using a difference interval of 70 ms (7 frames) in order to indicate changes of opening and closing of sentence. The difference of the average power density is expressed by the following equation 9. [Equation 9]
∑j(poweri+J -powerH) delta _ power, = -^ ^
2 ∑/
7=1 where M is a differential interval and 7 frames are used as M . [M 12] is
an integer less than or equal to Mil . The power, and delta _ power, are an
average power density of an i-th frame and an average power density, respectively.
(8) Step S411: detects a starting point of a middle sound of a voiced sound existing in every phoneme after the initial sound and a starting point of an initial sound of an unvoiced sound before the middle sound by selecting locally maximal and minimal values from the difference of the average power density. A peak is detected at a place which is not the starting point of the voiced sound. In this case, when a slope of an energy change has a negative value, the place is detected to not be the start point of the voiced sound. FIG. 5 is a graph showing a middle sound detecting result and an unvoiced sound initial sound detecting result which detect a starting point of a middle sound of a voiced sound and a starting point of an initial sound of an unvoiced sound, respectively.
(9) Step S412: reads out utterance information of a word most similar to the input speech through the variable vocabulary word recognition.
(10) Step S413: compares a word recognized for a continuous vowel with a final sound which is not segmented, and segments a formant or a feature vector according to the compared result.
(11) Step S312 of FIG. 2: calculates the above reliability using the input speech which is segmented to the phoneme unit, and rejects the out-of -vocabulary.
Such an experimental result is expressed by the following table 1. • Registered word
(1) CA: Correctly Accepted for Keyword, namely, a probabilty of the case of exactly accepting registered recognition target word.
Figure imgf000013_0001
(2) FAI: False Accepted In-Grammar Word (=Keyword), namely, a probability of the case of accepting the registered recognition target word but wrongly recognizes it.
(3) FR: False Rejected for Keyword, namely, a probability of the case of rejecting a registered recognition target word although the registered recognition target word is uttered. (4) Accordingly, CA + FAI + FR = 100%. • Out-of -vocabulary word
(1) CR: Correctly Rejected for OOV, namely, a probability of the case of rejecting an out-of -vocabulary word. (2) FAO: False Rejected Out-of-Grammar Word (=OOV), namely, a probability of the case of accepting the out-of -vocabulary word. (3) Accordingly, CR + FAO = 100%.
As shown in table 1, an experimental result of the present invention which performs utterance verification after segmenting phonemes is better than that of the conventional method.
Industrial Applicability
As mentioned above, utterance verification is performed in order to reject an out-of-vocabulary word in a variable vocabulary word recognition. At this time, the out-of-vocabulary word is exactly rejected by detecting an accurate boundary between phonemes and calculating reliability based on the detected result.
Although a preferred embodiment of the present invention has been described for illustrative purposes, those skilled in the art will appreciate that various modifications, additions and substitutions are possible, without departing from the scope and spirit of the invention as disclosed in the accompanying claims.

Claims

Claims
1. An out-of -vocabulary word rejection apparatus using phoneme segmentation in a variable vocabulary word recognition, the apparatus comprising: an initial sound detector for detecting an initial sound of sampled input speech data; a middle sound detector for detecting a starting point of a middle sound of a voiced sound existing in every phoneme after the initial sound is detected by the initial sound detector; an unvoiced sound initial sound detector for detecting a starting point of an initial sound of an unvoiced sound before the middle sound is detected by the middle sound detector; a detector for detecting a starting point of a phoneme which are not detected by the middle sound detector and the unvoiced sound initial sound detector; and an out-of -vocabulary word rejecting section performs a utterance authentication using the values detected by the initial sound detector, the middle sound detector, the unvoiced sound initial sound detector, and the detector.
2. The apparatus according to claim 1, wherein the initial sound detector detects an initial sound of sampled input speech data by the following equation 1:
average = s + t — temp * /*/])
Figure imgf000015_0001
x (t) = weighting * average + x(t) Zn = ∑ sgn[x(m)]-sgn|>(m-l)]
Figure imgf000016_0001
- m)
sga[x (m)] = l, x (m) ≥ 0 w(n) = — O ≤ n ≤ N-l where _ ' , 2N
= -1, x (m)< 0 = o othemnse.
where / is the sample number of frames, s is the total number of samples,
and temp- rand ()%4. x (t) is a biased sample value, the weighting is obtained by an experiment, varies according to an environment but is not sensitive to the environment, and ranges from 10 to 20.
3. The apparatus according to claim 1, wherein the middle sound detector and the an unvoiced sound initial sound detector detect the starting point of a middle sound of a voiced sound existing in every phoneme and the starting point of an initial sound of an unvoiced sound by the following equation 2, respectively:
[M /2]
∑jipσwer^j -power ^ delta _ power, = — ^n=
2 ∑/
where M is a differential interval and 7 frames are used as M . [M 72] is
an integer less than or equal to M 12 , the power t and delta _ power) are an average
power density of an i-th frame and an average power density, respectively, a starting point of a middle sound of a voiced sound existing every phoneme is detected after the initial sound and a starting point of an initial sound of an unvoiced sound before the middle sound by selecting locally maximal and minimal values from the difference of the average power density, a peak is detected at a place which is not the starting point of the voiced sound, and when a slope of an energy change has a negative value, the place is detected not to be the starting point of the voiced sound.
4. An out-of-vocabulary word rejection method using phoneme segmentation in a variable vocabulary word recognition, the method comprising the steps of:
(i) detecting an initial sound of sampled input speech data; (ii) detecting a starting point of a middle sound of a voiced sound existing in every phoneme after the initial sound is detected in step (i);
(iii) detecting a starting point of an initial sound of an unvoiced sound before the middle sound is detected in step (ii);
(iv) detecting a starting point of a phoneme which are not detected by steps (ii) and (iii); and (v) performing a utterance authentication using the values detected by steps
(i), (ii), (iii), and (iv).
5. The method according to claim 4, wherein step (i) detects an initial sound of sampled input speech data by the following equation 3:
1 2 1/2 0 average=—^--∑(∑ιx[t + temp *j*l]+ ∑x[s + t -temp *j*l])
2 ' y=i o /=Z/2
x (t) = weighting * average + x(t)
z,, = ∑ sgn[x(m)]-sgn[*(m -l)] w(n -m) sgn[jc (/»)] = 1, x(m) ≥ 0 w(n) = O ≤ n ≤ JV-1 where _ , 2N
= -1, x (m)< 0 = o otherwise.
where / is the sample number of frames, s is the total number of samples,
and temp = rand ()%4. x (t) is a biased sample value, the weighting is obtained by an experiment, varies according to an environment but is not sensitive to the environment, and ranges from 10 to 20; and steps (ii) and (iii) detect the starting point of a middle sound of a voiced sound existing in every phoneme and the starting point of an initial sound of an unvoiced sound by the following equation 4, respectively:
[M /2]
∑j(poM>eri+j - power H) delta _ power, = -^ ^71
where M is a differential interval and 7 frames are used as M . [M 12} is
an integer less than or equal to M 12 , the powert - and delta _poweη are an average
power density of an i-th frame and an average power density, respectively, a starting point of a middle sound of a voiced sound existing in every phoneme is detected after the initial sound and a starting point of an initial sound of an unvoiced sound before the middle sound is detected by selecting locally maximal and minimal values from the difference of the average power density, a peak is detected at a place which is not the starting point of the voiced sound, and when a slope of an energy change has a negative value, the place is detected not to be the starting point of the voiced sound.
6. A recording medium by a computer which executes an out-of-vocabulary word rejection program using phoneme segmentation in a variable vocabulary word recognition, the method comprising the steps of: (i) detecting an initial sound of sampled input speech data;
(ii) detecting a starting point of a middle sound of a voiced sound existing in every phoneme after the initial sound is detected in step (i);
(iii) detecting a starting point of an initial sound of an unvoiced sound before the middle sound is detected in step (ii); (iv) detecting a starting point of a phoneme which are not detected by steps
(ii) and (iii); and
(v) performing a utterance authentication using the values detected by steps (i), (ii), (iii), and (iv), wherein step (i) detects an initial sound of sampled input speech data by the following equation 3 :
average= 1 - temp *j*l])
Figure imgf000019_0001
x (t) = weighting * average + x(t)
Zn = J] sgn[x(m)]-sgn[*(m-l)] w(n-m) m=-∞ 1 sgn[x (m)] = 1, x (m) ≥ 0 w(n) = O ≤ n ≤ N-l where _ , 2N
= -1, x (m)< 0 = o otherwise.
where / is the sample number of frames, s is the total number of samples, and temp = rand ()%4. x (t) is a biased sample value, the weighting is obtained by an experiment, varies according to an environment but is not sensitive to the environment, and ranges from 10 to 20; and steps (ii) and (iii) detect the starting point of a middle sound of a voiced sound existing in every phoneme and the starting point of an initial sound of an unvoiced sound by the following equation 4, respectively:
[Af /2]
2 j(poweri+j - power H ) delta _ power t = -^ ^
2 ∑/
where M is a differential interval and 7 frames are used as M . [M 12] is
an integer less than or equal to Λ/72 , the power ;. and delta _ power t are an average
power density of an i-th frame and an average power density, respectively, a starting point of a middle sound of a voiced sound existing in every phoneme is detected after the initial sound and a starting point of an initial sound of an unvoiced sound before the middle sound by selecting locally maximal and minimal values from the difference of the average power density, a peak point is detected at a place which is not the starting point of the voiced sound, and when a slope of an energy change has a negative value, the place is detected not to be the starting point of the voiced sound.
PCT/KR2003/001200 2003-06-18 2003-06-18 Out-of-vocabulary word rejection algorithms using phoneme segmentation in variable vocabulary word recognition Ceased WO2004111998A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/KR2003/001200 WO2004111998A1 (en) 2003-06-18 2003-06-18 Out-of-vocabulary word rejection algorithms using phoneme segmentation in variable vocabulary word recognition
AU2003243024A AU2003243024A1 (en) 2003-06-18 2003-06-18 Out-of-vocabulary word rejection algorithms using phoneme segmentation in variable vocabulary word recognition

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/KR2003/001200 WO2004111998A1 (en) 2003-06-18 2003-06-18 Out-of-vocabulary word rejection algorithms using phoneme segmentation in variable vocabulary word recognition

Publications (1)

Publication Number Publication Date
WO2004111998A1 true WO2004111998A1 (en) 2004-12-23

Family

ID=33550102

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/KR2003/001200 Ceased WO2004111998A1 (en) 2003-06-18 2003-06-18 Out-of-vocabulary word rejection algorithms using phoneme segmentation in variable vocabulary word recognition

Country Status (2)

Country Link
AU (1) AU2003243024A1 (en)
WO (1) WO2004111998A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2608196A1 (en) 2011-12-21 2013-06-26 Institut Telecom - Telecom Paristech Combinatorial method for massive word generation

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2002045960A (en) * 2000-08-07 2002-02-12 Tanaka Kikinzoku Kogyo Kk Amorphous alloy casting method

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2002045960A (en) * 2000-08-07 2002-02-12 Tanaka Kikinzoku Kogyo Kk Amorphous alloy casting method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
KI-TAE KIM ET AL.: "Performance comparison of out-of-vocabulary word rejection algorithms in variable vocabulary word recognition", A COLLECTION OF LEARNED PAPERS PUBLISHED BY ACOUSTICAL SOCIETY OF KOREA, vol. 20, no. 2, 2001, KOREA, pages 27 - 34, XP008041929 *
KWANG-SIK MOON ET AL.: "Out-of-vocabulary word rejection algorithm in korean variable vocabulary word recognition", PROCEEDINGS. ISCAS 2000 GENEVA, THE 2000 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS, vol. 5, 28 May 2000 (2000-05-28) - 31 May 2000 (2000-05-31), pages 53 - 56, XP002904784 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2608196A1 (en) 2011-12-21 2013-06-26 Institut Telecom - Telecom Paristech Combinatorial method for massive word generation

Also Published As

Publication number Publication date
AU2003243024A1 (en) 2005-01-04

Similar Documents

Publication Publication Date Title
CN103971678B (en) Keyword spotting method and apparatus
Hasegawa-Johnson et al. Landmark-based speech recognition: Report of the 2004 Johns Hopkins summer workshop
US8352263B2 (en) Method for speech recognition on all languages and for inputing words using speech recognition
US11282511B2 (en) System and method for automatic speech analysis
Arora et al. Phonological feature-based speech recognition system for pronunciation training in non-native language learning
US20100004931A1 (en) Apparatus and method for speech utterance verification
US8271283B2 (en) Method and apparatus for recognizing speech by measuring confidence levels of respective frames
US20050065793A1 (en) Method and apparatus for discriminative estimation of parameters in maximum a posteriori (MAP) speaker adaptation condition and voice recognition method and apparatus including these
CN101452701B (en) Confidence degree estimation method and device based on inverse model
Kumar et al. Significance of GMM-UBM based modelling for Indian language identification
KR100930587B1 (en) Confusion Matrix-based Speech Verification Method and Apparatus
JP2955297B2 (en) Speech recognition system
Reynolds et al. Automatic language recognition via spectral and token based approaches
Wang et al. Unsupervised spoken term detection with acoustic segment model
Bansal et al. Speaker adaptation on Hidden Markov Model using MFCC & RASTA-PLP and comparative study
JPH09198086A (en) Speaker recognition threshold setting method and speaker recognition apparatus using this method
WO2004111998A1 (en) Out-of-vocabulary word rejection algorithms using phoneme segmentation in variable vocabulary word recognition
Kamble et al. Spontaneous emotion recognition for Marathi spoken words
Zhu et al. Optimizing the performance of spoken language recognition with discriminative training
Drakshayini et al. Repetition detection using spectral parameters and multi tapering features
Shankar et al. Weakly Supervised Syllable Segmentation by Vowel-Consonant Peak Classification.
JPH11249688A (en) Speech recognition apparatus and method
Mary et al. Keyword spotting techniques
Siniscalchi et al. An attribute detection based approach to automatic speech processing
Likitsupin et al. Acoustic-phonetic approaches for improving segment-based speech recognition for large vocabulary continuous speech

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NI NO NZ OM PH PL PT RO RU SC SD SE SG SK SL TJ TM TN TR TT TZ UA UG US UZ VC VN YU ZA ZM ZW

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IT LU MC NL PT RO SE SI SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
122 Ep: pct application non-entry in european phase
NENP Non-entry into the national phase

Ref country code: JP