WO2009078665A1 - Method and apparatus for lexical decoding - Google Patents
Method and apparatus for lexical decoding Download PDFInfo
- Publication number
- WO2009078665A1 WO2009078665A1 PCT/KR2008/007481 KR2008007481W WO2009078665A1 WO 2009078665 A1 WO2009078665 A1 WO 2009078665A1 KR 2008007481 W KR2008007481 W KR 2008007481W WO 2009078665 A1 WO2009078665 A1 WO 2009078665A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- phonemic
- words
- decoding
- edit distance
- similarity
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 40
- 238000005259 measurement Methods 0.000 claims description 12
- 230000008569 process Effects 0.000 description 20
- 238000010586 diagram Methods 0.000 description 6
- 230000006870 function Effects 0.000 description 3
- 238000003780 insertion Methods 0.000 description 2
- 230000037431 insertion Effects 0.000 description 2
- 230000015556 catabolic process Effects 0.000 description 1
- 238000006731 degradation reaction Methods 0.000 description 1
- 238000012217 deletion Methods 0.000 description 1
- 230000037430 deletion Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000013139 quantization Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/10—Speech classification or search using distance or distortion measures between unknown speech and reference templates
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
- G10L2015/025—Phonemes, fenemes or fenones being the recognition units
Definitions
- the present invention relates to a human speech recognition (heinafter, 'HSR') technique; and more particularly, to a lexical decoding method and apparatus that are suitable for improving lexical decoding speed using a word clustering scheme and length information for an input phoneme.
- 'HSR' human speech recognition
- HSR-based speech recognition is performed by phonemic decoding and lexical decoding.
- FIG. 1 depicts a block diagram showing a typical human speech recognition system.
- a human speech recognition system 100 includes a phonemic decoding unit 102, a lexical decoding unit 104, and an acoustic rescoring unit 106.
- the phonemic decoding unit 102 is a process obtaining a sequence of phonetic symbols to be outputted with maximum similarity corresponding to inputted speech signals.
- the lexical decoding unit 104 outputs N best words that are N object words having the smallest edit distance for the decoded phonemic string which is a result of the phonemic decoding unit 102.
- the N best words outputted from the lexical decoding unit 104 are sent to the acoustic rescoring unit 106, which rescores the inputted N object words using a sophisticated acoustic model to re-adjust a recognition priority and output the recognition result.
- a method for a lexical decoding method including detecting a length of a phonemic string outputted through phonemic decoding to an inputted speech signal; selecting recognition object words having a length of phonemic string similar to the length of phonemic string based on the detected length of phonemic string; measuring an edit distance from the phonemic string based on the selected recognition object words; and outputting at least one of the recognition object word having the smallest edit distance from the phonemic string through the measuring the edit distance.
- the method further includes determining a space to be searched depending on the number of phonemes upon the measuring edit distance.
- a lexical decoding method including creating clusters by clustering among recognition object words having a high similarity classified by an edit distance for all recognition object words; selecting words which is representative of the created clusters; measuring a first similarity of a phonemic string outputted through phonemic decoding based on the selected words and a speech signal to select one of the clusters, which corresponds to the selected words having an optimal similarity; measuring a second similarity between only the words forming the selected cluster and the phonemic string; and outputting optimal words through the similarity measurement.
- the phonemic decoding generates as accurate phoneme sequence for the inputted speech signal as possible.
- a lexical decoding apparatus including a detector for detecting a length of a phonemic string outputted through phonemic decoding to an inputted speech signal; a word selector for selecting recognition object words having a length of phonemic string similar to the length of the detected phonemic string; and an edit distance measurer for measuring an edit distance from the phonemic string based on the selected recognition object words, to output at least one of the recognition object word having the smallest edit distance from the phonemic string through the edit distance measurement.
- the phonemic decoding generates as accurate phoneme sequence for the inputted speech signal as possible.
- the edit distance measurer determines a space being searched in accordance with the number of the phonemes upon the edit distance measurement.
- a clustering unit for creating clusters by performing clustering among recognition object words having a high similarity classified by a edit distance for all recognition object words; a representative word selector for selecting words representative of the created clusters; a first similarity measurer for measuring a similarity of a phonemic string outputted through phonemic decoding based on the selected words and a speech signal, and selecting the corresponding cluster of the words having an optimal similarity; and a second similarity measurer for measuring a similarity between only the words forming the selected cluster and the phonemic string to output optimal words through the similarity measurement.
- the phonemic decoding generates as accurate phoneme sequence for the inputted speech signal as possible.
- the present invention has the following advantages: in the human speech recognition system, the lexical decoding process is performed by using the word clustering scheme or the length information for the input phoneme, whereby, the speed of the lexical decoding process can be improved and a total speech recognition time can be reduced without a degradation of speech recognition performance.
- FIG. 1 shows a block diagram representing a typical human speech recognition system
- FIG. 2 depicts a block diagram representing a lexical decoding structure that employs a scheme of reducing a search space using a length information for a phonemic string in accordance with a preferred embodiment of the present invention
- FIG. 3 illustrates a block diagram showing a 2-step lexical decoding structure using a word clustering scheme in accordance with a preferred embodiment of the present invention
- Fig. 4 is a lexical decoding process based on a phonemic length in accordance with a preferred embodiment of the present invention
- Fig. 5 describes whether a search space is restricted upon edit distance measurement based on a phonemic length in accordance with a preferred embodiment of the present invention.
- Fig. 6 shows a flowchart representing a word clustering algorithm based on an edit distance between words depending on a phonemic length in accordance with a preferred embodiment of the present invention.
- the present invention is directed to improvement of a lexical decoding speed using a word clustering scheme and a length information for an input phoneme.
- Search words are restricted upon a lexical decoding based on a length of a phonemic string outputted through phonemic decoding, and the restriction is imposed to search for only a difference in phonemic number upon edit distance measurement.
- all recognition object words are divided into clusters which are representative of the all recognition object words using an edit distance-based clustering scheme, a similarity between the inputted phonemic string and words which are representative of the cluster is measured to select N best clusters, and the lexical decoding is performed on only the words forming the cluster.
- human speech recognition is performed by the phonemic decoding unit 102, the lexical decoding unit
- the phonemic decoding unit 102 converts an inputted speech signal into a phonemic string through a phonemic decoding process, and the lexical decoding unit 104 recognizes recognition object words from the phonemic string converted through a phonemic decoding process.
- Math Figure 1 Math Figure 1
- Pr(XIP) denotes a conditional probability for an acoustic feature vector observed from the phonemes and is typically modeled using a Hidden Markov Model (HMM).
- HMM Hidden Markov Model
- Pr(P) is defined as a probability model representing a connection relationship between phonemes forming a word and is called as a language model. Pr(P) is represented by Equation 2.
- Pr( ⁇ 1 ,..., p t ) Prf ⁇ I P iA ,P t _ 2 ,..., P 1 ) Pr(P M 13-a--> ⁇ ) - PrOi)
- Equation 3 approximate to Equation 2, on the assumption that a current phoneme is affected by only previous (n-1) phonemes. 2- or 3-gram is typically used.
- the lexical decoding process is based on the result of a decoded phoneme sequence which may include a mis -recognized phonemes, as shown in Fig. 2, and a dynamic programming algorithm as shown in Equation 4, which obtains a word having the smallest edit distance between correct phonemic string among recognition object words.
- Equation 4 Q(x,y) denotes an accumulated distance, C(cx,ty) denotes a cost function when a substitution error from a reference phoneme ex to ty occurs, C(cx, ⁇ ) denotes a cost function when a deletion error for the reference phoneme ex occurs, and C( ⁇ ,ty) denotes a cost function when a ty insertion error for the reference phoneme ex occurs.
- the costs are represented by a negative logarithm for a confusion probability of the phoneme, as in Equation 5.
- a scheme using length information for an input phoneme and a word clustering scheme are used to improve lexical decoding speed.
- a method for improving lexical decoding speed through the scheme using phonemic length information will first be described.
- FIG. 2 depicts a block diagram representing a lexical decoding structure that employs a scheme of reducing a search space using a length information of a phonemic string in accordance with an embodiment of the present invention.
- a detector of phonemic number of phonemic string 200 in the lexical decoding unit 104 detects the phonemic number within inputted phonemic string. The detected phonemic number is sent to the selector 204 that is based on the phonemic number.
- the selector 204 selects a recognition object word 202 having phonemic string similar to the phonemic number of the received phonemic string among recognition object words 202, and sends the selected recognition object word 202 to an edit distance measurer based on length of phonemic string 206 based on the restriction of the number of phonemes.
- the edit distance measurer 206 measures an edit distance between pronounced strings of each word forming the recognition object word 202 for the phonemic strings, to output N words having the smallest edit distance as a result of lexical decoding.
- the dynamic program as shown in Equation 4 must be applied to all the words in order to obtain the result of phonemic recognition and the edit distance between the words.
- a total search time can be reduced by constraining search words by the phonemic number as a result of decoding the phonemes through the selection in a selector of recognition object words which consist of phonemic string having similar phonemic number 204 and by imposing a constraint on a search space even when Equation 4 is applied to obtain the edit distance for the words.
- Fig. 4 is a lexical decoding process based on a phonemic length in accordance with an embodiment of the present invention.
- the edit distance between two phonemic strings is obtained using the dynamic program as shown in Equation 4.
- a global constraint is applied.
- optimal edit distances for a entire search space consisting of the two given phonemic strings are obtained through the dynamic program, as in Fig. 5.
- Fig. 5 describes whether a search space is restricted upon edit distance measurement based on a phonemic length in accordance with an embodiment of the present invention.
- the search may be performed only in a search space corresponding to a slant- line area 500 indicated by a solid line in Fig. 5.
- the search space must increase from a diagonal line in proportion to the difference.
- An area 502 indicated by dotted lines corresponds to the difference between the two phonemic strings being 1
- an area 504 indicated by thick dotted lines corresponds to the difference between the two phonemic strings being 2.
- the search space is constrained, i.e., determined in proportion to the difference in length between the two phonemic strings whose edit distance is to be obtained, such that a search time can be greatly reduced, as compared with a conventional scheme that performs the search over all search areas.
- Another scheme for improving the lexical decoding speed includes performing a lexical decoding process in two steps.
- FIG. 3 illustrates a block diagram showing a 2-step lexical decoding structure using a word clustering scheme in accordance with an embodiment of the present invention.
- FIG. 6 shows a flowchart representing a word clustering algorithm based on an edit distance between words depending on a phonemic length in accordance with an embodiment of the present invention.
- step 600 all recognition object words are inputted, and in step 602, an edit distance between all recognition object words is measured.
- step 604 words having the smallest edit distance among the measured words are incorporated into each cluster.
- step 606 a total edit distance is measured and a determination is made as to whether the total edit distance is greater than a threshold value. If the total edit distance is greater than the threshold value, the process proceeds to step 608.
- step 608 a cluster having the greatest total edit distance is divided and the process returns to step 602, where the edit distance from the cluster is measured. The words are again incorporated into the cluster. This procedure is repeatedly performed. Clustering ends only if the measured total edit distance is smaller than the threshold value.
- the process of obtaining N clusters representative of all the words is referred to as a quantization process of recognition object word.
- the similarity measurer 300 measures a similarity between the word representative of the cluster and the inputted phonemic string to select N clusters having the smallest similarity value.
- the similarity measurer 302 for the selected clusters again performs the lexical decoding on all words forming the clusters to output a result of final lexical decoding.
- This 2-step lexical decoding process finds the cluster which is a set of words in the first step and then finds words belonging to the cluster, based on the cluster. This can greatly increase the decoding speed, as compared with a conventional scheme of searching for all words at a time.
- the lexical decoding speed can be improved by using the word clustering scheme and the information on the length of an inputted phonemic string.
- the search words are constrained upon lexical decoding based on the length of the phonemic string outputted via phonemic decoding, and a constraint is imposed to search for only a difference in phonemic number upon the edit distance measurement.
- all recognition object words are divided into clusters representative of the words using clustering scheme based on the edit distance, a similarity between the inputted phonemic string and the word representative of the cluster is then measured to select the N best clusters, and lexical decoding is performed on only the words forming the cluster again.
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention relates to a lexical decoding method including detecting a length of a phonemic string outputted through phonemic decoding based on an inputted speech signal, selecting recognition object words having a length of phonemic string similar to the length of phonemic string based on the detected length of phonemic string, measuring an edit distance from the phonemic string based on the selected recognition object words, and outputting at least one of the recognition object word having the smallest edit distance from the phonemic string through the measuring the edit distance. The phonemic decoding obtains a phonemic string of phonemic model by outputting a maximum similarity for the inputted speech signal.
Description
Description
METHOD AND APPARATUS FOR LEXICAL DECODING
Technical Field
[1] The present invention relates to a human speech recognition (heinafter, 'HSR') technique; and more particularly, to a lexical decoding method and apparatus that are suitable for improving lexical decoding speed using a word clustering scheme and length information for an input phoneme.
[2] This work was supported by the IT R&D program of MIC/IITA [2006-S-036-02, Development of large vocabulary/interactive distributed/embedded VUI for new growth engine industries].
[3]
Background Art
[4] In general, HSR-based speech recognition is performed by phonemic decoding and lexical decoding.
[5] The HSR-based speech recognition will be described in detail with reference to Fig.
1.
[6] Fig. 1 depicts a block diagram showing a typical human speech recognition system.
[7] Referring to Fig. 1, a human speech recognition system 100 includes a phonemic decoding unit 102, a lexical decoding unit 104, and an acoustic rescoring unit 106.
[8] The phonemic decoding unit 102 is a process obtaining a sequence of phonetic symbols to be outputted with maximum similarity corresponding to inputted speech signals. In a phonemic decoding process, ambient noises, inaccurate acoustic models, and so on may cause an error to be included in a decoded phonemic string. Accordingly, the lexical decoding unit 104 outputs N best words that are N object words having the smallest edit distance for the decoded phonemic string which is a result of the phonemic decoding unit 102. Further, the N best words outputted from the lexical decoding unit 104 are sent to the acoustic rescoring unit 106, which rescores the inputted N object words using a sophisticated acoustic model to re-adjust a recognition priority and output the recognition result.
[9] In the conventional human speech recognition system operate as described above, a maximum similar word or word sequence corresponding to the speech inputted from the user is outputed through phonemic decoding, lexical decoding, and acoustic rescoring. Even though such several stage processing performs faster speech recognition than that of conventional speech recognition system based on continuous hidden Markov model(CDHMM), it still takes too much time to use in practice in recognizing more than hundreds of thousands of words. So, this invention propose a
method to speed up decoding time for huge number of words recognition task domain. [10]
Disclosure of Invention
Technical Problem
[11] It is, therefore, a primary object of the present invention to provide a lexical decoding method and apparatus capable of improving a lexical decoding speed using a word clustering scheme and a length information for an input phoneme which is a result of phonemic decoding.
[12] It is, therefore, another object of the present invention to provide a lexical decoding method and apparatus that are capable of constraining search words upon lexical decoding based on a length of a phonemic string outputted via phonemic decoding, and imposing a path constraint to search for words which have only small number difference to phoneme lengths in measuring edit distance.
[13] It is, therefore, still another object of the present invention to provide a lexical decoding method and apparatus for dividing all recognition object words into clusters representative of the words using an edit distance-based clustering scheme, measuring a similarity between an inputted phonemic string and the word representative of the cluster to select N best clusters, and performing the lexical decoding on only the words forming the cluster again.
[14]
Technical Solution
[15] In accordance with one aspect of the invention, there is provided a method for a lexical decoding method including detecting a length of a phonemic string outputted through phonemic decoding to an inputted speech signal; selecting recognition object words having a length of phonemic string similar to the length of phonemic string based on the detected length of phonemic string; measuring an edit distance from the phonemic string based on the selected recognition object words; and outputting at least one of the recognition object word having the smallest edit distance from the phonemic string through the measuring the edit distance.
[16] It is desirable that the phonemic decoding generates as accurate phoneme sequence for the inputted speech signal as possible.
[17] It is also desirable that the method further includes determining a space to be searched depending on the number of phonemes upon the measuring edit distance.
[18] In accordance with another aspect of the invention, there is provided a lexical decoding method including creating clusters by clustering among recognition object words having a high similarity classified by an edit distance for all recognition object words; selecting words which is representative of the created clusters; measuring a first
similarity of a phonemic string outputted through phonemic decoding based on the selected words and a speech signal to select one of the clusters, which corresponds to the selected words having an optimal similarity; measuring a second similarity between only the words forming the selected cluster and the phonemic string; and outputting optimal words through the similarity measurement.
[19] It is preferable that the phonemic decoding generates as accurate phoneme sequence for the inputted speech signal as possible.
[20] In accordance with still another aspect of the invention, there is provided a lexical decoding apparatus including a detector for detecting a length of a phonemic string outputted through phonemic decoding to an inputted speech signal; a word selector for selecting recognition object words having a length of phonemic string similar to the length of the detected phonemic string; and an edit distance measurer for measuring an edit distance from the phonemic string based on the selected recognition object words, to output at least one of the recognition object word having the smallest edit distance from the phonemic string through the edit distance measurement.
[21] It is preferable that the phonemic decoding generates as accurate phoneme sequence for the inputted speech signal as possible.
[22] It is also preferred that the edit distance measurer determines a space being searched in accordance with the number of the phonemes upon the edit distance measurement.
[23] In accordance with still another aspect of the invention, there is provided a clustering unit for creating clusters by performing clustering among recognition object words having a high similarity classified by a edit distance for all recognition object words; a representative word selector for selecting words representative of the created clusters; a first similarity measurer for measuring a similarity of a phonemic string outputted through phonemic decoding based on the selected words and a speech signal, and selecting the corresponding cluster of the words having an optimal similarity; and a second similarity measurer for measuring a similarity between only the words forming the selected cluster and the phonemic string to output optimal words through the similarity measurement.
[24] It is preferable that the phonemic decoding generates as accurate phoneme sequence for the inputted speech signal as possible.
[25]
Advantageous Effects
[26] The present invention has the following advantages: in the human speech recognition system, the lexical decoding process is performed by using the word clustering scheme or the length information for the input phoneme, whereby, the speed of the lexical decoding process can be improved and a total speech recognition time can be reduced
without a degradation of speech recognition performance. Brief Description of the Drawings
[27] The above and other objects and features of the present invention will become apparent from the following description of embodiments given in conjunction with the accompanying drawings, in which:
[28] Fig. 1 shows a block diagram representing a typical human speech recognition system;
[29] Fig. 2 depicts a block diagram representing a lexical decoding structure that employs a scheme of reducing a search space using a length information for a phonemic string in accordance with a preferred embodiment of the present invention;
[30] Fig. 3 illustrates a block diagram showing a 2-step lexical decoding structure using a word clustering scheme in accordance with a preferred embodiment of the present invention;
[31] Fig. 4 is a lexical decoding process based on a phonemic length in accordance with a preferred embodiment of the present invention;
[32] Fig. 5 describes whether a search space is restricted upon edit distance measurement based on a phonemic length in accordance with a preferred embodiment of the present invention; and
[33] Fig. 6 shows a flowchart representing a word clustering algorithm based on an edit distance between words depending on a phonemic length in accordance with a preferred embodiment of the present invention.
[34]
Best Mode for Carrying Out the Invention
[35] Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings so that they can be readily implemented by one of ordinary skill in the art.
[36] The present invention is directed to improvement of a lexical decoding speed using a word clustering scheme and a length information for an input phoneme. Search words are restricted upon a lexical decoding based on a length of a phonemic string outputted through phonemic decoding, and the restriction is imposed to search for only a difference in phonemic number upon edit distance measurement. Furthermore, all recognition object words are divided into clusters which are representative of the all recognition object words using an edit distance-based clustering scheme, a similarity between the inputted phonemic string and words which are representative of the cluster is measured to select N best clusters, and the lexical decoding is performed on only the words forming the cluster.
[37]
[38] (Embodiments)
[39] In the human speech recognition system 100 as described in Fig. 1, human speech recognition is performed by the phonemic decoding unit 102, the lexical decoding unit
104, and the acoustic rescoring unit 106. [40] The phonemic decoding unit 102 converts an inputted speech signal into a phonemic string through a phonemic decoding process, and the lexical decoding unit 104 recognizes recognition object words from the phonemic string converted through a phonemic decoding process. [41] In the phonemic decoding process, a phonemic string P=pl, p2,..., pN having maximum posterior probability is obtained from a feature vector string X=xl, x2,..., xT of a given speech signal, as represented by Equation 1. [42] MathFigure 1
[Math.l] t arg max Pr(P \ X) P
[43] Pr(X I P) Pr(P) arg max — ■ — ^^-
= Pr(JQ
P [44] arg max Pr(X | P) Pr(P)
P
[45] To solve Equation 1, two probability models Pr(XIP) and Pr(P) are necessary.
Pr(XIP) denotes a conditional probability for an acoustic feature vector observed from the phonemes and is typically modeled using a Hidden Markov Model (HMM). Pr(P) is defined as a probability model representing a connection relationship between phonemes forming a word and is called as a language model. Pr(P) is represented by Equation 2.
[46] MathFigure 2
[Math.2]
Pr(^1,..., pt) = Prfø I PiA,Pt_2,..., P1) Pr(PM 13-a--> Λ) - PrOi)
[47] t
= Y\ Pr(pi \pi_1,..., p1)
2-1
[48] However, most actual speech recognition systems use n-gram as represented by
Equation 3, approximate to Equation 2, on the assumption that a current phoneme is affected by only previous (n-1) phonemes. 2- or 3-gram is typically used.
[49] MathFigure 3
[50] To decode a word string from the recognized phonemic string, the lexical decoding process is based on the result of a decoded phoneme sequence which may include a mis -recognized phonemes, as shown in Fig. 2, and a dynamic programming algorithm as shown in Equation 4, which obtains a word having the smallest edit distance between correct phonemic string among recognition object words.
[51] MathFigure 4
Q(x, J/) = flftffl.
[52] In Equation 4, Q(x,y) denotes an accumulated distance, C(cx,ty) denotes a cost function when a substitution error from a reference phoneme ex to ty occurs, C(cx,ε) denotes a cost function when a deletion error for the reference phoneme ex occurs, and C(ε,ty) denotes a cost function when a ty insertion error for the reference phoneme ex occurs. The costs are represented by a negative logarithm for a confusion probability of the phoneme, as in Equation 5.
[53] MathFigure 5
[Math.5]
C(Cx, ty) =- log Pr (ty I Cx)
C(Cx, ε ) =-log Pr C ε I Cx)
C( ε , ty) =-los Pr (ty | ε )
[54] A scheme using length information for an input phoneme and a word clustering scheme are used to improve lexical decoding speed. First, a method for improving lexical decoding speed through the scheme using phonemic length information will first be described.
[55] Fig. 2 depicts a block diagram representing a lexical decoding structure that employs a scheme of reducing a search space using a length information of a phonemic string in
accordance with an embodiment of the present invention.
[56] Referring to Fig. 2, if a phonemic string of N phonemes are outputted as a result of decoding phonemes, there is high possibility that a word having the smallest edit distance from a recognition object word consists of N phonemes. Accordingly, a detector of phonemic number of phonemic string 200 in the lexical decoding unit 104 detects the phonemic number within inputted phonemic string. The detected phonemic number is sent to the selector 204 that is based on the phonemic number. The selector 204 selects a recognition object word 202 having phonemic string similar to the phonemic number of the received phonemic string among recognition object words 202, and sends the selected recognition object word 202 to an edit distance measurer based on length of phonemic string 206 based on the restriction of the number of phonemes.
[57] The edit distance measurer 206 measures an edit distance between pronounced strings of each word forming the recognition object word 202 for the phonemic strings, to output N words having the smallest edit distance as a result of lexical decoding. In this case, the dynamic program as shown in Equation 4 must be applied to all the words in order to obtain the result of phonemic recognition and the edit distance between the words. However, a total search time can be reduced by constraining search words by the phonemic number as a result of decoding the phonemes through the selection in a selector of recognition object words which consist of phonemic string having similar phonemic number 204 and by imposing a constraint on a search space even when Equation 4 is applied to obtain the edit distance for the words.
[58] Fig. 4 is a lexical decoding process based on a phonemic length in accordance with an embodiment of the present invention.
[59] Referring to Fig. 4, if "g o r jv g E b a xl" 400 which consists of nine phonemes are outputted as a result of phoneme decoding for speech pronounced by a user, respective edit distances between the phonemes and two words "G O R YEO G EU R EU T M U L R YU S E N T EO" 404 and "G O R YEO G EO N EO P" 402 are measured. In this case, There is high possibility that "G O R YEO G EO N EO P" 402 has a shorter edit distance. This is because the phonemic number of 9 of "G O R YEO G EO N EO P" 402 is similar to the input phonemic number rather than to the phonemic number of 19 of "G O R YEO G EU R EU T M U L R YU S E N T EO" 404, and accordingly, "G O R YEO G EO N EO P" 402 has a relatively shorter edit distance for insertion. Given the phonemic number as a result of phoneme decoding, only words having a predetermined number (delta) difference from the inputted phonemic number among all recognition object words become search objects. Thus, N words having the shortest edit distance can be obtained without obtaining edit distances for all the words.
[60] As described above, for the word selected by the phonemic length, the edit distance
between two phonemic strings is obtained using the dynamic program as shown in Equation 4. In this process, a global constraint is applied. In obtaining two edit distances for phonemic strings having the same length, optimal edit distances for a entire search space consisting of the two given phonemic strings are obtained through the dynamic program, as in Fig. 5.
[61] Fig. 5 describes whether a search space is restricted upon edit distance measurement based on a phonemic length in accordance with an embodiment of the present invention.
[62] Referring to Fig. 5, when it is assumed that two phonemic strings each consist of four, search must be performed on all nodes in an entire 4x4 search space. However, the same performance is achieved even when the search space is constrained by a difference in length between the two phonemic strings even though the phonemes have such lengths.
[63] For example, when two phonemic strings have the same length, the search may be performed only in a search space corresponding to a slant- line area 500 indicated by a solid line in Fig. 5. However, when there is a difference in length between the two phonemic strings, the search space must increase from a diagonal line in proportion to the difference. An area 502 indicated by dotted lines corresponds to the difference between the two phonemic strings being 1, and an area 504 indicated by thick dotted lines corresponds to the difference between the two phonemic strings being 2. Thus, the search space is constrained, i.e., determined in proportion to the difference in length between the two phonemic strings whose edit distance is to be obtained, such that a search time can be greatly reduced, as compared with a conventional scheme that performs the search over all search areas.
[64] Meanwhile, another scheme for improving the lexical decoding speed includes performing a lexical decoding process in two steps.
[65] Fig. 3 illustrates a block diagram showing a 2-step lexical decoding structure using a word clustering scheme in accordance with an embodiment of the present invention.
[66] Referring to Fig. 3, in a unit of clustering among words having high confusion 306 based on an edit distance in the lexical decoding unit 104 all recognition object words 304 having a high similarity clusters into one cluster. A selector of words representative of clustering 308 selects a word representative of each cluster. A similarity measurer between words representative of clustering 300 measures a similarity between the word representative of the cluster and the inputted phonemic string to select N best clusters. A similarity measurer among all words within N best cluster 302 performs 2-step lexical decoding to decode all words included in the N best clusters, thereby improving a total speed of lexical decoding.
[67] In this scheme, creation of N clusters representative of all recognition object words
must be followed by selection of words representative of the N clusters. This is a process of obtaining N representative clusters from all the words through a process of dividing clusters as shown in Fig. 6.
[68] Fig. 6 shows a flowchart representing a word clustering algorithm based on an edit distance between words depending on a phonemic length in accordance with an embodiment of the present invention.
[69] In Fig. 6, the operations of the unit of clustering 306 and the selector of words representative of clustering 308 are illustrated. In step 600, all recognition object words are inputted, and in step 602, an edit distance between all recognition object words is measured. In step 604, words having the smallest edit distance among the measured words are incorporated into each cluster. In step 606, a total edit distance is measured and a determination is made as to whether the total edit distance is greater than a threshold value. If the total edit distance is greater than the threshold value, the process proceeds to step 608. In step 608, a cluster having the greatest total edit distance is divided and the process returns to step 602, where the edit distance from the cluster is measured. The words are again incorporated into the cluster. This procedure is repeatedly performed. Clustering ends only if the measured total edit distance is smaller than the threshold value.
[70] Thus, the process of obtaining N clusters representative of all the words is referred to as a quantization process of recognition object word. In the lexical decoding process for the quantized word, the similarity measurer 300 measures a similarity between the word representative of the cluster and the inputted phonemic string to select N clusters having the smallest similarity value. The similarity measurer 302 for the selected clusters again performs the lexical decoding on all words forming the clusters to output a result of final lexical decoding.
[71] This 2-step lexical decoding process finds the cluster which is a set of words in the first step and then finds words belonging to the cluster, based on the cluster. This can greatly increase the decoding speed, as compared with a conventional scheme of searching for all words at a time.
[72] As described above, according to the present invention, the lexical decoding speed can be improved by using the word clustering scheme and the information on the length of an inputted phonemic string. The search words are constrained upon lexical decoding based on the length of the phonemic string outputted via phonemic decoding, and a constraint is imposed to search for only a difference in phonemic number upon the edit distance measurement. Furthermore, all recognition object words are divided into clusters representative of the words using clustering scheme based on the edit distance, a similarity between the inputted phonemic string and the word representative of the cluster is then measured to select the N best clusters, and lexical decoding is
performed on only the words forming the cluster again.
[73] While the invention has been shown and described with respect to the embodiments, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the scope of the invention as defined in the following claims.
Claims
Claims
[1] A lexical decoding method comprising: detecting a length of a phonemic string outputted through phonemic decoding to an inputted speech signal; selecting recognition object words having a length of phonemic string similar to the length of the phonemic string based on the detected length of the phonemic string; measuring an edit distance from the phonemic string based on the selected recognition object words; and outputting at least one of the recognition object word having a smallest edit distance from the phonemic string through the measuring the edit distance. [2] The method of claim 1, wherein the phonemic decoding generates as accurate phoneme sequence for the inputted speech signal as possible. [3] The method of claim 1, further comprising determining a space to be searched depending on the number of phonemes upon the measuring edit distance. [4] A lexical decoding method comprising: creating clusters by clustering among recognition object words having a high similarity classified by an edit distance for all recognition object words; selecting words which is representative of the created clusters; measuring a first similarity of a phonemic string outputted through phonemic decoding based on the selected words and a speech signal to select one of the clusters, which corresponds to the selected words having an optimal similarity; measuring a second similarity between only the words forming the selected cluster and the phonemic string; and outputting an optimal word through the similarity measurement. [5] The method of claim 4, wherein the phonemic decoding generates as accurate phoneme sequence for the inputted speech signal as possible. [6] A lexical decoding apparatus comprising: a detector for detecting a length of a phonemic string outputted through phonemic decoding to an inputted speech signal; a word selector for selecting recognition object words having a length of the phonemic string similar to the detected length of the phonemic string; and an edit distance measurer for measuring an edit distance from the phonemic string based on the selected recognition object words, to output at least one of the recognition object words having a smallest edit distance from the phonemic string through the edit distance measurement. [7] The apparatus of claim 6, wherein the phonemic decoding generates as accurate
phoneme sequence for the inputted speech signal as possible. [8] The apparatus of claim 6, wherein the edit distance measurer determines a space being searched in accordance with the number of the phonemes upon the edit distance measurement. [9] A lexical decoding apparatus comprising: a clustering unit for creating clusters by performing clustering among recognition object words having a high similarity classified by a edit distance for all recognition object words; a representative word selector for selecting words representative of the created clusters; a first similarity measurer for measuring a similarity of a phonemic string outputted through phonemic decoding based on the selected words and a speech signal, and selecting the corresponding cluster of the words having an optimal similarity; and a second similarity measurer for measuring a similarity between only the words forming the selected cluster and the phonemic string to output an optimal word through the similarity measurement. [10] The apparatus of claim 9, wherein the phonemic decoding generates as accurate phoneme sequence for the inputted speech signal as possible.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
KR1020070132546A KR20090065102A (en) | 2007-12-17 | 2007-12-17 | Method and apparatus for lexical decoding |
KR10-2007-0132546 | 2007-12-17 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2009078665A1 true WO2009078665A1 (en) | 2009-06-25 |
Family
ID=40795704
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/KR2008/007481 WO2009078665A1 (en) | 2007-12-17 | 2008-12-17 | Method and apparatus for lexical decoding |
Country Status (2)
Country | Link |
---|---|
KR (1) | KR20090065102A (en) |
WO (1) | WO2009078665A1 (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107742516A (en) * | 2017-09-29 | 2018-02-27 | 上海与德通讯技术有限公司 | Intelligent identification Method, robot and computer-readable recording medium |
EP3113176B1 (en) * | 2015-06-30 | 2019-04-03 | Samsung Electronics Co., Ltd. | Speech recognition |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR101579544B1 (en) * | 2014-09-04 | 2015-12-23 | 에스케이 텔레콤주식회사 | Apparatus and Method for Calculating Similarity of Natural Language |
KR20210016767A (en) | 2019-08-05 | 2021-02-17 | 삼성전자주식회사 | Voice recognizing method and voice recognizing appratus |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP0703566A1 (en) * | 1994-09-23 | 1996-03-27 | Aurelio Oskian | Device for recognizing speech |
US6101470A (en) * | 1998-05-26 | 2000-08-08 | International Business Machines Corporation | Methods for generating pitch and duration contours in a text to speech system |
US6501833B2 (en) * | 1995-05-26 | 2002-12-31 | Speechworks International, Inc. | Method and apparatus for dynamic adaptation of a large vocabulary speech recognition system and for use of constraints from a database in a large vocabulary speech recognition system |
-
2007
- 2007-12-17 KR KR1020070132546A patent/KR20090065102A/en not_active Application Discontinuation
-
2008
- 2008-12-17 WO PCT/KR2008/007481 patent/WO2009078665A1/en active Application Filing
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP0703566A1 (en) * | 1994-09-23 | 1996-03-27 | Aurelio Oskian | Device for recognizing speech |
US6501833B2 (en) * | 1995-05-26 | 2002-12-31 | Speechworks International, Inc. | Method and apparatus for dynamic adaptation of a large vocabulary speech recognition system and for use of constraints from a database in a large vocabulary speech recognition system |
US6101470A (en) * | 1998-05-26 | 2000-08-08 | International Business Machines Corporation | Methods for generating pitch and duration contours in a text to speech system |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP3113176B1 (en) * | 2015-06-30 | 2019-04-03 | Samsung Electronics Co., Ltd. | Speech recognition |
CN107742516A (en) * | 2017-09-29 | 2018-02-27 | 上海与德通讯技术有限公司 | Intelligent identification Method, robot and computer-readable recording medium |
CN107742516B (en) * | 2017-09-29 | 2020-11-17 | 上海望潮数据科技有限公司 | Intelligent recognition method, robot and computer readable storage medium |
Also Published As
Publication number | Publication date |
---|---|
KR20090065102A (en) | 2009-06-22 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20240161732A1 (en) | Multi-dialect and multilingual speech recognition | |
CN110534095B (en) | Speech recognition method, apparatus, device and computer readable storage medium | |
US9934777B1 (en) | Customized speech processing language models | |
US10210862B1 (en) | Lattice decoding and result confirmation using recurrent neural networks | |
US7725319B2 (en) | Phoneme lattice construction and its application to speech recognition and keyword spotting | |
JP4215418B2 (en) | Word prediction method, speech recognition method, speech recognition apparatus and program using the method | |
KR100845428B1 (en) | Speech recognition system of mobile terminal | |
US20110077943A1 (en) | System for generating language model, method of generating language model, and program for language model generation | |
EP2685452A1 (en) | Method of recognizing speech and electronic device thereof | |
JP2007047818A (en) | Method and apparatus for speech recognition using optimized partial mixture tying of probability | |
Alon et al. | Contextual speech recognition with difficult negative training examples | |
JP2004362584A (en) | Discrimination training of language model for classifying text and sound | |
KR20040073398A (en) | Method and apparatus for predicting word error rates from text | |
WO2012001458A1 (en) | Voice-tag method and apparatus based on confidence score | |
JP2001092496A (en) | Continuous voice recognition device and recording medium | |
Karita et al. | Sequence training of encoder-decoder model using policy gradient for end-to-end speech recognition | |
Hu et al. | Phoneme-based contextualization for cross-lingual speech recognition in end-to-end models | |
WO2012004955A1 (en) | Text correction method and recognition method | |
JP5180800B2 (en) | Recording medium for storing statistical pronunciation variation model, automatic speech recognition system, and computer program | |
CN117099157A (en) | Multitasking learning for end-to-end automatic speech recognition confidence and erasure estimation | |
JP5688761B2 (en) | Acoustic model learning apparatus and acoustic model learning method | |
Shaik et al. | Hierarchical hybrid language models for open vocabulary continuous speech recognition using WFST. | |
WO2009078665A1 (en) | Method and apparatus for lexical decoding | |
KR20230156125A (en) | Lookup table recursive language model | |
KR101483947B1 (en) | Apparatus for discriminative training acoustic model considering error of phonemes in keyword and computer recordable medium storing the method thereof |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 08861989 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 08861989 Country of ref document: EP Kind code of ref document: A1 |