WO2009078665A1 - Method and apparatus for lexical decoding - Google Patents

Method and apparatus for lexical decoding Download PDF

Info

Publication number
WO2009078665A1
WO2009078665A1 PCT/KR2008/007481 KR2008007481W WO2009078665A1 WO 2009078665 A1 WO2009078665 A1 WO 2009078665A1 KR 2008007481 W KR2008007481 W KR 2008007481W WO 2009078665 A1 WO2009078665 A1 WO 2009078665A1
Authority
WO
WIPO (PCT)
Prior art keywords
phonemic
words
decoding
edit distance
similarity
Prior art date
Application number
PCT/KR2008/007481
Other languages
French (fr)
Inventor
Hoon Chung
Yunkeun Lee
Jeon Gue Park
Ho-Young Jung
Hyung-Bae Jeon
Original Assignee
Electronics And Telecommunications Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Electronics And Telecommunications Research Institute filed Critical Electronics And Telecommunications Research Institute
Publication of WO2009078665A1 publication Critical patent/WO2009078665A1/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/10Speech classification or search using distance or distortion measures between unknown speech and reference templates
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • G10L2015/025Phonemes, fenemes or fenones being the recognition units

Definitions

  • the present invention relates to a human speech recognition (heinafter, 'HSR') technique; and more particularly, to a lexical decoding method and apparatus that are suitable for improving lexical decoding speed using a word clustering scheme and length information for an input phoneme.
  • 'HSR' human speech recognition
  • HSR-based speech recognition is performed by phonemic decoding and lexical decoding.
  • FIG. 1 depicts a block diagram showing a typical human speech recognition system.
  • a human speech recognition system 100 includes a phonemic decoding unit 102, a lexical decoding unit 104, and an acoustic rescoring unit 106.
  • the phonemic decoding unit 102 is a process obtaining a sequence of phonetic symbols to be outputted with maximum similarity corresponding to inputted speech signals.
  • the lexical decoding unit 104 outputs N best words that are N object words having the smallest edit distance for the decoded phonemic string which is a result of the phonemic decoding unit 102.
  • the N best words outputted from the lexical decoding unit 104 are sent to the acoustic rescoring unit 106, which rescores the inputted N object words using a sophisticated acoustic model to re-adjust a recognition priority and output the recognition result.
  • a method for a lexical decoding method including detecting a length of a phonemic string outputted through phonemic decoding to an inputted speech signal; selecting recognition object words having a length of phonemic string similar to the length of phonemic string based on the detected length of phonemic string; measuring an edit distance from the phonemic string based on the selected recognition object words; and outputting at least one of the recognition object word having the smallest edit distance from the phonemic string through the measuring the edit distance.
  • the method further includes determining a space to be searched depending on the number of phonemes upon the measuring edit distance.
  • a lexical decoding method including creating clusters by clustering among recognition object words having a high similarity classified by an edit distance for all recognition object words; selecting words which is representative of the created clusters; measuring a first similarity of a phonemic string outputted through phonemic decoding based on the selected words and a speech signal to select one of the clusters, which corresponds to the selected words having an optimal similarity; measuring a second similarity between only the words forming the selected cluster and the phonemic string; and outputting optimal words through the similarity measurement.
  • the phonemic decoding generates as accurate phoneme sequence for the inputted speech signal as possible.
  • a lexical decoding apparatus including a detector for detecting a length of a phonemic string outputted through phonemic decoding to an inputted speech signal; a word selector for selecting recognition object words having a length of phonemic string similar to the length of the detected phonemic string; and an edit distance measurer for measuring an edit distance from the phonemic string based on the selected recognition object words, to output at least one of the recognition object word having the smallest edit distance from the phonemic string through the edit distance measurement.
  • the phonemic decoding generates as accurate phoneme sequence for the inputted speech signal as possible.
  • the edit distance measurer determines a space being searched in accordance with the number of the phonemes upon the edit distance measurement.
  • a clustering unit for creating clusters by performing clustering among recognition object words having a high similarity classified by a edit distance for all recognition object words; a representative word selector for selecting words representative of the created clusters; a first similarity measurer for measuring a similarity of a phonemic string outputted through phonemic decoding based on the selected words and a speech signal, and selecting the corresponding cluster of the words having an optimal similarity; and a second similarity measurer for measuring a similarity between only the words forming the selected cluster and the phonemic string to output optimal words through the similarity measurement.
  • the phonemic decoding generates as accurate phoneme sequence for the inputted speech signal as possible.
  • the present invention has the following advantages: in the human speech recognition system, the lexical decoding process is performed by using the word clustering scheme or the length information for the input phoneme, whereby, the speed of the lexical decoding process can be improved and a total speech recognition time can be reduced without a degradation of speech recognition performance.
  • FIG. 1 shows a block diagram representing a typical human speech recognition system
  • FIG. 2 depicts a block diagram representing a lexical decoding structure that employs a scheme of reducing a search space using a length information for a phonemic string in accordance with a preferred embodiment of the present invention
  • FIG. 3 illustrates a block diagram showing a 2-step lexical decoding structure using a word clustering scheme in accordance with a preferred embodiment of the present invention
  • Fig. 4 is a lexical decoding process based on a phonemic length in accordance with a preferred embodiment of the present invention
  • Fig. 5 describes whether a search space is restricted upon edit distance measurement based on a phonemic length in accordance with a preferred embodiment of the present invention.
  • Fig. 6 shows a flowchart representing a word clustering algorithm based on an edit distance between words depending on a phonemic length in accordance with a preferred embodiment of the present invention.
  • the present invention is directed to improvement of a lexical decoding speed using a word clustering scheme and a length information for an input phoneme.
  • Search words are restricted upon a lexical decoding based on a length of a phonemic string outputted through phonemic decoding, and the restriction is imposed to search for only a difference in phonemic number upon edit distance measurement.
  • all recognition object words are divided into clusters which are representative of the all recognition object words using an edit distance-based clustering scheme, a similarity between the inputted phonemic string and words which are representative of the cluster is measured to select N best clusters, and the lexical decoding is performed on only the words forming the cluster.
  • human speech recognition is performed by the phonemic decoding unit 102, the lexical decoding unit
  • the phonemic decoding unit 102 converts an inputted speech signal into a phonemic string through a phonemic decoding process, and the lexical decoding unit 104 recognizes recognition object words from the phonemic string converted through a phonemic decoding process.
  • Math Figure 1 Math Figure 1
  • Pr(XIP) denotes a conditional probability for an acoustic feature vector observed from the phonemes and is typically modeled using a Hidden Markov Model (HMM).
  • HMM Hidden Markov Model
  • Pr(P) is defined as a probability model representing a connection relationship between phonemes forming a word and is called as a language model. Pr(P) is represented by Equation 2.
  • Pr( ⁇ 1 ,..., p t ) Prf ⁇ I P iA ,P t _ 2 ,..., P 1 ) Pr(P M 13-a--> ⁇ ) - PrOi)
  • Equation 3 approximate to Equation 2, on the assumption that a current phoneme is affected by only previous (n-1) phonemes. 2- or 3-gram is typically used.
  • the lexical decoding process is based on the result of a decoded phoneme sequence which may include a mis -recognized phonemes, as shown in Fig. 2, and a dynamic programming algorithm as shown in Equation 4, which obtains a word having the smallest edit distance between correct phonemic string among recognition object words.
  • Equation 4 Q(x,y) denotes an accumulated distance, C(cx,ty) denotes a cost function when a substitution error from a reference phoneme ex to ty occurs, C(cx, ⁇ ) denotes a cost function when a deletion error for the reference phoneme ex occurs, and C( ⁇ ,ty) denotes a cost function when a ty insertion error for the reference phoneme ex occurs.
  • the costs are represented by a negative logarithm for a confusion probability of the phoneme, as in Equation 5.
  • a scheme using length information for an input phoneme and a word clustering scheme are used to improve lexical decoding speed.
  • a method for improving lexical decoding speed through the scheme using phonemic length information will first be described.
  • FIG. 2 depicts a block diagram representing a lexical decoding structure that employs a scheme of reducing a search space using a length information of a phonemic string in accordance with an embodiment of the present invention.
  • a detector of phonemic number of phonemic string 200 in the lexical decoding unit 104 detects the phonemic number within inputted phonemic string. The detected phonemic number is sent to the selector 204 that is based on the phonemic number.
  • the selector 204 selects a recognition object word 202 having phonemic string similar to the phonemic number of the received phonemic string among recognition object words 202, and sends the selected recognition object word 202 to an edit distance measurer based on length of phonemic string 206 based on the restriction of the number of phonemes.
  • the edit distance measurer 206 measures an edit distance between pronounced strings of each word forming the recognition object word 202 for the phonemic strings, to output N words having the smallest edit distance as a result of lexical decoding.
  • the dynamic program as shown in Equation 4 must be applied to all the words in order to obtain the result of phonemic recognition and the edit distance between the words.
  • a total search time can be reduced by constraining search words by the phonemic number as a result of decoding the phonemes through the selection in a selector of recognition object words which consist of phonemic string having similar phonemic number 204 and by imposing a constraint on a search space even when Equation 4 is applied to obtain the edit distance for the words.
  • Fig. 4 is a lexical decoding process based on a phonemic length in accordance with an embodiment of the present invention.
  • the edit distance between two phonemic strings is obtained using the dynamic program as shown in Equation 4.
  • a global constraint is applied.
  • optimal edit distances for a entire search space consisting of the two given phonemic strings are obtained through the dynamic program, as in Fig. 5.
  • Fig. 5 describes whether a search space is restricted upon edit distance measurement based on a phonemic length in accordance with an embodiment of the present invention.
  • the search may be performed only in a search space corresponding to a slant- line area 500 indicated by a solid line in Fig. 5.
  • the search space must increase from a diagonal line in proportion to the difference.
  • An area 502 indicated by dotted lines corresponds to the difference between the two phonemic strings being 1
  • an area 504 indicated by thick dotted lines corresponds to the difference between the two phonemic strings being 2.
  • the search space is constrained, i.e., determined in proportion to the difference in length between the two phonemic strings whose edit distance is to be obtained, such that a search time can be greatly reduced, as compared with a conventional scheme that performs the search over all search areas.
  • Another scheme for improving the lexical decoding speed includes performing a lexical decoding process in two steps.
  • FIG. 3 illustrates a block diagram showing a 2-step lexical decoding structure using a word clustering scheme in accordance with an embodiment of the present invention.
  • FIG. 6 shows a flowchart representing a word clustering algorithm based on an edit distance between words depending on a phonemic length in accordance with an embodiment of the present invention.
  • step 600 all recognition object words are inputted, and in step 602, an edit distance between all recognition object words is measured.
  • step 604 words having the smallest edit distance among the measured words are incorporated into each cluster.
  • step 606 a total edit distance is measured and a determination is made as to whether the total edit distance is greater than a threshold value. If the total edit distance is greater than the threshold value, the process proceeds to step 608.
  • step 608 a cluster having the greatest total edit distance is divided and the process returns to step 602, where the edit distance from the cluster is measured. The words are again incorporated into the cluster. This procedure is repeatedly performed. Clustering ends only if the measured total edit distance is smaller than the threshold value.
  • the process of obtaining N clusters representative of all the words is referred to as a quantization process of recognition object word.
  • the similarity measurer 300 measures a similarity between the word representative of the cluster and the inputted phonemic string to select N clusters having the smallest similarity value.
  • the similarity measurer 302 for the selected clusters again performs the lexical decoding on all words forming the clusters to output a result of final lexical decoding.
  • This 2-step lexical decoding process finds the cluster which is a set of words in the first step and then finds words belonging to the cluster, based on the cluster. This can greatly increase the decoding speed, as compared with a conventional scheme of searching for all words at a time.
  • the lexical decoding speed can be improved by using the word clustering scheme and the information on the length of an inputted phonemic string.
  • the search words are constrained upon lexical decoding based on the length of the phonemic string outputted via phonemic decoding, and a constraint is imposed to search for only a difference in phonemic number upon the edit distance measurement.
  • all recognition object words are divided into clusters representative of the words using clustering scheme based on the edit distance, a similarity between the inputted phonemic string and the word representative of the cluster is then measured to select the N best clusters, and lexical decoding is performed on only the words forming the cluster again.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention relates to a lexical decoding method including detecting a length of a phonemic string outputted through phonemic decoding based on an inputted speech signal, selecting recognition object words having a length of phonemic string similar to the length of phonemic string based on the detected length of phonemic string, measuring an edit distance from the phonemic string based on the selected recognition object words, and outputting at least one of the recognition object word having the smallest edit distance from the phonemic string through the measuring the edit distance. The phonemic decoding obtains a phonemic string of phonemic model by outputting a maximum similarity for the inputted speech signal.

Description

Description
METHOD AND APPARATUS FOR LEXICAL DECODING
Technical Field
[1] The present invention relates to a human speech recognition (heinafter, 'HSR') technique; and more particularly, to a lexical decoding method and apparatus that are suitable for improving lexical decoding speed using a word clustering scheme and length information for an input phoneme.
[2] This work was supported by the IT R&D program of MIC/IITA [2006-S-036-02, Development of large vocabulary/interactive distributed/embedded VUI for new growth engine industries].
[3]
Background Art
[4] In general, HSR-based speech recognition is performed by phonemic decoding and lexical decoding.
[5] The HSR-based speech recognition will be described in detail with reference to Fig.
1.
[6] Fig. 1 depicts a block diagram showing a typical human speech recognition system.
[7] Referring to Fig. 1, a human speech recognition system 100 includes a phonemic decoding unit 102, a lexical decoding unit 104, and an acoustic rescoring unit 106.
[8] The phonemic decoding unit 102 is a process obtaining a sequence of phonetic symbols to be outputted with maximum similarity corresponding to inputted speech signals. In a phonemic decoding process, ambient noises, inaccurate acoustic models, and so on may cause an error to be included in a decoded phonemic string. Accordingly, the lexical decoding unit 104 outputs N best words that are N object words having the smallest edit distance for the decoded phonemic string which is a result of the phonemic decoding unit 102. Further, the N best words outputted from the lexical decoding unit 104 are sent to the acoustic rescoring unit 106, which rescores the inputted N object words using a sophisticated acoustic model to re-adjust a recognition priority and output the recognition result.
[9] In the conventional human speech recognition system operate as described above, a maximum similar word or word sequence corresponding to the speech inputted from the user is outputed through phonemic decoding, lexical decoding, and acoustic rescoring. Even though such several stage processing performs faster speech recognition than that of conventional speech recognition system based on continuous hidden Markov model(CDHMM), it still takes too much time to use in practice in recognizing more than hundreds of thousands of words. So, this invention propose a method to speed up decoding time for huge number of words recognition task domain. [10]
Disclosure of Invention
Technical Problem
[11] It is, therefore, a primary object of the present invention to provide a lexical decoding method and apparatus capable of improving a lexical decoding speed using a word clustering scheme and a length information for an input phoneme which is a result of phonemic decoding.
[12] It is, therefore, another object of the present invention to provide a lexical decoding method and apparatus that are capable of constraining search words upon lexical decoding based on a length of a phonemic string outputted via phonemic decoding, and imposing a path constraint to search for words which have only small number difference to phoneme lengths in measuring edit distance.
[13] It is, therefore, still another object of the present invention to provide a lexical decoding method and apparatus for dividing all recognition object words into clusters representative of the words using an edit distance-based clustering scheme, measuring a similarity between an inputted phonemic string and the word representative of the cluster to select N best clusters, and performing the lexical decoding on only the words forming the cluster again.
[14]
Technical Solution
[15] In accordance with one aspect of the invention, there is provided a method for a lexical decoding method including detecting a length of a phonemic string outputted through phonemic decoding to an inputted speech signal; selecting recognition object words having a length of phonemic string similar to the length of phonemic string based on the detected length of phonemic string; measuring an edit distance from the phonemic string based on the selected recognition object words; and outputting at least one of the recognition object word having the smallest edit distance from the phonemic string through the measuring the edit distance.
[16] It is desirable that the phonemic decoding generates as accurate phoneme sequence for the inputted speech signal as possible.
[17] It is also desirable that the method further includes determining a space to be searched depending on the number of phonemes upon the measuring edit distance.
[18] In accordance with another aspect of the invention, there is provided a lexical decoding method including creating clusters by clustering among recognition object words having a high similarity classified by an edit distance for all recognition object words; selecting words which is representative of the created clusters; measuring a first similarity of a phonemic string outputted through phonemic decoding based on the selected words and a speech signal to select one of the clusters, which corresponds to the selected words having an optimal similarity; measuring a second similarity between only the words forming the selected cluster and the phonemic string; and outputting optimal words through the similarity measurement.
[19] It is preferable that the phonemic decoding generates as accurate phoneme sequence for the inputted speech signal as possible.
[20] In accordance with still another aspect of the invention, there is provided a lexical decoding apparatus including a detector for detecting a length of a phonemic string outputted through phonemic decoding to an inputted speech signal; a word selector for selecting recognition object words having a length of phonemic string similar to the length of the detected phonemic string; and an edit distance measurer for measuring an edit distance from the phonemic string based on the selected recognition object words, to output at least one of the recognition object word having the smallest edit distance from the phonemic string through the edit distance measurement.
[21] It is preferable that the phonemic decoding generates as accurate phoneme sequence for the inputted speech signal as possible.
[22] It is also preferred that the edit distance measurer determines a space being searched in accordance with the number of the phonemes upon the edit distance measurement.
[23] In accordance with still another aspect of the invention, there is provided a clustering unit for creating clusters by performing clustering among recognition object words having a high similarity classified by a edit distance for all recognition object words; a representative word selector for selecting words representative of the created clusters; a first similarity measurer for measuring a similarity of a phonemic string outputted through phonemic decoding based on the selected words and a speech signal, and selecting the corresponding cluster of the words having an optimal similarity; and a second similarity measurer for measuring a similarity between only the words forming the selected cluster and the phonemic string to output optimal words through the similarity measurement.
[24] It is preferable that the phonemic decoding generates as accurate phoneme sequence for the inputted speech signal as possible.
[25]
Advantageous Effects
[26] The present invention has the following advantages: in the human speech recognition system, the lexical decoding process is performed by using the word clustering scheme or the length information for the input phoneme, whereby, the speed of the lexical decoding process can be improved and a total speech recognition time can be reduced without a degradation of speech recognition performance. Brief Description of the Drawings
[27] The above and other objects and features of the present invention will become apparent from the following description of embodiments given in conjunction with the accompanying drawings, in which:
[28] Fig. 1 shows a block diagram representing a typical human speech recognition system;
[29] Fig. 2 depicts a block diagram representing a lexical decoding structure that employs a scheme of reducing a search space using a length information for a phonemic string in accordance with a preferred embodiment of the present invention;
[30] Fig. 3 illustrates a block diagram showing a 2-step lexical decoding structure using a word clustering scheme in accordance with a preferred embodiment of the present invention;
[31] Fig. 4 is a lexical decoding process based on a phonemic length in accordance with a preferred embodiment of the present invention;
[32] Fig. 5 describes whether a search space is restricted upon edit distance measurement based on a phonemic length in accordance with a preferred embodiment of the present invention; and
[33] Fig. 6 shows a flowchart representing a word clustering algorithm based on an edit distance between words depending on a phonemic length in accordance with a preferred embodiment of the present invention.
[34]
Best Mode for Carrying Out the Invention
[35] Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings so that they can be readily implemented by one of ordinary skill in the art.
[36] The present invention is directed to improvement of a lexical decoding speed using a word clustering scheme and a length information for an input phoneme. Search words are restricted upon a lexical decoding based on a length of a phonemic string outputted through phonemic decoding, and the restriction is imposed to search for only a difference in phonemic number upon edit distance measurement. Furthermore, all recognition object words are divided into clusters which are representative of the all recognition object words using an edit distance-based clustering scheme, a similarity between the inputted phonemic string and words which are representative of the cluster is measured to select N best clusters, and the lexical decoding is performed on only the words forming the cluster.
[37] [38] (Embodiments)
[39] In the human speech recognition system 100 as described in Fig. 1, human speech recognition is performed by the phonemic decoding unit 102, the lexical decoding unit
104, and the acoustic rescoring unit 106. [40] The phonemic decoding unit 102 converts an inputted speech signal into a phonemic string through a phonemic decoding process, and the lexical decoding unit 104 recognizes recognition object words from the phonemic string converted through a phonemic decoding process. [41] In the phonemic decoding process, a phonemic string P=pl, p2,..., pN having maximum posterior probability is obtained from a feature vector string X=xl, x2,..., xT of a given speech signal, as represented by Equation 1. [42] MathFigure 1
[Math.l] t arg max Pr(P \ X) P
[43] Pr(X I P) Pr(P) arg max — ■ — ^^-
= Pr(JQ
P [44] arg max Pr(X | P) Pr(P)
P
[45] To solve Equation 1, two probability models Pr(XIP) and Pr(P) are necessary.
Pr(XIP) denotes a conditional probability for an acoustic feature vector observed from the phonemes and is typically modeled using a Hidden Markov Model (HMM). Pr(P) is defined as a probability model representing a connection relationship between phonemes forming a word and is called as a language model. Pr(P) is represented by Equation 2.
[46] MathFigure 2
[Math.2]
Pr(^1,..., pt) = Prfø I PiA,Pt_2,..., P1) Pr(PM 13-a--> Λ) - PrOi)
[47] t
= Y\ Pr(pi \pi_1,..., p1)
2-1
[48] However, most actual speech recognition systems use n-gram as represented by
Equation 3, approximate to Equation 2, on the assumption that a current phoneme is affected by only previous (n-1) phonemes. 2- or 3-gram is typically used. [49] MathFigure 3
[Math.3] t
Figure imgf000007_0001
[50] To decode a word string from the recognized phonemic string, the lexical decoding process is based on the result of a decoded phoneme sequence which may include a mis -recognized phonemes, as shown in Fig. 2, and a dynamic programming algorithm as shown in Equation 4, which obtains a word having the smallest edit distance between correct phonemic string among recognition object words.
[51] MathFigure 4
[Math.4]
Figure imgf000007_0002
Q(x, J/) = flftffl.
[52] In Equation 4, Q(x,y) denotes an accumulated distance, C(cx,ty) denotes a cost function when a substitution error from a reference phoneme ex to ty occurs, C(cx,ε) denotes a cost function when a deletion error for the reference phoneme ex occurs, and C(ε,ty) denotes a cost function when a ty insertion error for the reference phoneme ex occurs. The costs are represented by a negative logarithm for a confusion probability of the phoneme, as in Equation 5.
[53] MathFigure 5
[Math.5]
C(Cx, ty) =- log Pr (ty I Cx)
C(Cx, ε ) =-log Pr C ε I Cx)
C( ε , ty) =-los Pr (ty | ε )
[54] A scheme using length information for an input phoneme and a word clustering scheme are used to improve lexical decoding speed. First, a method for improving lexical decoding speed through the scheme using phonemic length information will first be described.
[55] Fig. 2 depicts a block diagram representing a lexical decoding structure that employs a scheme of reducing a search space using a length information of a phonemic string in accordance with an embodiment of the present invention.
[56] Referring to Fig. 2, if a phonemic string of N phonemes are outputted as a result of decoding phonemes, there is high possibility that a word having the smallest edit distance from a recognition object word consists of N phonemes. Accordingly, a detector of phonemic number of phonemic string 200 in the lexical decoding unit 104 detects the phonemic number within inputted phonemic string. The detected phonemic number is sent to the selector 204 that is based on the phonemic number. The selector 204 selects a recognition object word 202 having phonemic string similar to the phonemic number of the received phonemic string among recognition object words 202, and sends the selected recognition object word 202 to an edit distance measurer based on length of phonemic string 206 based on the restriction of the number of phonemes.
[57] The edit distance measurer 206 measures an edit distance between pronounced strings of each word forming the recognition object word 202 for the phonemic strings, to output N words having the smallest edit distance as a result of lexical decoding. In this case, the dynamic program as shown in Equation 4 must be applied to all the words in order to obtain the result of phonemic recognition and the edit distance between the words. However, a total search time can be reduced by constraining search words by the phonemic number as a result of decoding the phonemes through the selection in a selector of recognition object words which consist of phonemic string having similar phonemic number 204 and by imposing a constraint on a search space even when Equation 4 is applied to obtain the edit distance for the words.
[58] Fig. 4 is a lexical decoding process based on a phonemic length in accordance with an embodiment of the present invention.
[59] Referring to Fig. 4, if "g o r jv g E b a xl" 400 which consists of nine phonemes are outputted as a result of phoneme decoding for speech pronounced by a user, respective edit distances between the phonemes and two words "G O R YEO G EU R EU T M U L R YU S E N T EO" 404 and "G O R YEO G EO N EO P" 402 are measured. In this case, There is high possibility that "G O R YEO G EO N EO P" 402 has a shorter edit distance. This is because the phonemic number of 9 of "G O R YEO G EO N EO P" 402 is similar to the input phonemic number rather than to the phonemic number of 19 of "G O R YEO G EU R EU T M U L R YU S E N T EO" 404, and accordingly, "G O R YEO G EO N EO P" 402 has a relatively shorter edit distance for insertion. Given the phonemic number as a result of phoneme decoding, only words having a predetermined number (delta) difference from the inputted phonemic number among all recognition object words become search objects. Thus, N words having the shortest edit distance can be obtained without obtaining edit distances for all the words.
[60] As described above, for the word selected by the phonemic length, the edit distance between two phonemic strings is obtained using the dynamic program as shown in Equation 4. In this process, a global constraint is applied. In obtaining two edit distances for phonemic strings having the same length, optimal edit distances for a entire search space consisting of the two given phonemic strings are obtained through the dynamic program, as in Fig. 5.
[61] Fig. 5 describes whether a search space is restricted upon edit distance measurement based on a phonemic length in accordance with an embodiment of the present invention.
[62] Referring to Fig. 5, when it is assumed that two phonemic strings each consist of four, search must be performed on all nodes in an entire 4x4 search space. However, the same performance is achieved even when the search space is constrained by a difference in length between the two phonemic strings even though the phonemes have such lengths.
[63] For example, when two phonemic strings have the same length, the search may be performed only in a search space corresponding to a slant- line area 500 indicated by a solid line in Fig. 5. However, when there is a difference in length between the two phonemic strings, the search space must increase from a diagonal line in proportion to the difference. An area 502 indicated by dotted lines corresponds to the difference between the two phonemic strings being 1, and an area 504 indicated by thick dotted lines corresponds to the difference between the two phonemic strings being 2. Thus, the search space is constrained, i.e., determined in proportion to the difference in length between the two phonemic strings whose edit distance is to be obtained, such that a search time can be greatly reduced, as compared with a conventional scheme that performs the search over all search areas.
[64] Meanwhile, another scheme for improving the lexical decoding speed includes performing a lexical decoding process in two steps.
[65] Fig. 3 illustrates a block diagram showing a 2-step lexical decoding structure using a word clustering scheme in accordance with an embodiment of the present invention.
[66] Referring to Fig. 3, in a unit of clustering among words having high confusion 306 based on an edit distance in the lexical decoding unit 104 all recognition object words 304 having a high similarity clusters into one cluster. A selector of words representative of clustering 308 selects a word representative of each cluster. A similarity measurer between words representative of clustering 300 measures a similarity between the word representative of the cluster and the inputted phonemic string to select N best clusters. A similarity measurer among all words within N best cluster 302 performs 2-step lexical decoding to decode all words included in the N best clusters, thereby improving a total speed of lexical decoding.
[67] In this scheme, creation of N clusters representative of all recognition object words must be followed by selection of words representative of the N clusters. This is a process of obtaining N representative clusters from all the words through a process of dividing clusters as shown in Fig. 6.
[68] Fig. 6 shows a flowchart representing a word clustering algorithm based on an edit distance between words depending on a phonemic length in accordance with an embodiment of the present invention.
[69] In Fig. 6, the operations of the unit of clustering 306 and the selector of words representative of clustering 308 are illustrated. In step 600, all recognition object words are inputted, and in step 602, an edit distance between all recognition object words is measured. In step 604, words having the smallest edit distance among the measured words are incorporated into each cluster. In step 606, a total edit distance is measured and a determination is made as to whether the total edit distance is greater than a threshold value. If the total edit distance is greater than the threshold value, the process proceeds to step 608. In step 608, a cluster having the greatest total edit distance is divided and the process returns to step 602, where the edit distance from the cluster is measured. The words are again incorporated into the cluster. This procedure is repeatedly performed. Clustering ends only if the measured total edit distance is smaller than the threshold value.
[70] Thus, the process of obtaining N clusters representative of all the words is referred to as a quantization process of recognition object word. In the lexical decoding process for the quantized word, the similarity measurer 300 measures a similarity between the word representative of the cluster and the inputted phonemic string to select N clusters having the smallest similarity value. The similarity measurer 302 for the selected clusters again performs the lexical decoding on all words forming the clusters to output a result of final lexical decoding.
[71] This 2-step lexical decoding process finds the cluster which is a set of words in the first step and then finds words belonging to the cluster, based on the cluster. This can greatly increase the decoding speed, as compared with a conventional scheme of searching for all words at a time.
[72] As described above, according to the present invention, the lexical decoding speed can be improved by using the word clustering scheme and the information on the length of an inputted phonemic string. The search words are constrained upon lexical decoding based on the length of the phonemic string outputted via phonemic decoding, and a constraint is imposed to search for only a difference in phonemic number upon the edit distance measurement. Furthermore, all recognition object words are divided into clusters representative of the words using clustering scheme based on the edit distance, a similarity between the inputted phonemic string and the word representative of the cluster is then measured to select the N best clusters, and lexical decoding is performed on only the words forming the cluster again.
[73] While the invention has been shown and described with respect to the embodiments, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the scope of the invention as defined in the following claims.

Claims

Claims
[1] A lexical decoding method comprising: detecting a length of a phonemic string outputted through phonemic decoding to an inputted speech signal; selecting recognition object words having a length of phonemic string similar to the length of the phonemic string based on the detected length of the phonemic string; measuring an edit distance from the phonemic string based on the selected recognition object words; and outputting at least one of the recognition object word having a smallest edit distance from the phonemic string through the measuring the edit distance. [2] The method of claim 1, wherein the phonemic decoding generates as accurate phoneme sequence for the inputted speech signal as possible. [3] The method of claim 1, further comprising determining a space to be searched depending on the number of phonemes upon the measuring edit distance. [4] A lexical decoding method comprising: creating clusters by clustering among recognition object words having a high similarity classified by an edit distance for all recognition object words; selecting words which is representative of the created clusters; measuring a first similarity of a phonemic string outputted through phonemic decoding based on the selected words and a speech signal to select one of the clusters, which corresponds to the selected words having an optimal similarity; measuring a second similarity between only the words forming the selected cluster and the phonemic string; and outputting an optimal word through the similarity measurement. [5] The method of claim 4, wherein the phonemic decoding generates as accurate phoneme sequence for the inputted speech signal as possible. [6] A lexical decoding apparatus comprising: a detector for detecting a length of a phonemic string outputted through phonemic decoding to an inputted speech signal; a word selector for selecting recognition object words having a length of the phonemic string similar to the detected length of the phonemic string; and an edit distance measurer for measuring an edit distance from the phonemic string based on the selected recognition object words, to output at least one of the recognition object words having a smallest edit distance from the phonemic string through the edit distance measurement. [7] The apparatus of claim 6, wherein the phonemic decoding generates as accurate phoneme sequence for the inputted speech signal as possible. [8] The apparatus of claim 6, wherein the edit distance measurer determines a space being searched in accordance with the number of the phonemes upon the edit distance measurement. [9] A lexical decoding apparatus comprising: a clustering unit for creating clusters by performing clustering among recognition object words having a high similarity classified by a edit distance for all recognition object words; a representative word selector for selecting words representative of the created clusters; a first similarity measurer for measuring a similarity of a phonemic string outputted through phonemic decoding based on the selected words and a speech signal, and selecting the corresponding cluster of the words having an optimal similarity; and a second similarity measurer for measuring a similarity between only the words forming the selected cluster and the phonemic string to output an optimal word through the similarity measurement. [10] The apparatus of claim 9, wherein the phonemic decoding generates as accurate phoneme sequence for the inputted speech signal as possible.
PCT/KR2008/007481 2007-12-17 2008-12-17 Method and apparatus for lexical decoding WO2009078665A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
KR1020070132546A KR20090065102A (en) 2007-12-17 2007-12-17 Method and apparatus for lexical decoding
KR10-2007-0132546 2007-12-17

Publications (1)

Publication Number Publication Date
WO2009078665A1 true WO2009078665A1 (en) 2009-06-25

Family

ID=40795704

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/KR2008/007481 WO2009078665A1 (en) 2007-12-17 2008-12-17 Method and apparatus for lexical decoding

Country Status (2)

Country Link
KR (1) KR20090065102A (en)
WO (1) WO2009078665A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107742516A (en) * 2017-09-29 2018-02-27 上海与德通讯技术有限公司 Intelligent identification Method, robot and computer-readable recording medium
EP3113176B1 (en) * 2015-06-30 2019-04-03 Samsung Electronics Co., Ltd. Speech recognition

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101579544B1 (en) * 2014-09-04 2015-12-23 에스케이 텔레콤주식회사 Apparatus and Method for Calculating Similarity of Natural Language
KR20210016767A (en) 2019-08-05 2021-02-17 삼성전자주식회사 Voice recognizing method and voice recognizing appratus

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0703566A1 (en) * 1994-09-23 1996-03-27 Aurelio Oskian Device for recognizing speech
US6101470A (en) * 1998-05-26 2000-08-08 International Business Machines Corporation Methods for generating pitch and duration contours in a text to speech system
US6501833B2 (en) * 1995-05-26 2002-12-31 Speechworks International, Inc. Method and apparatus for dynamic adaptation of a large vocabulary speech recognition system and for use of constraints from a database in a large vocabulary speech recognition system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0703566A1 (en) * 1994-09-23 1996-03-27 Aurelio Oskian Device for recognizing speech
US6501833B2 (en) * 1995-05-26 2002-12-31 Speechworks International, Inc. Method and apparatus for dynamic adaptation of a large vocabulary speech recognition system and for use of constraints from a database in a large vocabulary speech recognition system
US6101470A (en) * 1998-05-26 2000-08-08 International Business Machines Corporation Methods for generating pitch and duration contours in a text to speech system

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3113176B1 (en) * 2015-06-30 2019-04-03 Samsung Electronics Co., Ltd. Speech recognition
CN107742516A (en) * 2017-09-29 2018-02-27 上海与德通讯技术有限公司 Intelligent identification Method, robot and computer-readable recording medium
CN107742516B (en) * 2017-09-29 2020-11-17 上海望潮数据科技有限公司 Intelligent recognition method, robot and computer readable storage medium

Also Published As

Publication number Publication date
KR20090065102A (en) 2009-06-22

Similar Documents

Publication Publication Date Title
US20240161732A1 (en) Multi-dialect and multilingual speech recognition
CN110534095B (en) Speech recognition method, apparatus, device and computer readable storage medium
US9934777B1 (en) Customized speech processing language models
US10210862B1 (en) Lattice decoding and result confirmation using recurrent neural networks
US7725319B2 (en) Phoneme lattice construction and its application to speech recognition and keyword spotting
JP4215418B2 (en) Word prediction method, speech recognition method, speech recognition apparatus and program using the method
KR100845428B1 (en) Speech recognition system of mobile terminal
US20110077943A1 (en) System for generating language model, method of generating language model, and program for language model generation
EP2685452A1 (en) Method of recognizing speech and electronic device thereof
JP2007047818A (en) Method and apparatus for speech recognition using optimized partial mixture tying of probability
Alon et al. Contextual speech recognition with difficult negative training examples
JP2004362584A (en) Discrimination training of language model for classifying text and sound
KR20040073398A (en) Method and apparatus for predicting word error rates from text
WO2012001458A1 (en) Voice-tag method and apparatus based on confidence score
JP2001092496A (en) Continuous voice recognition device and recording medium
Karita et al. Sequence training of encoder-decoder model using policy gradient for end-to-end speech recognition
Hu et al. Phoneme-based contextualization for cross-lingual speech recognition in end-to-end models
WO2012004955A1 (en) Text correction method and recognition method
JP5180800B2 (en) Recording medium for storing statistical pronunciation variation model, automatic speech recognition system, and computer program
CN117099157A (en) Multitasking learning for end-to-end automatic speech recognition confidence and erasure estimation
JP5688761B2 (en) Acoustic model learning apparatus and acoustic model learning method
Shaik et al. Hierarchical hybrid language models for open vocabulary continuous speech recognition using WFST.
WO2009078665A1 (en) Method and apparatus for lexical decoding
KR20230156125A (en) Lookup table recursive language model
KR101483947B1 (en) Apparatus for discriminative training acoustic model considering error of phonemes in keyword and computer recordable medium storing the method thereof

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 08861989

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 08861989

Country of ref document: EP

Kind code of ref document: A1