WO2009078665A1

WO2009078665A1 - Method and apparatus for lexical decoding

Info

Publication number: WO2009078665A1
Application number: PCT/KR2008/007481
Authority: WO
Inventors: Hoon Chung; Yunkeun Lee; Jeon Gue Park; Ho-Young Jung; Hyung-Bae Jeon
Original assignee: Electronics And Telecommunications Research Institute
Priority date: 2007-12-17
Filing date: 2008-12-17
Publication date: 2009-06-25
Also published as: KR20090065102A

Abstract

The present invention relates to a lexical decoding method including detecting a length of a phonemic string outputted through phonemic decoding based on an inputted speech signal, selecting recognition object words having a length of phonemic string similar to the length of phonemic string based on the detected length of phonemic string, measuring an edit distance from the phonemic string based on the selected recognition object words, and outputting at least one of the recognition object word having the smallest edit distance from the phonemic string through the measuring the edit distance. The phonemic decoding obtains a phonemic string of phonemic model by outputting a maximum similarity for the inputted speech signal.

Description

METHOD AND APPARATUS FOR LEXICAL DECODING

Technical Field

[1] The present invention relates to a human speech recognition (heinafter, 'HSR') technique; and more particularly, to a lexical decoding method and apparatus that are suitable for improving lexical decoding speed using a word clustering scheme and length information for an input phoneme.

[2] This work was supported by the IT R&D program of MIC/IITA [2006-S-036-02, Development of large vocabulary/interactive distributed/embedded VUI for new growth engine industries].

[3]

Background Art

[4] In general, HSR-based speech recognition is performed by phonemic decoding and lexical decoding.

[5] The HSR-based speech recognition will be described in detail with reference to Fig.

1.

[6] Fig. 1 depicts a block diagram showing a typical human speech recognition system.

[7] Referring to Fig. 1, a human speech recognition system 100 includes a phonemic decoding unit 102, a lexical decoding unit 104, and an acoustic rescoring unit 106.

[8] The phonemic decoding unit 102 is a process obtaining a sequence of phonetic symbols to be outputted with maximum similarity corresponding to inputted speech signals. In a phonemic decoding process, ambient noises, inaccurate acoustic models, and so on may cause an error to be included in a decoded phonemic string. Accordingly, the lexical decoding unit 104 outputs N best words that are N object words having the smallest edit distance for the decoded phonemic string which is a result of the phonemic decoding unit 102. Further, the N best words outputted from the lexical decoding unit 104 are sent to the acoustic rescoring unit 106, which rescores the inputted N object words using a sophisticated acoustic model to re-adjust a recognition priority and output the recognition result.

[9] In the conventional human speech recognition system operate as described above, a maximum similar word or word sequence corresponding to the speech inputted from the user is outputed through phonemic decoding, lexical decoding, and acoustic rescoring. Even though such several stage processing performs faster speech recognition than that of conventional speech recognition system based on continuous hidden Markov model(CDHMM), it still takes too much time to use in practice in recognizing more than hundreds of thousands of words. So, this invention propose a method to speed up decoding time for huge number of words recognition task domain. [10]

Disclosure of Invention

Technical Problem

[11] It is, therefore, a primary object of the present invention to provide a lexical decoding method and apparatus capable of improving a lexical decoding speed using a word clustering scheme and a length information for an input phoneme which is a result of phonemic decoding.

[12] It is, therefore, another object of the present invention to provide a lexical decoding method and apparatus that are capable of constraining search words upon lexical decoding based on a length of a phonemic string outputted via phonemic decoding, and imposing a path constraint to search for words which have only small number difference to phoneme lengths in measuring edit distance.

[13] It is, therefore, still another object of the present invention to provide a lexical decoding method and apparatus for dividing all recognition object words into clusters representative of the words using an edit distance-based clustering scheme, measuring a similarity between an inputted phonemic string and the word representative of the cluster to select N best clusters, and performing the lexical decoding on only the words forming the cluster again.

[14]

Technical Solution

[15] In accordance with one aspect of the invention, there is provided a method for a lexical decoding method including detecting a length of a phonemic string outputted through phonemic decoding to an inputted speech signal; selecting recognition object words having a length of phonemic string similar to the length of phonemic string based on the detected length of phonemic string; measuring an edit distance from the phonemic string based on the selected recognition object words; and outputting at least one of the recognition object word having the smallest edit distance from the phonemic string through the measuring the edit distance.

[16] It is desirable that the phonemic decoding generates as accurate phoneme sequence for the inputted speech signal as possible.

[17] It is also desirable that the method further includes determining a space to be searched depending on the number of phonemes upon the measuring edit distance.

[18] In accordance with another aspect of the invention, there is provided a lexical decoding method including creating clusters by clustering among recognition object words having a high similarity classified by an edit distance for all recognition object words; selecting words which is representative of the created clusters; measuring a first similarity of a phonemic string outputted through phonemic decoding based on the selected words and a speech signal to select one of the clusters, which corresponds to the selected words having an optimal similarity; measuring a second similarity between only the words forming the selected cluster and the phonemic string; and outputting optimal words through the similarity measurement.

[19] It is preferable that the phonemic decoding generates as accurate phoneme sequence for the inputted speech signal as possible.

[20] In accordance with still another aspect of the invention, there is provided a lexical decoding apparatus including a detector for detecting a length of a phonemic string outputted through phonemic decoding to an inputted speech signal; a word selector for selecting recognition object words having a length of phonemic string similar to the length of the detected phonemic string; and an edit distance measurer for measuring an edit distance from the phonemic string based on the selected recognition object words, to output at least one of the recognition object word having the smallest edit distance from the phonemic string through the edit distance measurement.

[21] It is preferable that the phonemic decoding generates as accurate phoneme sequence for the inputted speech signal as possible.

[22] It is also preferred that the edit distance measurer determines a space being searched in accordance with the number of the phonemes upon the edit distance measurement.

[23] In accordance with still another aspect of the invention, there is provided a clustering unit for creating clusters by performing clustering among recognition object words having a high similarity classified by a edit distance for all recognition object words; a representative word selector for selecting words representative of the created clusters; a first similarity measurer for measuring a similarity of a phonemic string outputted through phonemic decoding based on the selected words and a speech signal, and selecting the corresponding cluster of the words having an optimal similarity; and a second similarity measurer for measuring a similarity between only the words forming the selected cluster and the phonemic string to output optimal words through the similarity measurement.

[24] It is preferable that the phonemic decoding generates as accurate phoneme sequence for the inputted speech signal as possible.

[25]

Advantageous Effects

[26] The present invention has the following advantages: in the human speech recognition system, the lexical decoding process is performed by using the word clustering scheme or the length information for the input phoneme, whereby, the speed of the lexical decoding process can be improved and a total speech recognition time can be reduced without a degradation of speech recognition performance. Brief Description of the Drawings

[27] The above and other objects and features of the present invention will become apparent from the following description of embodiments given in conjunction with the accompanying drawings, in which:

[28] Fig. 1 shows a block diagram representing a typical human speech recognition system;

[29] Fig. 2 depicts a block diagram representing a lexical decoding structure that employs a scheme of reducing a search space using a length information for a phonemic string in accordance with a preferred embodiment of the present invention;

[30] Fig. 3 illustrates a block diagram showing a 2-step lexical decoding structure using a word clustering scheme in accordance with a preferred embodiment of the present invention;

[31] Fig. 4 is a lexical decoding process based on a phonemic length in accordance with a preferred embodiment of the present invention;

[32] Fig. 5 describes whether a search space is restricted upon edit distance measurement based on a phonemic length in accordance with a preferred embodiment of the present invention; and

[33] Fig. 6 shows a flowchart representing a word clustering algorithm based on an edit distance between words depending on a phonemic length in accordance with a preferred embodiment of the present invention.

[34]

Best Mode for Carrying Out the Invention

[35] Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings so that they can be readily implemented by one of ordinary skill in the art.

[36] The present invention is directed to improvement of a lexical decoding speed using a word clustering scheme and a length information for an input phoneme. Search words are restricted upon a lexical decoding based on a length of a phonemic string outputted through phonemic decoding, and the restriction is imposed to search for only a difference in phonemic number upon edit distance measurement. Furthermore, all recognition object words are divided into clusters which are representative of the all recognition object words using an edit distance-based clustering scheme, a similarity between the inputted phonemic string and words which are representative of the cluster is measured to select N best clusters, and the lexical decoding is performed on only the words forming the cluster.

[37] [38] (Embodiments)

[39] In the human speech recognition system 100 as described in Fig. 1, human speech recognition is performed by the phonemic decoding unit 102, the lexical decoding unit

104, and the acoustic rescoring unit 106. [40] The phonemic decoding unit 102 converts an inputted speech signal into a phonemic string through a phonemic decoding process, and the lexical decoding unit 104 recognizes recognition object words from the phonemic string converted through a phonemic decoding process. [41] In the phonemic decoding process, a phonemic string P=pl, p2,..., pN having maximum posterior probability is obtained from a feature vector string X=xl, x2,..., xT of a given speech signal, as represented by Equation 1. [42] MathFigure 1

[Math.l] t arg max Pr(P \ X) P

[43] Pr(X I P) Pr(P) arg max — ■ — ^^-

= Pr(JQ

P [44] arg max Pr(X | P) Pr(P)

P

[45] To solve Equation 1, two probability models Pr(XIP) and Pr(P) are necessary.

Pr(XIP) denotes a conditional probability for an acoustic feature vector observed from the phonemes and is typically modeled using a Hidden Markov Model (HMM). Pr(P) is defined as a probability model representing a connection relationship between phonemes forming a word and is called as a language model. Pr(P) is represented by Equation 2.

[46] MathFigure 2

[Math.2]

Pr(^₁,..., p_t) = Prfø I P_iA,P_t_₂,..., P₁) Pr(P_M 13-a--> Λ) - PrOi)

[47] t

= Y\ Pr(p_i \p_i_₁,..., p₁)

2-1

[48] However, most actual speech recognition systems use n-gram as represented by

Equation 3, approximate to Equation 2, on the assumption that a current phoneme is affected by only previous (n-1) phonemes. 2- or 3-gram is typically used. [49] MathFigure 3

[Math.3] t

[50] To decode a word string from the recognized phonemic string, the lexical decoding process is based on the result of a decoded phoneme sequence which may include a mis -recognized phonemes, as shown in Fig. 2, and a dynamic programming algorithm as shown in Equation 4, which obtains a word having the smallest edit distance between correct phonemic string among recognition object words.

[51] MathFigure 4

[Math.4]

Q(x, J/) = flftffl.

[52] In Equation 4, Q(x,y) denotes an accumulated distance, C(cx,ty) denotes a cost function when a substitution error from a reference phoneme ex to ty occurs, C(cx,ε) denotes a cost function when a deletion error for the reference phoneme ex occurs, and C(ε,ty) denotes a cost function when a ty insertion error for the reference phoneme ex occurs. The costs are represented by a negative logarithm for a confusion probability of the phoneme, as in Equation 5.

[53] MathFigure 5

[Math.5]

C(Cx, ty) =- log Pr (ty I Cx)

C(Cx, ε ) =-log Pr C ε I Cx)

C( ε , ty) =-los Pr (ty | ε )

[54] A scheme using length information for an input phoneme and a word clustering scheme are used to improve lexical decoding speed. First, a method for improving lexical decoding speed through the scheme using phonemic length information will first be described.

[55] Fig. 2 depicts a block diagram representing a lexical decoding structure that employs a scheme of reducing a search space using a length information of a phonemic string in accordance with an embodiment of the present invention.

[56] Referring to Fig. 2, if a phonemic string of N phonemes are outputted as a result of decoding phonemes, there is high possibility that a word having the smallest edit distance from a recognition object word consists of N phonemes. Accordingly, a detector of phonemic number of phonemic string 200 in the lexical decoding unit 104 detects the phonemic number within inputted phonemic string. The detected phonemic number is sent to the selector 204 that is based on the phonemic number. The selector 204 selects a recognition object word 202 having phonemic string similar to the phonemic number of the received phonemic string among recognition object words 202, and sends the selected recognition object word 202 to an edit distance measurer based on length of phonemic string 206 based on the restriction of the number of phonemes.

[57] The edit distance measurer 206 measures an edit distance between pronounced strings of each word forming the recognition object word 202 for the phonemic strings, to output N words having the smallest edit distance as a result of lexical decoding. In this case, the dynamic program as shown in Equation 4 must be applied to all the words in order to obtain the result of phonemic recognition and the edit distance between the words. However, a total search time can be reduced by constraining search words by the phonemic number as a result of decoding the phonemes through the selection in a selector of recognition object words which consist of phonemic string having similar phonemic number 204 and by imposing a constraint on a search space even when Equation 4 is applied to obtain the edit distance for the words.

[58] Fig. 4 is a lexical decoding process based on a phonemic length in accordance with an embodiment of the present invention.

[59] Referring to Fig. 4, if "g o r jv g E b a xl" 400 which consists of nine phonemes are outputted as a result of phoneme decoding for speech pronounced by a user, respective edit distances between the phonemes and two words "G O R YEO G EU R EU T M U L R YU S E N T EO" 404 and "G O R YEO G EO N EO P" 402 are measured. In this case, There is high possibility that "G O R YEO G EO N EO P" 402 has a shorter edit distance. This is because the phonemic number of 9 of "G O R YEO G EO N EO P" 402 is similar to the input phonemic number rather than to the phonemic number of 19 of "G O R YEO G EU R EU T M U L R YU S E N T EO" 404, and accordingly, "G O R YEO G EO N EO P" 402 has a relatively shorter edit distance for insertion. Given the phonemic number as a result of phoneme decoding, only words having a predetermined number (delta) difference from the inputted phonemic number among all recognition object words become search objects. Thus, N words having the shortest edit distance can be obtained without obtaining edit distances for all the words.

[60] As described above, for the word selected by the phonemic length, the edit distance between two phonemic strings is obtained using the dynamic program as shown in Equation 4. In this process, a global constraint is applied. In obtaining two edit distances for phonemic strings having the same length, optimal edit distances for a entire search space consisting of the two given phonemic strings are obtained through the dynamic program, as in Fig. 5.

[61] Fig. 5 describes whether a search space is restricted upon edit distance measurement based on a phonemic length in accordance with an embodiment of the present invention.

[62] Referring to Fig. 5, when it is assumed that two phonemic strings each consist of four, search must be performed on all nodes in an entire 4x4 search space. However, the same performance is achieved even when the search space is constrained by a difference in length between the two phonemic strings even though the phonemes have such lengths.

[63] For example, when two phonemic strings have the same length, the search may be performed only in a search space corresponding to a slant- line area 500 indicated by a solid line in Fig. 5. However, when there is a difference in length between the two phonemic strings, the search space must increase from a diagonal line in proportion to the difference. An area 502 indicated by dotted lines corresponds to the difference between the two phonemic strings being 1, and an area 504 indicated by thick dotted lines corresponds to the difference between the two phonemic strings being 2. Thus, the search space is constrained, i.e., determined in proportion to the difference in length between the two phonemic strings whose edit distance is to be obtained, such that a search time can be greatly reduced, as compared with a conventional scheme that performs the search over all search areas.

[64] Meanwhile, another scheme for improving the lexical decoding speed includes performing a lexical decoding process in two steps.

[65] Fig. 3 illustrates a block diagram showing a 2-step lexical decoding structure using a word clustering scheme in accordance with an embodiment of the present invention.

[66] Referring to Fig. 3, in a unit of clustering among words having high confusion 306 based on an edit distance in the lexical decoding unit 104 all recognition object words 304 having a high similarity clusters into one cluster. A selector of words representative of clustering 308 selects a word representative of each cluster. A similarity measurer between words representative of clustering 300 measures a similarity between the word representative of the cluster and the inputted phonemic string to select N best clusters. A similarity measurer among all words within N best cluster 302 performs 2-step lexical decoding to decode all words included in the N best clusters, thereby improving a total speed of lexical decoding.

[67] In this scheme, creation of N clusters representative of all recognition object words must be followed by selection of words representative of the N clusters. This is a process of obtaining N representative clusters from all the words through a process of dividing clusters as shown in Fig. 6.

[68] Fig. 6 shows a flowchart representing a word clustering algorithm based on an edit distance between words depending on a phonemic length in accordance with an embodiment of the present invention.

[69] In Fig. 6, the operations of the unit of clustering 306 and the selector of words representative of clustering 308 are illustrated. In step 600, all recognition object words are inputted, and in step 602, an edit distance between all recognition object words is measured. In step 604, words having the smallest edit distance among the measured words are incorporated into each cluster. In step 606, a total edit distance is measured and a determination is made as to whether the total edit distance is greater than a threshold value. If the total edit distance is greater than the threshold value, the process proceeds to step 608. In step 608, a cluster having the greatest total edit distance is divided and the process returns to step 602, where the edit distance from the cluster is measured. The words are again incorporated into the cluster. This procedure is repeatedly performed. Clustering ends only if the measured total edit distance is smaller than the threshold value.

[70] Thus, the process of obtaining N clusters representative of all the words is referred to as a quantization process of recognition object word. In the lexical decoding process for the quantized word, the similarity measurer 300 measures a similarity between the word representative of the cluster and the inputted phonemic string to select N clusters having the smallest similarity value. The similarity measurer 302 for the selected clusters again performs the lexical decoding on all words forming the clusters to output a result of final lexical decoding.

[71] This 2-step lexical decoding process finds the cluster which is a set of words in the first step and then finds words belonging to the cluster, based on the cluster. This can greatly increase the decoding speed, as compared with a conventional scheme of searching for all words at a time.

[72] As described above, according to the present invention, the lexical decoding speed can be improved by using the word clustering scheme and the information on the length of an inputted phonemic string. The search words are constrained upon lexical decoding based on the length of the phonemic string outputted via phonemic decoding, and a constraint is imposed to search for only a difference in phonemic number upon the edit distance measurement. Furthermore, all recognition object words are divided into clusters representative of the words using clustering scheme based on the edit distance, a similarity between the inputted phonemic string and the word representative of the cluster is then measured to select the N best clusters, and lexical decoding is performed on only the words forming the cluster again.

[73] While the invention has been shown and described with respect to the embodiments, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the scope of the invention as defined in the following claims.

Claims

[1] A lexical decoding method comprising: detecting a length of a phonemic string outputted through phonemic decoding to an inputted speech signal; selecting recognition object words having a length of phonemic string similar to the length of the phonemic string based on the detected length of the phonemic string; measuring an edit distance from the phonemic string based on the selected recognition object words; and outputting at least one of the recognition object word having a smallest edit distance from the phonemic string through the measuring the edit distance. [2] The method of claim 1, wherein the phonemic decoding generates as accurate phoneme sequence for the inputted speech signal as possible. [3] The method of claim 1, further comprising determining a space to be searched depending on the number of phonemes upon the measuring edit distance. [4] A lexical decoding method comprising: creating clusters by clustering among recognition object words having a high similarity classified by an edit distance for all recognition object words; selecting words which is representative of the created clusters; measuring a first similarity of a phonemic string outputted through phonemic decoding based on the selected words and a speech signal to select one of the clusters, which corresponds to the selected words having an optimal similarity; measuring a second similarity between only the words forming the selected cluster and the phonemic string; and outputting an optimal word through the similarity measurement. [5] The method of claim 4, wherein the phonemic decoding generates as accurate phoneme sequence for the inputted speech signal as possible. [6] A lexical decoding apparatus comprising: a detector for detecting a length of a phonemic string outputted through phonemic decoding to an inputted speech signal; a word selector for selecting recognition object words having a length of the phonemic string similar to the detected length of the phonemic string; and an edit distance measurer for measuring an edit distance from the phonemic string based on the selected recognition object words, to output at least one of the recognition object words having a smallest edit distance from the phonemic string through the edit distance measurement. [7] The apparatus of claim 6, wherein the phonemic decoding generates as accurate phoneme sequence for the inputted speech signal as possible. [8] The apparatus of claim 6, wherein the edit distance measurer determines a space being searched in accordance with the number of the phonemes upon the edit distance measurement. [9] A lexical decoding apparatus comprising: a clustering unit for creating clusters by performing clustering among recognition object words having a high similarity classified by a edit distance for all recognition object words; a representative word selector for selecting words representative of the created clusters; a first similarity measurer for measuring a similarity of a phonemic string outputted through phonemic decoding based on the selected words and a speech signal, and selecting the corresponding cluster of the words having an optimal similarity; and a second similarity measurer for measuring a similarity between only the words forming the selected cluster and the phonemic string to output an optimal word through the similarity measurement. [10] The apparatus of claim 9, wherein the phonemic decoding generates as accurate phoneme sequence for the inputted speech signal as possible.