KR20090065102A

KR20090065102A - Method and apparatus for lexical decoding

Info

Publication number: KR20090065102A
Application number: KR1020070132546A
Authority: KR
Inventors: 정훈; 이윤근; 박전규; 정호영; 전형배
Original assignee: 한국전자통신연구원
Priority date: 2007-12-17
Filing date: 2007-12-17
Publication date: 2009-06-22
Also published as: WO2009078665A1

Abstract

A vocabulary decoding method and an apparatus thereof are provided to shorten overall performance time of voice recognition without deteriorating voice recognition performance. Based on an inputted voice signal, length of a phoneme series which is outputted by decoding phonemes is detected. Based on the detected length of the phoneme series, recognition target words(202) similar to the length of the phoneme series are selected. Edition distance measurement with the phoneme series is carried out on the basis of the selected recognition target words. Through the edition distance measurement, at least one recognition target word having a minimum edition distance with the phoneme series is outputted.

Description

Vocabulary decoding method and apparatus {METHOD AND APPARATUS FOR LEXICAL DECODING}

본 발명은 사용자 음성인식(Human speech recognition, 이하 HSR이라 한다) 기술에 관한 것으로서, 특히 단어 클러스터링(clustering) 방식 및 입력 음소 길이 정보를 이용하여 어휘 디코딩 속도를 향상 시키는데 적합한 어휘 디코딩 방법 및 장치에 관한 것이다.BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to human speech recognition (HSR) technology, and more particularly, to a method and apparatus for lexical decoding suitable for improving lexical decoding speed using word clustering and input phoneme length information. will be.

본 발명은 정보통신부 및 정보통신연구진흥원의 IT성장동력기술개발사업의 일환으로 수행한 연구로부터 도출된 것이다[과제관리번호: 2006-S-036-02, 과제명: 신성장동력산업용 대용량 대화형 분산 처리 음성인터페이스 기술개발].The present invention is derived from the research conducted as part of the IT growth engine technology development project of the Ministry of Information and Communication and the Ministry of Information and Telecommunications Research and Development. [Task Management Number: 2006-S-036-02, Task name: Large-capacity interactive dispersion for new growth engine industries] Development of processing voice interface technology].

일반적으로 HSR방식의 음성 인식을 수행하기 위해서는 음소 디코딩(automatic phone decoding)과, 어휘 디코딩(lexical decoding)을 통하여 음성 인식을 수행하게 된다.In general, in order to perform HSR speech recognition, speech recognition is performed through automatic phone decoding and lexical decoding.

이하 도면을 참조하여 상세히 설명하도록 한다.Hereinafter, with reference to the drawings will be described in detail.

도 1은 일반적인 사용자 음성 인식 시스템의 구조를 도시한 블록도이다.1 is a block diagram showing the structure of a general user speech recognition system.

도 1을 참조하면, 사용자 음성 인식 시스템(100)은 음소 디코딩부(102)와, 어휘 디코딩부(104), 음향 리스코어링부(rescoring)(106)를 포함한다.Referring to FIG. 1, the user speech recognition system 100 includes a phoneme decoding unit 102, a lexical decoding unit 104, and an acoustic rescoring unit 106.

음소 디코딩부(102)는 입력되는 음성 신호에 대해 최대 유사도로 출력하는 음소 모델의 열을 구하는 과정이다. 이러한 음소 디코딩 과정에서는 주변 잡음, 부정확한 음향 모델등과 같은 이유로 인해 디코딩된 음소열에 오류가 포함될 수 있다. 이에 어휘 디코딩부(104)에서는 음소 디코딩부(102)로부터 전달된 음소열에서 이러한 오류를 보정하면서 인식 대상 어휘간의 편집 거리(Edit distance)가 최소가 되는 N개의 대상 단어인 N-best 단어를 출력한다. 이후 어휘 디코딩부(104)에서 출력한 N-best 단어는 음향 리스코어링부(106)로 전달되며, 음향 리스코어링부(106)에서는 입력된 N개의 대상 단어에 대해 세밀한 음향 모델을 사용하여 리스코어링 하여 인식 순위를 재조정하고 그 인식 결과를 출력하게 된다. The phoneme decoding unit 102 is a process of obtaining a column of a phoneme model that is output with maximum similarity with respect to an input voice signal. In the phoneme decoding process, an error may be included in the decoded phoneme string for reasons such as ambient noise and an incorrect acoustic model. In this regard, the lexical decoding unit 104 outputs N-best words, which are N target words with minimum edit distance between recognized words, while correcting such errors in the phoneme strings transmitted from the phoneme decoding unit 102. do. Thereafter, the N-best word output from the lexical decoding unit 104 is transferred to the acoustic rescoring unit 106, and the acoustic rescoring unit 106 uses the detailed acoustic model for the input N target words to perform rescoring. To reorder the recognition and output the recognition result.

상기한 바와 같이 동작하는 종래 기술에 의한 사용자 음성 인식 시스템에 있어서는, 음소 디코딩과 어휘 디코딩 및 음향 리스코어링을 통하여 사용자로부터 입력된 음성에 대한 최대 유사도를 출력하고 있으나, 위와 같은 여러 처리 절차를 거치기에 음성인식이 되기까지는 각 단계별로 어느 정도 일정량의 시간이 소요될 수 있으며, 기존 음성인식을 위해 소요되는 시간보다 좀 더 빠른 음성인식을 수행할 수는 있으나, 이런 경우에는 음성인식 성능이 저하될 수 밖에 없다는 문제점이 있었다.In the conventional user speech recognition system operating as described above, the maximum similarity of the voice input from the user is output through the phoneme decoding, the lexical decoding, and the sound rescoring. It may take a certain amount of time for each stage to become speech recognition, and it is possible to perform speech recognition faster than the time required for existing speech recognition, but in this case, speech recognition performance is deteriorated. There was no problem.

이에 본 발명은, 단어 클러스터링(clustering) 방식 및 입력 음소 길이 정보를 이용하여 어휘 디코딩 속도를 향상 시킬 수 있는 어휘 디코딩 방법 및 장치를 제공한다. Accordingly, the present invention provides a vocabulary decoding method and apparatus capable of improving the vocabulary decoding speed using word clustering and input phoneme length information.

또한 본 발명은, 음소 디코딩을 통하여 출력된 음소열의 길이에 기반하여 어휘 디코딩 시에 탐색한 단어를 제한하고, 편집거리 측정 시에 음소 개수의 차이에 따른 만큼만을 검색하도록 제한할 수 있는 어휘 디코딩 방법 및 장치를 제공한다.In addition, the present invention, the lexical decoding method that can limit the searched words in the lexical decoding based on the length of the phoneme string output through the phoneme decoding, and to search only the search according to the difference in the number of phonemes when measuring the editing distance And an apparatus.

또한 본 발명은, 모든 인식 대상 단어에 대해 편집 거리에 기반한 클러스터링 기법을 사용하여 모든 단어를 대표하는 클러스터들로 구분한 후에 입력되는 음소열에 대해 우선 이들 클러스터를 대표하는 단어와의 유사도를 측정하여 N-best 클러스터를 선택한 후에, 이 클러스터를 구성하는 단어들만을 대상으로 다시 어휘 디코딩을 수행하는 어휘 디코딩 방법 및 장치를 제공한다.In addition, the present invention, by using the clustering technique based on the editing distance for all the words to be recognized, after dividing the clusters representing all words to the phoneme string input first to measure the similarity with the words representing these clusters N After selecting the -best cluster, a method and apparatus for lexical decoding that performs lexical decoding on only words constituting the cluster again.

본 발명의 일 실시예 방법은, 입력된 음성신호를 토대로 음소 디코딩되어 출력된 음소열의 길이를 검출하는 과정과, 상기 검출된 음소열의 길이에 기반하여 상기 음소열의 길이에 유사한 인식 대상 단어를 선정하는 과정과, 상기 선정된 인식 대상 단어들을 토대로 상기 음소열과의 편집 거리 측정을 수행하는 과정과, 상기 편집 거리 측정을 통해 상기 음소열과의 편집거리가 최소가 되는 적어도 하나의 상기 인식 대상 단어를 출력하는 과정을 포함한다.According to an exemplary embodiment of the present invention, a method of detecting a length of a phoneme string which has been phoneme-decoded and output based on an input voice signal and selecting a recognition target word similar to the length of the phoneme string based on the detected phoneme string length Performing an editing distance measurement with the phoneme string based on the selected recognition object words, and outputting at least one recognition object word having a minimum editing distance with the phoneme string through the editing distance measurement. Process.

본 발명의 다른 실시예 방법은, 모든 인식 대상 단어에 대해 편집 거리 별로 유사성이 높은 단어들끼리 클러스터링을 수행하여 클러스터를 생성하는 과정과, 상기 생성된 클러스터를 대표하는 단어를 선정하는 과정과, 상기 선정된 단어들과, 음성신호를 토대로 음소 디코딩되어 출력된 음소열의 제1 유사도를 측정하여, 최적의 유사도를 갖는 단어의 해당 클러스터를 선택하는 과정과, 상기 선택한 클러스터를 구성하는 단어들만을 대상으로 상기 음소열과의 제2 유사도를 측정하는 과정과, 상기 유사도 측정을 통해 최적 단어를 출력하는 과정을 포함한다.According to another exemplary embodiment of the present invention, there is provided a process of generating clusters by clustering words having high similarity with respect to edit distances for all recognized words, selecting a word representing the generated clusters, and Measuring the first similarity of the selected words and the phoneme sequence decoded and output based on the speech signal, selecting a corresponding cluster of words having an optimal similarity, and targeting only the words constituting the selected cluster And measuring a second similarity with the phoneme string and outputting an optimal word through the similarity measurement.

본 발명의 일 실시예 장치는, 입력된 음성신호를 토대로 음소 디코딩되어 출력된 음소열의 길이를 검출하는 검출부와, 상기 검출된 음소열의 길이에 유사한 인식 대상 단어를 선정하는 단어 선정부와, 상기 선정된 인식 대상 단어들을 토대로 상기 음소열과의 편집 거리 측정하고, 상기 편집 거리 측정을 통해 상기 음소열과의 편집거리가 최소가 되는 적어도 하나의 상기 인식 대상 단어를 출력하는 을 편집 거리 측정부를 포함한다.According to an embodiment of the present invention, an apparatus includes: a detector configured to detect a length of a phoneme string that is decoded and output based on an input voice signal, a word selector that selects a word to be recognized similar to the length of the detected phoneme string, and the selection; And an editing distance measuring unit configured to measure an editing distance with the phoneme string based on the recognized recognition words, and output at least one recognition object word having a minimum editing distance with the phoneme string through the editing distance measurement.

본 발명의 다른 실시예 장치는, 모든 인식 대상 단어에 대해 편집 거리 별로 유사성이 높은 단어들끼리 클러스터링을 수행하여 클러스터를 생성하는 클러스터링부와, 상기 생성된 클러스터를 대표하는 단어를 선정하는 대표 단어 선정부와, 상기 선정된 단어들과, 음성신호를 토대로 음소 디코딩되어 출력된 음소열의 유사도를 측정하여, 최적의 유사도를 갖는 단어의 해당 클러스터를 선택하는 제1 유사도 측정부와, 상기 선택한 클러스터를 구성하는 단어들만을 대상으로 상기 음소열과의 유사도를 측정하고, 상기 유사도 측정을 통해 최적 단어를 출력하는 제2 유사도 측정부를 포함한다.According to another exemplary embodiment of the present invention, a clustering unit generates a cluster by clustering words having high similarity with respect to editing distances for all recognized words, and a representative word line for selecting a word representing the generated cluster. A first similarity measurer configured to measure similarity between the selected words and the phoneme strings which are phoneme decoded and output based on the voice signal, and select a corresponding cluster of words having an optimal similarity, and the selected cluster And a second similarity measurer that measures similarity with the phoneme string based on only words to be output and outputs an optimal word through the similarity measure.

본 발명에 있어서, 개시되는 발명 중 대표적인 것에 의하여 얻어지는 효과를 간단히 설명하면 다음과 같다.In the present invention, the effects obtained by the representative ones of the disclosed inventions will be briefly described as follows.

본 발명은, 사용자 음성 인식 시스템에서 어휘 디코딩 과정을 수행하는 경우에 단어 클러스터링 방식 또는 입력 음소 길이 정보를 이용하여 어휘 디코딩 과정을 수행함으로써, 어휘 디코딩 과정의 속도를 향상시킬 수 있으며, 음성인식 성능의 저하 없이도 전체적인 음성인식 수행시간을 단축시킬 수 있는 효과가 있다.When the lexical decoding process is performed in a user speech recognition system, the present invention can improve the speed of the lexical decoding process by performing the lexical decoding process using the word clustering method or the input phoneme length information. There is an effect that can shorten the overall speech recognition performance time without degradation.

이하 첨부된 도면을 참조하여 본 발명의 동작 원리를 상세히 설명한다. 하기에서 본 발명을 설명함에 있어서 공지 기능 또는 구성에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우에는 그 상세한 설명을 생략할 것이다. 그리고 후술되는 용어들은 본 발명에서의 기능을 고려하여 정의된 용 어들로서 이는 사용자, 운용자의 의도 또는 관례 등에 따라 달라질 수 있다. 그러므로 그 정의는 본 명세서 전반에 걸친 내용을 토대로 내려져야 할 것이다. Hereinafter, the operating principle of the present invention will be described in detail with reference to the accompanying drawings. In the following description of the present invention, when it is determined that a detailed description of a known function or configuration may unnecessarily obscure the subject matter of the present invention, the detailed description thereof will be omitted. In addition, terms to be described below are terms defined in consideration of functions in the present invention, which may vary according to a user's or operator's intention or custom. Therefore, the definition should be made based on the contents throughout the specification.

본 발명은 단어 클러스터링(clustering) 방식 및 입력 음소 길이 정보를 이용하여 어휘 디코딩 속도를 향상시키기 위한 것으로서, 음소 디코딩을 통하여 출력된 음소열의 길이에 기반하여 어휘 디코딩 시에 탐색한 단어를 제한하고, 편집거리 측정 시에 음소 개수의 차이에 따른 만큼만을 검색하도록 제한하거나, 또는 모든 인식 대상 단어에 대해 편집 거리에 기반한 클러스터링 기법을 사용하여 모든 단어를 대표하는 클러스터들로 구분한 후에 입력되는 음소열에 대해 우선 이들 클러스터를 대표하는 단어와의 유사도를 측정하여 N-best 클러스터를 선택한 후에, 이 클러스터를 구성하는 단어들만을 대상으로 다시 어휘 디코딩을 수행하는 것이다.The present invention is to improve the lexical decoding speed by using the word clustering method and the input phoneme length information, and limits the words searched at the time of lexical decoding based on the length of the phoneme string output through the phoneme decoding, and editing Limit the search to only as much as the phoneme difference in distance measurement, or divide the clusters representing all words into clusters representing all words using the editing distance-based clustering technique for all recognized words. After selecting the N-best cluster by measuring the similarity with words representing the clusters, lexical decoding is performed on only the words constituting the cluster.

도 1에서 설명한 바와 같이 사용자 음성 인식 시스템(100)을 통한 사용자 음성 인식은 음소 디코딩부(102)와, 어휘 디코딩부(104), 그리고 음향 리스코어링부(106)를 통해 이루어진다.As described above with reference to FIG. 1, user speech recognition through the user speech recognition system 100 is performed through the phoneme decoding unit 102, the lexical decoding unit 104, and the acoustic rescoring unit 106.

음소 디코딩부(102)에서는 음소 디코딩과정을 통하여 입력되는 음성 신호로부터 음소열로 변환하고, 어휘 디코딩부(104)에서는 어휘 디코딩과정을 통하여 변환된 음소열로부터 인식 대상 어휘를 인식한다.The phoneme decoding unit 102 converts a voice signal input through a phoneme decoding process into a phoneme sequence, and the vocabulary decoding unit 104 recognizes a recognition target word from the phoneme sequence converted through the vocabulary decoding process.

음소 디코딩 과정은 <수학식 1>과 같이 주어진 음성신호의 특징 벡터열

에 대해 최대 사후 확률을 가지는 음소열

을 구하는 과정이다.The phoneme decoding process is a feature vector sequence of a given speech signal as shown in Equation 1.

Phoneme series with maximum posterior probability for

The process of obtaining

상기 <수학식 1>의 해를 구하기 위해서는 2개의 확률 모델

과

이 필요하다.

는 음소로부터 관측된 음향 특징 벡터에 대한 조건부 확률로 일반적으로는 은닉 마르코프 모델(HMM : Hidden Markov Model)을 사용하여 모델링한다.

은 단어를 구성하는 음소들 간의 연결관계를 확률 모델로 정의한 것으로 언어모델이라고 하며 <수학식 2>와 같다.In order to find the solution of Equation 1, two probabilistic models

and

This is necessary.

Is a conditional probability for the acoustic feature vector observed from the phoneme, and is generally modeled using the Hidden Markov Model (HMM).

Is a linguistic model that defines the relationship between the phonemes that make up a word. It is called a language model.

그러나 대부분의 실제 음성인식 시스템에서는 <수학식 2>에 대한 근사식으로 현재의 음소가 이전 (n-1)개의 음소에 의해서만 영향을 받는다는 가정하에 <수학식 3>과 같은 n-gram을 사용한다. 일반적으로 2-gram이나 3-gram을 사용하게 된다.However, most real speech recognition systems use n-grams like Equation 3 as an approximation to Equation 2, assuming that the current phoneme is affected only by the previous (n-1) phonemes. . Generally, you will use 2-gram or 3-gram.

인식된 음소열로부터 단어열을 디코딩하는, 어휘 디코딩 과정은 도 2와 같이 오인식된 음소열을 포함한 음소 디코딩의 결과와 인식 대상 단어에 대한 올바른 음소열간의 편집 거리가 최소인 단어를 찾는 <수학식 4>와 같은 동적 프로그래밍 알고리즘을 기반으로 한다.In the lexical decoding process of decoding a word string from a recognized phoneme sequence, as shown in FIG. 2, a word having a minimum editing distance between a result of phoneme decoding including a misidentified phoneme string and a correct phoneme string for a recognized word is found. Based on a dynamic programming algorithm such as 4>.

여기서,

는 누적 거리이고,

는 기준 음소 c_x가 t_y로 치환(substitution) 오류가 발생한 경우에 대한 cost 함수이고,

는 기준 음소 c_x에 대해 손실(deletion) 오류가 발생한 경우의 cost이고

는 기준 음소

에 대해 t_y의 삽입(insertion) 오류가 발생한 경우의 cost가 된다. 이들 cost들은 <수학식 5>와 같은 음소의 혼동 확률(confusion probability)에 대한 네거티브 로그(negative logarithm)로 표현된다.here,

Is the cumulative distance,

Is the cost function for the case where the reference phoneme c _x has a substitution error with t _y ,

Is the cost of a loss error for the reference phoneme _x

Phoneme reference

This is the cost when t _y insertion error occurs. These costs are expressed as a negative logarithm of the phoneme's confusion probability, as shown in Equation 5.

이하, 하기에서는 어휘 디코딩 속도를 향상시키기 위한 방안으로서, 입력 음소 길이 정보를 이용한 방식과, 단어 클러스터링(clustering) 방식이 있으며, 먼저 음소 길이 정보를 이용한 방식을 이용한 어휘 디코딩 속도 향상 방안에 대해 설명하도록 한다.Hereinafter, as a method for improving the lexical decoding speed, there are a method using input phoneme length information and a word clustering method. First, a method of improving the lexical decoding speed using a method using phoneme length information will be described. do.

도 2는 본 발명의 바람직한 실시예에 따라 음소열의 길이 정보를 이용한 탐색 공간 감축 방식의 어휘 디코딩 구조를 도시한 블록도이다.2 is a block diagram illustrating a lexical decoding structure of a search space reduction method using length information of phoneme strings according to an exemplary embodiment of the present invention.

도 2를 참조하면, 예를 들어 음소 디코딩의 결과로 모두 N개로 구성된 음소열이 출력되었다면, 인식 대상 단어와의 편집거리가 가장 적은 것은 대상 단어가 N개의 음소로 이루어진 것일 가능성이 높다는 것이므로, 먼저 어휘 디코딩부(104)내의 음소열 개수 검출기(200)에서는 입력된 음소열에서 음소열 개수를 검출한다. 이후, 검출된 음소열 개수는, 음소열 개수에 기반한 인식 대상 단어 선택기(204)로 전달하여, 인식 대상 단어 선택기(204)에서 인식 대상 단어(202)들 중에서 전달된 음소열 개수와 유사한 음소열로 구성된 인식대상 단어(202)를 선정하여, 음소열 길이 제약에 기반한 편집 거리 측정기(206)로 전달한다.Referring to FIG. 2, for example, if a phoneme string consisting of all N is output as a result of phoneme decoding, the least editing distance from the word to be recognized is most likely that the word is composed of N phonemes. The phoneme string number detector 200 in the lexical decoding unit 104 detects the phoneme string number from the input phoneme string. Thereafter, the detected phoneme string number is transferred to the recognition target word selector 204 based on the phoneme sequence number, and the phoneme sequence similar to the phoneme sequence number transferred among the recognition target words 202 in the recognition target word selector 204. The recognition target word 202 is selected and transmitted to the edit distance measurer 206 based on the phoneme length constraint.

이에 편집 거리 측정기(206)에서는 음소열에 대해 인식 대상 단어(202)를 구성하는 각 단어의 발음 열간의 편집 거리를 측정하여, 그 편집거리가 최소가 되는 N 개의 단어를 어휘 디코딩의 결과로 출력하게 된다. 이때 음소인식결과와 각 단어간의 편집 거리를 알기 위해서는 모든 단어에 대해 <수학식 4>와 같은 동적 프로그램을 적용해야 하나 음소 디코딩의 결과로 나온 음소의 수에 의해 탐색해야 할 단어를 인식 대상 단어 선택기(204)에서의 선정을 통해 제한하고 이들 단어에 대한 편집거리를 구하기 위한 <수학식 4>를 적용하는 경우에도 탐색 공간에 제약을 둠으로써 전체 탐색 시간을 줄이는 것이 가능하다.Accordingly, the edit distance measurer 206 measures the edit distance between the pronunciation strings of the words constituting the word 202 to be recognized for the phoneme string, and outputs N words whose edit distance is the minimum as a result of lexical decoding. do. In this case, in order to know the phoneme recognition result and the editing distance between each word, a dynamic program such as <Equation 4> should be applied to all words, but the word to be searched by the number of phonemes resulting from phoneme decoding Even in the case of applying Equation 4 for limiting through the selection in (204) and obtaining the editing distance for these words, it is possible to reduce the overall search time by limiting the search space.

도 4는 본 발명의 바람직한 실시예에 따라 음소길이에 따른 어휘 디코딩 과정을 도시한 도면이다.4 is a diagram illustrating a vocabulary decoding process according to phoneme length according to a preferred embodiment of the present invention.

도 4를 참조하면, 사용자가 발음한 음성에 대한 음소 디코딩 결과가 9개의 음소로 이루어진 "g o r jv g E b a xl"(400)으로 출력되었다면 이 음소 열과 2개의 단어 "고려그릇물류센터"(404)와 "고려건업"(402)간의 각각 편집 거리를 측정하면 "고려건업"(402)의 편집 거리가 적을 가능성이 높다 이는 "고려건업"(402)을 구성하는 음소열이 9개로 "고려그릇물류센터"(404)를 구성하는 음소열 19에 비해 입력되는 음소열의 개수가 유사하기 때문에 상대적으로 삽입(insertion)에 대한 편집 거리가 적어짐으로 인한다. 따라서 음소 디코딩의 결과로 음소열의 개수가 주어진다면 인식 대상이 되는 모든 단어들 중에서 입력되는 음소열의 개수에 비해 기설정된 개수(delta) 만큼의 차이만을 가지는 단어들만을 대상으로 검색하면, 모든 단어를 대상으로 편집 거리를 구하지 않고도 최소의 편집거리를 가지는 N개의 단어들을 구할 수 있다.Referring to FIG. 4, if a phoneme decoding result of a user pronounced voice is output as “gor jv g E ba xl” 400 consisting of nine phonemes, the phoneme string and two words “Koryeo Logistics Center” (404) ), It is likely that the editing distance of "Korea Construction" (402) is small when measuring the editing distance between each of "Korea Construction" (402). Compared to the phoneme strings 19 constituting the distribution center 404, the number of phoneme strings to be input is similar, and thus the editing distance for insertion is relatively small. Therefore, if the number of phoneme strings is given as a result of the phoneme decoding, if all the words that are to be recognized are searched for only words having a difference of a predetermined number (delta) compared to the number of phoneme strings to be input, Thus, N words having the minimum editing distance can be obtained without obtaining the editing distance.

상기한 바와 같이 음소길이에 의해 선택된 단어에 대해서는 <수학식 4>와 같은 동적 프로그램을 사용하여 2음소열 간의 편집거리를 구하게 된다. 이 과정은 포괄적 제약(global constraint)을 적용하는 것이다. 음소열의 길이가 동일한 2개의 편집 거리를 구하는 경우에는 도 5와 같이 두 주어진 음소열로 구성된 탐색 공간 전체에 대해 동적 프로그램을 통해 최적 편집 거리를 구하게 된다.As described above, the word selected by the phoneme length is used to obtain an editing distance between two phoneme strings using a dynamic program such as Equation 4. The process is to apply a global constraint. When two edit distances having the same phoneme length are obtained, an optimal edit distance is obtained through a dynamic program for the entire search space composed of two given phoneme sequences as shown in FIG. 5.

도 5는 본 발명의 바람직한 실시예에 따라 음소길이에 따른 편집 거리 측정 시 탐색 공간 제한 여부를 나타낸 도면이다.5 is a view showing whether or not to limit the search space when measuring the editing distance according to the phoneme length in accordance with a preferred embodiment of the present invention.

도 5를 참조하면, 예를 들어 2개의 음소열이 모두 4개로 구성되어 있다고 가정하면, 4x4로 구성된 모든 탐색 공간의 모든 노드에 대해 탐색을 수행해야 한다. 그러나 음소의 길이가 이렇게 정해져 있다고 탐색 공간을 두 음소열 간의 길이 차이에 의해 제한을 두어 탐색을 해도 성능에는 차이가 발생하지 않는다. Referring to FIG. 5, for example, assuming that two phoneme strings are all composed of four, it is necessary to perform a search for all nodes of all 4x4 search spaces. However, if the length of the phoneme is defined in this way, the search space is limited by the difference in length between the two phoneme strings.

예를 들어 두 음소열의 길이가 같다면 탐색 공간은 도 5의 실선으로 표시된 사선구간(500)에 대해서만 검색을 해도 충분하다 그러나 2 음소열의 개수가 차이가 난다면, 그 차이만큼 탐색 공간을 대각선을 기준으로 비례해서 증가시켜줘야 한다. 점선으로 표시한 구간(502)은 두 음소열 간의 차이가 1만큼 발생하는 경우이고, 굵은 점선으로 표시한 구간(504)은 두 음소열 개수의 차이가 2만큼 발생하는 경우이다. 따라서 편집거리를 구하는 두 음소열의 길이의 차이에 비례하여 탐색해야 할 공간을 제한 즉, 결정함으로써 기존의 모든 탐색 구간을 찾는 것에 비해 탐색 시간을 많이 줄일 수 있다.For example, if two phoneme strings have the same length, the search space may be sufficient to search only the diagonal line 500 indicated by the solid line of FIG. 5. However, if the number of two phoneme strings differs, the search space is diagonally divided by the difference. It should increase proportionally. A section 502 indicated by a dotted line is a case where a difference between two phonemes is generated by one, and a section 504 indicated by a thick dotted line is a case where a difference between two phoneme strings is generated by two. Therefore, by limiting the space to be searched in proportion to the difference between the lengths of the two phoneme strings for which the editing distance is obtained, the search time can be significantly reduced compared to finding all existing search sections.

한편, 어휘 디코딩 속도를 향상시키기 위한 다른 방안은, 어휘디코딩 과정을 2단계로 수행하는 것이다.Meanwhile, another method for improving the lexical decoding speed is to perform the lexical decoding process in two steps.

도 3은 본 발명의 바람직한 실시예에 따른 단어 클러스터링 방식을 통한 2단계 어휘 디코딩 구조를 도시한 블록도이다.3 is a block diagram illustrating a two-stage lexical decoding structure using a word clustering method according to a preferred embodiment of the present invention.

도 3을 참조하면, 어휘 디코딩부(104) 내의 편집거리에 기반한 단어 클러스터링부(306)에서는 모든 인식 대상 단어(304)에 대해 유사도가 높은 것들을 묶어서 하나의 클러스터로 만들고, 단어 클러스터 대표 단어 선정부(308)에서는 이 클러스터를 대표하는 단어를 선정하게 된다. 이후, 음소열과 클러스터 대표 단어간의 유 사도 측정부(300)에서는 이 클러스터를 대표하는 단어에 대해 우선 입력되는 음소열과 유사도를 비교하여 N개의 최적 클러스터를 선정하고, 음소열과 클러스터 내의 모든 단어간의 유사도 측정부(302)에서는 이 클러스터 내에 포함된 모든 단어에 대해 어휘 디코딩 과정을 거치는 2단계에 걸쳐 어휘 디코딩을 수행함으로써 전체 어휘 디코딩 속도를 향상시키는 방식이다.Referring to FIG. 3, in the word clustering unit 306 based on the editing distance in the lexical decoding unit 104, a group having high similarity with respect to all recognized words 304 is bundled into a cluster, and a word cluster representative word selector In 308, a word representing this cluster is selected. Subsequently, the similarity measurer 300 between the phoneme string and the cluster representative word selects N optimal clusters by comparing the phoneme string and similarity first input to the word representing the cluster, and measures the similarity between the phoneme string and all words in the cluster. The unit 302 improves the overall lexical decoding speed by performing lexical decoding in two stages of lexical decoding on all words included in the cluster.

이러한 방식에서는 먼저 모든 인식 대상 어휘에 대해 이를 대표하는 N개의 클러스터를 생성하고 이를 대표하는 단어를 선정해야 한다. 이 과정은 모든 단어로부터 대표하는 N개의 클러스터를 구하는 과정으로 도 6과 같은 클러스터 분할 과정을 통해 구해지게 된다. In this method, first, N clusters representing all the recognized vocabulary words should be generated and the words representing them are selected. This process is to obtain N clusters from all the words, which is obtained through the cluster partitioning process as shown in FIG.

도 6은 본 발명의 바람직한 실시예에 따라 음소길이에 따른 단어 간 편집 거리에 기반한 단어 클러스터링 방식 알고리즘을 도시한 흐름도이다.6 is a flowchart illustrating a word clustering algorithm based on an edit distance between words according to phoneme lengths according to an exemplary embodiment of the present invention.

도 6은 단어 클러스터링부(306)와 대표 단어 선정부(308)에서의 동작절차를 나타내는 것으로서, 600단계에서 모든 인식 대상 단어들을 입력받아, 602단계에서 모든 인식 대상 단어들 간에 편집거리를 측정하여, 604단계에서 측정된 단어들 중 최소 편집 거리를 가지는 단어들을 서로의 클러스터로 편입 한다. 이후, 606단계에서는 전체 편집거리를 측정하여 임계값보다 높은지 여부를 판단하여, 전체 편집거리가 임계값보다 높은 경우는 608단계로 진행하여 가장 높은 전체 편집 거리를 가지는 클러스터를 분할한 후, 다시 602단계로 복귀하여 클러스터와의 편집거리를 측정한다. 다시 클러스터들간에 편입을 수행하고, 이러한 절차를 반복하여, 측정된 전체 편집거리가 임계값보다 낮아진 경우에만, 클러스터링을 종료하게 된다.FIG. 6 illustrates an operation procedure in the word clustering unit 306 and the representative word selecting unit 308. In step 600, all recognition target words are input, and in step 602, the editing distance is measured between all recognition target words. In operation 604, the words having the minimum editing distance among the words measured in the step are incorporated into each other's cluster. Thereafter, in step 606, the entire editing distance is measured to determine whether the threshold is higher than the threshold. Return to step to measure the editing distance from the cluster. Incorporation is again performed between the clusters, and the procedure is repeated to end clustering only when the measured total edit distance is lower than the threshold.

이렇게 모든 단어에 대해 대표하는 N개의 클러스터를 구하는 과정을 인식 대상 단어의 양자화 과정이라 한다. 이렇게 양자화된 단어에 대해 어휘 디코딩 과정에서는 먼저 음소열과 클러스터 대표 단어간의 유사도 측정부(300)에서 입력되는 음소열에 대해 클러스터를 대표하는 단어와 입력 음소열 간의 유사도를 측정하여 N개의 가장 유사도 값이 적은 클러스터들을 선정한다. 이렇게 선정된 클러스터에 대해 음소열과 클러스터 내의 모든 단어간의 유사도 측정부(302)에서는 클러스터를 구성하는 모든 단어를 대상으로 다시 어휘 디코딩을 수행하여 최종 어휘 디코딩의 결과를 출력한다. The process of obtaining N clusters representing all words is called a quantization process of words to be recognized. In the lexical decoding process for the quantized words, first, the similarity between the phoneme string input from the similarity measurer 300 and the phoneme string input from the cluster representative word is measured between the words representing the cluster and the input phoneme string, so that the N most similarity values are small. Select clusters. The similarity measurer 302 between the phoneme string and all the words in the cluster for the selected cluster performs lexical decoding on all the words constituting the cluster and outputs the result of the final lexical decoding.

이러한 2단계 어휘 디코딩 과정은 제1단계에서 단어들의 집합인 클러스터를 먼저 찾아내고 이를 기반으로 클러스터 내에 속한 단어를 찾아내는 방식이므로 기존의 모든 단어를 한번에 검색하는 방식에 비해 많은 속도 향상을 기대할 수 있다. Since the two-step lexical decoding process finds a cluster which is a set of words in the first step and finds a word in the cluster based on the first step, it can be expected to improve much speed compared to a method of searching all the existing words at once.

이상 설명한 바와 같이, 본 발명은 단어 클러스터링 방식 및 입력 음소 길이 정보를 이용하여 어휘 디코딩 속도를 향상시키기 위한 것으로서, 음소 디코딩을 통하여 출력된 음소열의 길이에 기반하여 어휘 디코딩 시에 탐색한 단어를 제한하고, 편집거리 측정 시에 음소 개수의 차이에 따른 만큼만을 검색하도록 제한하거나, 또는 모든 인식 대상 단어에 대해 편집 거리에 기반한 클러스터링 기법을 사용하여 모든 단어를 대표하는 클러스터들로 구분한 후에 입력되는 음소열에 대해 우선 이들 클러스터를 대표하는 단어와의 유사도를 측정하여 N-best 클러스터를 선택한 후에, 이 클러스터를 구성하는 단어들만을 대상으로 다시 어휘 디코딩을 수행한다.As described above, the present invention is to improve the lexical decoding speed by using a word clustering method and input phoneme length information, and limits words searched in lexical decoding based on the length of a phoneme string output through phoneme decoding. When limiting the number of phonemes when editing the editing distance, limit the search to only the number, or use the editing distance based clustering technique for all recognized words to divide them into clusters representing all the words. First, the N-best cluster is selected by measuring similarity with words representing the clusters, and then lexical decoding is performed on only the words constituting the cluster.

한편 본 발명의 상세한 설명에서는 구체적인 실시예에 관해 설명하였으나, 본 발명의 범위에서 벗어나지 않는 한도 내에서 여러 가지 변형이 가능함은 물론이다. 그러므로 본 발명의 범위는 설명된 실시예에 국한되지 않으며, 후술되는 특허청구의 범위뿐만 아니라 이 특허청구의 범위와 균등한 것들에 의해 정해져야 한다. Meanwhile, in the detailed description of the present invention, specific embodiments have been described, but various modifications are possible without departing from the scope of the present invention. Therefore, the scope of the present invention should not be limited to the described embodiments, but should be defined not only by the scope of the following claims, but also by those equivalent to the scope of the claims.

도 1은 일반적인 사용자 음성 인식 시스템의 구조를 도시한 블록도,1 is a block diagram showing the structure of a general user speech recognition system;

도 2는 본 발명의 바람직한 실시예에 따라 음소열의 길이 정보를 이용한 탐색 공간 감축 방식의 어휘 디코딩 구조를 도시한 블록도,2 is a block diagram illustrating a lexical decoding structure of a search space reduction method using length information of a phoneme string according to a preferred embodiment of the present invention;

도 3은 본 발명의 바람직한 실시예에 따른 단어 클러스터링 방식을 통한 2단계 어휘 디코딩 구조를 도시한 블록도,3 is a block diagram illustrating a two-step lexical decoding structure using a word clustering method according to an embodiment of the present invention;

도 4는 본 발명의 바람직한 실시예에 따라 음소길이에 따른 어휘 디코딩 과정을 도시한 도면,4 is a diagram illustrating a vocabulary decoding process based on a phoneme length according to a preferred embodiment of the present invention;

도 5는 본 발명의 바람직한 실시예에 따라 음소길이에 따른 편집 거리 측정 시 탐색 공간 제한 여부를 나타낸 도면,5 is a view showing whether or not to limit the search space when measuring the editing distance according to the phoneme length in accordance with a preferred embodiment of the present invention,

도 6은 본 발명의 바람직한 실시예에 따라 음소길이에 따른 단어 간 편집 거리에 기반한 단어 클러스터링 방식 알고리즘을 도시한 흐름도.6 is a flowchart illustrating a word clustering algorithm based on an edit distance between words according to a phoneme length according to a preferred embodiment of the present invention.

< 도면의 주요 부분에 대한 부호 설명 > <Explanation of Signs of Major Parts of Drawings>

100 : 사용자 음성 인식 시스템 102 : 음소 디코딩부100: user speech recognition system 102: phoneme decoding unit

104 : 어휘 디코딩부 106 : 음향 리스코어링부104: vocabulary decoding section 106: acoustic recoring section

200 : 음소열 개수 검출기 202, 304 : 인식 대상 단어200: phoneme number detector 202, 304: recognition target word

204 : 음소열 개수에 기반한 인식 대상 단어 선택기204: Recognition target word selector based on phoneme number

206 : 음소열 길이 제약에 기반한 편집 거리 측정기206: Editing distance measurer based on phoneme length constraint

300: 음소열과 cluster 대표 단어간의 유사도 측정부300: a similarity measuring unit between phoneme strings and cluster representative words

302: 음소열과 cluster내의 모든 단어간의 유사도 측정부302: a similarity measurer between phoneme strings and all words in a cluster

306: 편집거리에 기반한 단어 clustering부 306: word clustering based on editing distance

308: 단어 cluster 대표 단어 선정부308: word cluster representative word selection unit

Claims

Detecting the length of the phoneme string output by phoneme decoding based on the input voice signal;

Selecting words to be recognized similar to the length of the phoneme string based on the detected length of the phoneme string;

Performing an editing distance measurement with the phoneme string based on the selected recognition target words;

Outputting at least one word to be recognized that the editing distance with the phoneme string is minimum through the editing distance measurement;

Vocabulary decoding method comprising a.

The method of claim 1,

The phoneme decoding is,

And a phoneme string of a phoneme model is obtained by outputting a maximum similarity with respect to the input voice signal.

The method of claim 1,

The method,

Determining a space to be searched according to the number of phoneme strings when the editing distance measurement is performed

Vocabulary decoding method further comprising.

Generating clusters by clustering words having similarities for each editing distance for all recognized words;

Selecting a word representing the generated cluster;

Selecting a corresponding cluster of words having an optimal similarity by measuring a first similarity of the selected words and the phoneme strings which are phoneme decoded and output based on the speech signal;

Measuring a second similarity degree with the phoneme string based on only words constituting the selected cluster;

The process of outputting the optimum word by measuring the similarity

Vocabulary decoding method comprising a.

The method of claim 4, wherein

The phoneme decoding is,

And outputting a maximum similarity degree to the input voice signal, thereby obtaining a column of a phoneme model.

A detector which detects the length of the phoneme strings which are decoded and output based on the input voice signal;

A word selecting unit which selects a word to be recognized similar to the detected phoneme length;

An editing distance measuring unit configured to measure an editing distance with the phoneme string based on the selected recognition target words, and output at least one recognition object word having a minimum editing distance with the phoneme string through the editing distance measurement

Vocabulary decoding apparatus comprising a.

The method of claim 6,

The phoneme decoding is,

The method of claim 6,

The editing distance measuring unit,

Vocabulary decoding device further comprising.

A clustering unit generating clusters by clustering words having high similarity with respect to each editing distance for all recognized words;

A representative word selection unit for selecting a word representing the generated cluster;

A first similarity measurer for measuring similarity between the selected words and the phoneme strings which are phoneme decoded and output based on a voice signal, and selecting a corresponding cluster of words having an optimal similarity;

A second similarity measurer configured to measure similarity with the phoneme string based on only words constituting the selected cluster, and output an optimal word through the similarity measure

Vocabulary decoding apparatus comprising a.

The method of claim 9,

The phoneme decoding is,

And a string of phoneme models is obtained by outputting the maximum similarity with respect to the input voice signal.