KR101846461B1

KR101846461B1 - Maximum Likelihood-based Automatic Lexicon Generation

Info

Publication number: KR101846461B1
Application number: KR1020160079411A
Authority: KR
Inventors: 김지환; 이동현; 박재현
Original assignee: 서강대학교산학협력단
Priority date: 2016-06-24
Filing date: 2016-06-24
Publication date: 2018-04-06
Also published as: KR20180000980A

Abstract

본 발명은 자동 어휘 생성 방법에 관한 것으로서, 학습 데이터에 해당하는 언어의 음절을 초기 어휘로 설정하는 단계, 상기 학습 데이터에 대한 음절 레벨의 세그멘테이션을 수행하는 단계, 상기 학습 데이터의 공백을 공백표시로 치환하는 단계, 상기 공백이 공백표시로 치환된 초기 세그멘테이션 결과로부터 언어 모델을 생성하는 단계, 상기 언어 모델에 포함된 현재 어휘들을 이용하여 단어를 조합하는 단계, 및 상기 조합된 단어 중, 조합된 단어가 현재 어휘에 포함될 경우, 우도가 최대가 되는 단어를 도출하여, 현재 어휘에 추가하는 단계를 포함하고, 상기 어휘 단어 수의 제한된 범위 내에서 상기 단어를 조합하는 단계 및 상기 우도가 최대가 되는 단어를 도출하여 현재 어휘에 추가하는 단계를 반복하는 것을 특징으로 함으로써, 기 설정된 수만큼의 단어를 가지는 어휘를 자동으로 생성할 수 있고, 생성된 어휘로부터 언어 모델을 성공적으로 생성할 수 있다.The present invention relates to an automatic vocabulary generation method, and more particularly, to an automatic vocabulary generation method, which comprises setting a syllable of a language corresponding to learning data as an initial vocabulary, segmenting a syllable level of the learning data, Generating a language model from an initial segmentation result in which the space is replaced with a blank display; combining words using current vocabularies included in the language model; Extracting a word whose maximum likelihood is the largest when the current word is included in the current vocabulary and adding it to the current vocabulary; combining the word within a limited range of the number of the vocabulary words; And adding the current word to the current vocabulary, A vocabulary with words can be automatically generated and can successfully generate a language model generated from the vocabulary.

Description

[0002] Maximum Likelihood-based Automatic Lexicon Generation [

본 발명은 자동 어휘 생성 방법에 관한 것으로서, 더욱 상세하게는 최대 우도를 이용하여 자동으로 어휘를 생성하는 자동 어휘 생성 방법 및 그 장치에 관한 것이다.The present invention relates to an automatic vocabulary generation method, and more particularly, to an automatic vocabulary generation method and apparatus for automatically generating a vocabulary using maximum likelihood.

무제한 어휘 음성 인식을 위해선, 어휘내의 단어들을 조합하여 임의의 문장을 생성할 수 있어야 하는데, 기존의 어휘 생성방법들은 out-of-lexicon 단어들이 어휘에 포함되지 않는 문제점과, 형태소 분석 오류 및 세그멘테이션(Segmentation) 에러가 음성인식의 에러로 전파된다는 문제점이 있다.For unlimited vocabulary speech recognition, it is necessary to be able to generate arbitrary sentences by combining words in a vocabulary. Existing vocabulary generation methods are problematic in that out-of-lexicon words are not included in the vocabulary, and morphological analysis errors and segmentation Segmentation error is propagated as an error in speech recognition.

한국공개특허공보 제10-2000-0019150호 "음성 모델의 생성 방법"Korean Patent Laid-Open Publication No. 10-2000-0019150 "Method of Generating Voice Model"

본 발명이 해결하고자 하는 첫 번째 과제는 최대 우도를 이용하여 자동으로 어휘를 생성하는 자동 어휘 생성 방법을 제공하는 것이다.The first problem to be solved by the present invention is to provide an automatic vocabulary generation method for automatically generating a vocabulary using maximum likelihood.

본 발명이 해결하고자 하는 두 번째 과제는 최대 우도를 이용하여 자동으로 어휘를 생성하는 자동 어휘 생성 장치를 제공하는 것이다.A second problem to be solved by the present invention is to provide an automatic vocabulary generation apparatus that automatically generates a vocabulary using maximum likelihood.

본 발명은 상기 첫 번째 과제를 달성하기 위하여, 학습 데이터에 해당하는 언어의 음절을 초기 어휘로 설정하는 단계; 상기 학습 데이터에 대한 음절 레벨의 세그멘테이션을 수행하는 단계; 상기 학습 데이터의 공백을 공백표시로 치환하는 단계; 상기 공백이 공백표시로 치환된 초기 세그멘테이션 결과로부터 언어 모델을 생성하는 단계; 상기 언어 모델에 포함된 현재 어휘들을 이용하여 단어를 조합하는 단계; 상기 조합된 단어 중, 조합된 단어가 현재 어휘에 포함될 경우, 우도가 최대가 되는 단어를 도출하여, 현재 어휘에 추가하는 단계를 포함하고, 상기 어휘 단어 수의 제한된 범위 내에서 상기 단어를 조합하는 단계 및 상기 우도가 최대가 되는 단어를 도출하여 현재 어휘에 추가하는 단계를 반복하는 것을 특징으로 하는 자동 어휘 생성 방법을 제공한다.According to another aspect of the present invention, there is provided a method of generating a vocabulary word, the method comprising: setting an initial vocabulary of a syllable corresponding to learning data; Performing segmentation of a syllable level for the learning data; Replacing the blank of the learning data with a blank display; Generating a language model from an initial segmentation result in which the space is replaced with a blank display; Combining words using current vocabularies included in the language model; Extracting a word having a maximum likelihood when the combined word is included in the current vocabulary and adding the extracted word to the current vocabulary; and combining the word within a limited range of the number of vocabulary words And extracting a word having the maximum likelihood value and adding it to the current vocabulary.

본 발명의 실시예에 의하면, 상기 음절을 초기 어휘로 설정하는 단계는, 상기 학습 데이터에 해당하는 언어의 각 음절을 공백표시가 없는 음절, 좌측에 공백표시가 있는 음절, 우측에 공백표시가 있는 음절, 및 좌우측에 공백표시가 있는 음절로 복제하는 단계를 포함하는 것을 특징으로 하는 자동 어휘 생성 방법일 수 있다.According to the embodiment of the present invention, the step of setting the syllable as an initial vocabulary includes the steps of setting each syllable in the language corresponding to the learning data as a syllable without a blank display, a syllable with a blank display on the left side, A syllable, and a syllable with a space mark on the left and right sides.

본 발명의 실시예에 의하면, 상기 학습 데이터의 공백을 공백표시로 치환하는 단계는, 상기 학습 데이터로부터 세그멘테이션 된 음절의 좌측 또는 우측에 공백이 있는 음절을 공백에 대응되도록 상기 좌측 또는 우측에 공백표시가 있는 음절로 복제하여 대체하는 것을 특징으로 하는 자동 어휘 생성 방법일 수 있다.According to the embodiment of the present invention, the step of replacing the blank space of the learning data with the blank display may include displaying a syllable having a space at the left or right of the syllable segmented from the learning data as a blank Is replaced with a syllable having a syllable having a plurality of syllables.

본 발명의 실시예에 의하면, 상기 언어 모델을 생성하는 단계는, 현재 세그멘테이션 결과로부터 음절 단위 카운트를 계산하는 단계; 및 상기 음절 단위 카운트를 이용하여 현재 어휘에 대한 언어 모델을 생성하는 단계를 포함하는 자동 어휘 생성 방법일 수 있다.According to an embodiment of the present invention, the step of generating the language model comprises: calculating a syllable unit count from the current segmentation result; And generating a language model for a current vocabulary using the syllable unit count.

본 발명의 실시예에 의하면, 상기 음절 단위 카운트는, 바이그램(bi-gram) 또는 트라이그램(tri-gram) 중 어느 하나 이상을 포함하는 것을 특징으로 하는 자동 어휘 생성 방법일 수 있다.According to an embodiment of the present invention, the syllable unit count may include at least one of bi-gram or tri-gram.

본 발명의 실시예에 의하면, 상기 도출된 단어를 현재 어휘에 추가하는 단계는, 현재 세그멘테이션 결과에 상기 도출된 단어를 추가하여 세그멘테이션 결과를 갱신하는 단계; 및 상기 갱신된 세그멘테이션 결과로부터 언어 모델을 갱신하는 단계를 포함하는 것을 특징으로 하는 자동 어휘 생성 방법일 수 있다.According to an embodiment of the present invention, the step of adding the derived word to the current vocabulary includes: updating the segmentation result by adding the derived word to the current segmentation result; And updating the language model from the updated segmentation result.

본 발명은 상기 두 번째 과제를 달성하기 위하여, 학습 데이터에 해당하는 언어의 음절과 상기 학습 데이터에 대한 음절 레벨의 세그멘테이션을 이용하여 설정된 현재 어휘들을 이용하여 단어를 조합하고, 상기 조합된 단어 중, 조합된 단어가 현재 어휘에 포함될 경우, 우도가 최대가 되는 단어를 도출하여 현재 어휘에 추가하는 처리부; 및 상기 학습 데이터 및 상기 현재 어휘들로 구성되는 언어 모델을 저장하는 저장부를 포함하고, 상기 처리부는, 상기 어휘 단어 수의 제한된 범위 내에서 상기 단어를 조합하고 상기 우도가 최대가 되는 단어를 도출하여 현재 어휘에 추가하는 과정을 반복하는 것을 특징으로 하는 자동 어휘 생성 장치를 제공한다.In order to achieve the second object, according to the present invention, there is provided a method of generating a speech signal by combining words using current vocabulary set using syllable of a language corresponding to learning data and segmentation of a syllable level of the learning data, A processing unit for deriving a word having the maximum likelihood when the combined word is included in the current vocabulary and adding the word to the current vocabulary; And a storage unit for storing a language model composed of the learning data and the current vocabulary, wherein the processing unit combines the words within a limited range of the number of vocabulary words and derives a word having the maximum likelihood And repeating the process of adding the current vocabulary to the current vocabulary.

본 발명에 따르면, 최대 우도를 이용하여 어휘를 생성하는바, 자동으로 어휘를 생성하여, 이 어휘내의 단어를 조합하여 임의의 문장을 생성 할 수 있다. 또한, out-of-lexicon 단어들이 어휘에 포함되지 않는 문제점과 형태소 분석과 공백 세그멘테이션 에러가 음성인터페이스 에러로 전파된다는 문제점을 해결한다.According to the present invention, a vocabulary is generated using the maximum likelihood, and a vocabulary is automatically generated, and an arbitrary sentence can be generated by combining words in the vocabulary. It also solves the problem that out-of-lexicon words are not included in the vocabulary, and morphological analysis and blank segmentation errors are propagated to speech interface errors.

도 1은 본 발명의 일 실시예에 따른 자동 어휘 생성 방법의 흐름도이다.
도 2 내지 5는 본 발명의 실시예에 따른 자동 어휘 생성 방법의 흐름도이다.
도 6은 본 발명의 일 실시예에 따른 자동 어휘 생성 장치의 블록도이다.1 is a flowchart of an automatic vocabulary generation method according to an embodiment of the present invention.
2 to 5 are flowcharts of an automatic vocabulary generation method according to an embodiment of the present invention.
6 is a block diagram of an automatic vocabulary generation apparatus according to an embodiment of the present invention.

본 발명에 관한 구체적인 내용의 설명에 앞서 이해의 편의를 위해 본 발명이 해결하고자 하는 과제의 해결 방안의 개요 혹은 기술적 사상의 핵심을 우선 제시한다.Prior to the description of the concrete contents of the present invention, for the sake of understanding, the outline of the solution of the problem to be solved by the present invention or the core of the technical idea is first given.

본 발명의 일 실시예에 따른 자동 어휘 생성 방법은 학습 데이터에 해당하는 언어의 음절을 초기 어휘로 설정하는 단계, 상기 학습 데이터에 대한 음절 레벨의 세그멘테이션을 수행하는 단계, 상기 학습 데이터의 공백을 공백표시로 치환하는 단계, 상기 공백이 공백표시로 치환된 초기 세그멘테이션 결과로부터 언어 모델을 생성하는 단계, 상기 언어 모델에 포함된 현재 어휘들을 이용하여 단어를 조합하는 단계, 및 상기 조합된 단어 중, 조합된 단어가 현재 어휘에 포함될 경우, 우도가 최대가 되는 단어를 도출하여, 현재 어휘에 추가하는 단계를 포함하고, 상기 어휘 단어 수의 제한된 범위 내에서 상기 단어를 조합하는 단계 및 상기 우도가 최대가 되는 단어를 도출하여 현재 어휘에 추가하는 단계를 반복하는 것을 특징으로 한다.According to an embodiment of the present invention, there is provided an automatic vocabulary generation method including: setting a syllable of a language corresponding to learning data as an initial vocabulary; segmenting a syllable level of the learning data; Generating a language model from an initial segmentation result in which the space is replaced with a blank display, combining words using current vocabularies included in the language model, Extracting a word having a maximum likelihood when the word is included in the current vocabulary and adding the word to the current vocabulary; combining the word within a limited range of the number of the vocabulary words; And adding the word to the current vocabulary is repeated.

이하 첨부된 도면을 참조하여 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자가 본 발명을 용이하게 실시할 수 있는 실시 예를 상세히 설명한다. 그러나 이들 실시예는 본 발명을 보다 구체적으로 설명하기 위한 것으로, 본 발명의 범위가 이에 의하여 제한되지 않는다는 것은 당업계의 통상의 지식을 가진 자에게 자명할 것이다.DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings. It will be apparent to those skilled in the art, however, that these examples are provided to further illustrate the present invention, and the scope of the present invention is not limited thereto.

본 발명이 해결하고자 하는 과제의 해결 방안을 명확하게 하기 위한 발명의 구성을 본 발명의 바람직한 실시예에 근거하여 첨부 도면을 참조하여 상세히 설명하되, 도면의 구성요소들에 참조번호를 부여함에 있어서 동일 구성요소에 대해서는 비록 다른 도면상에 있더라도 동일 참조번호를 부여하였으며 당해 도면에 대한 설명시 필요한 경우 다른 도면의 구성요소를 인용할 수 있음을 미리 밝혀둔다. 아울러 본 발명의 바람직한 실시 예에 대한 동작 원리를 상세하게 설명함에 있어 본 발명과 관련된 공지 기능 혹은 구성에 대한 구체적인 설명 그리고 그 이외의 제반 사항이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우, 그 상세한 설명을 생략한다.BRIEF DESCRIPTION OF THE DRAWINGS The above and other objects, features and advantages of the present invention will become more apparent from the following detailed description of the present invention when taken in conjunction with the accompanying drawings, in which: It is to be noted that components are denoted by the same reference numerals even though they are shown in different drawings, and components of different drawings can be cited when necessary in describing the drawings. In the following detailed description of the principles of operation of the preferred embodiments of the present invention, it is to be understood that the present invention is not limited to the details of the known functions and configurations, and other matters may be unnecessarily obscured, A detailed description thereof will be omitted.

도 1은 본 발명의 일 실시예에 따른 자동 어휘 생성 방법의 흐름도이고, 도 2 내지 5는 본 발명의 실시예에 따른 자동 어휘 생성 방법의 흐름도이다.FIG. 1 is a flowchart of an automatic vocabulary generation method according to an embodiment of the present invention, and FIGS. 2 to 5 are flowcharts of an automatic vocabulary generation method according to an embodiment of the present invention.

본 발명의 일 실시예에 따른 자동 어휘 생성 방법은 최초 해당 언어의 음절로부터 시작하여 최대 우도를 이용하여 단어들을 추가하여 자동으로 어휘를 생성한다. 자동으로 어휘를 생성함에 있어서, 공백을 고려한 어휘를 생성한다. 이하, 자동으로 어휘를 생성하는 구체적인 과정에 대해 도 1 내지 도6을 이용하여 자세히 설명하도록 한다.In the automatic vocabulary generation method according to an embodiment of the present invention, vocabularies are automatically generated by adding words using maximum likelihood starting from a syllable of a corresponding language. In generating a vocabulary automatically, you create a vocabulary that takes into account whitespace. Hereinafter, a detailed process of automatically generating a vocabulary will be described in detail with reference to FIG. 1 to FIG.

110 단계는 학습 데이터에 해당하는 언어의 음절을 초기 어휘로 설정하는 단계이다.Step 110 is a step of setting the syllable of the language corresponding to the learning data as the initial vocabulary.

보다 구체적으로, 자동 어휘를 생성하고자 하는, 즉 학습 데이터에 해당하는 언어의 음절을 초기 어휘로 설정한다. 다양한 언어들에 적용될 수 있으며, 특히 동아시아 언어들의 어휘를 자동으로 생성하는데 이용될 수 있다. 한국어에 대한 어휘를 자동으로 생성하기 위해서, 우선 EUC-KR에 정의된 한국어 음절을 초기 어휘로 설정한다. EUC-KR에 정의된 한국어 음절은 약 3900개로 구성되어 있다.More specifically, the syllable of the language in which the automatic vocabulary is to be generated, that is, the language corresponding to the learning data, is set as the initial vocabulary. It can be applied to various languages, and can be used to automatically generate vocabularies for East Asian languages in particular. To automatically generate a vocabulary for Korean, we first set the Korean syllable defined in EUC-KR as the initial vocabulary. The Korean syllables defined in EUC-KR consist of about 3900 syllables.

한국어는 공백(띄어쓰기)가 음절 좌측 또는 우측에 위치하는 경우, 공백의 위치에 따라 뜻이 달라질 수 있다. 따라서, 공백에 따른 에러를 제거하기 위하여, 공백표시를 이용할 수 있다.In Korean, if the space is located on the left or right side of the syllable, the meaning may be changed depending on the position of the space. Thus, in order to eliminate errors due to whitespace, a blank display can be used.

이를 위하여, 도 2의 210 단계와 같이, 상기 학습 데이터에 해당하는 언어의 각 음절을 네 가지 형태, 즉 공백표시가 없는 음절, 좌측에 공백표시(underscore)가 있는 음절, 우측에 공백표시가 있는 음절, 및 좌우측에 공백표시가 있는 음절로 복제하는 단계를 더 포함할 수 있다. 예를 들어, For this, as shown in step 210 of FIG. 2, each syllable of the language corresponding to the learning data is divided into four types: a syllable having no blank display, a syllable having a left underscore, A syllable, and a syllable with a space mark on the left and right sides. E.g,

[가]라는 음절은 공백표시가 없는 음절(오리지널(original) 음절)[가], 좌측에 공백표시가 있는 음절[_가], 우측에 공백표시가 있는 음절[가_], 및 좌우측에 공백표시가 있는 음절[_가_]로 나타낸다. 이와 같이, 공백을 공백표시로 표시하여, 정확한 어휘를 생성하고 해당 음성을 인식하는데 제공할 수 있다.The syllable called [a] is composed of a syllable (original syllable) without a space mark (a), a syllable [_] with a space mark on the left, a syllable [] with a space mark on the right, Indicated by a syllable [__ _] with a mark. In this way, a blank space can be displayed as a blank display, and an accurate vocabulary can be generated and provided to recognize the voice.

초기 어휘를 설정한 이후에는 학습 데이터에 대한 120 단계에서 상기 학습 데이터에 대한 음절 레벨의 세그멘테이션을 수행한다.After setting the initial vocabulary, segmentation of the syllable level for the learning data is performed in step 120 for the learning data.

보다 구체적으로, 자동으로 학습 데이터로부터 학습을 통해 자동 어휘를 생성하기 위하여, 학습하고자 하는 학습 데이터에 대한 음절 레벨의 세그멘테이션을 수행한다. 음절 레벨의 세그멘테이션을 통해 학습 데이터를 음절 단위로 세분화한다. More specifically, in order to automatically generate vocabulary through learning from learning data, segmentation of a syllable level with respect to learning data to be learned is performed. Segmentation of the syllable level by syllable segment.

학습 데이터에 대한 세그멘테이션을 수행한 후, 130 단계에서 학습 데이터의 공백을 공백표시로 치환한다.After performing the segmentation on the learning data, in step 130, the blank of the learning data is replaced with a blank display.

공백을 공백표시로 치환하기 위하여, 도 3과 같이, 310 단계에서 상기 학습 데이터로부터 세그멘테이션 된 음절의 좌측 또는 우측에 공백이 있는 음절을 공백에 대응되도록 상기 좌측 또는 우측에 공백표시가 있는 음절로 복제하여 대체함으로써 공백을 공백표시로 치환한다.As shown in FIG. 3, in step 310, a syllable having a space at the left or right of the syllable segmented from the learning data is replaced with a syllable with a space mark in the left or right side so as to correspond to a space, To replace the blank with a blank display.

언어의 음절에 공백표시를 적용하여 복제한 방법과 같이, 상기 학습 데이터로부터 세그멘테이션 된 음절의 좌측 또는 우측에 공백이 있는 경우, 해당 공백에 대응되도록 상기 좌측 또는 우측에 공백표시가 있는 음절로 복제하여 대체한다. 단순히 음절을 이용하는 것이 아닌 해당 음절의 좌측 또는 우측에 위치하는 공백을 함께 표시하여 이용할 수 있도록 좌측 또는 우측에 공백표시가 있는 음절로 복제하여 대체한다. 상기 좌측 또는 우측에 공백이 있는 세그멘테이션 된 음절을 모두 복제된 음절로 대체한 이후에는 공백이 모두 공백표시로 표시되어 있는바, 학습 데이터의 공백을 모두 공백표시로 치환한다.If there is a space on the left or right side of the segmented syllable from the learning data as in the method of duplicating the syllable of the language by applying a blank display, the syllable is copied in the syllable having a blank mark on the left or right side so as to correspond to the corresponding space Replace. Instead of simply using a syllable, it replaces the syllable with a space mark on the left or right side by replacing it with a syllable so that the space located on the left or right side of the syllable can be displayed together. After replacing all the segmented syllables having spaces on the left or right side by replicated syllables, the spaces are all displayed as spaces so that the spaces of the learning data are all replaced with spaces.

공백을 공백표시로 치환한 이후에는 130 단계에서 상기 공백이 공백표시로 치환된 초기 세그멘테이션 결과로부터 언어 모델을 생성한다.After replacing the blank with a blank display, a language model is generated from an initial segmentation result in which the blank is replaced with a blank display in step 130.

보다 구체적으로, 학습 데이터에 대한 공백이 공백표시로 치환된 세그멘테이션을 수행하고, 이후 그 세그멘테이션 결과로부터 언어 모델을 생성한다. 상기 언어 모델을 생성함에 있어서, 도 4와 같이, 현재 세그멘테이션 결과로부터 음절 단위 카운트를 계산하는 410 단계 및 상기 음절 단위 카운트를 이용하여 현재 어휘에 대한 언어 모델을 생성하는 420 단계의 과정을 수행할 수 있다.More specifically, segmentation is performed in which a blank space for learning data is replaced with a blank display, and then a language model is generated from the segmentation result. As shown in FIG. 4, the language model may be generated by calculating a syllable unit count from the current segmentation result and generating a language model for the current vocabulary using the syllable unit count have.

먼저 공백이 공백표시로 치환된 초기 세그멘테이션 결과에 대해 음절 단위 카운트 (n-gram count)를 계산하고, 현재 음절 단위 카운트에 기초하여, 현재 어휘에 대한 언어 모델을 생성할 수 있다. 상기 음절 단위 카운트는, 바이그램(bi-gram) 카운트 또는 트라이그램(tri-gram) 카운트 중 어느 하나 이상을 포함할 수 있다. 바이그램 카운트는 두 음절씩 카운트하는 것이고, 트라이그램 카운트는 세 음절씩 카운트하는 것으로, 다른 엔그램 카운트를 이용할 수도 있다. 정확한 어휘 생성을 위하여, 바이그램 카운트 및 트라이그램 카운트를 이용할 수 있다.A n-gram count can be calculated for an initial segmentation result in which a space is replaced with a blank display, and a language model for the current vocabulary can be generated based on the current syllable unit count. The syllable unit count may include at least one of a bi-gram count or a tri-gram count. The Biagram Count is a count of two syllables, the Trigram Count is a count of three syllables, and you can use a different engram count. For accurate vocabulary generation, a bi-gram count and a tri-gram count can be used.

상기와 같이, 음절 및 학습 데이터에 따른 초기 세그멘테이션 결과를 이용하여 현재 어휘에 대한 언어 모델을 생성할 수 있다. 이렇게 생성된 언어 모델에 대해 자동으로 어휘를 생성하기 위하여, 하기의 과정을 거친다.As described above, the language model for the current vocabulary can be generated using the initial segmentation result according to syllable and learning data. In order to automatically generate a vocabulary for the language model thus generated, the following process is performed.

140 단계에서 상기 언어 모델에 포함된 현재 어휘들을 이용하여 단어를 조합하고, 150 단계에서 상기 조합된 단어 중, 조합된 단어가 현재 어휘에 포함될 경우, 우도가 최대가 되는 단어를 도출하여, 현재 어휘에 추가한다.In step 140, words are combined using the current vocabularies included in the language model. If a combined word is included in the current vocabulary in step 150, a word having the maximum likelihood is derived, .

보다 구체적으로, 현재 어휘들을 이용하여 모든 어휘 단어 조합을 생성하고, 상기 조합된 단어 중, 조합된 단어가 현재 어휘에 포함될 경우, 우도가 최대가 되는 단어를 도출한다.More specifically, all vocabulary word combinations are generated using current vocabularies, and when a combined word is included in the current vocabulary, a word having the maximum likelihood is derived.

최대 우도를 이용하여 단어를 추가하는 과정에 대한 상세한 설명을 위해, 먼저 학습 데이터에

까지 n개의 문장으로 이루어진 코퍼스가 있다고 가정하자. 해당 코퍼스에서 생성된 언어 모델을 f라 하자. α 를 현재의 lexicon으로부터 2단어를 조합하여 만들 수 있는 모든 가능한 단어 set이라 하고, θ 는 α의 원소라 하자. 임의의 θ가 lexicon에 포함되었을 때 코퍼스로부터 생성되는 언어모델을

와 같이 정의 할 수 있다. 각 문장이 독립이고 고정되어 있으므로 언어모델을 다음와 같이 표현할 수 있다.For a detailed description of the process of adding words using maximum likelihood,

Let us assume that there are n sentences of corpus. Let f be the language model generated in the corpus. Let α be the set of all possible words that can be made up of two words from the current lexicon, and let θ be the origin of α. A language model generated from a corpus when an arbitrary θ is included in a lexicon

Can be defined as follows. Since each sentence is independent and fixed, the language model can be expressed as:

여기서

은 고정이며 θ는 가변이다. 이에 대한 우도 L은 다음과 같이 표현된다.here

Is fixed and θ is variable. The likelihood L for this is expressed as follows.

계산에 편의상 로그를 취하여 우도를 다음와 같이 표현한다.Take the log for the convenience of calculation and express the likelihood as follows.

제안된 ML estimation 방법으로

를 다음과 같이 찾는다.The proposed ML estimation method

Find the following.

를 찾은 결과 α 와 f가 업데이트 된다. 이러한 과정을 우도가 증가하는 동안 반복한다.

As a result, α and f are updated. This process is repeated as likelihood increases.

우도를 tri-gram을 이용하여 estimation 할 수 있다. 문장 s_j 가 J 개의 단어열, l_w1, ..., l_wJ, 로 이루어 졌을 때 s_j 에 대한 우도는 다음과 같이 계산된다. Likelihood can be estimated using a tri-gram. When the sentence s _j consists of J word sequences, l _w1 , ..., l _wJ , the likelihood for s _j is computed as

는 다음과 같이 계산된다.

Is calculated as follows.

상기와 같이, 어휘 단어 수의 제한된 범위 내에서 상기 단어를 조합하는 단계 및 상기 우도가 최대가 되는 단어를 도출하여 현재 어휘에 추가하는 단계를 반복하여, 원하는 만큼의 어휘를 자동으로 생성할 수 있다. 어휘 단어 수를 제한하여, 필요한 수만큼의 어휘를 생성할 수 있으며, 우도가 증가하는 동안 반복함으로써 어휘 생성의 정확성을 높일 수 있다.As described above, it is possible to automatically generate as many vocabularies as desired by repeating the steps of combining the words within a limited range of the number of vocabulary words, and deriving the word with the greatest likelihood of the word and adding it to the current vocabulary . By limiting the number of vocabulary words, it is possible to generate as many vocabularies as necessary, and it is possible to improve the accuracy of vocabulary generation by repeating while increasing the likelihood.

상기 도출된 단어를 현재 어휘에 추가한 이후에는 도 5와 같이, 현재 세그멘테이션 결과에 상기 도출된 단어를 추가하여 세그멘테이션 결과를 갱신하는 510 단계 및 상기 갱신된 세그멘테이션 결과로부터 언어 모델을 갱신하는 520 단계를 더 포함할 수 있다. 새로 추가되는 단어와 중복되는 현재 세그멘테이션을 상기 단어로 대체하여 제한된 단어의 수에 정확한 어휘가 포함될 수 있도록 하고, 이를 통해 세그멘테이션 결과가 갱신되면, 갱신된 세그멘테이션 결과에 대해 음절 단위 카운트를 계산하여 언어 모델을 갱신한다. After adding the derived word to the current vocabulary, a segmentation result is updated by adding the derived word to the current segmentation result as shown in FIG. 5, and a language model is updated from the updated segmentation result . The current segmentation that overlaps with the newly added word is replaced by the word so that the correct vocabulary can be included in the limited number of words, and when the segmentation result is updated, the syllable unit count is calculated for the updated segmentation result, .

위와 같은 과정을 통해 개발자가 지정한 수의 단어를 가지는 어휘를 생성할 수 있고, 생성된 어휘로부터 언어 모델을 성공적으로 생성할 수 있다. 상기 과정을 통해 생성된 어휘와 언어모델을 사용함으로써, 음성인식기는 동아시아 언어에 대해서 무제한 어휘의 문장을 생산할 수 있다. 모든 과정이 자동으로 이루어지며 어떠한 사전지식(pre-knowledge)이 필요하지 않기 때문에 1) out-of-lexicon 단어들을 lexicon에 포함시키지 못하는 문제 및 2) 형태소 분석과 세그멘테이션 에러가 음성인식 에러로 전파되는 문제를 방지할 수 있다. Through the above process, a developer can generate a vocabulary having a predetermined number of words, and a language model can be successfully generated from the generated vocabulary. By using the vocabulary and language model generated through the above process, the speech recognizer can produce unlimited vocabulary sentences for the East Asian languages. Because all processes are automatic and do not require any pre-knowledge, 1) they can not include out-of-lexicon words in lexicon, and 2) morphological analysis and segmentation errors are propagated to speech recognition errors The problem can be prevented.

상기 과정을 통해 생성된 어휘는 공백표시가 포함되어 있다. 따라서 이에 기반한 음성인식기의 결과에도 공백표시가 포함될 수 있다. 공백표시는 가독성(readability) 저하를 불러올 수 있기 때문에 이에 대한 후처리를 더 수행할 수 있다. 음성 인식 결과에 따른 공백은 좌측 또는 우측에 공백표시가 된 어휘 단어로 생성된다. 중복된 공백표시가 된 경우에는 첫 번째 공백표시는 좌측 어휘 단어에서 생성되고, 두 번째 공백표시는 우측 어휘 단어에서 생성된다. 이러한 문제를 해결하기 위하여, 보수적인 규칙을 적용할 수 있다. 즉, 두 개의 연이은 공백표시는 하나의 공백표시를 제거하고, 하나의 공백표시로 변환할 수 있다. 예를 들어, [_나는_ _친구 가_ _좋다_]는 [나는 친구가 좋다] 로 최종 인식 결과로 변환할 수 있다.The vocabulary generated through the above process includes a blank mark. Therefore, the result of the speech recognizer based thereon may also include a blank indication. Since the blank display may invite a deterioration in readability, it is possible to further perform the post-processing. Spaces according to the speech recognition result are generated as lexical words marked as blank on the left or right side. If there is a duplicate space mark, the first space mark is generated from the left vocabulary word, and the second space mark is generated from the right vocabulary word. To solve this problem, conservative rules can be applied. That is, two consecutive blank marks can be converted to a single blank mark, eliminating one blank mark. For example, [_ I _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

도 6은 본 발명의 일 실시예에 따른 자동 어휘 생성 장치의 블록도이다.6 is a block diagram of an automatic vocabulary generation apparatus according to an embodiment of the present invention.

본 발명의 일 실시예에 따른 자동 어휘 생성 장치(600)는 처리부(610) 및 저장부(620)로 구성되며, 통신부를 더 포함할 수 있다.The automatic lexical generation apparatus 600 according to an embodiment of the present invention may include a processing unit 610 and a storage unit 620, and may further include a communication unit.

처리부(610)는 학습 데이터에 해당하는 언어의 음절과 상기 학습 데이터에 대한 음절 레벨의 세그멘테이션을 이용하여 설정된 현재 어휘들을 이용하여 단어를 조합하고, 상기 조합된 단어 중, 조합된 단어가 현재 어휘에 포함될 경우, 우도가 최대가 되는 단어를 도출하여 현재 어휘에 추가하며, 상기 어휘 단어 수의 제한된 범위 내에서 상기 단어를 조합하고 상기 우도가 최대가 되는 단어를 도출하여 현재 어휘에 추가하는 과정을 반복한다.The processor 610 combines the words using the current vocabulary set using the syllable of the language corresponding to the learning data and the syllable level of the learning data and outputs the combined word to the current vocabulary If a word having a maximum likelihood value is included, a word having a maximum likelihood value is derived and added to the current vocabulary word. Then, the word is combined within a limited range of the number of the vocabulary words, do.

처리부(610)는 상기 학습 데이터로부터 세그멘테이션 된 음절의 좌측 또는 우측에 공백이 있는 음절을 공백에 대응되도록 상기 좌측 또는 우측에 밑줄 표시가 있는 음절로 복제하여 대체할 수 있고, 현재 세그멘테이션 결과로부터 음절 단위 카운트를 계산한 결과를 이용하여 현재 어휘에 대한 언어 모델을 생성할 수 있으며, 현재 세그멘테이션 결과에 상기 도출된 단어를 추가하여 세그멘테이션 결과를 갱신할 수 있다.The processing unit 610 may replace the syllable having a space at the left or right of the segmented syllable from the learning data with a syllable having an underlined mark on the left or right side so as to correspond to a space, The language model for the current vocabulary can be generated by using the result of counting, and the segmentation result can be updated by adding the derived word to the current segmentation result.

처리부(610)가 자동 어휘를 생성하는 과정은 도 1 내지 도 5의 자동 어휘 생성 방법에 대응하는바, 중복되는 설명은 생략하도록 한다.The process of generating the automatic vocabulary by the processing unit 610 corresponds to the automatic vocabulary generation method of FIGS. 1 to 5, and a duplicate description will be omitted.

저장부(620)는 학습 데이터 및 현재 어휘들로 구성되는 언어 모델을 저장할 수 있다. 통신부는 언어의 음절에 대한 데이터, 학습 데이터를 외부로부터 수신할 수 있으며, 생성된 어휘 및 언어 모델을 외부로 송신할 수 있다.The storage unit 620 may store a language model consisting of learning data and current vocabularies. The communication unit can receive data on the syllable of the language and learning data from the outside, and can transmit the generated vocabulary and language model to the outside.

본 발명의 실시예들은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체에 기록되는 프로그램 명령은 본 발명을 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체 (magnetic media), CD-ROM, DVD와 같은 광기록 매체 (optical media), 플롭티컬 디스크 (floptical disk)와 같은 자기-광 매체 (magneto-optical media), 및 롬 (ROM), 램 (RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. 상기된 하드웨어 장치는 본 발명의 동작을 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.Embodiments of the present invention may be implemented in the form of program instructions that can be executed on various computer means and recorded on a computer readable medium. The computer-readable medium may include program instructions, data files, data structures, and the like, alone or in combination. The program instructions recorded on the medium may be those specially designed and constructed for the present invention or may be available to those skilled in the art of computer software. Examples of computer-readable media include magnetic media such as hard disks, floppy disks and magnetic tape; optical media such as CD-ROMs and DVDs; magnetic media such as floppy disks; Magneto-optical media, and hardware devices specifically configured to store and execute program instructions such as ROM, RAM, flash memory, and the like. Examples of program instructions include machine language code such as those produced by a compiler, as well as high-level language code that can be executed by a computer using an interpreter or the like. The hardware devices described above may be configured to operate as one or more software modules to perform the operations of the present invention, and vice versa.

이상과 같이 본 발명에서는 구체적인 구성 요소 등과 같은 특정 사항들과 한정된 실시예 및 도면에 의해 설명되었으나 이는 본 발명의 보다 전반적인 이해를 돕기 위해서 제공된 것일 뿐, 본 발명은 상기의 실시예에 한정되는 것은 아니며, 본 발명이 속하는 분야에서 통상적인 지식을 가진 자라면 이러한 기재로부터 다양한 수정 및 변형이 가능하다. As described above, the present invention has been described with reference to particular embodiments, such as specific elements, and specific embodiments and drawings. However, it should be understood that the present invention is not limited to the above- And various modifications and changes may be made thereto by those skilled in the art to which the present invention pertains.

따라서, 본 발명의 사상은 설명된 실시예에 국한되어 정해져서는 아니되며, 후술하는 특허청구범위뿐 아니라 이 특허청구범위와 균등하거나 등가적 변형이 있는 모든 것들은 본 발명 사상의 범주에 속한다고 할 것이다.Accordingly, the spirit of the present invention should not be construed as being limited to the embodiments described, and all of the equivalents or equivalents of the claims, as well as the following claims, belong to the scope of the present invention .

600: 자동 어휘 생성 장치
610: 처리부
620: 저장부600: Automatic vocabulary generator
610:
620:

Claims

Setting a syllable of a language corresponding to the learning data as an initial vocabulary;
Performing segmentation of a syllable level for the learning data;
Replacing the blank of the learning data with a blank display;
Generating a language model from an initial segmentation result in which the space is replaced with a blank display;
Combining words of all vocabulary that can be generated using current vocabularies included in the language model; And
Deriving a word having a maximum likelihood when the combined word is included in the current vocabulary and adding the word to the current vocabulary;
Combining the words within a limited range of the number of vocabulary words, and deriving words with the greatest likelihood and adding them to the current vocabulary, the method comprising: repeating the steps of: Wherein the likelihood is calculated for a language model by using a product of a language model generated from each of the corpus of the learning data including independent individual sentences.

The method according to claim 1,
The step of setting the syllable to an initial vocabulary comprises:
And replicating each syllable in the language corresponding to the learning data with a syllable without a blank mark, a syllable with a blank mark on the left, a syllable with a blank mark on the right, and a syllable with a blank mark on the left and right sides The method comprising the steps of:

The method according to claim 1,
Wherein replacing the blank of the learning data with a blank display includes:
Wherein a syllable having a space at the left or right of the syllable segmented from the learning data is replicated and replaced with a syllable having a blank indication at the left or right side so as to correspond to a space.

The method according to claim 1,
Wherein the step of generating the language model comprises:
Calculating a syllable unit count from the current segmentation result; And
Generating a language model for a current vocabulary using the syllable unit count.

5. The method of claim 4,
Wherein the syllable unit count includes:
A bi-gram count, or a tri-gram count.

The method according to claim 1,
Wherein the step of adding the derived word to the current vocabulary comprises:
Updating the segmentation result by adding the derived word to the current segmentation result; And
And updating the language model from the updated segmentation result.

A computer-readable recording medium storing a program for causing a computer to execute the method according to any one of claims 1 to 6.

Combining all the words of the vocabulary that can be generated using the current vocabulary set using the syllable of the language corresponding to the learning data and the segmentation of the syllable level of the learning data, A processing unit for deriving a word having a maximum likelihood value and adding the extracted word to the current vocabulary; And
And a storage unit for storing a language model composed of the learning data and the current vocabularies,
Wherein,
And repeating the process of combining the words within a limited range of the number of vocabulary words and deriving a word having the maximum likelihood of the maximum and adding the word to the current vocabulary, Calculates the likelihood by using a product of language models generated from each of the corpus of the learning data including independent individual sentences.

9. The method of claim 8,
Wherein,
And replaces and replaces the syllable having a space at the left or right of the segmented syllable from the learning data with a syllable having a space mark at the left or right side so as to correspond to a space.

9. The method of claim 8,
Wherein,
Wherein the language model for the current vocabulary is generated using the result of calculating the syllable unit count from the current segmentation result.

9. The method of claim 8,
Wherein,
And adds the derived word to the current segmentation result to update the segmentation result.

The method according to claim 1,
The above-
A language model generated from a corpus consisting of n independent sentences S _n

(α is a set of all possible words, θ is an element of α)

Using

Is calculated as < RTI ID = 0.0 > a < / RTI >

9. The method of claim 8,
The above-
A language model generated from a corpus consisting of n independent sentences S _n

(α is a set of all possible words, θ is an element of α)

Using

Is calculated as: < EMI ID = 17.0 >