KR101388569B1

KR101388569B1 - Apparatus and method for adding new proper nouns to language model in a continuous speech recognition system

Info

Publication number: KR101388569B1
Application number: KR1020110079586A
Authority: KR
Inventors: 왕지현
Original assignee: 한국전자통신연구원
Priority date: 2011-08-10
Filing date: 2011-08-10
Publication date: 2014-04-23
Also published as: KR20130017260A

Abstract

본 발명은 연속어 음성인식 시스템에서 언어모델의 고유 명사 추가 기술에 관한 것으로, 신규 고유명사와 분류어를 수집하고, 수집된 단어들을 이용하여 텍스트 코퍼스로부터 후보 문장을 선정하며, 후보 문장으로부터 후보 문틀을 추출하고, 후보 문틀의 엔그램을 생성한 후, 각 후보 문틀을 통계적인 계산식을 이용하여 점수화 및 순위화를 수행하고, 상위의 문틀을 고유명사에 적용한 후 엔그램을 확장하여 언어모델에 반영하는 것을 특징으로 한다. 본 발명에 의하면, 음성인식을 위한 언어모델에 없는 신규 고유명사를 문장의 다양한 표현을 반영한 엔그램 형태로 추가할 수 있기 때문에 단순히 고유명사 만으로 추가하는 방법보다 고유 명사의 높은 음성인식 성능을 얻을 수 있다.The present invention relates to a technique for adding a proper noun of a language model in a continuous speech recognition system, collecting new proper nouns and classification words, selecting candidate sentences from a text corpus using collected words, and selecting candidate sentences from candidate sentences. Extract the candidate frame, generate an engram of the candidate door frame, score each candidate door frame using statistical calculations, apply the upper door frame to the proper noun, and expand the engram to reflect the language model. Characterized in that. According to the present invention, since a new proper noun that is not in a language model for speech recognition can be added in the form of an engram reflecting various expressions of sentences, a higher speech recognition performance can be obtained than a method of simply adding a proper noun only. have.

Description

Apparatus and method for adding proper nouns of language models in continuous speech recognition system {APPARATUS AND METHOD FOR ADDING NEW PROPER NOUNS TO LANGUAGE MODEL IN A CONTINUOUS SPEECH RECOGNITION SYSTEM}

본 발명은 연속어 음성인식의 언어모델에 고유명사를 추가하는 기술에 관한 것으로서, 특히 음성인식 어휘사전 및 언어모델용 훈련코퍼스에 없는 독립적으로 수집된 신규 고유명사들을 언어모델에 반영하는데 적합한 연속어 음성인식 시스템에서 언어모델의 고유 명사 추가 장치 및 방법에 관한 것이다.The present invention relates to a technique for adding a proper noun to a language model of continuous speech recognition, in particular, a continuous word suitable for reflecting independently collected new proper nouns not included in a speech recognition vocabulary dictionary and a training model for language models. The present invention relates to a device and a method for adding a proper noun of a language model in a speech recognition system.

일반적으로 음성인식 방법은 발성의 행태에 따라 몇 가지로 나뉘어지는데 고립어 단어 인식(isolated word recognition), 연결 단어 인식(connected word recognition), 연속어 인식(continuous speech recognition), 핵심어 인식(keyword spotting)이 있다. 이들 중에서 개별적인 단어를 인식하는 고립어 단어 인식과 달리, 연속어 인식은 음성신호에 해당하는 문장 또는 연속된 단어열을 찾는 방식으로서 어휘사전의 단어수가 증가할수록 문장을 구성하는 단어열의 가짓수가 크게 증가하게 되며, 단어와 단어 사이의 발음변이로 인해 단어 개수가 많을수록 비슷한 발음의 단어들로 오인식될 확률도 늘어나게 된다.In general, speech recognition methods are divided into several types according to the vocalization behavior. Isolated word recognition, connected word recognition, continuous speech recognition, and keyword spotting are used. have. Unlike the case of isolated word recognition in which individual words are recognized, continuous word recognition is a method for finding sentences or consecutive word strings corresponding to a voice signal. As the number of words in the vocabulary dictionary increases, The more pronounced the number of words due to the variation of pronunciation between the word and the word, the more likely it is to be mistaken for words of similar pronunciation.

음성인식에서의 언어모델은 사용자가 발성한 문장이 올바른 문장으로 인식되도록 단어들 간의 연결성을 텍스트 코퍼스로부터 통계적인 방법으로 수집하여 구축한 모델이다. 언어모델에는 유니그램(1-gram), 바이그램(2-gram), 트라이그램(3-gram)이 많이 사용된다. 유니그램은 단어의 확률을 사용하는 것으로서 바로 앞에 위치한 과거의 단어는 사용하지 않는다. 바이그램과 트라이그램은 각각 바로 앞 하나와 두 개의 단어에 의존하는 확률을 사용한다. 이와 같은 언어모델의 사용은 문법적으로 유효한 단어열이 인식되도록 하며, 단어나 문장의 탐색공간을 최소화시켜 인식 성능을 높이고 탐색 시간을 단축시킬 수 있다.The language model in speech recognition is a model constructed by collecting the connectivity between words from the text corpus in a statistical way so that the user's spoken sentence is recognized as the correct sentence. The language model uses a lot of 1-gram, 2-gram, and 3-gram. A unigram uses the probability of a word and does not use a word in the past immediately preceding it. Biagrams and trigrams use probabilities that depend on the first one and two words, respectively. The use of such a language model allows grammatically valid word strings to be recognized, minimizing the search space of words or sentences, thereby improving recognition performance and reducing search time.

언어모델 훈련을 위한 텍스트 코퍼스에는 충분한 고유명사가 포함되어 있지 않기 때문에 미등록어인 신규 고유명사를 포함한 자연어 문장을 음성 인식하려면, 텍스트 코퍼스에 신규 고유명사가 포함된 문장을 고의로 만들어서 입력하거나, 고유명사를 여러 번 반복하여 입력하는 수밖에 없었다. 그러나 전자의 경우, 추가하려는 고유명사의 개수가 많을 때에는 각 고유명사마다 다양한 표현을 커버하는 문장들을 일일이 작성해야 하기 때문에 매우 비효율적이며, 후자의 경우는 유니그램 형태로 고유명사만을 입력하기 때문에 문장 안에서 고유명사와 인접 출현하는 단어들에 대한 정보가 언어모델에 반영되지 않으므로 인식률이 떨어질 수밖에 없었다.The text corpus for training a language model does not contain enough proper nouns, so to speech-recognize natural language sentences that include new proper nouns that are not registered, you can intentionally enter a sentence that contains a new proper noun in the text corpus, or type a proper noun. I had to enter it again and again. However, in the former case, when the number of proper nouns to be added is large, it is very inefficient because each proper noun has to write a sentence covering various expressions. In the latter case, only the proper noun is input in the form of unigram. Since the information about proper nouns and words appearing adjacent to each other is not reflected in the language model, the recognition rate has to fall.

한국공개특허 10-2006-0092683호, 2006.08.23Korean Laid-Open Patent Publication No. 10-2006-0092683, 2006.08.23

상기한 바와 같이 종래 기술에 의한 고유명사 추가 방법들에 있어서, 고유명사의 엔그램은 언어모델용 텍스트 코퍼스로부터 추출할 수 있다. 그러나 텍스트 코퍼스에는 대부분의 고유명사가 출현하지 않을 뿐만 아니라, 임의로 수집한 문장들이 많기 때문에 다양한 유형의 고유명사가 고르게 출현하지도 않으며, 각 고유명사가 출현하는 문맥도 모두 상이하다는 문제점이 있었다. In the conventional methods of adding proper nouns as described above, the engram of the proper noun can be extracted from the text corpus for the language model. However, not only most proper nouns appear in the text corpus, but also there are many randomly collected sentences, and various types of proper nouns do not appear evenly, and the context in which each proper noun appears is different.

이에 본 발명의 실시예는 신규 고유명사를 엔그램 형태로 언어모델에 추가할 수 있는 연속어 음성인식 시스템에서 언어모델의 고유 명사 추가 장치 및 방법을 제공할 수 있다.Accordingly, an embodiment of the present invention can provide an apparatus and method for adding a proper noun for a language model in a continuous speech recognition system capable of adding a new proper noun to an language model in the form of an engram.

또한 본 발명의 실시예는 다양한 고유명사에 사용 가능한 일반적인 문맥을 텍스트 코퍼스로부터 추출하여 고유명사의 엔그램 생성에 사용할 수 있는 연속어 음성인식 시스템에서 언어모델의 고유 명사 추가 장치 및 방법을 제공할 수 있다. In addition, an embodiment of the present invention can provide a device and method for adding a proper proper noun of a language model in a continuous speech recognition system that can be used to generate an engram of proper nouns by extracting a general context that can be used for various proper nouns from a text corpus. have.

또한 본 발명의 실시예는 연속어 음성인식 시스템에서 언어모델에 고유 명사의 추가를 위해, 신규 고유명사와 분류어를 수집하고, 수집된 단어들을 이용하여 텍스트 코퍼스로부터 후보 문장을 선정하며, 후보 문장으로부터 후보 문틀을 추출하고, 후보 문틀의 엔그램을 생성한 후, 각 후보 문틀을 통계적인 계산식을 이용하여 점수화 및 순위화를 수행하고, 상위의 문틀을 고유명사에 적용한 후 엔그램을 확장하여 언어모델에 반영할 수 있는 연속어 음성인식 시스템에서 언어모델의 고유 명사 추가 장치 및 방법을 제공할 수 있다.In addition, the embodiment of the present invention collects new proper nouns and classification words for the addition of proper nouns to the language model in the continuous speech recognition system, select candidate sentences from the text corpus using the collected words, candidate sentences After extracting candidate door frames from them, generating engrams of candidate door frames, scoring and ranking each candidate door frame using statistical calculations, applying the upper door frames to proper nouns, and then expanding the engrams to language. It is possible to provide an apparatus and method for adding a proper noun for a language model in a continuous speech recognition system that can be reflected in a model.

본 발명의 일 실시예에 따른 연속어 음성인식 시스템에서 언어모델의 고유 명사 추가 장치는, 언어모델의 고유 명사 추가 장치에서 신규 고유명사와 분류어를 수집하는 수집부와, 상기 신규 고유 명사 및 분류어를 텍스트 스코프에서 검색하여 매칭된 문장들을 후보 문장으로 선정하는 후보 문장 선정부와, 상기 후보 문장으로부터 후보 문틀을 추출하는 후보 문틀 추출부와, 상기 후보 문틀에서 엔그램 형식의 후보 문틀을 추출하는 엔그램의 후보 문틀 생성부와, 상기 엔그램 형식의 후보 문틀 각각에 대한 점수 산정을 통해 높은 점수 별로 순위화하는 순위화부와, 순위화 된 엔그램 형식의 후보 문틀에 수집된 고유 명사를 대입하여 엔그램 확장을 수행하는 엔그램 확장부와, 상기 엔그램 확장을 통해 생성된 엔그램을 언어모델에 빈도와 함께 추가하는 반영부를 포함할 수 있다.Apparatus for adding a proper noun of a language model in the continuous speech recognition system according to an embodiment of the present invention, a collection unit for collecting new proper nouns and classification words in the proper noun adding apparatus of the language model, and the new proper nouns and classification A candidate sentence selecting unit that selects a match sentence as a candidate sentence by searching for a word in a text scope, a candidate door frame extracting unit extracting a candidate door frame from the candidate sentence, and extracting an engram-type candidate door frame from the candidate door frame. Substituting the proper nouns collected in the candidate door frame generation unit of the engram, the ranking unit for ranking by high score through the calculation of the score for each of the engram-type candidate door frame, An engram extension unit that performs engram expansion, and an engram generated by the engram extension is added to the language model with frequency. Reflecting may include portions.

그리고 상기 수집부는, 훈련용 텍스트 코퍼스에서 출현하지 않는 고유 명사를 수집하고, 상기 고유 명사로부터 개념적으로 구분되는 카테고리를 분류어로 할당할 수 있다.The collection unit may collect proper nouns that do not appear in the training text corpus, and may assign a category that is conceptually classified from the proper nouns.

그리고 상기 후보 문틀 추출부는, 상기 후보 문장에서 상기 고유 명사를 포함한 일정한 길이의 지역 문맥을 문틀의 형식으로 추출할 수 있다.The candidate door frame extractor may extract a local context having a predetermined length including the proper noun in the candidate sentence in the form of a door frame.

그리고 상기 순위화부는, 상기 엔그램 형식의 후보 문틀 마다 고유 명사를 포함하는 문틀의 개수 및 전체 코퍼스에서 문틀의 개수를 토대로 문틀의 적합성을 측정하고, 고유 명사를 포함하는 문틀의 개수 및 문틀의 상태도수를 토대로 문틀의 엔트로피를 산출할 수 있다.The ranking unit may measure the suitability of the door frame based on the number of door frames including proper nouns and the number of door frames in the entire corpus for each candidate door frame of the engram type, and the number of door frames and proper state diagrams of the proper nouns. The entropy of the door frame can be calculated based on the number.

그리고 상기 엔그램 확장부는, 적어도 하나의 상위 점수 문틀을 선정하고, 수집한 각각의 고유 명사를 문틀의 해당 위치에 대입할 수 있다.The engram expansion unit may select at least one upper score door frame and substitute each collected proper noun at a corresponding position of the door frame.

본 발명의 일 실시예에 따른 연속어 음성인식 시스템에서 언어모델의 고유 명사 추가 방법은, 언어모델의 고유 명사 추가 장치에서 신규 고유명사와 분류어를 수집하는 과정과, 수집된 상기 신규 고유 명사 및 분류어를 텍스트 스코프에서 검색하여 매칭된 문장들을 후보 문장으로 선정하는 과정과, 선정된 상기 후보 문장으로부터 후보 문틀을 추출하는 과정과, 추출된 상기 후보 문틀에서 엔그램 형식의 후보 문틀을 추출하는 과정과, 추출된 상기 엔그램 형식의 후보 문틀 각각에 대한 점수 산정을 통해 높은 점수 별로 순위화하는 과정과, 순위화 된 엔그램 형식의 문틀에 수집된 고유 명사를 대입하여 엔그램 확장을 수행하는 과정과, 상기 엔그램 확장을 통해 생성된 엔그램을 언어모델에 빈도와 함께 추가하는 과정을 포함할 수 있다.In the continuous speech recognition system according to an embodiment of the present invention, a method for adding a proper noun of a language model includes: collecting a new proper noun and a classification term in a device for adding a proper noun of a language model; Searching for classification words in a text scope, selecting matching sentences as candidate sentences, extracting candidate door frames from the selected candidate sentences, and extracting engram-type candidate door frames from the extracted candidate door frames. And a process of ranking by high scores by calculating a score for each of the extracted engram candidate door frames, and performing engram expansion by substituting proper nouns collected in the ranked engram door frames. And adding the engram generated by the engram extension with the frequency to the language model.

그리고 상기 수집하는 과정은, 훈련용 텍스트 코퍼스에서 출현하지 않는 고유 명사를 수집하고, 상기 고유 명사로부터 개념적으로 구분되는 카테고리를 분류어로 할당할 수 있다.The collecting process may collect proper nouns that do not appear in the training text corpus, and may assign a category that is conceptually classified from the proper nouns.

그리고 상기 후보 문틀을 추출하는 과정은, 상기 후보 문장에서 상기 고유 명사를 포함한 일정한 길이의 지역 문맥을 문틀의 형식으로 추출할 수 있다.The extracting of the candidate door frame may include extracting a local context of a certain length including the proper noun in the candidate sentence in the form of a door frame.

그리고 상기 점수 별로 순위화하는 과정은, 상기 엔그램 형식의 후보 문틀 마다 고유 명사를 포함하는 문틀의 개수 및 전체 코퍼스에서 문틀의 개수를 토대로 문틀의 적합성을 측정하는 과정과, 고유 명사를 포함하는 문틀의 개수 및 문틀의 상태도수를 토대로 문틀의 엔트로피를 산출하는 과정을 포함할 수 있다.The ranking of the scores may include measuring the suitability of the door frame based on the number of door frames including proper nouns and the number of door frames in the entire corpus, and the door frame including the proper nouns. It may include the process of calculating the entropy of the door frame based on the number of and the state frequency of the door frame.

그리고 상기 엔그램 확장을 수행하는 과정은, 적어도 하나의 상위 점수 문틀을 선정하고, 수집한 각각의 고유 명사를 문틀의 해당 위치에 대입할 수 있다.In the process of performing the engram expansion, at least one upper score door frame may be selected, and each collected proper noun may be substituted at a corresponding position of the door frame.

상기와 같은 본 발명의 실시예에 따른 연속어 음성인식 시스템에서 언어모델의 고유 명사 추가 장치 및 방법에 따르면 다음과 같은 효과가 하나 혹은 그 이상이 있다.According to the apparatus and method for adding a proper noun of a language model in the continuous speech recognition system according to the embodiment of the present invention as described above, there are one or more of the following effects.

본 발명의 실시예에 따른 연속어 음성인식 시스템에서 언어모델의 고유 명사 추가 장치 및 방법에 의하면, 음성인식을 위한 언어모델에 없는 신규 고유명사를 문장의 다양한 표현을 반영한 엔그램 형태로 추가할 수 있기 때문에 단순히 고유명사 만으로 추가하는 방법보다 고유 명사의 높은 음성인식 성능을 얻을 수 있는 효과가 있다.According to the apparatus and method for adding a proper noun of a language model in the continuous speech recognition system according to an embodiment of the present invention, a new proper noun not included in the language model for speech recognition may be added in an engram form reflecting various expressions of sentences. Because of this, it is more effective to obtain high voice recognition performance of proper nouns than simply adding proper nouns.

도 1은 본 발명의 실시예에 따른 언어모델의 고유 명사 추가 장치의 구조를 도시한 블록도,
도 2는 본 발명의 실시예에 따른 언어모델의 고유 명사 추가 장치의 동작 절차를 도시한 흐름도,
도 3은 본 발명의 실시예에 따른 신규 고유명사와 이에 대응하는 분류어의 목록 예시를 도시한 도면,
도 4는 본 발명의 실시예에 따른 고유명사 또는 분류어가 검색된 후보 문장의 예시를 도시한 도면,
도 5는 본 발명의 실시예에 따른 후보 문장으로부터 추출한 후보 문틀의 예시를 도시한 도면,
도 6은 본 발명의 실시예에 따른 후보 문틀로부터 엔그램 형식의 문틀 생성 예시를 도시한 도면,
도 7은 본 발명의 실시예에 따른 엔그램에 고유명사를 대입하여 엔그램 확장을 수행하는 예시를 도시한 도면.1 is a block diagram illustrating a structure of an apparatus for adding a proper noun of a language model according to an embodiment of the present invention;
2 is a flowchart illustrating an operation procedure of an apparatus for adding a proper noun of a language model according to an embodiment of the present invention;
3 is a diagram illustrating an example of a list of new proper nouns and their corresponding classification words according to an embodiment of the present invention;
4 is a diagram illustrating an example of a candidate sentence in which a proper noun or a classification word is searched according to an embodiment of the present invention;
5 is a diagram illustrating an example of a candidate door frame extracted from a candidate sentence according to an embodiment of the present invention;
6 is a diagram illustrating an example of generating an engram type door frame from a candidate door frame according to an embodiment of the present invention;
7 is a diagram illustrating an example of performing engram expansion by assigning a proper noun to an engram according to an embodiment of the present invention.

본 발명의 이점 및 특징, 그리고 그것들을 달성하는 방법은 첨부되는 도면과 함께 상세하게 후술되어 있는 실시예들을 참조하면 명확해질 것이다. 그러나 본 발명은 이하에서 개시되는 실시예들에 한정되는 것이 아니라 서로 다른 다양한 형태로 구현될 수 있으며, 단지 본 실시예들은 본 발명의 개시가 완전하도록 하고, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 발명의 범주를 완전하게 알려주기 위해 제공되는 것이며, 본 발명은 청구항의 범주에 의해 정의될 뿐이다. 명세서 전체에 걸쳐 동일 참조 부호는 동일 구성 요소를 지칭한다.Advantages and features of the present invention and methods for achieving them will be apparent with reference to the embodiments described below in detail with the accompanying drawings. The present invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art. To fully disclose the scope of the invention to those skilled in the art, and the invention is only defined by the scope of the claims. Like reference numerals refer to like elements throughout the specification.

본 발명의 실시예들을 설명함에 있어서 공지 기능 또는 구성에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우에는 그 상세한 설명을 생략할 것이다. 그리고 후술되는 용어들은 본 발명의 실시예에서의 기능을 고려하여 정의된 용어들로서 이는 사용자, 운용자의 의도 또는 관례 등에 따라 달라질 수 있다. 그러므로 그 정의는 본 명세서 전반에 걸친 내용을 토대로 내려져야 할 것이다. In the following description of the present invention, a detailed description of known functions and configurations incorporated herein will be omitted when it may make the subject matter of the present invention rather unclear. The following terms are defined in consideration of the functions in the embodiments of the present invention, which may vary depending on the intention of the user, the intention or the custom of the operator. Therefore, the definition should be based on the contents throughout this specification.

첨부된 블록도의 각 블록과 흐름도의 각 단계의 조합들은 컴퓨터 프로그램 인스트럭션들에 의해 수행될 수도 있다. 이들 컴퓨터 프로그램 인스트럭션들은 범용 컴퓨터, 특수용 컴퓨터 또는 기타 프로그램 가능한 데이터 프로세싱 장비의 프로세서에 탑재될 수 있으므로, 컴퓨터 또는 기타 프로그램 가능한 데이터 프로세싱 장비의 프로세서를 통해 수행되는 그 인스트럭션들이 블록도의 각 블록 또는 흐름도의 각 단계에서 설명된 기능들을 수행하는 수단을 생성하게 된다. 이들 컴퓨터 프로그램 인스트럭션들은 특정 방식으로 기능을 구현하기 위해 컴퓨터 또는 기타 프로그램 가능한 데이터 프로세싱 장비를 지향할 수 있는 컴퓨터 이용 가능 또는 컴퓨터 판독 가능 메모리에 저장되는 것도 가능하므로, 그 컴퓨터 이용가능 또는 컴퓨터 판독 가능 메모리에 저장된 인스트럭션들은 블록도의 각 블록 또는 흐름도 각 단계에서 설명된 기능을 수행하는 인스트럭션 수단을 내포하는 제조 품목을 생산하는 것도 가능하다. 컴퓨터 프로그램 인스트럭션들은 컴퓨터 또는 기타 프로그램 가능한 데이터 프로세싱 장비 상에 탑재되는 것도 가능하므로, 컴퓨터 또는 기타 프로그램 가능한 데이터 프로세싱 장비 상에서 일련의 동작 단계들이 수행되어 컴퓨터로 실행되는 프로세스를 생성해서 컴퓨터 또는 기타 프로그램 가능한 데이터 프로세싱 장비를 수행하는 인스트럭션들은 블록도의 각 블록 및 흐름도의 각 단계에서 설명된 기능들을 실행하기 위한 단계들을 제공하는 것도 가능하다. Each block of the accompanying block diagrams and combinations of steps of the flowchart may be performed by computer program instructions. These computer program instructions may be loaded into a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus so that the instructions, which may be executed by a processor of a computer or other programmable data processing apparatus, And means for performing the functions described in each step are created. These computer program instructions may also be stored in a computer usable or computer readable memory capable of directing a computer or other programmable data processing apparatus to implement the functionality in a particular manner so that the computer usable or computer readable memory It is also possible for the instructions stored in the block diagram to produce a manufacturing item containing instruction means for performing the functions described in each block or flowchart of the block diagram. Computer program instructions may also be stored on a computer or other programmable data processing equipment so that a series of operating steps may be performed on a computer or other programmable data processing equipment to create a computer- It is also possible that the instructions that perform the processing equipment provide the steps for executing the functions described in each block of the block diagram and at each step of the flowchart.

또한, 각 블록 또는 각 단계는 특정된 논리적 기능(들)을 실행하기 위한 하나 이상의 실행 가능한 인스트럭션들을 포함하는 모듈, 세그먼트 또는 코드의 일부를 나타낼 수 있다. 또, 몇 가지 대체 실시예들에서는 블록들 또는 단계들에서 언급된 기능들이 순서를 벗어나서 발생하는 것도 가능함을 주목해야 한다. 예컨대, 잇달아 도시되어 있는 두 개의 블록들 또는 단계들은 사실 실질적으로 동시에 수행되는 것도 가능하고 또는 그 블록들 또는 단계들이 때때로 해당하는 기능에 따라 역순으로 수행되는 것도 가능하다.Also, each block or each step may represent a module, segment, or portion of code that includes one or more executable instructions for executing the specified logical function (s). It should also be noted that in some alternative embodiments, the functions noted in the blocks or steps may occur out of order. For example, two blocks or steps shown in succession may in fact be performed substantially concurrently, or the blocks or steps may sometimes be performed in reverse order according to the corresponding function.

본 발명의 실시예는, 연속어 음성인식 시스템에서 언어모델에 고유 명사의 추가를 위해, 신규 고유명사와 분류어를 수집하고, 수집된 단어들을 이용하여 텍스트 코퍼스로부터 후보 문장을 선정하며, 후보 문장으로부터 후보 문틀을 추출하고, 후보 문틀의 엔그램을 생성한 후, 각 후보 문틀을 통계적인 계산식을 이용하여 점수화 및 순위화를 수행하고, 상위의 문틀을 고유명사에 적용한 후 엔그램을 확장하여 언어모델에 반영하기 위한 것이다.In an embodiment of the present invention, in order to add proper nouns to a language model in a continuous speech recognition system, new proper nouns and classification words are collected, candidate sentences are selected from a text corpus using the collected words, and candidate sentences After extracting candidate door frames from them, generating engrams of candidate door frames, scoring and ranking each candidate door frame using statistical calculations, applying the upper door frames to proper nouns, and then expanding the engrams to language. This is to reflect in the model.

이하, 첨부된 도면을 참조하여 본 발명의 실시예를 상세히 설명하기로 한다.Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings.

도 1은 본 발명의 실시예에 따른 언어모델의 고유 명사 추가 장치의 구조를 도시한 블록도이다.1 is a block diagram illustrating a structure of an apparatus for adding a proper noun of a language model according to an exemplary embodiment of the present invention.

도 1을 참조하면, 언어모델의 고유 명사 추가 장치(100)는 수집부(102), 후보 문장 선정부(106), 후보 문틀 추출부(110), 엔그램의 후보 문틀 생성부(112), 순위화부(114), 엔그램 확장부(116), 반영부(118) 등을 포함할 수 있다.Referring to FIG. 1, the apparatus 100 for adding a proper noun of a language model includes a collector 102, a candidate sentence selector 106, a candidate door frame extractor 110, an engram candidate door frame generator 112, The ranking unit 114 may include an engram extension unit 116, a reflecting unit 118, and the like.

구체적으로 수집부(102)는 신규 고유명사와 분류어를 수집하는 것으로서, 신규 고유명사와 이에 대응하는 분류어의 목록 예시를 도시한 도 3에서와 같이 훈련용 텍스트 코퍼스(104)에 출현하지 않는 고유 명사(301)들이 독립적으로 수집될 수 있고, 이들로부터 개념적으로 구분되는 카테고리 이름이 분류어(300)로서 구축될 수 있다. 이때, 고유명사들 중에는 분류어를 할당하지 못한 경우도 있을 수 있으며, 이런 경우에는 참조번호 (302)와 같이 X로 표시할 수 있다.Specifically, the collection unit 102 collects new proper nouns and classification words, and does not appear in the training text corpus 104 as shown in FIG. 3, which shows a list of new proper nouns and corresponding classification words. Proper nouns 301 may be collected independently, and a category name conceptually distinct from them may be constructed as the classifier 300. At this time, some of the proper nouns may not be assigned a classification word, in this case it may be indicated by an X as shown by reference numeral (302).

후보 문장 선정부(106)는 수집부(102)를 통해 수집한 고유명사와 분류어를 텍스트 코퍼스(108)에서 검색하여 매칭된 문장들을 후보 문장으로 선정할 수 있다. 도 4는 고유명사 또는 분류어가 검색된 후보 문장의 예시를 나타내며, '{'와 '{' 사이의 문자열이 매칭된 고유명사 또는 분류어에 해당할 수 있다.The candidate sentence selector 106 may search the text corpus 108 for proper nouns and classification words collected through the collection unit 102 and select matching sentences as candidate sentences. 4 illustrates an example of a candidate sentence in which a proper noun or a classification word is searched, and may correspond to a matching proper noun or classification word with a string between '{' and '{'.

후보 문틀 추출부(110)는 후보 문장으로부터 후보 문틀을 추출하는 것으로서, '문틀'이란, 고유명사를 포함한 일정한 길이의 지역 문맥을 말한다. The candidate door frame extractor 110 extracts the candidate door frame from the candidate sentence, and the term "door frame" refers to a local context of a certain length including proper nouns.

도 5는 본 발명의 실시예에 따른 후보 문장으로부터 추출한 후보 문틀의 예시를 도시한 도면으로서, 후보 문장(500)에서 매칭된 문자열을 중심으로 좌우 최대 2개씩의 단어를 문틀로 추출할 수 있다. 추출한 후보 문틀의 예제는 참조번호 (502)와 같다. 참조번호 (502)에서 'X'로 표시된 부분은 고유명사 또는 분류어가 매칭된 위치를 의미한다. 그리고 문틀 내에 문장의 시작이나 끝을 표시하기 위해서 '<s>' 또는 '</s>'를 사용할 수 있다. '<s>'로 표시된 부분은 문장의 시작 부분을 표시하고 문장의 끝 부분은 '</s>'로 표시할 수 있다.FIG. 5 is a diagram illustrating an example of a candidate door frame extracted from a candidate sentence according to an exemplary embodiment of the present invention. In the candidate sentence 500, a maximum of two left and right words may be extracted based on a matched string. An example of the extracted candidate door frame is shown at 502. In the reference numeral 502, the part indicated by 'X' means a location where a proper noun or a classification word is matched. And you can use '<s>' or '</ s>' to mark the beginning or end of a sentence in a door frame. The part marked with '<s>' may mark the beginning of the sentence and the end portion of the sentence may be marked with '</ s>'.

엔그램의 후보 문틀 생성부(112)는 후보 문틀 추출부(110)로부터 추출한 후보 문틀에서 'X'를 반드시 포함하는 엔그램 형식의 후보 문틀을 생성하는 것이다. The candidate door frame generator 112 of the engram generates an engram candidate door frame necessarily including 'X' in the candidate door frame extracted from the candidate door frame extractor 110.

도 6은 본 발명의 실시예에 따른 후보 문틀로부터 엔그램 형식의 문틀 생성 예시를 도시한 도면으로서, 예를 들어, 예를 들어, 도 6의 (600)이나 (604)와 같은 후보 문틀은 참조번호 (602)이나 (606)과 같이 'X'를 중심으로 하는 유니그램(1-gram), 바이그램(2-gram), 트라이그램(3-gram)을 생성할 수 있다.FIG. 6 illustrates an example of generating an engram-type door frame from a candidate door frame according to an embodiment of the present invention. For example, for example, refer to a candidate door frame such as 600 or 604 of FIG. 6. Unigrams (1-grams), bigrams (2-grams), and trigrams (3-grams) around 'X' may be generated, such as numbers 602 and 606.

순위화부(114)는 엔그램의 후보 문틀 생성부(112)를 통해 생성된 엔그램의 후보 문틀마다 점수(Score)를 계산하여 내림차순으로 순위화를 수행하는 것으로서, 점수 계산은 다음의 두가지 요소를 수식화하여 계산할 수 있다.The ranking unit 114 calculates a score for each candidate door frame of the engram generated through the candidate door frame generation unit 112 of the engram, and performs ranking in descending order. Can be formulated and calculated.

첫번째, 문틀의 적합성(Suitability). 얼마나 문틀로서 적합한지를 측정하며, 하기 <수학식 1>과 같다. First, the suitability of the door frame. How much is suitable as a door frame is measured, as shown in <Equation 1>.

여기서, t_i는 Any 고유 명사를 포함하는 문틀 i의 개수를 나타내고, m_i는 전체 코퍼스에서 문틀 i의 개수를 나타낼 수 있다.Here, t _i may represent the number of door frames i including the Any proper noun, and m _i may represent the number of door frames i in the entire corpus.

두번째, 문틀의 일반성(Generality). 얼마나 다양한 고유명사 유형들 사이에서 사용되는 문틀인지 측정하며, 하기 <수학식 2>와 같다. Second, the generality of the door frame. It measures how the door frame is used between various types of proper nouns, as shown in Equation 2 below.

여기서, T_ij는 고유 명사 j를 포함하는 문틀 i의 개수이고, f_ij는 문틀 ij의 상대도수(relative frequency)이고, G_i는 문틀i의 엔트로피(entropy)(0 ≤ G_i ≤ 1 임.Where T _ij is the number of door frames i including the proper noun j, f _ij is the relative frequency of door frame ij, and G _i is the entropy of door frame i (0 ≤ G _i ≤ 1). .

전체 계산식으로서, 점수(Score_i)는 하기 <수학식 3>과 같다.As a whole formula, the score (Score _i ) is as shown in Equation 3 below.

엔그램 확장부(116)는 상위 랭크된 문틀 N개를 선정한 후에 수집한 각 고유명사를 문틀의 'X' 위치에 대입하고 엔그램 확장을 수행할 수 있다. The engram extension unit 116 may select each of the top ranked N frame frames, assign each collected proper noun to the 'X' position of the door frame, and perform engram expansion.

도 7은 본 발명의 실시예에 따른 엔그램에 고유명사를 대입하여 엔그램 확장을 수행하는 예시를 도시한 도면으로서, 문틀(700)에 고유명사(702)를 각각 대입하여 참조번호(704)와 같이 확장할 수 있다.7 is a diagram illustrating an example of performing engram expansion by assigning a proper noun to an engram according to an embodiment of the present invention, and assigning a proper noun 702 to the door frame 700, respectively, with reference numeral 704. Can be extended as:

문틀(700)과 같이 길이 3의 트라이그램(3-gram) 문틀인 경우, 고유명사를 대입하였을 때 참조번호(704)와 같이 길이 3인 트라이그램을 생성할 수 있다. 문틀 (706)과 같이 길이 2인 바이그램(2-gram) 문틀인 경우, 고유명사를 대입하여 생성하는 엔그램도 바이그램이다.In the case of a three-gram trigram (three-gram) door frame, such as the door frame 700, when a proper noun is substituted, a trigram having a length of three may be generated as shown by reference numeral 704. In the case of a 2-gram door frame having a length of 2, such as the door frame 706, an engram generated by substituting proper nouns is also a bygram.

반영부(118)는 엔그램 확장부(116)를 통해 생성한 엔그램들을 언어모델에 빈도와 함께 그대로 추가하게 된다.The reflector 118 adds the engrams generated by the engram extender 116 to the language model with frequency.

도 2는 본 발명의 실시예에 따른 언어모델의 고유 명사 추가 장치의 동작 절차를 도시한 흐름도이다.2 is a flowchart illustrating an operation procedure of an apparatus for adding a proper noun of a language model according to an exemplary embodiment of the present invention.

도 2를 참조하면, 연속어 음성인식 시스템에서 언어모델의 고유 명사 추가 장치(100)는 200단계에서 수집부(102)를 통해 신규 고유명사와 분류어를 수집하고, 202단계에서는 후보 문장 선정부(106)에서 수집된 고유 명사 및 분류어를 텍스트 스코프(108)에서 검색하여 매칭된 문장들을 후보 문장으로 선정하게 된다.Referring to FIG. 2, the apparatus 100 for adding a proper noun of a language model in the continuous speech recognition system collects new proper nouns and classification words through the collection unit 102 in step 200, and in step 202, a candidate sentence selecting unit. The proper nouns and classification words collected at 106 are searched in the text scope 108 to select matched sentences as candidate sentences.

204단계에서 후보 문틀 추출부(110)는 선정된 후보 문장으로부터 후보 문틀을 추출하게 되고, 206단계에서 엔그램의 후보 문틀 생성부(112)는 추출된 후보 문틀에서 'X'를 반드시 포함하는 엔그램 형식의 후보 문틀을 생성하게 된다. In step 204, the candidate door frame extractor 110 extracts the candidate door frame from the selected candidate sentence, and in step 206, the candidate door frame generator 112 of the engram must include an 'X' in the extracted candidate door frame. It will create a candidate frame in gram format.

그리고 208단계에서 순위화부(114)는 생성된 엔그램 형식의 후보 문틀마다 점수 계산을 통해 높은 점수 별로 순위화하게 되며, 문틀의 적합성 및 일반성을 토대로 점수 계산을 수행하게 된다.In operation 208, the ranking unit 114 ranks each of the generated engram-type candidate door frames by high score, and performs a score calculation based on the suitability and generality of the door frame.

이후, 엔그램 확장부(116)는 210단계에서 적어도 하나의 상위 랭크된 문틀 N개를 선정한 후에 수집한 각 고유명사를 문틀의 'X' 위치에 대입하고 엔그램 확장하게 된다. 그리고 212단계에서 반영부(118)는 엔그램 확장부(116)를 통해 생성한 엔그램들을 언어모델(120)에 빈도와 함께 반영하게 된다.After that, the engram expansion unit 116 selects at least one of the at least one top-ranked door frame N and assigns the collected proper nouns to the 'X' position of the door frame and expands the engram. In operation 212, the reflector 118 reflects the engrams generated by the engram extension 116 to the language model 120 together with the frequency.

이상 설명한 바와 같이, 본 발명의 실시예에 따른 연속어 음성인식 시스템에서 언어모델의 고유 명사 추가 장치 및 방법은, 연속어 음성인식 시스템에서 언어모델에 고유 명사의 추가를 위해, 신규 고유명사와 분류어를 수집하고, 수집된 단어들을 이용하여 텍스트 코퍼스로부터 후보 문장을 선정하며, 후보 문장으로부터 후보 문틀을 추출하고, 후보 문틀의 엔그램을 생성한 후, 각 후보 문틀을 통계적인 계산식을 이용하여 점수화 및 순위화를 수행하고, 상위의 문틀을 고유명사에 적용한 후 엔그램을 확장하여 언어모델에 반영한다. As described above, the apparatus and method for adding a proper noun of a language model in the continuous speech recognition system according to the embodiment of the present invention are classified into a new proper noun and classification for adding a proper noun to the language model in the continuous speech recognition system. Collect words, select candidate sentences from the text corpus using the collected words, extract candidate door frames from candidate sentences, generate engrams of candidate door frames, and then score each candidate door frame using statistical calculations. And ranking, apply the upper door frame to proper nouns, and expand the engram to reflect in the language model.

한편 본 발명의 상세한 설명에서는 구체적인 실시예에 관해 설명하였으나, 본 발명의 범위에서 벗어나지 않는 한도 내에서 여러 가지 변형이 가능함은 물론이다. 그러므로 본 발명의 범위는 설명된 실시예에 국한되지 않으며, 후술되는 특허청구의 범위뿐만 아니라 이 특허청구의 범위와 균등한 것들에 의해 정해져야 한다.While the present invention has been described in connection with what is presently considered to be the most practical and preferred embodiment, it is to be understood that the invention is not limited to the disclosed embodiments, but is capable of various modifications within the scope of the invention. Therefore, the scope of the present invention should not be limited to the described embodiments, but should be determined by the scope of the appended claims, and equivalents thereof.

100 : 언어모델의 고유 명사 추가 장치 102 : 수집부
104 : 훈련용 텍스트 코퍼스 106 : 후보 문장 선정부
108 : 텍스트 코퍼스 110 : 후보 문틀 추출부
112 : 엔그램의 후보 문틀 생성부 114 : 순위화부
116 : 엔그램 확장부 118 : 반영부
120 : 언어모델100: device for adding a proper noun of a language model 102: collector
104: training text corpus 106: candidate sentence selection unit
108: text corpus 110: candidate door frame extraction unit
112: engram candidate door frame generation unit 114: ranking unit
116: engram expansion unit 118: reflecting unit
120: language model

Claims

A collecting unit for collecting new proper nouns and classification words from the proper noun adding device of the language model,
A candidate sentence selection unit for searching the new proper nouns and classification words in a text scope and selecting matching sentences as candidate sentences;
A candidate door frame extracting unit extracting a candidate door frame from the candidate sentence;
A candidate door frame generation unit of an engram extracting a candidate door frame having an engram form from the candidate door frame;
A ranking unit that ranks each of the high scores by calculating a score for each of the engram-type candidate door frames;
An engram expansion unit that performs engram expansion by assigning the proper nouns collected in the candidate door frames of the engram format,
Reflecting unit for adding the engram generated by the engram expansion to the language model with a frequency
Apparatus for adding a proper noun of a language model in a continuous speech recognition system comprising a.

The method of claim 1,
Wherein,
Apparatus for adding a proper noun of a language model in a continuous-word speech recognition system, characterized in that it collects proper nouns that do not appear in the training text corpus, and assigns categorical words to categories that are conceptually separated from the proper nouns.

The method of claim 1,
The candidate door frame extraction unit,
Apparatus for adding a proper proper noun of a language model in a continuous speech recognition system, characterized in that for extracting the local context of a certain length including the proper noun in the candidate sentence in the form of a door frame.

The method of claim 1,
The ranking unit,
The suitability of the door frame is measured based on the number of door frames including proper nouns and the number of door frames in the entire corpus for each candidate door frame of the engram format,
Apparatus for adding a proper noun of a language model in a continuous speech recognition system, characterized in that to calculate the entropy of the door frame based on the number of door frames including the proper nouns and the state frequency of the door frame.

The method of claim 1,
The engram expansion unit,
Apparatus for adding a proper noun of a language model in the continuous speech recognition system, characterized in that the at least one top score door frame is selected, and each collected proper noun is substituted at a corresponding position of the door frame.

Collecting new proper nouns and taxonomy from the proper device for adding proper nouns of the language model,
Searching for the collected new proper nouns and classification words in a text scope and selecting matching sentences as candidate sentences;
Extracting candidate door frames from the selected candidate sentences;
Extracting an engram-type candidate door frame from the extracted candidate door frame;
Ranking the high scores by calculating a score for each of the extracted engram candidate door frames;
Performing the engram expansion by assigning the proper nouns collected in the ranked engram-type doorframe,
The process of adding the engram generated by the engram extension with the frequency to the language model.
How to add a proper noun of a language model in a continuous speech recognition system comprising a.

The method according to claim 6,
The collecting process includes:
A method for adding a proper noun for a language model in a continuous speech recognition system, comprising collecting proper nouns that do not appear in a training text corpus, and assigning a category as a category.

The method according to claim 6,
The process of extracting the candidate door frame,
A method of adding a proper noun for a language model in a continuous-word speech recognition system, characterized in that for extracting a local context having a predetermined length including the proper noun from the candidate sentence in the form of a door frame.

The method according to claim 6,
The ranking process by the score,
Measuring the suitability of the door frame based on the number of door frames including the proper nouns and the number of door frames in the entire corpus for each candidate door frame of the engram type;
The process of calculating the entropy of a door frame based on the number of door frames that contain proper nouns and the state frequency of the door frame
How to add a proper noun of a language model in the continuous speech recognition system characterized in that it comprises a.

The method according to claim 6,
The process of performing the engram expansion,
A method of adding a proper noun of a language model in a continuous speech recognition system, characterized in that at least one upper score door frame is selected and each collected proper noun is substituted at a corresponding position of the door frame.