KR100950971B1

KR100950971B1 - Cognitive neuro computational method of automatic lexical acquisition

Info

Publication number: KR100950971B1
Application number: KR1020080059384A
Authority: KR
Inventors: 임희석
Original assignee: 한신대학교 산학협력단
Priority date: 2008-06-24
Filing date: 2008-06-24
Publication date: 2010-04-01
Also published as: KR20100000047A

Abstract

본 발명은 인지신경학적 언어정보처리 원리를 반영한 계산주의적 자동 어휘 획득 방법을 제공함에 목적이 있다.An object of the present invention is to provide a computational automatic vocabulary acquisition method that reflects the principles of cognitive neurological linguistic information processing.

본 발명의 인지신경기반 자동 어휘 획득 방법은, 코퍼스로부터 추출된 어절의 출현 빈도 혹은 최근성 빈도가 각각 대응하는 제1 및 제2 임계값을 초과하면 어절 사전에 저장하기 위한 어절 획득 기능; 및 상기 코퍼스로부터 추출된 어절의 머리문자열과 꼬리문자열의 빈도가 각각 대응하는 제3 및 제4 임계값을 초과하면 형태소 사전에 저장하기 위한 형태소 획득 기능을 포함한다.According to the present invention, there is provided a method of acquiring cognitive nerve-based automatic vocabulary, comprising: a word acquiring function for storing the word in a word dictionary when the frequency of appearance or recentity of a word extracted from a corpus exceeds a corresponding first and second threshold values, respectively; And a morpheme acquisition function for storing in the morpheme dictionary when the frequencies of the head and tail strings of the words extracted from the corpus exceed the corresponding third and fourth threshold values, respectively.

심성어휘, 심성어휘집, 전자사전, 인지, 신경, 자연어처리 Psychological Vocabulary, Psychological Vocabulary, Electronic Dictionary, Cognition, Neural, Natural Language Processing

Description

COGNITIVE NEURO COMPUTATIONAL METHOD OF AUTOMATIC LEXICAL ACQUISITION}

본 발명은 인지신경기반 자동 어휘 획득 방법에 관한 것으로, 구체적으로는 인간의 인지신경학적 언어정보처리 원리를 반영한 계산주의적 자동 어휘 획득 방법에 관한 것이다.The present invention relates to a method for acquiring cognitive nerve based automatic vocabulary, and more particularly, to a method for calculating computational automatic vocabulary reflecting human cognitive neurological language information processing principles.

인지신경과학(cognitive neuroscience)은 언어, 기억, 학습, 주의(attention) 등 인간의 인지 기능의 메카니즘 및 원리를 대뇌(brain)의 신경학적인 원리와 구조적 원리로 설명하고 밝히는 연구분야이다. 즉, 인지신경과학은 사람의 인지기능의 기저와 원리를 밝히고자 하는 인지과학(cognitive science)과 신경 시스템의 구조, 기능, 발달 등을 연구하는 신경과학(neuroscience)의 학제적 연구(interdisciplinary research)라 할 수 있다. Cognitive neuroscience is a field of research that explains and clarifies the mechanisms and principles of human cognitive functions such as language, memory, learning, and attention using neurological and structural principles of the brain. In other words, cognitive neuroscience is the interdisciplinary research of cognitive science that seeks to reveal the basis and principle of human cognitive function and neuroscience that studies the structure, function and development of the nervous system. It can be said.

최근 인간의 인지 기능 원리 규명에 대한 지적 관심 증대, 지능형 로봇 개발을 위한 뇌과학의 필요성 증대 등으로 인지신경과학 분야의 연구와 중요성이 날로 증가하고 있는 실정이다. 이에 대뇌의 정보처리 규명과 뇌에 대한 이해와 모델링, 그리고 이를 지능형 공학 시스템에 접목하기 위해서는 인지과학과 신경과학의 접목뿐만 아니라 신경심리학(neuropsychology), 컴퓨터과학(computer science), 전자과학(electronic engineering), 통계학, 물리학 등의 접목을 통한 다학제적 연구(multidisciplinary)가 필요하다.Recently, research and importance in the field of cognitive neuroscience are increasing day by day with increasing intellectual interest in clarifying the principle of human cognitive function and increasing necessity of brain science for developing intelligent robot. Therefore, in order to identify information processing of brain and understand and model brain and apply it to intelligent engineering system, it is necessary not only to combine cognitive science and neuroscience but also neuropsychology, computer science, electronic engineering. Multidisciplinary is needed by integrating statistics, statistics, and physics.

인지신경계산학의 필요성과 중요성은 다음과 같다.The necessity and importance of cognitive neural computation are as follows.

첫째, 인지신경기반의 모델을 분석하고 시뮬레이션함으로써 사람을 대상으로 한 행동 실험을 통해서 얻은 기존의 이론을 검증하고 심화 이해하는데 기여할 수 있다. First, by analyzing and simulating cognitive neuron-based models, we can contribute to verifying and deepening existing theories obtained through behavioral experiments on humans.

둘째, 계산주의적 모델 및 시스템의 변수 조작과 파라미터 변경을 통해서 사람 피험자를 이용한 행동 실험으로는 수행하기 힘든 예측 실험이 가능하다. Second, the manipulation of variables and parameter changes in computational models and systems enables predictive experiments that are difficult to perform in behavioral experiments using human subjects.

셋째, 계산주의적 모델에 인위적인 손상을 가하여 손상된 모델이 만들어내는 오류 관측 및 분석을 통하여 사람 피험자를 이용해서는 불가능한 손상 연구(lesion study)를 수행할 수 있다.Third, through the observation and analysis of errors generated by damaged models by artificially damaging computational models, damage studies can be performed that cannot be performed by human subjects.

한편, 인간의 언어정보처리 원리를 반영한 자동 어휘 획득 모델의 개발의 필요성은 다음과 같다. On the other hand, the necessity of developing an automatic vocabulary acquisition model reflecting the human language information processing principle is as follows.

첫째, 인간의 어휘정보처리 과정의 이해를 위해서 어휘획득 과정에 대한 이해가 중요한데, 계산주의적 모델은 대뇌속의 어휘 획득 과정에 대한 시뮬레이션 환경을 제공할 수 있다. First, understanding of vocabulary acquisition process is important for understanding vocabulary information processing process of human. Computational model can provide simulation environment for vocabulary acquisition process in cerebrum.

둘째, 언어정보처리 이해에 중요한 역할을 하는 어휘 획득과 어휘 표상은 밀접한 관계를 가지고 있는데, 어휘 획득 연구는 어휘 표상 연구(mental lexicon representation)를 위해서도 매우 중요하다. Second, vocabulary acquisition and vocabulary representation, which play an important role in understanding linguistic information processing, are closely related. Vocabulary acquisition research is also very important for mental lexicon representation.

셋째, 계산주의적 자동 어휘 획득 모델은 인간의 어휘 획득 과정을 검증하고 예측 실험이 가능하게 하며 손상 스터디를 통한 심도 있는 연구를 가능하게 할 수 있다. Third, the computational automatic vocabulary acquisition model can verify the human vocabulary acquisition process, enable prediction experiments, and enable in-depth studies through damage studies.

넷째, 자동 어휘 획득에 대한 연구는 지능형 로봇 개발을 위한 자동 언어 처리 시스템의 어휘 정보처리 시스템 개발에 기여할 수 있다. 지난 수십년간 자동화된 언어정보처리 시스템 개발을 위하여 인공지능 분야와 자연어처리 분야에서 지속적인 연구를 수행하여 왔으며, 그 결과 제안된 도메인 내에서 제한된 문법에 한정된 의미 분석이 가능한 시스템이 개발되었다. Fourth, the study of automatic vocabulary acquisition can contribute to the development of the lexical information processing system of the automatic language processing system for the development of intelligent robots. In the past decades, we have been continuously researching in the field of artificial intelligence and natural language for the development of automated language information processing system. As a result, a system capable of semantic analysis with limited grammar in the proposed domain has been developed.

하지만 현재의 자연어처리 기술은 인간의 언어정보처리 기술과 비교하여 매우 빈약한 수준이라 할 수 있으며, 인간다운 지능형 언어정보처리 시스템 개발은 묘연한 상태이다. 언어정보처리 시스템의 성능 향상을 위해서 규칙 기반 접근법과 통계 기반 접근법 모두 효율적인 언어처리 알고리즘, 전자 사전 구조 개발 등 계산주의적 관점에서의 성능 향상만을 시도하여 왔으며, 이러한 노력을 통한 시스템 성능 향상에는 한계가 있다는 지적이 제시되고 있다. However, the current natural language processing technology is very poor compared to human language information processing technology, and the development of human-like intelligent language information processing system is a mysterious state. In order to improve the performance of the linguistic information processing system, both rule-based and statistical-based approaches have attempted to improve performance from a computational point of view, such as efficient language processing algorithms and the development of electronic dictionary structures. The point is presented.

따라서, 기존의 언어 정보처리 시스템의 한계를 극복하고 인간의 언어처리 능력을 모사할 수 있는 자연어처리시스템 개발을 위해서는 인간의 언어정보처리 원리에 대한 규명과 이를 자연어처리 시스템 개발에 접목하는 인지신경기반 자연언어처리 시스템 개발이 필요하다. Therefore, in order to develop the natural language processing system that can overcome the limitations of the existing language information processing system and simulate the human language processing ability, it is necessary to identify the human language information processing principle and to integrate it into the natural language processing system development. A natural language processing system needs to be developed.

바람직하게는, 상기 어절 획득 기능은, 상기 코퍼스로부터 어절단위로 문자열을 추출하는 단계; 추출된 상기 문자열의 출현 빈도 정보와 최근성 정보를 데이터 베이스에 등록하는 단계; 추출된 상기 문자열의 출현 빈도와 출현 빈도 임계값을 비교하고, 추출된 상기 문자열의 최근성 빈도와 최근성 임계값을 비교하는 단계; 추출된 상기 문자열의 출현 빈도가 출현 빈도 임계값보다 높거나, 추출된 상기 문자열의 최근성 빈도가 최근성 임계값보다 높으면, 어절 사전에 등록하는 단계를 포함한다.Preferably, the word acquisition function may include extracting a character string from the corpus in word units; Registering the appearance frequency information and the latestness information of the extracted character string in a database; Comparing the appearance frequency and the appearance frequency threshold of the extracted string, and comparing the recentness frequency and the recentness threshold of the extracted string; If the frequency of appearance of the extracted string is higher than the appearance frequency threshold, or if the frequency of retrieval of the extracted string is higher than the recentness threshold value, registering in the word dictionary.

바람직하게는, 상기 형태소 획득 기능은, 상기 코퍼스로부터 입력된 어절의 음절 경계에서 상기 어절을 머리문자열과 꼬리문자열로 구분하는 단계; 상기 머리 문자열과 꼬리문자열의 출현 빈도 및 동시 출현 문자열 정보를 추출하는 단계; 상기 머리문자열 또는 꼬리문자열의 빈도수와 대응하는 임계값을 비교하는 단계; 상기 머리문자열 또는 꼬리문자열의 빈도수가 대응하는 임계값을 넘어서면, 상기 머리문자열의 마지막 음절까지의 발생 엔트로피와 상기 꼬리문자열의 첫음절까지의 발생 엔트로피를 계산하는 단계; 및 상기 머리문자열의 마지막 음절 엔트로피가 직전 음절 엔트로피보다 증가하고, 꼬리문자열의 첫 음절 엔트로피가 직후 음절 엔트로피보다 증가하는 것으로 판단되면, 상기 머리문자열을 머리형태소로, 상기 꼬리문자열을 꼬리형태소로 지정하여 형태소 사전에 저장하는 단계를 포함한다.Preferably, the morpheme acquiring function may include: dividing the word into a head string and a tail string at a syllable boundary of a word input from the corpus; Extracting occurrence frequency and simultaneous appearance string information of the head string and the tail string; Comparing a threshold value corresponding to a frequency of the header or footer string; Calculating an occurrence entropy up to the last syllable of the header string and an occurrence entropy up to the first syllable of the tail string when the frequency of the header string or the tail string exceeds a corresponding threshold value; And if it is determined that the last syllable entropy of the head string increases from the previous syllable entropy, and the first syllable entropy of the tail string increases immediately after the syllable entropy, the head string is designated as the head morpheme, and the tail string is designated as the tail morpheme. Storing in a morpheme dictionary.

본 발명에 따르면, 학습하고자 하는 언어의 코퍼스만 있을 경우, 해당 언어의 결합 형태와 형태소를 자동으로 획득할 수 있는 효과가 있다.According to the present invention, when there is only a corpus of a language to be learned, there is an effect of automatically obtaining a combination form and morpheme of the language.

한편, 본 발명에 따라, 1천만 어절 크기의 한국어 코퍼스를 이용하여 실험한 결과 전체 코퍼스에서 빈도의 합이 38.63%를 차지하는 2,097개의 어절이 학습되었으며, 3,488개의 형태소가 99.87%의 정확도로 추출되었다.On the other hand, according to the present invention, as a result of experiment using a Korean corpus of 10 million word size, 2,097 words with 38.63% of total frequency were learned, and 3,488 morphemes were extracted with an accuracy of 99.87%.

본 발명에 따르면, 인지신경학적 언어정보처리 원리를 반영한 계산주의적 자동 어휘 획득 방법을 제공할 수 있다.According to the present invention, it is possible to provide a computational automatic vocabulary acquisition method that reflects the principles of cognitive neurological linguistic information processing.

이하 첨부된 도면을 참조하여 본 발명의 바람직한 실시예를 상세히 설명하기 로 한다. 이에 앞서, 본 명세서 및 청구범위에 사용된 용어나 단어는 통상적이거나 사전적인 의미로 한정해서 해석되어서는 아니되며, 발명자는 그 자신의 발명을 가장 최선의 방법으로 설명하기 위해 용어의 개념을 적절하게 정의할 수 있다는 원칙에 입각하여 본 발명의 기술적 사상에 부합하는 의미와 개념으로 해석되어야만 한다. 따라서, 본 명세서에 기재된 실시예와 도면에 도시된 구성은 본 발명의 가장 바람직한 일실시예에 불과할 뿐이고 본 발명의 기술적 사상을 모두 대변하는 것은 아니므로, 본 출원시점에 있어서 이들을 대체할 수 있는 다양한 균등물과 변형예들이 있을 수 있음을 이해하여야 한다.Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings. Prior to this, terms or words used in the specification and claims should not be construed as having a conventional or dictionary meaning, and the inventors should properly explain the concept of terms in order to best explain their own invention. Based on the principle that can be defined, it should be interpreted as meaning and concept corresponding to the technical idea of the present invention. Therefore, the embodiments described in the specification and the drawings shown in the drawings are only one of the most preferred embodiments of the present invention and do not represent all of the technical idea of the present invention, various modifications that can be replaced at the time of the present application It should be understood that there may be equivalents and variations.

도 1은 본 발명의 인지신경기반 자동 어휘 획득 방법에 따른 자동 어휘 획득 모델 블록 구성도이다.1 is a block diagram of an automatic vocabulary acquisition model according to the cognitive nerve-based automatic vocabulary acquisition method of the present invention.

본 발명의 인지신경기반 자동 어휘 획득 방법에 따른 자동 어휘 획득 모델은 코퍼스를 입력받아 결합 모델의 어휘 사전과 분해 모델의 형태소 사전을 구축하는 하이브리드 모델의 어휘 자동 획득 모델이다. 모델의 어휘 학습을 위하여 입력하는 코퍼스는 인간이 세상을 살아가면서 듣고 보게 되는 언어 생활을 모델링하는 것이다. The automatic lexical acquisition model according to the cognitive nerve-based automatic lexical acquisition method of the present invention is an automatic lexical acquisition model of a hybrid model that receives a corpus and constructs a lexical dictionary of a combined model and a morpheme dictionary of a decomposition model. The corpus that is input for the vocabulary learning of the model is to model the language life that human beings hear and see as they live the world.

어절 획득 모델에서의 어절 획득 원리는 빈도 정보를 고려한 획득과 최근성 정보를 고려한 획득 두 가지이다. 언어 생활에서 고빈도 어절은 저빈도 어절에 비하여 인식 속도가 빠른 빈도 효과(frequency effect)를 보인다. 이러한 빈도 효과가 나타나는 이유 중 한 가지는 어절 전체가 분해되어 각각의 형태소가 인식되고 각 형태소를 조합한 어절이 인식되는 것이 아니라 어절 전체가 하나의 단위로 인식 되기 때문이다. 이렇게 어절 전체가 사전에 저장되어야하는 이유는 어절의 인식의 효율성에 있어서도 타당성을 갖는데, 고빈도 어절일수록 어절 전체를 사전에서 탐색하여 빠르게 인식하도록 하는 것이 언어 이해에도 효율적이다. 고빈도 어절을 어절 사전(Fdic)에 저장하여 어절 획득을 하는 것은 이러한 원리를 반영하는 것이다. 최근성 원리는 특정한 기간이나 특정한 시기에 집중적으로 반복하여 접하는 어절의 경우 결합 형태로 어절 사전(Fdic)에 저장된다는 원리로 상대적으로 강한 자극과 입력을 받은 어절은 분해되어 이해되기 보다는 결합형태로 이해되기 쉽기 때문이다.The principle of word acquisition in word acquisition model is two kinds of acquisition considering frequency information and acquisition considering relativity information. The high frequency words in the language life show a frequency effect that is faster than the low frequency words. One of the reasons for the frequency effect is that the whole word is recognized as one unit, not the whole word is recognized by each word and the words that combine each one are recognized. The reason why the whole word should be stored in the dictionary is valid for the efficiency of word recognition. It is more efficient to understand the language by searching the whole word quickly in the dictionary for the higher frequency word. Storing word by storing high frequency word in word dictionary reflects this principle. The principle of relativity is that in the case of intensively repeated words at a specific period or time, the words are stored in the word dictionary in a combined form. It is easy to be.

어절 획득 모델에서 사용되는 파라미터는 결합 형태 빈도 임계값(ft)과 최근성 임계값(rt) 두가지가 사용된다. 각 임계값은 획득할 언어의 특성에 따라 다르게 설정될 수 있다. Two parameters are used in the word acquisition model: the combined form frequency threshold ( ft ) and the recentness threshold ( rt ). Each threshold may be set differently according to the characteristics of the language to be obtained.

이를 위하여 본 발명의 인지신경기반 자동 어휘 획득 방법에 따른 자동 어휘 획득 모델은 코퍼스(100), 어절 획득 모델(200), 형태소 획득 모델(300), 어절사전(400), 및 형태소사전(500)을 포함한다.To this end, the automatic vocabulary acquisition model according to the cognitive nerve-based automatic vocabulary acquisition method of the present invention includes a corpus 100, a word acquisition model 200, a morpheme acquisition model 300, a word dictionary 400, and a morpheme dictionary 500. It includes.

코퍼스(100)가 입력되면 빈칸을 경계로 띄어쓴 어절이 어절 획득 모델(200)과 형태소 획득 모델(300)에 입력된다. When the corpus 100 is input, a word having a blank space is input to the word acquisition model 200 and the morpheme acquisition model 300.

어절 획득 모델(200)은 어절빈도 정보 추출 모듈(210), 어절빈도정보DB(220), 어절임계값(230), 어절 빈도 임계값 검증모듈(240), 어절 최근성 정보 추출 모듈(250), 어절 최근성 정보 DB(260), 최근성 임계값(270), 및 어절빈도 최근성 검증 모듈(280)을 포함하여 어절 빈도가 높거나 어절 최근성이 높은 것을 어절 사전에 저장한다.The word acquisition model 200 includes a word frequency information extraction module 210, a word frequency information DB 220, a word threshold value 230, a word frequency threshold verification module 240, and a word recency information extraction module 250. The word recency information DB 260, the recency threshold value 270, and the word frequency recency verification module 280 store the word frequency or the word recency in a word dictionary.

형태소 획득 모델(300)은 형태소 후보 추출 모듈(310), 형태소 후보 출현 정보 DB(320), 엔트로피 임계값(330), 및 형태소 후보 검증 모듈(340)을 포함하여 추출된 형태소에 대하여 형태소 사전에 저장한다. The morpheme acquisition model 300 includes a morpheme candidate extraction module 310, a morpheme candidate appearance information DB 320, an entropy threshold 330, and a morpheme candidate verification module 340. Save it.

도 2는 본 발명의 어절 획득 모델에 따른 어절 획득 순서도이다.2 is a word acquisition flowchart according to a word acquisition model of the present invention.

본 발명에 따른 어절 획득 모델(200)에서는 어절 빈도 정보 추출 모듈(210)이 코퍼스(100)로부터 입력된 어절을 추출하여 어절 빈도 정보 DB(220)에 저장하며, 어절빈도 임계값 검증 모듈(240)은 어절 빈도 정보 DB(220)에 저장된 어절의 빈도가 출현 빈도 임계값(ft, 230)보다 높은 어절에 대하여 고빈도 어절로 간주하고 어절 사전(400)에 저장한다. In the word acquisition model 200 according to the present invention, the word frequency information extraction module 210 extracts a word input from the corpus 100 and stores the word word input in the word frequency information DB 220. The word frequency threshold verification module 240 ) Is regarded as a high frequency word for a word whose frequency of the word stored in the word frequency information DB 220 is higher than the appearance frequency threshold value ( ft, 230) and stored in the word dictionary 400.

또한, 어절 획득 모델(200)에서는 추출된 어절이 특정 크기의 어절들 내에서 반복해서 출현했는지를 검증하여 어절 사전에 저장하는 바, 어절 최근성 정보 추출 모듈(250)은 코퍼스(100)로부터 입력된 어절을 추출하여 어절 최근성 정보 DB(260)에 저장하며, 어절빈도 최근성 검증 모듈(280)은 어절 최근성 정보 DB(260)에 저장된 어절의 최근성 빈도가 최근성 임계값(rt, 270)보다 높은 어절에 대하여 최근성 빈도가 높은 어절로 간주하고 어절 사전(400)에 저장한다. In addition, the word acquisition model 200 verifies whether the extracted word repeatedly appears in words of a specific size and stores the word in the word dictionary. The word recentness information extraction module 250 is input from the corpus 100. the extracts Eojeol stores the Eojeol recency information DB (260), Eojeol frequency recent validation module 280 is a recency frequency of Eojeol stored in Eojeol recency information DB (260) recency threshold (rt, The word higher than the word 270 is regarded as a word having a high recurrence frequency and stored in the word dictionary 400.

즉, 코퍼스에서 어절단위로 문자열을 추출하고(S210), 추출된 문자열의 어절 출현 빈도 정보와 최근성 정보를 DB에 등록한다(S220). 추출 문자열의 출현 빈도가 출현 빈도 임계값(ft, 230)보다 높거나, 추출 문자열의 최근성 빈도가 최근성 임계 값(rt, 270)보다 높으면(S230), 어절 사전에 등록한다(S240). 그러나, 추출 문자열의 출현 빈도가 출현 빈도 임계값(ft, 230)보다 높지도 않고, 추출 문자열의 최근성 빈도가 최근성 임계값(rt, 270)보다 높지도 않으면(S230) S210단계로 복귀한다.That is, the character string is extracted from the corpus by word units (S210), and the word occurrence frequency information and the latestness information of the extracted string are registered in the DB (S220). If the appearance frequency of the extracted string is higher than the appearance frequency threshold value (ft, 230) or the recentness frequency of the extracted string is higher than the recentity threshold value (rt, 270) (S230), it is registered in the word dictionary (S240). However, if the frequency of appearance of the extracted string is not higher than the appearance frequency threshold (ft, 230) and the frequency of recency of the extracted string is not higher than the recentness threshold (rt, 270) (S230), the process returns to step S210. .

본 발명에 따르면, 형태소는 어절을 두 개의 문자열로 나누어 앞부분의 문자열로 구성된 형태소 또는 형태소 열을 머리 형태소로 정의하고, 뒷부분의 문장열로 구성된 형태소 또는 형태소 열을 꼬리 형태소로 정의한다. 코퍼스의 어절에서 자주 등장하는 문자열을 탐지하고 그 문자열이 형태소가 될 수 있음을 검증하여 형태소 사전인 Mdic에 저장한다. 특정 문자열이 형태소가 될 수 있음을 검증하는 방법은 특정 문자열의 후행 음절의 엔트로피(sucessor entropy)와 선행 음절 엔트로피를 이용하여 검증한다. According to the present invention, a morpheme is divided into two strings and defines a morpheme or morpheme string composed of a string at the front as a head morpheme, and a morpheme or morpheme sequence composed of a sentence at the rear is defined as a tail morpheme. Detects strings frequently appearing in corpus words, verifies that they can be stemmed, and stores them in Mdic, a morpheme dictionary. The method of verifying that a specific string can be morpheme is verified by using a sucessor entropy and a preceding syllable entropy of the specific string.

엔트로피를 이용한 특정 문자열의 형태소 검증을 예를 들어 설명한다. An example of the stemming of a specific string using entropy will be described.

도 3은 본 발명의 형태소 획득 모델에 따른 형태소 획득 순서도이다.3 is a morpheme acquisition flowchart according to the morpheme acquisition model of the present invention.

본 발명의 형태소 획득 모델(300)에서는 입력된 어절의 가능한 모든 음절 경계에서 머리문자열/꼬리문자열로 구분하고(S310), 각 머리문자열과 꼬리문자열의 출현 빈도 및 동시 출현 문자열 정보를 추출한다(S320). In the morpheme acquisition model 300 of the present invention, the head string / tail string is divided at all possible syllable boundaries of the input word (S310), and the appearance frequency and the simultaneous appearance string information of each head string and the tail string are extracted (S320). ).

추출한 머리문자열 또는 꼬리문자열의 빈도수와 임계값을 비교하여(S330), 추출한 머리문자열 또는 꼬리문자열의 빈도수와 임계값을 넘을 경우 해당 머리문자열의 마지막 음절까지의 발생 엔트로피(순방향 엔트로피)와 꼬리문자열의 첫음절까 지의 발생 엔트로피(역방향 엔트로피)값을 계산한다(S340). By comparing the frequency and the threshold of the extracted header or footer string (S330), if the frequency and the threshold of the extracted header or footer string are exceeded, the occurrence entropy (forward entropy) and tail string of the last syllable of the heading string are exceeded. The calculated entropy (reverse entropy) value until the first syllable is calculated (S340).

머리문자열의 마지막 음절 엔트로피가 직전 음절 엔트로피보다 증가하고, 꼬리문자열의 첫 음절 엔트로피가 직후 음절 엔트로피보다 증가하는 것으로 판단되면(S350), 머리문자열을 머리형태소로, 꼬리문자열을 꼬리형태소로 형태소 사전(500)에 저장한다(S360). 즉, 계산된 순방향 및 역방향 엔트로피값의 변화가 형태소 가능성을 보일 경우 해당 머리문자열과 꼬리문자열을 각각 머리 형태소와 꼬리 형태소로 추출하여 형태소 사전(500)에 저장한다.If it is determined that the last syllable entropy of the head string is increased from the previous syllable entropy, and the first syllable entropy of the tail string is increased from the syllable entropy immediately (S350), the head string is the head morpheme and the tail string is the tail morpheme dictionary ( 500) (S360). That is, when the calculated change in the forward and reverse entropy values show the morpheme likelihood, the corresponding head string and the tail string are extracted into the head morpheme and the tail morpheme, respectively, and stored in the morpheme dictionary 500.

표 1은 코퍼스에서 문자열 "아빠"로 시작되는 어절과 각 어절의 출현 빈도를 나타내는 예시이다.Table 1 is an example showing the words starting with the string "Dad" and the frequency of occurrence of each word in the corpus.

어절Word 빈도frequency 아버지께서도Even my father 88 아버지만Father only 44 아버지께도To my father 44 아버지father 1616 아버지라도Even my father 88 합계Sum 4040

표 1의 어절의 예에서 문자열 “아”와 “아버”, 그리고 “아버지”의 출현 빈도는 모두 40회로 다른 부분 문자열보다 매우 높은 출현 빈도를 보인다. 따라서 이들 문자열이 모두 머리 형태소의 가능성이 있다. 하지만 형태소는 그 특성상 다른 형태소와 결합하여 어절을 생성하므로 만약 어떤 문자열이 머리 형태소라면 그 형태소 다음에는 여러 가지 꼬리 형태소가 연결될 수 있을 것이다. 예를 들면, “아버”라는 문자열 다음에는 출현할 수 있는 문자열은 “지”일 것이나 “아버지”라는 머리형태소가 될 수 있는 문자열 뒤에는 여러 가지의 꼬리형태소가 올 수 있다는 것이다. In the example of the word of Table 1, the frequency of the occurrences of the strings “ar”, “father”, and “father” are all 40 times higher than other substrings. Therefore, there is a possibility that all these strings are head stemming. However, morphemes combine words with other morphemes to create a word, so if a string is a head morpheme, various tail morphemes can be connected after it. For example, the string that can appear after the string "father" may be "ji", but there can be several tail stems after the string that can be the head stem of "father".

따라서 본 발명은 고빈도로 출현한 특정 문자열이 머리형태소의 가능성이 있는지를 조사하기 위하여 머리형태소 뒤에 출현하는 음절의 발생 엔트로피를 이용하여 계산한다. 특정한 문자열 뒤에 출현할 수 있는 음절의 엔트로피는 식 1과 같이 정의한다.Therefore, the present invention calculates using the occurrence entropy of syllables appearing behind the head morphemes to investigate whether a particular string that appears at high frequency is a head morpheme. The entropy of syllables that can appear after a specific string is defined as in Equation 1.

여기서, pi는 문자열 S 다음에 음절 C_i가 출현할 확률값을 의미한다.Here, pi denotes the probability value of the syllable C _i after the string S.

Entropy(S)값이 높을수록 문자열 S는 형태소일 가능성이 높아진다. 표 1의 예에서 고빈도 문자열 “아버지”에서 “아”, “아버”, “아버지”, 그리고 “아버지는”의 엔트로피를 그래프로 나타내면 도 4와 같다. The higher the Entropy (S) value, the more likely the string S is to be a morpheme. In the example of Table 1, the high-frequency strings "Father" to "A", "Father", "Father", and "Father" are graphs of entropy as shown in FIG.

도 4는 본 발명의 일실시예에 따른 순방향 문자열의 엔트로피 그래프로서, Entropy(아)부터 Entropy(아버)까지 엔트로피값이 줄어들다가 머리형태소가 될 수 있는 문자열 “아버지”에서 엔트로피값이 상승하는 것을 볼 수 있다. 또한 형태소가 만들어지는 과정에 있는 문자열에서는 엔트로피값이 적어지다가 형태소가 만들 어지는 문자열의 위치에서 엔트로피값이 상승하는 모양이 만들어지는 것을 알 수 있다.4 is an entropy graph of a forward string according to an embodiment of the present invention, in which the entropy value decreases from Entropy (ar) to Entropy (arbor), and the entropy value increases in the string “father” which may become a head morpheme. can see. In addition, it can be seen that the entropy value decreases in the string in the process of forming the stem, and the entropy value increases at the position of the string in which the stem is made.

본 발명은 엔트로피값이 상승하는 위치의 문자열을 찾아 머리 형태소 후보로 결정하고 이 후보가 머리 형태소일 가능성을 조사한다. 엔트로피가 상승하는 문자열을 바로 머리 형태소로 결정하지 못하는 이유는 특정 문자열을 공유하는 다양한 형태소가 존재할 수 있기 때문이다. 예를 들어, 표 2에서 부분 문자열 “아”가 반복적으로 출현하고 “아” 다음 음절의 종류도 “이”, “가”, “기”, “침” 등으로 다양하게 나타나고 있다. The present invention finds the character string of the position where the entropy value rises and determines the head morpheme candidate and investigates the possibility that the candidate is the head morpheme. The reason why entropy does not determine the rising string is head morphology because there can be various morphemes that share a specific string. For example, in Table 2, the substring “A” appears repeatedly, and the syllables after “A” are also variously shown as “Y”, “A”, “Ki”, and “Spit”.

어절Word 빈도frequency 아이스크림도Ice cream 44 아가도Sweetheart 88 아기를Baby 88 아침에In the morning 88 아이스크림을Ice cream 44

따라서 “아”의 엔트로피값이 높은 값으로 계산되어 머리 형태소로 등록되는 오류가 발생할 수 있다. 만약 “아”가 머리 형태소일 경우 “이스크림도”, “가도”, “기를”, “침에”, 그리고 “이스크림을”이 꼬리 형태소이어야 하는데, 이들이 꼬리 형태소일 가능성이 없다면 부분 문자열 “아”도 머리 형태소가 될 수 없다.Therefore, the entropy value of "a" is calculated to a high value and an error may be registered as a head morpheme. If "Ah" is a head morpheme, "Iscream", "Gay", "Ki", "Spit", and "Ice Cream" should be tail morphemes. Ah ”cannot be a head morph.

본 발명에 따르면, 머리 형태소일 가능성의 조사를 위하여 어절에서 머리 형태소 후보를 제외한 나머지 부분이 꼬리 형태소일 가능성이 있는지를 조사하는 방식을 사용한다. 꼬리 형태소일 가능성은 머리 형태소 후보를 선정하는 방식과 같이 엔트로피값을 이용하는데, 어절의 마지막부터 역방향으로 만들어지는 문자열의 빈도 정보를 이용하여 앞에 나올 수 있는 음절에 대한 엔트로피 값을 계산한다. 예를 들면, “아버지께서도”라는 문자열에서 문자열 “아버지”가 머리 형태소가 되기 위해서는 “께서도”가 꼬리 형태소일 가능성이 있어야 한다. 따라서 이 문자열의 역순 문자열인 “도서께”의 다음에 출현하는 음절의 엔트로피가 높은 값이 된다. According to the present invention, in order to investigate the possibility of the head morpheme, a method of investigating whether the remaining part except the head morpheme candidate in the word may be the tail morpheme is used. Possibility of tail morpheme uses entropy value in the same way as selecting head morpheme candidate, and calculates entropy value for preceding syllables using frequency information of string created from the end of word in reverse direction. For example, in order for the string "Father" to be a head stem in the string "Father," there must be a possibility that "Father" is a tail stem. Therefore, the entropy of the syllable that appears after the reverse order of the "book book" is high.

도 5는 본 발명의 일실시예에 따른 역방향 문자열의 엔트로피 그래프로서, 도 4와 유사하게 꼬리 형태소가 되는 문자열의 위치에서 엔트로피값이 상승함으로 보인다.FIG. 5 is an entropy graph of a reverse string according to an embodiment of the present invention. Similar to FIG. 4, an entropy value is increased at a position of a tail morpheme.

위에서 설명한 순방향 문자열의 엔트로피와 역방향 문자열의 엔트로피를 이용한 형태소 획득 알고리즘을 pseudo 코드로 나타내면 다음과 같다.The morpheme acquisition algorithm using the entropy of the forward string and the entropy of the reverse string as described above is expressed as pseudo code as follows.

// input : 어절(E)들로 구성된 코퍼스(언어 생활 기간)// input: corpus of words (E)

// 어절 E는 n+1개의 음절(C)로 이루어진 문자열, E = C₀C₁C₂...,C_n // word E is a string of n + 1 syllables (C), E = C ₀ C ₁ C ₂ ..., C _n

// Si = C₀C₁,...,C_i로 이루어진 부분 문자열// Si = C ₀ substring consisting of C ₁ , ..., C _i

// output : Morpheme-Dic(Mdic)// output: Morpheme-Dic (Mdic)

/* 순방향 entropy 증가 체크 *// * Forward entropy increase check * /

1.tmp_dic 을 NULL로 설정// tmp_dic은 임시morpheme의 누적빈도와 entropy값을 저장하는 테이블1. Set tmp_dic to NULL // tmp_dic is a table that stores cumulative frequency and entropy values of temporary morphemes.

// SUCC_k(S_i)는 S_i 이후에 출현한 C_k // SUCC _k (S _i ) is C _k after S _i

// tmp_dic은 {S_i, SUCC_k(S_i), SUCC_k(S_i)의 빈도}를 저장하는 테이블// tmp_dic is a table that stores {frequency of SU _i , SUCC _k (S _i ), SUCC _k (S _i )}

2.입력된 어절을 음절 단위로 분리(C₀ , C₁ , C₂ ... C_n); // ex)아/버/지/가2. Divide the input word into syllable units (C ₀ , C ₁ , C ₂ ... C _n ); // ex) Arbor / Ger / G

3.for each i=0 ~ (n-1) 3.for each i = 0 to (n-1)

4.{4.{

5. C₀+C₁+C₂+....+C_i 의 글자로 문자열(S_i)을 재생성; //ex)아,아버,아버지5. Regenerate the string (S _i ) with the letters C ₀ + C ₁ + C ₂ + .... + C _i ; father, father, father

6. C_(i+1)+C_(i+2)+...C_n의 글자로 문자열(S_j)를 재생성;//ex)버지가,지가,가6. Recreate the string (S _j ) with the characters C _{(i + 1)} + C _{(i + 2)} + ... C _n ; // ex

7. if(tmp_dic에 S_i가 등록되어 있는가? ) then tmp_dic 안의 S_i 빈도 1증가; If (is S _i registered in tmp_dic?) Then increases the frequency of S _{i in} tmp_dic by 1;

8. //ex) 아,아버,아버지8. // ex) Father, Father

9. else tmp_dic에 S_i를 등록;9. Register S _i with else tmp_dic;

10. if(tmp_dic에 SUCC_k(S_i)가 등록되어 있는가? ) then SUCC_k(S_i) 빈도 1증가; 10. If SUCC _k (S _i ) is registered in if (tmp_dic?) Then SUCC _k (S _i ) Increase frequency 1;

//ex)아/버지가,아버/지가,아버지/가 father / father, father / father, father / father

11. else tmp_dic에 SUCC_k(S_i)를 등록; 11. Register SUCC _k (S _i ) in else tmp_dic;

12.12.

13. for each j=0 ~ ( n-1 )13.for each j = 0 to (n-1)

14. {14. {

15. Forward_E(S_j) = ; 15. Forward_E (S _j ) =;

16. // (는 ), m은 tmp_dic내의 SUCC_k(S_j)의 갯수16. // (is), m is the number of SUCC _k (S _j ) in tmp_dic

17. } 17.}

18.}18.}

19.for each i=1 ~ (글자수-1) 19.for each i = 1 to (character-1)

20.{20. {

21. if(K_(i-1)의entropy값 < K_i의 entropy값) S_i 임시저장(H_i);21. if (entropy value of K _(i-1) <entropy value of K _i ) S _i Temporary storage (H _i );

22.}22.}

/* 역방향 entropy 증가 체크 *// * Check reverse entropy increase * /

23.Rtmp_dic 을 NULL로 설정// rtmp_dic은 임시morpheme의 누적빈도와 entropy값을 저장하는 테이블23. Set Rtmp_dic to NULL // rtmp_dic is a table that stores cumulative frequency and entropy values of temporary morphemes.

// RSUCC_k(RS_i)는 RS_i 이후에 출현한 RC_k // RSUCC _k (RS _i ) is RC _k after RS _i

// Rtmp_dic은 {RS_i, RSUCC_k(RS_i), RSUCC_k(RS_i)의 빈도}를 저장하는 테이블// Rtmp_dic is a table that stores {frequency of RS _i , RSUCC _k (RS _i ), RSUCC _k (RS _i )}

24.입력된 어절을 역방향의순서로 음절 단위로 분리(RC₀ , RC₁ , RC₂ ... RC_n); // ex)가/지/버/아24. The input word is divided into syllable units in the reverse order (RC ₀ , RC ₁ , RC ₂ ... RC _n ); // ex)

25.for each i=0 ~ (n-1) 25.for each i = 0 to (n-1)

26.{26. {

27. RC₀+RC₁+RC₂+....+RC_i 의 글자로 문자열(S_i)을 재생성; //ex)가,가지,가지버27. Regenerate the string S _i with the letters RC ₀ + RC ₁ + RC ₂ + .... + RC _i ; // ex)

28. RC_(i+1)+RC_(i+2)+...RC_n의 글자로 문자열(S_j)를 재생성;//ex)지버아,버아,아28.RC _{(i + 1)} + RC _{(i + 2)} + ... regenerates the string (S _j ) with characters of RC _n ; // ex

29. if(Rtmp_dic에 RS_i가 등록되어 있는가? ) then Rtmp_dic 안의 S_i 빈도 1증가; 29. if (? Is the RS _i is registered in Rtmp_dic) then Rtmp_dic S _i in the frequency increased by 1;

30. //ex) 가,가지,가지버30. // ex) fall, branch, branch

31. else Rtmp_dic에 S_i를 등록;31. Register S _i with else Rtmp_dic;

32. if(Rtmp_dic에 RSUCC_k(S_i)가 등록되어 있는가? ) then RSUCC_k(S_i) 빈도 1증가; 32. If RSUCC _k (S _i ) is registered in if (Rtmp_dic?) Then RSUCC _k (S _i ) increases by 1;

//ex)가/지버아,가지/버아,가지버/아 // ex) ga / gibera, eggplant / bua, gaziber / ah

33. else Rtmp_dic에 RSUCC_k(S_i)를 등록; 33. Register RSUCC _k (S _i ) with else Rtmp_dic;

34.34.

35. for each j=0 ~ ( n-1 )35.for each j = 0 to (n-1)

36. {36. {

37. Forward_E(RS_j) = ; 37. Forward_E (RS _j ) =;

38. // (는 ), Rm은 Rtmp_dic내의 RSUCC_k(S_j)의 갯수38. // (), Rm is the number of RSUCC _k (S _j ) in Rtmp_dic

39. } 39.}

40.}40.}

41.for each i=1 ~ (글자수-1) 41.for each i = 1 to (character-1)

42.{42. {

43. if(RK_(i-1)의entropy값 < RK_i의 entropy값) RS_i 임시저장(RH_i);43. if (entropy value of RK _(i-1) <entropy value of RK _i ) RS _i temporary storage (RH _i );

44.}44.}

이상과 같이, 본 발명은 비록 한정된 실시예와 도면에 의해 설명되었으나, 본 발명은 이것에 의해 한정되지 않으며 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에 의해 본 발명의 기술사상과 아래에 기재될 특허청구범위의 균등범위 내에서 다양한 수정 및 변형이 가능하다.As described above, although the present invention has been described by way of limited embodiments and drawings, the present invention is not limited thereto and is intended by those skilled in the art to which the present invention pertains. Various modifications and variations are possible within the scope of equivalents of the claims to be described.

도 1은 본 발명의 인지신경기반 자동 어휘 획득 방법에 따른 자동 어휘 획득 모델 블록 구성도,1 is a block diagram of an automatic vocabulary acquisition model according to a cognitive nerve-based automatic vocabulary acquisition method of the present invention;

도 2는 본 발명의 어절 획득 모델에 따른 어절 획득 순서도,2 is a word acquisition flowchart according to a word acquisition model of the present invention;

도 3은 본 발명의 형태소 획득 모델에 따른 형태소 획득 순서도,3 is a morpheme acquisition flowchart according to the morpheme acquisition model of the present invention;

도 4는 본 발명의 일실시예에 따른 순방향 문자열의 엔트로피 그래프, 및4 is an entropy graph of a forward string according to an embodiment of the present invention, and

도 5는 본 발명의 일실시예에 따른 역방향 문자열의 엔트로피 그래프이다.5 is an entropy graph of a reverse string according to an embodiment of the present invention.

*도면의 주요 부분에 대한 설명** Description of the main parts of the drawings *

100: 코퍼스 200: 어절 획득 모델100: Corpus 200: Word Acquisition Model

300: 형태소 획득 모델 400: 어절사전300: Morphological Acquisition Model 400: Word Dictionary

500: 형태소사전500: Morse Dictionary

Claims

delete

A word acquiring function for storing in a word dictionary if the frequency of appearance or recentity of the word extracted from the corpus exceeds the corresponding first and second threshold values, respectively; And

Morphological acquisition function for storing in the morpheme dictionary if the frequency of the head and tail strings of the words extracted from the corpus exceeds the corresponding third and fourth threshold values, respectively

Including,

The word acquiring function may include extracting a character string from the corpus in word units; Registering the appearance frequency information and the latestness information of the extracted character string in a database; Comparing the appearance frequency and the appearance frequency threshold of the extracted string, and comparing the recentness frequency and the recentness threshold of the extracted string; If the frequency of appearance of the extracted string is higher than the appearance frequency threshold, or if the frequency of retrieval of the extracted string is higher than the recentness threshold, cognitive neuron-based automatic vocabulary, comprising the step of registering in the word dictionary Acquisition method.

The method of claim 2, wherein the morpheme acquisition function,

Dividing the word into a head string and a tail string at a syllable boundary of the word inputted from the corpus;

Extracting occurrence frequency and simultaneous appearance string information of the header string and the tail string;

Comparing a threshold value corresponding to a frequency of the header or footer string;

Calculating an occurrence entropy up to the last syllable of the header string and an occurrence entropy up to the first syllable of the tail string when the frequency of the header string or the tail string exceeds a corresponding threshold value; And

If it is determined that the last syllable entropy of the head string increases from the previous syllable entropy and the first syllable entropy of the tail string increases immediately after the syllable entropy, the head string is designated as the head morpheme and the tail string is designated as the tail morpheme. Step to save in advance

Cognitive nerve-based automatic vocabulary acquisition method comprising a.

The method of claim 3, wherein the entropy satisfies the following equation.

Where pi is the probability that the syllable C _i will appear after the string S.