KR101126186B1

KR101126186B1 - Apparatus and Method for disambiguation of morphologically ambiguous Korean verbs, and Recording medium thereof

Info

Publication number: KR101126186B1
Application number: KR1020100086449A
Authority: KR
Inventors: 김선호; 윤준태; 박석; 서정연
Original assignee: 서강대학교산학협력단
Priority date: 2010-09-03
Filing date: 2010-09-03
Publication date: 2012-03-22
Also published as: KR20120023387A

Abstract

형태적 중의성이 존재하는 동사 분석 장치, 방법 및 그 기록 매체가 개시된다. 본 발명의 일 실시 예에 따른 형태적 중의성이 있는 동사 분석 장치는 레이블이 부착되지 않은 가공되지 않은 말뭉치에서 기본형이 파악된 중의성이 없는 제1동사활용과 상기 제1동사활용에 관련된 문맥 특징을 포함하는 학습 데이터를 수집하는 기본 데이터 수집부; 및 상기 학습 데이터에 기반하여 기본형과 문맥 사이의 분류자를 학습하는 훈련 수행부; 및 상기 말뭉치에서 기본형이 중의적인 제2동사활용 주변의 문맥 특징을 추출하고, 상기 학습된 분류자에 기반하여 기본형을 파악하는 동사 분석부; 기존 학습 데이터로 유도된 분류자로는 판단하기 어려운 예제를 파악하는 어려운 사례 판단부; 어려운 사례에 대한 기본형 판단을 위한 웹카운트를 수집하는 통신부; 어려운 사례를 추가하고 재학습하는 추가 훈련부를 포함한다. 본 발명의 실시 예들에 의하면, 말뭉치의 인위적인 분류 작업을 필요로 하지 않으면서도 텍스트 분석의 정확도 및 효율을 향상시킬 수 있고, 데이터베이스 구축을 포함한 텍스트 분석의 전 과정을 자동화할 수 있다.Disclosed are a verb analyzing apparatus, a method, and a recording medium thereof, in which morphological neutrality exists. According to an embodiment of the present invention, a verb analysis apparatus having a form neutrality may include a first verb utilization and a context characteristic related to the first verb utilization, in which a basic form is identified in an unprocessed cork without a label. Basic data collection unit for collecting the learning data including a; And a training performer configured to learn a classifier between a basic type and a context based on the learning data. And a verb analysis unit extracting a contextual feature around the second verb utilization in which the basic form is doubled in the corpus, and grasping the basic form based on the learned classifier. A difficult case determination unit for grasping an example that is difficult to determine with a classifier derived from existing learning data; Communication unit for collecting a web count for the basic judgment on the difficult case; Include additional training units to add and relearn difficult cases. According to embodiments of the present invention, it is possible to improve the accuracy and efficiency of text analysis without the need for artificial classification of corpus, and to automate the entire process of text analysis including database construction.

Description

Apparatus and Method for disambiguation of morphologically ambiguous Korean verbs, and Recording medium

본 발명은 텍스트의 형태소 분석 및 태깅에 관한 것으로, 특히, 형태적 중의성이 있는 동사 분석 장치, 방법 및 그 기록 매체에 관한 것이다.TECHNICAL FIELD The present invention relates to morphological analysis and tagging of text, and more particularly, to an apparatus, a method, and a recording medium having a verbal analysis.

말뭉치를 이용하는 확률 모델에 기반한 기계 학습(machine learning) 방법들은 최근 자연어 처리 문제를 해결하는 데에 성공적으로 적용되고 있다. 예를 들어, 통계적 형태소 태깅 시스템 (statistical parts-of-speech tagging system)의 경우 언어의 종류에 상관없이 약 97%의 정확도를 보이고 있다. 하지만 3%에 해당하는 에러율도 고성능의 정확도 높은 언어 처리를 요구하는 실제 응용 시스템에서는 여전히 간과할 수 없는 수치이다. 이러한 오류는 대부분 기존 태깅 시스템 내부에 고려되고 있는 제한된 문맥과 단어 정보만으로는 해결할 수 없으며, 규칙 기반의 후처리를 수행하더라도 규칙 자체가 오류가 많은 사례로부터 일반화되기 때문에 성능 향상을 달성하기 어렵다. Machine learning methods based on probabilistic models using corpus have recently been successfully applied to solve natural language processing problems. For example, the statistical parts-of-speech tagging system is about 97% accurate regardless of the language. However, the error rate of 3% can still be overlooked in real-world applications that require high-performance, accurate language processing. Most of these errors cannot be solved by limited context and word information that are considered in existing tagging systems. Even if rule-based post-processing is performed, it is difficult to achieve performance improvement because the rules themselves are generalized from error-prone cases.

이러한 형태소 분석 오류 중에서도 동사-동사 형태적 중의성(verb-verb morphological ambiguity)에 기인한 오류는 한국어의 형태소 분석 및 태거에서 가장 까다롭고 중요한 문제로 인식되고 있다. 해당 중의성 오류는 동사의 활용(conjugation) 시 발생하는 형태적 변이에 따른 것으로써 활용이란 동사 변화를 의미하며 한 단어가 문장 내에서 문법적 기능을 표시하기 위해 어미를 여러 가지 형태로 붙여서 사용하는 일을 의미한다. 한국어의 경우, 어간과 어미가 결합하는 과정에서 형태적인 변이가 발생하는 경우가 빈번하다.Among these morphological errors, errors due to verb-verb morphological ambiguity are recognized as the most difficult and important problems in morphological analysis and tagging in Korean. Correspondence errors are due to morphological variations that occur in the conjugation of verbs. Conjugation refers to verb changes and the use of a word in multiple forms to indicate grammatical functions in a sentence. Means. In the case of Korean, morphological variation occurs frequently in the process of combining the stem and the mother.

일반적으로 검색, 번역, 마이닝 등 자연어 처리 응용 시스템들은 문장이나 텍스트로부터 해당 단어들의 원형 (base form)과 해당 단어의 품사를 찾는 형태소 분석 및 태깅 단계를 가장 기본적인 프로세스로 활용하고 있는데, 여기서 동사-동사 형태적 중의성이란 텍스트에 등장한 형태적 변이로부터 해당 단어의 원형을 찾아내는 데 있어서 그 가능한 원형의 해가 중의적임을 의미한다. 예를 들어 "파는"이라는 단어 (어절)가 등장하였을 때 그 어간의 원형은 "팔다"와 "파다"의 두 가지가 가능하며 올바른 형태소 원형은 해당 문장의 주변 문맥(context)이나 문서의 주제(topic)에 의존하여 한가지로 결정된다.In general, natural language processing systems such as search, translation, and mining use the morphological analysis and tagging steps to find the base form of words and parts of speech from text or text as the most basic process. Morphological significance means that the solution of the prototypical form is significant in finding the prototype of the word from the morphological variations in the text. For example, when the word "selling" appears, the stem can be of two forms: "sell" and "sell", and the correct stem stems from the surrounding context of the sentence or the subject of the document ( depending on the topic).

이와 같이 형태적으로 중의성을 유발하는 동사들의 해당 원형을 찾아주는 문제를 이하, 동사-동사 형태적 중의성 해결 (verb-verb morphological disambiguation), VVMD로 줄여 표기하고, 동사-동사 형태적 중의성 (verb-verb morphological ambiguity)은 줄여 VVMA로 표기한다.The problem of finding the corresponding prototype of verbs that induce morphological neutrality is hereinafter referred to as verb-verb morphological disambiguation, VVMD, and verb-verb morphological neutrality. (verb-verb morphological ambiguity) is abbreviated as VVMA.

또한, 활용 시 형태적 중의성을 유발하는 동사들은 대부분 의미(sense)적인 측면에서 다의어에 속하는 경우가 많아 텍스트상에 빈번하게 등장하는 동사들이므로 그 올바른 처리가 실제 언어 처리 성능 향상에 미치는 영향이 크다. In addition, most verbs that induce morphological significance when used often belong to a multilingual word in terms of sense, and thus they appear frequently in texts. Big.

본 발명이 이루고자 하는 첫 번째 기술적 과제는 동사 원형 인식을 위해 적용되는 기계 학습 기법에 필요한 학습 말뭉치를 사람의 수작업을 통한 분류나 클래스 (레이블) 정보 부착(annotation) 과정 없이, 레이블이 부착되지 않은 일반 말뭉치 (unlabeled corpus)와 웹 카운트를 이용하여, 형태적 중의성이 있는 동사를 자동으로 분석하여 형태소 분석 성능을 향상시킬 수 있는 동사 분석 장치를 제공하는데 있다. The first technical problem to be achieved by the present invention is general unlabeled general learning without the process of classifying or class (labeling) information by hand for learning corpus required for the machine learning technique applied for verb prototype recognition. It is to provide a verb analysis device that can improve morphological analysis performance by automatically analyzing verbs with morphological significance using unlabeled corpus and web count.

본 발명이 이루고자 하는 두 번째 기술적 과제는 상기의 형태적 중의성이 있는 동사 분석 장치에 적용되는 효과적인 동사 분석을 위한 기계 학습 방법을 제공하는 데 있다.The second technical problem to be achieved by the present invention is to provide a machine learning method for effective verb analysis applied to the verb analysis device having the above form morphism.

본 발명이 이루고자 하는 세 번째 기술적 과제는 상기의 형태적 중의성이 있는 동사 분석 방법을 컴퓨터에서 실행시키기 위한 프로그램을 컴퓨터로 읽을 수 있는 기록매체를 제공하는 데 있다.The third technical problem to be achieved by the present invention is to provide a computer-readable recording medium for executing the above-described form of verbal analysis method on a computer.

상기의 첫 번째 기술적 과제를 이루기 위하여, 본 발명의 일 실시 예에 따른 한국어 동사 분석 장치는 주변 문맥 (context) 정보를 이용하여 그 원형을 찾아낸다. 동사 원형 학습을 위해 사용되는 주변 문맥 데이터는 해당 동사들이 실제 중의성 없이 한가지 경우의 원형 도출이 가능한 활용 (conjugation)들의 예제들과 그 주변 문맥들에서 추출된다. 본 발명의 일 실시 예에 따른 형태적 중의성이 있는 동사 분석 장치는 중의성 있는 문제를 중의성 없는 동일한 영역의 문제로 매핑하는데, 중의성이 발생하지 않는 동사 활용형들을 이용하면 레이블이 부착되지 않은 가공되지 않은 말뭉치(unlabeled raw corpus)들로부터도 동사들의 클래스, 즉 원형에 대한 정보를 미리 알 수 있으므로 클래스 정보가 이미 부착된 학습 말뭉치처럼 이용할 수 있다.In order to achieve the first technical problem, the Korean verb analyzing apparatus according to an embodiment of the present invention finds a prototype by using surrounding context information. Peripheral context data used for verb prototyping is extracted from examples of conjugations and surrounding contexts in which the verbs can be derived in one case without actual significance. According to an embodiment of the present invention, a verb analysis apparatus having a morphological neutrality maps a neutral problem to a problem of the same area without neutrality, and when a verb conjugation type without neutrality is used, the label is not labeled. From unlabeled raw corpus, we can know in advance about the class of the verb, that is, the prototype, so we can use it like a learning cork with class information already attached.

상기의 두 번째 기술적 과제를 이루기 위하여, 본 발명의 일 실시 예에 따른 형태적 중의성이 있는 동사 분석 방법은 동사 원형 집합 V를 정의하는 단계; 활용(conjugation) 시 형태적 중의성이 발현되는 동사 활용(어절) 집합 E_A와 해당 원형들의 활용 시 중의성이 나타나지 않는 활용(어절) 집합 E_u를 정의하는 단계; E_u 집합의 예제들로부터 관련된 문장을 추출하여 학습 데이터 D_train를 구축하는 기본 데이터 수집 단계; 수집 데이터들로부터 각 동사 원형과 함께 나타나는 주변 문맥을 학습 자질 (learning feature)로 사용하여 동사 원형 클래스를 할당하는 분류자(classifier)를 기계 학습 방법에 따라 구현하는 단계; 및 E_A 어절이 포함된 문장들로부터 실험 데이터 집합 D_test를 구축하는 단계를 포함한다.In order to achieve the second technical problem, a verb analysis method having morphological significance according to an embodiment of the present invention includes: defining a verb prototype set V; Defining a verb conjugation set E _{A in} which morphological significance is expressed during conjugation and a conjugation set E _{u in} which no primitive appears in the use of the prototypes; E _u A basic data collection step of constructing training data D _train by extracting relevant sentences from examples of the set; Implementing, according to a machine learning method, a classifier for assigning a verb prototype class using a surrounding context that appears with each verb prototype from the collected data as a learning feature; And constructing an experimental data set D _test from sentences including the E _A word.

본 발명의 다른 실시 예에 따른 형태적 중의성이 있는 동사 분석 방법은 현재까지의 학습 데이터로부터 유도된 분류자를 이용하여 실험 데이터에 대해 분류하기 힘든 어려운 예제 (hard example) 집합 D_hard를 추출하는 선택적 샘플링 (selective sampling) 단계; D_hard를 위한 해당 정보 추출을 위하여 web count를 구하고 D_hard를 D_train에 포함시켜 분류자를 재학습하는 단계를 더 포함한다.According to another embodiment of the present invention, a method of analyzing verbs with morphological significance is an optional method of extracting a hard example set D _hard which is difficult to classify to experimental data using a classifier derived from the training data. A selective sampling step; Obtaining the count to the web information retrieval for _hard D includes the step of re-learning the classifier include the D D _hard to _train more.

본 발명의 실시 예들에 의하면, 말뭉치의 인위적인 분류 작업 없이 교사 학습(supervised learning)의 효과를 가질 수 있는 동사 분석 방법을 제시하여 텍스트 분석의 정확도 및 효율을 향상시킬 수 있다. 또한 현재까지 학습된 데이터로부터 유도된 동사 원형 분류자를 새로운 데이터에 적용할 때, 기존에 이미 학습된 데이터와 다르거나 분류자의 결정이 확실하지 않은 어려운 예제들에 대해서만 기존 학습 데이터 집합에 추가하면서 학습 세트를 늘려나가는 점진적 학습 방법(incremental learning)과 웹 데이터를 함께 고려할 수 있는 방법을 제시하여 동사 분석 전 과정을 자동화할 수 있다. According to embodiments of the present invention, by providing a verb analysis method that can have the effect of teacher learning (supervised learning) without the artificial classification of the corpus can improve the accuracy and efficiency of text analysis. In addition, when applying verb prototype classifiers derived from data learned to date to new data, the training set is added to the existing training data set only for difficult examples that differ from previously learned data or whose classifiers are unclear. You can automate the entire process of verb analysis by presenting incremental learning methods that increase the number of points of interest and web data together.

도 1a는 본 발명의 일 실시 예에 따른 형태적 중의성 동사 분석 장치의 도표이다.
도 1b는 본 발명의 다른 실시 예에 따른 형태적 중의성 동사 분석 장치의 도표이다.
도 2는 본 발명의 일 실시 예에 따른 형태적 중의성 동사 분석 방법의 흐름도이다.1A is a diagram of a morphological neutral verb analyzing apparatus according to an embodiment of the present invention.
1B is a diagram of a morphological neutral verb analyzing apparatus according to another embodiment of the present invention.
2 is a flowchart of a method of analyzing morphological neutral verbs according to an embodiment of the present invention.

이하에서는 도면을 참조하여 본 발명의 바람직한 실시 예를 설명하기로 한다. 그러나, 다음에 예시하는 본 발명의 실시 예는 여러 가지 다른 형태로 변형될 수 있으며, 본 발명의 범위가 다음에 상술하는 실시 예에 한정되는 것은 아니다.Hereinafter, with reference to the drawings will be described a preferred embodiment of the present invention. However, embodiments of the present invention illustrated below may be modified in various other forms, and the scope of the present invention is not limited to the embodiments described below.

동사의 활용형은 시제, 상, 기타 문법적 카테고리에 의해 달라질 수 있는데, 특히 동사의 어미에서 그 변화가 두드러진다. 표 1은 다양한 어미를 가진 동사들을 보여준다.Verbs can vary in tenses, phases, and other grammatical categories, especially in verb endings. Table 1 shows verbs with various endings.

단어word 형태소 분석Stemming 갔다
가시었다
간다
가는
가기
가고went
Went away
Goes
going
Going
Going 가(동사)+었(과거 시제 선어말 어미)+ 다(종결형 어미)
가(동사)+시었(존칭 과거 시제 선어말 어미) +다(종결형 어미)
가(동사) +ㄴ(현재 시제 선어말 어미) +다(종결형 어미)
가(동사) +는(관형사형 전성어미)
가(동사) +기(명사형 전성어미)
가(동사) +고(대등적 연결 어미)(Verb) + (former ending premature ending) + c (ending ending)
Ga (verb) + sour (past term tense fresh ending ending) + da (terminating ending)
A (verb) + b (present tense ending ending) + multi (ending ending)
A (verb) + is (a tubular malleable mother)
(Verb) + flag (noun-type malleable mother)
A (verb) + high (equivalent connection ending)

표 2는 분석이 형태적 중의성을 보이는 동사 활용형들의 예와 그것의 가능한 형태소 분석의 예, 해당 원형의 중의성 없는 활용형들을 보여준다. Table 2 shows examples of verb conjugations where the analysis shows morphological significance, examples of its possible morphological analysis, and the non-neutral conjugations of the prototype.

활용 시 형태적 중의성을 보이는 동사 원형의 개수 자체는 많지 않지만 (12개 정도) 그 활용형들은 상당한 부분을 차지하며 해당 동사들은 다의적인 여러 의미로 쓰여 한국어 텍스트에 자주 등장하는 중요한 단어들이 많다. Although the number of verb prototypes that show morphological significance in use is not large (about 12), the utilization forms take up a considerable part, and the verbs are used in many different meanings, and there are many important words that appear frequently in Korean text.

VVMA는 형태적 중의성이라 할지라도 동사의 의미 중의성 해결과 관련이 깊다. 일반적으로 주어진 문맥이나 담화(discourse) 하에서는 단어들은 단 하나의 의미만을 표현하고 있기 때문에 (Yarowsky 1995) 주변 문맥의 단어들은 해당 단어의 의미 중의성 해결에 단서를 제공해 줄 수 있다.VVMA is deeply involved in resolving the verbality of verbs, even though it is formally. In general, in a given context or discourse, words represent only one meaning (Yarowsky 1995), so words in the surrounding context can provide clues to resolving the significance of the word.

또한, 모든 동사 활용형이 다 형태적 중의성을 내포하고 있지는 않다. 예를 들어 "파는"의 경우는 중의성이 존재하지만 "팔고" 나 "파고" 활용의 경우 그 원형이 "팔다"와 "파다"로 중의성이 존재하지 않는다. 따라서, 본 발명의 일 실시 예는 중의성 없이 나타난 동사 활용형으로부터 원형 도출이 가능한 예제들의 주변 문맥을 추출하여 중의성이 나타난 활용형 문제 해결에 이용한다.In addition, not all verb conjugations have morphological significance. For example, in the case of "selling," there is neutrality, but in the case of "selling" or "digging," there is no neutrality in the form of "sell" and "dig." Therefore, an embodiment of the present invention extracts the surrounding context of examples that can be derived from the verb utilization form without neutrality and uses it to solve the utilization type problem with the neutrality.

본 발명의 일 실시 예에서는 학습 데이터로서, 레이블이 되어 있지않은 말뭉치로부터 중의성 없는 활용형들의 주위 문맥과 동사 원형을 학습시킨 후, 중의성이 발생하는 활용형들의 주위 문맥을 보고 기존 학습한 문맥과 비슷한 원형 클래스를 할당한다. 즉, 중의성 없는 활용형은 기본형이 불명확한 단어들을 위한 학습 데이터로 사용될 수 있다. 이 과정에서 각각의 기본형과 함께 등장하는 문맥 정보가 수집될 수 있다.In one embodiment of the present invention, as learning data, after learning the surrounding context and verb prototype of unneutral conjugations from an unlabeled corpus, looking at the surrounding context of conjugation forms in which neutrality occurs, and similar to the previously learned context. Assign a prototype class. That is, a conjugation type without neutrality can be used as learning data for words whose base form is unclear. In this process, contextual information appearing with each basic type can be collected.

도 1a는 본 발명의 일 실시 예에 따른 형태적 중의성이 있는 동사 분석 장치(100)의 블록도이다.1A is a block diagram of a verb analyzing apparatus 100 having conformational importance according to an embodiment of the present invention.

기본 데이터 수집부(120)는 가공되지 않은 말뭉치(110)에서 기본형을 알 수 있는 동사 활용형 (이하 '제1동사활용')과 관련된 문맥 특징을 포함하는 학습 데이터를 수집한다. 이 과정에서 기본적인 학습 데이터(seed examples)가 생성된다. 고려되는 문맥 특징은 중의성이 발생되는 어절 앞에 나타나는 두 어절과 두 어절 각각의 내용어 (content words)와 조사정보가 문맥 특징 벡터로 사용되어 학습 데이터(130)를 구성한다.The basic data collection unit 120 collects learning data including contextual features related to verb conjugations (hereinafter, 'first verb utilization'), which can be seen in the raw corpus 110. In this process, basic learning data is generated. The contextual feature to be considered is composed of the two words appearing before the word from which the neutrality is generated, the content words of each of the two words, and the survey information are used as the contextual feature vector to form the learning data 130.

훈련 수행부(140)는 학습 데이터(130)에 기반한 기계 학습 방법을 이용하여 각 기본형을 판단하기 위하여 문맥 정보와 동사 원형 간의 조건부 문맥 확률을 구한다. 훈련 수행부(140)는 기본형과 함께 사용될 확률이 가장 높은 문맥 정보를 기준으로 기본형을 판단하기 위한 분류자를 학습한다. 본 발명의 일 실시 예에서는 재학습이 편리한

학습 방법을 이용한다. The training performer 140 obtains a conditional context probability between the context information and the verb prototype to determine each basic form using a machine learning method based on the training data 130. The training performer 140 learns a classifier for determining the basic type based on the context information having the highest probability of being used with the basic type. In one embodiment of the present invention convenient re-learning

Use the learning method.

표 3은 기본형 "듣"과 "들"을 구분하기 위해 사용되는 학습 데이터에서 문맥 특징 자질(feature)의 예를 보여준다.Table 3 shows an example of the context feature features in the training data used to distinguish between the basic types "listen" and "s".

내용어Contents
(( wordword _1One )) 내용어Contents
(( wordword ₂₂ )) WordWord _1One WordWord ₂₂ 격조자Tone
(( wordword _1One )) 격조사Check
(( wordword ₂₂ )) ClassClass
(( outputoutput )) 제작비Production cost 많이A lot of 제작비가Production cost 많이A lot of 가(주격조사)(Current Investigation) 들field 나I 이야기story 내of mine 이야기를Stories 를To 듣Listen 같Equal 이야기story 같은same 이야기를Stories 를To 듣Listen 점point 이유Reason 점을 Point 이유로For a reason 을of 로in 들field 급격히rapidly 줄line 급격히rapidly 줄어Shrink 들field

"Class"는 이러한 표 3의 문맥 특징 자질을 사용하는 경우에 출력되는 기본형을 의미한다."Class" means the base type that is output when using the context characteristic feature of Table 3.

동사 분석부(150)는 말뭉치(110)에 포함된 기본형이 중의적인 동사 활용형(이하 '제2동사활용')의 기본형을 판단을 위한 부분으로 훈련 수행부(140)에서 학습된 분류자를 이용하여 새로운 예제에 대한 기본형을 알아낸다. 임의의 텍스트가 입력되면, 동사 분석부(150)는 입력된 텍스트에서 각각의 동사의 앞에 존재하는 주변 단어들로부터 조사, 단어, 내용어 등 학습에 필요한 벡터를 추출한다. The verb analyzing unit 150 uses the classifiers learned in the training performing unit 140 as a part for judging the basic type of the verbal form of the verb included in the corpus 110 (hereinafter, 'second verb utilization'). Figure out the basic form for the new example. When an arbitrary text is input, the verb analyzing unit 150 extracts a vector required for learning such as a search, a word, a content word, etc. from the surrounding words existing in front of each verb in the input text.

중의적 형태를 가진 동사 활용의 기본형을 판단하는 것은 텍스트 의미 판독에서 중요한 역할을 한다. 따라서 도 1a의 동사형 분석장치(100)는 검색 엔진, 번역기 등에 응용될 수 있다.Judgment of the basic forms of verbal usage in a neutral form plays an important role in the reading of text meaning. Therefore, the verb analysis apparatus 100 of FIG. 1A may be applied to a search engine, a translator, and the like.

도 1b는 본 발명의 다른 실시 예에 따른 형태적 중의성이 있는 동사 분석 장치의 블록도이다.1B is a block diagram of a verb analyzing apparatus having form neutrality according to another embodiment of the present invention.

기본 데이터 수집부(120)는 가공되지 않은 말뭉치(110)에서 제1동사활용과 관련된 문맥 특징을 포함하는 학습 데이터(130)를 수집하고 훈련 수행부(140)는 학습 데이터(130)에 기반하여 문맥에 따른 기본형 분류자를 학습하며 동사 분석부(150)에서는 말뭉치(110)에 포함된 제2동사활용의 기본형을 판단한다. The basic data collector 120 collects the training data 130 including the contextual feature related to the first verb utilization in the raw corpus 110, and the training performer 140 is based on the training data 130. While learning the basic type classifier according to the context, the verb analyzing unit 150 determines the basic type of the second verb utilization included in the corpus 110.

일반적으로, 학습 데이터의 확장으로 보다 많은 문맥 정보를 수집할수록 동사 분석의 정확도가 향상될 수 있다. 또한 학습 도메인이 바뀌거나 새로운 단어들의 등장으로 새로운 학습 데이터 확보가 필요한 경우도 많다. 이때, 문맥 정보를 추출하기 위한 한 방법으로 웹 문서를 이용하는 방법을 생각해 볼 수 있다. 하지만 웹 페이지들은 다운로드 하기에는 지나치게 크다. 표 4는 중의성을 내포하고 있는 단어 "까는"과 이를 해결하기 위해 사용될 수 있는 중의성이 없는 활용형의 한 가지인 "까고", "깔고"의 웹 카운트의 예를 보여준다. "Daterange"는 특정 기간에 공개된 문헌들로 한정하여 결과를 출력하는 것으로 표 4는 한 달 기간 내의 문서로 검색의 범위를 한정한 결과이다.In general, the more contextual information collected by expanding the learning data, the better the accuracy of verb analysis. In addition, the learning domain may be changed or new words may be required to acquire new learning data. In this case, a method of using a web document may be considered as a method for extracting contextual information. But web pages are too big to download. Table 4 shows an example of the web counts of "black" and "color", one of the words "close" containing neutrality and one of the non-neutral conjugations that can be used to solve it. "Daterange" outputs the results limited to documents published in a specific period. Table 4 shows the results of limiting the search to documents within a month.

WordWord TotalTotal countcount DaterangeDaterange countscounts 까고
깔고
까는
사고
살고
사는Pick
Laying
How
accident
live
living 492,000
1,200,000
1,020,000
37,000,000
9,200,000
21,500,000492,000
1,200,000
1,020,000
37,000,000
9,200,000
21,500,000 70,500
122,000
156,000
246,000
247,000
248,00070,500
122,000
156,000
246,000
247,000
248,000

따라서, 어려운 사례 판단부(160)는 선택적 샘플링을 통해 모든 사례에 대해 학습 데이터를 확장하지 않고, 현 학습된 분류자가 분류하기 힘든 사례에 대해서만 선택적으로 학습 데이터를 확장한다. 이에 따라 본 발명의 다른 실시 예는 반복적인 데이터 추가를 피하고 불필요하게 데이터의 크기가 커지는 것을 방지하면서도 기본형 예측의 정확도는 높일 수 있다.Therefore, the difficult case determination unit 160 does not extend the training data for all cases through selective sampling, but selectively extends the training data only for the cases that are difficult to classify by the current trained classifier. Accordingly, another embodiment of the present invention can increase the accuracy of the basic prediction while avoiding repeated data addition and unnecessary data size.

이하에서는 제2동사활용에 대해 주변 문맥 추출하여 기존 학습된 분류자를 이용하여 첫 번째로 높은 확률값으로 예측되는 제1기본형과 두 번째로 높은 확률값으로 예측되는 제2기본형을 파악한다. 어려운 사례 판단부(160)는 첫 번째로 높은 확률값과 두 번째로 높은 확률값의 차를 계산한다. 어려운 사례 판단부(160)는 이 확신도(Confidence) 확률값에 기반하여 해당 제2동사활용에 대한 새로운 학습 데이터를 추가로 수집할 지 여부를 결정하도록 구성될 수 있다. 이러한 확률값은 수학식 1과 같은 형태로 계산될 수 있다.Hereinafter, by extracting the surrounding context for the second verb utilization, the first basic type predicted with the first high probability value and the second basic type predicted with the second highest probability value are identified by using the existing trained classifier. The difficult case determination unit 160 calculates a difference between the first highest probability value and the second highest probability value. The difficult case determination unit 160 may be configured to determine whether to further collect new learning data for the second verb utilization based on the confidence probability value. This probability value may be calculated in the form of Equation 1.

여기서, p(c|x)는 사례 x가 카테고리(기본형) c로 예측될 확률이고, x는 앞서 설명한 문맥 자질 집합(context feature set)이다. 해당 p값을 예측하기 위해서는

분류자를 사용한 예이다.Where p (c | x ) is the probability that case x is predicted as a category (basic type) c, and x is the context feature set described above. In order to predict the p value

Here is an example using a classifier.

수학식 1에서 c_i는 분석되지 않은 단어의 문맥 특징 벡터에 해당하는 x와 가장 근접한 카테고리(기본형)이고, c_j는 x와 두 번째로 근접한 카테고리(기본형)로 가정하면 학습된 분류자가 해당 문맥을 가지고 분류할 때의 확신도는 수학식 1의 Confidence 값에 해당된다.In Equation 1, c _i is the closest category (basic type) to x corresponding to the context feature vector of the unanalyzed word, and c _j is the second closest category (basic type) to x. The confidence level when classifying with is the confidence value of Equation 1.

수학식 1에서 보듯이, 이들 두 확률의 차이가 클수록 해당 데이터에 대한 분류자의 확신도는 커진다. 본 발명의 일 실시 예에서는 분류의 신뢰가 낮은 사례(동사)를 어려운 사례(hard examples)로 정의하여 이를 중심으로 학습 데이터를 확장한다.As shown in Equation 1, the greater the difference between these two probabilities, the greater the confidence of the classifier for the data. According to an embodiment of the present invention, the case of low confidence (verb) of classification is defined as hard examples, and the learning data is extended based on this.

또한, 어려운 사례 판단부(160)는 추가될 학습 데이터와 현재까지 수집된 학습 데이터 사이의 유사도에 기반하여 학습 데이터의 추가 여부를 결정할 수도 있다. 이를 다양성(diversity) 값으로 정의하여 수학식 2와 같이 계산할 수 있다. 이 다양성 값은 유사도에 반비례하는 값으로서, 이 값을 반영하는 경우, 지금까지 학습된 기존 데이터와 다른 형태의 예제에 대해서는 높은 값을 부여하여 여려운 사례로 판단할 수 있다. In addition, the difficult case determination unit 160 may determine whether to add the training data based on the similarity between the training data to be added and the training data collected so far. It can be calculated as Equation 2 by defining this as a diversity value. This diversity value is inversely proportional to the similarity. If this value is reflected, it can be judged as a difficult case by assigning a high value to the example of the form different from the existing data learned so far.

여기서, t _i는 학습 데이터에 추가될 사례를 의미한다. K는 입력된 두 사례 사이의 유사도를 계산하는 함수이다. S가 기존 학습에 사용된 사례들의 집합이라면, s_j는 기존에 학습 데이터에 존재하는 사례들 중 t _i와 가장 유사한 사례를 의미하는데, 다양성 값은 t _i와 s _j 사이의 유사도를 반영한 값으로 기존 학습 데이터와 현재 추가될 데이터가 얼마나 비슷한지를 파악한다.Here, t _i means an instance to be added to the training data. K is a function that calculates the similarity between two input cases. If S is a set of cases used in existing learning, s _j means the most similar to t _i among the existing cases in the training data. The diversity value reflects the similarity between t _i and s _j . Find out how similar the existing training data is with the data that is currently being added.

본 발명의 다른 실시 예에서는 다양성 값이 가장 큰 사례 즉 기존 학습 데이터와 상이한(유사도가 가장 작은) 예제를 어려운 사례로 정의하여 학습 데이터를 확장하도록 하였다. 실험 데이터의 약 3%에 대해 확신도와 다양성 값을 조합하여 어려운 사례로 추출한다.In another embodiment of the present invention, an example in which the diversity value has the greatest value, that is, an example different from the existing learning data (the least similarity) is defined as a difficult case to expand the learning data. A combination of confidence and diversity values for about 3% of the experimental data is extracted as a difficult case.

표 5는 "까"와 "깔"의 분석에서 추출된 어려운 사례들의 예와 이들의 관련 문맥 특징의 예를 보여준다.Table 5 shows examples of difficult cases extracted from the analysis of "black" and "color" and examples of their contextual features.

hardhard examplesexamples ContextContext featurefeature CorrectCorrect CategoryCategory 보안 프로그램을 까는
그녀를 면전에서 까고
팬티엄 4에서 Security program
Fuck her in the presence
In panties 4 보안 프로그램 보안 프로그램을 josa1=을
그녀 면전 그녀를 면전에서 josa2=를 josa1=에서
팬티엄 4 팬티엄 4에서 josa1=에서Security program security program josa1 =
She in front of her josa2 in front of her
From josa1 = in pantyham 4 깔(install)
까(slate)
깔(install)Install
Slate
Install

통신부(170)는 어려운 사례 판단부 (160)을 통해 어려운 사례로 판단되어 제2동사활용을 위한 학습 데이터를 추가하는 경우, 제2동사활용 주위의 문맥 특징과 가능한 기본형을 결합한 구를 질의어로 네트워크 또는 인터넷으로 연결된 검색 엔진에 입력한다. 통신부(170)가 상기 검색 엔진으로부터 웹 카운트를 반환받으면, 추가 훈련 수행부(180)에 전달하여 현재 사례(어려운 사례)에 대한 기본형을 찾게 한다.If the communication unit 170 is determined to be a difficult case through the difficult case determination unit 160 and adds the training data for the second verb utilization, the communication word network that combines the context characteristics and possible basic forms around the second verb utilization is a query language. Or type into a search engine connected to the Internet. When the communicator 170 receives the web count from the search engine, the communicator 170 transfers the web count to the additional training performer 180 to find a basic type for the current case (difficult case).

표 6은 어려운 사례들의 문맥 특징을 이용한 질의와 이러한 질의에 따라 검색엔진 구글(Google)이 반환한 웹 카운트를 보여준다.Table 6 shows queries using contextual features of difficult cases and the web count returned by the search engine Google in response to these queries.

표 6의 웹 카운트는 이전 3 개월간에 공개된 문서에 한정하여 계산된 값이다. 웹 카운트 추출을 위해 사용된 질의(Query)는 어려운 사례의 문맥 특징(예를 들어 "프로그램")과 이와 관련하여 사용되는 중의성이 없는 동사 활용형(예를 들어 "깔고")이 결합된 구로 구성된다. 즉, 표 6의 "깔고", "까고", "까게", "깔게"와 같은 용언의 활용형들은 "까"나 "깔"을 분석하기 위한 예로 사용된다. 와일드 카드(*)가 포함된 질의어 구 "프로그램* 깔고"에는 "프로그램을 깔고", "프로그램만 깔고", "프로그램만 무조건 깔고" 등의 다양한 어구들이 상기 질의와 매칭될 수 있다. 즉, "깔", "깔게", "깔기만", "깔고" 등의 활용형을 포함한 예들은 해당 기본형의 문맥을 훈련하기 위해 사용될 수 있다.The web count in Table 6 is calculated only for documents published in the previous three months. The query used for the web count extraction consists of a phrase that combines the contextual features of a difficult case (eg "program") with the verbal verb forms (eg "flat") used in this context. do. That is, the usage forms of the verbs such as "color", "carb", "carb", and "carb" in Table 6 are used as examples for analyzing "car" or "car". In the query phrase "program * laying" including a wildcard (*), various phrases such as "laying a program", "laying only a program", "laying a program unconditionally" and the like can be matched with the query. That is, examples including conjugations such as "car", "car", "carouse only", and "cargo" can be used to train the context of the base form.

추가 훈련 수행부(180)는 어려운 사례에 대한 학습 데이터를 확장한다. 추가 훈련 수행부(180)는 통신부(170)를 통해 수신된 웹 카운트를 이용하여 제2동사활용을 위한 학습 데이터를 생성하고 데이터베이스(130)에 추가한다.The additional training performer 180 expands the learning data for the difficult case. The additional training performer 180 generates training data for second verb utilization using the web count received through the communication unit 170 and adds the training data to the database 130.

한편, 도 1b의 분석 장치(100)도 도 1a에서와 마찬가지로 동사 분석부(150)를 포함할 수 있다.Meanwhile, the analysis apparatus 100 of FIG. 1B may also include a verb analyzer 150 as in FIG. 1A.

도 2는 본 발명의 일 실시 예에 따른 형태적 중의성이 존재하는 동사 분석 방법의 흐름도이다.2 is a flowchart illustrating a verb analysis method in which morphological neutrality exists according to an embodiment of the present invention.

먼저, 가공되지 않은 말뭉치에서 제1동사활용의 활용형과 관련된 문맥 특징인 학습 데이터를 수집한다(S210).First, in the raw corpus collect the learning data that is the contextual feature associated with the utilization type of the first verb utilization (S210).

다음, 학습 데이터에 기반하여 말뭉치에 포함된 제2동사활용의 기본형을 판단한다(S220). 보다 구체적으로, 이 과정(S220)은 제2동사활용과 연결되어 사용될 확률이 가장 높은 문맥 정보를 기준으로 제2동사활용의 기본형을 판단하는 과정일 수 있다.Next, the basic type of the second verb utilization included in the corpus is determined based on the learning data (S220). More specifically, this process (S220) may be a process of determining the basic type of the second verb utilization based on the context information having the highest probability of being used in connection with the second verb utilization.

그리고 제2동사활용의 기본형을 판단하는 과정에서 사용된 문맥 특징을 제2동사활용을 위한 학습 데이터로서 추가한다(S230).The contextual feature used in the process of determining the basic type of the second verb utilization is added as the learning data for the second verb utilization (S230).

마지막으로, 임의의 텍스트가 입력되면, 텍스트에서 각각의 동사의 앞에 존재하는 단어들을 학습 데이터와 비교하여 텍스트에 포함된 동사의 기본형을 판단한다(S240). 이 과정(S240)에서 사용되는 단어들은 앞의 두 어절 및 조사와, 그 내용어들로서, 각각의 동사와 관련된 문맥 단어들이다. 일 예로, 이 과정(S240)은 각각의 동사의 앞에 가장 근접하게 위치하는 명사 2개를 추출하여 학습 데이터와 비교함으로써 각각의 동사의 기본형을 판단하는 과정일 수 있다.Finally, when any text is input, the basic form of the verb included in the text is determined by comparing words existing in front of each verb in the text with the training data (S240). Words used in this process (S240) are the first two words and search, and the content words, the context words associated with each verb. For example, the process (S240) may be a process of determining the basic form of each verb by extracting two nouns located closest to each verb and comparing it with the learning data.

본 발명은 소프트웨어를 통해 실행될 수 있다. 바람직하게는, 본 발명의 실시 예들에 따른 형태적 중의성이 존재하는 동사 분석 방법을 컴퓨터에서 실행시키기 위한 프로그램을 컴퓨터로 읽을 수 있는 기록매체에 기록하여 제공할 수 있다. 소프트웨어로 실행될 때, 본 발명의 구성 수단들은 필요한 작업을 실행하는 코드 세그먼트들이다. 프로그램 또는 코드 세그먼트들은 프로세서 판독 가능 매체에 저장되거나 전송 매체 또는 통신망에서 반송파와 결합된 컴퓨터 데이터 신호에 의하여 전송될 수 있다.The invention can be implemented via software. Preferably, a program for executing a verb analysis method having morphological significance according to embodiments of the present invention may be provided by recording a program for executing in a computer on a computer-readable recording medium. When implemented in software, the constituent means of the present invention are code segments that perform the necessary work. The program or code segments may be stored on a processor readable medium or transmitted by a computer data signal coupled with a carrier on a transmission medium or network.

컴퓨터가 읽을 수 있는 기록매체는 컴퓨터 시스템에 의하여 읽혀질 수 있는 데이터가 저장되는 모든 종류의 기록 장치를 포함한다. 컴퓨터가 읽을 수 있는 기록 장치의 예로는 ROM, RAM, CD-ROM, DVD±ROM, DVD-RAM, 자기 테이프, 플로피 디스크, 하드 디스크(hard disk), 광데이터 저장장치 등이 있다. 또한, 컴퓨터가 읽을 수 있는 기록매체는 네트워크로 연결된 컴퓨터 장치에 분산되어 분산방식으로 컴퓨터가 읽을 수 있는 코드가 저장되고 실행될 수 있다.Computer-readable recording media include all kinds of recording devices that store data that can be read by a computer system. Examples of the computer readable recording medium include ROM, RAM, CD-ROM, DVD 占 ROM, DVD-RAM, magnetic tape, floppy disk, hard disk, optical data storage, and the like. The computer readable recording medium can also be distributed over network coupled computer devices so that the computer readable code is stored and executed in a distributed fashion.

본 발명은 도면에 도시된 실시 예들을 참고로 하여 설명하였으나 이는 예시적인 것에 불과하며 당해 분야에서 통상의 지식을 가진 자라면 이로부터 다양한 변형 및 실시 예의 변형이 가능하다는 점을 이해할 것이다. 그리고, 이와 같은 변형은 본 발명의 기술적 보호범위 내에 있다고 보아야 한다. 따라서, 본 발명의 진정한 기술적 보호범위는 첨부된 특허청구범위의 기술적 사상에 의해서 정해져야 할 것이다.Although the present invention has been described with reference to the embodiments illustrated in the drawings, this is merely exemplary and will be understood by those skilled in the art that various modifications and variations may be made therefrom. And, such modifications should be considered to be within the technical protection scope of the present invention. Therefore, the true technical protection scope of the present invention will be defined by the technical spirit of the appended claims.

Claims

A basic data collection unit for collecting learning data including a first verb utilization without a neutrality in which a basic form is identified in an unlabelled raw corpus and a contextual feature related to the first verb utilization; And
Extracts the context features of the two preceding words in which the basic form is the closest to the second verb usage, and based on the extracted context features and the context features of the learning data, A training performer learning a classifier for matching with the basic type of the first verb utilization
Containing, morphological neutral verb analysis device.

The method of claim 1,
When the arbitrary text is input, it further comprises a verb analysis unit for extracting the contextual feature existing in front of each verb in the text and using the classifier to determine the basic form of each verb Verb analysis device.

The method of claim 1,
The training performer
And wherein in terms of morphological significance, the basic type of the second verb utilization is determined based on the conditional context probability of the context information associated with each basic type.

The method of claim 1,
The confidence level of classification using the difference between the probability values for the first basic type predicted with the first highest probability value and the second basic type predicted with the second highest probability value when the basic form of the neutral verb is predicted by the training performer. And calculating a similarity between the experimental data and the learning data collected to date, and then determining whether to add the training data based on the confidence degree and the similarity. , Morphological neutrality verb analysis device.

The method of claim 4, wherein
In order to grasp the basic type of the difficult case in which the difficult case determination unit decides to add the learning data, the search engine transmits a query phrase using a context feature of two preceding words located closest to the difficult case to the search engine. A communication unit for receiving a web count from the communication unit; And
And an additional training performer configured to learn the classifier by extracting the basic type of the difficult case using the web count and adding it as learning data.

Collecting, as learning data, the contextual features associated with the primitive from the example of the first verb utilization with no neutrality to identify the primitive in the unlabeled raw corpus;
From the corpus, the basic feature of the second verb utilization is extracted by matching the basic feature of the second verb utilization based on the extracted context feature and the context feature of the learning data. Determining; And
And adding contextual features used in determining the basic form as learning data for extracting the basic form of the second verb utilization.

The method according to claim 6,
If any text is input, the method may further include determining a basic form of the verb utilization using a probability between the word and the basic form appearing before the verb usage to obtain the basic form in the text. Gender verb analysis method.

The method of claim 7, wherein
The step of determining the basic form of verb usage
And determining the basic form by extracting the contents of the previous contexts and extracting the contents of the verbs.

The method of claim 8,
The step of determining the basic form of verb usage
And determining the basic form of the verb utilization based on two words which are located closest to the verb utilization.

The method according to claim 6,
Determining the basic type of the second verb utilization
And determining the basic type of the second verb utilization using a naive Bayesian classifier using the conditional context probability between the possible basic types of verb utilization and contextual information.

The method according to claim 6,
The first basic type predicted with the first highest probability value and the second basic type predicted with the second highest probability value are identified for the second verb utilization, and the difference between the first highest probability value and the second highest probability value is determined. And determining whether or not to add the learning data for the second verb utilization based on the formal verbs verb analysis method.

The method according to claim 6,
Calculating a similarity between the learning data added to identify the basic type of the second verb utilization and the learning data collected so far, and determining whether to add the learning data for the second verb utilization based on the similarity; Morphological neutral verb analysis method, characterized in that it further comprises.

The method of claim 12,
In order to identify the basic type of the difficult case in which the learning data is added in the step of determining whether to add the learning data, a query phrase using a context feature of two preceding words located closest to the difficult case is added to the search engine. Delivering; And
And extracting the basic type of the difficult case using the web count returned from the search engine and adding the learning type to the training data.

A computer-readable recording medium having recorded thereon a program for performing the method of claim 6.