KR100320348B1

KR100320348B1 - Unregistered word analysis method using syllable normal representation dictionary and morphological analysis method of a sentence having unregistered word

Info

Publication number: KR100320348B1
Application number: KR1019990044751A
Authority: KR
Inventors: 이근배; 이원일; 차정원
Original assignee: 정명식; 학교법인 포항공과대학교
Priority date: 1999-10-15
Filing date: 1999-10-15
Publication date: 2002-01-10
Also published as: KR20010037310A

Abstract

본 발명은 자연어 처리방법에 관한 것으로서, 특히 한국어 음절 정규화 표현 사전을 이용하여 미등록어를 포함한 입력 문자열을 형태소 분석하는 방법에 관한 것이다.The present invention relates to a natural language processing method, and more particularly, to a method for stemming an input string including an unregistered word using a Korean syllable normalized expression dictionary.

본 발명은 개방어에 속하는 한국어의 음절을 분석하여, 명사, 부사, 동사, 형용사의 음절을 정규화 표현으로 바꾸어 모든 형태소를 이 음절 정규화 표현으로 나타냄으로써 미등록어를 사전에 입력할 수 있게 한다. 특히, 동사, 형용사 불규칙 활용들에 대한 추정에 음절 정규화 표현을 이용함으로써 미등록어로 발생하는 동사, 형용사들의 원형을 정확히 복원할 수 있는 능력이 있으며, 등록 형태소와 같은 수준의 접속정보를 제공할 수 있어 이후 단계의 처리에 별도의 규칙이 없이도 등록어와 같은 처리를 할 수 있게 되어 시스템 구현에서 추정의 정확도를 유지하면서 시스템의 단순함을 달성할 수 있다.The present invention analyzes the syllables of Korean language belonging to an open language, converts syllables of nouns, adverbs, verbs, and adjectives into normalized expressions, and expresses all morphemes in these syllable normalized expressions so that unregistered words can be input in advance. In particular, by using syllable normalized expressions for estimation of irregular usage of verbs and adjectives, it has the ability to accurately restore the prototypes of verbs and adjectives that occur in unregistered words, and can provide the same level of access information as registered morphemes. Subsequent processing can be handled like a registered word without any additional rules, thereby achieving the simplicity of the system while maintaining the accuracy of the estimation in the system implementation.

Description

Unregistered word analysis method using syllable normal representation dictionary and morphological analysis method of a sentence having unregistered word}

미등록어라 함은 문자열을 형태소 분석하기 위한 형태소 사전에 등록되어 있지 않은 형태소를 의미한다. 한국어에서는 명사, 부사, 동사, 형용사 등의 품사에 해당하는 형태소가 개방어에 속해 미등록어로서 발생되고, 조사, 어미 등의 품사에 해당하는 형태소는 폐쇄어에 속해 미등록어로서 발생되지 않는다. 개방된 문서를 처리하는 모든 자연어 처리 시스템에서 미등록어는 그 유형이 다양하고 자주 발생하므로, 자연어 처리 시스템의 형태소 분석 실패의 주원인이 되고, 그 시스템의 성능 향상에 커다란 장애가 된다.Unregistered word means a morpheme that is not registered in the morpheme dictionary for stemming the character string. In Korean, morphemes corresponding to parts of speech such as nouns, adverbs, verbs, and adjectives are generated as unregistered words belonging to the open language, and morphemes corresponding to parts of speech such as search and ending are not generated as unregistered words. In all natural language processing systems that handle open documents, unregistered words are of various types and occur frequently, which is a major cause of stemming failure of natural language processing systems, and is a major obstacle to improving the performance of the system.

예를 들어, 맞춤법 검사기의 경우, 오류 탐색의 기본 방법은 형태소 분석이 안되는 어절을 찾는 것이다. 그런데, 미등록어 때문에 오류가 아닌 어절들을 오류 어절로 간주하는 오류를 범하게 된다.For example, in the spell checker, the basic method of error search is to find words that are not stemming. However, because of the non-registered words, it is a mistake to consider words that are not errors as error words.

본 발명은 상기의 문제점을 해결하기 위하여 창작된 것으로서, 자연어 처리 시스템의 안정성과 신뢰성을 높이기 위해 미등록어를 효과적으로 처리할 수 있는 음절 정규화 표현사전을 이용한 미등록어 분석 방법, 미등록어를 포함한 문장의 형태소 분석방법 그리고 이들 컴퓨터를 통해 구현할 수 있는 미등록어 분석 프로그램을 기록한 컴퓨터가 읽을 수 있는 기록매체 및 미등록어를 포함한 문장의 형태소 분석 프로그램을 기록한 컴퓨터가 읽을 수 있는 기록매체를 제공함을 그 목적으로한다.The present invention has been created to solve the above problems, the method for analyzing unregistered words using syllable normalized expression dictionary that can effectively process unregistered words in order to increase the stability and reliability of the natural language processing system, the morphemes of sentences including unregistered words It is an object of the present invention to provide a computer-readable recording medium that records an analysis method and a computer-readable non-registered word analysis program and a computer-readable recording medium that records a morphological analysis program of sentences including non-registered words.

도 1a는 한국어 음절을 표현하는 정규화 표현 메타심볼을 나타낸 것이고, 도 1b는 도 1a의 메터심볼을 사용한 예를 나타낸 것이다.FIG. 1A illustrates a normalized expression metasymbol representing Korean syllables, and FIG. 1B illustrates an example using the meter symbol of FIG. 1A.

도 2는 음절 정규화 표현을 이용하여 구성한 음절 정규화 표현 사전의 일부를 예시한 것이다.2 illustrates a part of a syllable normalization expression dictionary constructed using syllable normalized expressions.

도 3는 음절 정규화 표현 사전을 이용하여 입력 문자열의 원형을 복원하는 과정을 나타낸 순서도이다.3 is a flowchart illustrating a process of restoring a prototype of an input string using a syllable normalized expression dictionary.

도 4는 품사 태깅에서 사용되는 미등록어에 대한 어휘확률값을 계산하는 수식을 나타낸 것이다.4 illustrates a formula for calculating a lexical probability value for an unregistered word used in part-of-speech tagging.

도 5는 등록어와 미등록어를 동시에 검색하여 분석하는 형태소 분석 과정을 나타낸 순서도이다.5 is a flowchart illustrating a morphological analysis process of simultaneously searching for and analyzing a registered word and a non-registered word.

상기 목적을 달성하기 위하여, 본 발명에 의한 음절 정규화 표현사전을 이용한 미등록어 분석방법은 (a) 복원할 미등록어를 포함하는 문자열을 입력받는 단계, (b) 한국어의 체언, 용언, 부사에 나타나는 모든 음절들을 음절 정규화 표현으로 나타낸 음절 정규화 표현 사전을 이용하여, 상기 미등록어의 원형를 복원하여 대응하는 접속정보 및 하나 이상의 형태소 후보들을 생성하는 단계, (c) 상기 하나 이상의 형태소 후보들 중 과생성된 미등록어를 소정의 휴리스틱정보를 사용하여 여과하는 단계, 및 (d) 한국어 접속 테이블을 사용하여 상기 여과된 형태소 후보들 중 접속 가능성이 없는 형태소 후보들을 삭제하여 형태소 접속 그래프를 생성하는 단계를 포함함을 특징으로 한다.In order to achieve the above object, an unregistered word analysis method using a syllable normalized expression dictionary according to the present invention includes (a) receiving a string including an unregistered word to be restored, (b) appearing in a Korean verb, a verb, an adverb Restoring the prototype of the unregistered word by using a syllable normalized expression dictionary representing all syllables as a syllable normalized expression to generate corresponding access information and one or more morpheme candidates, (c) an over-generated unregistered one of the one or more morpheme candidates Filtering a word using predetermined heuristic information, and (d) generating a morpheme connection graph by deleting morphological candidates that are not accessible from the filtered morpheme candidates using a Korean access table. It is done.

상기의 다른 목적을 달성하기 위하여, 본 발명에 의한 미등록어를 포함한 입력 문자열을 형태소 분석하는 방법은 (a) 상기 입력 문자열을 음소 단위로 분할하는 단계; (b) 음소 단위로 분할된 입력 문자열을 차례로 입력받고, 형태소 사전을 이용하여 등록어의 후보 형태소를 추출하고, 음절 정규화 표현 사전을 이용하여 미등록어의 후보 형태소를 추출하는 단계; (c) 한국어 접속 테이블을 이용하여, 상기 후보 형태소들 중 접속 가능성이 없는 후보 형태소를 제거하는 단계; (d) 접속 가능성이 있는 후보 형태소의 어휘확률을 구하는 단계; 및 (e) 상기 (b) 단계 내지 상기 (d) 단계를 상기 입력 문자열의 마지막 음소가 처리될 때까지 반복하는 단계를 포함함을 특징으로 한다.In order to achieve the above object, the method for stemming an input string including an unregistered word according to the present invention comprises the steps of: (a) dividing the input string into phoneme units; (b) receiving input strings divided into phonemes in sequence, extracting candidate morphemes of registered words using a morpheme dictionary, and extracting candidate morphemes of unregistered words using a syllable normalized expression dictionary; (c) removing candidate morphemes that are not accessible from the candidate morphemes using a Korean access table; (d) obtaining a lexical probability of a candidate morpheme that is likely to be connected; And (e) repeating steps (b) to (d) until the last phoneme of the input string is processed.

이하에서 본 발명을 첨부된 도면을 참조하여 상세히 설명한다.Hereinafter, the present invention will be described in detail with reference to the accompanying drawings.

도 1a 및 도 1b는 음절 정규화 표현을 나타내고 있다. 음절 정규화 표현은 한국어 각 품사별 형태소에 나타나는 특징적인 음절들을 표현하는 방법으로 도입된다. 도 1a에서 'Z'는 자음을 나타내고, 'V'는 모음을 나타내며, 'z'는 받침에서 'ㄹ'을 제외한 모든 자음을 나타낸다. '*'은 자음과 모음을 포함한 모든 음소를 나타낸다. 도 1b에 예시된 예문을 보면, 'ZV*갈'은 '갈'로 끝나는 모든 품사의 형태소를 나타내고, 'Zv*Vz'는 'ㄹ'아닌 자음으로 끝나는 형태소를 표현한다. 또, 'ZV*Vㄹ'은 끝소리가 'ㄹ'인 형태소를 표현하는 음절 정규화 표현이다.1A and 1B show syllable normalization expressions. The syllable normalization expression is introduced as a way of expressing the characteristic syllables that appear in the morphemes of each part of speech in Korean. In FIG. 1A, 'Z' represents consonants, 'V' represents vowels, and 'z' represents all consonants except 'd' in the base. '*' Represents all phonemes, including consonants and vowels. In the example sentences illustrated in FIG. 1B, 'ZV * gal' denotes morphemes of all parts of speech ending in 'Gal', and 'Zv * Vz' denotes morphemes ending in consonants, not 'd'. In addition, 'ZV * V ㄹ' is a syllable normalization expression representing the morpheme with the ending sound 'ㄹ'.

도 2는 음절 정규화 표현을 이용하여 입력한 실제 사전의 일부를 보여주고 있다. 음절 정규화 표현 사전에서 첫 번째 열은 품사와 원형태를 나타낸다. 예시된 품사 HI는 불규칙 형용사를 나타내고, DI는 불규칙 동사를 나타낸다. 음절 정규화 표현 사전에서 두 번째 열은 이형태를 나타내며 형태소 분석과정에서 사전을 검색할 때 키로 사용된다. 음절 정규화 표현 사전에서 세 번째 열은 접속검사를 실행할 때 사용되는 접속정보를 나타낸다. 이렇게 미등록어에 등록어와 같은 수준의 접속정보를 제공할 수 있어 접속검사시에 효과적이다. 접속정보는 한 형태소와 다른 형태소가 문법에 맞게 연결이 되었는 지를 검사할 때 사용하기 위한 정보이다. 예를 들어, 명사와 조사는 연결이 가능하지만, 명사와 어미는 연결되지 아니한다. 도 2에 예시된 접속정보는 등록어를 위한 접속정보와 동일하므로 본 발명에서 이에 대한 더 이상의 설명은 생략한다.2 shows a part of an actual dictionary input using syllable normalized expressions. The first column of the syllable-normalized dictionary shows parts of speech and a circle. Illustrative parts of speech HI represent irregular adjectives and DI represents irregular verbs. The second column in the syllable normalized expression dictionary represents this form and is used as a key when searching the dictionary during morphological analysis. The third column of the syllable normalization expression dictionary shows the connection information used when performing the connection check. In this way, it is possible to provide the unregistered word with the same level of access information as the registered word, which is effective at the time of connection inspection. Connection information is information used to check whether one morpheme and another morpheme are connected according to the grammar. For example, nouns and investigations can be linked, but nouns and endings are not. Since the access information illustrated in FIG. 2 is the same as the access information for the registered word, further description thereof will be omitted in the present invention.

도 2에 의하면, '갈'로 끝나는 형태소는 원형태도 역시 '갈'인 'ㄹ 불규칙'의 비활용 형태 형용사가 될 수 있는 후보로 제시될 수 있다. '고마워'라는 형태소로 음절 정규화 표현 사전을 검색한다면, 음절 정규화 표현 사전의 두 번째 열의 'ZV*워'와 일치하여 대응하는 원형태가 'ZV*ZVㅂ'이므로 '고맙'으로 복원이 된다.According to Figure 2, the morpheme ending in 'gal' can be presented as a candidate that can be a non-utilized form adjective of 'ㄹ irregularity', which is also 'gal'. If you search for a syllable-normalized dictionary with a morpheme of 'thank you', it is restored to 'thank you' because the corresponding circle is 'ZV * ZV', matching the 'ZV * Wor' in the second column of the syllable-normalized dictionary.

도 3은 각각 음절 정규화 표현 사전을 이용한 원형 복원 과정을 보여주고 있다. '고마워'의 경우를 예를 들어 설명한다. '고마워'로 음절 정규화 표현 사전의 이형태를 검색하면, 'ZV*워'에 일치하게 된다. 그리고, 음절 정규화 표현 사전에서 검색된 이형태의 음절 정규화 표현과 원형의 음절 정규화 표현이 같은지를 검사한다(300 단계). 검색된 이형태와 원형의 음절 정규화 표현이 같으면 입력 문자열을 그대로 원형으로 사용하고, 다르면 아래의 원형 복원 과정을 거친다. 본 예에서는 이형태의 음절 정규화 표현 'ZV*워'와 원형의 음절 정규화 표현 'ZV*ZVㅂ'이 서로 다르므로, '아니오'의 방향을 따라서 진행된다. 입력 문자열 '고마워'에서 이형태의 음절 정규화 표현 'ZV*워'를 빼면 같은 부분이 제거되고, '고마'만 남게 된다(310 단계). 다음, '고마'에 'ㅂ'을 더해서 입력 문자열의 원형태인 '고맙'이 복원된다(320 단계). 이것을 형태소 사전의 원형태 부분에 복사하면, 'HIㅂ<고맙>(고마워)[축약>]'과 같이 등록 형태소와 같은 모습으로 복원이 이루어지게 된다(330 단계).3 illustrates a circular restoration process using syllable normalized expression dictionaries, respectively. The case of 'thank you' is explained using an example. Searching for this form of a syllable-normalized dictionary with 'thank you' will match 'ZV * Word'. In operation 300, the syllable normalization expression found in the syllable normalization expression dictionary and the original syllable normalization expression are examined. If the searched form is identical with the syllable normalization expression of the prototype, the input string is used as it is, and if it is different, the prototype restoration process below is performed. In this example, since the syllable normalization expression 'ZV * W' and the circular syllable normalization expression 'ZV * ZV \' are different from each other, they proceed along the direction of 'No'. By subtracting this type of syllable normalization expression 'ZV * WOW' from the input string 'Thank You', the same part is removed and only 'Thank You' remains (step 310). Next, by adding 'ㅂ' to 'Goma', 'Thank you', which is the original form of the input string, is restored (step 320). If this is copied to the original part of the morpheme dictionary, the restoration is performed in the same form as the registered morpheme such as 'HI ㅂ <thank you> (thank you) [abbreviation>]' (step 330).

미등록어에 대한 어휘확률은 말뭉치로부터 직접 구할 수 없기 때문에 이를 계산하는 특별한 방법이 필요하다. 본 발명에서는 어휘확률값을 계산하기 위하여 음절 트라이그램을 사용한다. 왜냐하면, 음절은 그 하나로서 의미를 가질 수 있는 단위가 될 수 있으므로 특별한 음절로 시작하는 형태소에 대해서 정보를 줄 수 있다. 예를 들어, '최'가 형태소의 처음에 나타나면, 인명(人名)일 가능성이 높으므로, '최'로 시작하고, 3음절인 미등록어는 인명 고유명사의 어휘확률이 높아진다. 구체적인 계산식은 도 4에 나타나 있다. 도 4에서 계산의 대상이 되는 형태소 앞뒤에 형태소 경계를 추가하고 난 뒤(예를 들어 '최형도'가 계산 대상 형태소라고 하면, '##최형도#'으로 만들어진다.) 도 4의 식에 따라서 3음절씩 분해하여 계산을 한다. 예를 들어, '최형도'가 고유명사가 될 확률은 다음과 같이 계산된다.Since lexical probabilities for unregistered words cannot be obtained directly from corpus, a special method of calculating them is needed. In the present invention, a syllable trigram is used to calculate a lexical probability value. Because syllables can be units that can have meaning as one, they can give information about morphemes that begin with a particular syllable. For example, if the word 'choi' appears at the beginning of the morpheme, it is most likely to be a person's name. Therefore, the unregistered word that starts with 'choi' and the three syllables increases the lexical probability of the person's proper noun. The specific calculation is shown in FIG. After adding the morphological boundary before and after the morpheme to be calculated in FIG. 4 (for example, if the 'maximum' is the morpheme to be calculated, '## optimum #' is created) according to the equation of FIG. 4. Calculate by breaking up three syllables. For example, the probability that 'best' becomes a proper noun is calculated as follows.

이 경우, '최'가 형태소의 처음에 나오는 경우는 고유명사일 때가 다른 품사일 경우보다 많으므로 고유명사의 어휘확률이 상대적으로 높아진다. 모든 품사에 대한 음절 트라이그램은 학습시에 미리 계산되어 분석시 사용된다.In this case, since the word 'choi' appears at the beginning of the morpheme more often than the case of other parts of speech, the lexical probability of the proper noun is relatively high. Syllable trigrams for all parts of speech are precomputed in learning and used for analysis.

이러한 방법으로 미등록어를 추정하는 형태소 분석 과정이 도 5에 도시되어 있다. 입력 문자열이 들어오면 음소단위로 분할하고(500 단계), 분할된 형태소를 등록어 사전인 형태소 사전(50)과 미등록어 사전인 음절 정규화 표현 사전(51)을 동시에 검색하게 된다(510 단계 및 520 단계). 이렇게 동시에 검색하는 이유는 실제 분석이 가능한 경우이지만 제작자의 실수로 형태소 사전(50)에 입력이 안되어 있을 수도 있기 때문에 이를 보안해 주기 위해서이다. 예를 들어, 어절 '나는' '나/대명사 + 는/조사', '날/불규칙 동사 + 는/관형형 어미', '나/규칙 동사 + 는/관형형 어미'로 분석될 수 있는데 형태소 사전(50)에 '나/규칙동사'가 등록되어 있지 않을 경우 '시골에서 여름을 나는 사람들은…'에서는 항상 올바른 분석 결과가 나오지 않는다. 그래서 본 발명에 의한 미등록어 추정 방법에서는 등록어가 있다고 해도 미등록어 추정을 하여 이러한 단점을 극복하고 있다. 이러한 것은 신뢰성있는 시스템을 만드는데 중요한 요소이다.A morphological analysis process for estimating unregistered words in this manner is shown in FIG. 5. When the input string is input, the phoneme is divided into phoneme units (step 500), and the divided morphemes are simultaneously searched for the morpheme dictionary 50 as the registered word dictionary and the syllable normalized expression dictionary 51 as the non-registered word dictionary (steps 510 and 520). step). The reason for searching at the same time is the case that the actual analysis is possible, but because it may not be input to the morpheme dictionary (50) by the manufacturer's mistake in order to secure this. For example, the word 'I' can be analyzed as' I / '/ pronoun + is / investigative', 'day / irregular verb + is / tubular ending', and 'I / rule verb + is / tubular ending'. ) Is not registered in the 'me / rule verbs' people who spend summer in the countryside… 'Does not always give the correct analysis. Therefore, the unregistered word estimating method according to the present invention overcomes these disadvantages by estimating unregistered words even if there are registered words. This is an important factor in creating a reliable system.

일반 형태소 사전(50)에서 검색된 결과는 바로 접속 검사 과정(550 단계)을 거치지만, 음절 정규화 표현 사전(51)에서 검색된 형태소는 도 3과 같은 원형 복원 과정(530 단계)을 거치고, 과생성 방지 여과과정(540 단계)을 거친다.The result retrieved from the general morpheme dictionary 50 immediately goes through a connection checking process (step 550), but the morpheme retrieved from the syllable normalization expression dictionary 51 goes through a circular restoration process (step 530) as shown in FIG. 3 and prevents overproduction. The filtration process (step 540) is performed.

본 발명에 의한 미등록어 추정을 위한 음절 정규화 표현 사전을 이용하면, 모든 형태소 후보를 생성할 수 있다는 장점이 있는 반면, 형태소 후보의 과생성이라는 단점이 있다. 이와 같은 단점을 극복하기 위하여, 다음과 같이 몇 개의 휴리스틱 정보를 이용해서 불필요한 형태소 후보를 제거하는 것이 과생성 방지 여과 과정(540 단계)이다.Using the syllable normalized expression dictionary for estimating unregistered words according to the present invention has the advantage of generating all morpheme candidates, but has the disadvantage of overproduction of morpheme candidates. In order to overcome this disadvantage, it is the overproduction prevention filtration process (step 540) to remove unnecessary morphological candidates using some heuristic information as follows.

첫째, 시제 선어말 어미의 활용형 한국어에서 어휘 형태소는 선어말 어미를 가지지 못한다. 즉 명사, 동사, 형용사 등의 형태소는 시제 선어말 어미(았, 었, 셨, 였, 겠, ...)가 포함되어 있지 않다는 것이다. 예를 들어 '..았..'의 형태소는 어휘 형태소가 아니라 더 분해되어야 한다. 즉, '보았다'는 '보+았+다'로 분해되어 '보(보다)'만이 어휘 형태소가 되는 것이다. 이런 정보(시제 선어말어미 97개)를 이용해서 과생성된 미등록어를 제거하게 된다.First, the lexical morphemes do not have a front end ending in Korean. In other words, morphemes such as nouns, verbs, and adjectives do not contain the tense word endings (Y, Y, Y, Y, Y, ...). For example, the morpheme of '.. In other words, 'saw' is decomposed into 'bo + saw + da' so that only 'bo' becomes a lexical morpheme. This information (97 tense endings) is used to remove over-generated unregistered words.

둘째, 한국어에서는 '는', '를' 그리고 '가을', 고을', '마을'로 끝나는 복합어, '갑을' 넨장맞을', '노갑이을', '노을', '리을' '빌어먹을' '을', '제장맞을', '태을'을 제외한 '을'로 끝나는 체언은 없다. 미등록어로 추정된 체언 중에서 이 정보에 어긋나는 것은 제거한다.Second, in Korean, compound words ending with ',' and 'fall', 'goul', 'village', 'gap' sackcloth ',' old age ',' glow ',' ri 'and' eat ' There is no statement that ends with ',' except for ',' to be right 'and' tae '. Any misrepresentations of non-registered language that are inconsistent with this information shall be removed.

세째, 비터비 탐색(Viterbi search)은 형태소 확률값과 문맥확률값을 이용하여 가장 올바른 결과를 찾는 알고리즘이다. 이 계산과정에서는 처음부터 현재까지의 누적확률값을 구해서 가장 큰 값을 갖는 것이 가장 올바른 결과이다. 이 계산과정에서 구해지는 확률값을 이용해서 일정 값이하의 확률값을 가지는 결과들을 제거하여 여과한다. 이렇게 하는 이유는 발생될 가능성이 적은 것일수록 작은 확률을 갖기 때문이다.Third, Viterbi search is an algorithm that finds the most correct result by using the morpheme probability value and the context probability value. In this calculation, the most accurate result is obtained by calculating the cumulative probability value from the beginning to the present. The probability values obtained in this calculation process are used to filter out the results with probability values below a certain value. This is because the less likely it is to occur, the smaller the probability.

접속 검사는 한국어의 일반적인 접속지식을 테이블로 만들어 불필요한 접속을 제거한다. 이런 과정을 문장의 끝까지 반복한다. 이렇게 만들어진 형태소 그래프는 도 4에서 계산된 미등록어 어휘확률값과 학습된 등록어 확률값과 함께 품사 태깅(Part-of-Speech tagging)의 입력으로 사용된다.Connection check removes unnecessary connection by making table of general knowledge of Korean. Repeat this process to the end of the sentence. The morpheme graph thus created is used as an input of part-of-speech tagging together with the unregistered word lexical probability value and the learned registered word probability value calculated in FIG. 4.

한편, 상술한 본 발명의 실시예는 컴퓨터에서 실행될 수 있는 프로그램으로 작성가능하다. 그리고, 컴퓨터에서 사용되는 매체를 이용하여 상기 프로그램을 동작시키는 범용 디지털 컴퓨터에서 구현될 수 있다. 상기 매체는 마그네틱 저장매체(예를 들면, 롬, 플로피 디스크, 하드 디스크 등), 광학적 판독 매체(예를 들면, 씨디롬, 디브이디 등) 및 캐리어 웨이브(예를 들면, 인터넷을 통한 전송)와 같은 저장매체를 포함한다.On the other hand, the embodiments of the present invention described above can be written as a program that can be executed on a computer. And, it can be implemented in a general-purpose digital computer for operating the program using a medium used in the computer. The media may be stored such as magnetic storage media (e.g., ROM, floppy disk, hard disk, etc.), optical reading media (e.g., CD-ROM, DVD, etc.) and carrier waves (e.g., transmission over the Internet). Media.

상기 기록매체는 (a) 한국어의 체언, 용언, 부사에 나타나는 모든 음절들을 음절 정규화 표현으로 나타낸 음절 정규화 표현 사전 데이터베이스를 이용하여, 미등록어의 원형를 복원하고 대응하는 접속정보를 구하여 하나 이상의 형태소 후보들을 생성하는 모듈; (b) 상기 하나 이상의 형태소 후보들 중 과생성된 미등록어를 소정의 휴리스틱을 사용하여 여과하는 모듈; 및 (c) 한국어 접속 테이블을 사용하여 여과된 형태소 후보들 중 접속 가능성이 없는 형태소 후보들을 삭제하는 모듈을 컴퓨터에서 실행하는 프로그램 코드를 포함한다. 그리고, 상기 음절 정규화 표현 사전 데이터베이스는 형태소의 품사와 형태소 원형의 음절 정규화 표현을 저장한 원형 필드; 형태소 이형태의 음절 정규화 표현을 저장한 이형태 필드; 및 상기 이형태 필드의 이형태에 대응하는 접속정보를 저장한 접속정보 필드를 포함하여 구성되고, 상기 이형태 필드를 키로 하여 미등록어의 음절 정규화 표현을 액세스한다. 그리고, 상기 (a) 모듈은 (a1) 상기 음절 정규화 표현 사전의 이형태 필드에서 상기 미등록어에 대응하는 음절 정규화 표현을 검색하는 모듈; (a2) 검색된 이형태의 음절 정규화 표현과 원형의 음절 정규화 표현이 같으면, 상기 미등록어 자체를 형태소 후보의 원형으로 결정하고, 대응하는 접속정보를 형태소 후보의 접속정보로 결정하여, 형태소 후보를 생성하는 모듈; (a3) 검색된 이형태의 음절 정규화 표현과 원형의 음절 정규화 표현이 다르면, 상기 미등록어에서 상기 이형태의 음절 정규화 표현과 공통된 부분을 삭제하고, 상기 원형의 음절 정규화 표현에서 음절 정규화 표현 메터심볼을 제외한 부분을 합하여 상기 미등록어의 원형을 복원하고, 대응하는 접속정보를 형태소 후보의 접속정보로 결정하여, 형태소 후보를 생성하는 모듈; 및 (a4) 상기 (a1) 모듈 내지 상기 (a3) 모듈을 반복하여 상기 미등록어에 대한 모든 형태소 후보들을 생성하는 모듈을 컴퓨터에서 실행하는 프로그램 코드를 포함한다.The recording medium (a) uses a syllable normalization expression dictionary database, which shows all syllables in Korean suffixes, verbs, and adverbs as syllable normalized expressions, to restore the prototype of unregistered words and to obtain corresponding access information. Generating module; (b) a module for filtering over-generated unregistered words among the one or more morpheme candidates using a predetermined heuristic; And (c) program code for executing, on a computer, a module to delete morphological candidates that are not accessible from the morphological candidates filtered using the Korean connection table. The syllable normalized expression dictionary database may include a circular field for storing a part-of-speech and a syllable-normalized syllable normalized expression; A heteromorphic field for storing syllable normalized representations of morphological heteromorphisms; And a connection information field storing connection information corresponding to this type of the morphology field, and accessing a syllable normalized expression of an unregistered word using the morphology field as a key. And (a) the module comprises: (a1) a module for retrieving syllable normalized expressions corresponding to the unregistered words in the heteromorphic field of the syllable normalized expression dictionary; (a2) If the searched syllable normalized expression of this form and the original syllable normalized expression are the same, the non-registered word itself is determined as the prototype of the morpheme candidate, the corresponding access information is determined as the access information of the morpheme candidate, and a morpheme candidate is generated. module; (a3) If the syllable normalization expression of the present form is different from the syllable normalization expression of the original form, the part which is not common to the syllable normalization expression of the form is deleted from the unregistered word, and the portion except the syllable normalization expression meter symbol from the circular syllable normalization expression. Adds and restores the prototype of the unregistered word, determines the corresponding access information as the access information of the morpheme candidate, and generates a morpheme candidate; And (a4) program code for executing, on a computer, a module that repeats the modules (a1) to (a3) to generate all morphological candidates for the non-registered word.

상기 기록매체는 (a) 입력 문자열을 음소 단위로 분할하는 모듈; (b) 음소 단위로 분할된 입력 문자열을 차례로 입력받고, 형태소 사전을 이용하여 등록어의후보 형태소를 추출하고, 음절 정규화 표현 사전을 이용하여 미등록어의 후보 형태소를 추출하는 모듈; (c) 한국어 접속 테이블을 이용하여, 상기 후보 형태소들 중 접속 가능성이 없는 후보 형태소를 제거하는 모듈; (d) 접속 가능성이 있는 후보 형태소의 어휘확률을 구하는 모듈; 및 (e) 상기 (b) 모듈 내지 상기 (d) 모듈을 상기 입력 문자열의 마지막 음소가 처리될 때까지 반복하는 모듈을 컴퓨터에서 실행할 수 있는 프로그램 코드를 저장한다. 그리고, 상기 (d) 모듈에서 미등록어 후보 형태소의 어휘확률는 수학식에 의해 추정하여 구한다(여기서, m=e₀e₁e₂…e_n은 형태소이고, e는 음절이고, #은 형태소 경계이고, f_t는 품사가 t일 경우 빈도수이다).The recording medium includes: (a) a module for dividing an input string into phoneme units; (b) a module for sequentially receiving input strings divided into phoneme units, extracting candidate morphemes of registered words using a morpheme dictionary, and extracting candidate morphemes of unregistered words using a syllable normalized expression dictionary; (c) a module for removing candidate morphemes in which the candidate morphemes are not accessible by using a Korean access table; (d) a module for obtaining a lexical probability of a candidate morpheme that can be accessed; And (e) program code for executing a module in the computer that repeats the modules (b) to (d) until the last phoneme of the input string is processed. And the lexical probability of the unregistered word morpheme in the module (d). Is an equation (Where m = e ₀ e ₁ e ₂ … e _n is a morpheme, e is a syllable, # is a morpheme boundary, and f _t is a frequency when part of speech is t).

이상과 같은 본 발명을 구현하기 위한 기능적인 모듈들은 본 발명이 속하는 기술분야의 프로그래머들에 의해 용이하게 실시될 수 있다.Functional modules for implementing the present invention as described above can be easily implemented by programmers in the art to which the present invention belongs.

본 발명에 의하면, 본 발명은 개방어에 속하는 한국어의 음절을 분석하여, 명사, 부사, 동사, 형용사의 음절을 정규화 표현으로 바꾸어 모든 형태소를 이 음절 정규화 표현으로 나타냄으로써 미등록어를 사전에 입력할 수 있게 한다. 특히, 동사, 형용사의 불규칙 활용들에 대한 추정을 함에 있어서, 음절 정규화 표현사전을 이용함으로써 형태소 사전에 등록되지 않은 미등록어로 분류되는 동사, 형용사들의 원형을 정확히 복원할 수 있게 된다. 따라서 형태소 사전에 등록된 형태소와 같은 수준의 접속정보를 상기 위의 동사, 형용사의 불규칙 활용들에 대하여도 제공할 수 있게되어, 이후 단계에서 위의 동사, 형용사의 불규칙 활용들을 여러 목적(맞춤법 검사 등)으로 처리하는데 있어서 위의 형태소 사전에 등록된 등록어처럼 처리할 수 있게된다. 따라서 위의 동사, 형용사의 불규칙 활용들을 처리하기 위한 별도의 장치 또는 규칙 내지 방법을 적용하지 않아도 되므로 단순한 시스템을 이용하여 상기 목적을 달성할 수 있게되며, 동시에 상기 동사, 형용사의 불규칙 활용을 처리함에 있어서도 오류를 제거할 수 있게되는 이점이 있다.According to the present invention, the present invention analyzes the syllables of Korean belonging to an open language, replaces the syllables of nouns, adverbs, verbs, and adjectives with normalized expressions, and expresses all morphemes with these syllable normalized expressions. To be able. In particular, in estimating irregular usage of verbs and adjectives, the syllable normalized expression dictionary can be used to accurately reconstruct the prototypes of verbs and adjectives classified as unregistered words not registered in the morpheme dictionary. Therefore, it is possible to provide the same level of access information as the morpheme registered in advance in the morpheme dictionary for the irregular usage of the above verbs and adjectives. Etc.) can be treated like a registered word registered in the above morpheme dictionary. Therefore, it is not necessary to apply a separate device or rule or method for handling irregular verbs of adjectives and adjectives above, so that the above-mentioned object can be achieved by using a simple system, and at the same time dealing with irregular utilization of verbs and adjectives. Even if there is an advantage that can eliminate the error.

Claims

(a) receiving a string including an unregistered word to be restored;

(b) restoring the prototype of the unregistered word by using a syllable normalization expression dictionary representing all syllables appearing in Korean idioms, verbs, and adverbs as syllable normalization expressions to generate corresponding access information and one or more morpheme candidates;

(c) filtering over-generated unregistered words among the one or more morpheme candidates using predetermined heuristic information; And

and (d) generating a morpheme connection graph by deleting the morpheme candidates that are not accessible from the filtered morpheme candidates using a Korean access table.

The method of claim 1, wherein the syllable normalized expression dictionary

A circular field for storing parts of morphemes and syllable normalization expressions of morphemes;

A heteromorphic field for storing syllable normalized representations of morphological heteromorphisms; And

And a connection information field storing connection information corresponding to this type of the present type field.

A syllable normalized expression dictionary using a syllable normalized expression dictionary, wherein the syllable normalized expression of an unregistered word is accessed using the heteromorphic field as a key.

The method of claim 1, wherein step (b)

(b1) retrieving a syllable normalized expression corresponding to the unregistered word in the heteromorphic field of the syllable normalized expression dictionary;

(b2) if the retrieved syllable normalized expression and the prototype syllable normalized expression are the same, the non-registered word itself is determined as the prototype of the morpheme candidate, and the corresponding access information is determined as the access information of the morpheme candidate to generate a morpheme candidate. step;

(b3) If the found syllable normalization expression and the original syllable normalization expression are different from each other, the portion which is not common to the syllable normalization expression of the heteromorphic form is deleted from the unregistered word, and the portion except the syllable normalization expression m symbol from the circular syllable normalization expression. Reconstructing the prototype of the unregistered word, determining the corresponding access information as the access information of the morpheme candidate, and generating a morpheme candidate; And

and (b4) repeating the steps (b1) to (b3) to generate all morphological candidates for the unregistered words.

In the method of stemming an input string including an unregistered word,

(a) dividing the input string into phoneme units;

(b) receiving input strings divided into phonemes in sequence, extracting candidate morphemes of registered words using a morpheme dictionary, and extracting candidate morphemes of unregistered words using a syllable normalized expression dictionary;

(c) removing candidate morphemes that are not accessible from the candidate morphemes using a Korean access table;

(d) obtaining a lexical probability of a candidate morpheme that is likely to be connected; And

(e) repeating steps (b) to (d) until the last phoneme of the input string is processed.

5. The lexical probability of claim 4, wherein the lexical probabilities of the unregistered word candidate morphemes are in step (d). Is an equation

A morphological analysis method of an input string containing an unregistered word characterized by estimated by (where m = e ₀ e ₁ e ₂ … e _n is a morpheme, e is a syllable, # is a morpheme boundary, and f _t Is the frequency when the part of speech is t).

In a computer-readable recording medium recording an unregistered word analysis program using a syllable normalized expression dictionary database,

(a) a module for generating one or more morpheme candidates by restoring the prototype of an unregistered word and obtaining corresponding access information by using a syllable normalized expression dictionary database in which all syllables appearing in Korean idioms, verbs, and adverbs are syllable normalized expressions;

(b) a module for filtering over-generated unregistered words among the one or more morpheme candidates using a predetermined heuristic; And

(c) a computer-readable recording medium for recording an unregistered word analysis program using a syllable-normalized expression dictionary, comprising a module for deleting morphological candidates that are not accessible from the morphological candidates filtered using the Korean access table. .

7. The system of claim 6, wherein the syllable normalized expression dictionary database

A computer-readable recording medium having recorded thereon an unregistered word analysis program using a syllable normalized expression dictionary, wherein the syllable normalized expression of an unregistered word is accessed using the heteromorphic field as a key.

The method of claim 6, wherein the (a) module

(a1) a module for retrieving syllable normalized expressions corresponding to the unregistered words in the heteromorphic field of the syllable normalized expression dictionary;

(a2) If the searched syllable normalized expression of this form and the original syllable normalized expression are the same, the non-registered word itself is determined as the prototype of the morpheme candidate, the corresponding access information is determined as the access information of the morpheme candidate, and a morpheme candidate is generated. module;

(a3) If the syllable normalization expression of the present form is different from the syllable normalization expression of the original form, the portion that is not common to the syllable normalization expression of the form is deleted from the unregistered word, and the portion except the syllable normalization expression meter symbol from the circular syllable normalization expression. Adds and restores the prototype of the unregistered word, determines the corresponding access information as the access information of the morpheme candidate, and generates a morpheme candidate; And

(a4) A computer-readable recording medium recording a non-registered word analysis program using a syllable normalized expression dictionary by repeating the modules (a1) to (a3) to generate all morphological candidates for the non-registered words.

A computer-readable recording medium having recorded thereon a program for stemming an input string including an unregistered word,

(a) a module for dividing the input string into phoneme units;

(b) a module for sequentially receiving input strings divided into phonemes, extracting candidate morphemes of registered words using a morpheme dictionary, and extracting candidate morphemes of unregistered words using a syllable normalized expression dictionary;

(c) a module for removing candidate morphemes with no accessibility among the candidate morphemes using a Korean access table;

(d) a module for obtaining a lexical probability of a candidate morpheme that can be accessed; And

and (e) a module for repeating the modules (b) to (d) until the last phoneme of the input string has been processed.

10. The lexical probability of claim 9, wherein the module (d) is unregistered. Is an equation

A computer-readable recording medium recording a morphological analysis program estimated by the equation, wherein m = e ₀ e ₁ e ₂ … e _n is a morpheme, e is a syllable, # is a morpheme boundary, f _t is the frequency when the part of speech is t).