KR101309839B1

KR101309839B1 - Rule-based parsing apparatus and method using statistical information

Info

Publication number: KR101309839B1
Application number: KR1020090118298A
Authority: KR
Inventors: 노윤형; 최승권; 이기영; 권오욱; 김영길; 김창현; 양성일; 서영애; 김운; 황금하; 오영순; 박은진; 박상규
Original assignee: 한국전자통신연구원
Priority date: 2009-12-02
Filing date: 2009-12-02
Publication date: 2013-09-23
Also published as: KR20110061788A

Abstract

본 발명은 통계 정보를 이용한 규칙 기반 구문 분석 장치 및 방법에 관한 것으로, 본 발명의 일실시 예에 따른 통계정보를 이용한 규칙 기반 구문분석 방법은, 입력 문장에 대해 구문 규칙을 적용함으로써 구문 분석을 수행하는 단계; 상기 입력 문장에 대해 적용되는 규칙에 주어진 규칙 확률과 어휘통계정보에 기반하여 계산된 어휘 의존 가중치를 이용하여 상기 규칙에 대한 규칙 가중치를 계산하는 단계; 각 구문트리에 사용된 규칙에 대해 계산된 상기 규칙 가중치들을 곱하여 각 구문트리의 가중치를 계산하고 가장 높은 가중치를 갖는 구문 트리를 선택하는 단계; 및 상기 선택된 구문 트리를 출력하는 단계를 포함한다.The present invention relates to a rule-based parsing apparatus and method using statistical information, the rule-based parsing method using the statistical information according to an embodiment of the present invention, performs a syntax analysis by applying a syntax rule to the input sentence Making; Calculating a rule weight for the rule using a lexical dependency weight calculated based on a rule probability given to the rule applied to the input sentence and the lexical statistics information; Calculating weights of each syntax tree by multiplying the rule weights calculated for the rules used in each syntax tree and selecting a syntax tree having the highest weight; And outputting the selected syntax tree.

상술한 바와 같은 본 발명은, 규칙기반 방식의 효율성과 통계기반 방식의 높은 모호성 처리 성능을 갖는 구문분석이 가능하다.As described above, the present invention enables parsing with efficiency of rule-based method and high ambiguity processing performance of statistical-based method.

언어 처리, 구문 분석, 통계 정보, 규칙 기반 Language processing, parsing, statistical information, rule based

Description

Rule-based parsing device and method using statistical information {RULE-BASED PARSING APPARATUS AND METHOD USING STATISTICAL INFORMATION}

본 발명은 규칙 기반 구문 분석 장치 및 방법에 관한 것으로, 특히 통계 정보를 이용한 규칙 기반 구문 분석 장치 및 방법에 관한 것이다.The present invention relates to an apparatus and method for rule based parsing, and more particularly, to an apparatus and method for rule based parsing using statistical information.

본 발명은 지식 경제부의 한중영 대화체 및 기업문서 자동번역 기술개발 사업과제의 일환으로 수행한 연구로부터 도출된 것이다[과제관리번호 : 2009-S-034-01, 과제명 : 한중영 대화체 및 기업문서 자동번역 기술개발].The present invention is derived from the research conducted as part of the project of technology development of Korean-Chinese dialogue and corporate document automatic translation technology of the Ministry of Knowledge Economy. [Task Management No.: 2009-S-034-01, Title: Korean-Chinese dialogue and corporate document automatic translation Technology development].

일반적인 언어 처리 기술 중의 하나인 구문 분석은 주어진 문장이 정의된 문법 구조에 따라 정당하게 하나의 문장으로 사용될 수 있는가를 확인하는 작업을 통칭한다. Syntax analysis, one of general language processing techniques, refers to the task of checking whether a given sentence can be duly used as a sentence according to a defined grammatical structure.

이러한 구문 분석 방법은 형태소분석 및 태깅이 끝난 데이터를 구문 단위로 분석하는 과정을 수행하는데, 크게 규칙 기반의 구문 분석 방법과 통계 기반의 구문 분석 방법으로 구분된다. This method of parsing performs the process of analyzing the morphological analysis and the tagged data in syntax units. It is largely divided into rule-based parsing method and statistics-based parsing method.

상기 규칙 기반의 구문 분석 방법은 비교적 소수의 규칙을 반복적으로 적용하여 문장을 파싱함으로 인해 모호성 처리에 한계를 갖는다.The rule-based parsing method has a limitation in ambiguity processing due to parsing a sentence by repeatedly applying a relatively few rules.

이에 반하여 통계 기반의 구문 분석 방법은 어휘화된 규칙을 이용함으로 인해 모호성 처리에 있어서의 한계를 극복할 수 있다. 하지만 상기 통계 기반의 구문 분석 방법은 각 어휘에 따른 많은 통계 정보를 사용하기 때문에 효율성이 떨어질 뿐만 아니라 새로운 지식을 추가하거나 사람이 파싱 지식을 관리하고 튜닝하기가 쉽지 않다.In contrast, statistical-based parsing methods can overcome limitations in ambiguity processing by using lexicalized rules. However, since the statistical-based parsing method uses a lot of statistical information according to each vocabulary, not only is it less efficient, it is not easy to add new knowledge or to manage and tune the parsing knowledge.

따라서 구문 분석 시에 높은 모호성 처리 능력뿐만 아니라 통계 정보의 관리 및 튜닝이 용이한 구문 분석 방안 마련이 절실하다 할 것이다.Therefore, it is urgent to prepare not only high ambiguity processing capability but also syntax analysis that can easily manage and tune statistical information.

따라서 본 발명의 목적은, 규칙 기반 방식의 효율성과 통계 기반 방식의 높은 모호성 처리 능력을 갖는 통계 정보를 이용한 규칙 기반 구문 분석 장치 및 방법을 제공하는 데 있다.Accordingly, an object of the present invention is to provide an apparatus and method for rule-based parsing using statistical information having efficiency of rule-based method and high ambiguity processing capability of statistical-based method.

또한 본 발명의 다른 목적은 규칙 기반 파싱 방법을 기본으로 하여 구문 트리 부착 코퍼스로부터 추출된 어휘 의존 정보를 적용하여 모호성을 처리하도록 하는 통계 정보를 이용한 규칙 기반 구문 분석 장치 및 방법을 제공하는 데 있다.Another object of the present invention is to provide an apparatus and method for rule-based parsing using statistical information to apply lexical dependency information extracted from a syntax tree-attached corpus based on a rule-based parsing method to process ambiguity.

그 외의 본 발명에서 제공하고자 하는 목적은, 하기의 설명 및 본 발명의 실시예들에 의하여 파악될 수 있다. Other objects of the present invention are to be understood by the following description and embodiments of the present invention.

이를 위하여, 본 발명의 일실시 예에 따른 통계정보를 이용한 규칙 기반의 구문분석 방법은, 입력 문장에 대해 구문 규칙을 적용함으로써 구문 분석을 수행하는 단계; 상기 입력 문장에 대해 적용되는 규칙에 주어진 규칙 확률과 어휘통계정보에 기반하여 계산된 어휘 의존 가중치를 이용하여 상기 규칙에 대한 규칙 가중치를 계산하는 단계; 각 구문트리에 사용된 규칙에 대해 계산된 상기 규칙 가중치들을 곱하여 각 구문트리의 가중치를 계산하고 가장 높은 가중치를 갖는 구문 트리를 선택하는 단계; 및 상기 선택된 구문 트리를 출력하는 단계를 포함한다.To this end, a rule-based parsing method using statistical information according to an embodiment of the present invention includes the steps of performing syntax analysis by applying a syntax rule to an input sentence; Calculating a rule weight for the rule using a lexical dependency weight calculated based on a rule probability given to the rule applied to the input sentence and the lexical statistics information; Calculating weights of each syntax tree by multiplying the rule weights calculated for the rules used in each syntax tree and selecting a syntax tree having the highest weight; And outputting the selected syntax tree.

한편, 본 발명의 다른 실시예에 따른 통계정보를 이용한 규칙 기반의 구문분 석 장치는, 어휘통계정보 DB; 입력 문장에 대해 구문 규칙을 적용함으로써 구문 분석을 수행하여 최적의 구문트리를 선택하는 규칙 기반 파싱 모듈; 상기 입력 문장에 대해 적용되는 규칙에 주어진 규칙 확률과 상기 어휘통계정보 DB에 기반하여 계산된 어휘 의존 가중치를 이용하여 상기 규칙에 대한 규칙 가중치를 계산하고 상기 규칙 가중치를 상기 규칙 기반 파싱 모듈에 제공하는 규칙 가중치 계산 모듈; 및 상기 규칙 기반 파싱 모듈에 의해 선택된 구문 트리를 출력하는 구문 트리 출력 모듈을 포함하되, 상기 규칙 기반 파싱 모듈은 각 구문트리에 사용된 규칙에 대해 계산된 상기 규칙 가중치들을 곱하여 각 구문트리의 가중치를 계산하고 가장 높은 가중치를 갖는 구문 트리를 최적의 구문트리로 선택한다.On the other hand, rule-based syntax analysis device using the statistical information according to another embodiment of the present invention, vocabulary statistics information DB; A rule-based parsing module configured to select an optimal syntax tree by parsing by applying a syntax rule to an input sentence; Calculating a rule weight for the rule using a rule probability given to a rule applied to the input sentence and a lexical dependency weight calculated based on the lexical statistics information DB, and providing the rule weight to the rule-based parsing module. A rule weight calculation module; And a syntax tree output module for outputting a syntax tree selected by the rule-based parsing module, wherein the rule-based parsing module multiplies the rule weights calculated for the rules used in each syntax tree to multiply the weight of each syntax tree. Compute and select the syntax tree with the highest weight as the optimal syntax tree.

상술한 바와 같은 본 발명은, 규칙기반 파싱 방법을 기본으로 하여 구문트리 부착 코퍼스로부터 추출한 어휘의존정보를 적용하여 모호성을 처리함으로써, 규칙기반 방식의 효율성과 통계기반 방식의 높은 모호성 처리 성능을 갖는 구문분석이 가능하다. As described above, the present invention is based on a rule-based parsing method, and applies lexical dependency information extracted from a corpus with a syntax tree to process ambiguity, thereby providing a syntax having efficiency of rule-based method and high ambiguity processing performance of statistics-based method. Analysis is possible.

뿐만 아니라 규칙을 기본 파싱 지식으로 사용하기 때문에, 사람이 파싱 지식을 관리하고 튜닝하기가 쉽고, 어휘 패턴이나, 의미 패턴 등의 새로운 지식을 추가하는 것이 용이하다.In addition, because rules are used as basic parsing knowledge, it is easy for a person to manage and tune the parsing knowledge, and to add new knowledge such as lexical patterns and semantic patterns.

한편 그 외의 다양한 효과는 후술될 본 발명의 실시예에 따른 상세한 설명에서 직접적 또는 암시적으로 개시될 것이다.On the other hand, various other effects will be disclosed directly or implicitly in the detailed description according to the embodiment of the present invention to be described later.

하기에서 본 발명을 설명함에 있어 공지 기능 또는 구성에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우에는 그 상세한 설명을 생략한다. 그리고 후술하는 용어들은 본 발명에서의 기능을 고려하여 정의된 용어들로서 이는 사용자 및 운용자의 의도 또는 관례 등에 따라 달라질 수 있다. 그러므로 그 정의는 본 명세서 전반에 걸친 내용을 토대로 내려져야 한다. In the following description, well-known functions or constructions are not described in detail to avoid unnecessarily obscuring the subject matter of the present invention. The following terms are defined in consideration of the functions of the present invention, and these may vary depending on the intention or custom of the user and the operator. Therefore, the definition should be based on the contents throughout this specification.

후술될 본 발명의 실시예는 규칙 기반 파싱 방법을 기본으로 하고, 상기 규칙 기반 파싱 방법에 구문 트리 부착 코퍼스로부터 추출한 어휘 의존 정보를 적용하여 구문 트리를 선택하도록 하는 방안을 마련하고자 한다. 즉 규칙 기반 방식의 효율성과 통계 기반 방식의 높은 모호성 처리 능력을 갖는 통계 정보를 이용한 규칙 기반 구문 분석 방안에 대해 구체적으로 살펴볼 것이다.An embodiment of the present invention to be described below is based on a rule-based parsing method, and provides a scheme for selecting a syntax tree by applying lexical dependency information extracted from a corpus with a syntax tree to the rule-based parsing method. That is, the rule-based parsing method using the statistical information having the efficiency of the rule-based method and the high ambiguity processing capability of the statistical-based method will be described in detail.

이하, 첨부된 도면을 참조하여 본 발명의 바람직한 실시예를 상세히 설명한다. Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings.

도 1은 본 발명의 일실시예에 따른 통계 정보를 이용한 규칙 기반 구문 분석 장치의 구성을 도시한다. 도시된 바와 같이, 구문 분석 장치는, 입력모듈(102), 규칙기반파싱모듈(103), 구문트리출력모듈(104), 규치가중치계산모듈(105), 파싱규칙DB(106) 및 어휘통계정보DB(107)를 포함한다.1 illustrates a configuration of a rule-based syntax analysis apparatus using statistical information according to an embodiment of the present invention. As shown, the parsing apparatus includes an input module 102, a rule-based parsing module 103, a syntax tree output module 104, a rule weighting calculation module 105, a parsing rule DB 106, and lexical statistics information. DB 107 is included.

입력 모듈(102)은 구문 트리를 선택하기 위한 원문(101), 즉 분석을 위한 문장을 수신하고 상기 입력된 원문(101)을 규칙 기반 파싱 모듈(103)에 전달한다.The input module 102 receives a text 101 for selecting a syntax tree, that is, a text for analysis and passes the input text 101 to the rule-based parsing module 103.

상기 규칙 기반 파싱 모듈(103)은 파싱규칙DB(106)에 저장된 구문 규칙들을 상기 입력된 원문(101)에 적용함으로써 구문 분석을 수행한다. 상기 규칙 기반 파싱 모듈(103)은 외부로부터 제공되는 구문 규칙을 상기 원문(101)에 적용함으로써 새로운 차트를 생성한다. 그리고 상기 규칙 기반 파싱 모듈(103)은 상기 새로운 차트를 생성할 때마다 규칙 가중치 계산 모듈(105)을 호출한다. The rule-based parsing module 103 performs parsing by applying syntax rules stored in the parsing rule DB 106 to the input text 101. The rule-based parsing module 103 generates a new chart by applying syntax rules provided from the outside to the original text 101. The rule-based parsing module 103 calls the rule weight calculation module 105 each time the new chart is generated.

규칙 가중치 계산 모듈(105)은 어휘 통계 정보 데이터베이스(107)를 이용하여 상기 적용되는 구문 규칙의 중심 어휘와 각 자식노드 어휘에 관한 어휘의존 가중치를 계산하고 상기 구문 규칙 확률과 상기 어휘의존 가중치를 이용하여 규칙 가중치를 계산한다. 어휘 통계 정보 데이터베이스는 두 어휘간의 의존 가중치를 저장하고 있다. 규칙 어휘 의존 가중치는 차트의 중심어와 다른 자식 중심어들 간의 어휘 의존 가중치의 곱에 의해 계산된다. The rule weight calculation module 105 calculates the vocabulary dependent weights for the central vocabulary of the applied syntax rule and each child node vocabulary using the vocabulary statistical information database 107 and uses the syntax rule probabilities and the vocabulary dependent weights. Calculate the rule weight. The lexical statistics information database stores dependency weights between two words. Rule vocabulary dependent weights are calculated by multiplying the vocabulary dependent weights between the central words of the chart and other child core words.

일실시예에서, 규칙 가중치 계산 모듈(105)은 어휘 통계 정보 데이터베이스(107)로부터의 통계 정보를 구할 때 어휘 데이터 부족 문제가 발생하는 것을 보완하기 위하여, 동사에 대한 하위범주화타입 가중치를 고려하여 규칙 가중치를 계산한다. 여기서 동사의 하위 범주화 타입이란 특정 동사가 필수격으로 취하는 구문 요소를 의미한다. 하위 범주화 타입 가중치를 상기 새로운 차트의 중심어가 동사인지 아닌지에 따라 달리 적용한다. 예컨대 상기 새로운 차트의 중심어가 동사가 아닌 경우에는 상기 하위 범주화 타입 가중치를 1로 적용한다. 하지만 상기 새로운 차트의 중심어가 동사인 경우에는 상기 하위 범주화 타입 가중치를 P(h_c |h_t, h_l)/ P(h_c |h_t)에 의해 계산한다. 이때 h_c는 차트 중심어의 하위 범주화 타입을 의미하고, h_t는 차트 중심어의 품사를 의미하며, h_l는 차트 중심어의 원형을 의미한다.In one embodiment, the rule weight calculation module 105 considers the subcategory type weights for verbs to compensate for the occurrence of lack of lexical data when obtaining statistical information from the lexical statistics information database 107. Calculate the weight. Here, the subcategories of verbs refer to the syntactic elements that certain verbs take as mandatory. The subcategorization type weights are applied differently depending on whether or not the central word of the new chart is a verb. For example, when the central word of the new chart is not a verb, the subcategorization type weight is applied as 1. However, when the central word of the new chart is a verb, the subcategorization type weight is calculated by P (h _c | h _t , h _l ) / P (h _c | h _t ). In this case, h _c means the subcategories of the chart core word, h _t means the part-of-speech of the chart core word, and h _l means the prototype of the chart core word.

한편, 어휘 의존 가중치를 계산하는데 있어서 어휘의존확률은 두 어휘 사이의 거리에 영향을 많이 받는다는 특성을 반영하기 위해 어휘간 거리(distance) 속성을 반영할 수 있다. 구체적으로 설명하면, 상기 차트의 중심어와 다른 자식 중심어들 간의 어휘 의존 가중치는 P(c_it,c_il|h_t, h_l, h_c, dist_i)/ P(c_it,c_il|h_t, h_c, dist_i)에 의해 계산된다. 여기서 c_it는 i번째 자식의 중심어 품사를 의미하고, c_il는 i번째 자식의 중심어 원형을 의미하며, dist_i는 차트 중심어와 i번째 자식의 중심어 간의 거리 속성을 의미한다. 상기 차트 중심어와 i번째 자식의 중심어 간의 거리 속성 dist_i는 상기 차트 중심어와 상기 i번째 자식 중심어 간의 임의의 단어 존재 여부, 콤마의 포함 여부, 동사의 존재 여부, 전치사의 존재 여부 등을 고려하여 결정할 수 있다.Meanwhile, in calculating the vocabulary dependent weight, the vocabulary dependency probability may reflect the distance attribute between the vocabularies in order to reflect the characteristic that the distance between the two vocabularies is greatly affected. Specifically, the lexical dependence weight between the central word and other child central words of the chart is P (c _it , c _il | h _t , h _l , h _c , dist _i ) / P (c _it , c _il | h _t , h _c , dist _i ) Where c _it denotes the central part-of-speech of the i-th child, c _il denotes the central term prototype of the i-th child, and dist _i denotes the distance attribute between the chart center and the i-th child. The distance attribute dist _i between the chart center word and the center word of the i th child is determined in consideration of the existence of any word between the chart center word and the i th child center word, the presence of a comma, the presence of a verb, the presence of a preposition, etc. Can be.

또한, 상기 차트의 중심어와 다른 자식 중심어들 간의 어휘 의존 가중치를 계산하기 위한 P(c_it,c_il|h_t, h_l, h_c, dist_i)는 λ1*P(c_it,c_il|h_t, h_l, h_c, dist_i) + (1-λ1)* P(c_it,c_il|h_t, h_c, dist_i)에 의해 스무딩(smoothing) 수 있다. 그리고 상기 차트의 중심어와 다른 자식 중심어들 간의 어휘 의존 가중치를 계산하기 위한 P(c_it,c_il|h_t, h_c, dist_i)는 λ2*P(c_it,c_il|h_t, h_c, dist_i) + (1-λ2)* P(c_it,c_il|h_t, dist_i)에 의해 다듬어 진다. 여기서 C(x)는 x의 빈도수, u는 0보다 큰 상수를 의미하며, λ1은 C(h_t, h_l, h_c, dist_i)/ (C(h_t, h_l, h_c, dist_i)+u)에 의해 계산되고, λ2는 C(h_t, h_c, dist_i)/ (C(h_t, h_c, dist_i)+u)에 의해 계산된다.In addition, P (c _it , c _il | h _t , h _l , h _c , dist _i ) for calculating the lexical dependency weight between the center word and other child center words of the chart is λ1 * P (c _it , c _il | h _t , h _l , h _c , dist _i ) + (1-λ1) * P (c _it , c _il | h _t , h _c , dist _i ) can be smoothed. And P (c _it , c _il | h _t , h _c , dist _i ) for calculating the lexical dependency weight between the center word and other child center words of the chart is λ2 * P (c _it , c _il | h _t , h _c , dist _i ) + (1-λ2) * is refined by P (c _it , c _il | h _t , dist _i ). Where C (x) is the frequency of x, u is a constant greater than 0, and λ1 is C (h _t , h _l , h _c , dist _i ) / (C (h _t , h _l , h _c , dist _i ) + u) and λ 2 is calculated by C (h _t , h _c , dist _i ) / (C (h _t , h _c , dist _i ) + u).

규칙 기반 파싱 모듈(103)은 상기 규칙 가중치 계산 모듈(105)에 의해 계산된 규칙 가중치를 전달받아 가장 높은 규칙 가중치를 갖는 규칙들로 이루어진 구문 트리를 선택한다. The rule based parsing module 103 receives the rule weight calculated by the rule weight calculation module 105 and selects a syntax tree composed of the rules having the highest rule weight.

구문 트리 출력 모듈(104)은 상기 규칙 기반 파싱 모듈(103)에 의해 선택된 구문 트리를 최종 파싱 결과로 출력된다.The syntax tree output module 104 outputs the syntax tree selected by the rule-based parsing module 103 as the final parsing result.

도 2는 규칙 기반 파서에서 이용되는 PCFG (Probabilistic Context Free Grammar) 규칙의 예를 도시한다. 구체적으로, 도 2는 서로 다른 세 가지의 구문 규칙을 제시하고 있다. 도시된 바와 같이, 제1 구문 규칙인 "S -> NP VP", 제2 구문 규칙인 "NP -> DT NN" 및 제3 구문 규칙인 "VP -> VB NP"이 구문 트리를 구성하는데 사용되고 있다. 여기서 S는 문장, NP는 명사구, VP는 동사구, DT는 관사, NN은 명사, VB는 능동태 동사를 각각 의미한다.2 illustrates an example of a Probabilistic Context Free Grammar (PCFG) rule used in a rule based parser. Specifically, Figure 2 presents three different syntax rules. As shown, the first syntax rule "S-> NP VP", the second syntax rule "NP-> DT NN" and the third syntax rule "VP-> VB NP" are used to construct the syntax tree. have. Where S is a sentence, NP is a noun phrase, VP is a verb phrase, DT is an article, NN is a noun, and VB is an active verb.

이때 상기 구문 트리를 구성하는 각 구문 규칙의 확률은 하기 <수학식 1>에 의해 계산된다.At this time, the probability of each syntax rule constituting the syntax tree is calculated by Equation 1 below.

P(S -> NP VP) = C(S -> NP VP)/C(S)P (S-> NP VP) = C (S-> NP VP) / C (S)

P(NP -> DT NN) = C(NP -> DT NN)/C(NP)P (NP-> DT NN) = C (NP-> DT NN) / C (NP)

P(VP -> VB NP) = C(VP -> VB NP)/C(VP)P (VP-> VB NP) = C (VP-> VB NP) / C (VP)

여기서, C(A)는 코퍼스에서 A의 구문이 발생하는 빈도수를 의미한다.Here, C (A) means the frequency of occurrence of the phrase A in the corpus.

이때 상기 구문 트리의 전체 확률은 상기 구문 트리를 구성하는데 사용된 규칙들 각각에 대응하여 계산한 구문 규칙 확률들의 총합에 의해 획득할 수 있다. 그리고 상술한 바에 의해 구문 트리별로 획득된 전체 가중치들 중 그 값이 최대인 구문 트리를 선택한다. 이와 같이 전체 가중치가 최대인 구문 트리를 선택함으로써, 높은 모호성 처리 성능을 얻을 수 있다.In this case, the overall probability of the syntax tree may be obtained by the sum of syntax rule probabilities calculated corresponding to each of the rules used to construct the syntax tree. As described above, the syntax tree having the maximum value among the total weights obtained for each syntax tree is selected. In this way, by selecting the syntax tree having the maximum total weight, high ambiguity processing performance can be obtained.

한편, 통계 기반 파서에서 사용하는 어휘화된 PCFG 규칙 중 하나의 규칙의 일예를 살펴보면, "(VP (VB like) (NP flowers))"와 같이 "VP -> VB NP"의 구문 규칙이 적용된 경우, 각 구문 규칙의 노드에 노드의 중심 어휘를 포함하는 것을 말한다. 즉 하기 <수학식 2>와 같이 표현 가능하다.On the other hand, if you look at an example of one of the lexicalized PCFG rules used in the statistics-based parser, the syntax rules of "VP-> VB NP" is applied, such as "(VP (VB like) (NP flowers))" In other words, each syntax rule includes a node's central vocabulary. That is, it can be expressed as in <Equation 2>.

VP(like) -> VB(like) NP(flowers)VP (like)-> VB (like) NP (flowers)

이때 상기 <수학식 2>로 표현된 어휘화된 구문 규칙에 대한 확률은 하기 <수학식 3>과 같다.In this case, the probability of the lexicalized syntax rule expressed by Equation 2 is expressed by Equation 3 below.

P(VP(like)-> VB(like) NP(flowers)) = C(VP(like)->VB(like) NP(flowers))/C(VP(like))P (VP (like)-> VB (like) NP (flowers)) = C (VP (like)-> VB (like) NP (flowers)) / C (VP (like))

여기서 C(A)는 코퍼스에서 A의 구문이 발생하는 빈도수를 의미한다.Where C (A) is the frequency of occurrence of the phrase A in the corpus.

하지만 통계 기반 파서에서는 상기 <수학식 2>에서 보이고 있는 구문 규칙이 어휘화되면서 발생하는 데이터 부족 문제를 해결하기 위해 독립 가정을 하여 상기 구문 규칙에 대해 하기 <수학식 4>와 같은 분할을 통해 해당 규칙의 확률을 구할 수 있다.However, in the statistics-based parser, the syntax rule shown in Equation 2 is used to solve the lack of data caused by the lexicalization. The probability of the rule can be found.

한편, 본 발명에서는 상술한 어휘화된 규칙 확률 대신에 하기 <수학식 5>에서와 같이 규칙 가중치를 계산한다. 즉, 하기 <수학식 5>에서는 일반 구문 규칙 확률과 어휘 의존 가중치를 분리 적용하여 규칙 가중치를 계산한 예를 보이고 있다.Meanwhile, in the present invention, instead of the lexicalized rule probability described above, the rule weight is calculated as in Equation 5 below. That is, in Equation 5, a rule weight is calculated by separately applying a general syntax rule probability and a lexical dependency weight.

W(VP(like)->VB(like) NP(flowers)) = P(VP->VB NP) * W(VB(like)|VP(like), VB) * W(NP(flowers)|VP(like), VB)W (VP (like)-> VB (like) NP (flowers)) = P (VP-> VB NP) * W (VB (like) | VP (like), VB) * W (NP (flowers) | VP (like), VB)

상기 <수학식 5>에 의하면, W(VB(like)|VP(like), VB)가 1이고, P(NP(flowers)|VP(like), VB)/P(NP(flowers)|VP, VB)의 값이 1보다 큰 경우에는 W(VP(like) -> VB(like) NP(flowers))는 P(VP -> VB NP)보다 큰 가중치를 받게 된다. 하지만 P(NP(flowers)|VP(like), VB)/P(NP(flowers)|VP, VB)의 값이 1보다 작은 경우에는 페널티를 받게 된다. 따라서 직관적으로 상기 <수학식 5>에 의한 규칙 가중치는 규칙의 어휘화를 반영하고 있음을 확인할 수 있다.According to Equation 5, W (VB (like) | VP (like), VB) is 1, and P (NP (flowers) | VP (like), VB) / P (NP (flowers) | VP If the value of VB is greater than 1, W (VP (like)-> VB (like) NP (flowers)) has a weight greater than P (VP-> VB NP). However, if P (NP (flowers) | VP (like), VB) / P (NP (flowers) | VP, VB) is less than 1, there is a penalty. Therefore, it can be intuitively confirmed that the rule weight according to Equation 5 reflects the lexicalization of the rule.

이에 대한 보다 실질적인 예로써, "rob the traveler of his money"에서 "VP -> VP PP"의 규칙에 의한 구문 분석이 이루어진다고 가정할 때, 일반적인 동사의 경우에는 "W(PP(of)|VP(v), VB) = P(PP(of)|VP(v), VB)/P(PP(of)|VP, VB)"의 값이 낮아 대부분 "VP-> VP PP"의 규칙이 적용되지 않을 것이다. 하지만 동사 "rob"의 경우에는 "W(PP(of)|VP(rob), VB) = P(PP(of)|VP(rob), VB)/P(PP(of)|VP, VB)"의 값이 높아 대부분 "VP -> VP PP"의 규칙이 적용된 구문 트리가 선택될 것이다. 따라서 모호성 처리가 가능함을 확인할 수 있다.As a more practical example of this, assuming that "rob the traveler of his money" is parsed according to the rule "VP-> VP PP", in the case of a common verb, "W (PP (of) | VP (v), VB) = P (PP (of) | VP (v), VB) / P (PP (of) | VP, VB) "is low and most of the rules of" VP-> VP PP "apply. Will not be. But in the case of the verb "rob", "W (PP (of) | VP (rob), VB) = P (PP (of) | VP (rob), VB) / P (PP (of) | VP, VB) The value of "is high, and the syntax tree with the rule" VP-> VP PP "will be selected. Therefore, it can be confirmed that ambiguity processing is possible.

하지만 상술한 방식에 의해 모호성 처리가 이루어진다고 하더라도, 여전히 "P(PP(of)|VP(rob), VB)"와 같은 통계 정보를 구하고자 할 경우에 데이터 부족 문제가 발생할 수 있다.However, even if ambiguity processing is performed by the above-described method, a data shortage problem may still occur when statistical information such as "P (PP (of) | VP (rob), VB)" is still desired.

이를 위해서는 동사에 대한 하위 범주화 타입을 사용하도록 한다. 여기서 동사의 하위 범주화 타입이란 특정 동사가 필수격으로 취하는 구문 요소를 의미한다.To do this, use subcategories of verbs. Here, the subcategories of verbs refer to the syntactic elements that certain verbs take as mandatory.

예컨대 "I saw the man"에서 "see"는 명사구를 필수격으로 취하였다. 그리고 "I want you to do it"에서 "want"는 명사구와 to 부정사구를 필수격으로 취하였 다. 따라서 규칙 확률은 하기 <수학식 6>과 같이 정의될 수 있다.For example, in "I saw the man", "see" is taken as a noun phrase. In "I want you to do it", "want" takes noun phrases and to indefinite phrases as mandatory. Therefore, the rule probability may be defined as in Equation 6 below.

W(VP(like) -> VB(like,t1) NP(flowers)) = P(VP -> VB(t1) NP) * W(VB(like, t1)|VP(like)) * W(VB(like,t1)|VP(like,t1), VB) * W(NP(flowers)|VP(like,t1), VB)W (VP (like)-> VB (like, t1) NP (flowers)) = P (VP-> VB (t1) NP) * W (VB (like, t1) | VP (like)) * W (VB (like, t1) | VP (like, t1), VB) * W (NP (flowers) | VP (like, t1), VB)

여기서 t1은 명사구를 필수격으로 취하는 하위 범주화 타입임을 의미한다.Where t1 is a subcategory type that takes a noun phrase as mandatory.

이때 어휘 의존 확률은 하기 <수학식 7>에 의해 스무딩(smoothing)된다.In this case, the lexical dependency probability is smoothed by Equation 7 below.

여기서 λ1 = f1/(f1+u1), f1 = C(VP(like,t1), VB)이고, u1은 0보다 큰상수를 의미한다. 그리고 상기 <수학식 7>은 보간 삭제(interpolated deletion)를 이용하여 어휘 의존 확률을 다듬는 예를 보이고 있다.[Lambda] 1 = f1 / (f1 + u1), f1 = C (VP (like, t1), VB), and u1 means a constant greater than zero. Equation (7) shows an example of trimming the lexical dependency probability using interpolated deletion.

한편 상기 <수학식 7>에서 λ1 값의 변화를 살펴보면, f1 = C(VP(like,t1), VB) 값이 충분히 크면 1이 되어 P(NP(flowers)|VP(like,t1), VB)은 대략 P(NP(flowers)|VP(like,t1), VB)이 된다. 그리고 f1 = C(VP(like,t1), VB) 값이 작아질수록 조건이 완화된 확률 값의 비중이 높아짐을 알 수 있다. On the other hand, when the change of λ1 value is expressed in Equation 7, f1 = C (VP (like, t1), VB) is large enough to be 1 and P (NP (flowers) | VP (like, t1), VB ) Is approximately P (NP (flowers) | VP (like, t1), VB). The smaller the value of f1 = C (VP (like, t1), VB), the higher the proportion of the probability value with which the condition is relaxed.

즉, "buy his wife a mink coat"에서 만일 P(wife|buy)의 확률이 존재하지 않을 때, 수여동사로 쓰이는 모든 동사들 집합 VB(d1)에 대한 확률값 P(wife|VB(d1))을 사용함을 의미이다.In other words, in "buy his wife a mink coat", if there is no probability of P (wife | buy), the probability value P (wife | VB (d1)) for all the verb sets VB (d1) used as an award verb. Means to use.

한편, 위와 같은 어휘 의존 확률은 두 어휘 사이의 거리에 영향을 많이 받는다. 통상적으로 영어에서는 두 어휘 사이에 동사가 존재하는 경우 의존 관계가 존재할 확률이 현저히 떨어진다. 이와 같은 특성을 반영하기 위해, 간격 (distance) 속성을 반영한다. On the other hand, the lexical dependency probability as described above is greatly affected by the distance between the two vocabularies. In general, in English, if a verb exists between two vocabularies, the probability of having a dependency is significantly lower. To reflect this characteristic, we reflect the distance attribute.

본 발명의 일 실시예에서, 어휘 간격 속성은 세 개의 필드로 나타낸다. 첫 번째 필드는 두 어휘가 인접해있으면 0, 아니면 1값을 갖는다. 두 번째 필드는 콤마가 두 어휘사이에 존재하면 1, 아니면 0값을 갖으며, 세 번째 필드는 동사가 포함되어 있으면 1, 아니면 0값을 갖는다.In one embodiment of the present invention, the lexical spacing attribute is represented by three fields. The first field has a value of zero if the two vocabularies are adjacent, or a value of one. The second field has a value of 1 if the comma is between two vocabularies, or 0. The third field has a value of 1 if the verb is included, or a 0 value.

따라서 앞에서 살펴본 어휘 의존 확률은 하기 <수학식 8>과 같이 변경된다.Therefore, the lexical dependency probability discussed above is changed as shown in Equation 8 below.

P(NP(flowers)|VP(like,t1), VB, 000)P (NP (flowers) | VP (like, t1), VB, 000)

전술한 과정을 통해 최종적으로 구해진 규칙 가중치는 차트 파싱에서 비활성 차트가 생성될 때 적용된다. 비활성 차트란 차트 파싱에서 규칙 적용이 완료된 차트를 의미한다. The rule weight finally obtained through the above process is applied when an inactive chart is generated in chart parsing. Inactive charts are charts that have completed rule application in chart parsing.

비활성 차트가 생성될 때의 차트에 규치가중치를 적용시킨 것을 예를 들어 설명하면 다음과 같다.For example, the rule weight is applied to the chart when the inactive chart is generated.

규칙 R이 h -> c₁ c₂ ... c_n (h는 부모, c_i는 i번째 자식)와 같은 차트에 적 용되었을 때, 규칙 가중치는 하기 <수학식 9>로 정의된다.When rule R is applied to a chart such as h-> c ₁ c ₂ ... c _n (h is parent, c _i is i-th child), the rule weight is defined by Equation 9 below.

W(R) = P(R) * W(h_c|h_t,h_l)*∏_i W(c_it,c_il|h_t,h_l,h_c,dist_i)W (R) = P (R) * W (h _c | h _t , h _l ) * ∏ _i W (c _it , c _il | h _t , h _l , h _c , dist _i )

λ1 = C(h_t,h_l,h_c,dist_i)/(C(h_t,h_l,h_c,dist_i)+u)λ1 = C (h _t , h _l , h _c , dist _i ) / (C (h _t , h _l , h _c , dist _i ) + u)

λ2 = C(h_t,h_c,dist_i)/(C(h_t,h_c,dist_i)+u)λ2 = C (h _t , h _c , dist _i ) / (C (h _t , h _c , dist _i ) + u)

여기서, x_c는 노드 x의 품사가 동사일 때 하위 범주화 타입을 의미하고, x_t는 노드 x의 품사를 의미하며, x_l는 노드 x의 어휘를 의미한다.Here, x _c means a subcategorization type when the part-of-speech of node x is a verb, x _t means the part-of-speech of node x, and x _l means the vocabulary of node x.

예를 들어, 상기 <수학식 9>에 따라 차트 "(VP (VB rob) (NP (DT the) (NN traveler)) (PP (IN of) (PRP$ his) (NN money))"를 생성할 시에 규칙 가중치는 하기 <수학식 10>에 의해 구하여 질 수 있다.For example, according to Equation 9, the chart "(VP (VB rob) (NP (DT the) (NN traveler)) (PP (IN of) (PRP $ his) (NN money))" is generated. Rule weights can be obtained by Equation 10 below.

결국, 상기 <수학식 10>에서 보면, 기존의 구문규칙에 의한 부분 "P(VP -> VB(t1) NP PP)"과 어휘 확률에 의한 부분 "W(t1 |rob, VB, 100) * W(traveler, NN| rob, VB, t1, 100) * W(of, IN| rob, VB, t1, 100)"을 분리하여 규칙가중치를 구함을 알 수 있다.As a result, in Equation 10, the portion "P (VP-> VB (t1) NP PP)" based on the existing syntax rules and the portion "W (t1 | rob, VB, 100) * based on the lexical probability * It can be seen that the rule weight is obtained by separating W (traveler, NN | rob, VB, t1, 100) * W (of, IN | rob, VB, t1, 100).

이상에서는 본 발명의 바람직한 실시예에 대하여 도시하고 설명하였지만, 본 발명은 상술한 특정의 실시예에 한정되지 아니하며, 청구 범위에서 청구하는 본 발명의 요지를 벗어남이 없이 당해 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에 의하여 다양한 변형 실시가 가능한 것은 물론이고, 이러한 변형 실시들은 본 발명의 기술적 사상이나 전망으로부터 개별적으로 이해되어져서는 안 될 것이다. Although the preferred embodiments of the present invention have been shown and described above, the present invention is not limited to the above-described specific embodiments, and the present invention is not limited to the specific scope of the present invention, which is claimed in the claims. Various modifications can be made by those skilled in the art, and these modifications should not be individually understood from the technical spirit or prospect of the present invention.

도 1은 본 발명의 실시 예에 따른 통계 정보를 이용한 규칙 기반 구문 분석 장치의 구성을 보이고 있는 도면,1 is a diagram showing the configuration of a rule-based parsing apparatus using statistical information according to an embodiment of the present invention;

도 2는 규칙 기반 파서에서 PCFG (Probabilistic Context Free Grammar) 규칙의 일반적인 형태를 보이고 있는 도면.2 is a diagram illustrating a general form of a Probabilistic Context Free Grammar (PCFG) rule in a rule based parser.

Claims

Performing parsing by applying syntax rules to input sentences using a rule-based parsing module;

Calculating a rule weight for the rule using the lexical dependency weight calculated based on the rule probability given to the rule applied to the input sentence and the lexical statistics information using the rule-based parsing module;

Calculating the weight of each syntax tree by selecting the syntax tree having the highest weight by multiplying the rule weights calculated for the rules used in each syntax tree using the rule-based parsing module; And

Outputting the selected syntax tree using a syntax tree output module

Rule-based parsing method using statistical information, including;

Calculating the rule weight by using the rule probability, the lexical dependency weight, and the subcategorization type weight in the rule weight calculation step,

The subcategorization type weight is 1 when the center word of the chart is not a verb, and P (hc | ht, hl) / P (hc | ht) when the center word is a verb, where hc is a subcategorization type of the chart center word. , ht is the part of speech of the chart core word, and hl is the prototype of the chart core word.

delete

The method of claim 1,

The lexical dependent weight is calculated by multiplying a lexical dependent weight between a chart center word and another child center word, and the lexical dependency weight between the chart center word and another child center word is P (c _it , c _il | h _t , h _l , h _c , dist _i ) / P (c _it , c _il | h _t , h _c , dist _i ) (where c _it is the centroid of the i child, c _il is the centroid of the i child, and dist _i is the chart Rule-based parsing method using statistical information calculated by the center property and the center property of the i-th child).

5. The method of claim 4,

The distance property between the chart center word and the i-th child center word is statistical information determined by considering the existence of any word between the chart center word and the i-th child center word, the presence of a comma, the presence of a verb, and the presence of a preposition. Rule-based parsing method used.

5. The method of claim 4,

P (c _it , c _il | h _t , h _l , h _c , dist _i ) is λ1 * P (c _it , c _il | h _t , h _l , h _c , dist _i ) + (1-λ1) * P (c _it , c _il | h _t , h _c , dist _i ), λ1 = C (h _t , h _l , h _c , dist _i ) / (C (h _t , h _l , h _c , dist _i) A rule-based parsing method using statistical information smoothed by) + u) (where C (x) is a frequency of x and u is a constant greater than 0).

The method according to claim 6,

P (c _it , c _il | h _t , h _c , dist _i ) is λ 2 * P (c _it , c _il | h _t , h _c , dist _i ) + (1- 2λ) * P (c _it , c _il | h _t , dist _i ), 2λ = C (h _t , h _c , dist _i ) / (C (h _t , h _c , dist _i ) + u), where C (x) is the frequency of x, u means a constant greater than 0).

Lexical statistics information DB;

A rule-based parsing module configured to select an optimal syntax tree by parsing by applying a syntax rule to an input sentence;

The rule weight for the rule is calculated using a rule probability given to a rule applied to the input sentence and a lexical dependency weight calculated based on the lexical statistics information DB, and the rule weight is provided to the rule-based parsing module. A rule weight calculation module; And

A syntax tree output module for outputting the optimal syntax tree selected by the rule-based parsing module,

The rule-based parsing module calculates the weight of each syntax tree by multiplying the rule weights calculated for the rules used in each syntax tree, and selects the syntax tree having the highest weight as the optimal syntax tree,

The rule weight calculation module,

Calculate the rule weight using a further subcategorization type weight in addition to the rule probability and the lexical dependent weight,

delete

The method of claim 8, wherein the rule weight calculation module calculates the lexical dependency weight by a product of a lexical dependency weight between a chart center word and another child center word, and the lexical dependency weight between the chart center word and another child center word is P (c _it , c _il | h _t , h _l , h _c , dist _i ) / P (c _it , c _il | h _t , h _c , dist _i ) (where c _it is the central part of speech of the i th child, c _il is i Rule-based parsing device using statistical information calculated by the centroid of the first child, dist _i , the distance property between the chart central word and the i's child.

The method of claim 11, wherein the rule weight calculation module,

Statistical information is determined by considering the distance attribute between the chart center word and the i-th child center word in consideration of the existence of any word between the chart center word and the i-th child center word, the presence of a comma, the presence of a verb, and the presence of a preposition. Rule based parsing device.

The method of claim 11, wherein the rule weight calculation module,

P (c _it , c _il | h _t , h _l , h _c , dist _i ) is lambda 1 * P (c _it , c _il | h _t , h _l , h _c , dist _i ) + (1-λ1) * P (c _it , c _il | h _t , h _c , dist _i ), λ1 = C (h _t , h _l , h _c , dist _i ) / (C (h _t , h _l , h _c , dist _i) Rule-based parsing device using statistical information refined by) + u) (where C (x) is the frequency of x and u is a constant greater than 0).

The method of claim 13, wherein the rule weight calculation module is configured to determine P (c _it , c _il | h _t , h _c , dist _i ) by lambda 2 * P (c _it , c _il | h _t , h _c , dist _i ) + (1- 2λ) * P (c _it , c _il | h _t , dist _i ), 2λ = C (h _t , h _c , dist _i ) / (C (h _t , h _c , dist _i ) + Rule-based parsing device using statistical information refined by u), where C (x) is the frequency of x and u is a constant greater than zero.