KR100641053B1

KR100641053B1 - Apparatus and method for restoration of ellipsis statement constituent

Info

Publication number: KR100641053B1
Application number: KR1020050093880A
Authority: KR
Inventors: 임수종; 이창기; 장명길
Original assignee: 한국전자통신연구원
Priority date: 2005-10-06
Filing date: 2005-10-06
Publication date: 2006-11-02

Abstract

A device and a method for restoring an omitted component of a sentence are provided to prevent an error caused from an ellipsis of the sentence component, offer correct sentence structure analysis information, and recognize/restore the omitted component of the Hangul sentence by properly using rule and statistics information. A sentence structure(10) analyzer analyzes a structure of the inputted sentence based on predefined grammar. An ellipsis candidate recognizer(20) detects restoration candidates of the ellipsis in the inputted sentence if the ellipsis of the sentence is determined by checking a necessary component of each inflected word appeared in the analyzed sentence. An ellipsis restorer(30) restores the ellipsis in the detected restoration candidates by using the predefined rule/statistics information(32).

Description

Apparatus and method for restoration of ellipsis statement constituent}

도 1 은 본 발명에 따른 문장 성분 복원 장치를 나타낸 도면1 is a view showing a sentence component restoration apparatus according to the present invention

도 2 는 본 발명에 따른 문장 성분 복원 방법을 나타낸 흐름도2 is a flowchart illustrating a sentence component restoration method according to the present invention.

도 3 은 도 2에서 문장 구조 분석 방법을 상세히 나타낸 흐름도3 is a flowchart illustrating a sentence structure analysis method in FIG. 2 in detail;

도 4 는 도 2에서 생략 성분 후보 설정 방법을 상세히 나타낸 흐름도4 is a flowchart illustrating a method of setting an omitted component candidate in FIG. 2;

도 5 는 도 2에서 생략 성분 복원 방법을 상세히 나타낸 흐름도5 is a flowchart illustrating a method of restoring a skipped component in FIG. 2;

*도면의 주요부분에 대한 부호의 설명* Explanation of symbols for main parts of the drawings

10 : 문장구조 분석부 20 : 생략후보 인식부10: sentence structure analysis unit 20: omitted candidate recognition unit

22 : 자/타동사 사전 30 : 생략성분 복원부22: child / transitive verb dictionary 30: skipped component restoring unit

32 : 규칙/통계 정보 32: Rule / statistics information

본 발명은 생략된 문장성분 복원 장치 및 방법에 관한 것으로, 특히 규칙과 통계 정보를 이용하여 한국어에서 생략된 문장 성분을 인식하고 복원해주는 문장성분 복원 장치 및 방법에 관한 것이다.The present invention relates to an apparatus and method for restoring omitted sentence components, and more particularly, to an apparatus and method for restoring sentence components omitted from Korean using rules and statistical information.

일반적으로 자연언어 처리에서 발생하는 생략(ellipsis)은 주로 반복을 피하기 위해서 발생하기 때문에 같은 문장이나 혹은 그 전의 문장에서 유추하여 생략 현상을 복원할 수 있다.In general, ellipsis that occurs in natural language processing mainly occurs to avoid repetition, and thus the omission may be restored by inferring from the same sentence or a previous sentence.

영어의 경우 생략현상은 다음과 같이 동사구가 생략된 형태가 된다.In the case of English, the abbreviation is a form in which the verb phrase is omitted as follows.

"Helen saw the movie and Mary did too.""Helen saw the movie and Mary did too."

그러나 한국어의 경우에는 동사구가 생략되는 형태는 극히 드물고 주로 주어나 목적어 역할을 하는 명사구가 생략이 된다. 즉, 동일 문장내 주어 생략과 백과사전과 같은 특수한 문서에서 표제어의 생략 등이 많다.However, in the case of Korean, a verb phrase is rarely omitted, and a noun phrase mainly serving as a subject or object is omitted. That is, there are many omissions of subjects in the same sentence and headings in special documents such as encyclopedias.

[일 예 1][Example 1]

동일 문장내 주어가 생략되는 경우,If the subject in the same sentence is omitted,

“학자 프라사스타파다가 <구의법강요(句義法綱要)>를 저술하여 (학자 프라사스타파다가) 이 파의 학설을 정립하였다”"Study Prasapadapah wrote <Old Law Enforcement (句義法句義要)> ( Study Prasapadapaga) established the theory of this wave."

위와 같은 현상으로 “학자 프라사스타파다가“는 실제 문장에서는 생략된다.As a result of this phenomenon, "Scholastic Prasapadaga" is omitted in the actual sentence.

이러한 현상은 질문에 대해 정확한 답만을 제시하는 질의응답 시스템에서도 발생된다.This phenomenon also occurs in the question-and-answer system that presents only correct answers to questions.

질의응답 시스템에서 답을 주기 위해 여러 가지 기법이 사용되는데 그 중의 하나로 다음과 같이 질문과 정답 문장의 술어-논항 관계를 이용하는 방법이 있다.In the Q & A system, various techniques are used. One of them is the method of using the predicate-argument relationship of question and answer sentences as follows.

[일 예 2] [Example Example 2]

백과사전과 같은 특수한 문서에서 표제어의 생략되는 경우,If a heading is omitted in a special document such as an encyclopedia,

“2000년에 노벨평화상을 받은 사람은?”“Who received the Nobel Peace Prize in 2000?”

받다(subj : 사람, obj : 노벨평화상, adj : 2000년)(Subj: person, obj: Nobel Peace Prize, adj: 2000)

Title : 김대중Title: Kim Dae Jung

“…공로로 2000년 노벨평화상을 받았다.”“… For the 2000 Nobel Peace Prize. ”

받다(subj : 김대중, obj : 노벨평화상, adv : 2000년, 공로) -- ①(Subj: Kim Dae-jung, obj: Nobel Peace Prize, adv: 2000, Merit)-①

받다(subj : NULL, obj : 노벨평화상, adj : 2000년, 공로) -- ②(Subj: NULL, obj: Nobel Peace Prize, adj: 2000, merit)-②

이처럼 생략 현상이 빈번하게 일어나는 한국어 백과사전 문서는 실제 문장(①) 안에서 표제어에 해당하는 부분은 모두 생략을 한 상태로 생략된 문장(②)으로 구성되기 때문에 문장 그대로 술어 논항 관계를 구성한 경우는 ②와 같은 형태가 되어 원하는 정답을 찾을 수가 없게 된다.In this case, the Korean encyclopedia document with frequent omissions is composed of the omitted sentences (②) with all of the headings in the actual sentences (①) omitted, so the predicate argument relations are formed as they are. You will not be able to find the correct answer.

이와 같은 생략된 문장성분의 복원은 주로 사람이 만든 규칙만을 적용하여 생략성분 복원을 시도하기 때문에 복원될 수 있는 문장의 형태를 제한시킬 수 밖에 없다. 또한 이외의 방법으로 통계정보를 사용하는 경우가 있는데 이는 규칙이 갖는 장점을 활용하지 못하고 통계만을 사용하기 때문에 성능 면에서 차이가 나는 문제가 있다.The restoration of the omitted sentence component is limited to the form of the sentence that can be restored because it mainly attempts to restore the omitted component by applying only man-made rules. In addition, there are cases in which statistical information is used in other ways, which does not utilize the advantages of the rule, but uses only statistics, which causes a difference in performance.

따라서 본 발명은 상기와 같은 문제점을 해결하기 위해 안출한 것으로서, 한국어에서 빈번하게 발생되는 문장 성분의 생략 현상으로 인해 생길 수 있는 오류를 방지하고 좀더 정확한 문장 구조 분석 정보를 제공하는 문장 성분 복원 장치 및 방법을 제공하는데 그 목적이 있다.Accordingly, the present invention has been made to solve the above problems, a sentence component restoration apparatus that prevents errors that may occur due to the omission of sentence components frequently occurring in Korean and provides more accurate sentence structure analysis information and The purpose is to provide a method.

본 발명의 다른 목적은 규칙과 통계 정보를 적절히 이용하여 한국어 문장에서 생략된 문장 성분을 인식하고 복원해주는 문장 성분 복원 장치 및 방법을 제공하는데 있다.Another object of the present invention is to provide a sentence component restoration apparatus and method for recognizing and restoring sentence components omitted from Korean sentences using appropriate rules and statistical information.

상기와 같은 목적을 달성하기 위한 본 발명에 따른 문장 성분 복원 장치의 특징은 기 정의된 문법에 기반하여 입력된 문장의 구조를 분석하는 문장구조 분석부와, 상기 분석된 문장에 출현하는 각각의 용언별 필수성분을 파악하여 문장성분의 생략이 판단되면, 상기 입력된 문장내에서 생략된 성분의 복원후보를 검출하는 생략후보 인식부와, 상기 검출된 복원후보들에서 기 정의된 규칙정보 및 통계정보를 이용하여 상기 생략된 성분을 복원하는 생략성분 복원부를 포함하여 구성된다.A feature of the sentence component restoring apparatus according to the present invention for achieving the above object is a sentence structure analysis unit for analyzing the structure of the input sentence based on a predefined grammar, and each of the words appearing in the analyzed sentence If it is determined that the sentence component is omitted by determining the essential components of each of the input sentence, the skip candidate recognition unit for detecting the reconstruction candidate of the omitted component in the input sentence, and the rule information and statistical information predefined in the detected reconstruction candidate And an omitted component restoring unit for restoring the omitted components.

바람직하게 상기 용언별 필수성분이 목적어로 파악되면 용언의 자/타 동사 사전을 이용하여 타동사로 판단된 경우에만 해당 용언의 생략 후보로 설정하는 것을 특징으로 한다.Preferably, if the essential ingredient for each verb is identified as the target word, it is set as an omission candidate of the verb only when it is determined to be a transitive verb using the child / other verb dictionary of the verb.

바람직하게 상기 문장의 구조는 주어, 목적어, 보어 중 어느 하나로 이루어지는 문장 성분과, 명사절, 형용사절 중 어느 하나로 이루어지는 절의 종류와, 문장과 문장, 절과 절 또는 문장과 절간의 연결 관계를 나타내는 연결 성분으로 구성되는 것을 특징으로 한다.Preferably, the structure of the sentence is a connection component representing a sentence component consisting of any one of a subject, a bore, a kind of a noun clause and an adjective clause, and a connection relationship between a sentence and a sentence, a clause and a clause, or a sentence and a clause. It is characterized in that the configuration.

바람직하게 상기 생략 후보 인식부에서 사용되는 후보는 문장 성분 중 적어도 어느 하나인 것을 특징으로 한다.Preferably, the candidate used in the omitted candidate recognizer is at least one of sentence components.

상기와 같은 목적을 달성하기 위한 본 발명에 따른 문장 성분 복원 방법의 특징은 후보의 생략이 포함된 문장이 입력되면, 문장의 구조를 분석하는 제 1 단계와, 상기 구조 분석된 문장을 입력 받아 용언별 필수 성분을 설정하여 생략된 성분의 후보 중 문장에 출현하는 각각의 용언에 대해 복원 후보들을 생성하는 제 2 단계와, 상기 생략후보 설정에 의해 생성된 용언별 생략 성분 후보 중 규칙화된 격틀 정보 및 통계 정보를 이용하여 실제 복원 후보를 결정함으로써, 생략 성분이 복원된 문장을 완성하는 제 3 단계를 포함하여 이루어지는 것을 특징으로 한다.A feature of the sentence component restoring method according to the present invention for achieving the above object is a first step of analyzing the structure of the sentence, if the sentence including the omission of the candidate is input, and accepts the structure-analyzed sentence A second step of generating reconstruction candidates for each word appearing in a sentence among candidates of the omitted element by setting a required component for each element; and regular battle information among the omitted element candidates for each word generated by the skip candidate setting And a third step of completing the sentence in which the omitted component is restored by determining the actual restoration candidate using the statistical information.

바람직하게 상기 제 1 단계는 생략된 문장 성분이 포함된 문장을 입력하면, 상기 입력된 문장을 품사 태깅, 개체명 인식, 어휘의미 태깅 과정 중 적어도 하나 이상의 과정을 순차적으로 수행하는 단계와, 상기 적어도 하나 이상의 과정들을 거친 문장에 통상적으로 정의되어 있는 문법을 대조하면서 문장 성분, 절의 종류 및 연결 성분을 갖는 문장의 구조를 분석하는 단계를 포함하여 이루어지는 것을 특징으로 한다.Preferably, in the first step, when a sentence including an omitted sentence component is input, sequentially performing at least one or more of a part-of-speech tagging, entity name recognition, and vocabulary tagging process of the input sentence; And analyzing the structure of the sentence having a sentence component, a kind of a clause, and a linking component, by contrasting a grammar commonly defined in a sentence that has undergone one or more processes.

바람직하게 상기 제 2 단계는 구조 분석된 문장이 입력되면, 입력된 구조 분석된 문장을 통해 상기 문장에 출현하는 각각의 용언별 필수성분을 파악하는 단계와, 상기 파악된 용언별 필수성분 중 같은 문장내에 출현하는 용언의 모든 주어들을 상기 주어가 존재하지 않는 용언의 주어의 복원 후보들로 설정하는 단계와, 상기 파악된 용언별 필수성분 중 같은 문장내에 출현하는 용언의 모든 목적어들을 상기 목적어가 존재하지 않는 용언의 목적어의 복원 후보들로 설정하는 단계를 포함하여 이루어지는 것을 특징으로 한다.Preferably, in the second step, when the structure-analyzed sentence is input, identifying an essential component for each term appearing in the sentence through the inputted structured sentence, and the same sentence among the identified essential elements for each term Setting all the subjects of a verb appearing within as candidates for reconstruction of a subject of a verb that does not exist, and all the object words of the verb appearing in the same sentence among the identified essential elements of the verb, wherein the object does not exist And setting the candidates for reconstruction of the object of the verb.

바람직하게 상기 설정되는 목적어의 복원 후보는 기 저장된 자/타동사 사전 을 이용하여 해당 용언이 타동사로 판단된 경우에만 해당 용언의 목적어 후보로 설정하는 것을 특징으로 한다.Preferably, the candidate for restoration of the object to be set may be set as the object candidate of the corresponding word only when the corresponding word is determined to be the other verb using a pre-stored child / verb dictionary.

바람직하게 상기 제 3 단계는 상기 생성된 용언별 생략 성분 후보가 입력되면, 상기 생략 성분 후보가 주어인지 목적어 인지를 판단하는 단계와, 상기 후보가 주어로 판단되면, 생략성분 후보의 규칙화된 의미 정보를 이용하여 의미 정보를 갖는 문장 성분이 특정 용언의 주어 생략 성분 후보이면 생략된 문장에 해당 후보를 주어로 복원하는 단계와, 상기 후보가 주어로 판단되는 경우이면서, 상기 규칙화된 의미 정보로 복원 여부가 결정되지 않는 생략 성분 후보로 판단되거나, 또는 상기 후보가 목적어로 판단되면 미리 정의되어 있는 규칙화된 격틀에 매칭을 수행하고, 상기 매칭 결과 매칭된 격틀을 검출하는 단계와, 상기 검출과정을 통해 상기 생략된 주어나 목적어를 제외한 격틀이 검출되면 복원하는 단계와, 상기 복원된 문장에 기 정의된 통계정보를 적용하여 산출되는 수치적 통계를 이용하여 이미 복원 대상으로 선정된 문장 성분을 재검증하는 단계를 포함하여 이루어지는 것을 특징으로 한다.Preferably, the third step includes determining whether the omitted component candidate is a subject or a target word when the generated omitted element candidate for each term is input, and if the candidate is determined to be a subject, a regularized meaning of the omitted component candidate. Restoring the candidate to the omitted sentence as a subject if the sentence component having semantic information using the information is a candidate element of a specific term, and if the candidate is determined to be the subject, If it is determined that it is an omitted component candidate that is not determined to be restored or if the candidate is determined to be an object, performing matching to a predefined regular frame, and detecting the matching frame as a result of the matching; Restoring when a frame other than the omitted subject or object is detected through the statistical information defined in the restored sentence; And re-validating the sentence components already selected for restoration by using numerical statistics calculated by applying the beam.

바람직하게 상기 통계 정보는 생략된 문장이 복원되지 않을 때, 주어로 복원될 때, 목적어로 복원될 때의 경우를 수치적 통계로 각각 0 ~ 1 사이의 값(score)으로 표기되는 것을 특징으로 한다.Preferably, the statistical information is characterized by a numerical value representing a score between 0 and 1, respectively, when the omitted sentence is not restored, when it is restored to the subject, and when it is restored to the object as numerical statistics. .

바람직하게 상기 검증하는 단계는 상기 문장 구조 분석을 통한 복원 및 격틀 매칭을 통한 복원이 상기 통계정보를 통한 검증과 동일한지를 판단하는 단계와, 상기 판단 결과 동일하면 상기 문장 구조 분석을 통한 복원 및 격틀 매칭을 통해 복 원된 문장을 그대로 유지하는 단계와, 상기 판단 결과 동일하지 않다면, 상기 문장 구조 분석을 통한 복원 및 격틀 매칭을 통해 복원된 후보를 상기 통계정보에서 정의된 후보로 변경하여 복원하는 단계로 이루어지는 것을 특징으로 한다.Preferably, the verifying may include determining whether the restoration through sentence structure analysis and the restoration through the frame matching are the same as the verification through the statistical information. Maintaining the restored sentence as it is, and if it is not the same as a result of the determination, restoring the restored candidate by the restoration through the sentence structure analysis and the matching by a candidate defined in the statistical information It is characterized by.

바람직하게 상기 검증하는 단계는 상기 통계정보를 통한 검증을 위해, 1)복원되지 않을 때, 2)주어로 복원될 때 및 3)목적어로 복원될 때를 통계정보에 적용해서 산출되는 값의 경계(threshold) 값을 각각 사용자가 임의로 미리 정의하는 단계와, 상기 복원된 후보가 상기 통계정보의 적용에서 산출된 수치적 통계가 상기 경계(threshold)값을 넘지 못할 후보이면, 상기 통계정보에서 추천하는 후보를 최종 복원할 대상으로 선정하여 복원하는 단계와, 상기 복원된 후보가 상기 통계정보의 적용에서 산출된 수치적 통계가 상기 경계(threshold)값 이하를 보이지 않는다면 앞에서 복원된 후보를 그대로 유지하는 단계를 포함하여 이루어지는 것을 특징으로 한다.Preferably, the verifying step includes a boundary of a value calculated by applying to the statistical information 1) when not restored, 2) when restored to a jurisdiction, and 3) when restored to a destination for verification through the statistical information. a user's predefined predefined threshold value, and if the restored candidate is a candidate whose numerical statistics calculated by the application of the statistical information do not exceed the threshold, the candidate recommended by the statistical information Selecting and restoring the final candidate to be restored; and if the restored candidate does not have a numerical value calculated by applying the statistical information below the threshold value, maintaining the previously restored candidate as it is. It is characterized by comprising.

바람직하게 상기 통계 정보는 어휘자질(lexical feature), 품사자질(POS feature), 의미자질, 구조분석 자질, 복합자질 중 어느 하나 이상을 이용하는 것을 특징으로 한다.Preferably, the statistical information is characterized by using any one or more of a lexical feature, a POS feature, a semantic feature, a structural analysis feature, and a composite feature.

본 발명의 다른 목적, 특성 및 이점들은 첨부한 도면을 참조한 실시예들의 상세한 설명을 통해 명백해질 것이다.Other objects, features and advantages of the present invention will become apparent from the following detailed description of embodiments with reference to the accompanying drawings.

본 발명에 따른 문장 성분 복원 장치 및 방법의 바람직한 실시예에 대하여 첨부한 도면을 참조하여 설명하면 다음과 같다.A preferred embodiment of the sentence component restoring apparatus and method according to the present invention will be described with reference to the accompanying drawings.

도 1 은 본 발명에 따른 문장 성분 복원 장치를 나타낸 도면이다.1 is a diagram showing a sentence component restoration apparatus according to the present invention.

도 1과 같이, 기 정의된 문법에 기반하여 입력된 문장의 구조를 분석하는 문장구조 분석부(10)와, 상기 분석된 문장에 출현하는 각각의 용언별 필수성분을 파악하여 문장성분의 생략이 판단되면, 상기 입력된 문장내에서 생략된 성분의 복원후보를 검출하는 생략후보 인식부(20)와, 상기 검출된 복원후보들에서 기 정의된 규칙정보(31) 및 통계정보(32)를 이용하여 상기 생략된 성분을 복원하는 생략성분 복원부(30)를 포함하여 구성된다.As shown in Figure 1, sentence structure analysis unit 10 for analyzing the structure of the input sentence on the basis of a predefined grammar, and by identifying the essential components of each term appearing in the analyzed sentence omission of sentence components If it is determined, by using the omitted candidate recognition unit 20 for detecting the reconstruction candidate of the omitted element in the input sentence, the rule information 31 and the statistical information 32 predefined in the detected reconstruction candidates It is configured to include an omitted component restorer 30 for restoring the omitted components.

이때, 상기 문장구조 분석부(10)는 입력된 문장을 형태소 분석과 개체명 인식, 어휘의미 태깅 중 적어도 하나 이상의 과정을 통해 문장의 구조를 분석한다.In this case, the sentence structure analysis unit 10 analyzes the structure of the sentence through at least one process of morphological analysis, entity name recognition, vocabulary meaning tagging.

또한, 상기 문장의 구조는 주어, 목적어, 보어 등과 같은 문장 성분과, 명사절, 형용사절 등과 같은 절의 종류와, 문장과 문장, 절과 절 또는 문장과 절간의 연결 관계를 나타내는 연결 성분으로 구성된다. 그리고 상기 생략 후보 인식부(20)에서 사용되는 후보는 생략된 주어, 목적어, 보어 등과 같은 문장 성분 중 적어도 어느 하나를 말한다.In addition, the structure of the sentence is composed of a sentence component such as a subject, an object, a bore, etc., a kind of a clause such as a noun clause, an adjective clause, and the like, and a connection component representing a connection relationship between a sentence and a sentence, a clause and a clause, or a sentence and a clause. In addition, the candidate used in the abbreviation candidate recognizing unit 20 refers to at least one of sentence elements such as a subject, an object, a bore, and the like that are omitted.

상기 생략후보 인식부(20)는 상기 용언별 필수성분이 주어, 목적어이며, 상기 용언별 필수성분이 목적어로 파악되면 용언의 자/타 동사 사전을 이용하여 타동사로 판단된 경우에만 해당 용언의 생략 후보로 설정한다.The abbreviation candidate recognition unit 20 is given the essential components for each verb, the target word, and if the essential component for each verb is identified as the target language, omitting the verb only if it is determined to be a transitive verb using the child / other verb dictionary of the verb Set as a candidate.

이와 같이 구성된 본 발명에 따른 문장 성분 복원 장치를 이용한 방법을 첨부한 도면을 참조하여 상세히 설명하면 다음과 같다.The method using the sentence component restoration apparatus according to the present invention configured as described above will be described in detail with reference to the accompanying drawings.

도 2 는 본 발명에 따른 문장 성분 복원 방법을 나타낸 흐름도이다. 그리고 도 3 은 도 2에서 문장 구조 분석 방법을 상세히 나타낸 흐름도이고, 도 4 는 도 2 에서 생략 성분 후보 설정 방법을 상세히 나타낸 흐름도이고, 도 5 는 도 2에서 생략 성분 복원 방법을 상세히 나타낸 흐름도이다.2 is a flowchart illustrating a sentence component restoration method according to the present invention. 3 is a flowchart illustrating a method of analyzing a sentence structure in FIG. 2, FIG. 4 is a flowchart illustrating a method of setting an omitted component candidate in FIG. 2, and FIG. 5 is a flowchart illustrating a method of restoring an omitted component in FIG. 2.

도 2와 같이, 첫 번째 단계로 후보의 생략이 포함된 문장(11)이 입력되면, 문장의 구조를 분석한다(S100).As shown in FIG. 2, when the sentence 11 including the omission of the candidate is input as the first step, the structure of the sentence is analyzed (S100).

도 3을 참조하여 좀더 상세히 설명하면, 사용자가 생략된 문장 성분이 포함된 문장을 입력하면(S110), 상기 입력된 문장에 대한 품사 태깅(S120), 개체명 인식(S130), 어휘의미 태깅(S140) 과정을 순차적으로 수행한다. Referring to FIG. 3, when a user inputs a sentence including an omitted sentence component (S110), the part-of-speech tagging (S120), the entity name recognition (S130), and the lexical meaning tagging ( S140) is performed sequentially.

그리고 상기 과정들을 모두 거친 문장에 통상적으로 정의되어 있는 문법을 대조하면서 문장 성분, 절의 종류 및 연결 성분을 갖는 문장의 구조를 분석한다(S150).Then, the structure of the sentence having a sentence component, a kind of a clause, and a linking component is analyzed while comparing the grammar that is normally defined in the sentence which has undergone the above processes (S150).

상기 문장 성분은 주어, 목적어, 보어 등을 나타내며, 절의 종류는 명사절, 형용사절 등을 나타낸다. 또한 연결 성분은 문장과 문장, 절과 절 또는 문장과 절간의 연결 관계를 나타낸다.The sentence component indicates a subject, an object, a bore, and the like, and the kind of a clause indicates a noun clause, an adjective clause, and the like. In addition, the connection component indicates a connection relationship between a sentence and a sentence, a clause and a clause, or a sentence and a clause.

상기 문장 구조 분석을 수행하여 얻어진 입력과 결과의 일 실시예는 다음과 같다.An embodiment of the input and the result obtained by performing the sentence structure analysis is as follows.

[실시예 1] Example 1

“학자 프라사스타파다가 <구의법강요(句義法綱要)>를 저술하여 이 파의 학설을 정립하였다”"Study Prassa Stafada wrote <the law of the old law" and established the theory of this wave. "

정립하다 Establish

<obj : 이 파의 학설 : 를><obj: This par theory:

저술하다Write

<obj : <구의법강요(句義法綱要):를><obj: <Compulsory Law (를 法綱要):

다음 두 번째 단계로 상기 구조 분석된 문장을 입력 받아 용언별 필수 성분을 설정하여, 생략된 성분의 후보 중 문장에 출현하는 각각의 용언에 대해서 복원 후보들을 생성한다. 이때, 설명의 간략화를 위해 상기 용언별 필수 성분은 주어, 목적어로 한정한다(S200).In the next second step, the structure-analyzed sentence is input, and essential components for each verb are set, and reconstruction candidates are generated for each verb appearing in the sentence among candidates of the omitted component. At this time, for the sake of simplicity, the essential components for each term are given and limited to the target words (S200).

도 4를 참조하여 좀더 상세히 설명하면, 상기 첫 번째 단계에서 구조 분석된 문장이 입력되면(S210), 입력된 구조 분석된 문장을 통해 상기 문장에 출현하는 각각의 용언별 필수성분을 파악한다(S220). In more detail with reference to Figure 4, when the structure-analyzed sentence is input in the first step (S210), through the input structure-analyzed sentence to grasp the essential components for each word appearing in the sentence (S220) ).

본 명세서에서는 용언별 필수성분으로 주어, 목적어로 한정하였으므로, 먼저 문장에 출현하는 각각의 용언에 대해서 주어 복원 후보들을 생성한다(S230). 이때, 상기 주어 복원 후보로는 같은 문자 내에 출현하는 용언의 모든 주어들을 상기 주어가 존재하지 않는 용언의 주어로 설정한다.In the present specification, given as essential components for each word, and limited to the target word, first, the candidate reconstruction candidates are generated for each word appearing in the sentence (S230). In this case, as the subject restoration candidate, all the subjects of the verb appearing in the same letter are set as the subject of the verb in which the subject does not exist.

이어 문장에 출현하는 각각의 용언에 대해서 목적어 복원 후보들을 생성한다(S240). 이때, 목적어는 주어와 다르게 타동사만이 갖기 때문에 기 저장된 자/타동사 사전을 이용하여 해당 용언이 자동사인지 타동사인지를 먼저 판단하게 된다. 따라서 상기 목적어 복원 후보로는 같은 문장내에 출현하는 용언에서 목적어로 쓰인 문장 성분 중 용언이 타동사로 판단된 경우에만 해당 용언의 목적어 후보들로 설정한다.Subsequently, object restoration candidates are generated for each verb appearing in the sentence (S240). In this case, since the target word has only the transitive verb differently from the subject, it is first determined whether the corresponding verb is an automatic verb or a transitive verb using a pre-stored child / transitive verb. Therefore, the candidate for restoration of the object is set as the object candidates of the verb only when it is determined that the verb is a transitive verb among sentence components used in the verb appearing in the same sentence.

마지막 세 번째 단계로 상기 생략후보 설정에 의해 생성된 용언별 생략 성분 후보를 이용하여 실제 복원 후보를 결정함으로써, 생략 성분이 복원된 문장을 완성한다(S300). In a final third step, the sentence for which the skipped component is restored is completed by determining the actual restore candidate using the skipped component candidate for each word generated by the skip candidate setting (S300).

도 5를 참조하여 좀더 상세히 설명하면, 상기 생성된 용언별 생략 성분 후보가 입력되면(S310), 먼저 상기 후보가 주어인지 목적어 인지를 판단한다(S320). Referring to FIG. 5, if the generated constituent skip component candidates are input (S310), it is first determined whether the candidate is a subject or a target object (S320).

그리고 상기 판단 결과(S320), 상기 후보가 주어로 판단되는 경우는 생략성분 후보의 규칙화된 의미 정보를 이용한다(S330). In a case where the determination result (S320) determines that the candidate is a subject, regularized semantic information of an omitted component candidate is used (S330).

이때, 정의되어 있는 특정 의미 정보의 경우에는 복원되는 경향이 50% 이상이기 때문에 이러한 의미 정보를 갖는 문장 성분이 특정 용언의 주어 생략 성분 후보라고 판단되면(S340), 생략된 문장에 해당 후보를 주어로 복원한다(S350). In this case, in the case of the defined specific semantic information, since the tendency to be restored is 50% or more, if the sentence component having such semantic information is determined to be a candidate element for omitting specific words (S340), the candidate is given to the omitted sentence. Restore to (S350).

이어 상기 후보가 주어로 판단되는 경우이면서, 상기 규칙화된 의미 정보로 복원 여부가 결정되지 않는 생략 성분 후보로 판단되거나(S340), 또는 상기 후보가 목적어로 판단되는 경우(S320)는 규칙화된 격틀 정보를 이용한다(S360). Subsequently, when the candidate is determined to be the subject, it is determined as an omitted component candidate that does not determine whether to restore the ordered semantic information (S340), or when the candidate is determined as an object (S320). The battle information is used (S360).

이때, 상기 판단 결과(S320) 상기 생략 성분 후보가 목적어로 판단되는 경우는 상기 규칙화된 의미 정보를 이용하지 않고 바로 규칙화된 격틀 정보를 이용하는 것이 바람직하다. 왜냐하면, 상기 생략 성분 후보가 목적어인 경우는 상기 규칙화된 의미 정보를 이용하는 경우에 규칙화된 의미 정보와 매칭되는 결과가 거의 없기 때문이다.In this case, when it is determined that the omitted component candidate is the target word (S320), it is preferable to use the ordered battle information without using the ordered semantic information. This is because, when the omitted component candidate is an object, there is almost no result matching the ordered semantic information when the ordered semantic information is used.

상기 사용되는 규칙화된 격틀 정보는 다음 실시예 2와 같은 형태이다.The regularized battlefield information used is the same as in the second embodiment.

[실시예 2]Example 2

A=의미코드! 격조사 용언!다 > 예문A = meaning code! Prospective words!

A=사람!가 B=인공적장소!로 가!다 > [그[A]가 바다[B]로 가다]A = People! Go to B = Artificial place!> [He [A] goes to sea [B]]

이와 같은 규칙화된 격틀을 이용한 방법은 표 1에서 나타내고 있다. The method using such a regularized framework is shown in Table 1.

입 력 input - 표제어 : 알롱만[sense:그곳] 하이퐁 동쪽에 위치하며 -Heading: Along Bay [sense] is located east of Hai Phong. 문장구조 분석(parsing) Parsing sentence structure 위치하다(subj:NULL, obj:NULL, ADV:동쪽:에 격틀: 방향!에 위치하다 Position (subj: NULL, obj: NULL, ADV: East: In Battle: Direction!) 격틀 매칭 Battle match 24265-2 A=곳!가 B=곳!에서 C=방향!에 24265-4 A=곳!가 B=방향!에 24265-8 A=기상!가 B=방향!에 24265-12 A=방향!에 24265-17 A=부위!가 B=방향!에 24265-2 A = Where! B = Where! To C = Direction! 24265-4 A = Where! To B = Direction! 24265-8 A = Weather! Go = B = Direction! 24265-12 A = Direction 24265-17 A = site! To B = direction! 복원 여부 Restore or not 주어로 복원 Restore to subject

표 1과 같이, 상기 규칙화된 격틀을 이용한 방법은 다음과 같다.As shown in Table 1, the method using the regularized framework is as follows.

먼저 표제어 알랑만으로 입력이 “하이퐁 동쪽에 위치하며“로 입력되면, 상기 첫 번째 단계인 문장 구조의 분석을 통해 입력된 문장을 분석하게 된다(S362).First, if the input is inputted as “located east of Haiphong” with the headword Alangman, the input sentence is analyzed through the analysis of the sentence structure (S362).

이렇게 구조 분석된 결과를 이용하여 미리 정의되어 있는 규칙화된 격틀에 매칭을 수행하고(S364), 상기 매칭 결과 매칭된 격틀을 검출한다(S366). The structured analysis results are used to match a predefined regular frame (S364), and the matched frame is detected (S366).

그리고 이 검출과정을 통해 상기 생략된 주어나 목적어를 제외한 격틀이 검출되면 복원한다(S368). 즉 문장 구조분석 용언에서 존재하지 않으면서, 상기 검출된 격틀이 입력된 문장의 필수 성분으로 등록된 경우에는 검출된 규칙화된 격틀을 참조하여 복원 여부와 함께 복원될 문장의 성분을 정의하게 된다. 그리고 이를 통해 생략된 문장을 복원하게 된다.When the frame other than the omitted subject or object is detected through the detection process, the frame is restored (S368). In other words, if the detected frame is not present in the sentence structure analysis terminology and the detected frame is registered as an essential component of the input sentence, the component of the sentence to be restored along with whether or not to be restored is defined with reference to the detected regular frame. This will restore the omitted sentences.

그러나, 이렇게 복원된 문장들은 상기 규칙화된 의미 정보 및 규칙화된 격틀 정보가 규칙의 한계 때문에 모든 생략 문장 성분 후보에 대해 정확한 복원을 결정할 수 없다. However, these reconstructed sentences cannot determine the correct reconstruction for all omitted sentence component candidates because of the limitations of the regularized semantic information and regularized strife information.

따라서. 기 정의된 통계정보를 상기 복원된 문장에 적용한다(S370).therefore. The predefined statistical information is applied to the restored sentence (S370).

상기 통계 정보는 입력된 문장 구조를 통해 3가지 경우를 수치적으로 통계 내어 정의한 정보를 말한다. 상기 3가지 경우는 첫째, 생략된 문장이 복원되지 않을 때, 둘째, 주어로 복원될 때, 셋째, 목적어로 복원될 때의 경우이다.The statistical information refers to information obtained by numerically defining three cases through an input sentence structure. The three cases are first, when the omitted sentence is not restored, second, when the subject is restored, and third, when the object is restored.

이때, 상기 수치적 통계는 각각 0 ~ 1 사이의 값(score)으로 표기되며, 값이 클수록 생략된 문장 성분 후보에 가깝게 된다.In this case, the numerical statistics are each represented by a score between 0 and 1, and the larger the value is, the closer the sentence component candidate is to be omitted.

따라서, 상기 통계 정보를 적용하여 표기되는 값 중 가장 큰 값으로 표기되는 하나를 선택하여 상기 규칙화된 의미정보를 적용함으로써, 이미 복원 대상으로 선정된 문장 성분을 규칙/통계정보에 한번 더 적용하여 검증하는 결과를 갖는다.Accordingly, by applying the regularized semantic information by selecting one represented by the largest value among the values displayed by applying the statistical information, the sentence component already selected for restoration is applied once more to the rule / statistic information. Has the result of verification.

이와 같이 정의된 상기 통계정보를 복원된 문장에 적용함으로써, 각각의 생략된 문장 성분 후보를 보다 정확하게 복원할 수 있게 된다. By applying the statistical information defined in this way to the reconstructed sentence, it is possible to more accurately reconstruct each omitted sentence component candidate.

그래서 상기 문장 구조 분석을 통한 복원(주어)(S350) 및 격틀 매칭을 통한 복원(주어, 목적어)(S360)이 상기 통계정보를 통한 검증과 동일한지를 판단한다(S380).Therefore, it is determined whether the reconstruction (subject) through the sentence structure analysis (S350) and the reconstruction through subframe matching (subject, object) (S360) are the same as the verification through the statistical information (S380).

상기 판단 결과(S380), 동일하면 상기 문장 구조 분석을 통한 복원(주어)(S350) 및 격틀 매칭을 통해 복원(주어, 목적어)(S360)된 문장을 그대로 유지한다(S390).As a result of the determination (S380), if it is the same, the sentence is restored (main language) through the sentence structure analysis (S350) and the restored (major object) (S360) through the matching of the frame (S390).

그리고 상기 판단 결과(S380), 동일하지 않다면 상기 문장 구조 분석을 통한 복원(주어)(S350) 및 격틀 매칭을 통해 복원(주어, 목적어)(S360)된 후보를 상기 통계정보에서 정의된 후보로 변경하여 복원한다(S400).If the result of the determination (S380) is not the same, the candidates restored through the sentence structure analysis (subject) (S350) and the restored through the pertinent matching (subject, object) (S360) are changed into candidates defined in the statistical information. To restore (S400).

이때, 상기 통계정보를 통한 검증을 위해, 사용자가 임의로 상기 3가지 경우에 대해 각각 그 경계(threshold) 값을 미리 정의한다.In this case, for verification through the statistical information, a user arbitrarily defines a threshold value for each of the three cases in advance.

즉, 상기 경계(threshold) 값은 상기 3가지 경우인 복원되지 않을 때, 주어로 복원될 때 및 목적어로 복원될 때 등을 통계정보에 적용해서 산출되는 값의 기준을 정해놓은 값으로, 이 경계 값 이상이 되는 경우가 생략된 문장의 후보가 된다.That is, the threshold value is a value that sets a criterion for a value calculated by applying to the statistical information when not restored, when restored to the subject, when restored to the object, etc., which are the three cases. The case of more than the value becomes a candidate of the abbreviated sentence.

그래서 상기 문장 구조 분석을 통한 복원(주어)(S350) 및 격틀 매칭을 통한 복원(주어, 목적어)(S360)된 후보가 통계정보의 적용에서 사용자가 임의로 정한 경계(threshold)값을 넘지 못할 후보인 경우는 규칙화된 의미정보 및 격틀 매칭을 통한 복원에서 결정된 후보를 무시하고 상기 통계정보에서 추천하는 후보를 최종 복원할 대상으로 선정하여 복원한다(S400).Therefore, the candidates restored through the sentence structure analysis (main) (S350) and the restored through the pertinent matching (subject, object) (S360) are candidates that cannot exceed a threshold value arbitrarily determined by the user in the application of statistical information. In the case, the candidate determined in the restoration through regularized semantic information and frame matching is ignored and the candidate recommended in the statistical information is selected as a target to be finally restored (S400).

그리고 상기 문장 구조 분석을 통한 복원(주어)(S350) 및 격틀 매칭을 통해 복원(주어, 목적어)(S360)된 후보가 통계정보의 적용에서 사용자가 임의로 정한 경계(threshold)값 이하를 보이지 않는다면 앞에서 복원된 후보를 그대로 유지하게 된다(S390). If the candidates restored through the sentence structure analysis (S350) and the candidates restored through the matching of the frame (S360) do not show below a threshold value arbitrarily determined by the user in the application of statistical information, The restored candidate is maintained as it is (S390).

이와 같은 과정을 거쳐서 생략 성분이 포함된 문장에 대해서 생략 성분 존재 여부와 생략 성분 후보를 인식하고 인식된 후보에 대해서 실제로 복원여부를 규칙과 통계 정보를 이용하여 복원을 수행한 후, 복원된 문장을 출력하다(S410). Through the above process, the presence of the omitted component and the omitted component candidate are recognized for the sentence including the omitted component, and the restored candidate is actually restored using the rule and statistical information after reconstructing the recognized candidate. Output (S410).

이때, 상기 통계 정보에서 중요한 자질은 본 발명에서는 어휘자질(lexical feature), 품사자질(POS feature), 의미자질, 구조분석 자질, 복합자질 등을 사용한다. In this case, in the present invention, important features are lexical features, POS features, semantic features, structural analysis features, and composite features.

이상에서와 같이 상세한 설명과 도면을 통해 본 발명의 최적 실시예를 개시하였다. 용어들은 단지 본 발명을 설명하기 위한 목적에서 사용된 것이지 의미 한정이나 특허청구범위에 기재된 본 발명의 범위를 제한하기 위하여 사용된 것은 아니다. 그러므로 본 기술 분야의 통상의 지식을 가진 자라면 이로부터 다양한 변형 및 균등한 타 실시예가 가능하다는 점을 이해할 것이다. 따라서 본 발명의 진정한 기술적 보호 범위는 첨부된 특허청구범위의 기술적 사상에 의해 정해져야 할 것이다. As described above, the preferred embodiment of the present invention has been disclosed through the detailed description and the drawings. The terms are used only for the purpose of describing the present invention and are not used to limit the scope of the present invention as defined in the meaning or claims. Therefore, those skilled in the art will understand that various modifications and equivalent other embodiments are possible from this. Therefore, the true technical protection scope of the present invention will be defined by the technical spirit of the appended claims.

이상에서 설명한 바와 같은 본 발명에 따른 문장 성분 복원 장치 및 방법은 다음과 같은 효과가 있다.The sentence component restoring apparatus and method according to the present invention as described above has the following effects.

첫째, 문장의 구조 분석을 이용하는 자동 번역, 검색, 정보 추출, 질의응답 시스템의 성능을 향상시킬 수 있다.First, it is possible to improve the performance of automatic translation, retrieval, information extraction, and question and answer system using sentence structure analysis.

둘째, 생략이 포함된 문장을 문장구조 분석하여 정확하게 생략이 가능한 후보를 먼저 인식하였고 생략된 후보 중 복원 대상을 결정하는 과제에서도 규칙과 통 계를 적절히 사용하여 문장 구조 분석의 능력을 향상시킬 수 있다.Second, the sentence structure analysis was performed to recognize the candidates that could be omitted correctly by sentence structure analysis, and the ability of sentence structure analysis can be improved by appropriately using rules and statistics in the task of determining the restoration target among the omitted candidates. .

Claims

A sentence structure analysis unit for analyzing a structure of an input sentence based on a predefined grammar,

An omission candidate recognition unit that detects essential components for each term appearing in the analyzed sentence and detects reconstruction candidates of an element omitted in the input sentence if it is determined that the sentence component is omitted;

And a skipped component restorer for restoring the omitted components by using the predetermined rule information and statistical information defined in the detected restored candidates.

A first step of analyzing the structure of the sentence including the omission of the candidate;

A second step of detecting reconstruction candidates of the omitted components in the input sentence, if it is determined that the essential components of each term appearing in the analyzed sentence are omitted;

And a third step of determining the actual restoration candidate by using the ordered battle information and the statistical information among the omitted element candidates generated by the omission candidate setting, and completing a sentence in which the omitted elements are restored. Omitted sentence component restoration method.

The method of claim 2, wherein the first step

Sequentially performing at least one of a part-of-speech tagging, an entity name recognition, and a vocabulary tagging process on a sentence including an omitted sentence component;

And analyzing the structure of a sentence while comparing a grammar predefined to a sentence that has undergone at least one process.

The structure of claim 2, wherein the structure of the sentence

A sentence component composed of any one of an object, a bore, a noun clause, and an adjective clause, and at least one of a connection component indicating a connection relationship between a sentence and a sentence, a clause and a clause, or a sentence and a clause. Omitted sentence component restoration method.

The method of claim 2, wherein the second step

Identifying essential components for each term appearing in the structurally analyzed sentence;

Setting all the subjects of the verb appearing in the sentence among the identified essential ingredients for the verb as a reconstruction candidate for the subject of the verb without the subject;

And restoring all object words of the verb appearing in the sentence among the identified essential components for the verb as a candidate for restoration of the object of the verb without the object.

The method of claim 5,

The restored candidate of the sentence, characterized in that it is set to the restored candidate according to the object of the verb only when the determined verb is determined to be another verb using the pre-stored child / verb verb dictionary.

The method of claim 2, wherein the third step

A determination step of determining whether restoration candidates of the omitted elements are given or object when the generated candidates for each omitted element are input;

A first reconstruction step of restoring the candidate to the omitted sentence if the sentence component having the semantic information is a candidate element of a specific verb using the ordered semantic information of the omitted component candidate if the candidate is the subject; Wow,

If it is determined that the candidate is a subject and is determined to be an omitted component candidate whose restoration is not determined by the ordered semantic information, or if the candidate is determined as an object, matching is performed on a predefined regularity frame. And a second restoring step of detecting and restoring the matching gap as a result of the matching;

And a re-verification step of re-validating a sentence component that is already selected for restoration by using numerical statistics calculated by applying predefined statistical information to the restored sentence. .

The method of claim 7,

In the second restoring step, the detected sentence component restoring method may include detecting the same frame except for the omitted subject or object.

The method of claim 7, wherein

The statistical information is omitted when the omitted sentence is not restored, when restored to the subject, the case of restoring the object as a numerical value (score) between 0 and 1, respectively How to restore sentence components.

8. The method of claim 7, wherein the revalidation step is

Determining whether reconstruction through sentence structure analysis and reconstruction through frame matching are the same as verification through the statistical information;

If the determination result is the same, reconstructing the sentence through analysis of the sentence structure and maintaining the restored sentence as it is;

And if it is not the same, restoring the sentence component by restoring through the sentence structure analysis and reconstructing the candidate restored through the stratification matching to the candidate defined in the statistical information.

8. The method of claim 7, wherein the revalidation step is

For verification through the statistical information, the user applies a threshold value calculated by applying 1) not restored, 2) restored to the jury, and 3) restored to the destination. Optionally predefined steps,

If the restored candidate is a candidate whose numerical statistics calculated by the application of the statistical information do not exceed the threshold value, selecting and restoring a candidate recommended in the statistical information as a final restoration target;

If the restored candidate does not have a numerical value calculated by applying the statistical information below the threshold value, maintaining the previously restored candidate as it is; .

The method of claim 7, wherein

The statistical information may be at least one of a lexical feature, a POS feature, a semantic feature, a structural analysis feature, and a compound feature.