KR101409413B1

KR101409413B1 - Method for natural language processing using unification grammar

Info

Publication number: KR101409413B1
Application number: KR1020120079435A
Authority: KR
Inventors: 김한우; 설용수
Original assignee: 한양대학교 에리카산학협력단
Priority date: 2012-07-20
Filing date: 2012-07-20
Publication date: 2014-06-20
Also published as: KR20140012469A

Abstract

단일화 문법을 이용한 자연어 처리 방법이 개시된다. 일 실시예에 따라 자연어를 처리하는 방법은, 자연어 텍스트를 입력받고, 입력된 자연어 텍스트를 전처리하여 제외어, 굴절 접사 및 미등록어를 제거하며, 미리 정의된 단일화 규칙에 의해 계층화된 자질과 해당 자질 값을 연계하여 구조화된 자질 트리를 생성하며, 자질 트리는, 전처리된 자연어 텍스트를 루트 노드로 설정하고, 루트 노드의 자식 노드로서 구문 자질을 생성하여 대응하는 구문 자질 값을 할당하며, 구문 자질의 자식 노드로서 의미를 갖는 언어의 최소 단위인 형태소 자질을 생성하여 대응하는 형태소 자질 값을 할당함으로써, 형태소 자질이 구문으로 구조화되고, 구문 자질이 문장으로 구조화된다.A natural language processing method using a unified grammar is disclosed. According to an embodiment, a method for processing a natural language includes receiving a natural language text, removing a negative word, a refraction affixed word, and an unrecorded word by pre-processing the input natural language text, Value tree to generate a structured feature tree. The feature tree sets the preprocessed natural language text as a root node, generates a syntax feature as a child node of the root node, assigns a corresponding syntax feature value, The morpheme qualities are structured into phrases and the syntactic qualities are structured into sentences by generating morpheme qualities which are the minimum units of language having meaning as nodes and assigning corresponding morpheme qualities.

Description

[0001] The present invention relates to a natural language processing method using a unified grammar,

본 발명은 자연어 처리 기술에 관한 것으로, 특히 자연어를 인식하여 이에 대한 적절한 응답을 할 수 있는 대화 시스템에서 단일화 문법을 이용하여 자연어 텍스트를 처리하는 방법 및 이를 기록한 기록매체에 관한 것이다.The present invention relates to a natural language processing technique, and more particularly, to a method for processing a natural language text using a unified grammar in an interactive system capable of recognizing a natural language and responding appropriately to the natural language, and a recording medium storing the method.

자연어 처리(natural language processing)라 함은, 컴퓨터를 이용하여 인간의 언어를 이해하거나, 인간의 언어를 생성 및 분석하는 인공 지능 기술을 말한다. 자연어 이해는 일상 생활 언어를 통해 형태 분석, 의미 분석, 내지 대화 분석 등을 통하여 컴퓨터가 처리할 수 있도록 변환시키는 작업이며, 자연어 생성은 컴퓨터가 처리한 결과물을 인간의 편의성에 입각하여 텍스트, 음성, 그래픽 등을 생성하는 작업이다. 이러한 자연어 처리는 문서 처리, 색인 작성, 언어 번역, 질문 응답 등 많은 컴퓨터 기반의 응용 분야에서 활용될 수 있다. 최근에는 휴대단말기나 가전 제품에서도 사용자의 자연어 명령을 인식하고 이에 따른 적절한 응답을 제공하는 대화 시스템이 등장하기에 이르렀다. 이하에서 인용되는 비특허문헌은 이러한 음성 자연어 처리를 위한 대화 관리 시스템의 개요를 소개하고 있다.Natural language processing refers to artificial intelligence technology that uses a computer to understand a human language or generate and analyze a human language. Natural language comprehension is a task that transforms computerized processing through morphological analysis, semantic analysis, and dialogue analysis through everyday language. Natural language generation is a process of converting texts, Graphics, and so on. Such natural language processing can be utilized in many computer-based applications such as document processing, indexing, language translation, and question answering. In recent years, a dialogue system has come to be developed that recognizes natural language commands of users and provides appropriate responses according to the instructions. Non-patent literature cited below introduces an outline of a dialogue management system for such speech natural language processing.

그러나, 인간이 일상적으로 사용하는 언어는 그 형태가 표준적인 문법에 완전히 부합하는 것이 아니며, 상황에 따라 다양하게 변형되어 활용되기 때문에 컴퓨터를 이용한 자연어 처리에는 일정 부분 응용 분야에 제약이 발생할 수 있으며, 특히 실용화 수준에서 그 의미나 문맥을 정확하게 파악하지 못하는 문제를 야기하고 있다. 대화 시스템이 사용자에게 적절한 응답을 제공하기 위해서는 무엇보다도 사용자의 최초 명령 내지 발화를 정확하게 분석, 인지할 수 있어야 한다.However, since the language used by humans in everyday life is not completely in conformity with the standard grammar, it may be used in a variety of ways depending on the situation, Especially, it causes problems that the meaning and context can not be accurately grasped at the level of practical use. In order for the conversation system to provide an appropriate response to the user, it must be able to accurately analyze and recognize the user's initial command or utterance.

음성 자연어 처리를 위한 대화 관리 시스템, 정민우, 은지현, 이청재, 정상근, 이근배, 정보과학회논문지 제24권 제1호 통권 제200호 (2006. 1) pp.19-26, 한국정보과학회, 2006. The purpose of this paper is to develop a dialogue management system for speech natural language processing. This paper proposes a dialogue management system for speech natural language processing.

본 발명의 실시예들이 해결하고자 하는 기술적 과제는, 종래의 자연언어 대화 시스템은 자연어 텍스트로부터 얻을 수 있는 정보가 단어 레벨의 의미 정보, 품사 정보, 구문 정보 수준에 머무르기 때문에 모호성 문제에 대처할 수 있을 정도의 정보가 부족한 한계를 극복하고, 그로 인해 지속적으로 발생하는 어휘적 모호성, 경계 모호성, 형태적 모호성 내지 구조적 모호성을 해소하며, 이러한 모호성 문제에 대한 부적절한 대응 결과 대화 시스템 전체의 대화 정확도가 저하되는 문제점을 해결하고자 한다.SUMMARY OF THE INVENTION The present invention has been made in view of the above problems, and it is an object of the present invention to provide a natural language dialogue system capable of coping with the problem of ambiguity because the information obtained from the natural language text remains at the level of semantic information, And thus the linguistic ambiguity, the boundary ambiguity, the morphological ambiguity and the structural ambiguity are overcome, and the improper response to the ambiguity problem leads to a decrease in the accuracy of conversation of the entire conversation system .

상기 기술적 과제를 해결하기 위하여, 본 발명의 일 실시예에 따른 적어도 하나의 프로세서(processor)를 이용하여 자연어를 처리하는 방법은, 자연어 텍스트(text)를 입력받는 단계; 상기 입력된 자연어 텍스트를 전처리하여 제외어(stop word), 굴절 접사(inflection affix) 및 미등록어를 제거하는 단계; 및 미리 정의된 단일화 규칙에 의해 계층화된 자질과 해당 자질 값을 연계하여 구조화된 자질 트리(feature tree)를 생성하는 단계;를 포함하되, 상기 자질 트리는, 상기 전처리된 자연어 텍스트를 루트 노드(root node)로 설정하고, 상기 루트 노드의 자식 노드로서 구문 자질을 생성하여 대응하는 구문 자질 값을 할당하며, 상기 구문 자질의 자식 노드로서 의미를 갖는 언어의 최소 단위인 형태소 자질을 생성하여 대응하는 형태소 자질 값을 할당함으로써, 상기 형태소 자질이 구문으로 구조화되고, 상기 구문 자질이 문장으로 구조화된다.According to an aspect of the present invention, there is provided a method of processing a natural language using at least one processor, the method comprising: inputting a natural language text; Pre-processing the input natural language text to remove a stop word, an inflection affix and an unregistered word; And generating a structured feature tree by associating a feature layered by a predefined unification rule with a corresponding feature value, wherein the feature tree includes a root node, ), Allocates a corresponding syntax property value as a child node of the root node, generates a syntax property as a child node of the root node, generates a morpheme property that is a minimum unit of a language having a meaning as a child node of the syntax property, By assigning a value, the morpheme qualities are structured into phrases, and the phrase qualities are structured into sentences.

일 실시예에 따른 상기 자연어를 처리하는 방법은, 상기 전처리된 자연어 텍스트에 복수의 형태소가 하나로 결합되어 표현된 교착어가 포함된 경우, 상기 교착어를 결합 전의 원형으로 복원하는 단계;를 더 포함할 수 있다.The method of processing the natural language according to an exemplary embodiment may further include restoring the prefixed prefix into a circular form when the prefixed natural language text includes a plurality of morphemes combined to form a combined word, have.

일 실시예에 따른 상기 자연어를 처리하는 방법은, 형태소 자질 데이터베이스에 저장된 형태소 분류를 이용하여 상기 전처리된 자연어 텍스트를 형태소 단위로 분할하고, 상기 분할된 형태소에 품사를 부착하는 단계;를 더 포함할 수 있다.The method of processing the natural language according to an exemplary embodiment further includes dividing the preprocessed natural language text into morpheme units using morpheme classification stored in the morpheme database and attaching parts of speech to the divided morpheme .

일 실시예에 따른 상기 자연어를 처리하는 방법은, 구문 자질 데이터베이스에 저장된 구문 태그 및 구문 규칙을 이용하여 상기 전처리된 자연어 텍스트의 문장을 구문 단위로 분할하는 단계;를 더 포함할 수 있다.The method of processing the natural language according to an exemplary embodiment may further include dividing a sentence of the preprocessed natural language text into syntax units using a syntax tag and a syntax rule stored in a syntax property database.

일 실시예에 따른 상기 자연어를 처리하는 방법에서, 상기 형태소 자질은, LEX(형태소), POS(품사), SEM(의미범주)를 갖는 명사(N) 자질, LEX(형태소), POS(품사), SEM(의미범주), SUBCAT(하위범주)를 갖는 동사(V) 자질, LEX(형태소), POS(품사), SEM(의미범주), QUALIFIER(수식범주)를 갖는 형용사(ADJ) 자질 및 부사(ADV) 자질, LEX(형태소), POS(품사), CASE(격범주)를 갖는 관계언(PARTICLE) 자질, LEX(형태소), POS(품사), TYPE(어미범주)를 갖는 어미(END) 자질, LEX(형태소), POS(품사), MOD(서법범주)를 갖는 선어말어미(PEND) 자질, 중 어느 하나이고, 상기 구문 자질은, SUBJ(주어), OBJ(목적어), PRED(술어)를 갖는 문장(S) 자질, HEAD(핵심어), CASE(격범주), COMP(보충어)를 갖는 주어(SUBJ) 자질 및 목적어(OBJ) 자질, HEAD(핵심어), SEM(의미범주)를 갖는 복합명사(NN) 자질 및 관형어구(ADJP) 자질, HEAD(핵심어), COMP(보충어), CONJ(어미범주)를 갖는 서술어(PRED) 자질, 중 어느 하나이다.In a method of processing natural language according to an embodiment, the morpheme qualities are selected from the group consisting of LEX (morpheme), POS (part of speech), N (N) , Adjective (ADJ) qualities and adverbs with verbal (V) qualities, LEX (morphemes), POS (parts of speech), SEM (semantic category), QUALIFIER (formula category) with SEM (semantic category), SUBCAT (END) qualities with an ADV qualification, LEX (morpheme), POS (part of speech), CASE (category qualification), LEX (morpheme), POS (part of speech), TYPE , PEND character qualities with LEX (morpheme), POS (part of speech), MOD (linguistic category), and the syntax qualities are SUBJ (subject), OBJ (object word), PRED (SUBJ) and object (OBJ) qualities, HEAD (keyword), and SEM (semantic category) with sentence (S) qualities, HEAD (keyword), CASE (NN) Qualities and ADJP Qualities, HEAD (Keyword), COMP (PRED) qualities with CONJ (mother category).

한편, 이하에서는 상기 기재된 자연어를 처리하는 방법을 컴퓨터에서 실행시키기 위한 프로그램을 기록한 컴퓨터로 읽을 수 있는 기록매체를 제공한다.Meanwhile, a computer readable recording medium on which a program for executing a method of processing a natural language as described above on a computer is recorded.

본 발명의 실시예들은 단일화 규칙에 의해 계층화된 자질과 해당 자질 값을 연계하여 구조화된 자질 트리를 생성함으로써, 자연어 텍스트로부터 계층적으로 구조화된 형태소, 구문, 자질 정보를 종합적으로 획득할 수 있고, 그로 인해 단어 레벨의 의미 정보, 품사 정보 및 구문 정보를 동시에 고려함으로써 종래의 자연어 처리에 기반한 대화 시스템에서의 모호성 문제를 해소할 수 있으며, 결과적으로 사용자 언어를 정확하고 적절하게 해석함으로써 대화 시스템 전체의 대화 정확도가 향상된다.The embodiments of the present invention can collectively acquire the morpheme, syntax, and qualitative information structured hierarchically from the natural language text by generating a structured qualification tree by linking the qualities layered by the uniformization rule and the corresponding qualification value, Therefore, by considering semantic information, part-of-speech information, and syntax information at the word level, it is possible to solve the problem of ambiguity in the conversation system based on the conventional natural language processing. As a result, Conversational accuracy improves.

도 1은 본 발명의 실시예들이 활용될 수 있는 자연어 처리를 이용한 대화 시스템을 도시한 블록도이다.
도 2는 본 발명의 일 실시예에 따른 단일화 문법을 이용한 자연어 처리 방법을 도시한 흐름도이다.
도 3은 본 발명의 일 실시예에 따른 단일화 문법을 이용한 자연어 처리 장치를 포함하는 대화 시스템을 예시한 블록도이다.
도 4는 본 발명의 실시예들이 채택하고 있는 형태소 자질을 예시한 도면이다.
도 5는 본 발명의 일 실시예에 따른 자연어 처리 방법에서 문장을 형태소 단위로 분석하고 품사를 부착하는 과정을 예시한 도면이다.
도 6은 본 발명의 실시예들이 채택하고 있는 구문 자질을 예시한 도면이다.
도 7은 본 발명의 일 실시예에 따른 자연어 처리 방법에서 문장을 구문 단위로 분석하는 과정을 예시한 도면이다.
도 8은 본 발명의 일 실시예에 따른 자연어 처리 방법에서 단일화 규칙에 의해 구조화된 자질 트리를 예시한 도면이다.1 is a block diagram illustrating an interactive system using natural language processing in which embodiments of the present invention may be utilized.
2 is a flowchart illustrating a natural language processing method using a unified grammar according to an embodiment of the present invention.
3 is a block diagram illustrating an interactive system including a natural language processing apparatus using a unified grammar according to an embodiment of the present invention.
4 is a diagram illustrating morpheme qualities adopted by embodiments of the present invention.
5 is a diagram illustrating a process of analyzing a sentence in a morphological unit and attaching a part of speech in a natural language processing method according to an embodiment of the present invention.
FIG. 6 is a diagram illustrating syntax qualities adopted by embodiments of the present invention.
7 is a diagram illustrating a process of analyzing a sentence in syntax units in the natural language processing method according to an embodiment of the present invention.
8 is a diagram illustrating a feature tree structured by a unification rule in a natural language processing method according to an embodiment of the present invention.

본 발명의 실시예들을 설명하기에 앞서, 자연어 처리 및 이를 이용한 대화 처리 분야의 특성과 이에 따른 문제점들을 간략히 소개한 후, 이러한 문제점을 해결하기 위해 본 발명의 실시예들이 채택하고 있는 기술적 수단을 순차적으로 제시하도록 한다Prior to describing the embodiments of the present invention, the characteristics of the natural language processing and the dialog processing using the same, and the problems therefrom are briefly introduced, and then the technical means employed by the embodiments of the present invention are sequentially .

인간과 컴퓨터 간의 상호작용 분야에서 자연어 처리를 포함한 인공지능 기술은 오랜 세월 동안 학자들의 흥미로운 연구주제였다. 근래에 이르러 인간이 컴퓨터나 로봇과 같은 기계와 언어를 통한 대화를 함에 있어 인간의 대화를 이해하고 의도를 파악하여 그에 적절한 응답 등의 액션을 수행하는 대화 시스템이 실용화되기에 이르렀으며, 이러한 대화 시스템은 스마트폰의 음성 대화 에이전트, 자동차의 음성 대화 에이전트, 음성 대화 로봇, 음성 대화가 가능한 가전 제품 내지 텍스트 대화 컴퓨터 응용프로그램 등에 활용될 수 있다. 이러한 대화 시스템의 개괄적인 구조를 살펴보면 다음과 같다.Artificial intelligence, including natural language processing, has long been an interesting subject for scholars in the field of human-computer interaction. Recently, a dialogue system has been put into practical use in which a human being understands and intends to understand human conversation and performs an action such as an appropriate response in conversation with a machine such as a computer or a robot through a language. Can be utilized for a voice conversation agent of a smart phone, a voice conversation agent of a car, a voice conversation robot, a consumer electronic product capable of voice conversation or a text conversation computer application program. The general structure of this dialog system is as follows.

도 1은 본 발명의 실시예들이 활용될 수 있는 자연어 처리를 이용한 대화 시스템(100)을 도시한 블록도로서, 크게 자연어 처리부(10), 대화 관리부(30) 및 자연어 생성부(50)를 포함할 수 있다.1 is a block diagram showing an interactive system 100 using natural language processing in which embodiments of the present invention can be utilized and includes a natural language processing unit 10, a dialogue management unit 30, and a natural language generation unit 50 can do.

자연어 처리부(10)는 사용자로부터 대화를 입력받아 이를 이해한다. 이를 위해 자연어 처리부(10)는 인간의 언어를 구성하는 다양한 단어를 저장하는 사전, 문법 규칙과 제약 조건 데이터베이스(20)를 참조하여 입력된 자연어를 분석하고, 그로부터 의미를 발견한다.The natural language processing unit 10 receives the dialogue from the user and understands it. To this end, the natural language processing unit 10 analyzes the inputted natural language by referring to a dictionary, grammar rules and a constraint database 20 for storing various words constituting a human language, and finds meaning therefrom.

대화 관리부(30)는 자연어 처리부(10)를 통해 의미가 파악된 언어에 대응하여 어떠한 응답을 할지를 결정한다. 이를 위해 인간의 대화 상황에 따른 적절한 대응 방법을 제시하는 응답 템플릿(templete)과 특정 분야에서 활용될 수 있는 도메인 지식(domain knowledge) 데이터베이스(40)가 참조될 수 있으며, 이로부터 대화 관리부(30)는 사용자의 입력 자연어에 대응하는 적절한 응답을 구성하는 기본 요소들을 발견하게 된다.The dialogue management unit 30 determines through the natural language processing unit 10 what response is to be made in response to the language in which the meaning is grasped. For this, a response template (templete) suggesting an appropriate response method according to a human conversation situation and a domain knowledge database 40 that can be utilized in a specific field can be referred to, Will find the basic elements that make up the appropriate response corresponding to the user's input natural language.

자연어 생성부(50)는 인간의 언어를 구성하는 언어 정보(60)를 참조하여 대화 관리부(30)를 통해 생성된 응답의 기본 요소들에 기반한 자연어 응답을 생성하여 사용자에게 제공한다.The natural language generation unit 50 generates a natural language response based on the basic elements of the response generated through the dialogue management unit 30 by referring to the language information 60 constituting the human language and provides the generated natural language response to the user.

이상과 같은 일련의 과정을 살펴보면, 무엇보다도 사용자 발화 시점으로부터 자연어를 최초로 입력받아 그 의미를 파악하는 자연어 처리부(10)의 역할이 무엇보다도 중요함을 알 수 있다. 특히, 종래의 자연어 대화 시스템에서는 가공되지 않은 입력 텍스트(text)를 기준으로 그 의미를 파악하기 위한 다양한 처리 과정들(예를 들어, 형태소 분석 과정, 품사 태깅 과정, 구문 분석 과정, 의미 분석 과정 등이 될 수 있다.)을 선택적으로 수행함으로써 입력 텍스트로부터 언어 정보를 추출하고 대화 관리의 기초 정보로 활용하였다.As described above, it can be seen that the natural language processing unit 10, which inputs the natural language first from the user's utterance point and grasps its meaning, is most important. Particularly, in the conventional natural language conversation system, various processing processes (e.g., morphological analysis process, part-of-speech tagging process, parsing process, semantic analysis process, and the like) for grasping the meaning based on unprocessed input text ) Is selectively performed to extract the language information from the input text and use it as the basic information of the dialogue management.

이러한 종래의 자연언어 대화 시스템은 자연어 텍스트로부터 획득할 수 있는 정보가 단어 레벨의 의미 정보, 품사 정보 내지 구문 정보 수준에 머무르기 때문에 자연어 처리에서 지속적으로 발생하는 어휘적 모호성, 경계 모호성, 형태적 모호성 내지 구조적 모호성을 해결할 수 없다는 점이 문제점으로 지적되었다. 즉, 최초에 자연어 처리부(10)를 통해 입력 텍스트의 모호성 문제점을 해결하지 못하였기 때문에 이후의 대화 관리부(30) 내지 자연어 생성부(50)를 통한 처리 결과가 부적절해지는 것은 두말할 나위가 없다. 결과적으로 자연어 처리부(10)의 처리 실패는 대화 시스템 전체의 대화 정확도를 저하시키게 되는 원인이 된다는 것을 알 수 있다.This conventional natural language dialogue system has a problem that the information that can be obtained from the natural language text remains at the level of semantic information, part-of-speech information or syntactic information at the word level, so that linguistic ambiguity, boundary ambiguity, It is pointed out that the structural ambiguity can not be solved. That is, since the ambiguity problem of the input text can not be solved through the natural language processing unit 10 for the first time, it is needless to say that the result of the processing through the dialogue management unit 30 to the natural language generation unit 50 is improper. As a result, it can be seen that the processing failure of the natural language processing unit 10 causes a decrease in conversation accuracy of the entire conversation system.

이에, 이하에서 기술되는 본 발명의 실시예들은, 자연어 텍스트를 입력 매개체로 하는 대화 시스템에서 입력된 자연어를 형태소 자질, 구문 자질, 문장 자질의 자질을 기본 단위로 처리하되, 각각의 형태소, 구문, 문장의 자질이 갖는 결합 법칙에 따라 이들을 단일화함으로써 언어를 구조화하고, 이렇게 구조화된 자질 기반의 언어를 대화에 활용하는 방법을 제안하고자 한다.Therefore, embodiments of the present invention described below deal with the basic characteristics of a morphological feature, a syntactic feature, and a sentence feature of a natural language inputted in an interactive system using a natural language text as an input medium, We will structure the language by uniting them according to the combining rules of sentence qualities, and propose a method to utilize such structured quality-based language for dialogue.

특히, 본 발명의 실시예들은, 대화 시스템이 활용될 응용 분야에 적합한 형태소 자질, 구문 자질, 문장 자질 등을 정의하고 그에 대응하는 자질 사전을 구축한 후, 형태소 자질을 단일화하여 구문 자질을 구조화하며, 나아가 구문 자질을 단일화하여 문장 자질로 구조화시키는 기술적 수단을 제시한다. 따라서, 대화 시스템은 이상의 과정을 통해 구조화된 문장을 활용하여 미리 정의해놓은 자질 정보를 얼마든지 획득할 수 있고, 이에 따라 지능적인 대화를 수행하기 위한 핵심 정보로서 활용할 수 있다.In particular, embodiments of the present invention define morpheme qualities, syntax qualities, and sentence qualities suitable for an application field in which an interactive system is to be used, construct a qualitative dictionary corresponding thereto, and then structure the syntax qualities by unitizing morpheme qualities In addition, it suggests a technical means to unify the syntactic qualities and structure them into sentence qualities. Therefore, the conversation system can acquire any predefined qualities information using the structured sentence through the above process, and thus can be utilized as key information for performing intelligent conversation.

이하에서는, 도면을 참조하여 상기된 기술적 과제를 해결하기 위한 본 발명의 실시예들을 구체적으로 설명한다. 다만, 하기의 설명 및 첨부된 도면에서 본 발명의 요지를 흐릴 수 있는 공지 기능 또는 구성에 대한 상세한 설명은 생략한다. 또한, 도면 전체에 걸쳐 동일한 구성 요소들은 가능한 한 동일한 명칭 및 도면 부호로 나타내고 있음에 유의하여야 한다.Hereinafter, embodiments of the present invention for solving the above-mentioned technical problems will be described in detail with reference to the drawings. In the following description and the accompanying drawings, detailed description of well-known functions or constructions that may obscure the subject matter of the present invention will be omitted. It is to be noted that the same components are denoted by the same names and reference numerals as possible throughout the drawings.

도 2는 본 발명의 일 실시예에 따른 단일화 문법을 이용한 자연어 처리 방법을 도시한 흐름도로서, 이하의 단계들은 적어도 하나의 프로세서(processor)를 이용하여 자연어를 처리하는 대화 시스템 내지 자연어 처리 장치에서 구현될 수 있다. 또한, 필수적으로 수행되어야 하는 단계들(210, 220, 260)은 실선으로 표시하였으며, 선택적으로 수행될 수 있는 단계들(230. 240, 250)은 점선으로 표시하였다. 선택적인 단계들(230. 240, 250)은 260 단계의 자질 트리를 생성하기 위한 기초 정보로서 활용될 수 있으나, 필요에 따라 선택적으로 활용될 수 있다. 이하에서는 이들 과정을 모두 설명하지만, 선택적인 단계들(230. 240, 250)이 반드시 모두 수행되어야 하는 것이 아님을 이해하여야 한다.FIG. 2 is a flowchart illustrating a natural language processing method using a unified syntax according to an embodiment of the present invention. The following steps are implemented in an interactive system or a natural language processing apparatus that processes natural language using at least one processor. . Also, steps 210, 220, and 260 that are necessarily performed are indicated by solid lines, and steps 230, 240, and 250 that may be performed selectively are indicated by dashed lines. The optional steps 230, 240, and 250 may be utilized as basic information for generating the feature tree of step 260, but may optionally be utilized as needed. It is to be understood that while these processes are all described below, optional steps 230, 240 and 250 are not necessarily all performed.

210 단계에서, 대화 시스템은 사용자로부터 자연어 텍스트(text)를 입력받는다. 물론 사용자의 발화는 최초에 아날로그 데이터이지만, 이를 대화 시스템이 처리하기에 용이한 디지털 데이터로 변환되어 입력되는 것이 바람직하다.In step 210, the conversation system receives natural language text from the user. Of course, the user's utterance is initially analog data, but is preferably converted into digital data that is easy for the conversation system to process and input.

220 단계에서, 대화 시스템은 210 단계를 통해 입력된 자연어 텍스트를 전처리하여 제외어(stop word), 굴절 접사(inflection affix) 및 미등록어를 제거한다. 제외어는 언어 처리에 필요없다고 인정되는 의성어, 부사, 전치사, 부정사 등을 의미하며, 이들 제외어를 필터링함으로써 정제된 자연어 텍스트를 얻을 수 있다. 예를 들어, "아야", "헉", "the", "a", "hmm" 등이 제외어에 해당할 수 있다. 굴절 접사는 형태소가 단어를 형성할 때 어떠한 역할을 하느냐에 따라 구분된 것으로서, 단어의 중심부를 담당하는 형태소(어기(base)라고 불린다.)가 아닌 단어의 주변부를 형성하는 형태소 중 한 단어의 굴절(inflection)만을 담당하는 형태소를 말한다. 예를 들어, "웃어라", "웃고", "웃으면", "웃으니", "웃지", "웃는다"에 포함된 "-어라", "-고", "-으면", "-으니", "-지", "-는다" 등이 굴절 접사, 특히 굴절 접미사(흔히 '어미'라고도 한다.)에 해당한다. 미등록어는 대화 시스템의 자연어 처리 수단 내지 사전에 등록되지 않아 식별이 되지않는 용어, 단어를 말하는 것으로서, 대화의 의미 파악을 위해 부적절하게 작용하는 경우를 방지하기 위해 삭제되는 것이 바람직하나, 필요에 따라서는 별로도 저장되어 처리될 수도 있을 것이다. 인터넷 신조어 내지 쉽게 찾아보기 어려운 비속어의 경우가 이에 해당할 수 있으며, 예를 들어 "완전 좋다"의 인터넷 비속어인 "오나전 좋다"라는 표현이 입력된 경우, "오나전"이라는 미등록어를 삭제 처리할 수 있을 것이다. 나아가, 220 단계의 전처리 과정을 통해, 오탈자나 띄어쓰기를 교정하거나 불가피한 경우 삭제하는 방법도 활용될 수 있을 것이다.In step 220, the conversation system preprocesses the natural language text input through step 210 to remove stop words, inflection affixes, and unregistered words. Negative words mean simple words, adverbs, prepositions, and infinitives that are not considered necessary for language processing. By filtering these negatives, refined natural language text can be obtained. For example, "aya", "huck", "the", "a", "hmm" Refraction affix is divided according to what role the morpheme plays in forming the word, and refraction of one of the morphemes forming the periphery of the word, not the morpheme (called the base) responsible for the center of the word inflection). For example, "-", "-", "-" and "-" included in "Laugh", "Laugh", "Laugh", "Laugh", "Laugh" , "-", "-" and "-" correspond to the inflection, especially the inflectional suffix (often called the "mother"). It is preferable that the unlabeled word is deleted in order to prevent the natural language processing means of the conversation system or a term or word which is not registered in the dictionary and is not recognized in advance so as to prevent improper operation for understanding the meaning of the conversation. It may also be stored and processed. This may be the case of a new language of the Internet or a language that is hard to find easily. For example, if the expression "good" is inputted, which is an Internet profanity word of "completely good", an unregistered word " There will be. Furthermore, through the preprocessing step of step 220, it is possible to use a method of correcting misplaced punctuation or removing unnecessary cases.

230 단계에서, 대화 시스템은 220 단계를 통해 전처리된 자연어 텍스트에 복수의 형태소가 하나로 결합되어 표현된 교착어가 포함된 경우, 상기 교착어를 결합 전의 원형으로 복원할 수 있다. 교착어는 첨가어(affixing language)라고도 하며, 고립어와 굴절어의 중간적 성격을 지닌 것으로 어근에 접사가 결합되어 문장 내에서의 각 단어의 기능을 나타낸다. 230 단계에서는 자연어 텍스트 내에서 이러한 교착어가 감지될 경우, 보다 용이한 의미 파악과 형태소 구분을 위해 교착어를 결합 이전의 원형의 형태로 복원할 수 있다. 예를 들어, "갑시다" 및 "해주세요"라는 표현에 대해서 각각 "가다 + ㅂ시다" 및 "하 + 아 + 주 + 시 + 어요"로 복원할 수 있다.In step 230, if the prefixed natural language text is combined with a plurality of morphemes combined into one, the dialog system may restore the prefixed word to a round shape before joining. It is also called the affixing language. It has the intermediate character of the isolated word and the inflection. It is a combination of the affix to the root and the function of each word in the sentence. In the step 230, when such a ploys are detected in the natural language text, the ploys can be restored to the original form before the ployment for easier semantic recognition and morphological classification. For example, you can restore the words "go" and "please" to "go + go" and "go + go + to +" respectively.

240 단계에서, 대화 시스템은 형태소 자질 데이터베이스에 저장된 형태소 분류를 이용하여 220 단계를 통해 전처리된 자연어 텍스트를 형태소 단위로 분할하고, 상기 분할된 형태소에 품사를 부착한다. 형태소는 의미를 갖는 언어의 최소 단위로서, 240 단계에서는 입력된 자연어 텍스트를 이러한 형태소 단위로 분할하여 각각의 의미를 파악하게 된다. 예를 들어, "나는"이라는 표현에 대해 "나(대명사) + 는(조사)"와 같이 형태소 단위의 분할과 해당 품사를 부착할 수 있다. 또는 동일한 표현인 "나는"에 대해 "날다(동사) + ㄴ(관형형어미)"로 분석할 수도 있을 것이며, 양자 모두를 확보한 다음, 이후 태거(tagger)를 통해 문장 내에서 적절한 형태소 분석 결과를 선택할 수 있다.In step 240, the conversation system divides the preprocessed natural language text into morpheme units using the morpheme classification stored in the morpheme database, and attaches the part-of-speech to the divided morpheme. The morpheme is the minimum unit of the language having meaning. In step 240, the inputted natural language text is divided into these morpheme units, and the meaning of each word is grasped. For example, for the expression "I", the division of the morpheme unit such as "I (pronoun) + (investigation)" and the part of speech can be attached. Or the same expression "I" can be analyzed as "fly (verb) + b (tubular ending)", securing both, and then using the tagger, You can choose.

통상적으로 형태소 분석 단계에서 문제가 되는 부분은 미등록어, 오탈자, 띄어쓰기 오류 등을 들 수 있는데, 이러한 오류를 방지하기 위해 앞서 220 단계를 통해 전처리를 수행한 바 있다. 또 다른 문제로는 복합 명사 분해가 있는데, 복합 명사는 그 구성이 다양한 방식으로 분해될 수 있다는 데에서 혼란을 야기할 수 있으며, 이러한 선택지들 중에서 가장 적합한 분해 결과를 선택할 필요가 있다. 이를 위해 본 발명의 실시예들은 다양하게 분해되는 분석 결과들 중에서 적합한 결과를 선택하기 위해, 확률에 기반한 테이블 파싱(table parsing)을 사용할 수도 있을 것이다.Typically, the problematic parts in the morpheme analysis step include unrecognized words, misspellings, and spacing errors. To prevent these errors, the preprocessing has been performed through step 220. [ Another problem is compound noun decomposition, which can lead to confusion as the composition can be decomposed in various ways, and it is necessary to select the most suitable decomposition result among these options. To this end, embodiments of the present invention may use probability based table parsing to select an appropriate outcome from various decomposed analysis results.

또한, 품사 부착은 태거(tagger)를 통해 수행될 수 있으며, 행태소 분석 수단을 통해 출력된 다양한 분석 결과 중에서 문맥에 적합한 하나의 분석 결과를 선택할 수 있다. 이 때, 문맥 좌우에 위치한 중의성 해소의 힌트가 되는 정보를 이용하여 적합한 분석 결과를 선택하는데, 대규모의 품사부착 말뭉치를 이용하여 구현될 수 있다. 구현을 위해 은닉 마르코프 모델(hidden Markov model, HMM)이 활용될 수 있다.Also, attaching part of speech can be performed through a tagger, and one analysis result suitable for a context can be selected from various analysis results output through the behavior analysis unit. In this case, it is possible to use a large-scale parts-of-speech corpus to select a suitable analysis result by using the information as a hint for solving the ambiguity located on the left and right of the context. For implementation, a hidden Markov model (HMM) can be utilized.

250 단계에서, 대화 시스템은 구문 자질 데이터베이스에 저장된 구문 태그 및 구문 규칙을 이용하여 220 단계를 통해 전처리된 자연어 텍스트의 문장을 구문 단위로 분할한다. 필요에 따라서는 구문 분석의 중의성 문제를 방지하기 위해, 구 단위의 분석과 절 단위의 분석을 먼저 선행한 후, 이들 소 단위의 분석 결과를 활용하여 구문 분석을 수행할 수도 있을 것이다. 250 단계에서, 대화 시스템은 문장을 이루고 있는 구성 성분을 분해하고 분해된 성분들 간의 위계 관계를 분석하여 문장의 구조를 결정한다.In step 250, the conversation system divides the sentence of the preprocessed natural language text into syntax units by using the syntax tag and the syntax rule stored in the syntax property database. In order to prevent the problem of ambiguity in parsing, if necessary, we may precede parsing and clause-based parsing first, and then perform parsing using these small-scale analysis results. In step 250, the dialog system decomposes the constituent elements of the sentence and analyzes the hierarchical relationship between the decomposed elements to determine the structure of the sentence.

260 단계에서, 대화 시스템은 미리 정의된 단일화 규칙에 의해 계층화된 자질과 해당 자질 값을 연계하여 구조화된 자질 트리(feature tree)를 생성한다. 여기서, 자질 트리는, 220 단계를 통해 전처리된 자연어 텍스트를 루트 노드(root node)로 설정하고, 상기 루트 노드의 자식 노드로서 구문 자질을 생성하여 대응하는 구문 자질 값을 할당하며, 상기 구문 자질의 자식 노드로서 의미를 갖는 언어의 최소 단위인 형태소 자질을 생성하여 대응하는 형태소 자질 값을 할당함으로써, 상기 형태소 자질이 구문으로 구조화되고, 상기 구문 자질이 문장으로 구조화된다. 즉, 260 단계의 구조화된 자질 트리를 통해, 대화 시스템은 형태소 자질을 단일화하여 구문 자질을 구조화하며, 나아가 구문 자질을 단일화하여 문장 자질로 구조화시킴으로써 미리 정의해놓은 자질 정보를 종합적으로 획득할 수 있다는 장점을 갖는다.In step 260, the dialog system creates a structured feature tree by linking the qualities layered by the predefined unification rule with the corresponding feature values. Here, the feature tree sets the preprocessed natural language text as a root node in step 220, generates a syntax feature as a child node of the root node, assigns a corresponding syntax feature value, The morpheme qualities are structured into phrases and the phrase qualities are structured into sentences by generating morpheme qualities which are the minimum unit of language having meaning as nodes and allocating corresponding morpheme qualities. In other words, through the structured feature tree of step 260, the dialogue system structures the syntactic features by unifying the morpheme qualities, and structures them into sentence qualities by unifying the syntactic qualities, thereby acquiring predefined qualitative information comprehensively Respectively.

도 3은 본 발명의 일 실시예에 따른 단일화 문법을 이용한 자연어 처리 장치를 포함하는 대화 시스템을 예시한 블록도로서, 도 1의 대화 시스템에 본 발명의 일 실시예에 따른 도 2의 자연어 처리 방법을 구현한 것이다. 도 3에서는 자연어 처리 과정에 집중하여 구성을 도시하였으며, 부차적인 구성은 생략하였다. 또한, 자연어 처리 과정에서 필수적인 구성(11, 17)은 실선으로 표시하였으며, 선택적인 구성(12, 13, 14, 15, 16)은 점선으로 표시하였다. 각각의 구성은 이미 앞서 설명한 도 2의 각 단계에 대응되므로, 여기서는 장치적 특징과 연결 관계에 집중하여 그 구성을 약술하도록 한다.FIG. 3 is a block diagram illustrating an interactive system including a natural language processing apparatus using a unified grammar according to an embodiment of the present invention. In the interactive system of FIG. 1, . In FIG. 3, the configuration is shown focusing on the natural language processing process, and the subordinate configuration is omitted. In addition, the essential components (11, 17) in natural language processing are indicated by solid lines, and the optional components (12, 13, 14, 15, 16) are indicated by dotted lines. Since each configuration corresponds to each of the steps of FIG. 2 already described above, the configuration is focused on the device characteristic and the connection relationship.

전처리부(11)는 자연어 텍스트를 입력받아 전처리하여 제외어, 굴절 접사 및 미등록어를 제거한다.The preprocessing unit 11 receives the natural language text, preprocesses it, and removes the negatives, the refraction affixes, and the unrecognized words.

원형 복원부(12)는 전처리부(11)의 출력값을 전달받아 교착어가 포함되어 있는지 여부를 검사하고, 만약 교착어가 포함되어 있다면 교착어를 결합 전의 원형으로 복원한다.The circular restoration unit 12 receives the output value of the preprocessing unit 11 to check whether or not a prefixed word is included. If the prefixed word is included, the circular restoration unit 12 restores the prefixed word to its original form.

형태소 분석부(13)는 전처리부(11) 또는 원형 복원부(12)의 출력값을 전달받아 형태소 자질 데이터베이스(14)에 저장된 형태소 분류를 이용하여 자연어 텍스트를 형태소 단위로 분할하고, 분할된 형태소에 품사를 부착한다.The morphological analysis unit 13 receives the output values of the preprocessing unit 11 or the circular restoration unit 12 and divides the natural language text into morpheme units using the morpheme classification stored in the morpheme quality database 14, Attach part of speech.

구문 분석부(15)는 전처리부(11), 원형 복원부(12) 또는 형태소 분석부(13)의 출력값을 전달받아 구문/문장 자질 데이터베이스(16)에 저장된 구문 태그 및 구문 규칙을 이용하여 자연어 텍스트의 문장을 구문 단위로 분할한다.The parsing unit 15 receives the output values of the preprocessing unit 11, the circular restoration unit 12, or the morpheme analysis unit 13 and generates a natural language word using syntax tags and syntax rules stored in the syntax / Split text sentences into syntactic units.

자질 트리 생성부(17)는 전처리부(11), 원형 복원부(12), 형태소 분석부(13) 또는 구문 분석부(15)의 출력값을 전달받아 미리 정의된 단일화 규칙에 의해 계층화된 자질과 해당 자질 값을 연계하여 구조화된 자질 트리(feature tree)를 생성한다.The feature tree generating unit 17 receives the output values of the preprocessing unit 11, the circular restoration unit 12, the morpheme analysis unit 13, or the syntax analysis unit 15, And associates the attribute values to generate a structured feature tree.

마지막으로, 대화 관리자(30)는 자질 트리 생성부(17)를 통해 생성된 자질 트리의 자질 구조를 활용하여 자연어 대화 내지 응답을 출력한다. 엄밀한 의미에서 이러한 대화 관리자(30)는 대화 시스템(300)에는 포함되나, 본 발명의 집중하고 있는 자연어 처리 수단에는 포함되지 않는다.Finally, the dialogue manager 30 outputs the natural language dialogue or response using the feature structure of the feature tree generated through the feature tree generating unit 17. [ In a strict sense, such a conversation manager 30 is included in the conversation system 300, but is not included in the concentrated natural language processing means of the present invention.

이하에서는 도 3의 대화 시스템(300)을 통해 도출될 수 있는 3개의 실시예를 제시하도록 한다. 각각의 실시예들은 도 3에 도시된 선택적 구성(12, 13, 14, 15, 16)의 가감에 의해 달라질 수 있다.Hereinafter, three embodiments that can be derived through the dialog system 300 of FIG. 3 will be presented. Each of the embodiments can be varied by the addition or subtraction of the optional configuration 12, 13, 14, 15, 16 shown in FIG.

실시예Example 1) One)

실시예 1에 따른 대화 시스템(300)은 전처리부(11), 형태소 분석부(13), 형태소 자질 데이터베이스(14), 구문 분석부(15), 구문 자질 데이터베이스(16), 자질 트리 생성부(17) 및 대화 관리부(30)를 포함하여 구현될 수 있다. 이러한 실시예 1에 따른 대화 시스템(300)은 자연어 텍스트 입력에 대하여 기본적인 전처리를 수행한 후, 형태소 자질 데이터베이스(14)를 활용하여 형태소 분석을 수행하고, 형태소 분석 결과와 구문 자질 데이터베이스(16)를 사용하여 구문 분석을 수행한다. 그 결과를 자질 트리 생성부(17)를 통해 단일화 문법을 활용하여 구조화된 자질 트리로 생성하고, 대화 관리부(30)가 이렇게 생성된 형태소 자질 정보와 구문 자질 정보를 활용하게 된다.The dialog system 300 according to the first embodiment includes a preprocessing unit 11, a morpheme analysis unit 13, a morpheme quality database 14, a syntax analysis unit 15, a syntax quality database 16, 17, and a dialogue management unit 30. The dialog system 300 according to the first embodiment performs basic preprocessing for natural language text input, performs morpheme analysis using the morpheme qualities database 14, and outputs morpheme analysis results and syntax quality database 16 To perform parsing. The result is generated into a structured feature tree by using the unifying syntax through the feature tree generating unit 17, and the dialog management unit 30 utilizes the generated morpheme feature information and syntax feature information.

실시예Example 2) 2)

실시예 2에 따른 대화 시스템(300)은 가장 단순한 형태로 구현된 것을 예시한 것으로서, 전처리부(11), 형태소 분석부(13), 형태소 자질 데이터베이스(14), 자질 트리 생성부(17) 및 대화 관리부(30) 만을 포함하여 구현될 수 있다. 이러한 실시예 2에 따른 대화 시스템(300)은 기본적인 자연어 전처리 후, 형태소 자질 데이터베이스(14)를 활용한 형태소 분석만을 수행한다. 이 때, 구문 정보가 없기 때문에 자질 트리 생성부(17)는 단순히 형태소 자질의 리스트 구조로 구조화될 수 있다. 마지막으로 대화 관리부(30)는 이렇게 구조화된 자질 리스트를 활용하여 응답을 출력한다.The dialog system 300 according to the second embodiment is an example implemented in the simplest form and includes a preprocessing unit 11, a morpheme analyzing unit 13, a morpheme database 14, a quality tree generating unit 17, Only the dialogue management unit 30 can be implemented. The dialog system 300 according to the second embodiment performs morpheme analysis using the morpheme quality database 14 after basic natural language preprocessing. At this time, since there is no syntax information, the qualification tree generation unit 17 can be structured into a list structure of merely morpheme qualities. Finally, the dialogue management unit 30 outputs the response using the structured list of qualities.

실시예Example 3) 3)

실시예 3에 따른 대화 시스템(300)은 가장 복잡한 형태로 구현된 것을 예시한 것으로서, 전처리부(11), 원형 복원부(12), 형태소 분석부(13), 형태소 자질 데이터베이스(14), 구문 분석부(15), 구문 자질 데이터베이스(16), 자질 트리 생성부(17) 및 대화 관리부(30)를 모두 포함하여 구현될 수 있다. 이러한 실시예 3에 따른 대화 시스템(300)은 기본적인 자연어 전처리 후, 원형 복원을 수행하고, 이후 형태소 자질 데이터베이스(14)를 활용하여 형태소 분석을 수행하며, 형태소 분석 결과와 구문 자질 데이터베이스(16)를 사용하여 구문 분석을 수행한다. 그 결과를 자질 트리 생성부(17)에 통해 단일화 문법을 활용하여 구조화된 자질 트리로 생성하고, 대화 관리부(30)가 이렇게 생성된 형태소 자질 정보와 구문 자질 정보를 활용하게 된다.The dialog system 300 according to the third embodiment exemplifies the most complex form of the dialog system 300 and includes a preprocessing unit 11, a circular restoration unit 12, a morpheme analysis unit 13, a morpheme database 14, An analysis unit 15, a syntax property database 16, a property tree creation unit 17, and a dialogue management unit 30. The dialog system 300 according to the third embodiment performs the circular restoration after the basic natural language preprocessing and then performs the morphological analysis using the morpheme qualities database 14 and the morphological analysis result and the syntax quality database 16 To perform parsing. The result is generated as a structured feature tree by using the unifying syntax through the feature tree generating unit 17, and the dialog management unit 30 utilizes the generated morpheme feature information and syntax feature information.

요약하건대, 본 발명의 실시예들은 자연어 텍스트가 입력되어 대화 관리부(30)에 의해 처리되기 전까지 기본적인 자연어 전처리를 수행한 후, 형태소 자질 데이터베이스(14)와 구문/문장 자질 데이터베이스(16)를 활용하여 각각 형태소 분석 및 구문 분석을 수행한다. 이렇게 형태소 자질 단위와 구문 자질 단위 문장 자질 단위로 분할된 각각의 단위들은 자질 트리 생성부(17)에 의해 형태소의 조합이 구문으로, 구문의 조합이 문장으로 단일화되며 구조화된다. 이렇게 구조화된 자질들은 컴퓨터 시스템을 통해 용이하게 처리될 수 있는 트리(tree) 구조로 표현되는 것이 바람직하며, 이렇게 완성된 구조화된 자질 트리는 대화 관리부(30)에 대화에 필요한 다양한 자질 정보를 제공한다. 따라서, 자질 트리를 통해 자질로부터 다양한 대화에 관련된 정보를 추가로 얻기 때문에 훨씬 더 자연스럽고 지능적인 대화 관리가 가능해진다.In summary, embodiments of the present invention perform basic natural language preprocessing until the natural language text is input and processed by the dialogue management unit 30, and then use the morpheme quality database 14 and the syntax / sentence quality database 16 Perform morphological analysis and parsing, respectively. Each of the units divided into the morpheme unit and the syntactic unit unit sentence qualities unit is unified and structured by the qualification tree generation unit 17 as a combination of morphemes as a syntax and a combination of syntax as a sentence. The structured qualities are preferably expressed in a tree structure that can be easily processed through a computer system. The completed structured qualities tree provides various kinds of qualitative information necessary for conversation to the dialogue management unit 30. Thus, a more natural and intelligent conversation management becomes possible because the qualification tree further acquires information related to various conversations from the qualities.

도 4는 본 발명의 실시예들이 채택하고 있는 형태소 자질을 예시한 도면으로서, 명사(N) 자질, 동사(V) 자질, 형용사(ADJ) 자질 및 부사(ADV) 자질, 관계언(PARTICLE) 자질, 어미(END) 자질, 선어말어미(PEND) 자질 등을 포함한다. 각각의 형태소 자질은 다음과 같은 세부 요소를 갖는다.FIG. 4 is a diagram illustrating morpheme qualities adopted by embodiments of the present invention and includes noun qualities, verb qualities, adjective qualities and adverb qualities, END qualities, and PEND qualities. Each morpheme has the following details.

명사(N) 자질은 LEX(형태소), POS(품사), SEM(의미범주)를 가지고, 동사(V) 자질은 LEX(형태소), POS(품사), SEM(의미범주), SUBCAT(하위범주)를 가지고, 형용사(ADJ) 자질 및 부사(ADV) 자질은 LEX(형태소), POS(품사), SEM(의미범주), QUALIFIER(수식범주)를 가지고, 관계언(PARTICLE, PP) 자질은 LEX(형태소), POS(품사), CASE(격범주)를 가지고, 어미(END) 자질은 LEX(형태소), POS(품사), TYPE(어미범주)를 가지며, 선어말어미(PEND) 자질은 LEX(형태소), POS(품사), MOD(서법범주)를 가질 수 있다.The noun qualities are LEX (morpheme), POS (part of speech), SEM (semantic category), verb V qualities are LEX (morpheme), POS (part of speech), SEM (semantic category), SUBCAT ), ADJ qualities and adverb qualities have LEX (morpheme), POS (part of speech), SEM (semantic category) and QUALIFIER (qualification category), and the PARTICLE and PP qualities are LEX POS has a POS (morpheme), POS (part of speech) and PEND character has a LEX (morpheme), POS (part of speech), CASE ), POS (Part of Speech), and MOD (Calligraphy Category).

도 5는 본 발명의 일 실시예에 따른 자연어 처리 방법에서 문장을 형태소 단위로 분석하고 품사를 부착하는 과정을 예시한 도면으로서, 입력값으로서 "나는 학교에 갑니다"라는 문장이 입력된 상황을 가정하고 있다.FIG. 5 is a diagram illustrating a process of analyzing a sentence in morphological units and attaching parts of speech in the natural language processing method according to an embodiment of the present invention, and assumes a situation in which a sentence "I go to school" .

입력 문장을 분할하여 토큰(token)에 저장하고, 각각의 토큰은 다시 구체적인 형태소로 분할되며, 분할된 형태소에 품사가 부착된 상황을 예시하고 있다. 도 5를 통해 확인할 수 있듯이 하나의 분할된 토큰의 경우에도 다양한 형태소 분석이 가능함을 알 수 있으며, 문맥을 고려하여 이들 선택지들 중 가장 적합한 분석 결과가 선택되게 될 것이다.The input sentence is divided into tokens, each token is again divided into specific morphemes, and the parts are attached to the divided morphemes. As can be seen from FIG. 5, it can be seen that various morphological analyzes are possible even in the case of one divided token, and the most suitable analysis result among these options will be selected in consideration of the context.

도 6은 본 발명의 실시예들이 채택하고 있는 구문 자질을 예시한 도면으로서, 문장(S) 자질, 주어(SUBJ) 자질 및 목적어(OBJ) 자질, 복합명사(NN) 자질 및 관형어구(ADJP) 자질, 그리고 서술어(PRED) 자질 등을 포함한다. 각각의 구문 자질은 다음과 같은 세부 요소를 갖는다.FIG. 6 is a diagram illustrating syntax qualities adopted by embodiments of the present invention, including sentence (S), subject (SUBJ) and object (OBJ), complex noun (NN), and adjective , And predicate (PRED) qualities. Each syntactic feature has the following details:

문장(S) 자질은 SUBJ(주어), OBJ(목적어), PRED(술어)를 가지고, 주어(SUBJ) 자질 및 목적어(OBJ) 자질은 HEAD(핵심어), CASE(격범주), COMP(보충어)를 가지고, 복합명사(NN) 자질 및 관형어구(ADJP) 자질은 HEAD(핵심어), SEM(의미범주)를 가지며, 서술어(PRED) 자질은 HEAD(핵심어), COMP(보충어), CONJ(어미범주)를 가질 수 있다.(SUBJ), OBJ (object), and PRED (predicate). The SUBJ and OBJ qualities are HEAD, CASE, COMP, (ADJP) qualities have HEAD, SEM, and PRED qualities are HEAD, COMP, CONJ, and so on. Lt; / RTI >

도 7은 본 발명의 일 실시예에 따른 자연어 처리 방법에서 문장을 구문 단위로 분석하는 과정을 예시한 도면으로서, 입력값으로서 [A] "I worked very hard."와 [B] "He could not very well quit his job."이라는 문장이 입력된 상황을 가정하고 있다.7 is a diagram illustrating a process of analyzing a sentence in syntax units in a natural language processing method according to an embodiment of the present invention. [A] "I worked very hard." And [B] "He could not very well quit his job. "

도 7에는 루트 노드에 입력된 문장 원문이 위치하고 이로부터, 각각의 구문을 분석하여 하향식으로 분석한 결과를 보여주고 있다. 즉, 문장 원문으로부터 각각의 구성 요소들을 추출하고, 그 구조에 따라 트리를 구성한 후, 마지막으로 구문 자질에 대응하는 자질 값을 할당하였다.FIG. 7 shows a result of top-down analysis of the sentence texts input to the root node, from which the respective sentences are analyzed. That is, each component is extracted from the original text of the sentence, a tree is constructed according to the structure, and finally, a feature value corresponding to the syntax feature is assigned.

도 8은 본 발명의 일 실시예에 따른 자연어 처리 방법에서 단일화 규칙에 의해 구조화된 자질 트리를 예시한 도면으로서, "나는 학교에 갑니다"라는 문장을 입력받은 상황을 가정하고 있다.FIG. 8 is a diagram illustrating a feature tree structured by a unification rule in a natural language processing method according to an embodiment of the present invention, assuming a situation where a sentence "I am going to school" is input.

앞서 도 2를 통해 구체적으로 설명한 바와 같이, 본 발명의 실시예들은 단일화 규칙에 의해 구문 자질과 형태소 자질을 계층적으로 구조화하고, 각 계층에 자질 값을 할당하는 구조화된 자질 트리를 제시하였다. 이에 따라 도 8에 도시된 자질 트리에는 루트 노드에 문장 원문이 위치하고, 그 자식 노드로서 구문 자질이 생성되어 구문 자질 값이 할당되어 있음을 볼 수 있다. 또한, 구문 자질의 자식 노드로서 형태소 자질이 생성되었으며, 형태소 자질 값이 할당되어 있음을 알 수 있다. 즉, 이러한 구조화된 자질 트리를 통해 최초에 입력된 자연어가 종합적으로 분석될 수 있으며, 각각의 구성 요소들에 대한 직접적인 정보 제공이 가능하다.As described above in detail with reference to FIG. 2, embodiments of the present invention propose a structured feature tree for hierarchically structuring syntax and morpheme attributes according to a unification rule, and assigning feature values to each layer. Accordingly, it can be seen that the original text of the sentence is located in the root node in the feature tree shown in FIG. 8, and syntax qualities are generated as its child nodes and the syntax qualities are assigned. In addition, morpheme qualities are generated as child nodes of syntax qualities, and morpheme qualities are assigned. In other words, the natural language inputted first through the structured property tree can be analyzed comprehensively, and it is possible to provide direct information about each component.

상기된 본 발명의 실시예들에 따르면, 단일화 규칙에 의해 계층화된 자질과 해당 자질 값을 연계하여 구조화된 자질 트리를 생성함으로써, 자연어 텍스트로부터 계층적으로 구저화된 형태소, 구문, 자질 정보를 종합적으로 획득할 수 있고, 그로 인해 단어 레벨의 의미 정보, 품사 정보 및 구문 정보를 동시에 고려함으로써 종래의 자연어 처리에 기반한 대화 시스템에서의 모호성 문제를 해소할 수 있으며, 결과적으로 사용자 언어를 정확하고 적절하게 해석함으로써 대화 시스템 전체의 대화 정확도가 향상된다.According to the above-described embodiments of the present invention, by creating a structured quality tree by linking the qualities layered by the uniformization rule and the corresponding qualities, the morpheme, syntax, and qualitative information hierarchized from the natural language text are synthesized , Thereby simultaneously solving the problem of ambiguity in the dialogue system based on the conventional natural language processing by considering the semantic information of the word level, part of speech information, and syntax information at the same time. As a result, Interpretation improves conversation accuracy throughout the conversation system.

또한, 순수한 자연어 텍스트로부터는 얻어낼 수 없는 대화에 필요한 다양한 정보를 본 발명의 실시예들이 채택하고 있는 자질 트리의 형태소 자질 및 구문 자질로부터 획득하여 대화에 활용할 수 있으며, 이로 인해 보다 더 자연스러운 대화, 보다 더 지능적인 대화를 가능하게 할 수 있다. 나아가, 본 발명의 실시예들은 스마트폰의 음성 대화 에이전트, 자동차의 음성 대화 에이전트, 음성 대화 로봇, 음성 대화가 가능한 가전 제품 등에서 음성 인식 이후의 자연어 텍스트를 활용한 대화 관리에 유용하게 적용될 수 있다.In addition, various information necessary for conversation that can not be obtained from pure natural language text can be acquired from the morphological and syntactic qualities of the qualitative tree adopting the embodiments of the present invention and utilized for conversation, Thereby enabling more intelligent conversation. Furthermore, the embodiments of the present invention can be applied to the management of a conversation using natural language text after voice recognition in a voice conversation agent of a smart phone, a voice conversation agent of a car, a voice conversation robot, and a home appliance capable of voice conversation.

한편, 본 발명의 실시예들은 컴퓨터로 읽을 수 있는 기록 매체에 컴퓨터가 읽을 수 있는 코드로 구현하는 것이 가능하다. 컴퓨터가 읽을 수 있는 기록 매체는 컴퓨터 시스템에 의하여 읽혀질 수 있는 데이터가 저장되는 모든 종류의 기록 장치를 포함한다.Meanwhile, the embodiments of the present invention can be embodied as computer readable codes on a computer readable recording medium. A computer-readable recording medium includes all kinds of recording apparatuses in which data that can be read by a computer system is stored.

컴퓨터가 읽을 수 있는 기록 매체의 예로는 ROM, RAM, CD-ROM, 자기 테이프, 플로피디스크, 광 데이터 저장장치 등이 있으며, 또한 캐리어 웨이브(예를 들어 인터넷을 통한 전송)의 형태로 구현하는 것을 포함한다. 또한, 컴퓨터가 읽을 수 있는 기록 매체는 네트워크로 연결된 컴퓨터 시스템에 분산되어, 분산 방식으로 컴퓨터가 읽을 수 있는 코드가 저장되고 실행될 수 있다. 그리고 본 발명을 구현하기 위한 기능적인(functional) 프로그램, 코드 및 코드 세그먼트들은 본 발명이 속하는 기술 분야의 프로그래머들에 의하여 용이하게 추론될 수 있다.Examples of the computer-readable recording medium include a ROM, a RAM, a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device and the like, and also a carrier wave (for example, transmission via the Internet) . In addition, the computer-readable recording medium may be distributed over network-connected computer systems so that computer readable codes can be stored and executed in a distributed manner. In addition, functional programs, codes, and code segments for implementing the present invention can be easily deduced by programmers skilled in the art to which the present invention belongs.

이상에서 본 발명에 대하여 그 다양한 실시예들을 중심으로 살펴보았다. 본 발명에 속하는 기술 분야에서 통상의 지식을 가진 자는 본 발명이 본 발명의 본질적인 특성에서 벗어나지 않는 범위에서 변형된 형태로 구현될 수 있음을 이해할 수 있을 것이다. 그러므로 개시된 실시예들은 한정적인 관점이 아니라 설명적인 관점에서 고려되어야 한다. 본 발명의 범위는 전술한 설명이 아니라 특허청구범위에 나타나 있으며, 그와 동등한 범위 내에 있는 모든 차이점은 본 발명에 포함된 것으로 해석되어야 할 것이다.The present invention has been described above with reference to various embodiments. It will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims. Therefore, the disclosed embodiments should be considered in an illustrative rather than a restrictive sense. The scope of the present invention is defined by the appended claims rather than by the foregoing description, and all differences within the scope of equivalents thereof should be construed as being included in the present invention.

100 : 대화 시스템
10 : 자연어 처리부
20 : 사전/문법규칙/제약조건 데이터베이스
30 : 대화 관리부
40 : 응답 템플릿/도메인 지식 데이터베이스
50 : 자연어 생성부
60 : 언어 정보 데이터베이스
300 : 구조화된 자질 트리를 이용한 대화 시스템
11 : 전처리부 12 : 원형 복원부
13 : 형태소 분석부 14 : 형태소 자질 데이터베이스
15 : 구문 분석부 16 : 구문/문장 자질 데이터베이스
17 : 자질 트리 생성부100: Conversation system
10: Natural Language Processing Department
20: dictionary / grammar rules / constraint database
30:
40: Response Template / Domain Knowledge Database
50: Natural language generating unit
60: Language information database
300: Interactive system using structured feature tree
11: preprocessing unit 12: circular restoration unit
13: Morphological analysis part 14: Morphological property database
15: parsing unit 16: syntax / sentence quality database
17: qualification tree generation unit

Claims

A method for processing natural language using at least one processor,
Receiving natural language texts;
A stop word which is recognized to be unnecessary for language processing by preprocessing the input natural language text and a reflex affix which is a morpheme that only deals with inflection of a word in a morpheme forming a peripheral part of a word other than a base, removing inflection affix and unregistered word;
Restoring the prefixed word into a circular form before combining when a plurality of morphemes are combined and represented in the preprocessed natural language text and an affix to the root is combined to include a word that indicates the function of each word in the sentence; And
Generating a structured feature tree by associating a feature layered by a predefined unification rule with a corresponding feature value,
Wherein the feature tree sets the preprocessed natural language text as a root node, generates a syntax feature as a child node of the root node, assigns a corresponding syntax feature value, and assigns the meaning as a child node of the syntax feature Wherein the morpheme qualities are structured into phrases and the phrase qualities are structured into sentences by generating morpheme qualities that are the smallest units of the language and allocating corresponding morpheme qualities.

delete

The method according to claim 1,
Dividing the preprocessed natural language text into morpheme units using a morpheme classification stored in a morpheme quality database, and attaching parts of speech to the divided morpheme.

The method according to claim 1,
Further comprising the steps of: dividing the sentence of the preprocessed natural language text into syntax units using syntax tags and syntax rules stored in the syntax property database.

The method according to claim 1,
The morpheme qualities include,
(N) qualities with LEX (morpheme), POS (part of speech), SEM (semantic category)
(V) qualities with LEX (morpheme), POS (part of speech), SEM (semantic category), SUBCAT (subcategory)
Adjective (ADJ) and adverb (ADV) qualities with LEX (morpheme), POS (part of speech), SEM (semantic category), QUALIFIER
LEX (morpheme), POS (part of speech), CASE (category category)
End qualities with LEX (morpheme), POS (part of speech), TYPE (end category)
(PEND) qualities with LEX (morpheme), POS (part of speech), MOD (linguistic category)
The phrase qualities include,
Sentence (S) qualities with SUBJ (subject), OBJ (object word), PRED (predicate)
Subject (SUBJ) and object (OBJ) qualities with HEAD (keyword), CASE (category), COMP
(NN) and ADJP qualities with HEAD, SEM (semantic category)
A predicate (PRED) qualifier having a HEAD (keyword), a COMP (complement), and a CONJ (mother category).