KR102143157B1

KR102143157B1 - System and method for generating paraphrase sentence based on ontology

Info

Publication number: KR102143157B1
Application number: KR1020180147694A
Authority: KR
Inventors: 양승원; 김성만
Original assignee: 주식회사 솔트룩스
Priority date: 2018-11-26
Filing date: 2018-11-26
Publication date: 2020-08-11
Also published as: KR20200061877A

Abstract

패러프레이즈(paraphrase) 문장을 생성하는 시스템은, 본 발명의 예시적 실시예에 따라, 질의 문장으로부터 추출된 토큰들에 대응하는 온톨로지(ontology) 구성요소들 각각을 포함하는 시맨틱 리소스들을 획득하고, 시맨틱 리소스들의 조합인 패턴 및 질의 문장에서 토큰들이 대체가능한 구조를 가지는 프레임을 생성하는 특성 추출부, 패턴 및 프레임을 상호 대응시키고, 상호 대응되는 패턴 및 프레임을 패턴 저장부에 저장하는 패턴 색인부, 및 패턴 저장부에 저장된 패턴에 대응하는 프레임들에 기초하여, 질의 문장의 패러프레이즈 문장을 생성하는 문장 생성부를 포함할 수 있다.A system for generating a paraphrase sentence, according to an exemplary embodiment of the present invention, acquires semantic resources including each of ontology components corresponding to tokens extracted from a query sentence, and semantic A characteristic extracting unit for generating a frame having a structure in which tokens can be replaced in a pattern and a query sentence, which is a combination of resources, a pattern index unit for correlating the patterns and frames, and storing the corresponding patterns and frames in the pattern storage unit, and It may include a sentence generating unit that generates a paraphrase sentence of the query sentence based on the frames corresponding to the pattern stored in the pattern storage unit.

Description

A system and method for generating paraphrase sentences based on ontology {SYSTEM AND METHOD FOR GENERATING PARAPHRASE SENTENCE BASED ON ONTOLOGY}

본 발명의 기술적 사상은 패러프레이즈 문장 생성에 관한 것으로서, 자세하게는 온톨로지 기반 패러프레이즈 문장 생성을 위한 시스템 및 방법에 관한 것이다.The technical idea of the present invention relates to a paraphrase sentence generation, and more particularly, to a system and method for generating an ontology-based paraphrase sentence.

본 발명은 산업자원통상부 로봇산업핵심기술개발사업-인공지능융합로봇시스템기술의 일환으로 (주)아이피엘에서 주관하고 (주)솔트룩스에서 공동 연구하여 수행된 연구로부터 도출된 것이다. [연구기간: 2018.01.01~2018.12.31, 연구관리 전문기관: 한국산업기술평가관리원, 연구과제명: 가정용 소셜로봇 및 서비스 개발 시스템, 과제 고유번호: 10077633]The present invention is derived from research conducted by IPL Co., Ltd. and joint research conducted by Saltlux Co., Ltd. as part of the Robot Industry Core Technology Development Project-Artificial Intelligence Convergence Robot System Technology of the Ministry of Commerce, Industry and Energy. [Research period: 2018.01.01~2018.12.31, Research management professional institution: Korea Institute of Industrial Technology Evaluation and Management, research project name: home social robot and service development system, project serial number: 10077633]

사람의 언어를 기계가 이해하도록 자연어를 인식하고 처리하는 것은 자연어 이해(natural language understand)로서 지칭될 수 있고, 자연어 이해는 다양한 분야에 사용될 수 있다. 예를 들면, 자연어 이해는, 사용자의 질의를 인식함으로써 질의에 대한 답변을 자동으로 제공하는 질의 응답 시스템(question and answering system)에 사용될 수 있다.Recognizing and processing natural language so that a machine understands human language can be referred to as natural language understand, and natural language understanding can be used in various fields. For example, natural language understanding can be used in a question-and-answer system that automatically provides an answer to a query by recognizing a user's query.

동일한 의미에도 불구하고 사람의 언어는 다양하게 표현될 수 있으므로, 자연어 이해에서는 다양하게 표현된 문장들로부터 동일한 의미를 파악하는 것이 요구될 수 있다. 이에 따라, 동일한 의미에 대응하는 다수의 문장들, 즉 패러프레이즈(paraphrase) 문장들을 준비하는 것은 자연어 이해의 중요한 기반이 될 수 있다.In spite of the same meaning, human language can be expressed in various ways, so in natural language understanding, it may be required to grasp the same meaning from various expressed sentences. Accordingly, preparing a plurality of sentences corresponding to the same meaning, that is, paraphrase sentences, may be an important basis for understanding natural language.

본 발명의 기술적 사상은, 온톨로지에 기반하여 패러프레이즈 문장들을 자동으로 생성하는 시스템 및 방법을 제공한다.The technical idea of the present invention provides a system and method for automatically generating paraphrase sentences based on an ontology.

상기와 같은 목적을 달성하기 위하여, 본 발명의 기술적 사상에 따라 패러프레이즈(paraphrase) 문장을 생성하는 시스템은, 질의 문장으로부터 추출된 토큰들에 대응하는 온톨로지(ontology) 구성요소들 각각을 포함하는 시맨틱 리소스들을 획득하고, 시맨틱 리소스들의 조합인 패턴 및 질의 문장에서 토큰들이 대체가능한 구조를 가지는 프레임을 생성하는 특성 추출부, 패턴 및 프레임을 상호 대응시키고, 상호 대응되는 패턴 및 프레임을 패턴 저장부에 저장하는 패턴 색인부, 및 패턴 저장부에 저장된 패턴에 대응하는 프레임들에 기초하여, 질의 문장의 패러프레이즈 문장을 생성하는 문장 생성부를 포함할 수 있다.In order to achieve the above object, a system for generating a paraphrase sentence according to the technical idea of the present invention includes semantic components including each of the ontology components corresponding to tokens extracted from the query sentence. A feature extraction unit that acquires resources and generates a frame having a structure in which tokens can be replaced in a pattern and query sentence, which is a combination of semantic resources, correlates the pattern and frame, and stores the corresponding pattern and frame in the pattern storage unit The pattern indexing unit may include a sentence generating unit that generates a paraphrase sentence of the query sentence based on the frames corresponding to the pattern stored in the pattern storage unit.

본 발명의 예시적 실시예에 따라, 특성 추출부는, 네트워크를 통해서 복수의 질의 문장들을 획득할 수 있다.According to an exemplary embodiment of the present invention, the feature extractor may acquire a plurality of query sentences through a network.

본 발명의 예시적 실시예에 따라, 특성 추출부는, 질의 문장에 대하여 자연어 처리를 수행함으로써 토큰들을 추출하는 자연어 처리부, 및 지식 베이스로부터 토큰들에 대응하는 시맨틱 리소스들을 수신하고, 패턴 및 프레임을 생성하는 패턴 생성부를 포함할 수 있다.According to an exemplary embodiment of the present invention, the feature extraction unit receives a natural language processing unit that extracts tokens by performing natural language processing on a query sentence, and semantic resources corresponding to the tokens from the knowledge base, and generates a pattern and a frame. It may include a pattern generator.

본 발명의 예시적 실시예에 따라, 패턴 색인부는, 상이한 순서의 시맨틱 리소스들을 각각 포함하는 패턴들을 상이한 패턴들로서 패턴 저장부에 저장할 수 있다.According to an exemplary embodiment of the present invention, the pattern indexing unit may store patterns each including semantic resources of a different order as different patterns in the pattern storage unit.

본 발명의 예시적 실시예에 따라, 문장 생성부는, 특성 추출부에 의해서 생성된 패턴에 대응하는 프레임들을 패턴 저장부에서 검색하는 패턴 검색부, 토큰들을 검색된 프레임들에 적용함으로써 적어도 하나의 예비 패러프레이즈 문장을 생성하는 어휘 치환부, 및 적어도 하나의 예비 패러프레이즈 문장을 자연어 규칙에 따라 수정함으로써 패러프레이즈 문장을 생성하는 문장 후처리부를 포함할 수 있다.According to an exemplary embodiment of the present invention, the sentence generation unit includes a pattern search unit that searches frames corresponding to the pattern generated by the feature extraction unit from the pattern storage unit, and at least one preliminary parametric by applying tokens to the searched frames. A vocabulary replacement unit for generating a phrase sentence, and a sentence post-processing unit for generating a paraphrase sentence by modifying at least one preliminary paraphrase sentence according to a natural language rule.

본 발명의 예시적 실시예에 따라, 어휘 치환부는, 시맨틱 리소스들에 기초하여, 토큰들의 동의어들을 검색된 프레임들에 적용함으로써 적어도 하나의 예비 패러프레이즈 문장을 생성할 수 있다.According to an exemplary embodiment of the present invention, the vocabulary replacement unit may generate at least one preliminary paraphrase sentence by applying synonyms of tokens to retrieved frames based on semantic resources.

본 발명의 예시적 실시예에 따라, 어휘 치환부는, 검색된 프레임의 구조를 변경함으로써 적어도 하나의 예비 패러프레이즈 문장을 생성할 수 있다.According to an exemplary embodiment of the present invention, the vocabulary replacement unit may generate at least one preliminary paraphrase sentence by changing the structure of the searched frame.

본 발명의 예시적 실시예에 따라, 어휘 치환부는, 검색된 프레임에서 부사구의 위치를 변경함으로써 추가적인 예비 패러프레이즈 문장을 생성할 수 있다.According to an exemplary embodiment of the present invention, the vocabulary replacement unit may generate an additional preliminary paraphrase sentence by changing the position of the adverb phrase in the searched frame.

본 발명의 예시적 실시예에 따라, 어휘 치환부는, 검색된 프레임이 명사구로 종료하는 경우, 질의 종결구를 추가함으로써 추가적인 예비 패러프레이즈 문장을 생성할 수 있다.According to an exemplary embodiment of the present invention, when the searched frame ends with a noun phrase, the vocabulary replacement unit may generate an additional preliminary paraphrase sentence by adding a query terminating phrase.

본 발명의 기술적 사상에 따른 시스템 및 방법에 의하면, 검증된 온톨로지에 기반하여 문장을 분석함으로써 정확하면서도 풍부한 패러프레이즈 문장들이 자동으로 생성될 수 있다.According to the system and method according to the technical idea of the present invention, accurate and rich paraphrase sentences can be automatically generated by analyzing sentences based on the verified ontology.

또한, 본 발명의 기술적 사상에 따른 시스템 및 방법에 의하면, 풍부하게 구비된 패러프레이즈 문장들에 기인하여 자연어 이해를 기반으로 하는 작업들, 예컨대 질의 응답 시스템, 지식 추출 등의 성능을 현저하게 향상시킬 수 있다.In addition, according to the system and method according to the technical idea of the present invention, the performance of tasks based on natural language understanding, such as a query response system, knowledge extraction, etc., can be significantly improved due to the abundantly provided paraphrase sentences. I can.

또한, 본 발명의 기술적 사상에 따른 시스템 및 방법에 의하면, 풍부하게 구비된 패러프레이즈 문장들에 기인하여 기계 학습을 위한 양질의 데이터가 마련될 수 있고, 이에 따라 기계 학습의 활용도를 현저하게 확대시킬 수 있다.In addition, according to the system and method according to the technical idea of the present invention, high-quality data for machine learning can be prepared due to the abundantly provided paraphrase sentences, and accordingly, the utilization of machine learning can be remarkably expanded. I can.

본 발명의 예시적 실시예들에서 얻을 수 있는 효과는 이상에서 언급한 효과들로 제한되지 아니하며, 언급되지 아니한 다른 효과들은 이하의 기재로부터 본 발명의 예시적 실시예들이 속하는 기술분야에서 통상의 지식을 가진 자에게 명확하게 도출되고 이해될 수 있다. 즉, 본 발명의 예시적 실시예들을 실시함에 따른 의도하지 아니한 효과들 역시 본 발명의 예시적 실시예들로부터 당해 기술분야의 통상의 지식을 가진 자에 의해 도출될 수 있다.The effects obtained in the exemplary embodiments of the present invention are not limited to the above-mentioned effects, and other effects that are not mentioned are common knowledge in the technical field to which the exemplary embodiments of the present invention belong from the following description. It can be clearly derived and understood by those who have. That is, unintended effects of implementing the exemplary embodiments of the present invention may also be derived from the exemplary embodiments of the present invention by a person of ordinary skill in the art.

도 1은 본 발명의 예시적 실시예에 따른 시스템 및 그 입출력 관계를 나타내는 블록도이다.
도 2는 본 발명의 예시적 실시예에 따라 도 1의 특성 추출부의 예시를 나타내는 블록도이다.
도 3a 및 도 3b는 본 발명의 예시적 실시예들에 따른 질의 문장, 패턴 및 프레임의 예시들을 나타낸다.
도 4는 본 발명의 예시적 실시예에 따라 도 1의 패턴 저장부에 저장된 패턴들 및 프레임들의 예시를 나타낸다.
도 5는 본 발명의 예시적 실시예에 따른 도 1의 문장 생성부의 예시를 나타내는 블록도이다.
도 6a 및 도 6b는 본 발명의 예시적 실시예들에 따라 도 5의 어휘 치환부의 동작의 예시들을 나타내는 도면들이다.
도 7은 본 발명의 예시적 실시예에 따른 패러프레이즈 문장을 생성하는 방법을 나타내는 순서도이다.1 is a block diagram showing a system and an input/output relationship thereof according to an exemplary embodiment of the present invention.
FIG. 2 is a block diagram illustrating an example of a feature extraction unit of FIG. 1 according to an exemplary embodiment of the present invention.
3A and 3B illustrate examples of query sentences, patterns, and frames according to exemplary embodiments of the present invention.
4 shows examples of patterns and frames stored in the pattern storage unit of FIG. 1 according to an exemplary embodiment of the present invention.
Fig. 5 is a block diagram showing an example of the sentence generating unit of Fig. 1 according to an exemplary embodiment of the present invention.
6A and 6B are diagrams illustrating examples of operations of the vocabulary replacement unit of FIG. 5 according to exemplary embodiments of the present invention.
Fig. 7 is a flow chart showing a method of generating a paraphrase sentence according to an exemplary embodiment of the present invention.

이하, 첨부한 도면을 참조하여 본 발명의 실시 예에 대해 상세히 설명한다. 본 발명의 실시 예는 당 업계에서 평균적인 지식을 가진 자에게 본 발명을 보다 완전하게 설명하기 위하여 제공되는 것이다. 본 발명은 다양한 변경을 가할 수 있고 여러 가지 형태를 가질 수 있는 바, 특정 실시 예들을 도면에 예시하고 상세하게 설명하고자 한다. 그러나, 이는 본 발명을 특정한 개시 형태에 대해 한정하려는 것이 아니며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다. 각 도면을 설명하면서 유사한 참조부호를 유사한 구성요소에 대해 사용한다. 첨부된 도면에 있어서, 구조물들의 치수는 본 발명의 명확성을 기하기 위하여 실제보다 확대하거나 축소하여 도시한 것이다.Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings. Embodiments of the present invention are provided to more fully describe the present invention to those with average knowledge in the art. The present invention can be applied to various changes and may have various forms, and specific embodiments will be illustrated in the drawings and described in detail. However, this is not intended to limit the present invention to a specific form disclosed, it should be understood to include all changes, equivalents, and substitutes included in the spirit and scope of the present invention. In describing each drawing, similar reference numerals are used for similar elements. In the accompanying drawings, the dimensions of the structures are enlarged or reduced than actual ones for clarity of the present invention.

본 출원에서 사용한 용어는 단지 특정한 실시 예를 설명하기 위해 사용된 것으로, 본 발명을 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수개의 표현을 포함한다. 본 출원에서, "포함하다" 또는 "가지다" 등의 용어는 명세서 상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성 요소, 부분품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.Terms used in the present application are only used to describe specific embodiments, and are not intended to limit the present invention. Singular expressions include plural expressions, unless the context clearly indicates otherwise. In this application, the terms "comprises" or "have" are intended to indicate the presence of features, numbers, steps, actions, elements, parts or combinations thereof described in the specification, one or more other features. It should be understood that the existence or addition possibilities of fields or numbers, steps, actions, components, parts or combinations thereof are not excluded in advance.

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 갖는다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥상 가지는 의미와 일치하는 의미를 가지는 것으로 해석되어야 하며, 본 출원에서 명백하게 정의하지 아니하는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.Unless otherwise defined, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by a person skilled in the art to which the present invention pertains. Terms, such as those defined in the dictionary used in general, should be interpreted as having meanings consistent with meanings in the context of related technologies, and should be interpreted as ideal or excessively formal meanings unless explicitly defined in the present application. Does not work.

이하 도면 및 설명에서, 하나의 블록으로 표시 또는 설명되는 구성요소는 하드웨어 블록 또는 소프트웨어 블록일 수 있다. 예를 들면, 구성요소들 각각은 서로 신호를 주고 받는 독립적인 하드웨어 블록일 수도 있고, 또는 하나의 프로세서에서 실행되는 소프트웨어 블록일 수도 있다. 또한, 본 명세서에서 "시스템" 또는 "데이터베이스"는 적어도 하나의 프로세서 및 프로세서에 의해서 액세스되는 메모리를 포함하는 컴퓨팅 시스템을 지칭할 수 있다.In the following drawings and description, a component indicated or described as a single block may be a hardware block or a software block. For example, each of the components may be an independent hardware block that exchanges signals with each other, or may be a software block executed in one processor. Also, in the present specification, "system" or "database" may refer to a computing system including at least one processor and a memory accessed by the processor.

도 1은 본 발명의 예시적 실시예에 따른 시스템 및 그 입출력 관계를 나타내는 블록도이다. 도 1에 도시된 바와 같이, 패러프레이즈 생성 시스템(100)은 지식 베이스(300) 및 패턴 저장부(500)와 통신가능하게 연결될 수 있다. 또한, 패러프레이즈 생성 시스템(100)은 자연어로서 질의 문장을 수신할 수 있고, 지식 베이스(300) 및 패턴 저장부(500)와 통신함으로써 질의 문장과 동일한 의미의 패러프레이즈 문장을 생성할 수 있다. 후술되는 바와 같이, 패러프레이즈 생성 시스템(100)은 지식 베이스(300)를 참조하여 질의 문장으로부터 패턴을 추출할 수 있고, 패턴에 대응하는 다수의 프레임들을 패턴과 함께 패턴 저장부(500)에 저장할 수 있다. 도 1에 도시된 블록들(100, 300, 500)은 네트워크를 통해서 상호 통신할 수도 있고, 일대일 통신을 위한 전용 채널을 통해서 상호 통신할 수도 있다. 또한, 도 1에 도시된 블록들(100, 300, 500) 중 2개 이상이 하나의 시스템(예컨대, 컴퓨팅 시스템)에 포함될 수도 있고, 일부 실시예들에서 패턴 저장부(500)는 패러프레이즈 생성 시스템(100)에 포함될 수 있다.1 is a block diagram showing a system and an input/output relationship thereof according to an exemplary embodiment of the present invention. As shown in FIG. 1, the paraphrase generation system 100 may be communicatively connected to the knowledge base 300 and the pattern storage unit 500. In addition, the paraphrase generation system 100 may receive a query sentence as a natural language, and may generate a paraphrase sentence having the same meaning as the query sentence by communicating with the knowledge base 300 and the pattern storage unit 500. As will be described later, the paraphrase generation system 100 may extract a pattern from the query sentence by referring to the knowledge base 300, and store a plurality of frames corresponding to the pattern in the pattern storage unit 500 together with the pattern. I can. The blocks 100, 300, and 500 shown in FIG. 1 may communicate with each other through a network or through a dedicated channel for one-to-one communication. In addition, two or more of the blocks 100, 300, and 500 shown in FIG. 1 may be included in one system (eg, a computing system), and in some embodiments, the pattern storage unit 500 generates a paraphrase. It may be included in the system 100.

패러프레이즈 생성 시스템(100)은 다양한 방식으로 질의 문장을 수신할 수 있다. 일부 실시예들에서, 패러프레이즈 생성 시스템(100)은 사용자 인터페이스를 포함할 수 있고, 사용자 인터페이스를 통해서 사람, 즉 사용자로부터 질의 문장을 수신할 수 있다. 일부 실시예들에서, 패러프레이즈 생성 시스템(100)은 네트워크, 예컨대 인터넷에 접속할 수 있고, 네트워크로부터 다수의 질의 문장들을 능동적으로 수집할 수도 있다. 도 1에 도시된 바와 같이, 패러프레이즈 생성 시스템(100)은 특성 추출부(120), 패턴 색인부(140) 및 문장 생성부(160)를 포함할 수 있다.The paraphrase generation system 100 may receive a query sentence in various ways. In some embodiments, the paraphrase generation system 100 may include a user interface, and may receive a query sentence from a person, that is, a user through the user interface. In some embodiments, the paraphrase generation system 100 may access a network, such as the Internet, and may actively collect multiple query sentences from the network. As shown in FIG. 1, the paraphrase generation system 100 may include a feature extracting unit 120, a pattern indexing unit 140, and a sentence generating unit 160.

특성 추출부(120)는 질의 문장으로부터 추출된 토큰들에 대응하는 온톨로지 구성요소들 각각을 포함하는 시맨틱 리소스(semantic resource)들을 획득할 수 있다. 온톨로지(ontology)는 실존하거나 사람이 인식 가능한 것들을 컴퓨터에서 다룰 수 있는 형태로 표현한 것으로서, 온톨로지 구성요소들은 엔티티(entity; E)(또는 인스턴스(instance)), 클래스(class; C), 속성(property; P), 값(value; V)을 포함할 수 있다. 추가적으로, 온톨로지 구성요소들은, 관계(relation)(엔티티간 속성 또는 클래스간 속성), 함수 텀(function term), 제한(restriction), 규칙(rule), 사건(event) 등을 더 포함할 수 있다. 지식 베이스(300)는 온톨로지에 기반하여 방대한 지식 데이터를 저장할 수 있고, 예컨대 지식 베이스(300)는 RDF(Resource Description Framework)를 사용하여 표현된 지식 데이터를 포함할 수 있으며, 지식 데이터 단위로서 트리플(triple)이 사용될 수 있다. 지식 베이스(300)는 쿼리, 예컨대 SPARQL(SPARQL Protocol and RDF Query Language) 쿼리에 응답하여 트리플을 반환할 수 있다. 일부 실시예들에서, 특성 추출부(120)는 패러프레이즈 생성 시스템(100)의 외부에 있는 자연어 인식 시스템(미도시)에 질의 문장을 전달할 수 있고, 자연어 인식 시스템으로부터 질의 문장으로부터 추출된 토큰들을 수신할 수 있다. 일부 실시예들에서, 특성 추출부(120)는, 도 2를 참조하여 후술되는 바와 같이, 질의 문장을 직접 처리함으로써 시맨틱 리소스들을 생성할 수도 있다.The feature extractor 120 may acquire semantic resources including each of the ontology components corresponding to tokens extracted from the query sentence. Ontology is a representation of things that exist or can be recognized by humans in a form that can be handled by a computer. Ontology components are entities (E) (or instances), classes (C), properties (property). ; P), value (V) can be included. Additionally, the ontology components may further include a relationship (attribute between entities or between classes), a function term, a restriction, a rule, an event, and the like. The knowledge base 300 may store vast amounts of knowledge data based on an ontology. For example, the knowledge base 300 may include knowledge data expressed using a Resource Description Framework (RDF), and a triple ( triple) can be used. The knowledge base 300 may return a triple in response to a query, such as a SPARQL Protocol and RDF Query Language (SPARQL) query. In some embodiments, the feature extraction unit 120 may transmit a query sentence to a natural language recognition system (not shown) external to the paraphrase generation system 100, and may transfer tokens extracted from the query sentence from the natural language recognition system. Can receive. In some embodiments, the feature extraction unit 120 may generate semantic resources by directly processing a query sentence, as described later with reference to FIG. 2.

특성 추출부(120)는 온톨로지에 기반하여 질의 문장으로부터 획득된 시맨틱 리소스들로부터 패턴 및 프레임을 생성할 수 있다. 도 3a 및 도 3b를 참조하여 후술되는 바와 같이, 패턴은 시맨틱 리소스들의 조합으로서 질의 문장이 나타내는 의미에 대응할 수 있다. 즉, 동일한 패턴에 대응하는 문장들을 패러프레이즈 문장들로서 지칭될 수 있다. 또한, 도 3a 및 도 3b를 참조하여 후술되는 바와 같이, 프레임은 질의 문장으로부터 추출된 토큰들이 질의 문장에서 대체가능한 구조를 가질 수 있다. 도 4를 참조하여 후술되는 바와 같이, 하나의 패턴은 적어도 하나의 프레임에 대응할 수 있고, 상호 대응되는 패턴 및 프레임이 패턴 색인부(140)에 의해서 패턴 저장부(500)에 저장될 수 있다. 특성 추출부(120)는 패턴 색인부(140)에 패턴 및 프레임을 제공할 수 있는 한편, 문장 생성부(160)에 추출된 토큰들을 포함하는 토큰 리스트 및 패턴을 제공할 수 있다.The feature extractor 120 may generate a pattern and a frame from semantic resources obtained from a query sentence based on the ontology. As will be described later with reference to FIGS. 3A and 3B, the pattern is a combination of semantic resources and may correspond to the meaning indicated by the query sentence. That is, sentences corresponding to the same pattern may be referred to as paraphrase sentences. In addition, as will be described later with reference to FIGS. 3A and 3B, the frame may have a structure in which tokens extracted from the query sentence can be replaced in the query sentence. As will be described later with reference to FIG. 4, one pattern may correspond to at least one frame, and patterns and frames corresponding to each other may be stored in the pattern storage unit 500 by the pattern index unit 140. The characteristic extracting unit 120 may provide a pattern and a frame to the pattern indexing unit 140, and may provide a token list and a pattern including the extracted tokens to the sentence generating unit 160.

패턴 색인부(140)는 특성 추출부(120)로부터 질의 문장으로부터 생성된 패턴 및 프레임을 수신할 수 있다. 패턴 색인부(140)는 패턴 및 프레임을 상호 대응시킬 수 있고, 상호 대응되는 패턴 및 프레임을 패턴 저장부(500)에 저장할 수 있다. 이에 따라, 패턴 저장부(500)에 저장된 패턴이 (예컨대, 문장 생성부(160)에 의해서) 검색되는 경우 대응되는 프레임들이 같이 검색될 수 있다. 일부 실시예들에서, 패턴 색인부(140)는 특성 추출부(120)로부터 수신된 패턴을 패턴 저장부(500)에서 검색할 수 있고, 특성 추출부(120)로부터 수신된 패턴이 패턴 저장부(500)에 이미 저장되어 있고, 특성 추출부(120)로부터 수신된 프레임 역시 패턴 저장부(500)에 저장되어 있는 경우, 특성 추출부(120)로부터 수신된 패턴 및 프레임의 저장을 생략할 수 있다. 패턴 색인부(140)에 의해서 패턴 저장부(500)에 저장된 패턴 및 프레임의 구조는 도 4를 참조하여 후술될 것이다.The pattern index unit 140 may receive a pattern and a frame generated from the query sentence from the feature extraction unit 120. The pattern index unit 140 may correlate patterns and frames, and may store patterns and frames corresponding to each other in the pattern storage unit 500. Accordingly, when a pattern stored in the pattern storage unit 500 is searched (eg, by the sentence generating unit 160), corresponding frames may be searched together. In some embodiments, the pattern index unit 140 may search the pattern received from the feature extraction unit 120 in the pattern storage unit 500, and the pattern received from the feature extraction unit 120 is a pattern storage unit. If already stored in 500 and the frame received from the feature extraction unit 120 is also stored in the pattern storage unit 500, storage of the pattern and frame received from the feature extraction unit 120 can be omitted. have. Structures of patterns and frames stored in the pattern storage unit 500 by the pattern index unit 140 will be described later with reference to FIG. 4.

문장 생성부(160)는 특성 추출부(120)로부터 질의 문장에 대응하는 패턴 및 토큰 리스트를 수신할 수 있고, 질의 문장의 패러프레이즈 문장을 생성할 수 있다. 문장 생성부(160)는 패턴 저장부(500)에서 패턴을 검색할 수 있고, 검색된 패턴 및 이에 대응하는 프레임들을 패턴 저장부(500)로부터 획득할 수 있다. 예를 들면, 문장 생성부(160)는 특성 추출부(120)로부터 수신된 패턴을 패턴 저장부(500)에서 검색할 수 있고, 토큰 리스트 및 검색된 패턴에 대응하는 프레임들에 기초하여 질의 문장과 동일한 의미를 가지는 문장, 즉 패러프레이즈 문장을 생성할 수 있다. 문장 생성부(160)의 예시는 도 5를 참조하여 후술될 것이다.The sentence generating unit 160 may receive a pattern and a token list corresponding to the query sentence from the feature extracting unit 120, and may generate a paraphrase sentence of the query sentence. The sentence generating unit 160 may search for a pattern in the pattern storage unit 500, and may obtain the searched pattern and frames corresponding thereto from the pattern storage unit 500. For example, the sentence generation unit 160 may search the pattern received from the feature extraction unit 120 in the pattern storage unit 500, and based on the token list and frames corresponding to the searched pattern, the query sentence and the A sentence having the same meaning, that is, a paraphrase sentence, can be generated. An example of the sentence generating unit 160 will be described later with reference to FIG. 5.

도면들을 참조하여 후술되는 바와 같이, 패러프레이즈 생성 시스템(100)은 온톨로지 기반 지식 베이스(300)를 사용하여 질의 문장의 패러프레이즈 문장을 생성할 수 있다. 이에 따라 질의 문장은 검증된 온톨로지에 기반하여 분석될 수 있고, 정확하면서도 풍부한 패러프레이즈 문장들이 자동으로 생성될 수 있다. 또한, 패러프레이즈 생성 시스템(100)에 의해서 생성된 패러프레이즈 문장들에 기인하여 자연어 이해를 기반으로 하는 작업들, 예컨대 질의 응답 시스템, 지식 추출 등의 성능이 현저하게 향상될 수 있고, 기계 학습을 위한 양질의 데이터가 적절하게 마련될 수 있다.As will be described later with reference to the drawings, the paraphrase generation system 100 may generate a paraphrase sentence of a query sentence by using the ontology-based knowledge base 300. Accordingly, the query sentence can be analyzed based on the verified ontology, and accurate and rich paraphrase sentences can be automatically generated. In addition, due to the paraphrase sentences generated by the paraphrase generation system 100, the performance of tasks based on natural language understanding, such as a question-and-answer system and knowledge extraction, can be remarkably improved. Quality data can be properly prepared.

도 2는 본 발명의 예시적 실시예에 따라 도 1의 특성 추출부(120)의 예시를 나타내는 블록도이고, 도 3a 및 도 3b는 본 발명의 예시적 실시예들에 따른 질의 문장, 패턴 및 프레임의 예시들을 나타낸다. 도 1을 참조하여 전술된 바와 같이, 도 2의 특성 추출부(120')는 질의 문장을 수신할 수 있고, 지식 베이스(300)를 참조하여 질의 문장에 대응하는 패턴 및 프레임을 생성할 수 있다. 도 2에 도시된 바와 같이, 특성 추출부(120')는 자연어 처리부(122) 및 패턴 생성부(124)를 포함할 수 있고, 이하에서 도 2, 도 3a 및 도 3b는 도 1을 참조하여 설명될 것이다.2 is a block diagram showing an example of the feature extraction unit 120 of FIG. 1 according to an exemplary embodiment of the present invention, and FIGS. 3A and 3B are query sentences, patterns, and patterns according to exemplary embodiments of the present invention. Examples of frames are shown. As described above with reference to FIG. 1, the feature extraction unit 120 ′ of FIG. 2 may receive a query sentence and may generate a pattern and a frame corresponding to the query sentence with reference to the knowledge base 300. . As shown in FIG. 2, the feature extraction unit 120 ′ may include a natural language processing unit 122 and a pattern generation unit 124, and FIGS. 2, 3A and 3B hereinafter will be described with reference to FIG. 1. Will be explained.

일부 실시예들에서, 특성 추출부(120')는 질의 문장을 자연어 처리함으로써 토큰(token)들을 추출할 수 있고, 토큰들에 대응하는 시맨틱 리소스들을 지식 베이스(300)로부터 획득할 수 있다. 도 1을 참조하여 전술된 바와 같이, 일부 실시예들에서, 도 2의 예시와 상아하게, 도 1의 특성 추출부(120)는 질의 문장을 직접 자연어 처리하는 대신 패러프레이즈 생성 시스템(100) 외부의 자연어 처리 시스템에 질의 문장을 전달하고, 그로부터 토큰들을 수신할 수도 있다.In some embodiments, the feature extraction unit 120 ′ may extract tokens by processing a query sentence in natural language, and may obtain semantic resources corresponding to the tokens from the knowledge base 300. As described above with reference to FIG. 1, in some embodiments, different from the example of FIG. 2, the feature extraction unit 120 of FIG. 1 directly processes the query sentence, but the paraphrase generation system 100 It is also possible to pass a query sentence to the natural language processing system of the company and receive tokens from it.

자연어 처리부(122)는 질의 문장으로부터 토큰들을 추출할 수 있다. 예를 들면, 도 3a에 도시된 바와 같이 "서울의 인구는?"이라는 제1 질의 문장(Q1)에 대하여, 자연어 처리부(122)는 "서울", "의", "인구", "는", "?"을 포함하는 토큰 리스트를 생성할 수 있다. 일부 실시예들에서, 자연어 처리부(122)에 의해서 추출된 토큰은 단어뿐만 아니라 특성을 함께 생성할 수 있다. 예를 들면, 토큰은 품사(예컨대, 명사, 형용사 부사 등)를 포함할 수도 있고, 질의 문장에서 문장 구성요소(예컨대, 주어, 목적어, 부사어 등)를 포함할 수도 있다. 자연어 처리부(122)는 임의의 방식으로 토큰들을 추출할 수 있다. 예를 들면, 자연어 처리부(122)는 지식 베이스(300)를 참조하여 토큰들을 추출할 수도 있다. 예를 들면, 자연어 처리부(122)는 지식 베이스(300)의 엔티티들, 속성들 등에 기초하여 질의 문장에 포함된 단어의 품사를 판정할 수 있다. 또한, 일부 실시예들에서, 도 6a 및 도 6b를 참조하여 후술되는 바와 같이, 추출된 토큰들은 트리 구조를 가질 수 있다.The natural language processing unit 122 may extract tokens from the query sentence. For example, as shown in FIG. 3A, for the first query sentence Q1 of "What is the population of Seoul?", the natural language processing unit 122 may be used to "Seoul", "Us", "Population", and "A". You can create a list of tokens including "?". In some embodiments, the token extracted by the natural language processing unit 122 may generate not only words but also characteristics. For example, the token may include parts of speech (eg, nouns, adverbs of adjectives, etc.), or may include sentence elements (eg, subjects, objects, adverbs, etc.) in a query sentence. The natural language processing unit 122 may extract tokens in an arbitrary manner. For example, the natural language processing unit 122 may extract tokens by referring to the knowledge base 300. For example, the natural language processing unit 122 may determine the part-of-speech of a word included in the query sentence based on entities and attributes of the knowledge base 300. Further, in some embodiments, as described later with reference to FIGS. 6A and 6B, the extracted tokens may have a tree structure.

패턴 생성부(124)는 자연어 처리부(122)로부터 토큰들을 수신할 수 있고, 수신된 토큰들에 대응하는 시맨틱 리소스들을 지식 베이스(300)로부터 획득할 수 있다. 예를 들면, 도 3a에 도시된 바와 같이, "서울의 인구는?"이라는 제1 질의 문장(Q1)에 대하여, "서울"의 시맨틱 리소스로서 엔티티 "<Entity-City>"를 획득할 수 있고, "인구"의 시맨틱 리소스로서 속성 "<Property_Population>"을 획득할 수 있다. 이에 따라, 제1 패턴(P1)은 시맨틱 리소스들로서, "<Entity-City>" 및 "<Property_Population>"을 포함할 수 있다. 또한, 도 3b에 도시된 바와 같이, "서울의 인구수를 알려주세요?"라는 제2 질의 문장(Q2)에 대하여, "LA"의 시맨틱 리소스로서 엔티티 "<Entity_City>"를 획득할 수 있고, "인구수"의 시맨틱 리소스로서 속성 "<Peoperty_Population>"을 획득할 수 있다. 제2 패턴(P2)은 "<Entity-City>" 및 "<Property_Population>"을 포함할 수 있고, 이에 따라 제1 패턴(P1)과 동일할 수 있다.The pattern generator 124 may receive tokens from the natural language processing unit 122 and may acquire semantic resources corresponding to the received tokens from the knowledge base 300. For example, as shown in FIG. 3A, for the first query sentence Q1 of “What is the population of Seoul?”, an entity “<Entity-City>” may be obtained as a semantic resource of “Seoul” and , As a semantic resource of "population", the property "<Property_Population>" may be acquired. Accordingly, the first pattern P1 is semantic resources and may include “<Entity-City>” and “<Property_Population>”. In addition, as shown in FIG. 3B, with respect to the second query sentence Q2 of "Please tell me the number of population of Seoul?", an entity "<Entity_City>" may be obtained as a semantic resource of "LA", and " The attribute "<Peoperty_Population>" can be acquired as a semantic resource of "number of population". The second pattern P2 may include “<Entity-City>” and “<Property_Population>”, and thus may be the same as the first pattern P1.

일부 실시예들에서, 패턴은 시맨틱 리소스들의 조합뿐만 아니라 시맨틱 리소스들의 순서를 정의할 수 있다. 즉, 동일한 시맨틱 리소스들을 포함하는 패턴들일지라도 시맨틱 리소스들의 순서들이 상이한 경우, 해당 패턴들은 상이한 패턴들로서 패턴 색인부(140)에 의해 패턴 저장부(500)에 저장될 수 있다.In some embodiments, the pattern may define a combination of semantic resources as well as an order of semantic resources. That is, even if patterns including the same semantic resources have different orders of semantic resources, corresponding patterns may be stored in the pattern storage unit 500 by the pattern index unit 140 as different patterns.

또한, 패턴 생성부(124)는 질의 문장 및 패턴에 기초하여 프레임을 생성할 수 있다. 예를 들면, 도 3a에 도시된 바와 같이, 제1 질의 문장(Q1) 및 제1 패턴(P1)에 기초하여, 패턴 생성부(124)는 제1 프레임(F11), 즉 "<Entity_City>의 <Property_Population>는?"을 생성할 수 있다. 제1 프레임(F11)에서, 엔티티에 대응하는 시맨틱 리소스, 즉 "<Entity_City>"에, "서울", "LA" 등과 같은 인스턴스가 적용(또는 대입)될 수 있고, 속성에 대응하는 시맨틱 리소스, 즉 "<Property_Population>"은 "인구", "인구수"와 같이 동의어들 중 하나가 적용될 수 있다. 이에 따라, 제1 질의 문장(Q1)과 상이한 질의 문장으로서 제1 프레임(F11)에 제1 패턴(P1)으로 분석된 질의 문장에 포함된 토큰들을 제1 프레임(F11)에 적용하는 경우, 해당 질의 문장의 패러프레이즈 문장이 생성될 수 있다. 유사하게, 도 3b에 도시된 바와 같이, 제2 질의 문장(Q2) 및 제2 패턴(P2)에 기초하여, 패턴 생성부(124)는 제2 프레임(F12), 즉 "<Entity_City>의 <Property_Population>를 알려주세요."를 생성할 수 있다. 제1 프레임(F11)과 유사하게, 제2 프레임(F12)에서, 엔티티에 대응하는 시맨틱 리소스, 즉 "<Entity_City>"에, "서울", "LA" 등과 같은 인스턴스가 적용될 수 있고, 속성에 대응하는 시맨틱 리소스, 즉 "<Property_Population>"에 "인구", "인구수"와 같이 동의어들 중 하나가 적용될 수 있다. 이에 따라, 제2 질의 문장(Q2)과 상이한 질의 문장으로서 제2 프레임(F12)에 제1 패턴(P1)으로 분석된 질의 문장에 포함된 토큰들을 제2 프레임(F12)에 적용하는 경우, 해당 질의 문장의 패러프레이즈 문장이 생성될 수 있다.In addition, the pattern generation unit 124 may generate a frame based on the query sentence and the pattern. For example, as shown in FIG. 3A, based on the first query sentence Q1 and the first pattern P1, the pattern generating unit 124 is the first frame F11, that is, “<Entity_City>”. <Property_Population>?" can be created. In the first frame F11, an instance such as "Seoul" and "LA" may be applied (or substituted) to a semantic resource corresponding to an entity, that is, "<Entity_City>", and a semantic resource corresponding to an attribute, That is, for "<Property_Population>", one of synonyms such as "population" and "population number" may be applied. Accordingly, when the tokens included in the query sentence analyzed in the first pattern P1 in the first frame F11 as a query sentence different from the first query sentence Q1 are applied to the first frame F11, the corresponding A paraphrase sentence of the query sentence can be generated. Similarly, as shown in FIG. 3B, based on the second query sentence Q2 and the second pattern P2, the pattern generating unit 124 performs a second frame F12, that is, <Entity_City>. Property_Population> please tell me" can be created. Similar to the first frame F11, in the second frame F12, an instance such as "Seoul", "LA", etc. may be applied to the semantic resource corresponding to the entity, that is, "<Entity_City>". One of synonyms such as "population" and "population count" may be applied to the corresponding semantic resource, that is, "<Property_Population>". Accordingly, when the tokens included in the query sentence analyzed in the first pattern P1 in the second frame F12 as a query sentence different from the second query sentence Q2 are applied to the second frame F12, the corresponding A paraphrase sentence of the query sentence can be generated.

도 3a 및 도 3b의 예시에서, 제1 질의 문장(Q1) 및 제2 질의 문장(Q2)은 동일한 패턴에 대응하므로(즉, 제1 패턴(P1) 및 제2 패턴(P2)이 일치하므로), 제1 질의 문장(Q1)의 토큰들을 제2 프레임(F12)에 적용함으로써 제1 질의 문장(Q1)의 패러프레이즈 문장, 즉 "서울의 인구를 알려주세요."가 생성될 수 있다. 유사하게, 제2 질의 문장(Q2)의 토큰들을 제1 프레임(F11)에 적용함으로써 제2 질의 문장(Q2)의 패러프레이즈 문장, 즉 "LA의 인구는?"이 생성될 수 있다. In the examples of FIGS. 3A and 3B, since the first query sentence Q1 and the second query sentence Q2 correspond to the same pattern (that is, because the first pattern P1 and the second pattern P2 are identical) , By applying the tokens of the first query sentence Q1 to the second frame F12, a paraphrase sentence of the first query sentence Q1, that is, “Please tell me the population of Seoul” may be generated. Similarly, by applying the tokens of the second query sentence Q2 to the first frame F11, a paraphrase sentence of the second query sentence Q2, that is, “What is the population of LA?” may be generated.

도 4는 본 발명의 예시적 실시예에 따라 도 1의 패턴 저장부(500)에 저장된 패턴들 및 프레임들의 예시를 나타낸다. 도 1을 참조하여 전술된 바와 같이, 패턴 색인부(140)는 패턴 및 프레임을 상호 대응시킬 수 있고, 상호 대응하는 패턴 및 프레임을 패턴 저장부(500)에 저장할 수 있다. 이하에서, 도 4는 도 1을 참조하여 설명될 것이다.4 shows examples of patterns and frames stored in the pattern storage unit 500 of FIG. 1 according to an exemplary embodiment of the present invention. As described above with reference to FIG. 1, the pattern index unit 140 may correspond to a pattern and a frame, and may store a pattern and a frame corresponding to each other in the pattern storage unit 500. In the following, FIG. 4 will be described with reference to FIG. 1.

도 3a 및 도 3b를 참조하여 전술된 바와 같이, 하나의 패턴에 다수의 프레임들이 대응할 수 있고, 주어진 질의 문장에 대응하는 패턴을 분석한 후 패턴에 대응하는 다수의 프레임들에 질의 문장의 토큰들을 적용함으로써 다수의 패러프레이즈 문장들이 생성될 수 있다. 이에 따라, 도 4에 도시된 바와 같이, 패턴 저장부(500)는 복수의 패턴들(P1 내지 Pn) 각각에 대응하는 복수의 그룹들(G1 내지 Gn)을 포함할 수 있다(n은 1보다 큰 정수). 복수의 그룹들(G1 내지 Gn) 각각은 하나의 패턴 및 패턴에 대응하는 복수의 프레임들을 포함할 수 있다. 예를 들면, 제1 그룹(G1)은 제1 패턴(P1) 및 이에 대응하는 복수의 프레임들(F11 내지 F1x)을 포함할 수 있고(x는 1보다 큰 정수), 제2 그룹(G2)은 제2 패턴(P2) 및 이에 대응하는 복수의 프레임들(F21 내지 F2y)을 포함할 수 있으며(y는 1보다 큰 정수), 제3 그룹(G3)은 제3 패턴(P3) 및 이에 대응하는 복수의 프레임들(Fn1 내지 Fnz)을 포함할 수 있다(z는 1보다 큰 정수).As described above with reference to FIGS. 3A and 3B, a plurality of frames may correspond to one pattern, and after analyzing a pattern corresponding to a given query sentence, tokens of the query sentence are added to a plurality of frames corresponding to the pattern. By applying, multiple paraphrase sentences can be generated. Accordingly, as shown in FIG. 4, the pattern storage unit 500 may include a plurality of groups G1 to Gn corresponding to each of the plurality of patterns P1 to Pn (n is greater than 1). Large integer). Each of the plurality of groups G1 to Gn may include one pattern and a plurality of frames corresponding to the pattern. For example, the first group G1 may include the first pattern P1 and a plurality of frames F11 to F1x corresponding thereto (x is an integer greater than 1), and the second group G2 May include a second pattern P2 and a plurality of frames F21 to F2y corresponding thereto (y is an integer greater than 1), and the third group G3 is a third pattern P3 and corresponding thereto. It may include a plurality of frames (Fn1 to Fnz) (z is an integer greater than 1).

패턴 색인부(140)는 도 4에 도시된 바와 같은 구조를 가지도록 패턴들 및 프레임들을 패턴 저장부(500)에 저장할 수 있는 한편, 문장 생성부(160)는 도 4에 도시된 구조에 기인하여, 검색된 패턴에 대응하는 복수의 프레임들을 획득할 수 있다. 예를 들면, 문장 생성부(160)가 특성 추출부(120)로부터 수신한 패턴이 제2 패턴(P2)에 대응하는 경우, 문장 생성부(160)는 제2 패턴(P2)을 패턴 저장부(500)에서 검색할 수 있고, 제2 패턴(P2)에 대응하는 복수의 프레임들(F21 내지 F2y)을 획득할 수 있다. 도 4에 도시된 예시는 패턴 및 복수의 프레임들 사이 대응관계를 나타내는 것일 뿐, 도 4에 도시된 구조를 구현하기 위하여 임의의 데이터 구조들이 사용될 수 있는 점은 이해될 것이다. 또한, 일부 실시예들에서, 프레임은 도 6a를 참조하여 후술되는 바와 같이 트리 구조를 가질 수도 있다.The pattern index unit 140 may store patterns and frames in the pattern storage unit 500 so as to have a structure as shown in FIG. 4, while the sentence generation unit 160 is due to the structure shown in FIG. 4. Thus, a plurality of frames corresponding to the searched pattern may be obtained. For example, when the pattern received from the feature extraction unit 120 by the sentence generating unit 160 corresponds to the second pattern P2, the sentence generating unit 160 stores the second pattern P2 as a pattern storage unit. It is possible to search at 500 and obtain a plurality of frames F21 to F2y corresponding to the second pattern P2. It will be appreciated that the example shown in FIG. 4 is only to show a pattern and a correspondence relationship between a plurality of frames, and that arbitrary data structures may be used to implement the structure shown in FIG. 4. In addition, in some embodiments, the frame may have a tree structure as will be described later with reference to FIG. 6A.

도 5는 본 발명의 예시적 실시예에 따른 도 1의 문장 생성부(160)의 예시를 나타내는 블록도이다. 도 1을 참조하여 전술된 바와 같이, 도 5의 문장 생성부(160')는 특성 추출부(120)로부터 질의 문장에 대응하는 패턴 및 프레임을 수신할 수 있고, 패턴 저장부(500)에 저장된 패턴 및 프레임을 참조하여 질의 문장의 패러프레이즈 문장을 생성할 수 있다. 도 5에 도시된 바와 같이, 문장 생성부(160')는 패턴 검색부(162), 어휘 치환부(164) 및 문장 후처리부(166)를 포함할 수 있고, 이하에서 도 5는 도 1을 참조하여 설명될 것이다.5 is a block diagram showing an example of the sentence generator 160 of FIG. 1 according to an exemplary embodiment of the present invention. As described above with reference to FIG. 1, the sentence generating unit 160 ′ of FIG. 5 may receive a pattern and a frame corresponding to a query sentence from the feature extracting unit 120, and stored in the pattern storage unit 500. A paraphrase sentence of the query sentence can be generated by referring to the pattern and frame. As shown in FIG. 5, the sentence generating unit 160 ′ may include a pattern search unit 162, a vocabulary replacement unit 164, and a sentence post-processing unit 166. Hereinafter, FIG. It will be explained with reference.

패턴 검색부(162)는 특성 추출부(120)로부터 패턴을 수신할 수 있고, 특성 추출부(120)로부터 수신된 패턴에 대응하는 프레임들을 패턴 저장부(500)에서 검색할 수 있다. 도 4를 참조하여 전술된 바와 같이, 패턴 저장부(500)는 상호 대응하는 패턴 및 프레임들을 포함하도록 구조화될 수 있으므로, 패턴 검색부(162)는 수신된 패턴을 패턴 저장부(500)에서 검색함으로써 패턴에 대응하는 프레임들을 패턴 저장부(500)로부터 수신할 수 있다.The pattern search unit 162 may receive a pattern from the feature extraction unit 120 and may search frames corresponding to the pattern received from the feature extraction unit 120 in the pattern storage unit 500. As described above with reference to FIG. 4, since the pattern storage unit 500 may be structured to include patterns and frames corresponding to each other, the pattern search unit 162 searches the received pattern in the pattern storage unit 500. By doing so, frames corresponding to the pattern may be received from the pattern storage unit 500.

어휘 치환부(164)는 특성 추출부(120)로부터 토큰 리스트를 수신할 수 있고, 패턴 검색부(162)로부터 프레임들을 수신할 수 있다. 도 1을 참조하여 전술된 바와 같이, 토큰 리스트는 질의 문장에서 추출된 토큰들을 포함할 수 있고, 어휘 치환부(164)는 토큰 리스트에 포함된 토큰들을 패턴 검색부(162)로부터 수신된 프레임들에 적용함으로써 예비(preliminary) 패러프레이즈 문장들을 생성할 수 있다. 예를 들면, 도 3a의 제1 질의 문장(Q1)으로부터 추출된 토큰 리스트를 수신하고, 도 3b의 제2 프레임(F12)을 패턴 검색부(162)로부터 수신한 경우, 어휘 치환부(164)는 토큰 리스트에 포함된 "서울", "인구"를 제2 프레임(F12)에 적용함으로써 예비 패러프레이즈 문장 "서울의 인구수를 알려주세요."를 생성할 수 있다. 유사하게, 도 3b의 제2 질의 문장(Q2)으로부터 추출된 토큰 리스트를 수신하고, 도 3a의 제1 프레임(F11)을 패턴 검색부(162)로부터 수신한 경우, 어휘 치환부(164)는 토큰 리스트에 포함된 "LA", "인구수"를 제1 프레임(F11)에 적용함으로써 예비 패러프레이즈 문장 "LA의 인구수는?"을 생성할 수 있다.The vocabulary replacement unit 164 may receive a token list from the feature extraction unit 120 and may receive frames from the pattern search unit 162. As described above with reference to FIG. 1, the token list may include tokens extracted from the query sentence, and the vocabulary replacement unit 164 uses the frames received from the pattern search unit 162 for tokens included in the token list. By applying to, we can generate preliminary paraphrase sentences. For example, when a token list extracted from the first query sentence Q1 of FIG. 3A is received and the second frame F12 of FIG. 3B is received from the pattern search unit 162, the vocabulary replacement unit 164 May generate a preliminary paraphrase sentence "Please tell me the number of population in Seoul" by applying "Seoul" and "Population" included in the token list to the second frame F12. Similarly, when receiving the token list extracted from the second query sentence Q2 of FIG. 3B and receiving the first frame F11 of FIG. 3A from the pattern search unit 162, the vocabulary replacement unit 164 A preliminary paraphrase sentence "What is the number of populations of LA?" may be generated by applying "LA" and "the number of populations" included in the token list to the first frame F11.

일부 실시예들에서, 어휘 치환부(164)는 토큰에 대응하는 단어의 동의어를 프레임에 적용함으로써 예비 패러프레이즈 문장을 생성할 수 있다. 예를 들면, 도 3b의 제2 질의 문장(Q2)으로부터 추출된 토큰 리스트를 수신하고, 도 3a의 제1 프레임(F11)을 패턴 검색부(162)로부터 수신한 경우, 어휘 치환부(164)는 예비 패러프레이즈 문장으로서 "LA의 인구수는?" 뿐만 아니라 "로스앤젤레스의 인구수는?"을 생성할 수도 있고, 제2 프레임(F12)에 기초하여 "로스앤젤레스의 인구수를 알려주세요."를 생성할 수도 있다. 또한, 일부 실시예들에서, 어휘 치환부(164)는 프레임의 구조를 변경함으로써 예비 패러프레이즈 문장을 생성할 수도 있다. 프레임의 구조를 변경함으로써 예비 패러프레이즈 문장을 생성하는 어휘 치환부(164)의 예시들은 도 6a 및 도 6b를 참조하여 후술될 것이다.In some embodiments, the vocabulary replacement unit 164 may generate a preliminary paraphrase sentence by applying a synonym of a word corresponding to the token to the frame. For example, when receiving a token list extracted from the second query sentence Q2 of FIG. 3B and receiving the first frame F11 of FIG. 3A from the pattern search unit 162, the vocabulary replacement unit 164 Is a preliminary paraphrase sentence, "What is LA's population?" In addition, "What is the number of people in Los Angeles?" or "What is the number of people in Los Angeles?" may be generated based on the second frame F12. In addition, in some embodiments, the vocabulary replacement unit 164 may generate a preliminary paraphrase sentence by changing the structure of the frame. Examples of the vocabulary replacement unit 164 that generates a preliminary paraphrase sentence by changing the structure of the frame will be described later with reference to FIGS. 6A and 6B.

문장 후처리부(166)는 어휘 치환부(164)로부터 예비 패러프레이즈 문장을 수신할 수 있고, 예비 패러프레이즈 문장을 자연어 규칙에 따라 수정함으로써 패러프레이즈 문장을 생성할 수 있다. 어휘 치환부(164)에 의해서 토큰들이 프레임에 적용된 예비 패러프레이즈 문장은 자연어 규칙이 위반된 부분을 포함할 수 있다. 예를 들면, 예비 패러프레이즈 문장은 부적절한 조사 및/또는 어미를 포함할 수 있고, 문장 후처리부(166)는 자연어 규칙에 기초하여 조사 및/또는 어미를 수정함으로써 자연스러운 문장으로서 패러프레이즈 문장을 생성할 수 있다.The sentence post-processing unit 166 may receive a preliminary paraphrase sentence from the vocabulary replacement unit 164, and may generate a paraphrase sentence by modifying the preliminary paraphrase sentence according to natural language rules. The preliminary paraphrase sentence in which tokens are applied to the frame by the vocabulary replacement unit 164 may include a portion in which the natural language rule is violated. For example, the preliminary paraphrase sentence may include inappropriate investigations and/or endings, and the sentence post-processing unit 166 may generate the paraphrase sentences as natural sentences by modifying the investigations and/or endings based on natural language rules. I can.

도 6a 및 도 6b는 본 발명의 예시적 실시예들에 따라 도 5의 어휘 치환부(164)의 동작의 예시들을 나타내는 도면들이다. 구체적으로, 도 6a 및 도 6b는 프레임의 구조를 변경함으로써 예비 패러프레이즈 문장들을 생성하는 어휘 치환부(164)의 예시들을 나타낸다. 어휘 치환부(164)는 프레임의 구조를 변경할 수도 있고, 프레임에 토큰들을 적용한 예비 패러프레이즈 문장의 구조를 변경할 수도 있다. 이하에서 도 6a 및 도 6b는 도 5를 참조하여 설명될 것이며, 도 6a 및 도 6b에 대한 설명 중 중복되는 내용은 생략될 것이다. 설명의 편의상 도 6a 및 도 6b는 프레임에 토큰들이 적용된 예비 패러프레이즈 문장의 구조를 변경하는 예시들을 나타낸다.6A and 6B are diagrams illustrating examples of the operation of the vocabulary replacement unit 164 of FIG. 5 according to exemplary embodiments of the present invention. Specifically, FIGS. 6A and 6B show examples of the vocabulary replacement unit 164 generating preliminary paraphrase sentences by changing the structure of the frame. The vocabulary replacement unit 164 may change the structure of the frame or may change the structure of a preliminary paraphrase sentence in which tokens are applied to the frame. Hereinafter, FIGS. 6A and 6B will be described with reference to FIG. 5, and overlapping contents in the description of FIGS. 6A and 6B will be omitted. For convenience of explanation, FIGS. 6A and 6B illustrate examples of changing the structure of a preliminary paraphrase sentence to which tokens are applied to a frame.

도 6a를 참조하면, 어휘 치환부(164)는 프레임에서 부사구의 위치를 변경함으로써 추가적인 예비 패러프레이즈 문장을 생성할 수 있다. 예를 들면, "스미스소니언 미술관은 미국의 어느 도시에 있어?"라는 예비 패러프레이즈 문장은 도 6a의 상부에 도시된 바와 같은 트리 구조로 표현될 수 있다. 도 6a에서, 토큰들 아래 표시된 참조부호들은 토큰들의 특성들로서, 품사 및 문장 구성요소 등을 나타낼 수 있다. 예를 들면, "VNP"는 긍정 지정사구로서 "명사+이다"를 나타낼 수 있고, "NP_SBJ"는 체언(예컨대, 명사, 대명사, 수사)이 주어로서 사용된 것을 나타낼 수 있고, "NP_AJT"는 체언이 부사어(용언 수식어)로서 사용된 것을 나타낼 수 있다. 또한, "NP_MOD"는 체언이 관형어(체언 수식어)로서 사용된 것을 나타낼 수 있고, "DP"는 관형사구를 나타낼 수 있다.6A, the vocabulary replacement unit 164 may generate an additional preliminary paraphrase sentence by changing the position of the adverb phrase in the frame. For example, the preliminary paraphrase sentence "Which city is the Smithsonian Museum in the United States?" may be expressed in a tree structure as shown in the upper part of FIG. 6A. In FIG. 6A, reference numerals indicated under tokens are characteristics of tokens, and may indicate parts of speech and sentence elements. For example, "VNP" may represent "noun + is" as a positive designation phrase, "NP_SBJ" may represent that a body language (eg, noun, pronoun, rhetoric) is used as a subject, and "NP_AJT" is It can indicate that body language is used as an adverb (terminal modifier). In addition, "NP_MOD" may indicate that the body language is used as a guan-type word (a body language modifier), and "DP" may represent a guan-type phrase.

어휘 치환부(164)는 부사구에 대응하는 서브-트리의 위치를 변경할 수 있다. 예를 들면, 도 6b의 하단에 도시된 바와 같이, 어휘 치환부(164)는 "미국의 어느 도시에"에 대응하는 서브-트리와 "스미스소니언 미술관은"의 위치를 바꿀 수 있다. 이에 따라, "미국의 어느 도시에 스미스소니언 미술관은 있어?"라는 예비 패러프레이즈 문장이 생성될 수 있다. 도 5의 문장 후처리부(166)는 조사를 수정함으로써 패러프레이즈 문장, 즉 "미국의 어느 도시에 스미스소니언 미술관이 있어?"을 최종적으로 생성할 수 있다.The vocabulary replacement unit 164 may change the position of the sub-tree corresponding to the adverb phrase. For example, as shown in the lower part of FIG. 6B, the vocabulary replacement unit 164 may change the position of the sub-tree corresponding to “in any city in the United States” and the “Smithsonian Art Museum”. Accordingly, a preliminary paraphrase sentence of "Which city in the United States has Smithsonian Art Museum?" may be generated. The sentence post-processing unit 166 of FIG. 5 may finally generate a paraphrase sentence, that is, “Which city in the United States has the Smithsonian Art Museum?” by modifying the investigation.

도 6b를 참조하면, 어휘 치환부(164)는 하나이상의 구를 추가적인 예비 패러프레이즈 문장을 생성할 수 있다. 예를 들면, "스미스소니언 미술관이 있는 도시는?"이라는 예비 패러프레이즈 문장은 도 6b의 좌측에 도시된 바와 같이 구조화될 수 있다. 어휘 치환부(164)는 검색된 프레임이 명사구로 종료하는 경우, 질의 종결구를 추가함으로써 추가적인 예비 패러프레이즈 문장을 생성할 수 있다. 예를 들면, 도 6b의 우측에 도시된 바와 같이, 어휘 치환부(164)는 명사구인 "도시는"으로 종결되는 프레임에서 질의 종결구로서 "어디야"를 추가함으로써 예비 패러프레이즈 문장, 즉 "스미스소니언 미술관이 있는 도시는 어디야?"를 생성할 수 있다.Referring to FIG. 6B, the vocabulary replacement unit 164 may generate an additional preliminary paraphrase sentence with one or more phrases. For example, a preliminary paraphrase sentence “Which city is the Smithsonian Museum?” may be structured as shown on the left side of FIG. 6B. When the searched frame ends with a noun phrase, the vocabulary replacement unit 164 may generate an additional preliminary paraphrase sentence by adding a query terminating phrase. For example, as shown on the right side of FIG. 6B, the vocabulary replacement unit 164 adds “Where” as a query terminating phrase in the frame ending with the noun phrase “shown”, so that a preliminary paraphrase sentence, namely “Smith "Where is the Soonian Museum of Art?"

도 7은 본 발명의 예시적 실시예에 따른 패러프레이즈 문장을 생성하는 방법을 나타내는 순서도이다. 예를 들면, 도 7의 방법은 도 1의 패러프레이즈 생성 시스템(100)에 의해서 수행될 수 있다. 구체적으로, 단계 S10, 단계 S20 및 단계 S30은 도 1의 특성 추출부(120)에 의해서 수행될 수 있고, 단계 S40, 단계 S50 및 단계 S60은 문장 생성부(160)에 의해서 수행될 수 있다. 이하에서, 도 7은 도 1을 참조하여 설명될 것이다.Fig. 7 is a flow chart showing a method of generating a paraphrase sentence according to an exemplary embodiment of the present invention. For example, the method of FIG. 7 may be performed by the paraphrase generation system 100 of FIG. 1. Specifically, steps S10, S20, and S30 may be performed by the feature extraction unit 120 of FIG. 1, and steps S40, S50, and S60 may be performed by the sentence generation unit 160. In the following, FIG. 7 will be described with reference to FIG. 1.

도 7을 참조하면, 단계 S10에서 질의 문장을 수신하는 동작이 수행될 수 있다. 예를 들면, 특성 추출부(120)는 사용자 인터페이스를 통해서 사용자로부터 질의 문장을 수신할 수도 있고, 네트워크로부터 질의 문장들을 수집할 수도 있다.Referring to FIG. 7, an operation of receiving a query sentence may be performed in step S10. For example, the feature extraction unit 120 may receive a query sentence from a user through a user interface, or may collect query sentences from a network.

단계 S20에서, 시맨틱 리소스들을 획득하는 동작이 수행될 수 있다. 예를 들면, 특성 추출부(120)는 질의 문장을 자연어 처리함으로써 추출된 토큰들에 대응하는 시맨틱 리소스들을 지식 베이스(300)로부터 획득할 수 있다. 도면들을 참조하여 전술된 바와 같이, 지식 베이스(300)는 온톨로지에 기반하여 구축될 수 있고 검증된 지식 데이터를 포함할 수 있으므로, 질의 문장의 패러프레이즈 문장을 생성하는데 활용될 수 있다. 질의 문장의 자연어 처리는, 일부 실시예들에서 패러프레이즈 생성 시스템(100)의 외부에서 수행될 수도 있고, 일부 실시예들에서 특성 추출부(120)에 의해서 수행될 수도 있다.In step S20, an operation of acquiring semantic resources may be performed. For example, the feature extraction unit 120 may acquire semantic resources corresponding to the extracted tokens from the knowledge base 300 by processing a query sentence in natural language. As described above with reference to the drawings, since the knowledge base 300 may be constructed based on an ontology and may include verified knowledge data, it may be used to generate a paraphrase sentence of a query sentence. The natural language processing of the query sentence may be performed outside the paraphrase generation system 100 in some embodiments, or may be performed by the feature extraction unit 120 in some embodiments.

단계 S30에서, 패턴 및 프레임을 생성하는 동작이 수행될 수 있다. 예를 들면, 특성 추출부(120)는 질의 문장에 기초하여 획득된 시맨틱 리소스들의 조합으로서 패턴을 생성할 수 있고, 지의 문장에서 토큰들이 대체 가능한 구조를 가지는 프레임을 생성할 수 있다. 또한, 특성 추출부(120)는 추출된 토큰들을 포함하는 토큰 리스트를 생성할 수도 있다.In step S30, an operation of generating a pattern and a frame may be performed. For example, the feature extraction unit 120 may generate a pattern as a combination of semantic resources acquired based on the query sentence, and may generate a frame having a structure in which tokens can be replaced in the sentence sentence. In addition, the feature extraction unit 120 may generate a token list including the extracted tokens.

단계 S40에서, 패턴을 검색하는 동작이 수행될 수 있다. 예를 들면, 패턴 저장부(500)는 상호 대응하는 패턴 및 프레임을 저장할 수 있고, 하나의 패턴은 복수의 프레임들에 대응할 수 있다. 문장 생성부(160)는 패턴 저장부(500)는 단계 S30에서 생성된 패턴을 검색할 수 있고, 검색된 패턴에 대응하는 프레임들을 패턴 저장부(500)로부터 획득할 수 있다.In step S40, an operation of searching for a pattern may be performed. For example, the pattern storage unit 500 may store patterns and frames corresponding to each other, and one pattern may correspond to a plurality of frames. The sentence generating unit 160 may search the pattern generated in step S30 and the pattern storage unit 500 may obtain frames corresponding to the searched pattern from the pattern storage unit 500.

단계 S50에서, 프레임에 토큰들을 적용하는 동작이 수행될 수 있다. 예를 들면, 문장 생성부(160)는 패턴 저장부(500)로부터 획득된 프레임에 토큰 리스트에 포함된 토큰들을 적용함으로써 예비 패러프레이즈 문장들을 생성할 수 있다. 일부 실시예들에서, 문장 생성부(160)는 토큰의 동의어를 사용하여 추가적인 예비 패러프레이즈 문장을 생성할 수도 있고, 프레임 또는 예비 패러프레이즈 문장의 구조를 변경함으로써 추가적인 예비 패러프레이즈 문장을 생성할 수도 있다.In step S50, an operation of applying tokens to the frame may be performed. For example, the sentence generation unit 160 may generate preliminary paraphrase sentences by applying tokens included in the token list to the frame obtained from the pattern storage unit 500. In some embodiments, the sentence generator 160 may generate an additional preliminary paraphrase sentence by using a synonym of the token, or may generate an additional preliminary paraphrase sentence by changing the structure of a frame or a preliminary paraphrase sentence. have.

단계 S60에서, 문장의 후처리가 수행될 수 있다. 예를 들면, 단계 S50에서 생성된 예비 패러프레이즈 문장은 자연어 규칙이 위반된 부분을 포함할 수 있고, 문장 생성부(160)는 자연어 규칙에 따라 예비 패러프레이즈 문장의 조사 및/또는 어미를 수정함으로써 패러프레이즈 문장을 최종적으로 생성할 수 있다.In step S60, post-processing of the sentence may be performed. For example, the preliminary paraphrase sentence generated in step S50 may include a part in which the natural language rule is violated, and the sentence generation unit 160 examines the preliminary paraphrase sentence and/or corrects the ending according to the natural language rule. Paraphrase sentences can be finally created.

이상에서와 같이 도면과 명세서에서 예시적인 실시예들이 개시되었다. 본 명세서에서 특정한 용어를 사용하여 실시예들이 설명되었으나, 이는 단지 본 발명의 기술적 사상을 설명하기 위한 목적에서 사용된 것이지 의미 한정이나 특허청구범위에 기재된 본 발명의 범위를 제한하기 위하여 사용된 것은 아니다. 그러므로 본 기술분야의 통상의 지식을 가진 자라면 이로부터 다양한 변형 및 균등한 타 실시예가 가능하다는 점을 이해할 것이다. 따라서, 본 발명의 진정한 기술적 보호범위는 첨부된 특허청구범위의 기술적 사상에 의해 정해져야 할 것이다.As described above, exemplary embodiments have been disclosed in the drawings and the specification. In the present specification, embodiments have been described using specific terms, but these are only used for the purpose of describing the technical idea of the present invention, and are not used to limit the meaning or the scope of the present invention described in the claims. . Therefore, those of ordinary skill in the art will understand that various modifications and equivalent other embodiments are possible therefrom. Accordingly, the true technical scope of the present invention should be determined by the technical spirit of the appended claims.

Claims

As a system for generating paraphrase sentences,
Acquires semantic resources including each of the ontology components corresponding to tokens extracted from the query sentence, and creates a frame having a pattern that is a combination of the semantic resources and a structure in which the tokens can be replaced in the query sentence A feature extraction unit configured to;
A pattern storage unit for storing a plurality of frames including a plurality of patterns and at least one frame corresponding to each of the plurality of patterns;
A pattern indexing unit configured to associate patterns and frames with each other and store the corresponding patterns and frames in the pattern storage unit; And
And a sentence generator configured to generate a paraphrase sentence of the query sentence based on frames corresponding to the pattern stored in the pattern storage unit,
And the feature extraction unit comprises a pattern generation unit configured to receive the semantic resources corresponding to the tokens from a knowledge base and generate the pattern and the frame.

The method according to claim 1,
And the feature extraction unit is configured to obtain a plurality of query sentences through a network.

The method according to claim 1,
The feature extraction unit,
And a natural language processing unit configured to extract the tokens by performing natural language processing on the query sentence.

The method according to claim 1,
And the pattern indexing unit is configured to store patterns each including semantic resources of a different order as different patterns in the pattern storage unit.

The method according to claim 1,
The sentence generating unit,
A pattern search unit configured to search in the pattern storage unit for frames corresponding to the pattern generated by the feature extraction unit;
A vocabulary replacement unit configured to generate at least one preliminary paraphrase sentence by applying the tokens to the retrieved frames; And
And a sentence post-processing unit configured to generate the paraphrase sentence by modifying the at least one preliminary paraphrase sentence according to a natural language rule.

The method according to claim 5,
And the vocabulary replacement unit is configured to generate the at least one preliminary paraphrase sentence by applying synonyms of the tokens to the retrieved frames based on the semantic resources.

The method according to claim 5,
And the vocabulary replacement unit is configured to generate the at least one preliminary paraphrase sentence by changing a structure of the searched frame.

The method according to claim 7,
And the vocabulary replacement unit is configured to generate an additional preliminary paraphrase sentence by changing a position of an adverb phrase in the searched frame.

The method according to claim 7,
And the vocabulary replacement unit is configured to generate an additional preliminary paraphrase sentence by adding a query terminating phrase when the searched frame ends with a noun phrase.