KR20200086586A

KR20200086586A - Knowledge extraction system using frame based on ontology

Info

Publication number: KR20200086586A
Application number: KR1020190002991A
Authority: KR
Inventors: 양승원; 곽수정
Original assignee: 주식회사 솔트룩스
Priority date: 2019-01-09
Filing date: 2019-01-09
Publication date: 2020-07-17
Also published as: KR102182619B1

Abstract

Provided is a knowledge extraction system capable of automatically extracting knowledge. According to the present invention, the knowledge extraction system comprises: a frame generation unit receiving or collecting an unstructured document through a network, generating triple knowledge, and generating a sentence frame from the unstructured document; a partial frame generation unit generating a partial frame by dividing the sentence frame; a knowledge extraction rule generation unit generating a knowledge extraction rule from the partial frame generated from a rule target unstructured document among unstructured documents; and a knowledge extraction unit generating triple knowledge from partial frames generated from unstructured documents to be a knowledge target among unstructured documents by using the knowledge extraction rule.

Description

Knowledge extraction system using frame based on ontology}

본 발명은 지식 추출 시스템에 관한 것으로, 자세하게는 온톨로지 기반의 프레임을 이용한 지식 추출 시스템 관한 것이다. The present invention relates to a knowledge extraction system, and more particularly, to a knowledge extraction system using an ontology-based frame.

본 발명은 미래창조과학부 SW컴퓨팅산업원천기술개발사업(SW)의 일환으로 (주)비트컴퓨터에서 주관하고 (주)솔트룩스가 공동 연구하여 수행된 연구로부터 도출된 것이다. [연구기간: 2018.04.01~2019.03.31, 연구관리 전문기관: 정보통신기술진흥센터, 연구과제명: 머신러닝기반 의료HER 클라우드 서비스 개발, 과제 고유번호: 2017-0-03355] The present invention was derived from a study conducted by Bitlux Co., Ltd. and jointly researched by Saltlux Co., Ltd. as part of the SW Computing Industry Source Technology Development Project (SW) by the Ministry of Science, ICT and Future Planning. [Research period: 2018.04.01~2019.03.31, Research management agency: Information and Communication Technology Promotion Center, Research project name: Machine learning-based medical HER cloud service development, task identification number: 2017-0-03355]

온톨로지(ontology)란 공유화된 개념화에 대한 정형화되고 명시적인 명세로서 특정 지식과 관련된 용어와 용어 사이의 관계를 형식적으로 정의한 집합으로, 컴퓨터가 처리할 수 있도록 지식을 논리적으로 저장하는 형식을 말한다. Ontology is a formalized and explicit specification for shared conceptualization. It is a set that formally defines terms related to specific knowledge and relationships between terms, and is a form that stores knowledge logically for computer processing.

온톨로지는 전통적으로 구축하고자 하는 해당 분야의 전문가가 직접 문장을 읽고 형식에 맞춰서 데이터를 입력하는 방식으로 구축된다. 그러나 이러한 전통적인 방법은 지식을 추출하고 저장하는 데에 너무 많은 시간과 비용이 발생한다. Ontologies are traditionally constructed by experts in the field who want to build sentences and directly reading the sentences and inputting the data according to the format. However, these traditional methods are too time consuming and expensive to extract and store knowledge.

이에 기계학습을 활용하여 비정형 문서(자연어 문서)에서 지식을 추출하는 방법, 패턴을 미리 정의하여 패턴 기반으로 지식을 추출하는 방법 등이 제안되었다. 그러나, 기계학습을 활용한 방법은 해당 분야(도메인)마다 새롭게 학습 데이터를 구축하는 데에 많은 시간과 노력이 필요하며, 패턴 기반의 지식을 추출하는 방법 역시 패턴을 구축하는 데에 많은 비용이 소모된다.Accordingly, a method of extracting knowledge from an atypical document (natural language document) by using machine learning, and a method of extracting knowledge based on a pattern by defining a pattern in advance have been proposed. However, the method using machine learning requires a lot of time and effort to build new learning data for each field (domain), and the method of extracting pattern-based knowledge also requires a lot of money to build a pattern. do.

본 발명의 기술적 과제는, 온톨로지 기반의 프레임을 이용하여 지식을 자동으로 추출할 수 있는 지식 추출 시스템을 제공하는 것이다. The technical problem of the present invention is to provide a knowledge extraction system capable of automatically extracting knowledge using an ontology-based frame.

상기 기술적 과제를 달성하기 위한 본 발명의 기술적 사상의 일측면에 따른 지식 추출 시스템은, 네트워크를 통하여 비정형 문서를 수신하거나 수집하여, 트리플 지식을 생성하며, 상기 비정형 문서로부터 문장 프레임을 생성하는 프레임 생성부; 상기 문장 프레임을 분할하여, 부분 프레임을 생성하는 부분 프레임 생성부; 상기 비정형 문서 중 규칙 대상 비정형 문서로부터 생성된 상기 부분 프레임으로부터 지식 추출 규칙을 생성하는 지식 추출 규칙 생성부; 및 상기 지식 추출 규칙을 이용하여 상기 비정형 문서 중 지식 대상 비정형 문서로부터 생성된 상기 부분 프레임으로부터 상기 트리플 지식을 생성하는 지식 추출부;를 포함한다. The knowledge extraction system according to an aspect of the technical idea of the present invention for achieving the above technical problem receives or collects an unstructured document through a network, generates triple knowledge, and generates a frame for generating a sentence frame from the unstructured document part; A partial frame generator for dividing the sentence frame and generating a partial frame; A knowledge extraction rule generator configured to generate a knowledge extraction rule from the partial frame generated from a rule target unstructured document among the unstructured documents; And a knowledge extracting unit generating the triple knowledge from the partial frame generated from the unstructured document of knowledge among the unstructured documents using the knowledge extraction rule.

상기 지식 추출 규칙 생성부는, 상기 규칙 대상 비정형 문서를 문장 단위로 분할한 후, 주어 개체와 목적어 개체를 모두 포함하는 문장을 규칙 추출 대상 문장으로 선정하는 규칙 추출 대상 문장 선정부; 상기 프레임 생성부 및 상기 부분 프레임 생성부를 통하여 상기 규칙 추출 대상 문장으로부터 생성된 상기 부분 프레임으로부터 복수의 지식 추출 규칙 후보를 생성하는 규칙 생성부; 및 상기 복수의 지식 추출 규칙 후보에 대하여 검증을 수행하여 상기 지식 추출 규칙을 선정하는 규칙 검증부;를 포함할 수 있다. The knowledge extraction rule generation unit may include: a rule extraction target sentence selection unit for dividing the rule target unstructured document into sentence units, and selecting a sentence including both a subject object and a target object entity as a rule extraction target sentence; A rule generation unit generating a plurality of knowledge extraction rule candidates from the partial frame generated from the rule extraction target sentence through the frame generation unit and the partial frame generation unit; And a rule verification unit that selects the knowledge extraction rules by performing verification on the plurality of knowledge extraction rule candidates.

상기 규칙 검증부는, 상기 복수의 지식 추출 규칙 후보에 대한 신뢰도를 분석하는 신뢰도 분석부; 및 상기 신뢰도 분석부에서 분석된 신뢰도를 기초로 상기 복수의 지식 추출 규칙 후보 중에서 상기 지식 추출 규칙을 선정하는 규칙 선정부;를 포함할 수 있다. The rule verification unit, a reliability analysis unit for analyzing the reliability of the plurality of knowledge extraction rule candidates; And a rule selection unit selecting the knowledge extraction rule from among the plurality of knowledge extraction rule candidates based on the reliability analyzed by the reliability analysis unit.

상기 규칙 선정부는, 상기 복수의 지식 규칙 후보 중 신뢰도 분석 대상 술어에 대한 신뢰도가 0과 1 사이의 설정값보다 큰 것, 또는 가장 큰 것을 상기 지식 추출 규칙으로 선정할 수 있다. The rule selection unit may select, among the plurality of knowledge rule candidates, the reliability of the predicate to be analyzed for reliability greater than or equal to a set value between 0 and 1 as the knowledge extraction rule.

상기 프레임 생성부는, 상기 비정형 문서에 대한 형태소 분석 결과와 구문 분석 결과를 참조하여, 문장 개체들을 온톨로지 데이터가 가지는 온톨로지 인스턴스와 연결하는 개체 연결부; 및 프레임 생성 규칙을 이용하여 문장 프레임을 생성하는 문장 프레임 생성부;를 포함할 수 있다. The frame generation unit may refer to a morpheme analysis result and a parsing result for the unstructured document, and an entity connection unit connecting sentence entities to ontology instances of ontology data; And a sentence frame generator for generating a sentence frame using a frame generation rule.

상기 부분 프레임 생성부는, 상기 문장 프레임을 분할하여 상기 문장 프레임이 가지는 술어의 개수보다 같거나 작은 개수의 상기 부분 프레임을 생성할 수 있다. The partial frame generation unit may divide the sentence frame and generate the partial frame having the same or smaller number than the number of predicates in the sentence frame.

상기 부분 프레임 생성부는, 상기 문장 프레임이 가지는 술어 중 주어 개체와 목적어 개체를 모두 포함하는 것 각각을 기준으로 상기 부분 프레임을 생성할 수 있다. The partial frame generation unit may generate the partial frame based on each of the predicate possessed by the sentence frame including both the subject object and the object object.

상기 지식 추출 규칙은 각 술어 별로 정의된 규칙이며, 상기 부분 프레임은 상기 비정형 문서가 가지는 술어 각각을 기준으로 관계를 가지는 문장 개체들을 포함하도록 생성될 수 있다. The knowledge extraction rule is a rule defined for each predicate, and the partial frame may be generated to include sentence entities having a relationship based on each predicate in the unstructured document.

본 발명에 지식 추출 시스템은, 해당 분야의 전문가가 직접 문장을 읽고 형식에 맞춰서 데이터를 입력하거나, 학습 데이터를 구축하거나, 패턴을 구축하지 않고, 자동으로 생성되고 검증된 의미적으로 추상화된 지식 추출 규칙을 이용해 비정형 장에서 자동으로 트리플 형식의 트리플 지식을 추출할 수 있다. 따라서, 본 발명에 지식 추출 시스템은 많은 시간, 노력, 및 비용을 소모하지 않고, 자동으로 트리플 지식을 추출하여, 지식 베이스를 구축할 수 있다. In the present invention, the knowledge extraction system extracts semantically abstracted knowledge that is automatically generated and verified by experts in the field directly reading sentences and entering data according to a format, constructing learning data, or constructing patterns. Rules can be used to automatically extract triple-type triple knowledge from atypical fields. Therefore, in the present invention, the knowledge extraction system can automatically extract triple knowledge and build a knowledge base without consuming a lot of time, effort, and cost.

도 1은 본 발명의 예시적 실시 예에 따른 지식 추출 시스템을 나타내는 블록도이다.
도 2는 본 발명의 예시적 실시 예에 따른 지식 추출 시스템을 나타내는 블록도이다.
도 3은 본 발명의 예시적 실시 예에 따른 지식 추출 시스템이 가지는 프레임 생성부의 구성을 나타내는 블록도이다.
도 4는 본 발명의 예시적 실시 예에 따른 지식 추출 시스템이 가지는 규칙 생성부 및 규칙 검증부의 구성을 나타내는 블록도이다.
도 5는 본 발명의 예시적 실시 예에 따른 지식 추출 시스템이 가지는 자연어 이해부의 구성을 나타내는 블록도이다.
도 6은 본 발명의 예시적 실시 예에 따른 지식 추출 시스템의 프레임 생성부에서 생성하는 문장 프레임을 설명하기 위한 개념도이다.
도 7은 본 발명의 예시적 실시 예에 따른 지식 추출 시스템의 부분 프레임 생성부에서 생성하는 부분 프레임을 설명하기 위한 개념도이다.
도 8은 본 발명의 예시적 실시 예에 따른 지식 추출 시스템을 나타내는 블록도이다. 1 is a block diagram showing a knowledge extraction system according to an exemplary embodiment of the present invention.
Fig. 2 is a block diagram showing a knowledge extraction system according to an exemplary embodiment of the present invention.
Fig. 3 is a block diagram showing a structure of a frame generation unit of a knowledge extraction system according to an exemplary embodiment of the present invention.
Fig. 4 is a block diagram showing the configuration of a rule generation unit and a rule verification unit of the knowledge extraction system according to an exemplary embodiment of the present invention.
Fig. 5 is a block diagram showing a structure of a natural language understanding unit of a knowledge extraction system according to an exemplary embodiment of the present invention.
6 is a conceptual diagram illustrating a sentence frame generated by a frame generation unit of a knowledge extraction system according to an exemplary embodiment of the present invention.
7 is a conceptual diagram illustrating a partial frame generated by a partial frame generation unit of a knowledge extraction system according to an exemplary embodiment of the present invention.
8 is a block diagram illustrating a knowledge extraction system according to an exemplary embodiment of the present invention.

이하, 첨부한 도면을 참조하여 본 발명의 실시 예에 대해 상세히 설명한다. 본 발명의 실시 예는 당 업계에서 평균적인 지식을 가진 자에게 본 발명을 보다 완전하게 설명하기 위하여 제공되는 것이다. 본 발명은 다양한 변경을 가할 수 있고 여러 가지 형태를 가질 수 있는 바, 특정 실시 예들을 도면에 예시하고 상세하게 설명하고자 한다. 그러나, 이는 본 발명을 특정한 개시 형태에 대해 한정하려는 것이 아니며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다. 각 도면을 설명하면서 유사한 참조부호를 유사한 구성요소에 대해 사용한다. 첨부된 도면에 있어서, 구조물들의 치수는 본 발명의 명확성을 기하기 위하여 실제보다 확대하거나 축소하여 도시한 것이다.Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings. The embodiments of the present invention are provided to more fully describe the present invention to those skilled in the art. The present invention can be applied to various changes and may have various forms, and specific embodiments will be illustrated in the drawings and described in detail. However, this is not intended to limit the present invention to a specific disclosure form, and it should be understood that all modifications, equivalents, and substitutes included in the spirit and scope of the present invention are included. In describing each drawing, similar reference numerals are used for similar components. In the accompanying drawings, the dimensions of the structures are enlarged or reduced than actual ones for clarity of the present invention.

본 명세서에서 사용한 용어는 단지 특정한 실시 예를 설명하기 위해 사용된 것으로, 본 발명을 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수개의 표현을 포함한다. 본 명세서에서, "포함하다" 또는 "가지다" 등의 용어는 명세서 상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성 요소, 부분품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.The terms used in this specification are only used to describe specific embodiments, and are not intended to limit the present invention. Singular expressions include plural expressions, unless the context clearly indicates otherwise. In this specification, terms such as “include” or “have” are intended to indicate that a feature, number, step, operation, component, part, or combination thereof described on the specification exists, and that one or more other features are present. It should be understood that the existence or addition possibilities of fields or numbers, steps, actions, components, parts or combinations thereof are not excluded in advance.

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 갖는다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥상 가지는 의미와 일치하는 의미를 가지는 것으로 해석되어야 하며, 본 명세서에서 명백하게 정의하지 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.Unless otherwise defined, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by a person skilled in the art to which the present invention pertains. Terms such as those defined in a commonly used dictionary should be interpreted as having meanings consistent with meanings in the context of related technologies, and are not to be interpreted as ideal or excessively formal meanings, unless explicitly defined herein. .

이하 도면 및 설명에서, 하나의 블록, 예를 들면, '~부' 또는 '~모듈'로 표시 또는 설명되는 구성요소는 하드웨어 블록 또는 소프트웨어 블록일 수 있다. 예를 들면, 구성요소들 각각은 서로 신호를 주고 받는 독립적인 하드웨어 블록일 수도 있고, 또는 하나의 프로세서에서 실행되는 소프트웨어 블록일 수도 있다.In the following drawings and descriptions, one block, for example, a component indicated or described as'~ unit' or'~ module' may be a hardware block or a software block. For example, each of the components may be an independent hardware block that exchanges signals with each other, or may be a software block executed in one processor.

도 1은 본 발명의 예시적 실시 예에 따른 지식 추출 시스템을 나타내는 블록도이다. 1 is a block diagram showing a knowledge extraction system according to an exemplary embodiment of the present invention.

도 1을 참조하면, 지식 추출 시스템(1)은 지식 추출 규칙 생성부(100) 및 지식 추출부(200)를 포함한다. 지식 추출 규칙 생성부(100)는 다수의 비정형 문서(UD)로부터 지식 추출 규칙(150)을 생성하고, 지식 추출부(200)는 생성된 지식 추출 규칙(150)을 이용하여 비정형 문서(UD)로부터 트리플 지식(600)을 추출할 수 있다. Referring to FIG. 1, the knowledge extraction system 1 includes a knowledge extraction rule generation unit 100 and a knowledge extraction unit 200. The knowledge extraction rule generation unit 100 generates a knowledge extraction rule 150 from a plurality of unstructured documents UD, and the knowledge extraction unit 200 uses the generated knowledge extraction rule 150 to generate an unstructured document UD. From the triple knowledge 600 can be extracted.

트리플 지식(600)은 예를 들면, RDF(Resource Description Framework)일 수 있으며, 지식을 '<주어>', '<술어>', '<목적어>' 형식의 구조로 나타낼 수 있으며, 자원(Resource)인 '<주어>'와 '<목적어>'의 관계의 특징(특성)을 '<술어>'로 표현할 수 있다. 이러한 관계는 '<주어>'에서 '<목적어>'로 방향성을 가지며, 속성(Property)이라 호칭할 수 있다. The triple knowledge 600 may be, for example, an RDF (Resource Description Framework), and may represent knowledge in a structure of the form of'<speak>','<predicate>','<object>', and resources (Resource ) The characteristics (characteristics) of the relationship between'<verb>' and'<destination>' can be expressed as'<prediction>'. This relationship has directionality from'<verb>' to'<destination>' and can be called a property.

일부 실시 예에서, 다수의 비정형 문서(UD) 중 일부에서 규칙 추출 대상 문장(도 2의 120)을 선정한 후, 선정된 규칙 추출 대상 문장(120)을 이용하여 지식 추출 규칙(150)을 생성하고, 생성된 지식 추출 규칙(150)을 이용하여 다수의 비정형 문서(UD)로부터 트리플 지식(600)을 추출할 수 있다.In some embodiments, after selecting a rule extraction target sentence (120 of FIG. 2) from a part of a plurality of unstructured documents (UD), the knowledge extraction rule 150 is generated using the selected rule extraction target sentence 120 , Triple knowledge 600 may be extracted from a plurality of unstructured documents UD using the generated knowledge extraction rule 150.

다른 일부 실시 예에서, 다수의 비정형 문서(UD) 중 일부에서 규칙 추출 대상 문장(도 2의 120)을 선정한 후, 선정된 규칙 추출 대상 문장(120)을 이용하여 지식 추출 규칙(150)을 생성하고, 생성된 지식 추출 규칙(150)을 이용하여 다수의 비정형 문서(UD) 중 나머지로부터 트리플 지식(600)을 추출할 수 있다.In some other embodiments, after selecting a rule extraction target sentence (120 in FIG. 2) from a part of a plurality of unstructured documents (UD), a knowledge extraction rule 150 is generated using the selected rule extraction target sentence 120 And, by using the generated knowledge extraction rule 150, the triple knowledge 600 may be extracted from the rest of the plurality of unstructured documents UD.

또 다른 일부 실시 예에서, 규칙 추출 대상 문장(120)의 선정용 비정형 문서(도 8의 UD-R)와 트리플 지식(600) 추출용 비정형 문서(도 8의 UD-K)은 각각 별도로 준비될 수 있다. 예를 들면, 지식 추출 규칙(150)을 생성하는 데에 적합한 규칙 대상 비정형 문서(UD-R)를 별도로 준비하여 지식 추출 규칙(150)을 생성한 후, 임의로 수집된 별도의 지식 대상 비정형 문서(UD-K)로부터 트리플 지식(600)을 추출할 수 있다. In some other embodiments, the atypical document for selecting the rule extraction target sentence 120 (UD-R in FIG. 8) and the atypical document for extracting the triple knowledge 600 (UD-K in FIG. 8) are separately prepared. Can. For example, after preparing a knowledge extraction rule 150 by separately preparing a rule target unstructured document (UD-R) suitable for generating a knowledge extraction rule 150, a random knowledge target unstructured document collected randomly ( UD-K) can extract triple knowledge (600).

일부 실시 예에서, 지식 추출 시스템(1)은 네트워크(50)를 통하여 비정형 문서(UD)를 수신할 수 있다. 다른 일부 실시 예에서, 지식 추출 시스템(1)은 네트워크(50)를 통하여 비정형 문서(UD)를 수집하기 위한, 문서 수집 로봇을 더 포함할 수 있다. In some embodiments, the knowledge extraction system 1 may receive an unstructured document (UD) through the network 50. In some other embodiments, the knowledge extraction system 1 may further include a document collection robot for collecting unstructured documents UD through the network 50.

네트워크(50)는 유선 인터넷 서비스, 근거리 통신망(LAN), 광대역 통신망(WAN), 인트라넷, 무선 인터넷 서비스, 이동 컴퓨팅 서비스, 무선 데이터 통신 서비스, 무선 인터넷 접속 서비스, 위성 통신 서비스, 무선 랜, 블루투스 등 유/무선을 통하여 데이터를 주고 받을 수 있는 것을 모두 포함할 수 있다. 네트워크(50)가 스마트폰 또는 태블릿 등과 연결되는 경우, 네트워크(50)는 3G, 4G, 5G 등의 무선 데이터 통신 서비스, 와이파이(Wi-Fi) 등의 무선 랜, 블루투스 등일 수 있다. The network 50 includes wired Internet service, local area network (LAN), broadband communication network (WAN), intranet, wireless Internet service, mobile computing service, wireless data communication service, wireless Internet access service, satellite communication service, wireless LAN, Bluetooth, etc. It can include anything that can send and receive data over wired or wireless. When the network 50 is connected to a smartphone or tablet, the network 50 may be a wireless data communication service such as 3G, 4G, 5G, wireless LAN such as Wi-Fi, Bluetooth, or the like.

지식 추출 규칙 생성부(100)는 온톨로지 데이터(500)를 참조하여, 비정형 문서(UD)로부터 지식 추출 규칙(150)을 생성할 수 있다. 지식 추출 규칙 생성부(100)는 자연어 이해부(300)를 참조하여, 비정형 문서(UD)에 대한 자연어 분석을 수행한 후, 비정형 문서(UD)에 대한 자연어 분석 결과로부터 지식 추출 규칙을 생성할 수 있다. The knowledge extraction rule generation unit 100 may generate the knowledge extraction rule 150 from the unstructured document UD with reference to the ontology data 500. The knowledge extraction rule generation unit 100 refers to the natural language understanding unit 300, performs natural language analysis on the unstructured document UD, and then generates knowledge extraction rules from the natural language analysis results on the unstructured document UD. Can.

온톨로지(ontology)란 세상의 있는 사물이나 사건들에 대한 특징을 파악하고 개념화하여 일종의 데이터베이스와 같은 형태로 만드는 기술을 의미하며, 정보 시스템의 대상이 되는 자원의 개념을 명확하게 정의하고 상세하게 기술하여 관련된 사물 및 사건들에 대한 정확한 정보를 찾을 수 있도록 할 수 있다. Ontology refers to a technology that identifies and conceptualizes the characteristics of objects or events in the world and makes them into a kind of database, and clearly defines and describes in detail the concept of a resource that is the target of an information system. It can help to find accurate information about related objects and events.

온톨로지 데이터(500)는 동사들, 명사(엔티티, Entity)들 등이 정의되어 있는 데이터일 수 있다. 온톨로지 데이터(500)는 데이터간에 연결성을 가지고, 이러한 연결성을 통해 각 동사에 대한 유사어, 반대어, 성격을 판단할 수 있으며, 동사를 통한 의도 분석을 수행할 수 있도록 하는 유기적인 데이터 집합, 및 이러한 연결성을 활용하여 유사어 및 반대어 그리고 각 엔티티가 가지고 있는 모든 속성 정보들을 활용할 수 있도록 하는 유기적인 데이터 집합일 수 있다. The ontology data 500 may be data in which verbs, nouns (entities, and entities) are defined. The ontology data 500 has connectivity between data, and through such connectivity, it is possible to determine similar words, counterwords, and personalities for each verb, and an organic data set that enables intention analysis through verbs, and such connectivity It can be an organic data set that utilizes analogous and counterwords and all attribute information of each entity.

온톨로지 데이터(500)는 예를 들면, 클래스(Class), 인스턴스(Instance), 릴레이션(Relation), 및 속성(Property)의 4가지로 구성될 수 있다. 클래스(Class)는 일반적으로 우리가 사물이나 개념에 붙이는 이름인 "컴퓨터", "키보드", "모니터" 등을 말한다. 인스턴스(Instance)는 클래스(Class)의 실질적인 형태로, "삼성전자 SM-500 컴퓨터", "LG전자 X키보드", "삼성전자 SM-520 모니터" 등과 같은 것을 일반적으로 인스턴스(Instance)라고 볼 수 있다. 속성(Property)은 클래스(Class) 또는 인스턴스(Instance)의 특정한 성질이나 정보를 값(Value)으로 가지고 있는 것이다. 삼성전자 12인치 SM-520 모니터에서, 인스턴스(Instance)인 "삼성전자 SM-520 모니터" 는 사이즈(Size)라는 속성(Property)으로 12인치를 가지고 있는 것이다. 릴레이션(Relation)은 클래스(Class)와 인스턴스(Instance) 간의 관계를 표현할 수 있다. "사람은 동물이다"에서와 같이 "사람"인 클래스(Class)와 "동물"인 클래스(Class)는 'isA'와 같은 릴레이션(Relation)를 통해 '"사람" isA "동물"'과 같이 표현할 수 있다.The ontology data 500 may be composed of four types, for example, Class, Instance, Relation, and Property. Class refers to "computer", "keyboard", and "monitor", which are the names we usually give to things or concepts. Instance is a practical form of class, such as "Samsung Electronics SM-500 computer", "LG Electronics X keyboard", "Samsung Electronics SM-520 monitor", etc., can be generally viewed as instances. have. A property is a value that contains specific properties or information of a class or instance. In the Samsung Electronics 12-inch SM-520 monitor, the instance "Samsung Electronics SM-520 monitor" has 12 inches as a property called Size. Relation may express a relationship between a class and an instance. As in "People are Animals", the "Person" class and the "Animal" class can be expressed as ""Person" isA "Animal" through a relation such as'isA'. Can.

지식 추출 규칙 생성부(100)는 데이터간에 연결성을 가지는 유기적인 데이터 집합인 온톨로지 데이터(500)를 참조하여, 비정형 문서(UD)에 대한 자연어 분석 결과, 예를 들면 형태소 분석 결과와 구문 분석 결과를 참조하여 비정형 문서(UD)가 가지는 문장 개체들을 온톨로지 데이터(500)가 가지는 데이터간 연결성에 연결하여 문장 프레임 및 부분 프레임을 생성한 후, 부분 프레임으로부터 지식 추출 규칙(150)을 생성 및 검증할 수 있다. 이에 대해서는 도 2 내지 도 7을 통하여 자세히 설명하도록 한다. The knowledge extraction rule generation unit 100 refers to the ontology data 500, which is an organic data set having connectivity between data, so that natural language analysis results, such as morphological analysis results and syntax analysis results, for the unstructured document (UD) can be determined. After reference, the sentence objects of the unstructured document UD are connected to the connectivity between the data of the ontology data 500 to generate sentence frames and partial frames, and then knowledge extraction rules 150 can be generated and verified from the partial frames. have. This will be described in detail through FIGS. 2 to 7.

지식 추출부(200)는 비정형 문서(UD)에 대한 자연어 분석 결과, 예를 들면 형태소 분석 결과와 구문 분석 결과를 참조하여 비정형 문서(UD)가 가지는 문장 개체들을 온톨로지 데이터(500)가 가지는 데이터간 연결성에 연결하여 문장 프레임 및 부분 프레임을 생성한 후, 지식 추출 규칙(150)에 부분 프레임을 매핑하고 트리플 지식(600)을 생성할 수 있다. The knowledge extraction unit 200 refers to the results of natural language analysis of the unstructured document UD, for example, the morpheme analysis result and the parsing result, between the data of the ontology data 500 of the sentence objects of the unstructured document UD. After generating the sentence frame and the partial frame by connecting to the connectivity, the partial frame may be mapped to the knowledge extraction rule 150 and the triple knowledge 600 may be generated.

지식 추출 시스템(1)에서 생성된 트리플 지식(600)은 지식 베이스(knowledge base, 700)에 저장될 수 있다. 일부 실시 예에서, 지식 베이스(700)은 지식 추출 시스템(1)과 네트워크(50)를 통하여 연결될 수 있다. The triple knowledge 600 generated in the knowledge extraction system 1 may be stored in a knowledge base 700. In some embodiments, knowledge base 700 may be connected to knowledge extraction system 1 and network 50.

지식 베이스(700)에는 활용이 가능한 지식, 또는 지식을 구현하기 위하여 필요한 개체들 및 이들 사이의 관계를 가지고 있는 데이터인 지식데이터, 즉 트리플 지식(600)이 축적될 수 있다. 일부 실시 예에서, 지식 추출 시스템(1)에서 생성된 트리플 지식(600)은 온톨로지 데이터(500)에도 축적될 수 있다. The knowledge base 700 may store knowledge that can be utilized, or knowledge data, that is, data necessary to implement knowledge and data having a relationship between them, that is, triple knowledge 600. In some embodiments, the triple knowledge 600 generated in the knowledge extraction system 1 may also be accumulated in the ontology data 500.

온톨로지 데이터(500) 및/또는 지식 베이스(700)에도 트리플 지식을 축적되나, 본 명세서에서 트리플 지식(600), 또는 생성된 트리플 지식이라고 기재하는 것은 지식 추출 시스템(1)에서 생성한 트리플 지식을 의미하며, 온톨로지 데이터(500), 및 지식 베이스(700)에 축적된 트리플 지식은 각각 온톨로지 데이터(500)가 가지는 트리플 지식, 및 지식 베이스(700)가 가지는 트리플 지식이라 구분하여 기재한다. Triple knowledge is also accumulated in the ontology data 500 and/or the knowledge base 700, but describing the triple knowledge 600 or the generated triple knowledge in this specification refers to the triple knowledge generated in the knowledge extraction system 1. Meaning, the triple knowledge accumulated in the ontology data 500 and the knowledge base 700 are separately described as triple knowledge possessed by the ontology data 500 and triple knowledge possessed by the knowledge base 700.

일부 실시 예에서, 지식 추출 시스템(1)은 온톨로지 데이터(500)를 구비하지 않고, 지식 베이스(700)를 온톨로지 데이터로 활용하여 트리플 지식(600)을 추출할 수 있다. In some embodiments, the knowledge extraction system 1 does not include the ontology data 500, and extracts the triple knowledge 600 by using the knowledge base 700 as ontology data.

도 2는 본 발명의 예시적 실시 예에 따른 지식 추출 시스템을 나타내는 블록도이다. Fig. 2 is a block diagram showing a knowledge extraction system according to an exemplary embodiment of the present invention.

도 2를 참조하면, 지식 추출 시스템(1a)은 지식 추출 규칙 생성부(100) 및 지식 추출부(200)를 포함한다. 지식 추출 시스템(1a)은 네트워크(50)를 통하여 비정형 문서(UD)를 수신할 수 있다. 지식 추출 규칙 생성부(100)는 다수의 비정형 문서(UD)로부터 지식 추출 규칙(150)을 생성하고, 지식 추출부(200)는 생성된 지식 추출 규칙(150)을 이용하여 비정형 문서(UD)로부터 트리플 지식(600)을 추출할 수 있다. Referring to FIG. 2, the knowledge extraction system 1a includes a knowledge extraction rule generation unit 100 and a knowledge extraction unit 200. The knowledge extraction system 1a can receive the unstructured document UD through the network 50. The knowledge extraction rule generation unit 100 generates a knowledge extraction rule 150 from a plurality of unstructured documents UD, and the knowledge extraction unit 200 uses the generated knowledge extraction rule 150 to generate an unstructured document UD. From the triple knowledge 600 can be extracted.

지식 추출 규칙 생성부(100)는 규칙 추출 대상 문장 선정부(110), 프레임 생성부(412), 부분 프레임 생성부(452), 및 규칙 생성부(130)를 포함할 수 있다. 일부 실시 예에서, 지식 추출 규칙 생성부(100)는 규칙 검증부(140)를 더 포함할 수 있다. The knowledge extraction rule generation unit 100 may include a rule selection target sentence selection unit 110, a frame generation unit 412, a partial frame generation unit 452, and a rule generation unit 130. In some embodiments, the knowledge extraction rule generation unit 100 may further include a rule verification unit 140.

규칙 추출 대상 문장 선정부(110)는 비정형 문서(UD)들을 문장 단위로 분할한 후 주어 개체와 목적어 개체를 모두 포함하는 문장을 규칙 추출 대상 문장(120)으로 선정한다. 규칙 추출 대상 문장 선정부(110)는 자연어 이해부(300)를 참조하여, 규칙 추출 대상 문장(120)을 선정할 수 있다. The sentence extraction target sentence selection unit 110 divides the unstructured documents UD into sentence units and selects a sentence including both the subject object and the object object as the rule extraction target sentence 120. The rule extraction target sentence selection unit 110 may select the rule extraction target sentence 120 with reference to the natural language understanding unit 300.

일부 실시 예에서, 규칙 추출 대상 문장 선정부(110)에서 규칙 추출 대상 문장(120)을 선정하는 과정에서, 비정형 문서(UD)들이 가지고 있는 문장들에 대하여 자연어 이해부(300)에서 수행할 수 있는 모든 자연어 분석을 수행하지 않고, 형태소 분석, 및 구문 분석 등, 주어 개체와 목적어 개체를 구분할 수 있는 수준의 일부 자연어 분석만을 수행할 수 있다.In some embodiments, in the process of selecting the rule extraction target sentence 120 from the rule extraction target sentence selection unit 110, the natural language understanding unit 300 may perform the sentences of the unstructured documents UD. Without performing all natural language analysis, only some natural language analysis, such as morpheme analysis and syntax analysis, can be distinguished between a subject object and a target object object.

다른 일부 실시 예에서, 규칙 추출 대상 문장 선정부(110)에서 규칙 추출 대상 문장(120)을 선정하는 과정에서, 비정형 문서(UD)들이 가지고 있는 문장들에 대하여 자연어 이해부(300)에서 수행할 수 있는 모든 자연어 분석을 수행한 후, 주어 개체와 목적어 개체를 구분할 수 있는 수준의 자연어 분석 결과만을 참조할 수 있다. In some other embodiments, in the process of selecting the rule extraction target sentence 120 from the rule extraction target sentence selection unit 110, the natural language understanding unit 300 performs the sentences of the unstructured documents UD. After performing all possible natural language analysis, only the result of natural language analysis that can distinguish the subject object and the target object can be referenced.

프레임 생성부(412)는 자연어 이해부(300)에서 수행된 규칙 추출 대상 문장(120)에 대한 구문 분석 결과와 형태소 분석 결과와 구문 분석 결과를 참조하여, 문장 개체들을 온톨로지 데이터(500)가 가지는 온톨로지 인스턴스와 연결한 후, 프레임 생성 규칙을 이용하여 문장 프레임을 생성한다. 문장 프레임에 대해서는 도 6을 통하여 예시를 통하여 구체적으로 설명한다. The frame generation unit 412 refers to the parsing result, the morpheme analysis result, and the parsing result of the rule extraction target sentence 120 performed by the natural language understanding unit 300, and the ontology data 500 has sentence entities. After connecting with the ontology instance, the sentence frame is generated using the frame generation rule. The sentence frame will be specifically described with reference to FIG. 6.

부분 프레임 생성부(452)는, 프레임 생성부(412)에서 생성된 문장 프레임을 부분 프레임으로 생성한다. 부분 프레임 생성부(452)는 문장 프레임이 가지는 문장 개체들 중, 술어 각각을 기준으로 관계를 가지는 문장 개체들만을 포함할 수 있도록 부분 프레임으로 생성할 수 있다. 부분 프레임 생성부(452)는 문장 프레임이 가지는 술어의 개수와 같거나 적은 개수의 부분 프레임을 생성할 수 있다. 부분 프레임에 대해서는 도 7을 통하여 예시를 통하여 구체적으로 설명한다. The partial frame generation unit 452 generates a sentence frame generated by the frame generation unit 412 as a partial frame. The partial frame generation unit 452 may generate a partial frame to include only sentence entities having a relationship based on each of the predicates among sentence entities included in the sentence frame. The partial frame generation unit 452 may generate a number of partial frames equal to or less than the number of predicates in the sentence frame. The partial frame will be described in detail by way of example through FIG. 7.

규칙 생성부(130)는 부분 프레임 생성부(452)에서 생성된 부분 프레임이 가지는 술어에 대한 온톨로지 데이터(500)이 가지는 트리플 지식을 참조하여, 지식 추출 규칙(150)을 생성할 수 있다. 일부 실시 예에서, 규칙 생성부(130)는 지식 추출 규칙 후보(도 4의 135)를 생성하고, 규칙 검증부(140)에서 지식 추출 규칙 후보(135)에 대한 검증을 수행하여, 지식 추출 규칙(150)을 선정할 수 있다. The rule generator 130 may generate the knowledge extraction rule 150 by referring to the triple knowledge of the ontology data 500 for the predicate of the partial frame generated by the partial frame generator 452. In some embodiments, the rule generation unit 130 generates a knowledge extraction rule candidate (135 in FIG. 4), and performs a verification on the knowledge extraction rule candidate 135 in the rule verification unit 140 to extract knowledge. 150 can be selected.

본 발명의 예시적 실시 예에 따른 지식 추출 시스템(1a)에서, 온톨로지 데이터(500)를 참조하여, 온톨로지를 기반으로 하는 문장 프레임 또는 부분 프레임과 같은 프레임을 사용함으로써, 의미적으로 문장을 추상화한 규칙을 생성할 수 있다. 예를 들면, '이순신은 거북선을 만들다'와 같은 문장을 의미하는 프레임은 '사람(Human-SBJ)은 인공물(Artifact-OBJ)을 만들다'와 같이 의미적으로 추상화되어 '술어' 만들다(create)에 대한 규칙으로 정의될 수 있다. 규칙 추출 대상 문장(120)들로부터 각 술어 별로 정의된 규칙들을 모아서 지식 추출 규칙(150)을 생성할 수 있다. In the knowledge extraction system 1a according to the exemplary embodiment of the present invention, the sentence is abstracted semantically by using a frame such as a sentence frame or a partial frame based on the ontology with reference to the ontology data 500. You can create rules. For example, a frame that means a sentence such as'Lee Sun-shin makes a turtle ship' is semantically abstracted such as'Human-SBJ creates Artifact-OBJ' to create'prediction' It can be defined as a rule for. From the rule extraction target sentences 120, knowledge defined rules 150 may be generated by collecting rules defined for each predicate.

지식 추출부(200)는 프레임 생성부(414), 부분 프레임 생성부(454), 지식 추출 규칙 매핑부(210), 및 트리플 생성부(220)를 포함할 수 있다. The knowledge extraction unit 200 may include a frame generation unit 414, a partial frame generation unit 454, a knowledge extraction rule mapping unit 210, and a triple generation unit 220.

프레임 생성부(414)는 자연어 이해부(300)에서 수행된 비정형 문서(UD)에 대한 대한 구문 분석 결과와 형태소 분석 결과와 구문 분석 결과를 참조하여, 문장 개체들을 온톨로지 데이터(500)가 가지는 온톨로지 인스턴스와 연결한 후, 프레임 생성 규칙을 이용하여 문장 프레임을 생성한다. 부분 프레임 생성부(454)는, 프레임 생성부(414)에서 생성된 문장 프레임을 부분 프레임으로 생성한다. The frame generation unit 414 refers to the parsing result, the morpheme analysis result, and the parsing result of the unstructured document (UD) performed by the natural language understanding unit 300, and the ontology of the ontology data 500 of sentence objects After connecting with the instance, the sentence frame is generated using the frame generation rule. The partial frame generation unit 454 generates a sentence frame generated by the frame generation unit 414 as a partial frame.

지식 추출 규칙 생성부(100)가 가지는 프레임 생성부(412) 및 부분 프레임 생성부(452) 각각과 지식 추출부(200)가 가지는 프레임 생성부(414) 및 부분 프레임 생성부(454) 각각은 분석 대상이 규칙 추출 대상 문장(120)과 비정형 문서(UD)인 것을 제외하고는 그 동작이 실질적으로 동일한 바, 자세한 설명은 생략한다. Each of the frame generation unit 412 and the partial frame generation unit 452 of the knowledge extraction rule generation unit 100 and the frame generation unit 414 and the partial frame generation unit 454 of the knowledge extraction unit 200 are respectively The operation is substantially the same except that the analysis target is the rule extraction target sentence 120 and the unstructured document UD, and a detailed description is omitted.

일부 실시 예에서, 지식 추출 규칙 생성부(100)가 가지는 프레임 생성부(412) 및 부분 프레임 생성부(452) 각각과 지식 추출부(200)가 가지는 프레임 생성부(414) 및 부분 프레임 생성부(454) 각각은 동일한 구성을 공유하는 것일 수 있다. 이에 대해서는 도 8에서 자세히 설명하도록 한다. In some embodiments, each of the frame generator 412 and the partial frame generator 452 of the knowledge extraction rule generator 100 and the frame generator 414 and the partial frame generator of the knowledge extractor 200 are included. Each of 454 may share the same configuration. This will be described in detail in FIG. 8.

지식 추출 규칙 매핑부(210)는 부분 프레임 생성부(452)에서 생성한 부분 프레임과 지식 추출 규칙(150)을 매핑하여, 해당 부분 프레임에서 지식 추출 여부를 판단할 수 있다. The knowledge extraction rule mapping unit 210 may map the partial frame generated by the partial frame generation unit 452 and the knowledge extraction rule 150 to determine whether knowledge is extracted from the corresponding partial frame.

트리플 생성부(220)는 부분 프레임의 엔티티, 즉 문장 개체와 지식 추출 규칙의 속성 정보를 트리플 형태의 지식으로 변환하여 트리플 지식(600)을 생성할 수 있다. The triple generator 220 may generate the triple knowledge 600 by converting the entity information of the partial frame, that is, the sentence entity and the attribute information of the knowledge extraction rule into triple knowledge.

지식 추출부(200)에서는 다수의 비정형 문서(UD)의 문장들을 입력으로 받고 지식 추출 규칙 생성부(100)에서 생성된 지식 추출 규칙(150)을 활용해 트리플 형식의 지식인 트리플 지식(600)을 추출한다. 먼저 입력된 문장은 프레임 생성부(414)와 부분 프레임 생성부(454)를 거쳐 하나 이상의 부분 프레임으로 변환된다. 변환된 부분 프레임은 지식 추출 규칙 생성부(100)에서 생성된 지식 추출 규칙(150)이 포함하는 규칙들과 매핑되며, 적용되는 규칙이 찾아진 경우 문장에서 지식을 추출해 트리플 형식으로 표현되는 트리플 지식(600)을 생성한다. The knowledge extraction unit 200 receives the sentences of a plurality of unstructured documents (UD) as input, and utilizes the knowledge extraction rules 150 generated by the knowledge extraction rule generation unit 100 to receive triple knowledge 600, which is a triple-type knowledge. To extract. The first input sentence is converted into one or more partial frames through the frame generator 414 and the partial frame generator 454. The converted partial frame is mapped to the rules included in the knowledge extraction rule 150 generated by the knowledge extraction rule generation unit 100, and when the applied rule is found, triple knowledge represented by a triple form by extracting knowledge from a sentence Produces (600).

트리플 생성부(220)에서 생성된 트리플 지식(600)은 지식 베이스(700)에 저장될 수 있다. 일부 실시 예에서, 지식 베이스(700)은 지식 추출 시스템(1a)과 네트워크(50)를 통하여 연결될 수 있다. The triple knowledge 600 generated by the triple generation unit 220 may be stored in the knowledge base 700. In some embodiments, the knowledge base 700 may be connected to the knowledge extraction system 1a through the network 50.

일부 실시 예에서, 지식 추출 시스템(1a)에서 생성된 트리플 지식(600)은 온톨로지 데이터(500)에도 축적될 수 있다. In some embodiments, the triple knowledge 600 generated in the knowledge extraction system 1a may also be accumulated in the ontology data 500.

이렇게 온톨로지 데이터(500)가 확장되고 지식 추출 시스템(1a)에서 처리된 비정형 문서(UD)의 양이 증가되면, 지식 추출 규칙 생성부(100)에서는 다시 지식 추출 규칙(150)을 생성 및 검증하여 지식 추출 규칙(150)을 업데이트할 수 있다. 업데이트된 지식 추출 규칙(150)을 통해 다른 트리플 지식(600)을 더 추출할 수 있으며, 이러한 선순환 구조를 통해 지속적으로 온톨로지 데이터(500) 및/또는 지식 베이스(700)를 확장할 수 있다.When the ontology data 500 is expanded and the amount of unstructured documents UD processed by the knowledge extraction system 1a is increased, the knowledge extraction rule generation unit 100 generates and verifies the knowledge extraction rules 150 again. The knowledge extraction rule 150 may be updated. Another triple knowledge 600 may be further extracted through the updated knowledge extraction rule 150, and the ontology data 500 and/or the knowledge base 700 may be continuously extended through the virtuous cycle structure.

일부 실시 예에서, 지식 추출 시스템(1a)은 온톨로지 데이터(500)를 구비하지 않고, 지식 베이스(700)를 온톨로지 데이터로 활용하여 트리플 지식(600)을 추출할 수 있다. In some embodiments, the knowledge extraction system 1a does not include the ontology data 500, and extracts the triple knowledge 600 by using the knowledge base 700 as ontology data.

도 3은 본 발명의 예시적 실시 예에 따른 지식 추출 시스템이 가지는 프레임 생성부의 구성을 나타내는 블록도이다. 도 3에서는 지식 추출 규칙 생성부(100)가 가지는 프레임 생성부(412) 및 부분 프레임 생성부(452) 각각과 지식 추출부(200)가 가지는 프레임 생성부(414) 및 부분 프레임 생성부(454) 각각이 동일하다고 가정하고 설명되며, 비정형 문서(UD)를 기준으로 문장 프레임(440)을 생성하는 과정을 설명하나, 도 2에 보인 규칙 추출 대상 문장(120)을 기준으로 문장 프레임(440)을 생성하는 방법도 동일할 수 있다. Fig. 3 is a block diagram showing a structure of a frame generation unit of a knowledge extraction system according to an exemplary embodiment of the present invention. In FIG. 3, each of the frame generator 412 and the partial frame generator 452 of the knowledge extraction rule generator 100 and the frame generator 414 and the partial frame generator 454 of the knowledge extractor 200 are included. ) It is assumed that each is the same, and the process of generating the sentence frame 440 based on the unstructured document UD is described, but the sentence frame 440 is based on the rule extraction target sentence 120 shown in FIG. 2. The method of generating can be the same.

도 3을 참조하면, 프레임 생성부(412/414)는 개체 연결부(420), 및 문장 프레임 생성부(430)을 포함한다. Referring to FIG. 3, the frame generation unit 412/414 includes an object connection unit 420 and a sentence frame generation unit 430.

프레임 생성부(412/414)는 자연어로된 문장으로 이루어진 비정형 문서(UD)를 입력받아, 비정형 문서(UD)를 이루는 문장 각각에 대한 문장 프레임(440)을 생성할 수 있다. 비정형 문서(UD)를 이루는 문장 각각이 입력되면 먼저 자연어 이해부(300)를 참조하여, 형태소 분석을 수행하고, 형태소 분석 결과를 입력 받아 구문 분석을 수행한다. 형태소 분석 결과와 구문 분석 결과를 바탕으로 명사류와 동사류에 대한 다양한 처리(복합 명사 처리, 동사 결합 등)를 통해 토큰화를 진행하며, 각 토큰들은 개체 연결부(420)에서 온톨로지 데이터(500)에 저장된 인스턴스와 연결될 수 있다. 연결된 개체 인스턴스 정보와, 문법 형태소(조사, 어미 등) 정보, 프레임 생성 규칙(435)을 이용하여, 문장 프레임 생성부(430)에서 문장 프레임(400)을 생성한다. 생성된 문장 프레임(400)은 지식 추출 규칙 생성부(도 1 및 도 2의 100)와 지식 추출부(도 1 및 도 2의 200)에서 활용될 수 있다. 생성된 문장 프레임(400)은 도 6에 예시되어 설명된다. The frame generator 412/414 may receive an unstructured document UD made of sentences in natural language, and generate a sentence frame 440 for each sentence constituting the unstructured document UD. When each sentence constituting an unstructured document (UD) is input, first, the morpheme analysis is performed by referring to the natural language comprehension unit 300, and the morpheme analysis results are input and syntax analysis is performed. Tokenization is performed through various processes (composite noun processing, verb combination, etc.) for nouns and verbs based on the morpheme analysis results and parsing results, and each token is stored in the ontology data 500 in the object connection unit 420 It can be associated with an instance. The sentence frame generator 430 generates the sentence frame 400 using the connected object instance information, grammatical morphemes (search, mother, etc.) information, and the frame generation rule 435. The generated sentence frame 400 may be utilized by the knowledge extraction rule generator (100 in FIGS. 1 and 2) and the knowledge extractor (200 in FIGS. 1 and 2). The generated sentence frame 400 is illustrated and illustrated in FIG. 6.

생성된 문장 프레임(400)은 부분 프레임 생성부(452/454)에서 부분 프레임으로 분할될 수 있다. 부분 프레임 생성부(452/454)는, 프레임 생성부(412/414)에서 생성된 문장 프레임(400)이 가지는 문장 개체들 중, 술어 각각을 기준으로 관계를 가지는 문장 개체들만을 포함할 수 있도록 부분 프레임으로 생성할 수 있다. 부분 프레임 생성부(452/454)는 문장 프레임(400)이 가지는 술어의 개수와 같거나 적은 개수의 부분 프레임을 생성할 수 있다. 예를 들어, 문장 프레임(400)이 하나의 술어를 가지는 경우, 부분 프레임 생성부(452/454)는 문장 프레임(400)을 그대로 부분 프레임으로 전달하고, 문장 프레임이 2개 이상의 술어를 가지는 경우, 부분 프레임 생성부(452/454)는 문장 프레임(400)을 2개 이상의 술어에 대응하여 2개 이상으로 분할된 부분 프레임을 생성할 수 있다. The generated sentence frame 400 may be divided into partial frames by the partial frame generator 452/454. The partial frame generation unit 452/454 may include only sentence entities having a relationship based on each of the predicates among the sentence entities of the sentence frame 400 generated by the frame generation unit 412/414. Can be created as a partial frame. The partial frame generation unit 452/454 may generate a number of partial frames equal to or less than the number of predicates of the sentence frame 400. For example, when the sentence frame 400 has one predicate, the partial frame generation unit 452/454 delivers the sentence frame 400 as a partial frame as it is, and the sentence frame has two or more predicates , The partial frame generation unit 452/454 may generate a partial frame in which the sentence frame 400 is divided into two or more corresponding to two or more predicates.

일부 실시 예에서, 부분 프레임 생성부(452/454)는 문장 프레임(400) 이 가지는 술어 중 주어 개체와 목적어 개체를 적어도 하나를 포함하지 않는 술어에 대해서는 부분 프레임을 생성하지 않을 수 있다. 예를 들어, 문장 프레임(400)가 3개의 술어를 가지나, 그 중 2개의 술어만이 주어 개체와 목적어 개체를 모두 가지는 경우, 부분 프레임 생성부(452/454)는 문장 프레임(400)으로부터 2개의 분할된 부분 프레임을 생성할 수 있다. In some embodiments, the partial frame generation unit 452/454 may not generate a partial frame for a predicate that does not include at least one subject object and an object object among the predicates of the sentence frame 400. For example, if the sentence frame 400 has three predicates, but only two of the predicates have both the subject object and the object object, the partial frame generation unit 452/454 is provided by the sentence frame 400. It is possible to generate two divided partial frames.

도 4는 본 발명의 예시적 실시 예에 따른 지식 추출 시스템이 가지는 규칙 생성부 및 규칙 검증부의 구성을 나타내는 블록도이다. Fig. 4 is a block diagram showing a structure of a rule generation unit and a rule verification unit of the knowledge extraction system according to an exemplary embodiment of the present invention.

도 4를 참조하면, 규칙 검증부(140)는, 규칙 생성부(130)에서 생성된 지식 추출 규칙 후보(135)를 전달받아 검증을 수행하여, 지식 추출 후보(135) 중 지식 추출 규칙(150)을 선정할 수 있다. Referring to FIG. 4, the rule verification unit 140 receives the knowledge extraction rule candidate 135 generated by the rule generation unit 130 to perform verification, and performs knowledge extraction among the knowledge extraction candidates 135 (150 ) Can be selected.

예를 들어, 규칙 생성부(130)에서, '이순신은 거북선을 만들다'와 같은 문장으로부터 '사람(Human-SBJ)은 인공물(Artifact-OBJ)을 만들다'와 같이 의미적으로 추상화된 '술어' 만들다(create)에 대한 규칙으로 정의될 수 있다. 규칙 추출 대상 문장(120)들로부터 각 술어 별로 정의된 규칙들을 모아서 지식 추출 규칙 후보(135)를 생성할 수 있다. For example, in the rule generation unit 130, from a sentence such as'Lee Sun-shin makes a turtle ship','human (Suman-SBJ) makes an artifact (Artifact-OBJ)' semantically abstracted'predicate' It can be defined as a rule for create. Rules defined for each predicate may be collected from the rule extraction target sentences 120 to generate a knowledge extraction rule candidate 135.

규칙 검증부(140)는 신뢰도 분석부(142) 및 규칙 선정부(144)를 포함할 수 있다. The rule verification unit 140 may include a reliability analysis unit 142 and a rule selection unit 144.

신뢰도 분석부(142)는 지식 추출 규칙 후보(135)에 대한 신뢰도를 계산한다. 지식 추출 규칙 후보(135)에 대한 신뢰도는 다음 수식에 의하여 구해질 수 있다. The reliability analysis unit 142 calculates the reliability of the knowledge extraction rule candidate 135. The reliability of the knowledge extraction rule candidate 135 can be obtained by the following equation.

여기에서, r은 지식 추출 규칙 후보(이하에서는 규칙)이고, P는 술어들의 집합이고, p와 q는 각각의 술어(p, q ∈ P)로, 술어 q는 술어들의 집합 P 중에서 술어 p가 아닌 임의의 술어를 의미한다. h_rp는 술어 p에 대한 규칙 r의 등장 비율이고, 마찬가지로 h_rq는 술어 q에 대한 규칙 r의 등장 비율이다. C(r,p)는 술어 p에 대한 규칙 r의 신뢰도로, 0과 1 사이의 값을 가진다. max h_rq는, 술어들의 집합 P 중에서, 술어 p 이외의 술어들 중, 규칙 r의 비율이 가장 높은 술어에 대한 규칙 r의 등장 비율을 의미한다. Here, r is a candidate for knowledge extraction rules (hereinafter, a rule), P is a set of predicates, p and q are respective predicates (p, q ∈ P), and predicate q is a predicate p among a set of predicates P It means not any predicate. h _rp is the ratio of the appearance of the rule r to the predicate p, and similarly h _rq is the ratio of the appearance of the rule r to the predicate q. C(r,p) is the reliability of the rule r for the predicate p and has a value between 0 and 1. max h _rq means the appearance ratio of rule r to the predicate with the highest ratio of rule r among predicates other than predicate p among the set P of predicates.

따라서, 하나의 술어(예를 들면, p)에 대하여, 복수의 지식 추출 규칙 후보에 각각에 대하여 신뢰도를 구하면, 해당 술어(p)가 다른 술어(q)에 비하여 등장 비율이 높은 규칙만이 0이 아닌 값(0보다 크고 1 이하의 값)을 가질 수 있으므로, 술어에 대한 규칙의 신뢰도가 0보다 큰 규칙이 해당 술어의 규칙에 적합함을 알 수 있다. Therefore, when a reliability is obtained for each of a plurality of knowledge extraction rule candidates for one predicate (for example, p), only the rule in which the predicate p has a higher appearance rate than the other predicate q is 0 Since it can have a non-value (a value greater than 0 and less than 1), it can be seen that a rule with a reliability of a rule greater than 0 is suitable for the rule of the predicate.

일부 실시 예에서, 규칙 선정부(144)는 술어에 대한 규칙의 신뢰도가 0보다 큰 규칙을 지식 추출 규칙(150)으로 선정할 수 있다. 다른 일부 실시 예에서, 규칙 선정부(144)는 술어에 대한 규칙의 신뢰도가 0과 1 사이의 설정값보다 큰 규칙을 지식 추출 규칙(150)으로 선정할 수 있다. 또 다른 일부 실시 예에서, 규칙 선정부(144)는 하나의 술어에 대하여 신뢰도가 가장 큰 규칙을 지식 추출 규칙(150)으로 선정할 수 있다. In some embodiments, the rule selection unit 144 may select a rule having a reliability level of greater than 0 as the knowledge extraction rule 150 for the predicate. In some other embodiments, the rule selection unit 144 may select a rule having a reliability of a rule for a predicate greater than a set value between 0 and 1 as the knowledge extraction rule 150. In some other embodiments, the rule selection unit 144 may select the rule having the highest reliability for one predicate as the knowledge extraction rule 150.

도 5는 본 발명의 예시적 실시 예에 따른 지식 추출 시스템이 가지는 자연어 이해부의 구성을 나타내는 블록도이다. Fig. 5 is a block diagram showing a structure of a natural language understanding unit of a knowledge extraction system according to an exemplary embodiment of the present invention.

도 5를 참조하면, 자연어 이해부(300)는 형태소 분석부(310), 구문 분석부(320), 개체명 분석부(330), 필터링 분석부(340), 의도 분류부(350), 도메인 분석부(360), 및 시맨틱 롤 라벨링부(SRL, 370)를 포함할 수 있다. 일부 실시 예에서, 자연어 이해부(300)는 전술한 구성 중 적어도 하나를 포함하지 않을 수 dLT다 예를 들면, 자연어 이해부(300)는 형태소 분석부(310), 구문 분석부(320), 개체명 분석부(330), 및 필터링 분석부(340)을 포함하고, 의도 분류부(350), 도메인 분석부(360), 및 시맨틱 롤 라벨링부(SRL, 370)를 포함하지 않을 수 있다. Referring to FIG. 5, the natural language understanding unit 300 includes a morpheme analysis unit 310, a syntax analysis unit 320, an entity name analysis unit 330, a filtering analysis unit 340, an intention classification unit 350, and a domain The analysis unit 360 may include a semantic roll labeling unit (SRL, 370). In some embodiments, the natural language understanding unit 300 is a dLT that may not include at least one of the above-described configurations. For example, the natural language understanding unit 300 may include a morpheme analysis unit 310, a syntax analysis unit 320, The entity name analysis unit 330 and the filtering analysis unit 340 may be included, and the intention classification unit 350, the domain analysis unit 360, and the semantic roll labeling units SRL and 370 may not be included.

형태소 분석부(310)는 비정형 문서(도 1의 UD)가 가지는 문장을 형태소 단위로 분리할 수 있다. 구문 분석부(320) 및 개체명 분석부(330)는 각각 형태소 단위로 분리된 문장 개체에 구문 분석 및 개체명 분석을 할 수 있다. 필터링 분석부(340)는 문장 개체 중 불필요한 피쳐(feature)를 제거하여 간결화된 문장을 생성할 수 있다. 의도 분류부(350) 및 도메인 분석부(360)는 필터링 분석부(340)에서 생성한 간결화된 문장을 기초로 의미 역할이 부여된 질의의 의도(intention) 분류 및 도메인 분석을 할 수 있다. 시맨틱 롤 라벨링부(370)는 문장 개체에 대한 의미 역할(Semantic Role)을 부여(Labeling)할 수 있다.The morpheme analysis unit 310 may separate sentences of the unstructured document (UD of FIG. 1) into morphological units. The syntax analysis unit 320 and the entity name analysis unit 330 may perform syntax analysis and entity name analysis on sentence entities separated by morpheme units, respectively. The filtering analysis unit 340 may generate a concise sentence by removing unnecessary features among sentence entities. The intention classification unit 350 and the domain analysis unit 360 may perform intention classification and domain analysis of a query to which a semantic role is assigned based on a concise sentence generated by the filtering analysis unit 340. The semantic roll labeling unit 370 may assign a semantic role to the sentence entity.

도 6은 본 발명의 예시적 실시 예에 따른 지식 추출 시스템의 프레임 생성부에서 생성하는 문장 프레임을 설명하기 위한 개념도이다.6 is a conceptual diagram illustrating a sentence frame generated by a frame generation unit of a knowledge extraction system according to an exemplary embodiment of the present invention.

도 6을 참조하면, '조선의 장군인 이순신은 거북선을 만들었다'라는 문장(ST)으로부터 문장 프레임(SF)을 생성한다. 문장(ST)은 규칙 추출 대상 문장 선정부(도 2의 100)에서 선정한 각각의 규칙 추출 대상 문장(120)이거나, 지식 추출부(200)의 프레임 생성부(200)가 입력받은 비정형 문서(UD)가 가지는 문장일 수 있다. Referring to FIG. 6, a sentence frame SF is generated from the sentence ST,'the general of Joseon Yi Sun-sin made a turtle ship'. The sentence ST is each sentence extraction target sentence 120 selected by the rule extraction target sentence selection unit 100 (FIG. 2 in FIG. 2), or the unstructured document received by the frame generation unit 200 of the knowledge extraction unit 200 (UD ).

문장 프레임 생성부(도 3의 430)는 온톨로지 데이터(도 3의 500)를 참조하여 문장(ST)의 최종 술어인 '만들었다'로부터 'Make 만들다'에 대하여 'What'과 'Agent' 각각에 'Artifact'인 '거북선'과 'Human'인 '이순신'을 인스턴스로 대입하고, '조선의 장군인 이순신'은 '이순신은 조선의 장군이다'라는 의미이므로, 명사(Noun_Mod)인 '이순신'에 대한 술어인 'Be 이다'에 대하여 'Patient'에 'Designation 장군'을 인스턴스로 대입하고, 이에 대한 'Owner'로 'Country 조선'을 인스턴스에 대입하여, 문장 프레임(SF)을 생성할 수 있다. The sentence frame generation unit (430 in FIG. 3) refers to the ontology data (500 in FIG. 3), and for each'What' and'Agent' for'Make' from'Made' from the final predicate of the sentence ST. Artifact''Turtle Ship' and'Human''Soon-Shin' are assigned as instances, and'Shosun's General Yi Sun-Shin' means'Lee Sun-Sin is a general of Joseon', so that the'Nun_Mod''Isoon Shin' A sentence frame (SF) may be generated by substituting'Designation general' into'Patient' for the predicate'Be is' and'Country Chosun' into the instance as'Owner'.

도 7은 본 발명의 예시적 실시 예에 따른 지식 추출 시스템의 부분 프레임 생성부에서 생성하는 부분 프레임을 설명하기 위한 개념도이다.7 is a conceptual diagram illustrating a partial frame generated by a partial frame generation unit of a knowledge extraction system according to an exemplary embodiment of the present invention.

도 7을 참조하면, 도 6에서 '조선의 장군인 이순신은 거북선을 만들었다'라는 문장(ST)으로부터 생성된 문장 프레임(SF)은 'Make 만들다'와 'Be 이다'의 2개의 술어를 가질 수 있다. 'Make 만들다'라는 술어는 주어 개체로 'Human 이순신'을 포함하고, 목적어 개체로 'Artifact 거북선'을 포함한다. 또한 'Be 이다'라는 술어는 주어 개체로 'Human 이순신'을 포함하고, 목적어 개체로 'Designation 장군'을 포함한다. Referring to FIG. 7, in FIG. 6, the sentence frame SF generated from the sentence ST of the general of Joseon, Yi Sun-shin made a turtle ship, may have two predicates,'make' and'be'. have. The predicate'make make' includes'Human Yi Sun-sin' as the subject object, and'Artifact turtle ship' as the object object. Also, the predicate'Be is' includes'Human Yi Sun-sin' as the subject object, and'General Designation' as the object object.

부분 프레임 생성부(도 1의 452/도 2의 454)은 문장 프레임(SF)이 가지는 술어 중에서, 주어 개체와 목적어 개체를 모두 포함하는 술어 각각에 대하여 분할된 부분 프레임(PF1, PF2)을 생성한다. 즉, 부분 프레임 생성부(도 1의 452/도 2의 454)은 문장 프레임(SF)을, 'Make 만들다'라는 술어를 기준으로 주어 개체와 목적어 개체를 포함하는 분할 프레임(PF1)과 'Be 이다'라는 술어를 기준으로 주어 개체와 목적어 개체를 포함하는 분할 프레임(PF2)으로 분할할 수 있다. The partial frame generation unit (452 in FIG. 1/454 in FIG. 2) generates divided partial frames PF1 and PF2 for each of the predicates including the subject object and the object object among the predicates possessed by the sentence frame SF. do. That is, the partial frame generating unit (452 in FIG. 1/454 in FIG. 2) gives the sentence frame SF as a reference based on the predicate'make', and the divided frame PF1 including the object and the object object and'Be It can be divided into a split frame (PF2) including an object and an object by giving a subject based on the predicate.

도 8은 본 발명의 예시적 실시 예에 따른 지식 추출 시스템을 나타내는 블록도이다. 도 8에 대한 내용 중 도 1 내지 도 7과 중복되는 설명은 생략될 수 있다. Fig. 8 is a block diagram showing a knowledge extraction system according to an exemplary embodiment of the present invention. 8, descriptions overlapping with FIGS. 1 to 7 may be omitted.

도 8을 참조하면, 지식 추출 시스템(2)은 지식 추출 규칙 생성부(102), 지식 추출부(202) 및 프레임 관리부(402)를 포함한다. 지식 추출 시스템(2)은 도 1 및 도 2에 보인 지식 추출 시스템(1, 1a)과 달리, 별도의 프레임 관리부(402)를 구비할 수 있다. Referring to FIG. 8, the knowledge extraction system 2 includes a knowledge extraction rule generation unit 102, a knowledge extraction unit 202, and a frame management unit 402. Unlike the knowledge extraction systems 1 and 1a shown in FIGS. 1 and 2, the knowledge extraction system 2 may include a separate frame management unit 402.

지식 추출 시스템(2)은 네트워크(50)를 통하여 비정형 문서(UD)를 수신할 수 있다. 다른 일부 실시 예에서, 지식 추출 시스템(1)은 네트워크(50)를 통하여 비정형 문서(UD)를 수집하기 위한, 문서 수집 로봇을 더 포함할 수 있다. The knowledge extraction system 2 may receive an unstructured document UD through the network 50. In some other embodiments, the knowledge extraction system 1 may further include a document collection robot for collecting unstructured documents UD through the network 50.

비정형 문서(UD)는 규칙 추출 대상 문장(120)의 선정용 비정형 문서(UD-R, 이하 규칙 대상 비정형 문서)와 트리플 지식(600) 추출용 비정형 문서(UD-K, 이하 지식 대상 비정형 문서)를 포함할 수 있다. 일부 실시 예에서, 규칙 대상 비정형 문서(UD-R)와 지식 대상 비정형 문서(UD-K)는 비정형 문서(UD) 중 선택된 임의의 비정형 문서일 수 있다. 다른 일부 실시 예에서, 규칙 대상 비정형 문서(UD-R)와 지식 대상 비정형 문서(UD-K)는 비정형 문서(UD) 중 서로 다른 비정형 문서일 수 있다. The unstructured document (UD) includes an unstructured document (UD-R, hereinafter referred to as an unstructured document) for selection of a sentence 120, and an unstructured document (UD-K, hereinafter referred to as an unstructured document) for extracting triple knowledge (600) It may include. In some embodiments, the unstructured document subject to rule (UD-R) and the unstructured document subject to knowledge (UD-K) may be any unstructured document selected from unstructured documents UD. In some other embodiments, the unstructured document subject to rule (UD-R) and the unstructured document subject to knowledge (UD-K) may be different unstructured documents among unstructured documents UD.

지식 추출 규칙 생성부(102)는 규칙 추출 대상 문장 선정부(110), 규칙 생성부(130), 및 규칙 검증부(140)를 포함할 수 있다. 규칙 추출 대상 문장 선정부(110)는 규칙 대상 비정형 문서(UD-R)들을 문장 단위로 분할한 후 주어 개체와 목적어 개체를 모두 포함하는 문장을 규칙 추출 대상 문장(120)으로 선정한다. 규칙 추출 대상 문장 선정부(110)는 자연어 이해부(300)와 온톨로지 데이터(500)를 참조하여, 규칙 추출 대상 문장(120)을 선정할 수 있다. The knowledge extraction rule generation unit 102 may include a rule extraction target sentence selection unit 110, a rule generation unit 130, and a rule verification unit 140. The rule extraction target sentence selection unit 110 divides the rule target unstructured documents (UD-Rs) into sentence units, and selects a sentence including both the subject object and the object object as the rule extraction target sentence 120. The rule extraction target sentence selection unit 110 may select the rule extraction target sentence 120 with reference to the natural language understanding unit 300 and the ontology data 500.

프레임 관리부(402)는 프레임 생성부(410) 및 부분 프레임 생성부(420)를 포함한다. 프레임 생성부(410)는 자연어 이해부(300)에서 수행된 규칙 추출 대상 문장(120)에 대한 구문 분석 결과와 형태소 분석 결과와 구문 분석 결과를 참조하여, 문장 개체들을 온톨로지 데이터(500)가 가지는 온톨로지 인스턴스와 연결한 후, 프레임 생성 규칙을 이용하여 문장 프레임을 생성한다. 부분 프레임 생성부(450)는, 프레임 생성부(410)에서 생성된 문장 프레임을 부분 프레임으로 생성한다. 부분 프레임 생성부(450)는 문장 프레임이 가지는 문장 개체들 중, 술어 각각을 기준으로 관계를 가지는 문장 개체들만을 포함할 수 있도록 부분 프레임으로 생성할 수 있다. 부분 프레임 생성부(450)는 문장 프레임이 가지는 술어의 개수와 같거나 적은 개수의 부분 프레임을 생성할 수 있다. 프레임 생성부(410) 및 부분 프레임 생성부(420)은 도 3에서 설명한 프레임 생성부(412/414) 및 부분 프레임 생성부(452/454)과 대체로 동일하므로 중복되는 내용은 생략하도록 한다. The frame management unit 402 includes a frame generation unit 410 and a partial frame generation unit 420. The frame generation unit 410 refers to the parsing result, the morpheme analysis result, and the parsing result of the rule extraction target sentence 120 performed by the natural language understanding unit 300, and the ontology data 500 has sentence entities. After connecting with the ontology instance, the sentence frame is generated using the frame generation rule. The partial frame generator 450 generates the sentence frame generated by the frame generator 410 as a partial frame. The partial frame generator 450 may generate a partial frame to include only sentence entities having a relationship based on each of the predicates among sentence entities included in the sentence frame. The partial frame generation unit 450 may generate a number of partial frames equal to or less than the number of predicates in the sentence frame. The frame generating unit 410 and the partial frame generating unit 420 are substantially the same as the frame generating unit 412/414 and the partial frame generating unit 452/454 described with reference to FIG. 3, so that redundant content is omitted.

단, 도 2에서 보인 지식 추출 시스템(1a)에서는 지식 추출 규칙 생성부(100) 및 지식 추출부(200) 각각이 별도의 프레임 생성부(412)와 부분 프레임 생성부(452) 및 프레임 생성부(414)와 부분 프레임 생성부(454)를 가지면서 이용하는 것과는 달리, 도 8에 보이는 지식 추출 시스템(2)은 지식 추출 규칙 생성부(102) 및 지식 추출부(202)이 프레임 관리부(402)가 가지는 프레임 생성부(410)와 부분 프레임 생성부(450)를 이용하는 점에 차이가 있다.However, in the knowledge extraction system 1a shown in FIG. 2, each of the knowledge extraction rule generation unit 100 and the knowledge extraction unit 200 is a separate frame generation unit 412, a partial frame generation unit 452, and a frame generation unit Unlike using the 414 and the partial frame generator 454, the knowledge extraction system 2 shown in FIG. 8 includes the knowledge extraction rule generator 102 and the knowledge extractor 202 as the frame manager 402. There is a difference in using the frame generator 410 and the partial frame generator 450.

규칙 생성부(130)는 부분 프레임 생성부(450)에서 생성된 부분 프레임이 가지는 술어에 대한 온톨로지 데이터(500)이 가지는 트리플 지식을 참조하여, 지식 추출 규칙(150)을 생성할 수 있다. 일부 실시 예에서, 규칙 생성부(130)는 지식 추출 규칙 후보(도 4의 135)를 생성하고, 규칙 검증부(140)에서 지식 추출 규칙 후보(135)에 대한 검증을 수행하여, 지식 추출 규칙(150)을 선정할 수 있다. The rule generator 130 may generate the knowledge extraction rule 150 by referring to the triple knowledge of the ontology data 500 of the predicates of the partial frame generated by the partial frame generator 450. In some embodiments, the rule generation unit 130 generates a knowledge extraction rule candidate (135 in FIG. 4), and performs a verification on the knowledge extraction rule candidate 135 in the rule verification unit 140 to extract knowledge. 150 can be selected.

지식 추출부(202)는 지식 추출 규칙 매핑부(210) 및 트리플 생성부(220)를 포함할 수 있다. 지식 대상 비정형 문서(UD-K)는 프레임 생성부(410)에서 자연어 이해부(300)에서 수행된 규칙 추출 대상 문장(120)에 대한 구문 분석 결과와 형태소 분석 결과와 구문 분석 결과를 참조하여, 문장 개체들을 온톨로지 데이터(500)가 가지는 온톨로지 인스턴스와 연결한 후, 프레임 생성 규칙을 이용하여 문장 프레임으로 생성된다. 부분 프레임 생성부(450)는, 프레임 생성부(410)에서 생성된 문장 프레임을 부분 프레임으로 생성한다. 부분 프레임 생성부(450)는 문장 프레임이 가지는 문장 개체들 중, 술어 각각을 기준으로 관계를 가지는 문장 개체들만을 포함할 수 있도록 부분 프레임으로 생성할 수 있다. The knowledge extraction unit 202 may include a knowledge extraction rule mapping unit 210 and a triple generation unit 220. The unstructured document subject to knowledge (UD-K) refers to the parsing result, the morpheme analysis result, and the parsing result of the rule extraction target sentence 120 performed by the natural language understanding unit 300 in the frame generating unit 410, After the sentence objects are connected to the ontology instance of the ontology data 500, they are generated as sentence frames using the frame generation rules. The partial frame generator 450 generates the sentence frame generated by the frame generator 410 as a partial frame. The partial frame generator 450 may generate a partial frame to include only sentence entities having a relationship based on each of the predicates among the sentence entities included in the sentence frame.

지식 추출 규칙 매핑부(210)는 부분 프레임 생성부(450)에서 생성한 부분 프레임과 지식 추출 규칙(150)을 매핑하여, 해당 부분 프레임에서 지식 추출 여부를 판단할 수 있다. The knowledge extraction rule mapping unit 210 may map the partial frame generated by the partial frame generation unit 450 and the knowledge extraction rule 150 to determine whether knowledge is extracted from the corresponding partial frame.

지식 추출 시스템(2)에서 생성된 트리플 지식(600)은 지식 베이스(700)에 저장될 수 있다. 일부 실시 예에서, 지식 베이스(700)은 지식 추출 시스템(1)과 네트워크(50)를 통하여 연결될 수 있다. 지식 베이스(700)에는 트리플 지식(600)이 축적될 수 있다. 일부 실시 예에서, 지식 추출 시스템(2)에서 생성된 트리플 지식(600)은 온톨로지 데이터(500)에도 축적될 수 있다. 일부 실시 예에서, 지식 추출 시스템(2)은 온톨로지 데이터(500)를 구비하지 않고, 지식 베이스(700)를 온톨로지 데이터로 활용하여 트리플 지식(600)을 추출할 수 있다. The triple knowledge 600 generated in the knowledge extraction system 2 may be stored in the knowledge base 700. In some embodiments, knowledge base 700 may be connected to knowledge extraction system 1 and network 50. Triple knowledge 600 may be accumulated in the knowledge base 700. In some embodiments, the triple knowledge 600 generated in the knowledge extraction system 2 may also be accumulated in the ontology data 500. In some embodiments, the knowledge extraction system 2 does not include the ontology data 500, and extracts the triple knowledge 600 by using the knowledge base 700 as the ontology data.

본 발명에 지식 추출 시스템은, 해당 분야의 전문가가 직접 문장을 읽고 형식에 맞춰서 데이터를 입력하거나, 학습 데이터를 구축하거나, 패턴을 구축하지 않고, 자동으로 생성되고 검증된 의미적으로 추상화된 지식 추출 규칙을 이용해 비정형 장에서 자동으로 트리플 형식의 지식을 추출할 수 있다. In the present invention, the knowledge extraction system extracts semantically abstracted knowledge that is automatically generated and verified by experts in the field directly reading sentences and entering data according to a format, constructing learning data, or constructing patterns. Rules can be used to automatically extract triple-type knowledge from atypical fields.

이상, 본 발명을 바람직한 실시예를 들어 상세하게 설명하였으나, 본 발명은 상기 실시예에 한정되지 않고, 본 발명의 기술적 사상 및 범위 내에서 당 분야에서 통상의 지식을 가진 자에 의하여 여러가지 변형 및 변경이 가능하다. As described above, the present invention has been described in detail with reference to preferred embodiments, but the present invention is not limited to the above embodiments, and various modifications and changes by those skilled in the art within the technical spirit and scope of the present invention. This is possible.

1, 1a, 2 : 지식 추출 시스템, 50 : 네트워크, 100, 102 : 지식 추출 규칙 생성부, 110 : 규칙 추출 대상 문장 선정부, 120 : 규책 추출 대상 문서, 130 : 규칙 생성부, 140 : 규칙 검증부, 150 : 지식 추출 규칙, 200, 202 : 지식 추출부, 210 : 지식 추출 규칙 매핑부, 220 : 트리플 생성부, 300 : 자연어 이해부, 400 : 문장 프레임, 402 : 프레임 관리부, 410, 412, 414 : 프레임 생성부, 420, 422, 424 : 부분 프레임 생성부, 500 : 온톨로지 데이터, 600 : 트리플 지식, 700 : 지식 베이스, UD : 비정형 문서1, 1a, 2: knowledge extraction system, 50: network, 100, 102: knowledge extraction rule generation unit, 110: rule extraction target sentence selection unit, 120: rule extraction target document, 130: rule generation unit, 140: rule verification Part 150: knowledge extraction rule, 200, 202: knowledge extraction part, 210: knowledge extraction rule mapping part, 220: triple generation part, 300: natural language understanding part, 400: sentence frame, 402: frame management part, 410, 412, 414: frame generator, 420, 422, 424: partial frame generator, 500: ontology data, 600: triple knowledge, 700: knowledge base, UD: atypical document

Claims

In the knowledge extraction system for generating a triple knowledge by receiving or collecting atypical documents through a network,
A frame generator that generates a sentence frame from the unstructured document;
A partial frame generator for dividing the sentence frame and generating a partial frame;
A knowledge extraction rule generator configured to generate a knowledge extraction rule from the partial frame generated from a rule target unstructured document among the unstructured documents; And
A knowledge extraction system comprising; a knowledge extraction unit generating the triple knowledge from the partial frame generated from an unstructured document subject to knowledge among the unstructured documents using the knowledge extraction rule.

According to claim 1,
The knowledge extraction rule generation unit,
A rule extraction target sentence selecting unit for dividing the rule-targeted atypical document into sentence units and selecting a sentence including both a subject object and a target object as a rule extraction target sentence;
A rule generation unit generating a plurality of knowledge extraction rule candidates from the partial frame generated from the rule extraction target sentence through the frame generation unit and the partial frame generation unit; And
And a rule verification unit that selects the knowledge extraction rule by performing verification on the plurality of knowledge extraction rule candidates.

According to claim 2,
The rule verification unit may include: a reliability analysis unit that analyzes reliability of the plurality of knowledge extraction rule candidates by Equation 1; And

-(Equation 1)
(r is a knowledge extraction rule candidate for reliability analysis among a plurality of knowledge rule candidates, P is a set of predicates, p, q is a predicate with respective predicates included in P, p is a predicate for reliability analysis, a predicate q among a set of predicates P Is any predicate other than the predicate p of the set of predicates P, C(r, p) is the reliability of rule r for predicate p, h _rp is the rate of appearance of rule r for predicate p, h _rq is for predicate q Rule r appearance ratio)
And a rule selection unit for selecting the knowledge extraction rule from among the plurality of knowledge extraction rule candidates based on the reliability analyzed by the reliability analysis unit.

According to claim 3,
The rule selection unit,
The knowledge extraction system, characterized in that the reliability of the predicate to be analyzed for reliability among the plurality of knowledge rule candidates is greater than or equal to a set value between 0 and 1 as the knowledge extraction rule.

According to claim 2,
The frame generation unit,
An entity connection unit connecting sentence entities to ontology instances of ontology data with reference to morpheme analysis results and parsing results for the unstructured document; And
And a sentence frame generation unit generating a sentence frame using a frame generation rule.

According to claim 1,
The partial frame generation unit, by dividing the sentence frame, the knowledge extraction system, characterized in that to generate the same number of sub-frames than the number of predicates in the sentence frame.

The method of claim 6,
The partial frame generation unit, the knowledge extraction system, characterized in that for generating the partial frame based on each of the subject object and the target object both of the predicates of the sentence frame.

According to claim 1,
The knowledge extraction rule is a rule defined for each predicate,
The partial frame is a knowledge extraction system characterized in that it is generated to include sentence entities having a relationship based on each of the predicates in the unstructured document.