KR102543703B1

KR102543703B1 - Knowledge extraction system for scientific technology papers

Info

Publication number: KR102543703B1
Application number: KR1020200153089A
Authority: KR
Inventors: 이경일
Original assignee: 주식회사 솔트룩스
Priority date: 2020-11-16
Filing date: 2020-11-16
Publication date: 2023-06-16
Also published as: KR20220066737A

Abstract

본 발명에 따른 과학 기술 논문을 위한 지식 추출 시스템은, 네트워크를 통하여 복수의 과학 기술 논문을 수집하여, 데이터를 추출하고 기술 분류 체계에 따라 분류하여 분류 문헌 저장소에 저장하는 문헌 관리부; 반정형 지식 추출 모델을 참조하여, 분류 문헌 저장소에 저장된 상기 복수의 과학 기술 논문으로부터 추출된 데이터 중에서 반정형 데이터를 지식으로 추출하는 반정형 지식 추출 모델; 비정형 지식 추출 모델을 참조하여, 분류 문헌 저장소에 저장된 상기 복수의 과학 기술 논문으로부터 추출된 데이터 중에서 비정형 데이터를 지식으로 추출하는 비정형 지식 추출 모델; 정형 지식 추출 모델을 참조하여, 분류 문헌 저장소에 저장된 상기 복수의 과학 기술 논문으로부터 추출된 데이터 중에서 정형 데이터를 지식으로 추출하는 정형 지식 추출 모델; 및 상기 반정형 지식 추출부, 상기 비정형 지식 추출부, 및 상기 정형 지식 추출부 각각에서 추출된 지식들을 통합하여 지식 베이스에 저장하는 지식 관리부;를 포함한다. The knowledge extraction system for science and technology papers according to the present invention includes a document management unit that collects a plurality of scientific and technical papers through a network, extracts data, classifies it according to a technical classification system, and stores it in a classified literature repository; a semi-structured knowledge extraction model for extracting semi-structured data as knowledge among data extracted from the plurality of science and technology papers stored in a classified literature repository with reference to the semi-structured knowledge extraction model; an unstructured knowledge extraction model for extracting unstructured data as knowledge among data extracted from the plurality of scientific and technical papers stored in a classified literature repository with reference to the unstructured knowledge extraction model; a structured knowledge extraction model for extracting structured data as knowledge from data extracted from the plurality of science and technology papers stored in a classified literature repository with reference to the structured knowledge extraction model; and a knowledge management unit that integrates the knowledge extracted from each of the semi-structured knowledge extraction unit, the unstructured knowledge extraction unit, and the structured knowledge extraction unit and stores them in a knowledge base.

Description

Knowledge extraction system for scientific technology papers}

본 발명은 지식 추출 시스템에 관한 것으로, 자세하게는 과학 기술 논문으로부터 지식을 추출하기 위한 지식 추출 시스템에 관한 것이다. The present invention relates to a knowledge extraction system, and more particularly to a knowledge extraction system for extracting knowledge from scientific and technical papers.

본 발명은 과학기술정보통신부 SW컴퓨팅산업원천기술개발사업(SW)의 일환으로 (주)솔트룩스에서 주관하고 연구하여 수행된 연구로부터 도출된 것이다. [연구기간: 2020.01.01.~2020.13.31., 연구관리 전문기관: 정보통신기술진흥센터, 연구과제명: WiseKB: 빅데이터 이해 기반 자가학습형 지식베이스 및 추론 기술 개발, 과제고유번호 1711103335, 세부과제번호 2013-2-00109-008]The present invention is derived from research conducted and supervised by Saltlux Co., Ltd. as part of the SW computing industry source technology development project (SW) of the Ministry of Science and ICT. [Research period: 2020.01.01.~2020.13.31., research management specialized institution: Information and Communication Technology Promotion Center, research project title: WiseKB: development of self-learning knowledge base and inference technology based on big data understanding, project identification number 1711103335, Detailed task number 2013-2-00109-008]

과학 기술 논문은 과학 기술 분야의 학술 성과를 담고 있으며, 과학 기술 분야 연구원의 노력과 지혜가 응집되어 있으며, 지식전파 및 학술교류의 중요 수단으로, 모든 과학 기술 분야의 연구 성과는 이전에 이루어진 연구에 기초하여 이루어진다. Science and technology papers contain academic achievements in the field of science and technology, and are a condensation of the efforts and wisdom of researchers in the field of science and technology, and are an important means of knowledge dissemination and academic exchange. made on the basis of

과학 기술 논문 발간량이 꾸준히 증가하고 인터넷을 통해 입수할 수 있는 정보가 과잉상태에 도달함에 따라, 개인의 지식과 경험을 토대로 과학 기술 논문으로부터 정보를 입수하고 분석하는 전통적인 정보분석 방법은 시간 과다 소요, 분석하는 개인의 관점에 따라 편향된 정보수집과 분석 진행 등의 단점을 내포하고 있다.As the amount of scientific and technological papers published steadily increases and the information available through the Internet reaches a state of excess, the traditional information analysis method of obtaining and analyzing information from scientific and technological papers based on individual knowledge and experience is time-consuming, It has disadvantages such as biased information collection and analysis progress according to the viewpoint of the individual being analyzed.

본 발명의 기술적 과제는, 과학 기술 논문으로부터 효율적으로 지식을 추출할 수 있는, 과학 기술 논문을 위한 지식 추출 시스템에 관한 것이다. The technical problem of the present invention relates to a knowledge extraction system for scientific and technical papers, which can efficiently extract knowledge from scientific and technical papers.

상기 기술적 과제를 달성하기 위한 본 발명의 기술적 사상의 일측면에 따른 과학 기술 논문을 위한 지식 추출 시스템은, 네트워크를 통하여 복수의 과학 기술 논문을 수집하여, 데이터를 추출하고 기술 분류 체계에 따라 분류하여 분류 문헌 저장소에 저장하는 문헌 관리부; 반정형 지식 추출 모델을 참조하여, 분류 문헌 저장소에 저장된 상기 복수의 과학 기술 논문으로부터 추출된 데이터 중에서 반정형 데이터를 지식으로 추출하는 반정형 지식 추출 모델; 비정형 지식 추출 모델을 참조하여, 분류 문헌 저장소에 저장된 상기 복수의 과학 기술 논문으로부터 추출된 데이터 중에서 비정형 데이터를 지식으로 추출하는 비정형 지식 추출 모델; 정형 지식 추출 모델을 참조하여, 분류 문헌 저장소에 저장된 상기 복수의 과학 기술 논문으로부터 추출된 데이터 중에서 정형 데이터를 지식으로 추출하는 정형 지식 추출 모델; 및 상기 반정형 지식 추출부, 상기 비정형 지식 추출부, 및 상기 정형 지식 추출부 각각에서 추출된 지식들을 통합하여 지식 베이스에 저장하는 지식 관리부;를 포함한다. A knowledge extraction system for scientific and technological papers according to one aspect of the technical idea of the present invention for achieving the above technical problem collects a plurality of scientific and technological papers through a network, extracts data, classifies according to a technical classification system, a document management unit storing classified documents in a repository; a semi-structured knowledge extraction model for extracting semi-structured data as knowledge among data extracted from the plurality of science and technology papers stored in a classified literature repository with reference to the semi-structured knowledge extraction model; an unstructured knowledge extraction model for extracting unstructured data as knowledge among data extracted from the plurality of scientific and technical papers stored in a classified literature repository with reference to the unstructured knowledge extraction model; a structured knowledge extraction model for extracting structured data as knowledge from data extracted from the plurality of science and technology papers stored in a classified literature repository with reference to the structured knowledge extraction model; and a knowledge management unit that integrates the knowledge extracted from each of the semi-structured knowledge extraction unit, the unstructured knowledge extraction unit, and the structured knowledge extraction unit and stores them in a knowledge base.

상기 문헌 관리부는, 상기 복수의 과학 기술 논문을 수집하는 문헌 수집 모듈; 상기 문헌 수집 모듈에서 수집한 상기 복수의 과학 기술 논문이 PDF(Portable Document Format) 또는 이미지 파일로 이루어진 경우, 상기 복수의 과학 기술 논문으로부터 텍스트 데이터를 추출하는 OCR(Optical Character Recognition) 모듈; 상기 복수의 과학 기술 논문으로부터 추출된 텍스트 정보를 이용하여, 상기 복수의 과학 기술 논문을 분류하는 문헌 분류 모델; 및 상기 문헌 분류 모델에서 분류된 상기 복수의 과학 기술 논문을 저장하는 분류 문헌 저장소;를 포함할 수 있다. The literature management unit, a literature collection module for collecting the plurality of science and technology papers; An optical character recognition (OCR) module for extracting text data from the plurality of scientific and technical papers when the plurality of scientific and technical papers collected by the document collection module are PDF (Portable Document Format) or image files; a literature classification model for classifying the plurality of scientific and technical papers by using text information extracted from the plurality of scientific and technical papers; and a classified literature repository for storing the plurality of scientific and technical papers classified in the literature classification model.

상기 OCR 모듈은, 수집된 상기 복수의 과학 기술 논문의 반정형 데이터와 비정형 데이터인 상기 텍스트 정보를 추출하는 본문 추출기; 및 수집된 상기 복수의 과학 기술 논문의 정형 데이터인 표 정보를 인식하여 추출하는 표 인식기;를 포함할 수 있다. The OCR module may include a text extractor for extracting semi-structured data and unstructured data of the plurality of collected science and technology papers, the text information; and a table recognizer for recognizing and extracting table information, which is structured data of the plurality of collected scientific and technical papers.

상기 정형 지식 추출부는, 상기 분류 문헌 저장소에 저장된 데이터 중에서, 상기 표 인식기에서 추출된 반정형 데이터인 상기 표 정보를 지식으로 추출할 수 있다. The structured knowledge extraction unit may extract the table information, which is semi-structured data extracted from the table recognizer, as knowledge among data stored in the classified literature repository.

상기 정형 지식 추출부는, 상기 표 정보에서 헤더 셀과 값 셀을 구분하는 헤더/값 분별 모듈; 상기 헤더 셀이 지식과 연결될 수 있도록 의미 분석을 수행하는 의미 분석 모듈; 및 상기 값 셀의 값을 지식과 연결될 수 있도록 정제하는 값 정제 모듈;을 포함할 수 있다. The structured knowledge extraction unit may include a header/value classification module for distinguishing a header cell from a value cell in the table information; a semantic analysis module that performs semantic analysis so that the header cell can be connected to knowledge; and a value refinement module that refines the value of the value cell so that it can be linked to knowledge.

상기 문헌 분류 모델은, 기설정된 키워드를 사용하여 수집된 상기 복수의 과학 기술 논문에 대한 검색을 수행하여, 상기 복수의 과학 기술 논문 중 상기 기설정된 키워드에 상응하는 과학 기술 논문을 분류하여 상기 분류 문헌 저장소에 저장하거나, 기술 분류에 따라서 수집된 상기 복수의 과학 기술 논문을 분류하여, 상기 복수의 과학 기술 논문 중 특정 기술 분류로 분류된 과학 기술 논문을 상기 분류 문헌 저장소에 저장할 수 있다. The literature classification model performs a search for the plurality of scientific and technical papers collected using a predetermined keyword, classifies a scientific and technical paper corresponding to the predetermined keyword among the plurality of scientific and technical papers, and classifies the classified literature. The scientific and technical papers classified into a specific technical category among the plurality of scientific and technical papers may be stored in the classified literature repository by storing in a repository or by classifying the plurality of scientific and technical papers collected according to technical classification.

상기 반정형 지식 추출부는, 상기 분류 문헌 저장소에 저장된 데이터 중에서 반정형 데이터인 서지 정보와 식별 번호를 지식으로 추출하며, 상기 서지 정보는, 문헌의 제목, 저자, 요약, 또는 키워드이고, 상기 식별 번호는 국제 표준 연속 간행물 번호(International Standard Serial Number, ISSN), 디지털 객체 식별자(digital object identifier, DOI), 또는 국제 표준 도서 번호(International Standard Book Number, ISBN)일 수 있다. The semi-structured knowledge extraction unit extracts bibliographic information and identification numbers, which are semi-structured data, as knowledge from data stored in the classified literature repository, and the bibliographic information is a title, author, abstract, or keyword of a document, and the identification number may be an International Standard Serial Number (ISSN), a digital object identifier (DOI), or an International Standard Book Number (ISBN).

상기 비정형 지식 추출부는, 상기 분류 문헌 저장소에 저장된 데이터 중에서 비정형 데이터로부터 개체명 정보를 인식하는 개체명 인식 모듈; 상기 개체명 인식 모듈에서 인식된 개체명 정보 사이의 관계를 추출하는 관계 추출 모듈; 및 상기 관계 추출 모듈에서 개체명 정보 사이의 관계가 추출된 문장들을 기설정된 문장 분류 체계에 따라 분류하는 문장 분류 모듈;을 포함할 수 있다. The unstructured knowledge extraction unit may include: an entity name recognition module recognizing entity name information from unstructured data among data stored in the classified document repository; a relationship extraction module for extracting a relationship between entity name information recognized by the entity name recognition module; and a sentence classification module for classifying the sentences from which the relationship between object name information is extracted in the relation extraction module according to a preset sentence classification system.

상기 지식 관리부는, 상기 반정형 지식 추출부, 상기 비정형 지식 추출부, 및 상기 정형 지식 추출부 각각에서 추출된 지식들을 서로 연결하여 통합하는 지식 통합 모듈; 및 상기 서로 연결하여 통합된 지식들을 상기 지식 베이스에 지식 그래프로 저장하는 지식 변환 모듈;을 포함할 수 있다. The knowledge management unit may include: a knowledge integration module for connecting and integrating knowledge extracted from each of the semi-structured knowledge extraction unit, the unstructured knowledge extraction unit, and the structured knowledge extraction unit; and a knowledge conversion module that stores the knowledge that is connected to each other and integrated into the knowledge base as a knowledge graph.

수집된 상기 복수의 과학 기술 논문을 저장하는 수집 문헌 저장소; 및 상기 반정형 지식 추출 모델, 상기 비정형 지식 추출 모델 및 상기 정형 지식 추출 모델 각각에 대한 학습을 수행하는 추출 모델 학습부;를 더 포함할 수 있으며, 상기 반정형 지식 추출 모델은 상기 수집 문헌 저장소에 저장된 반정형 데이터를 사용하여, 상기 추출 모델 학습부에서 학습이 수행될 수 있고, 상기 비정형 지식 추출 모델 및 상기 정형 지식 추출 모델 각각은 상기 분류 문헌 저장소에 저장된 비정형 데이터 및 정형 데이터 각각을 사용하여, 상기 추출 모델 학습부에서 학습이 수행될 수 있다. a collection literature repository for storing the plurality of scientific and technical articles collected; and an extraction model learning unit configured to perform learning on each of the semi-structured knowledge extraction model, the unstructured knowledge extraction model, and the structured knowledge extraction model, wherein the semi-structured knowledge extraction model is stored in the collected document repository. Learning may be performed in the extraction model learning unit using stored semi-structured data, and each of the unstructured knowledge extraction model and the structured knowledge extraction model uses unstructured data and structured data stored in the classified literature repository, respectively. Learning may be performed in the extraction model learning unit.

본 발명에 따른 과학 기술 논문을 위한 지식 추출 시스템은, 특정 키워드에 대하여 검색된 과학 기술 논문 또는 특정 기술 분야로 분류된 과학 기술 논문에 대하여 효율성 및 정확성을 함께 가지며 학습된 지식 추출 모델들을 통하여 지식을 추출하므로, 과학 기술 논문이 가지는 지식을 빠르고 정확하게 찾을 수 있도록 할 수 있다. The knowledge extraction system for science and technology papers according to the present invention extracts knowledge through learned knowledge extraction models with both efficiency and accuracy for science and technology papers searched for specific keywords or scientific and technical papers classified into specific technical fields. Therefore, it is possible to quickly and accurately find the knowledge of science and technology papers.

도 1은 본 발명의 예시적 실시 예에 따른 과학 기술 논문을 위한 지식 추출 시스템의 개략적인 블록도이다.
도 2는 본 발명의 예시적 실시 예에 따른 과학 기술 논문을 위한 지식 추출 시스템에서 반정형 지식 추출 모델을 학습하는 방법을 설명하기 위한 개략적인 블록도이다.
도 3은 본 발명의 예시적 실시 예에 따른 과학 기술 논문을 위한 지식 추출 시스템에서 비정형 지식 추출 모델 및 정형 지식 추출 모델을 학습하는 방법을 설명하기 위한 개략적인 블록도이다.
도 4는 본 발명의 예시적 실시 예에 따른 과학 기술 논문을 위한 지식 추출 시스템에서 구축한 지식 베이스를 이용하여, 지식을 검색하는 방법을 설명하기 위한 개략적인 블록도이다.
도 5는 본 발명의 예시적 실시 예에 따른 과학 기술 논문을 위한 지식 추출 시스템이 포함하는 자연어 이해부의 개략적인 블록도이다.
도 6은 본 발명의 예시적 실시 예에 따른 과학 기술 논문을 위한 지식 추출 시스템에서, 반정형 지식을 추출하는 과정을 설명하기 위한 개념도이다.
도 7은 본 발명의 예시적 실시 예에 따른 과학 기술 논문을 위한 지식 추출 시스템에서, 정형 지식을 추출하는 과정을 설명하기 위한 개념도이다.
도 8은 본 발명의 예시적 실시 예에 따른 과학 기술 논문을 위한 지식 추출 시스템에서 추출하여 구축된 지식 베이스를 나타내는 개념도이다. 1 is a schematic block diagram of a knowledge extraction system for scientific and technical papers according to an exemplary embodiment of the present invention.
2 is a schematic block diagram illustrating a method for learning a semi-structured knowledge extraction model in a knowledge extraction system for scientific and technical papers according to an exemplary embodiment of the present invention.
3 is a schematic block diagram illustrating a method of learning an unstructured knowledge extraction model and a structured knowledge extraction model in a knowledge extraction system for scientific and technical papers according to an exemplary embodiment of the present invention.
4 is a schematic block diagram illustrating a method of searching for knowledge by using a knowledge base constructed in a knowledge extraction system for scientific and technical papers according to an exemplary embodiment of the present invention.
5 is a schematic block diagram of a natural language understanding unit included in a knowledge extraction system for scientific and technical papers according to an exemplary embodiment of the present invention.
6 is a conceptual diagram illustrating a process of extracting semi-structured knowledge in a knowledge extraction system for scientific and technical papers according to an exemplary embodiment of the present invention.
7 is a conceptual diagram illustrating a process of extracting formal knowledge in a knowledge extraction system for scientific and technical papers according to an exemplary embodiment of the present invention.
8 is a conceptual diagram illustrating a knowledge base constructed by extracting from a knowledge extraction system for scientific and technical papers according to an exemplary embodiment of the present invention.

이하, 첨부한 도면을 참조하여 본 발명의 실시 예에 대해 상세히 설명한다. 본 발명의 실시 예는 당 업계에서 평균적인 지식을 가진 자에게 본 발명을 보다 완전하게 설명하기 위하여 제공되는 것이다. 본 발명은 다양한 변경을 가할 수 있고 여러 가지 형태를 가질 수 있는바, 특정 실시 예들을 도면에 예시하고 상세하게 설명하고자 한다. 그러나, 이는 본 발명을 특정한 개시 형태에 대해 한정하려는 것이 아니며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다. 각 도면을 설명하면서 유사한 참조부호를 유사한 구성요소에 대해 사용한다. 첨부된 도면에 있어서, 구조물들의 치수는 본 발명의 명확성을 기하기 위하여 실제보다 확대하거나 축소하여 도시한 것이다.Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings. The embodiments of the present invention are provided to more completely explain the present invention to those skilled in the art. Since the present invention can have various changes and various forms, specific embodiments will be illustrated in the drawings and described in detail. However, it should be understood that this is not intended to limit the present invention to the specific disclosed form, and includes all modifications, equivalents, and substitutes included in the spirit and scope of the present invention. In describing each figure, like reference numbers are used for like elements. In the accompanying drawings, the dimensions of the structures are shown enlarged or reduced than actual for clarity of the present invention.

본 출원에서 사용한 용어는 단지 특정한 실시 예를 설명하기 위해 사용된 것으로, 본 발명을 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수개의 표현을 포함한다. 본 출원에서, "포함하다" 또는 "가지다" 등의 용어는 명세서 상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성 요소, 부분품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.Terms used in this application are only used to describe specific embodiments, and are not intended to limit the present invention. Expressions in the singular number include plural expressions unless the context clearly dictates otherwise. In this application, terms such as "comprise" or "have" are intended to designate that there is a feature, number, step, operation, component, part, or combination thereof described in the specification, but one or more other features It should be understood that it does not preclude the possibility of the presence or addition of numbers, steps, operations, components, parts, or combinations thereof.

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 갖는다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥상 가지는 의미와 일치하는 의미를 가지는 것으로 해석되어야 하며, 본 출원에서 명백하게 정의하지 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.Unless defined otherwise, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Terms such as those defined in commonly used dictionaries should be interpreted as having a meaning consistent with the meaning in the context of the related art, and unless explicitly defined in this application, they are not interpreted in an ideal or excessively formal meaning. .

이하 도면 및 설명에서, 하나의 블록, 예를 들면, '~부' 또는 '~모듈'로 표시 또는 설명되는 구성요소는 하드웨어 블록 또는 소프트웨어 블록일 수 있다. 예를 들면, 구성요소들 각각은 서로 신호를 주고 받는 독립적인 하드웨어 블록일 수도 있고, 또는 하나의 프로세서에서 실행되는 소프트웨어 블록일 수도 있다.In the drawings and descriptions below, one block, for example, a component indicated or described as '~unit' or '~module' may be a hardware block or a software block. For example, each of the components may be an independent hardware block that exchanges signals with each other, or may be a software block executed on a single processor.

본 발명의 구성 및 효과를 충분히 이해하기 위하여, 첨부한 도면을 참조하여 본 발명의 바람직한 실시 예들을 설명한다. In order to fully understand the configuration and effects of the present invention, preferred embodiments of the present invention will be described with reference to the accompanying drawings.

도 1은 본 발명의 예시적 실시 예에 따른 과학 기술 논문을 위한 지식 추출 시스템의 개략적인 블록도이다.1 is a schematic block diagram of a knowledge extraction system for scientific and technical papers according to an exemplary embodiment of the present invention.

도 1을 참조하면, 과학 기술 논문을 위한 지식 추출 시스템(1, 이하 지식 추출 시스템)은 문헌 관리부(100), 반정형 지식 추출부(200), 비정형 지식 추출부(300), 정형 지식 추출부(400), 지식 관리부(500), 및 지식 베이스(600)를 포함한다. Referring to FIG. 1, the knowledge extraction system (1, hereinafter referred to as the knowledge extraction system) for science and technology papers includes a document management unit 100, a semi-structured knowledge extraction unit 200, an unstructured knowledge extraction unit 300, and a structured knowledge extraction unit. 400, a knowledge management unit 500, and a knowledge base 600.

문헌 관리부(100)는 문헌 수집 모듈(110), OCR 모듈(120), 수집 문헌 저장소(130), 문헌 분류 모듈(140), 및 분류 문헌 저장소(150)를 포함할 수 있다. 문헌 수집 모듈(110)은 복수의 과학 기술 논문(문헌, 40)을 수집할 수 있다. OCR 모듈(120)은 수집된 복수의 과학 기술 논문(40)으로부터 데이터를 추출할 수 있다. 문헌 분류 모듈(140)은 수집된 복수의 과학 기술 논문(40)을 기저장된 기술 분류 체계에 따라 분류하여, 분류 문헌 저장소(150)에 저장할 수 있다. The document management unit 100 may include a document collection module 110 , an OCR module 120 , a collected document storage 130 , a document classification module 140 , and a classified document storage 150 . The literature collection module 110 may collect a plurality of scientific and technical papers (document, 40). The OCR module 120 may extract data from a plurality of collected scientific and technical papers 40 . The literature classification module 140 may classify the collected scientific and technical papers 40 according to a pre-stored technical classification system and store them in the classified literature repository 150 .

문헌 수집 모듈(110)은 네트워크(20)를 통하여 복수의 과학 기술 논문(40)을 수집할 수 있다. 네트워크(20)는 유선 인터넷 서비스, 근거리 통신망(LAN), 광대역 통신망(WAN), 인트라넷, 무선 인터넷 서비스, 이동 컴퓨팅 서비스, 무선 데이터 통신 서비스, 무선 인터넷 접속 서비스, 위성 통신 서비스, 무선 랜, 블루투스 등 유/무선을 통하여 데이터를 주고 받을 수 있는 것을 모두 포함할 수 있다. 네트워크(20)가 스마트폰 또는 태블릿 등과 연결되는 경우, 네트워크(20)는 3G, 4G, 5G 등의 무선 데이터 통신 서비스, 와이파이(Wi-Fi) 등의 무선 랜, 블루투스 등일 수 있다. 과학 기술 논문(40)은 과학 기술 분야의 학위 논문, 학회 논문, 또는 학술지 논문일 수 있으나, 소논문과 보고서와 같은 과학 기술 분야의 문헌일 수도 있다. 본 명세서에서는 지식 추출 시스템(1)에서 수집한 과학 기술 논문(40)을 '문헌'이라고 언급할 수 있다. 과학 기술 논문(40)은 텍스트 정보, 표 정보, 및 이미지 정보를 포함할 수 있다. 상기 이미지 정보는 사진, 그림, 또는 그래프와 그에 대한 캡션을 포함할 수 있다. The literature collection module 110 may collect a plurality of scientific and technical papers 40 through the network 20 . The network 20 includes wired Internet service, local area network (LAN), wide area network (WAN), intranet, wireless Internet service, mobile computing service, wireless data communication service, wireless Internet access service, satellite communication service, wireless LAN, Bluetooth, etc. It may include all that can send and receive data through wired/wireless. When the network 20 is connected to a smartphone or tablet, the network 20 may be a wireless data communication service such as 3G, 4G, or 5G, a wireless LAN such as Wi-Fi, or Bluetooth. The science and technology thesis 40 may be a dissertation, a conference thesis, or an academic journal paper in the field of science and technology, but may also be a document in the field of science and technology, such as a dissertation and a report. In this specification, the scientific and technical papers 40 collected by the knowledge extraction system 1 may be referred to as 'documents'. The scientific and technical paper 40 may include text information, table information, and image information. The image information may include a photo, picture, or graph and a caption therefor.

OCR(Optical Character Recognition) 모듈(120)은 본문 추출기(122)와 표 인식기(124)를 포함할 수 있다. OCR 모듈(120)은 수집된 과학 기술 논문(40)이 PDF(Portable Document Format) 또는 이미지 파일로 이루어진 경우, 이를 텍스트 정보로 치환하여 추출할 수 있다. The Optical Character Recognition (OCR) module 120 may include a text extractor 122 and a table recognizer 124 . The OCR module 120 may extract the collected scientific and technical papers 40 by substituting them into text information when the collected scientific and technical papers 40 are PDF (Portable Document Format) or image files.

본문 추출기(122)는 과학 기술 논문(40)의 상기 텍스트 정보를 추출할 수 있다. 표 인식기(124)는 과학 기술 논문(40)의 상기 표 정보를 인식하여 추출할 수 있다. 일부 실시 예에서, 본문 추출기(122)는 과학 기술 논문(40)의 이미지 정보에 대한 캡션이 가지는 텍스트들 및 이미지가 저장되는 링크 정보를 더 추출할 수 있다. 본문 추출기(122)는 과학 기술 논문(40)으로부터 반정형 데이터와 비정형 데이터를 추출할 수 있다. 반정형 데이터는 서지 정보와 식별 번호일 수 있고, 비정형 데이터는 본문 내용일 수 있다. 표 인식기(124)는 과학 기술 논문(40)의 상기 표 정보로부터 정형 데이터를 추출할 수 있다. The text extractor 122 may extract the text information of the science and technology paper 40 . The table recognizer 124 may recognize and extract the table information of the science and technology paper 40 . In some embodiments, the text extractor 122 may further extract texts of captions for image information of the scientific and technical paper 40 and link information in which images are stored. The text extractor 122 may extract semi-structured data and unstructured data from the science and technology paper 40 . The semi-structured data may be bibliographic information and identification numbers, and the unstructured data may be body contents. The table recognizer 124 may extract structured data from the table information of the scientific and technical paper 40 .

표 인식기(124)는 과학 기술 논문(40)의 상기 표 정보 내에 포함되는 텍스트들 및 각 텍스트의 표 내의 셀 위치 등의 표 정보를 인식할 수 있다. OCR 모듈(120)은 수집된 과학 기술 논문(40)이 인식 가능한 텍스트로 이루어지는 경우, OCR 인식을 수행하지 않고, 본문 추출기(122)와 표 인식기(124)를 통하여 상기 텍스트 정보와 상기 표 정보를 추출 및 인식할 수 있다. The table recognizer 124 may recognize table information such as texts included in the table information of the scientific technical paper 40 and cell positions of each text in the table. The OCR module 120, when the collected scientific and technical papers 40 are composed of recognizable text, does not perform OCR recognition and extracts the text information and the table information through the text extractor 122 and the table recognizer 124. can be extracted and recognized.

OCR 모듈(120)은 수집된 과학 기술 논문(40)으로부터 추출한 상기 텍스트 정보와 인식된 표 정보를 수집 문헌 저장소(130)에 저장할 수 있다. 수집 문헌 저장소(130)에는 수집된 과학 기술 논문(40)에 포함되는 상기 이미지 정보의 이미지가 저장될 수 있다. The OCR module 120 may store the text information extracted from the collected scientific and technical papers 40 and the recognized table information in the collected literature repository 130 . Images of the image information included in the collected scientific and technical papers 40 may be stored in the collected literature repository 130 .

문헌 분류 모듈(140)은, 수집된 과학 기술 논문(40)으로부터 OCR 모듈(120)에서 추출한 상기 텍스트 정보를 이용하여 수집된 과학 기술 논문(40)을 분류할 수 있다. The literature classification module 140 may classify the collected scientific and technical papers 40 by using the text information extracted by the OCR module 120 from the collected scientific and technical papers 40 .

일부 실시 예에서, 문헌 분류 모듈(140)은 기설정된 키워드를 사용하여, 수집된 과학 기술 논문(40)에 대한 검색을 수행하여, 기설정된 키워드에 상응하는 과학 기술 논문(40)을 분류하여 분류 문헌 저장소(150)에 저장할 수 있다. In some embodiments, the literature classification module 140 performs a search for the collected scientific and technical papers 40 using a predetermined keyword, and classifies and classifies the scientific and technical papers 40 corresponding to the predetermined keyword. may be stored in the literature repository 150.

다른 일부 실시 예에서, 문헌 분류 모듈(140)은 국가과학기술표준분류체계와 같은 기술 분류에 따라서 수집된 과학 기술 논문(40)을 분류하여, 분류별로 분류 문헌 저장소(150)에 저장할 수 있다. 예를 들면, 문헌 분류 모듈(140)은 수집된 과학 기술 논문(40)을 대분류, 중분류, 소분류로 분류할 수 있다. 대분류는 자연, 생명, 인공물, 인간과학과 기술, 및 비 과학기술일 수 있다. 중분류는 대분류가 자연의 경우 수학, 물리학, 화학, 및 지구과학(지구/대기/해양/천문)일 수 있고, 대분류가 생명인 경우, 생명과학, 농림수산식품, 및 보건의료일 수 있고, 대분류가 인공물인 경우, 기계, 재료, 화공, 전기/전자, 정보/통신, 에너지/자원, 원자력, 환경, 및 건설/교통일 수 있고, 대분류가 인간과학과 기술인 경우, 뇌과학 및 인지/감성과학일 수 있다. 소분류는 대분류가 생명이고 중분류가 생명과학인 경우, 분자세포생물학, 유전학/유전공학, 발생생물학|발생/신경생물학, 면역학/생리학, 분류/생태/환경생물학, 생화학/구조생물학, 융합바이오, 생물공학, 산업바이오, 바이오공정/기기, 생물위해성 및 기타생명과학일 수 있다. 문헌 분류 모듈(140)은 대분류가 비 과학기술로 분류된 과학 기술 논문(40)에 대해서는 중분류와 소분류로는 분류하지 않을 수 있다. In some other embodiments, the literature classification module 140 may classify the collected science and technology papers 40 according to technical classifications such as the National Science and Technology Standard Classification System and store them in the classified literature storage 150 for each classification. For example, the literature classification module 140 may classify the collected scientific and technical papers 40 into major categories, middle categories, and small categories. The major categories may be nature, life, artifact, human science and technology, and non-scientific technology. The middle classification may be mathematics, physics, chemistry, and earth science (Earth/atmosphere/marine/astronomy) if the major classification is nature, and may be life science, agriculture, forestry, fishery, and health care if the major classification is life, and the major classification may be In the case of artifacts, they may be machinery, materials, chemical engineering, electricity/electronics, information/communication, energy/resources, nuclear power, environment, and construction/transportation, and if the major categories are human science and technology, they may be neuroscience and cognitive/emotional science. there is. If the main category is life and the middle category is life science, molecular and cell biology, genetics/genetic engineering, developmental biology|development/neurobiology, immunology/physiology, taxonomy/ecology/environmental biology, biochemistry/structural biology, convergence bio, biology It can be engineering, industrial bio, bioprocess/device, biorisk and other life sciences. The literature classification module 140 may not classify science and technology papers 40 whose main classification is classified as non-science and technology into middle and small classifications.

분류 문헌 저장소(150)는 수집된 과학 기술 논문(40)을 분류별로 각각 저장할 수 있으나, 본 명세서에서는 특별히 언급되지 않는 한, 분류 문헌 저장소(150)는 수집된 과학 기술 논문(40) 중 키워드를 사용하여 검색된 과학 기술 논문(40) 또는 특정 기술 분류로 분류된 과학 기술 논문(40)만을 저장하고 있는 것을 의미할 수 있고, 수집 문헌 저장소(130)는 수집된 과학 기술 논문(40)을 모두 저장하고 있는 것을 의미할 수 있다. The classified literature repository 150 may store the collected scientific and technical papers 40 for each category, but unless otherwise specified in the present specification, the classified literature repository 150 uses keywords among the collected scientific and technical papers 40. It may mean that only the science and technology papers 40 searched using the search or science and technology papers 40 classified into a specific technology classification are stored, and the collection literature repository 130 stores all of the collected science and technology papers 40. It can mean what you are doing.

반정형 지식 추출부(200)는 분류 문헌 저장소(150)에 저장된 데이터 중에서 반정형 데이터인 서지 정보와 식별 번호를 지식으로 추출할 수 있다. 반정형 지식 추출부(200)는 반정형 지식 추출 모델(250)을 참조하여, 분류 문헌 저장소(150)에 저장된 데이터 중에서 서지 정보와 식별 번호를 추출할 수 있다. 반정형 지식 추출부(200)는 서지 정보 추출 모듈(210)과 식별 번호 추출 모듈(220)을 포함할 수 있다. 서지 정보 추출 모듈(210)은 본문 추출기(122)가 추출한 구조화된 반정형 데이터로부터 문헌의 제목, 저자, 요약, 키워드 등을 추출할 수 있다. 식별 번호 추출 모듈(220)은 식별 번호 등을 추출할 수 있다. 식별 번호는 예를 들면, 국제 표준 연속 간행물 번호(International Standard Serial Number, ISSN), 디지털 객체 식별자(digital object identifier, DOI), 및 국제 표준 도서 번호(International Standard Book Number, ISBN)일 수 있다. The semi-structured knowledge extractor 200 may extract semi-structured data, bibliographic information and an identification number, as knowledge from data stored in the classified literature repository 150 . The semi-structured knowledge extraction unit 200 may extract bibliographic information and an identification number from data stored in the classified literature repository 150 by referring to the semi-structured knowledge extraction model 250 . The semi-structured knowledge extraction unit 200 may include a bibliographic information extraction module 210 and an identification number extraction module 220. The bibliographic information extraction module 210 may extract the title, author, summary, keywords, etc. of the document from the structured semi-structured data extracted by the text extractor 122 . The identification number extraction module 220 may extract an identification number or the like. The identification number may be, for example, an International Standard Serial Number (ISSN), a digital object identifier (DOI), and an International Standard Book Number (ISBN).

비정형 지식 추출부(300)는 분류 문헌 저장소(150)에 저장된 데이터 중에서 비정형 데이터를 지식으로 추출할 수 있다. 비정형 지식 추출부(300)는 비정형 지식 추출 모델(350)을 참조하여, 분류 문헌 저장소(150)에 저장된 데이터 중에서 비정형 데이터를 지식으로 추출할 수 있다. 비정형 지식 추출부(300)는 개체명 인식 모듈(310), 관계 추출 모듈(320), 및 문장 분류 모듈(330)을 포함할 수 있다. 개체명 인식 모듈(310)은 분류 문헌 저장소(150)에 저장된 데이터로부터 지식 추출 대상 후보 문장을 판단할 수 있도록 개체명 정보를 인식하여 얻을 수 있다. 관계 추출 모듈(320)과 개체명 인식 모듈(310)에서 인식된 개체명 정보 사이의 관계를 추출할 수 있고, 문장 분류 모듈(330)은 개체명 정보 사이의 관계가 추출되는 문장들을 기설정된 문장 분류 체계에 따라 분류할 수 있다. 예를 들면, 문장 분류 모듈(330)은 개체명 정보 사이의 관계가 추출되는 문장들을, 실험 방법, 실험 결과 등으로 기설정된 문장 분류 체계에 따라 분류될 수 있다.The unstructured knowledge extraction unit 300 may extract unstructured data as knowledge from among data stored in the classified literature repository 150 . The unstructured knowledge extraction unit 300 may refer to the unstructured knowledge extraction model 350 and extract unstructured data as knowledge from among data stored in the classified literature repository 150 . The unstructured knowledge extraction unit 300 may include an object name recognition module 310 , a relationship extraction module 320 , and a sentence classification module 330 . The entity name recognition module 310 may recognize and obtain entity name information from data stored in the classified literature repository 150 to determine candidate sentences to be knowledge extracted. The relationship between the entity name information recognized by the relationship extraction module 320 and the entity name recognition module 310 may be extracted, and the sentence classification module 330 converts sentences from which the relationship between entity name information is extracted into a preset sentence. They can be classified according to a classification system. For example, the sentence classification module 330 may classify sentences from which relationships between object name information is extracted according to a predetermined sentence classification system based on an experiment method and an experiment result.

정형 지식 추출부(400)는 OCR 모듈(120)의 표 인식기(124)가 인식한 분류 문헌 저장소(150)에 저장한 표 정보인 정형 데이터를 지식으로 추출할 수 있다. 정형 지식 추출부(400)는 정형 지식 추출 모듈(450)을 참조하여, 표 정보를 지식으로 추출할 수 있다. 정형 지식 추출부(400)는 헤더/값 분별 모듈(410), 의미 분석 모듈(420), 및 값 정제 모듈(430)을 포함할 수 있다. 헤더/값 분별 모듈(410)은 표 정보에서 헤더 셀과 값 셀을 구분할 수 있다. 의미 분석 모듈(420)은 헤더 셀이 지식과 연결될 수 있도록 의미 분석을 수행할 수 있다. 값 정제 모듈(430)은 값 셀의 값을 지식과 연결될 수 있도록 정제할 수 있다. The structured knowledge extractor 400 may extract structured data, which is table information recognized by the table recognizer 124 of the OCR module 120 and stored in the classified document repository 150, as knowledge. The structured knowledge extraction unit 400 may refer to the structured knowledge extraction module 450 to extract table information as knowledge. The structured knowledge extraction unit 400 may include a header/value classification module 410, a semantic analysis module 420, and a value refinement module 430. The header/value classification module 410 may distinguish header cells and value cells from table information. The semantic analysis module 420 may perform semantic analysis so that the header cell may be connected to knowledge. The value refinement module 430 may refine the value of the value cell so that it can be linked to knowledge.

비정형 지식 추출부(300), 정형 지식 추출부(400), 및 지식 관리부(500) 각각에서 추출한 지식들은 테이블(table) 구조 또는 트리(tree) 구조를 가지는 정형 데이터일 수 있다. 반정형 지식 추출부(200), 비정형 지식 추출부(300), 및 정형 지식 추출부(400) 각각에 의하여 제공될 수 있는 정형 데이터는 예를 들면, RDB(Relation DateBase), CSV(Comma-Seperated Variables), XML(eXtensible Markup Language), JSON(JavaScript Object Notation) 등일 수 있으나, 이에 한정되지 않는다. The knowledge extracted from each of the unstructured knowledge extraction unit 300, the structured knowledge extraction unit 400, and the knowledge management unit 500 may be structured data having a table structure or a tree structure. Structured data that can be provided by each of the semi-structured knowledge extraction unit 200, the unstructured knowledge extraction unit 300, and the structured knowledge extraction unit 400, for example, RDB (Relation DateBase), CSV (Comma-Separated Variables), XML (eXtensible Markup Language), JSON (JavaScript Object Notation), etc., but is not limited thereto.

지식 관리부(500)는 반정형 지식 추출부(200), 비정형 지식 추출부(300), 및 정형 지식 추출부(400) 각각에서 추출한 지식들을 통합할 수 있다. 지식 관리부(500)는 지식 통합 모듈(510)과 지식 변환 모듈(520)을 포함할 수 있다. 지식 통합 모듈(510)은 반정형 지식 추출부(200), 비정형 지식 추출부(300), 및 정형 지식 추출부(400) 각각에서 추출한 지식들을 서로 연결할 수 있다. 지식 변환 모듈(520)은 지식 통합 모듈(510)에서 서로 연결하여 통합한 지식들을 지식 베이스(600)에 저장할 수 있도록 변환할 수 있다. 일부 실시 예에서 지식 변환 모듈(520)은 지식들을 지식 베이스(600)에 지식 그래프로 저장할 수 있다. The knowledge management unit 500 may integrate knowledge extracted from each of the semi-structured knowledge extraction unit 200, the unstructured knowledge extraction unit 300, and the structured knowledge extraction unit 400. The knowledge management unit 500 may include a knowledge integration module 510 and a knowledge conversion module 520 . The knowledge integration module 510 may connect knowledge extracted from each of the semi-structured knowledge extraction unit 200, the unstructured knowledge extraction unit 300, and the structured knowledge extraction unit 400 to each other. The knowledge conversion module 520 may convert knowledge integrated by connecting to each other in the knowledge integration module 510 to be stored in the knowledge base 600 . In some embodiments, the knowledge transformation module 520 may store knowledge in the knowledge base 600 as a knowledge graph.

지식 그래프는 다양한 소스로부터 시맨틱 정보를 축적하여, 검색 결과를 향상시킬 수 있도록, 개체들 사이의 의미적 그래프를 나타내는 그래프인 시맨틱 그래프를 의미한다. 시맨틱 정보란 자원과 자원 사이의 관계를 어떠한 정보로 표현한 것을 의미한다. 즉 지식 그래프는, 시맨틱 정보들을 모아 그래프 형태로서 데이터를 저장한 것을 의미한다. The knowledge graph refers to a semantic graph that is a graph representing a semantic graph between entities so that search results can be improved by accumulating semantic information from various sources. Semantic information means that a relationship between resources is expressed in some kind of information. That is, the knowledge graph means that data is stored in the form of a graph by collecting semantic information.

지식 그래프는 예를 들면, RDF(Resource Description Framework)를 사용하여 구축될 수 있다. RDF는 URI(Uniform Resource Identifier)를 갖는 모든 리소스(웹 페이지, 이미지, 동영상 등)들의 속성, 특성, 관계 등을 기술(설명)하기 위한 모델, 언어, 문법으로, 그래프 방식의 데이터 모델이다. RDF는 주어와 목적어의 두 자원간의 관계를 표현한 것으로, 프로퍼티(property) 또는 술어(predicate)는 이 관계의 특성, 특징을 설명한다. 주어(subject)-프로퍼티(property)-목적어(object) 구조의 문장을 트리플(triple)이라고 부르며, 관계는 방향성을 가질 수 있다. A knowledge graph can be built using, for example, a Resource Description Framework (RDF). RDF is a model, language, and grammar for describing (explaining) properties, characteristics, and relationships of all resources (web pages, images, videos, etc.) with a URI (Uniform Resource Identifier), and is a graph-based data model. RDF expresses the relationship between two resources, a subject and an object, and a property or predicate describes the characteristics of this relationship. A sentence with a subject-property-object structure is called a triple, and the relationship may have a direction.

지식 그래프는 예를 들면, 하나의 개체를 주어(subject)로, 다른 개체를 목적어(object)로 사용하여 하나의 개체와 다른 개체의 관계를 주어(subject)-프로퍼티(property)-목적어(object) 구조의 트리플로 나타내고, 다른 개체를 주어(subject), 또 다른 개체를 목적어(object)로 사용하여 다른 개체와 또 다른 개체의 관계를 주어(subject)-프로퍼티(property)-목적어(object) 구조의 트리플로 나타내는 것이 반복되는 다차 관계 트리플로 이루어질 수 있다. A knowledge graph is a subject-property-object relationship between one entity and another entity, for example, using one entity as the subject and another entity as the object. It is represented as a triple of structure, and another object is used as a subject and another object as an object. What is represented by a triple can consist of repeated multi-order relationship triples.

지식 추출 시스템(1)은 질의 응답 관리부(700) 및 사용자 인터페이스(UI, 900)를 더 포함할 수 있다. 지식 추출 시스템(1)이 질의 응답 관리부(700) 및 사용자 인터페이스(UI, 900)를 더 포함하는 경우, 과학 기술 논문을 위한 지식 추출 시스템(1)은 과학 기술 질의 응답 시스템으로 기능할 수 있다. The knowledge extraction system 1 may further include a question response management unit 700 and a user interface (UI) 900. When the knowledge extraction system 1 further includes a question answering management unit 700 and a user interface (UI, 900), the knowledge extraction system 1 for science and technology papers may function as a science and technology question answering system.

사용자 인터페이스(900)는 네트워크(20)를 통하여 사용자(10)로부터 자연어 형태의 질의를 사용자 인터페이스(900)로 수신하고, 응답을 송신할 수 있다. 사용자 인터페이스(900)는 사용자(10)가 사용하는 단말기 등을 통하여 지식 추출 시스템(1), 즉 과학 기술 질의 응답 시스템에 엑세스하기 위한 인터페이스를 제공할 수 있다. 사용자(10)는 사용자 인터페이스(900)를 통하여 지식 추출 시스템(1), 즉 과학 기술 질의 응답 시스템에 질의를 전송할 수 있고, 사용자 인터페이스(900)를 통하여 지식 추출 시스템(1), 즉 과학 기술 질의 응답 시스템이 제공하는 질의에 대한 응답을 수신할 수 있다. The user interface 900 may receive a query in a natural language form from the user 10 through the network 20 through the user interface 900 and transmit a response. The user interface 900 may provide an interface for accessing the knowledge extraction system 1, that is, the science and technology question answering system, through a terminal used by the user 10. The user 10 may transmit a query to the knowledge extraction system 1, that is, the scientific and technological question answering system, through the user interface 900, and the knowledge extraction system 1, that is, the scientific and technological query, through the user interface 900. Responses to queries provided by the response system may be received.

질의 응답 관리부(700)는 사용자(10)가 사용자 인터페이스(900)를 통하여 한 질의를 자연어 이해부(750)를 참조하여 해석하고, 지식 베이스(600)를 통하여 구한 질의에 대한 응답을 자연어로 생성하여 사용자 인터페이스(900)를 통하여 사용자(10)에게 제공할 수 있다. The question response management unit 700 interprets a query made by the user 10 through the user interface 900 with reference to the natural language understanding unit 750 and generates a response to the query obtained through the knowledge base 600 in natural language. It can be provided to the user 10 through the user interface 900.

지식 추출 시스템(1)은 추출 모델 학습부(800)를 더 포함할 수 있다. 추출 모델 학습부(800)는 반정형 지식 추출 모델(250), 비정형 지식 추출 모델(350) 및 정형 지식 추출 모델(450)에 대한 학습을 수행할 수 있다. 추출 모델 학습부(800)는 입력 모듈(810)과 출력 모듈(820)을 포함할 수 있다. 입력 모듈(810)은 수집 문헌 저장소(130) 또는 분류 문헌 저장소(150)에 저장된 데이터를 입력받을 수 있고, 출력 모듈(820)은 입력 모듈(810)에서 입력받은 데이터를 사용하여 딥 러닝을 수행한 결과를 이용하여, 반정형 지식 추출 모델(250), 비정형 지식 추출 모델(350) 또는 정형 지식 추출 모델(450)에 대한 학습을 수행하고, 기학습된 반정형 지식 추출 모델(250), 비정형 지식 추출 모델(350) 또는 정형 지식 추출 모델(450)을 업데이트할 수 있다. The knowledge extraction system 1 may further include an extraction model learning unit 800 . The extraction model learning unit 800 may perform learning on the semi-structured knowledge extraction model 250, the unstructured knowledge extraction model 350, and the structured knowledge extraction model 450. The extraction model learning unit 800 may include an input module 810 and an output module 820 . The input module 810 may receive data stored in the collected literature repository 130 or the classified literature repository 150, and the output module 820 may perform deep learning using the data input from the input module 810. Using one result, learning is performed on the semi-structured knowledge extraction model 250, the unstructured knowledge extraction model 350, or the structured knowledge extraction model 450, and the pre-learned semi-structured knowledge extraction model 250, the unstructured The knowledge extraction model 350 or the structured knowledge extraction model 450 may be updated.

추출 모델 학습부(800)는 수집 문헌 저장소(130)에 저장된 데이터를 입력 모듈(810)이 입력받아 딥 러닝을 수행하고, 그 결과를 이용하여 출력 모듈(820)이 반정형 지식 추출 모델(250)에 대한 학습을 수행하고 업데이트할 수 있다. 반정형 지식 추출 모델(250)은 수집 문헌 저장소(130)에 저장된 반정형 데이터를 사용하여, 추출 모델 학습부(800)에서 수행한 딥 러닝에 의하여 학습 및 업데이트될 수 있다. In the extraction model learning unit 800, the input module 810 receives data stored in the collected literature repository 130 and performs deep learning, and the output module 820 uses the result to generate the semi-structured knowledge extraction model 250 ) can be learned and updated. The semi-structured knowledge extraction model 250 may be learned and updated by deep learning performed by the extraction model learning unit 800 using semi-structured data stored in the collected literature repository 130 .

추출 모델 학습부(800)는 분류 문헌 저장소(150)에 저장된 데이터를 입력 모듈(810)이 입력받아 딥 러닝을 수행하고, 그 결과를 이용하여 출력 모듈(820)이 비정형 지식 추출 모델(350)에 대한 학습을 수행하고 업데이트할 수 있다. 비정형 지식 추출 모델(350)은 분류 문헌 저장소(150)에 저장된 비정형 데이터를 사용하여, 추출 모델 학습부(800)에서 수행한 딥 러닝에 의하여 학습 및 업데이트될 수 있다. In the extraction model learning unit 800, the input module 810 receives data stored in the classified literature repository 150 and performs deep learning, and the output module 820 uses the result to generate an unstructured knowledge extraction model 350. You can learn about and update. The unstructured knowledge extraction model 350 may be learned and updated by deep learning performed by the extraction model learning unit 800 using unstructured data stored in the classified literature repository 150 .

추출 모델 학습부(800)는 분류 문헌 저장소(150)에 저장된 데이터를 입력 모듈(810)이 입력받아 딥 러닝을 수행하고, 그 결과를 이용하여 출력 모듈(820)이 정형 지식 추출 모델(450)에 대한 학습을 수행하고 업데이트할 수 있다. 정형 지식 추출 모델(450)은 분류 문헌 저장소(150)에 저장된 과학 기술 논문(40)의 상기 표 정보로부터 추출된 정형 데이터를 사용하여, 추출 모델 학습부(800)에서 수행한 딥 러닝에 의하여 학습 및 업데이트될 수 있다. In the extraction model learning unit 800, the input module 810 receives the data stored in the classified literature repository 150 and performs deep learning, and the output module 820 uses the result to generate a structured knowledge extraction model 450. You can learn about and update. The structured knowledge extraction model 450 is learned by deep learning performed by the extraction model learning unit 800 using the structured data extracted from the table information of the science and technology paper 40 stored in the classified literature repository 150 and can be updated.

과학 기술 논문(40)은 기술 분류에 따라서, 사용되는 용어, 즉 개체명이 다를 수 있다. 따라서 과학 기술 논문(40)이 가지는 본문 내용 및 표 정보는 기술 분류에 따라서 많은 차이를 가질 수 있다. 그러나 과학 기술 논문(40)이 가지는 반정형 데이터인 서지 정보와 식별 번호는 기술 분류와 무관하게 대체로 유사할 형식을 가질 수 있다. The scientific and technical paper 40 may have different terms, that is, entity names, depending on the technical classification. Therefore, the text content and table information of the science and technology paper 40 may have many differences according to the technical classification. However, bibliographic information and identification numbers, which are semi-structured data of the science and technology thesis 40, may have substantially similar formats regardless of technical classification.

따라서 본 발명에 따른 지식 추출 시스템(1)의 추출 모델 학습부(800)는, 서지 정보와 식별 번호와 같은 반정형 데이터로부터 지식을 추출하기 위한 반정형 지식 추출 모델(250)에 대한 학습을 수행하기 위한 입력 데이터와, 본문 내용 및 표 정보와 같은 비정형 데이터 및 정형 데이터로 지식을 추출하기 위한 비정형 지식 추출 모델(350) 및 정형 지식 추출 모델(450)에 대한 학습을 수행하기 위한 입력 데이터를 다르게 선정할 수 있다. Therefore, the extraction model learning unit 800 of the knowledge extraction system 1 according to the present invention performs learning on the semi-structured knowledge extraction model 250 for extracting knowledge from semi-structured data such as bibliographic information and identification numbers. The input data for performing learning on the unstructured knowledge extraction model 350 and the structured knowledge extraction model 450 for extracting knowledge with unstructured data and structured data such as text content and table information are different. can be selected

즉, 추출 모델 학습부(800)는, 과학 기술 논문(40)에서 상대적으로 적은 양을 차지하며 기술 분류와 무관하게 대체로 유사한 형식을 가지는 서지 정보와 식별 번호와 같은 반정형 데이터는 기술 분류와 무관하게 수집된 모든 과학 기술 논문(40)이 저장된 수집 문헌 저장소(130)로부터 입력받아 반정형 지식 추출 모델(250)을 학습할 수 있고, 과학 기술 논문(40)에서 상대적으로 많은 양을 차지하고 기술 분류에 따라서 많은 차이를 가지고 있는 본문 내용 및 표 정보와 같은 비정형 데이터 및 정형 데이터는 수집된 과학 기술 논문(40) 중 키워드를 사용하여 검색된 과학 기술 논문(40) 또는 한가지 기술 분류로 분류된 과학 기술 논문(40)만을 저장된 분류 문헌 저장소(150)로부터 입력받아 비정형 지식 추출 모델(350) 및 정형 지식 추출 모델(450)을 학습할 수 있다.That is, the extraction model learning unit 800 occupies a relatively small amount in the science and technology paper 40, and semi-structured data such as bibliographic information and identification numbers that have a substantially similar format regardless of technology classification are irrelevant to technology classification. It is possible to learn the semi-structured knowledge extraction model 250 by receiving input from the collected literature repository 130 where all scientific and technical papers 40 that have been collected are stored, and occupy a relatively large amount in the scientific and technical papers 40 and classify the technology. Unstructured data and structured data, such as text content and table information, which have a lot of difference depending on the scientific and technical papers (40) searched using keywords among the collected scientific and technical papers (40) or scientific and technological papers classified into one technical classification Only (40) may be received from the stored classified literature repository 150 to learn the unstructured knowledge extraction model 350 and the structured knowledge extraction model 450.

따라서, 추출 모델 학습부(800)는 효율성 및 정확성을 함께 가지며 반정형 지식 추출 모델(250), 비정형 지식 추출 모델(350) 및 정형 지식 추출 모델(450)에 대한 학습을 수행할 수 있다. Therefore, the extraction model learning unit 800 can perform learning on the semi-structured knowledge extraction model 250, the unstructured knowledge extraction model 350, and the structured knowledge extraction model 450 with both efficiency and accuracy.

본 발명에 따른 과학 기술 논문을 위한 지식 추출 시스템(1)은 특정 키워드에 대하여 검색된 과학 기술 논문(40) 또는 특정 기술 분야로 분류된 과학 기술 논문(40)에 대하여 효율성 및 정확성을 함께 가지며 학습된 지식 추출 모델들을 통하여 지식을 추출하므로, 과학 기술 논문(40)이 가지는 지식을 빠르고 정확하게 찾을 수 있도록 할 수 있다. The knowledge extraction system 1 for science and technology papers according to the present invention has efficiency and accuracy for science and technology papers 40 searched for specific keywords or science and technology papers 40 classified into specific technical fields, and learns Since knowledge is extracted through knowledge extraction models, it is possible to quickly and accurately find the knowledge of the science and technology paper 40 .

도 2는 본 발명의 예시적 실시 예에 따른 과학 기술 논문을 위한 지식 추출 시스템에서 반정형 지식 추출 모델을 학습하는 방법을 설명하기 위한 개략적인 블록도이다. 2 is a schematic block diagram illustrating a method for learning a semi-structured knowledge extraction model in a knowledge extraction system for scientific and technical papers according to an exemplary embodiment of the present invention.

도 2를 참조하면, 지식 추출 시스템(1)은 문헌 관리부(100), 반정형 지식 추출부(200), 반정형 지식 추출 모델(250), 및 추출 모델 학습부(800)를 포함할 수 있다. Referring to FIG. 2 , the knowledge extraction system 1 may include a document management unit 100, a semi-structured knowledge extraction unit 200, a semi-structured knowledge extraction model 250, and an extraction model learning unit 800. .

추출 모델 학습부(800)는 수집 문헌 저장소(130)에 저장된 데이터를 입력 모듈(810)이 입력받아 딥 러닝을 수행하고, 그 결과를 이용하여 출력 모듈(820)이 반정형 지식 추출 모델(250)에 대한 학습을 수행하고 업데이트할 수 있다. 반정형 지식 추출 모델(250)은 수집 문헌 저장소(130)에 저장된 반정형 데이터를 사용하여, 추출 모델 학습부(800)에서 수행한 딥 러닝에 의하여 학습 및 업데이트될 수 있고, 반정형 지식 추출부(200)는 반정형 지식 추출 모델(250)을 참조하여 분류 문헌 저장소(150)에 저장된 데이터 중에서 반정형 데이터인 서지 정보와 식별 번호를 지식으로 추출할 수 있다. In the extraction model learning unit 800, the input module 810 receives data stored in the collected literature repository 130 and performs deep learning, and the output module 820 uses the result to generate the semi-structured knowledge extraction model 250 ) can be learned and updated. The semi-structured knowledge extraction model 250 may be learned and updated by deep learning performed by the extraction model learning unit 800 using semi-structured data stored in the collected literature repository 130, and the semi-structured knowledge extraction unit 200 refers to the semi-structured knowledge extraction model 250 to extract bibliographic information and identification numbers, which are semi-structured data, from data stored in the classified literature repository 150 as knowledge.

도 3은 본 발명의 예시적 실시 예에 따른 과학 기술 논문을 위한 지식 추출 시스템에서 비정형 지식 추출 모델 및 정형 지식 추출 모델을 학습하는 방법을 설명하기 위한 개략적인 블록도이다. 3 is a schematic block diagram illustrating a method of learning an unstructured knowledge extraction model and a structured knowledge extraction model in a knowledge extraction system for scientific and technical papers according to an exemplary embodiment of the present invention.

도 3을 참조하면, 지식 추출 시스템(1)은 문헌 관리부(100), 비정형 지식 추출부(300), 비정형 지식 추출 모델(350), 정형 지식 추출부(400), 정형 지식 추출 모델(450), 및 추출 모델 학습부(800)를 포함할 수 있다. Referring to FIG. 3, the knowledge extraction system 1 includes a document management unit 100, an unstructured knowledge extraction unit 300, an unstructured knowledge extraction model 350, a structured knowledge extraction unit 400, and a structured knowledge extraction model 450. , and an extraction model learning unit 800.

추출 모델 학습부(800)는 분류 문헌 저장소(150)에 저장된 데이터를 입력 모듈(810)이 입력받아 딥 러닝을 수행하고, 그 결과를 이용하여 출력 모듈(820)이 비정형 지식 추출 모델(350)에 대한 학습을 수행하고 업데이트할 수 있다. 비정형 지식 추출 모델(350)은 분류 문헌 저장소(150)에 저장된 비정형 데이터를 사용하여, 추출 모델 학습부(800)에서 수행한 딥 러닝에 의하여 학습 및 업데이트될 수 있고, 비정형 지식 추출부(300)는 비정형 지식 추출 모델(350)을 참조하여 분류 문헌 저장소(150)에 저장된 데이터 중에서 비정형 데이터를 지식으로 추출할 수 있다. In the extraction model learning unit 800, the input module 810 receives data stored in the classified literature repository 150 and performs deep learning, and the output module 820 uses the result to generate an unstructured knowledge extraction model 350. You can learn about and update. The unstructured knowledge extraction model 350 may be learned and updated by deep learning performed by the extraction model learning unit 800 using unstructured data stored in the classified literature repository 150, and the unstructured knowledge extraction unit 300 may extract unstructured data as knowledge among data stored in the classified literature repository 150 by referring to the unstructured knowledge extraction model 350 .

또한, 추출 모델 학습부(800)는 분류 문헌 저장소(150)에 저장된 데이터를 입력 모듈(810)이 입력받아 딥 러닝을 수행하고, 그 결과를 이용하여 출력 모듈(820)이 정형 지식 추출 모델(450)에 대한 학습을 수행하고 업데이트할 수 있다. 정형 지식 추출 모델(450)은 분류 문헌 저장소(150)에 저장된 비정형 데이터를 사용하여, 추출 모델 학습부(800)에서 수행한 딥 러닝에 의하여 학습 및 업데이트될 수 있고, 정형 지식 추출부(400)는 비정형 지식 추출 모델(450)을 참조하여 분류 문헌 저장소(150)에 저장된 데이터 중에서 정형 데이터인 표 정보를 지식으로 추출할 수 있다. In addition, in the extraction model learning unit 800, the input module 810 receives data stored in the classified literature storage 150 to perform deep learning, and the output module 820 uses the result to generate a structured knowledge extraction model ( 450) can be performed and updated. The structured knowledge extraction model 450 can be learned and updated by deep learning performed by the extraction model learning unit 800 using unstructured data stored in the classified literature repository 150, and the structured knowledge extraction unit 400 may extract table information, which is structured data, as knowledge from among data stored in the classified literature repository 150 by referring to the unstructured knowledge extraction model 450 .

도 4는 본 발명의 예시적 실시 예에 따른 과학 기술 논문을 위한 지식 추출 시스템에서 구축한 지식 베이스를 이용하여, 지식을 검색하는 방법을 설명하기 위한 개략적인 블록도이다. 4 is a schematic block diagram illustrating a method of searching for knowledge by using a knowledge base constructed in a knowledge extraction system for scientific and technical papers according to an exemplary embodiment of the present invention.

도 4를 참조하면, 과학 기술 질의 응답 시스템으로 기능하는 지식 추출 시스템(1)은 사용자 인터페이스(900), 질의 응답 관리부(700), 자연어 이해부(750), 및 지식 베이스(600)를 포함할 수 있다. Referring to FIG. 4 , the knowledge extraction system 1 functioning as a science and technology question answering system may include a user interface 900, a question answering management unit 700, a natural language understanding unit 750, and a knowledge base 600. can

질의 응답 관리부(700)는 사용자(10)가 사용자 인터페이스(900)를 통하여 한 질의를 자연어 이해부(750)를 참조하여 해석하고, 지식 베이스(600)를 통하여 구한 질의에 대한 응답을 자연어로 생성하여 사용자 인터페이스(900)를 통하여 사용자(10)에게 제공할 수 있다. 질의 응답 관리부(700)는 질의 수신 모듈(710), 지식 검색 모듈(720), 및 응답 생성 모듈(730)을 포함할 수 있다. 질의 수신 모듈(710)은 사용자(10)가 사용자 인터페이스(900)를 통하여 한 질의를 수신하고, 자연어 이해부(750)를 참조하여 해석할 수 있다. 지식 검색 모듈(720)은 질의 수신 모듈(710)에서 해석된 질의를 지식 베이스(600)에 하여 그 결과를 수신할 수 있다. 응답 생성 모듈(730)은 질의 검색 모듈(720)에서 수신한 결과를 자연어인 질의에 대한 응답으로 생성하여, 사용자 인터페이스(900)를 통하여 사용자(10)에게 제공할 수 있다. The question response management unit 700 interprets a query made by the user 10 through the user interface 900 with reference to the natural language understanding unit 750 and generates a response to the query obtained through the knowledge base 600 in natural language. It can be provided to the user 10 through the user interface 900. The query response management unit 700 may include a query reception module 710 , a knowledge search module 720 , and a response generation module 730 . The query reception module 710 may receive a query made by the user 10 through the user interface 900 and interpret it with reference to the natural language understanding unit 750 . The knowledge search module 720 may send the query interpreted by the query receiving module 710 to the knowledge base 600 and receive a result thereof. The response generation module 730 may generate the result received from the query search module 720 as a response to a query in natural language, and provide the result to the user 10 through the user interface 900 .

도 5는 본 발명의 예시적 실시 예에 따른 과학 기술 논문을 위한 지식 추출 시스템이 포함하는 자연어 이해부의 개략적인 블록도이다.5 is a schematic block diagram of a natural language understanding unit included in a knowledge extraction system for scientific and technical papers according to an exemplary embodiment of the present invention.

도 4 및 도 5를 함께 참조하면, 자연어 이해부(750)는 형태소 분석부(751), 구문 분석부(752), 개체명 분석부(753), 필터링 분석부(754), 의도 분류부(755), 도메인 분석부(756), 및 시맨틱 롤 라벨링부(SRL, 757)를 포함할 수 있다. 형태소 분석부(751)는 사용자(10)의 자연어 형태의 질의가 가지는 문장을 형태소 단위로 분리할 수 있다. 구문 분석부(752) 및 개체명 분석부(753)는 각각 형태소 단위로 분리된 문장 개체에 구문 분석 및 개체명 분석을 할 수 있다. 필터링 분석부(754)는 문장 개체 중 불필요한 피쳐(feature)를 제거하여 간결화된 문장을 생성할 수 있다. 의도 분류부(755) 및 도메인 분석부(756)는 필터링 분석부(754)에서 생성한 간결화된 문장을 기초로 의미 역할이 부여된 질의의 의도(intention) 분류 및 도메인 분석을 할 수 있다. 시맨틱 롤 라벨링부(757)는 문장 개체에 대한 의미 역할(Semantic Role)을 부여(Labeling)할 수 있다. 4 and 5 together, the natural language understanding unit 750 includes a morpheme analysis unit 751, a syntax analysis unit 752, an object name analysis unit 753, a filtering analysis unit 754, an intention classification unit ( 755), a domain analysis unit 756, and a semantic role labeling unit (SRL, 757). The morpheme analyzer 751 may separate a sentence of the query in the form of a natural language of the user 10 into morpheme units. The syntax analysis unit 752 and entity name analysis unit 753 may perform syntax analysis and entity name analysis on sentence entities separated by morpheme units, respectively. The filtering analyzer 754 may generate a simplified sentence by removing unnecessary features from sentence objects. The intention classification unit 755 and the domain analysis unit 756 may perform intention classification and domain analysis of a query to which a semantic role is assigned based on the simplified sentence generated by the filtering analysis unit 754 . The semantic role labeling unit 757 may assign (label) a semantic role to a sentence entity.

도 6은 본 발명의 예시적 실시 예에 따른 과학 기술 논문을 위한 지식 추출 시스템에서, 반정형 지식을 추출하는 과정을 설명하기 위한 개념도이다. 6 is a conceptual diagram illustrating a process of extracting semi-structured knowledge in a knowledge extraction system for scientific and technical papers according to an exemplary embodiment of the present invention.

도 1 및 도 6을 함께 참조하면, 지식 추출 시스템(1)은 과학 기술 논문(40)인 문헌으로부터 반정형 지식을 추출할 수 있다. 반정형 지식은 예를 들면, 반정형 데이터인 서지 정보 및 식별 번호로부터 추출된 지식일 수 있다. 도 6에는 도시의 편의를 위하여 식별 정보인 DOI와 ISSN도 서지 정보의 일부분으로 도시하였다. Referring to FIGS. 1 and 6 together, the knowledge extraction system 1 may extract semi-structured knowledge from a document that is a scientific and technical paper 40 . Semi-structured knowledge may be, for example, knowledge extracted from semi-structured data such as bibliographic information and identification numbers. In FIG. 6, DOI and ISSN, which are identification information, are also shown as part of bibliographic information for convenience of illustration.

구체적으로, 문헌 수집 모듈(110)이 수집한 과학 기술 논문(40)이 PDF 또는 이미지 파일로 이루어진 경우, OCR 모듈(120)의 본문 추출기(122)는 문헌이 가지는 서지 정보인 반정형 데이터를 추출할 수 있으며, 반정형 지식 추출부(200)는 반정형 지식 추출 모델(250)을 참조하여, 반정형 지식을 추출할 수 있다. 반정형 지식은 발행 연도(Year), 제목(Title), 요약 설명(Description), 저자(Creator.), 키워드(Keyword), 출간(Publicated) 등의 서지 정보와 디지털 객체 식별자(DOI), 국제 표준 연속 간행물 번호(ISSN) 등의 식별 번호에 대한 특성(Property)과 그 각각에 대한 값(Value)을 가질 수 있다. Specifically, when the scientific and technical papers 40 collected by the literature collection module 110 are composed of PDF or image files, the text extractor 122 of the OCR module 120 extracts semi-structured data that is bibliographic information of the literature. The semi-structured knowledge extraction unit 200 may extract the semi-structured knowledge by referring to the semi-structured knowledge extraction model 250 . Semi-structured knowledge includes bibliographic information such as year of publication (Year), title (Title), brief description (Description), author (Creator.), keywords (Keyword), and publication (Published), digital object identifier (DOI), international standards It may have a property for an identification number such as a serial publication number (ISSN) and a value for each of them.

도 7은 본 발명의 예시적 실시 예에 따른 과학 기술 논문을 위한 지식 추출 시스템에서, 정형 지식을 추출하는 과정을 설명하기 위한 개념도이다. 7 is a conceptual diagram illustrating a process of extracting formal knowledge in a knowledge extraction system for scientific and technical papers according to an exemplary embodiment of the present invention.

도 1 및 도 7을 함께 참조하면, 지식 추출 시스템(1)은 과학 기술 논문(40)인 문헌으로부터 정형 지식을 추출할 수 있다. 정형 지식은 예를 들면, 정형 데이터인 표 정보일 수 있다. Referring to FIGS. 1 and 7 together, the knowledge extraction system 1 may extract formal knowledge from a document that is a science and technology paper 40 . The structured knowledge may be, for example, table information that is structured data.

구체적으로, 문헌 수집 모듈(110)이 수집한 과학 기술 논문(40)이 표를 포함하는 경우, OCR 모듈(120)의 표 인식기(124)는 문헌이 가지는 표를 인식하여 표 정보를 추출할 수 있으며, 정형 지식 추출부(300)는 정형 지식 추출 모델(350)을 참조하여 표 정보를 정형 지식으로 추출할 수 있다. Specifically, when the scientific and technical papers 40 collected by the literature collection module 110 include tables, the table recognizer 124 of the OCR module 120 may recognize tables of documents and extract table information. In addition, the structured knowledge extraction unit 300 may extract table information as structured knowledge by referring to the structured knowledge extraction model 350 .

하나의 표에는 다양한 정보를 담고 있을 수 있으며, 정형 지식 추출부(400)의 헤더/값 분별 모듈(410)은 표 정보에서 헤더 셀과 값 셀을 구분할 수 있고, 의미 분석 모듈(420)은 헤더 셀이 지식과 연결될 수 있도록 의미 분석을 수행할 수 있고, 값 정제 모듈(430)은 값 셀의 값을 지식과 연결될 수 있도록 정제할 수 있다. One table may contain various types of information, and the header/value classification module 410 of the structured knowledge extraction unit 400 may distinguish between header cells and value cells in the table information, and the semantic analysis module 420 may separate the header cells and value cells from the table information. Semantic analysis can be performed so that the cell can be linked to knowledge, and the value refinement module 430 can refine the value of the value cell so that it can be linked to knowledge.

이를 통하여, 지식 추출 시스템(1)은 예를 들면, 하나의 표로부터 그룹 정보, 처장, 증상 정보, 실험 정보 등을 의미가 부여된 정형 지식으로 추출할 수 있다. Through this, the knowledge extraction system 1 can extract, for example, group information, director, symptom information, experiment information, etc. from one table as structured knowledge to which meaning is assigned.

도 8은 본 발명의 예시적 실시 예에 따른 과학 기술 논문을 위한 지식 추출 시스템에서 추출하여 구축된 지식 베이스를 나타내는 개념도이다. 8 is a conceptual diagram illustrating a knowledge base constructed by extracting from a knowledge extraction system for scientific and technical papers according to an exemplary embodiment of the present invention.

도 1 및 도 8을 함께 참조하면, 지식 추출 시스템(1)은 수집된 과학 기술 논문(40)으로부터 반정형 지식 추출부(200), 비정형 지식 추출부(300), 및 정형 지식 추출부(400)를 통하여 추출한 반정형 지식, 비정형 지식, 및 정형 지식을 지식 관리부(500)의 지식 통합 모듈(510)에서 서로 연결하여 통합하고, 지식 변환 모듈(520)에서 서로 연결하여 통합된 지식들을 지식 베이스(600)에 저장할 수 있다. 지식 변환 모듈(520)은 지식들을 지식 베이스(600)에 지식 그래프(602)로 저장할 수 있다. Referring to FIGS. 1 and 8 together, the knowledge extraction system 1 includes a semi-structured knowledge extraction unit 200, an unstructured knowledge extraction unit 300, and a structured knowledge extraction unit 400 from the collected scientific and technical papers 40. The semi-structured knowledge, unstructured knowledge, and structured knowledge extracted through ) are connected and integrated in the knowledge integration module 510 of the knowledge management unit 500, and the knowledge conversion module 520 connects the integrated knowledge to the knowledge base. (600). The knowledge conversion module 520 may store knowledge in the knowledge base 600 as a knowledge graph 602 .

지식 그래프(602)는 예를 들면, 논문 제목(Research paper)을 주어(subject)로 논문 ID(paper ID) 및 논문 제목(paper title)을 목적어(object)로 사용하고, 그 각각의 값은 프로퍼티로 사용한 주어(subject)-프로퍼티(property)-목적어(object) 구조의 트리플로 나타낼 수 있다. 또한, 논문 제목(Research paper)을 주어(subject)로 실험(Experiment)을 목적어(object)로 사용할 수 있고, 실험(Experiment)을 주어로 실험 결과(Experiment Result), 처리(Treatment), 실험군(Experiment Group), 대조군(Control Group)을 목적어(object)로 사용할 수 있다. The knowledge graph 602 uses, for example, a research paper as a subject, a paper ID and a paper title as an object, and each value is a property It can be expressed as a triple of subject-property-object structure used as . In addition, you can use Research paper as the subject and Experiment as the object, and Experiment Result, Treatment, and Experiment group as the subject. Group) and control group can be used as objects.

이와 같이, 특정 키워드에 대하여 검색된 과학 기술 논문(40) 또는 특정 기술 분야로 분류된 과학 기술 논문(40)에 대하여 구축된 지식 그래프(602)를 저장하는 지식 베이스(600)를 참조하여, 사용자(10)의 질의에 대하여 질의 응답 관리부(700)는 응답을 생성하여 사용자(10)에게 제공할 수 있다. In this way, with reference to the knowledge base 600 that stores the knowledge graph 602 constructed for the scientific and technical papers 40 searched for a specific keyword or the scientific and technical papers 40 classified in a specific technical field, the user ( With respect to the query of 10), the query response management unit 700 may generate and provide a response to the user 10 .

이상, 본 발명을 바람직한 실시예를 들어 상세하게 설명하였으나, 본 발명은 상기 실시예에 한정되지 않고, 본 발명의 기술적 사상 및 범위 내에서 당 분야에서 통상의 지식을 가진 자에 의하여 여러가지 변형 및 변경이 가능하다. In the above, the present invention has been described in detail with preferred embodiments, but the present invention is not limited to the above embodiments, and various modifications and changes are made by those skilled in the art within the technical spirit and scope of the present invention. this is possible

1 : 지식 추출 시스템, 10 : 사용자, 20 : 네트워크, 40 : 문헌, 과학 기술 논문, 100 : 문헌 관리부, 200 : 반정형 지식 추출부, 250 : 반정형 지식 추출 모델, 300 : 비정형 지식 추출부, 350 : 비정형 지식 추출 모델, 400 : 정형 지식 추출부, 450 : 정형 지식 추출 모델, 500 : 지식 관리부, 600 : 지식 베이스, 700 : 질의 응답 관리부, 800 : 추출 모델 학습부, 900 : 사용자 인터페이스1: knowledge extraction system, 10: user, 20: network, 40: literature, scientific and technical papers, 100: literature management unit, 200: semi-structured knowledge extraction unit, 250: semi-structured knowledge extraction model, 300: unstructured knowledge extraction unit, 350: unstructured knowledge extraction model, 400: structured knowledge extraction unit, 450: structured knowledge extraction model, 500: knowledge management unit, 600: knowledge base, 700: query response management unit, 800: extraction model learning unit, 900: user interface

Claims

A literature collection module for collecting a plurality of science and technology papers through a network;
a collection literature storage for storing the plurality of scientific and technical papers collected in the literature collection module;
a literature classification module extracting data from the plurality of science and technology papers collected in the literature collection module and classifying them according to a technology classification system;
a classified literature storage for storing the plurality of science and technology papers classified according to the technology classification system in the document classification module by classification;
a semi-structured knowledge extraction model for extracting semi-structured data as knowledge among data extracted from the plurality of science and technology papers stored in the classified literature repository with reference to the semi-structured knowledge extraction model;
an unstructured knowledge extraction model for extracting unstructured data as knowledge from among data extracted from the plurality of science and technology papers stored in the classified literature repository with reference to the unstructured knowledge extraction model;
a structured knowledge extraction model for extracting structured data as knowledge from data extracted from the plurality of science and technology papers stored in the classified literature repository with reference to the structured knowledge extraction model;
a knowledge management unit integrating the knowledge extracted from each of the semi-structured knowledge extraction unit, the unstructured knowledge extraction unit, and the structured knowledge extraction unit and storing them in a knowledge base; and
An extraction model learning unit configured to perform learning on each of the semi-structured knowledge extraction model, the unstructured knowledge extraction model, and the structured knowledge extraction model;
The semi-structured knowledge extraction model is learned in the extraction model learning unit using the semi-structured data of all the plurality of scientific and technical papers stored in the collected literature repository,
Each of the unstructured knowledge extraction model and the structured knowledge extraction model corresponds to a predetermined keyword among the plurality of scientific and technical papers stored in the classified document repository or is an unstructured form of only scientific and technological papers of one technical classification classified according to the technical classification system. A knowledge extraction system in which learning is performed in the extraction model learning unit using each of data and structured data.

According to claim 1,
An optical character recognition (OCR) module for extracting text data from the plurality of scientific and technical papers when the plurality of scientific and technical papers collected by the document collection module are PDF (Portable Document Format) or image files;
a literature classification model for classifying the plurality of scientific and technical papers by using text information extracted from the plurality of scientific and technical papers; and
The knowledge extraction system further comprising a; classified literature storage for storing the plurality of scientific and technical papers classified in the literature classification model.

According to claim 2,
The OCR module,
a text extractor for extracting semi-structured data and unstructured data of the plurality of scientific and technical papers collected; and
A knowledge extraction system comprising a table recognizer for recognizing and extracting table information, which is the structured data of the plurality of scientific and technical papers collected.

According to claim 3,
The formal knowledge extraction unit,
The knowledge extraction system, characterized in that for extracting the table information, which is semi-structured data extracted from the table recognizer, as knowledge among data stored in the classified literature repository.

According to claim 4,
The formal knowledge extraction unit,
a header/value classification module for distinguishing header cells from value cells in the table information;
a semantic analysis module that performs semantic analysis so that the header cell can be connected to knowledge; and
A knowledge extraction system comprising a value refinement module for refining the value of the value cell so that it can be linked to knowledge.

According to claim 2,
The document classification model,
Performing a search for the plurality of scientific and technical papers collected using the predetermined keyword, classifying scientific and technical papers corresponding to the predetermined keyword among the plurality of scientific and technical papers, and storing them in the classified literature repository;
The knowledge extraction system, characterized in that for classifying the plurality of scientific and technical papers collected according to the technical classification, and storing scientific and technical papers classified into a specific technical classification among the plurality of scientific and technical papers in the classified literature repository.

According to claim 1,
The semi-structured knowledge extraction unit,
From the data stored in the classified literature repository, bibliographic information and identification numbers, which are semi-structured data, are extracted as knowledge,
The bibliographic information is a title, author, abstract, or keyword of a document, and the identification number is an International Standard Serial Number (ISSN), a digital object identifier (DOI), or an international standard book. Knowledge extraction system, characterized in that the number (International Standard Book Number, ISBN).

According to claim 1,
The atypical knowledge extraction unit,
a entity name recognition module recognizing entity name information from unstructured data among data stored in the classified literature repository;
a relationship extraction module for extracting a relationship between entity name information recognized by the entity name recognition module; and
A knowledge extraction system comprising: a sentence classification module for classifying the sentences from which the relationship between entity name information is extracted in the relation extraction module according to a preset sentence classification system.

According to claim 1,
The knowledge management department,
a knowledge integration module for connecting and integrating knowledge extracted from each of the semi-structured knowledge extraction unit, the unstructured knowledge extraction unit, and the structured knowledge extraction unit; and
A knowledge extraction system comprising: a knowledge conversion module that stores the knowledge that is connected to each other and integrated into the knowledge base as a knowledge graph.

According to claim 3,
The text extractor further extracts link information in which texts and images of captions for the collected image information of the plurality of scientific and technical papers are stored,
The knowledge extraction system, characterized in that the collected literature repository stores images of the image information included in the plurality of scientific and technical papers collected.