KR101636477B1

KR101636477B1 - Human-like Knowledge Expansion and Verification System Using Structured Knowledge Information and Document Crwaling, Method, Recording Medium

Info

Publication number: KR101636477B1
Application number: KR1020140169015A
Authority: KR
Inventors: 양중식; 곽용진
Original assignee: (주)아이와즈
Priority date: 2014-11-28
Filing date: 2014-11-28
Publication date: 2016-07-06
Also published as: KR20160065372A

Abstract

본 발명에 의해 지식정보를 확장,검증함으로써 종래에 지속적으로 제기되었던 지식정보의 확장과정에서 발생하는 지식정보의 신뢰성문제를 해결하고, 확장과 검증에 따르는 비용적 부담을 줄일 수 있으며, 기 구축된 지식정보를 수집,목표탐색,추출에 활용함으로써, 기 구축된 지식정보의 가치 및 신뢰성을 평가할 수 있고, 수집 및 추출된 새로운 지식정보를 기 구축된 지식정보와 비교/대조하여 일관성 및 신뢰성을 검증함으로 지식정보를 효과적으로 확장하기 위한 것으로 질의생성부,문서수집부,지식정보 추출부,지식정보 검증부,지식정보 갱신부를 포함하는 것을 특징으로 한다.By expanding and verifying the knowledge information according to the present invention, it is possible to solve the reliability problem of knowledge information generated in the process of expanding the knowledge information, which has been continuously and conventionally proposed, to reduce the cost burden of expansion and verification, We can evaluate the value and reliability of pre-built knowledge information by using knowledge information in collection, goal search, and extraction, and verify consistency and reliability by comparing / comparing new knowledge information collected and extracted with existing knowledge information A document collecting unit, a knowledge information extracting unit, a knowledge information verifying unit, and a knowledge information updating unit for effectively expanding knowledge information.

Description

BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a knowledge extension and verification system using structured knowledge information and document collection,

본 발명은 미래창조과학부에서 지원하는 "Symbolic Approach 기반 인간모형 자가학습 지능 원천기술 개발"의 일환으로 온톨로지,의미망, 시맨틱웹, 지식베이스 등과 같은 자연언어처리 및 인공지능 분야에서 사용하는 의미, 지식, 지능적 처리에 사용되는 지식정보를 활용하여 자동/반자동으로 지식을 확장하고 검증하는 것에 관한 것이다.As a part of the "Symbolic Approach-Based Human Model Self-Learning Intelligence Source Technology Development" supported by the Future Creation Science Department, the present invention relates to the meaning and knowledge used in the natural language processing and artificial intelligence such as ontology, semantic network, semantic web, knowledge base, , And expanding and verifying knowledge automatically and semi-automatically using knowledge information used in intelligent processing.

온톨로지, 의미망, 시맨틱웹, 지식베이스 등은 컴퓨터가 인간의 요구와 의도를 이해할 수 있는 방법으로 널리 사용되고 있다. 그러나 구축과 검증에 있어서 전문가의 지식에 대한 의존도가 높아 대규모 구축과 검증에 어려움이 있다.Ontologies, semantic networks, semantic webs, and knowledge bases are widely used as a way for computers to understand human needs and intentions. However, it is difficult to build and verify on a large scale due to high reliance on expert knowledge in construction and verification.

최근 온톨로지를 중심으로 서로 다른 도메인에서 구축된 온톨로지의 통합을 통한 확장 노력이 이루어지고 있으며, 이 과정에서 구축된 온톨로지의 검증 방법도 제안되고 있다. 이와 함께 대규모 웹문서를 기반으로 한 RDF triple을 생성하는 방법으로 온톨로지 및 지식베이스를 확장하기도 한다.Recently, ontologies have been expanded to integrate ontologies constructed in different domains, and validation methods of ontologies constructed in this process have also been proposed. We also extend ontologies and knowledge bases by creating RDF triples based on large web documents.

온톨로지는 전통적인 생성방법으로 인지, 생성, 평가, 기록의 과정을 거쳐서 생성된다. 전통적인 방법은 많은 부분을 휴리스틱(heuristic)에 의존한다. 휴리스틱을 통한 온톨로지의 생성은 풍부한 어휘와 정형적 의미에 충실한 온톨로지를 생성할 수 있는 장점이 있다.The ontology is generated by the conventional generation method through the process of recognition, generation, evaluation, and recording. Traditional methods rely heavily on a lot of things. The ontology generation through heuristics has the advantage of generating an ontology that is rich in rich vocabulary and formal meaning.

그러나 많은 부분을 휴리스틱에 의존하기 때문에 많은 시간과 비용을 소비하는 단점이 있다. 휴리스틱에 의존하는 부분은 시간과 비용의 감축에 저해 요소가 된다. 이런 단점을 극복하기 위하여 제시된 방법으로는 UML(XML)을 이용한 도메인 온톨로지 생성방법, 형식적 개념 분석(Formal Concept Analysis)을 이용한 도메인 온톨로지 생성방법, 데이터마이닝 기법(ID3, AOI, association rules mining, clustering)을 이용한 온톨로지 생성방법, 도구들(Protege 2000, 등)을 이용한 온톨로지 생성방법들이 있다.However, there are disadvantages of consuming a lot of time and money because it relies heavily on many parts. The heuristic-dependent part is an obstacle to the reduction of time and cost. To overcome these drawbacks, proposed methods include domain ontology generation method using UML (XML), domain ontology generation method using formal concept analysis, data mining technique (ID3, AOI, association rules mining, clustering) And ontology generation methods using tools (Protege 2000, etc.).

그러나, UML을 이용한 도메인 온톨로지 생성방법은 분석, 설계의 과정에 많은 부분을 휴리스틱에 의존하여 많은 시간과 비용을 소비한다. 형식적 개념 분석을 이용한 방법은 문서(document)들에서 유효한 키워드를 색출하는 과정에서 많은 부분을 휴리스틱에 의존하며 데이터마이닝 기법 역시 데이터마이닝 기법을 적용하기 위한 데However, the domain ontology generation method using UML depends on heuristic much in the process of analysis and design, and it consumes much time and money. The method using formal concept analysis relies on heuristics in much of the process of searching for valid keywords in documents and the data mining technique is used for applying data mining technique

이터들을 선별하는 과정에 많은 시간과 비용을 소비하는 단점을 지니고 있다. 도구를 사용하는 방법은 휴리스틱에 의존하는 일부분을 도구를 통하여 시간과 비용의 소비를 감축시키나 근본적으로 전통적인 생성방법에 도움을 줄 뿐 휴리스틱에 많은 부분을 의존함으로 시간과 비용을 많이 소비하며, 구성된 온톨로지의 실효성 검증이 되지 않는 문제점이 있다.It has a disadvantage of consuming a lot of time and money in selecting the data. The method of using the tool reduces the time and cost consumption through the tool which depends on the heuristic, but it basically consumes the time and the cost because it heavily depends on the heuristic, which helps the traditional generation method. There is a problem in that the verification of effectiveness can not be performed.

시맨틱 웹 기술은 컴퓨터가 정보자원의 뜻을 이해하고, 논리적 추론까지 할 수 있는 차세대 지능형웹을 말한다. 현재의 컴퓨터처럼 사람이 마우스나 키보드를 이용해 원하는 정보를 찾아 눈으로 보고 이해하는웹이 아니라, 컴퓨터가 이해할 수 있는 웹을 말한다. 즉, 사람이 읽고 해석하기에 편리하게 설계되어 있는 현재의 웹 대신에 컴퓨터가 이해할 수 있는 형태의 새로운 언어로 표현해 기계들끼리 서로 의사소통을 할 수 있는지능형 웹이다.Semantic Web technology refers to the next generation intelligent web where computers can understand the meaning of information resources and logically reasoning. Like the current computer, it is a web that a computer understands, not a web that a person uses a mouse or a keyboard to find and understand desired information. It is an intelligent web that allows machines to communicate with each other by expressing them in a new language that can be understood by the computer instead of the current web, which is designed to be easy for people to read and interpret.

시맨틱 웹의 원리는 사람들이 이해할 수 있도록 자연어 위주로 되어 있는 현재의 웹 문서와 달리, 정보자원들 사이에 연결되어 있는 의미를 컴퓨터가 이해할 수 있는 형태의 언어로 바꾸는 것이다. 이렇게 되면 컴퓨터가 정보자원의 뜻을해석하고, 기계들끼리 서로 정보를 주고받으면서 자체적으로 필요한 일을 처리하는 것이 가능해진다.The Semantic Web principle is to transform the semantics of information resources into a language that can be understood by computers, unlike current Web documents that are focused on natural language so that people can understand them. This allows the computer to interpret the meaning of the information resource, and the machines can exchange information with each other and handle the tasks themselves.

시맨틱 웹과 관련된 연구는 RDF(Resource Description Framework)를 기반으로 한 온톨로지 기술과국제표준화기구(International Organization for Standardization, ISO) 중심의 토픽 맵(Topic Map) 기술이 주류를 이루고 있다. 전자는 현재의 웹에 자원(주어)/속성(술어)/속성값(목적어) 등 자원을 기술하는 언어인 메타데이터를 부여해 정보의 의미를 이해하고 처리할 수 있게 하는 기술이다. 후자는 ISO의 XML(eXtensible MarkupLanguage) 기반 표준 기술언어인 XTM(XML Topic Maps) 언어를 이용해 정보와 지식의 분산 관리를 지원하는 기술로, 지식층과 정보층의 이중 구조를 띤다.이러한 시맨틱 웹이 실현되면 컴퓨터가 자동으로 정보를 처리할 수 있어 정보시스템의 생산성과 효율성이 극대화된다. 컴퓨터 혼자 전자상거래를 할 수 있고, 기업의 시스템 통합(System Integration, SI), 지능형 로봇 시스템, 의료 정보화 등 다양한 분야에 응용할 수 있다.Research related to semantic web is mainly composed of ontology technology based on Resource Description Framework (RDF) and Topic Map technology based on International Organization for Standardization (ISO). The former is a technology that allows the web to understand and process the meaning of information by assigning metadata, which is a language that describes resources such as resource (subject) / attribute (predicate) / attribute value (object) The latter is a technology that supports the distributed management of information and knowledge using the XML Topic Maps (XTM) language, which is a standard technology language based on ISO's XML (eXtensible Markup Language). It has a dual structure of knowledge layer and information layer. When realized, computers can process information automatically, maximizing productivity and efficiency of information system. It can be applied to various fields such as system integration (SI), intelligent robot system, medical informatization, and so on.

선행기술문헌들로는 공개특허 2010-0003084호(온톨로지 부분 그래프 추출장치및 그 방법과, 그를 이용한 검색사용자 질의와 온톨로지의 의미적 매칭 장치및 그 방법)와 공개특허 10-2009-0010556호(테이터베이스로부터 온톨로지를 생성하는 방법및 그 장치)가 있으나 상기 선행기술문헌들은 지식정보의 확장과 검증에 있어 주로 확장에 집중되고 검증과정이 이루어지지 않아서 확장된 지식정보에 의도하지 않은 노이즈(Noise)가 많이 포함되어 있어 이로인해 확장된 지식정보를 효과적으로 활용하기가 곤란하다.Prior art documents include a method disclosed in Japanese Laid-Open Patent Publication No. 2010-0003084 (an apparatus and method for extracting an ontology partial graph, a search user query and an ontology semantic matching apparatus using the same and a method thereof) A method for generating an ontology, and a device thereof). However, the prior art documents mainly focus on expansion and verification of knowledge information, and there is a lot of noise that is not intended for extended knowledge information And thus it is difficult to effectively utilize the expanded knowledge information.

기존의 온톨로지를 비롯한 지식정보의 확장과 검증은 주로 확장에 집중되는 경향이 있었다. 또한 확장과 검증이 함께 이루지지 않아서 확장된 지식정보에 구축/확장 의도에 부합하지 않는 정보(Noise)가 많았다. 이로인해 확장된 지식정보를 활용하기가 어려웠다.The extension and verification of knowledge information including existing ontology tended to concentrate mainly on expansion. In addition, there is a lot of information (Noise) that does not fit with the intension / extension intention to extended knowledge information because extension and verification are not done together. This makes it difficult to utilize the expanded knowledge information.

본 발명은 기 구축된 지식정보와 문서수집기(Crwaler)를 결합하여, 인간이 지식을 검증/확장 하는것과 유사한 방식으로 전산 처리용 지식정보를 검증/확장 하는데 그 목적이 있다.The object of the present invention is to verify / extend knowledge information for computation processing in a similar manner to human knowledge verification / expansion by combining previously constructed knowledge information with a document collector (Crwaler).

본 발명의 목적을 달성하기 위해, 본 발명에 따른 지식정보와 문서수집을 이용한 지식확장 및 검증 시스템은,In order to achieve the object of the present invention, a knowledge extension and verification system using knowledge information and document collection according to the present invention,

기 구축된 지식정보를 지식을 갱신하고 검증하기 위한 관련 문서를 수집할 수 있는 문서 검색용 질의문으로 변환 생성하는 질의 생성부(100); A query generation unit (100) for converting the generated knowledge information into a query for a document search capable of collecting related documents for updating and verifying knowledge;

상기 질의 생성부(100)에서 생성된 질의문에 대한 목표 도메인으로부터의 반환된 결과 문서들을 수집/저장하고, 목표 도메인에서 수집된 문서에 대한 적절성을 분석/평가하는 문서 수집부(200);A document collection unit 200 for collecting / storing the returned result documents from the target domain for the query sentence generated by the query generation unit 100 and analyzing / evaluating the appropriateness of the collected documents in the target domain;

수집된 대상 문서로부터, 기 구축 지식정보의 구조적 프레임워크와 규칙, 기 구축지식정보의 의미(해당 지식정보에 포함된 어휘,관계,구조및 사용된 테이타 값), 기 구축 지식정보와 관련된 내용(상보적이거나 상반된 내용)을 기초로 새로운 지식 정보를 추출하는 지식정보 추출부(300);From the collected target documents, the structural framework and rules of prebuilt knowledge information, the meaning of the prebuilt knowledge information (vocabulary, relation, structure and used data value included in the relevant knowledge information), contents related to prebuilt knowledge information A knowledge information extracting unit 300 for extracting new knowledge information based on complementary or incompatible contents);

추출된 지식정보와 기 구축된 지식정보를 상호 비교하여, 기 구축된 지식정보가 추출된 지식정보와 동일한 경우에는 추출된 지식정보에 대한 삭제 필요성을, 기 구축된 지식정보가 추출된 지식정보 보다 적은 정보를 포함하는경우에는 기 구축된 지식정보에 대한 갱신 필요성을, 기 구축된 지식정보가 추출된 지식정보 보다 많은 정보를 포함하는 경우에는 추출된 지식정보에 대한 보완 필요성을 검증하는 지식정보 검증부(400);When the extracted knowledge information is compared with the already constructed knowledge information, if the previously constructed knowledge information is the same as the extracted knowledge information, the necessity of deleting the extracted knowledge information is compared with the extracted knowledge information In the case of containing less information, it is necessary to update the existing knowledge information. If the existing knowledge information includes more information than the extracted knowledge information, it is necessary to verify the necessity of supplementing the extracted knowledge information. (400);

상기 지식정보 검증부(400)의 검증결과에 따라 추출 지식정보를 삭제하거나, 기 구축 지식정보를 갱신하거나, 추출 지식정보를 보완하는 처리를 수행하는 지식정보 갱신부(500)를 포함함을 특징으로 한다.And a knowledge information updating unit (500) for performing the process of deleting extracted knowledge information, updating basic knowledge information, or supplementing extracted knowledge information according to the verification result of the knowledge information verification unit (400) .

본 발명의 목적을 달성하기 위해, 본 발명에 따른 지식정보와 문서수집을 이용한 지식확장 및 검증 방법은,In order to accomplish the object of the present invention, the knowledge extension and verification method using knowledge information and document collection according to the present invention,

A.구축된 지식정보를 문서수집용 도메인의 질의어로 변환하는 단계;A. converting the constructed knowledge information into a query word of a domain for document collection;

B. 변환된 질의어로 목표 도메인에서 문서를 수집하고, 수집된 문서에 대한 분석/평가를 수행하는 단계;B. collecting the document in the target domain with the translated query and performing analysis / evaluation on the collected document;

C.기 구축된 지식정보를 이용하여 수집된 문서로부터 지식정보의 구조적 프레임워크와 규칙, 지식정보의 의미(해당 지식정보에 포함된 어휘,관계,구조및 사용된 테이타 값), 지식정보와 관련된 내용을 참조하여 새로운 지식정보를 추출하는 단계;C. From the collected documents using the constructed knowledge information, the structural framework of the knowledge information, the meaning of the knowledge information (vocabulary, relation, structure and used data value included in the knowledge information) Extracting new knowledge information with reference to contents;

D.추출된 지식정보를 기 구축된 지식정보와 비교 검증하는 단계;D. comparing and verifying extracted knowledge information with established knowledge information;

E.기 구축된 지식정보와 새로 추출된 지식정보에 대해 삭제/갱신/보완 처리하는 단계;E. Deleting / updating / supplementing the previously constructed knowledge information and newly extracted knowledge information;

F. 상기 E단계에서 삭제/갱신/보완 처리한 후, 다시 삭제/갱신/보완할 지식정보가 있으면 A~E단계를 반복하고, 없으면 종료하는 단계를 포함함을 특징으로 한다.F. Repeat steps A to E if there is knowledge information to be deleted / updated / supplemented again after deleting / updating / supplementing in step E, and terminating the step if not.

본 발명에 따른 새로운 지식정보를 추출하는 단계는 수집된 문서로부터 구축된 지식정보와 구조적으로 유사한 항목을 추출하는 단계(C-1),수집된 문서로부터 구축된 지식정보와 의미적으로 유사한 항목을 추출하는 단계(C-2), 수집된 문서로부터 구축된 지식정보와 상반된 항목을 추출하는 단계(C-3), 수집된 문서로부터 구축된 지식정보와 상보적인 항목을 추출하는 단계(C-4)를 포함함을 특징으로 한다.
The step of extracting new knowledge information according to the present invention includes a step (C-1) of extracting items structurally similar to the knowledge information constructed from the collected documents, a step of obtaining items semantically similar to the knowledge information constructed from the collected documents (C-3) extracting items that are incompatible with the knowledge information constructed from the collected documents, extracting items complementary to the knowledge information constructed from the collected documents (C-4 ).

본 발명에 따른 추출된 지식정보를 기 구축된 지식정보와 융합하는 단계는 기 구축된 지식정보와 추출된 지식정보가 동일한 경우 추출된 지식정보를 삭제처리하는 단계(D-1), 기 구축된 지식정보가 추출된 지식정보보다 적은 정보를 포함하는 경우 기 구축된 지식정보를 갱신처리하는 단계(D-2), 기 구축된 지식정보가 추출된 지식정보보다 많은 정보를 포함하는 경우 추출된 지식정보를 보완처리하는 단계(D-3)를 포함함을 특징으로 한다.
The step of fusing the extracted knowledge information according to the present invention with the pre-established knowledge information includes a step (D-1) of deleting the extracted knowledge information when the pre-established knowledge information and the extracted knowledge information are the same, (D-2) updating the already-constructed knowledge information when the knowledge information includes less information than the extracted knowledge information; if the previously constructed knowledge information includes more information than the extracted knowledge information, And a step (D-3) of complementing and processing the information.

또한 상기의 방법은 프로그램화하여 컴퓨터로 읽을수 있는 기록매체로 제공됨을 특징으로 한다.In addition, the above method is provided as a computer-readable recording medium that is programmed.

본 발명에 의해 지식정보를 확장,검증함으로써 종래에 지속적으로 제기되었던 지식정보의 확장과정에서 발생하는 지식정보의 신뢰성문제를 해결하고, 확장과 검증에 따르는 비용적 부담을 줄일 수 있으며, 기 구축된 지식정보를 수집,목표탐색,추출에 활용함으로써, 기 구축된 지식정보의 가치 및 신뢰성을 평가할 수 있고, 수집 및 추출된 새로운 지식정보를 기 구축된 지식정보와 비교/대조하여 일관성 및 신뢰성을 검증함으로 확장된 지식정보를 효과적으로 사용할 수 있다.By expanding and verifying the knowledge information according to the present invention, it is possible to solve the reliability problem of knowledge information generated in the process of expanding the knowledge information, which has been continuously and conventionally proposed, to reduce the cost burden of expansion and verification, We can evaluate the value and reliability of pre-built knowledge information by using knowledge information in collection, goal search, and extraction, and verify consistency and reliability by comparing / comparing new knowledge information collected and extracted with existing knowledge information And thus the expanded knowledge information can be effectively used.

도1은 본 발명에 따른 구조화된 지식정보와 문서수집을 이용한 지식확장 및 검증 시스템을 도시한 도면.
도2는 본 발명에 따는 구조화된 지식정보와 문서수집을 이용한 지식확장 및 검증 방법을 도시한 도면.
도3은 새로운 지식정보 추출과정을 나타낸 도면1 illustrates a knowledge extension and verification system using structured knowledge information and document collection according to the present invention;
2 is a diagram illustrating a knowledge extension and verification method using structured knowledge information and document collection according to the present invention.
3 is a diagram showing a process of extracting new knowledge information

본 발명은 기 구축된 지식정보를 확장하고 확장된 지식정보를 검증함으로 지식정보의 확장에 신뢰성을 제공하고자 하는 것으로 본 발명을 개략적으로 설명하면 다음과 같다. 기 구축된 지식정보를 문서수집을 위한 질의로 변환하고, 질의에 대해 반환된 결과 문서들을 수집/저장하고, 수집된 문서로부터 기 구축된 지석정보를 바탕으로 새로운 지식정보를 추출하고, 추출된 지식정보를 기 구축된 지식정보와 비교하여 전체 지식정보를 갱신하며, 상기의 과정을 더 이상 갱신되는 지식정보가 없을 때까지 계속한다.The present invention is intended to provide reliability in expanding knowledge information by extending established knowledge information and verifying extended knowledge information, and the present invention will be described in brief as follows. The method includes the steps of converting previously constructed knowledge information into a query for document collection, collecting / storing the returned result documents for the query, extracting new knowledge information based on the constructed stone information from the collected documents, The information is compared with the previously constructed knowledge information to update the entire knowledge information, and the above process is continued until there is no longer updated knowledge information.

본 발명은 상기와 같은 절차를 통해 지속적으로 지식을 축적, 검증, 확장하는 시스템 및 방법에 관한것으로서, 인간이 현재 지식을 토대로 새로운 지식을 획득하고 기존의 지식을 보완, 확대하는 원리에 따른 것이다.
The present invention relates to a system and method for continuously accumulating, verifying, and expanding knowledge through the above-described procedures, and is based on a principle that humans acquire new knowledge based on current knowledge and supplement and expand existing knowledge.

먼저 본 발명의 일 실시예인 기 구축된 지식정보에 대해 구조화된 지식정보와 문서수집을 이용한 지식확장 및 검증 시스템을 도1을 통해 자세히 설명한다.
First, a knowledge extension and verification system using structured knowledge information and document collection for pre-established knowledge information, which is one embodiment of the present invention, will be described in detail with reference to FIG.

도1에서 보듯이 본 발명인 구조화된 지식정보와 문서수집을 이용한 지식확장 및 검증 시스템은 질의생성부(100), 문서수집부(200), 지식정보 추출부(300), 지식정보 검증부(400), 지식정보 갱신부(500)를 포함한다.
1, the knowledge extension and verification system using structured knowledge information and document collection according to the present invention includes a query generation unit 100, a document collection unit 200, a knowledge information extraction unit 300, a knowledge information verification unit 400 And a knowledge information updating unit 500.

질의생성부(100)는 기 구축된 지식정보를 분석하여, 지식을 갱신하고 검증하기 위한 관련 문서를 수집할 수 있는 문서 검색용 질의문을 생성한다. 즉, 본원 발명의 질의생성부는 목표 도메인의 문서내용을 문서검색용 질의문 형태로 변환하는 것이다. 이러한 방법은 네이버, 다음과 같은 검색 서비스에서 사용자가 전달한 키워드에 대한 검새결과를 반환하기 위해 사용하는 질의문과 유사하다. 그러나 본 발명에서의 질의 생성부가 생성한 질의문은 목표 문서의 수집에 사용되는 데 반해, 일반 포털의 검색 서비스는 질의에 응답하기 위해 사용되는 점에서 차이가 있다.The query generation unit 100 analyzes the constructed knowledge information and generates a query for a document search that can collect related documents for updating and verifying knowledge. That is, the query generation unit of the present invention converts document contents of the target domain into a query form for document search. This method is similar to the query used by Naver to return the result of a search for a keyword passed by the user in the following search service. However, the query generated by the query generator in the present invention is used to collect the target document, whereas the search service of the general portal is used to respond to the query.

이러한 수집방법은 구글의 deep crawl과 같이 널리 알려진 수집방법이다.This collection method is a popular collection method such as Google's deep crawl.

초기의 문서, 특히 웹문서의 수집장치들은 수집할 문서가 저장된 웹페이지의 주소공간이 탐색공간이었다. 문서의 수집은 주어진 주소공간에 저장된 문서를 복사하는 과정이며 문서의 내용은 고려되지 않았다. 그러나 웹을 통한 문서교환 및 제공 시스템이 발달함에 따라, 문서는 웹페이지의 주소공간이 아니라 별도의 DB에 저장되고, 사용자의 요구나 서비스 정책에 따라 웹문서로 재가공되어 제공되기 시작했다. 이러한 경우에는 웹 서비스를 제공하는 인터페이스에 어떤 요구가 전달될 때에만 웹문서를 획득할 수 있다. 구글의 deep crawl은 이러한 방식으로 주소공간에 저장되지 않고 제공되는 웹문서를 획득하는 방법이다.The collection of initial documents, especially web documents, was a search space for the address space of the web page where the documents to be collected were stored. The collection of documents is the process of copying the documents stored in a given address space and the contents of the documents are not considered. However, as the document exchanging and providing system through the web is developed, the documents are stored in a separate DB instead of the address space of the web page, and the web documents are reworked according to the user's request or service policy. In this case, a web document can be acquired only when a request is transmitted to the interface providing the web service. Google's deep crawl is a way to obtain a web document that is not stored in the address space in this way.

본발명에서의 질의 생성부는 문서수집을 위해 생성하는 질의 생성과 같은 목적을 갖는다. 문서수집은 수집 목표로 삼은 주소공간 또는 획득하고자 하는 서비스에 따라 질의를 생성한다.The query generation unit in the present invention has the same purpose as the query generation that is generated for document collection. The document collection generates a query according to the address space used as the collection target or the service to be acquired.

예를 들어, 중고 자동차에 대한 특정 정보 제공 서비스 업체(목표 도메인을 의미함)에서 문서를 획득하고자 한다면, 자동차 이름, 가격, 제작연도, 제작사를 키워드로 하고, 이에 맞는 질의어을 생성한다. 그러므로 수집 대상 서비스를 식별하고 이에 맞추어 적절한 질의 구조와 키워드들을 결정하는 것이 핵심 기술이다.
For example, to acquire a document from a specific information providing service provider (meaning a target domain) for a used car, a keyword is generated based on a car name, a price, a production year, and a manufacturer as keywords. Therefore, it is a key technology to identify the appropriate services and determine appropriate query structures and keywords.

본 발명에서는 지식정보를 확장하고 검증할 수 있는 내용이 포함된 문서가 수집대상이다. 그러므로 문서 획득을 위한 질의식은 기 구축된 지식정보로부터 생성되기 때문에 문서를 서비스하고 있는 목표 도메인의 주소(예:http://books.google.com/tech 등과 같은 웹도메인 이름)자체는 은 질의식 구성과 관련이 없다.In the present invention, a document containing contents that can be extended and verified is collected. Therefore, since the query expression for document acquisition is generated from the pre-constructed knowledge information, the address of the target domain that services the document (for example, the web domain name such as http: //books.google.com/tech) It has nothing to do with configuration.

만일 동일한 키워드와 구조를 갖더라도 이는 지식정보로부터 획득된 경우에만 유효하다. 즉, 중고자동차 서비스 업체의 서비스 구조로부터 질의식이 구성되는 것이 아니라, 중고자동차에 대해 기 구축된 지식정보로부터 질의식이 구성됨을 유의해야 된다. Even if they have the same keyword and structure, this is only valid if they are obtained from knowledge information. In other words, it should be noted that the query expression is composed not from the service structure of the used car service company but from the knowledge information that has been constructed for the used car.

예를들어, "GM 자동차"라는 키워드로 검색하더라도 "GM자동차"가 "중고차 시세"라는 지식정보와 연관이 있는지, "영화 트랜스포머"와 연관이 되었는지에 따라 서로 다른 질의식을 생성하고 반환된 문서에 대해서도 평가 결과가 달라진다.
For example, even if you search for the keyword "GM car", you can create different quality rituals depending on whether the "GM car" is related to the knowledge information "used car ticket" or "movie transformer" The evaluation results are also different.

본 발명의 또 다른 질의어 생성 예를 든다. "오바마는 미국의 21대 대통령이다"라는 지식정보에 대한 지식표현 키워드로는 "오바마(사람:이름), 미국(국가:이름), 대통령(대통령:직위)-21대"등과 같은 지식표현을 추출할 수 있고 이로부터 다음과 같은 검색 질의어를 생성할 수 있다.Another example of query generation according to the present invention will be described. "Obama is the 21st President of the United States." Knowledge-based keywords for knowledge information include knowledge expressions such as "Obama (person: name), the United States (country: name), the President And can generate the following search query word from this.

serch keyword = (오바마& (type:name or type:person))serch keyword = (Obama & (type: name or type: person))

serch keyword = (미국& (type:name or type:country))serch keyword = (United States & (type: name or type: country))

serch keyword = ((오바마& (type:name or type:person)) & (대통령) or (미국 & 대통령))serch keyword = ((Obama & (type: name or type: person)) & (President) or (US & President)

상기 검색 질의어 형식은 수집 목표로 삼은 목표도메인(수집대상의 주소공간이나 서비스업체 등)에 따라 다소 달라질 수 있다. 질의 생성을 통한 문서수집은 Madhavan et. tal(2007, 2008)에서 소개한 템플릿을 이용한 질의 생성과 유사하다. 그러나 질의 생성을 위해 수집 목표의 데이터 유형에 기반한 것이 아니라, 기 구축된 지식정보 즉 데이터의 내용에 기반한다는 점에서 차이가 있다.
The format of the search query form may be somewhat different according to the target domain (the address space of the collection target, the service provider, etc.) as the collection target. Document collection through query generation is described in Madhavan et. It is similar to query generation using template introduced in tal (2007, 2008). However, there is a difference in that the query generation is not based on the data type of the collection target but is based on the pre-built knowledge information, that is, the content of the data.

문서수집부(200)는 질의 생성부(100)에서 생성된 질의문에 대해 각 목표 도메인에서 반환된 결과 문서들을 수집/저장하고, 목표 도메인에서 수집된 문서에 대한 적절성을 분석/평가한다.The document collection unit 200 collects / stores the result documents returned from each target domain, and analyzes / evaluates the appropriateness of the documents collected in the target domain, for the query sentence generated by the query generation unit 100.

질의생성부(100)에서 생성된 질의어에 의해 각 목표도메인에서 반환된 결과는 네트워크 상에서 주소를 갖는 문서들이다.그러므로 문서를 수집한다는 것은 전형적인 웹문서의 수집을 의미한다.The result returned from each target domain by the query term generated by the query generation unit 100 are documents having addresses on the network. Thus, collecting a document means collecting a typical web document.

문서의 수집은 일반적으로 웹수집에 사용되는 주소공간에 의해 획득된 문서의 복사와 질의에 의해 반환되는 문서를 일정한 저장소에 저장함을 의미한다. 이 때, 저장되는 문서는 문서 획득 경로, 주어적 질의(예: 오바마), 해당 질의 생성에 사용된 지식정보(예: 미국 대통령, 최초의 흑인 대통령 등)의 주소공간에 대한 메타데이터 등이 포함된다.The collection of documents generally means that the documents returned by copying and querying the documents obtained by the address space used for web collection are stored in a certain repository. At this time, the document to be stored includes the document acquisition path, subject query (eg, Obama), and metadata about the address space of the knowledge information used to generate the query (eg, US president, first black president, etc.) do.

또한, 문서 수집부에서는 획득된 문서의 적절성을 평가하는데, 문서의 적절성 평가에는 문서와 함께 저장되는 정보들이 활용된다.In addition, the document collecting unit evaluates the appropriateness of the acquired document, and the information stored together with the document is used for the appropriateness evaluation of the document.

문서의 적절성 평가란 질의문을 통해 수집된 문서들중 원하지 않는 문서를 필터링하는 것으로, 예를들어 "미국 대통령 오바마"란 질의에 대해 수집된 수많은 문서들중에는 미국 대통령인 오바마와 직접 관련없는 비관련 문서들도 포함하게 되는데 이러한 원하지 않는 문서들에 대해 적절성을 평가하여 필터링 하는 것이 적절성 평가이다. 적절성 평가란 수집된 문서와 함께 저장되어 있는 정보들을 활용하여 필요치 않는 문서들, 즉 적절치 않은 수집문서들은 대상에서 제외시킴을 의미한다. The evaluation of the relevance of a document is the filtering of unwanted documents from the query. For example, among the many documents collected about the query "US President Obama," there is no direct connection with Obama, It is an appropriateness evaluation that evaluates and filters appropriateness for these unwanted documents. Appropriateness assessment means that information that is stored with the collected documents is used to exclude documents that are not needed, that is, inappropriate documents.

예을 들어, "미국 대통령 오바마"라는 지식정보에 대해 질의생성부에서는 상기 질의생성부 설명부문에서 언급한 방법에 의해 질의어가 생성되고 상기 질의어에 대해 목표도메인으로부터 수많은 문서들이 수집된다. 상기 수집된 문서중에는 미국 대통령 오바마란 지식정보가 담겨는 있지만 중국의 과학기술이 중심이 되는 문서가 있을 수있고 이러한 문서를 배제하는 것이다. 즉, 상기 수집된 중국과학기술에 대한 문서 내용중 미국 대통령 오바마란 지식정보가 포함은 되지만 미국 대통령 오바마에 대한 문서가 아니라 중국과학기술에 대한 내용이어서 상기 수집된 중국과학기술에 대한 문서는 미국 대통령 오바마란 질의어에 대한 적절한 수집 문서가 되지 못하는 것이다.For example, a query term is generated by the method described in the description section of the query generation section with respect to knowledge information called "President Obama ", and a large number of documents are collected from the target domain with respect to the query term. Among the collected documents, there is a document containing the knowledge information of President Obama of the United States, but it is the center of China 's science and technology, and it excludes such document. In other words, the document on the collected Chinese science and technology includes the knowledge information of the US President Obama, but it is not about the President Obama but about the Chinese science and technology. Obama is not a proper collection document for a query.

문서의 내용을 평가하는 기능은 자연어처리 기술을 이용하게 되며 자연어처리 기술은 공지된 기술이어서 자세한 설명은 생략한다.
The function of evaluating the contents of the document uses a natural language processing technique, and the natural language processing technology is a known technology, so a detailed description is omitted.

지식정보 추출부(300)는 수집된 문서로부터 새로운 지식정보를 추출한다.The knowledge information extraction unit (300) extracts new knowledge information from the collected documents.

지식정보 추출에는 해당 문서의 획득에 사용된 기구축 지식정보와 질의, 문서가 획득된 서비스의 메타데이터, 문서에 포함된 메타데이터가 사용된다.In extracting knowledge information, meta data included in the document and the basic construction information used in acquisition of the document are used.

기존의 온톨로지, 지식베이스, 시맨틱웹에서는 문서가 포함하는 지식을 각각의 지식정보가 갖는 구조적 프레임워크와 규칙에 따라 새로운 지식정보의 인스턴스를 생성한다.In existing ontologies, knowledge bases, and semantic webs, new knowledge information instances are created according to the structural frameworks and rules of each knowledge knowledge.

그러나 본 발명에서는, 도3에 나타나 있듯이,문서획득에 사용된 기구축 지식정보의 구조적 프레임워크와 규칙 뿐만 아니라 의미(해당 지식정보에 포함된 어휘, 관계, 구조 및 그에 사용된 데이터의 값), 문서의 모든 내용이 아닌 적용된 지식정보와 관련된 내용에 의해서도 새로운 지식정보가 추출된다. 즉 본발명은 새로운 지식정보 추출을 위해 종래의 구조적 프레임워크와 규칙 이외에 지식정보의 의미와 내용을 반영해서 새로운 지식정보를 추출하는 것에 특징이 있다.However, in the present invention, as shown in FIG. 3, not only the structural framework and rules of the basic building information used for acquiring a document, but also the meaning (vocabulary, relation, structure and value of data used in the corresponding knowledge information) The new knowledge information is extracted not only by all the contents of the document but also by the contents related to the applied knowledge information. That is, the present invention is characterized in that new knowledge information is extracted by reflecting the meaning and contents of knowledge information in addition to the conventional structural framework and rules for extracting new knowledge information.

이러한 본 발명에 적용된 새로운 지식정보 추출 과정은 인간이 지식을 획득함에 있어서, 접촉한 모든 데이터를 지식화하지 않는다 점, 데이터를 해석할 때 어떤 기존 지식을 적용하느냐에 따라 해석 결과와 얻어지는 데이터가 달라진다는 점을 이용하고 있음에 특징이 있다 할 것이다.
The new knowledge information extraction process applied to the present invention differs from the conventional knowledge information extraction process in that the human being acquires knowledge does not knowledge all the data that are contacted and that the analysis result and the obtained data are different depending on the existing knowledge applied when interpreting the data It is characterized by using points.

이하에서는 본발명의 지식정보 추출부가 기 구축된 지식정보의 구조,의미,내용에 입각해 새로운 지식정보를 추출하는 과정을 설명한다.Hereinafter, a process of extracting new knowledge information based on the structure, meaning and contents of knowledge information constructed in the knowledge information extracting unit of the present invention will be described.

본 발명의 지식정보 추출부(300)는 새로운 지식정보를 추출하기 위해 수집된 문서로부터 기 구축된 지식정보와 구조적으로 유사한 항목을 추출한다.예를 들어 기 구축된 지식정보가 "미국의 대통령"이라 할 경우 기 구축된 지식정보의 구조는 X의Y형태가 된다. 이와 유사한 구조적 항목으로는 "미국의 21대 대통령"이며 구조는 X의 (몇대인지를 나타내는 숫자)대 Y가 된다. 따라서 지식정보 추출부(300)는 "미국의 (숫자표현)대 대통령"이라는 구조를 갖는 지식정보를 추출하게 된다.The knowledge information extraction unit 300 of the present invention extracts items structurally similar to the constructed knowledge information from the collected documents to extract new knowledge information.For example, The structure of the knowledge information that has been constructed becomes the Y-shape of X. A similar structural item is "the 21st President of the United States" and the structure is X (the number indicating the number of units) versus Y. Accordingly, the knowledge information extraction unit 300 extracts knowledge information having a structure of "President of the United States (numerical expression) vs. President ".

본 발명의 지식정보 추출부(300)는 수집된 문서로부터 기 구축된 지식정보와 의미적으로 유사한 항목을 추출한다. 예를 들어 대통령과 유사한 의미를 갖는 것으로 왕,총리,주석등이 있을 수 있다. 따라서 "미국의 왕". "미국의 총리", "미국의 주석"등으로 표현된 지식정보를 추출한다. 마찬가지로 미국과 유사한 의미적 계통정보를 갖는 국가명을 이용하여 "한국의 대통령". "프랑스의 대통령"을 포함하는 지식정보를 추출한다.The knowledge information extracting unit 300 of the present invention extracts items that are semantically similar to the constructed knowledge information from the collected documents. For example, there may be a king, a prime minister, a commentary, etc., which have a similar meaning to the president. Therefore, "King of America". "Prime Minister of America", "Annotation of America", and so on. Likewise, "the President of Korea" using the name of a country with semantic system information similar to the United States. And extracts knowledge information including "President of France ".

또한, 본 발명의 지식정보 추출부(300)는 기 구축된 지식정보의 내용적 측면에서 새로운 지식정보를 추출한다. 그 일례로 수집된 문서로부터 기 구축된 지식정보와 상반된 항목과 상보적인 항목을 추출한다. 예를들어, 기 구축된 지식정보가 "가장 비싼 스마트폰"이라면 상반된 항목으로 "가장 싼 스마트폰"이란 항목을 추출하고, 상보적인 항목으로는 "가장 고급인 스마트폰", "가장 고가인 스마트폰"이란 항목을 추출한다. 상기에서 설명한 바와 같이 본 발명의 지식정보 추출부는 기 구축된 지식정보와 연관성이 높은 문서로부터 기 구축된 지식정보의 구조,의미,내용적 측면을 고려하여 새로운 지식정보를 추출하는 것이다.
In addition, the knowledge information extraction unit 300 of the present invention extracts new knowledge information from the content aspect of the constructed knowledge information. For example, extract items that are incompatible with the previously constructed knowledge information from the collected documents. For example, if the established knowledge information is "the most expensive smartphone", the item "cheapest smartphone" is extracted as the contradictory item, and the complementary items are "the most advanced smartphone", " Phone ". As described above, the knowledge information extracting unit of the present invention extracts new knowledge information in consideration of the structure, meaning, and content aspects of the knowledge information constructed from the highly related documents.

다음으로 지식정보 검증부(400)는 기 구축된 지식정보와 새로 추출된 지식정보를 검증한다. 지식정보의 검증은 오랫동안 해당 분야의 전문가나 지식정보에 대한 전문가에 의한 검증에 의존해 왔다. Hlomani Hlomani and Deborah Stacey(2014)의 조사에 따르면, 온톨로지와 같은 지식정보의 검증은 방법론적인 면에서 정답문서에 기반한 방법(Gold standard-based evaluation), 학습을 이용한 방법(Application or task-based evaluation), 사용자의 피드백에 의한 방법(User-based evaluation), 수집된 데이터와 비교에 의한 방법(Data-driven evaluation)이 사용되고 있다.Next, the knowledge information verifying unit 400 verifies the previously constructed knowledge information and the newly extracted knowledge information. The verification of knowledge information has long been relied on by experts in the field or by experts on knowledge information. According to a survey by Hlomani Hlomani and Deborah Stacey (2014), the verification of knowledge information such as the ontology is based on a methodology-based method (Gold standard-based evaluation), an application or task-based evaluation , User-based evaluation, and data-driven evaluation are used.

본 발명에서 제시하는 방법도 수집된 문서에 기반하므로 Data-driven에 포함된다고 할 수 있다. Hlomani Hlomani and Deborah Stacey(2014)는 I. Nonaka and R. Toyama(2005)를 인용하면서 Data-driven에 의한 방법은 domain knowledge가 동적이기 때문에 항상 일관성이 있는 것도 아니고 명시적이지 않은 경우가 있는 등 수집된 데이터와 그 데이터의 해석에 따른 문제를 제기하였다.The method proposed by the present invention is also included in the data-driven based on the collected document. Hlomani Hlomani and Deborah Stacey (2014) cite I. Nonaka and R. Toyama (2005), but the data driven method is not always consistent and not explicit because the domain knowledge is dynamic. And the interpretation of the data.

본 발명에서는 이러한 문제점을 앞서 제시한 두 가지 방법(본 발명의 질의생성부와지식정보 추출부에서 이용하고 있는 방법)에 의해 해소하였다. 즉,첫번째는 기 구축된 지식을 질의로 사용함으로써 수집된 문서(즉 데이터)와 목표 지식과의 연관성을 높이는 질의 생성부(100)의 기능이고, 두 번째는 지식정보 추출부(300)에서 기구축된 지식정보와 대상 문서들의 메타데이터를 이용하여 목표 지식의 관점에서 데이터를 해석하는 방법이다.In the present invention, such a problem is solved by the two methods (the method used in the query generating unit and the knowledge information extracting unit of the present invention) proposed above. That is, the first is a function of the query generation unit 100 that increases the association between the collected document (i.e., data) and the target knowledge by using the previously constructed knowledge as a query, and the second is the function of the query generation unit 100, It is a method of analyzing data from the viewpoint of target knowledge using the constructed knowledge information and the metadata of the target documents.

이와 함께 후술하게될 지식정보 갱신부(500)에서는 기 구축된 지식정보와 추출된 지식정보를 비교/대조하여 새로운 지식을 추가하고, 부족한 지식을 보완함으로써 동적인 domain knowledge의 특성을 수용하고, 불완전한 지식을 보완한다. 이 과정은 반복적으로 수행되며, 검증을 수행할 때마다 부여되는 가중치에 의해 신뢰도가 평가된다. 즉, 갱신률이 적은 지식정보는 신뢰도가 유지되고, 갱신률이 높은 지식정보는 신뢰도가 낮아진다.In addition, the knowledge information updating unit 500, which will be described later, accepts the characteristics of the dynamic domain knowledge by adding new knowledge by comparing / collating the established knowledge information with the extracted knowledge information, supplementing the deficient knowledge, It complements knowledge. This process is repeated and reliability is assessed by the weights given each time the verification is performed. That is, the reliability of the knowledge information having a low update rate is maintained, and the reliability of the knowledge information having a high update rate is low.

본 발명의 지식정보 검증부(400)에서 이루어지는 지식정보에 대한 검증은 대상과 유형에 따라 다음과 같이 구분하여 처리한다.
The verification of the knowledge information performed by the knowledge information verifying unit 400 according to the present invention is performed in the following manner according to the object and type.

<기 구축된 지식정보와 새로 추출된 지식정보가 동일한 경우><When the established knowledge information and the newly extracted knowledge information are the same>

기 구축된 지식정보와 구조와 내용적 측면에서 완전히 일치하거나 포함관계에 있는 새로 추출된 지식정보는 지식정보의 확장이나 검증이 불필요한 정보이어서 새로 추출된 지식정보는 삭제 표시된다. 이로인해 기 구축된 지식정보의 신뢰성은 증가된다.The newly extracted knowledge information, which is completely or consistently related to the constructed knowledge information and structure and contents, is unnecessary information of extension or verification of knowledge information, so that newly extracted knowledge information is deleted. This increases the reliability of the pre-established knowledge information.

예를들어, 기 구축된 지식정보가 "오바마는 미국의 21대 대통령이다"이고, 새로 추출된 지식정보가 "오바마는 미국의 대통령이다", "오바마는 미국의 21번째 대통령이다"라면 새로 추출된 지식정보중 "오바마는 미국의 21번째 대통령이다"는 기 구축된 지식정보와 일치하는 정보이고, "오바마는 미국의 대통령이다"는 기 구축된 지식정보에 포함되는 범위의 정보이어서 새로 추출된 지식정보들인 "오바마는 미국의 대통령이다"와 "오바마는 미국의 21번째 대통령이다"는 새로운 지식정보로 확장할 필요가 없는 지식정보로써 삭제 처리된다.
For example, if the established knowledge information is "Obama is the 21st President of the United States" and the newly extracted knowledge information is "Obama is the President of the United States" and "Obama is the 21st President of the United States" Obama is the 21st President of the United States "is information consistent with the existing knowledge information, and" Obama is the President of the United States " The knowledge information "Obama is the president of the United States" and "Obama is the 21st president of the United States" are deleted as knowledge information that does not need to be extended to new knowledge information.

<기 구축된 지식정보가 새로 추출된 지식정보보다 적은 정보를 포함한 경우><When the pre-established knowledge information contains less information than newly extracted knowledge information>

기 구축된 지식정보가 추출된 지식정보보다 불충분한 경우(새로 추출된 지식정보가 기 구축된 지식정보보다 많은 데이터(노드)와 관계, 속성을 가진 경우)에는 기 구축된 지식정보는 새로 추출된 지식정보로 갱신되어야 하며, 이때 기 구축된 지식정보는 갱신 표시되며 갱신 대상이 된다. 이 경우 기 구축된 지식정보의 신뢰성은 낮아지며, 기 구축된 지식정보는 후술하게될 지식정보 갱신부에서 갱신 처리된다.
If the pre-built knowledge information is insufficient than the extracted knowledge information (when the newly extracted knowledge information has a relation and attribute with more data (nodes) than the pre-established knowledge information), the pre-built knowledge information is newly extracted The knowledge information should be updated with the knowledge information. In this case, the reliability of the pre-established knowledge information is lowered, and the pre-established knowledge information is updated by the knowledge information update unit to be described later.

예를들어, 기 구축된 지식정보가 "오바마는 미국의 대통령이다"이고, 새로 추출된 지식정보가 "오바마는 미국의 흑인 대통령이다" 또는 "미국의 21대 대통령인 오바마는 미국 최초의 흑인 대통령이다"인 경우, 기 구축 정보인 "오바마는 미국 대통령이다"는 몇대 대통령인지. 흑인인지 백인인지에 대해 정보가 부족한 지식정보가 된다. 따라서 기 구축 지식정보인 "오바마는 미국의 대통령이다"는 구조,의미,내용적 측면에서 갱신이 되어야 할 필요성이 있고, 후술하게될 지식정보 갱신부를 통해 갱신처리된다. 즉, 기 구축된 지식정보인 "오바마는 미국의 대통령이다"는 몇대 대통령인지, 흑인인지 백인인지에 대한 정보가 반영되어 새로운 지식정보로 갱신 처리 되는 것이다.
For example, the preliminary knowledge information is "Obama is the President of the United States" and the newly extracted knowledge information is "Obama is the black president of the United States" or "Obama, the 21st president of the United States, "Obama is the president of the United States," he said. There is insufficient information about whether it is black or white. Therefore, there is a need for renewal in terms of structure, meaning, and contents of the existing knowledge information "Obama is the President of the United States", and updated information is updated through a knowledge information update unit to be described later. In other words, the knowledge of the pre-built knowledge, "Obama is the President of the United States," reflects the information on how many presidents, blacks or whites are, and is updated with new knowledge.

<기 구축된 지식정보가 새로 추출된 지식정보보다 많은 정보를 포함한 경우><When the established knowledge information contains more information than the newly extracted knowledge information>

기 구축된 지식정보가 추출된 지식정보보다 많은 정보를 포함한다는 것은 수집된 문서가 충분한 지식을 가지지 못했거나, 추출된 지식정보가 불완전하다는 것을 의미한다. 예를 들어, 기구축된 정보가 "오바마는 민주당 출신의 미국 대통령이다"라는 정보이고 추출된 정보가 "링컨은 미국의 대통령이다"라고 할 때, 추출된 정보는 링컨의 소속당에 대한 지식이 포함되지 않았음을 확인할 수 있다. 이러한 경우 추출된 지식정보는 불완전한 것이므로 보완필요성이 표시되고, 신뢰성은 0으로 한다.
The fact that the constructed knowledge information contains more information than the extracted knowledge information means that the collected document does not have sufficient knowledge or the extracted knowledge information is incomplete. For example, if the pre-built information is information that "Obama is a US president from the Democratic Party" and the extracted information is "Lincoln is the President of the United States," the extracted information includes knowledge of Lincoln's Party . In this case, since the extracted knowledge information is incomplete, the necessity of supplement is indicated, and the reliability is set to zero.

이상에서 설명한 바와 같이 본 발명의 지식정보 검증부(400)는 새로 추출된 지식정보에 대해 삭제나 보완을 표시하거나, 기 구축된 지식정보에 대해 갱신을 표시하는 기능을 수행하고 이와 관련된 정보를 질의 생성부로 전달하는 기능을 추가적으로 수행함을 특징으로 한다.
As described above, the knowledge information verifying unit 400 of the present invention displays a deletion or a supplement to the newly extracted knowledge information, performs a function of displaying an update on the previously constructed knowledge information, To the generating unit.

다음으로 본 발명의 지식정보 갱신부(500)는 지식정보 검증부(400)의 검증결과에서 부여된 삭제,갱신,보완등의 표시에 따라 지식정보의 처리를 수행한다.Next, the knowledge information update unit 500 of the present invention performs processing of knowledge information according to an indication such as deletion, update, or supplement given in the verification result of the knowledge information verification unit 400. [

삭제표시된 지식정보는 기 구축된 지식정보와 동일한 구조.의미,내용을 포함하는 것이어서 삭제하거나 별도의 저장공간으로 이동시킨다.The deleted knowledge information includes the same structure, meaning, and content as the previously constructed knowledge information, and is deleted or moved to a separate storage space.

갱신표시된 지식정보는 필요 정보(추가/수정해야할 노드, 관계 등과 그 값(값이 없는 경우도 있을 수 있다))를 부착하고 상태표시를 보완으로 변경한 후, 질의 생성부로 전달한다. 갱신대상인 지식정보는 지식정보 검증부의 검증과정에서 신뢰도가 증가할 때까지 문서수집을 위한 지식정보 목록에서 제거되지 않는다.
The updated knowledge information is attached to the necessary information (node, relationship, etc. to be added / corrected and its value (there may be no value)), changes the status indication to complement, and transmits it to the query generation unit. The knowledge information to be updated is not removed from the knowledge information list for document collection until the reliability of the knowledge information verifier is increased.

예를들어, 기 구축된 지식정보가 " 오바마는 미국 대통령이다"이고 새로 추출된 지식정보가 " 오바마는 미국의 21대 대통령이다" 또는 "21대 대통령인 오바마는 미국의 흑인 대통령이다"라고 하는 경우, 기 구축정보인 "오바마는 미국 대통령이다"는 갱신의 대상인 지식정보이고, 갱신에 필요한 정보는 몇대 대통령인지, 흑인인지 백인인지가 되는 것이다. 이 때 본 발명의 지식정보 갱신부는 상기 기 구축정보인 "오바마는 미국의 대통령이다"는 보완으로 변경한후 보완해야할 관련 정보(몇대 대통령인지, 흑인인지 백인인지를 나타내는 정보)를 질의생성부(100)로 전달하게 되고 정보를 전달받은 질의 생성부(100)는 질의어 구조와 내용에 몇대 대통령인지, 흑인인지 백인인지를 포함하는 질의어를 생성하고, 문서수집부(200)는 생성된 질의어를 이용하여 기 설명된 문서수집과 분석/평가를 수행한 후 도1에서 나타나 있는 추출과 검증 과정을 반복한다..
For example, if the established knowledge information is "Obama is the President of the United States" and the newly extracted knowledge information is "Obama is the 21st President of the United States" or "Obama the 21st President is the American black president" In the case of the construction information, "Obama is the President of the United States of America" is the knowledge information to be renewed, and the information required for renewal is the number of presidents, blacks or whites. At this time, the knowledge information renewal unit of the present invention changes the supplementary information to "Obama is the President of the United States," which is the basic construction information, and then transmits the related information (information indicating how many presidents or black or white people) 100 and transmits the information to the query generation unit 100. The query generation unit 100 generates a query term including the number of presidents and black or white in the query term structure and the content and the document collection unit 200 uses the generated query term After performing the described document collection and analysis / evaluation, the extraction and verification process shown in FIG. 1 is repeated.

또한 보완 표시된 지식정보의 경우에 대해 설명한다.The case of knowledge information supplemented is also described.

기 구축된 지식정보가 "오바마는 민주당 출신의 미국 대통령이다"라는 정보이고 새로 추출된 정보가 "링컨은 미국의 대통령이다"라고 할 때, 새로 추출된 지식정보는 불완전한 정보로 보완의 필요성이 있는 것이고, 지식정보 갱신부는 보완에 필요한 정보(링컨의 당적에 관한 정보)를 질의 생성부로 전달한다. 정보를 전달받은 질의 생성부는 당적이 포함된 새로운 질의어(예: 사람-당-국가-직위의 구조를 갖는 질의어)를 생성한다. 질의어 생성후에는 도1에 따라 수집, 추출, 검증 과정을 반복한다.
When the newly-extracted information is "Lincoln is the President of the United States", the newly extracted knowledge information is incomplete information and needs to be supplemented , And the knowledge information update unit transfers the information necessary for the supplement (information about the Lincoln's party) to the query generation unit. The query generation unit that has received the information generates a new query language (for example, a query having a structure of a person-party-country-title) containing the subject. After the query term is generated, the collection, extraction, and verification processes are repeated according to FIG.

이하에서는 도2를 통해 본 발명의 또 다른 실시예인 지식정보와 문서수집을 이용한 지식확장 및 검증 방법에 대해 설명한다.Hereinafter, a knowledge extension and verification method using knowledge information and document collection, which is another embodiment of the present invention, will be described with reference to FIG.

도2에서 보듯이 본 발명의 일실시예인 지식정보와 문서수집을 이용한 지식확장 및 검증 방법은 As shown in FIG. 2, the knowledge extension and verification method using knowledge information and document collection, which is one embodiment of the present invention,

A.구축된 지식정보를 문서수집용 도메인의 질의로 변환하는 단계;A. converting the constructed knowledge information into a query of domain for document collection;

B. 변환된 질의로 목표 도메인에서 문서를 수집하고, 수집된 문서에 대한 분석/평가를 수행하는 단계;B. collecting the document in the target domain with the converted query, and performing analysis / evaluation on the collected document;

F. 상기 E단계에서 삭제/갱신/보완 처리한 후, 다시 삭제/갱신/보완할 지식정보가 있으면 A~E단계를 반복하고, 없으면 종료하는 단계를 포함함을 특징으로 한다.
F. Repeat steps A to E if there is knowledge information to be deleted / updated / supplemented again after deleting / updating / supplementing in step E, and terminating the step if not.

단계A: 기 구축된 지식정보를 문서수집용 질의어로 변환하는 단계를 설명한다.Step A: The step of converting the previously constructed knowledge information into a query word for document collection will be described.

기 구축된 지식정보인 "오바마는 미국 대통령이다"를 문서 수집용 질의어로 변환 생성한다. 이 경우 질의어 형태는 다음과 같을 수 있다.Obama is the President of the United States, which is a knowledge base that has been constructed. In this case, the query form can be as follows.

serch keyword = ((오바마& (type:name or type:person)) & (대통령) or (미국 & 대통령))
serch keyword = ((Obama & (type: name or type: person)) & (President) or (US & President)

단계B: 변환된 질의어로 목표도메인에서 문서를 수집하고 수집된 문서에 대한 분석/평가를 수행하는 단계를 설명한다.Step B: The step of collecting the document in the target domain with the converted query and performing the analysis / evaluation on the collected document will be described.

변환된 질의어를 이용하여 목표 도메인(문서수집이 가능한 모든 포털이나 서비스 사이트등)에서 문서를 수집한다. 수집된 문서의 내용중 적절치 않은 문서는 분석/평가를 통해 수집대상에서 제외 시키며 이때 공지된 자연어처리 기법을 이용하여 적절치 않은 문서를 필터링한다.The document is collected from the target domain (all portal or service site where document collection is possible) using the converted query. Inappropriate documents that are not included in the collected documents are excluded from the collection through analysis / evaluation. In this case, inappropriate documents are filtered using known natural language processing techniques.

예를들어 수집된 문서중 오바마란 단어가 존해하는 문서가 있지만 내용에 있어 "오바마는 미국 대통령이다"와는 거리가 먼 중국과학기술에 관련된 문서가 있다고 가정할 때, 이 문서는 수집대상에서 제외시킨다.For example, suppose that there is a document that Obama is the word of the collected documents, but the document is related to Chinese science and technology, which is far from "Obama is the President of the United States". .

이때 공지의 자연어처리 기법을 이용하여 상기 중국과학기술에 관련된 문서는 수집대상에서 제외시킨다.
At this time, documents related to Chinese science and technology are excluded from the collection subject by using known natural language processing techniques.

단계C: 새로운 지식정보를 추출하는 단계를 설명한다.Step C: The step of extracting new knowledge information will be described.

적절치 않은 문서가 제외된 수집된 문서를 대상으로 새로운 지식정보를 추출함에 있어 기 구축된 지식정보의 구조 이외에 의미,관련내용도 추출 요소로 반영한다.In extracting new knowledge information for collected documents excluding inappropriate documents, semantic and related contents are reflected as extraction factors in addition to the structure of knowledge information that has been constructed.

"오바마는 미국 대통령이다"와 구조적으로 유사한 지식정보로는 "링컨은 미국의 초대 대통령이다" "오바마는 미국의 흑인 대통령이다"등일수 있다."Obama is the president of the United States." Structurally similar knowledge is "Lincoln is the first president of the United States," and "Obama is the American black president."

다음으로 "오바마는 미국 대통령이다"와 의미적으로 유사한 지식정보로는 " 오바마는 미국의 왕이다", "오바마는 미국의 주석이다"등일수 있다.Next, the semantic similarity between "Obama is the president of the United States" and "Obama is the King of America" and "Obama is the commentator of the United States" and so on.

다음으로 "오바마는 미국의 대통령이다"와 상반되거나 상보적으로 관련있는 내용을 포함하는 지식정보로는 "오바마는 미국의 인기있는 대통령이다", "오바마는 아프리카의 대통령이다","오바마는 미국의 백인 대통령이다"등일 수 있다
"Obama is a popular president of the United States,""Obama is a president of Africa,""Obama is a US president," and "Obama is a president of the United States. It's a white president of "

단계D: 추출된 지식정보를 기 구축된 지식정보와 비교 검증하는 단계를 설명한다.Step D: The step of comparing and verifying the extracted knowledge information with the established knowledge information will be described.

기 구축된 지식정보와 완전히 일치하거나 포함되는 새로 추출된 지식정보는 지식정보의 확장이나 검증이 불필요한 정보이어서 새로 추출된 지식정보는 삭제 표시된다. 이로인해 기 구축된 지식정보의 신뢰성은 증가된다.The newly extracted knowledge information, which is completely matched with or included in the previously constructed knowledge information, is information that does not require expansion or verification of the knowledge information, so that the newly extracted knowledge information is deleted. This increases the reliability of the pre-established knowledge information.

예를들어, 기 구축된 지식정보가 "오바마는 미국의 21대 대통령이다"이고, 새로 추출된 지식정보는 "오바마는 미국의 대통령이다" 또는 "오바마는 미국의 21번째 대통령이다"라면 새로 추출된 지식정보는 기 구축된 지식정보의 구조(주어-국가-대-직위)및 의미,내용에 있어서 일치하거나 포함되는 것이어서 새로 추출된 지식정보는("오바마는 미국의 대통령이다" 또는 "오바마는 미국의 21번째 대통령이다") 기 구축된 지식정보("오바마는 미국의 21대 대통령이다")를 확장시킬 수 없는 의미없는 지식정보인 관계로 삭제 표시된다.
For example, if the pre-built knowledge is "Obama is the 21st President of the United States" and the newly extracted knowledge information is "Obama is the President of the United States" or "Obama is the 21st President of the United States" (Ie, "Obama is the President of the United States," or "Obama is the President of the United States"), President Obama is the 21st President of the United States. ") It is deleted because it is meaningless knowledge information that can not expand the established knowledge information (" Obama is the 21st President of the United States ").

기 구축된 지식정보가 추출된 지식정보보다 불충분한 경우(새로 추출된 지식정보가 기 구축된 지식정보보다 많은 데이터(노드)와 관계, 속성을 가진 경우)에는 기 구축된 지식정보는 새로 추출된 지식정보로 갱신되어야 하며, 이때 기 구축된 지식정보는 갱신 표시되며 갱신 대상이 된다. 이 경우 기 구축된 지식정보의 신뢰성은 낮아진다.If the pre-built knowledge information is insufficient than the extracted knowledge information (when the newly extracted knowledge information has a relation and attribute with more data (nodes) than the pre-established knowledge information), the pre-built knowledge information is newly extracted The knowledge information should be updated with the knowledge information. In this case, the reliability of the pre-established knowledge information is lowered.

예를들어, 기 구축된 지식정보가 "오바마는 미국의 대통령이다"이고, 새로 추출된 지식정보가 "오바마는 미국의 21대 대통령이다" 또는 " 21대 대통령인 오바마는 미국의 흑인 대통령이다"인 경우, 기 구축 정보인 "오바마는 미국 대통령이다"는 몇대 대통령인지. 흑인인지 백인인지에 대해 정보가 부족한 지식정보가 된다. 따라서 기 구축 지식정보인 "오바마는 미국의 대통령이다"는 구조,의미,내용적 측면에서 갱신이 되어야 할 필요성이 있는 것이고, 기 구축된 지식정보인 "오바마는 미국의 대통령이다"는 몇대 대통령인지, 흑인인지 백인인지에 대한 정보가 반영되어질 지식정보로서 갱신 표시되는 것이다.
For example, the established knowledge is "Obama is the President of the United States," and the newly extracted knowledge is "Obama is the 21st President of the United States," or "Obama, the 21st President, Obama is the president of the United States. There is insufficient information about whether it is black or white. Therefore, there is a need to update the structure, meaning, and content of Obama's "Obama is the President of the United States," which is the base knowledge of knowledge. , And information indicating whether the person is black or white is updated and displayed as knowledge information to be reflected.

기 구축된 지식정보가 추출된 지식정보보다 많은 정보를 포함한다는 것은 수집된 문서가 충분한 지식을 가지지 못했거나, 추출된 지식정보가 불완전하다는 것을 의미한다. 예를 들어, 기구축된 정보가 "오바마는 민주당 출신의 미국 대통령이다"라는 정보이고 추출된 정보가 "링컨은 미국의 대통령이다"라고 할 때, 추출된 정보는 링컨의 소속당에 대한 지식이 포함되지 않았음을 확인할 수 있다. 이러한 경우 추출된 지식정보는 불완전한 것이므로 보완필요성을 표시하고, 신뢰성은 0으로 한다.The fact that the constructed knowledge information contains more information than the extracted knowledge information means that the collected document does not have sufficient knowledge or that the extracted knowledge information is incomplete. For example, if the pre-built information is information that "Obama is a US president from the Democratic Party" and the extracted information is "Lincoln is the President of the United States," the extracted information includes knowledge of Lincoln's Party . In this case, since the extracted knowledge information is incomplete, it indicates the necessity of complement and the reliability is set to zero.

이상에서 설명한 바와 같이 추출된 지식정보를 기 구축된 지식정보와 검증하는 단계는 새로 추출된 지식정보에 대해 삭제나 보완을 표시하거나, 기 구축된 지식정보에 대해 갱신을 표시하는 기능을 수행하는 단계이다.
As described above, the step of verifying the extracted knowledge information with the preliminarily constructed knowledge information may include the step of displaying the deletion or the supplement to the newly extracted knowledge information or the function of displaying the renewed knowledge information to be.

단계 E,,F:기 구축된 지식정보와 새로 추출된 지식정보에 대해 삭제/갱신/보완 처리하는 단계와 삭제/갱신/보완 처리한 후, 다시 삭제/갱신/보완할 지식정보가 있으면 A~E단계를 반복하고, 없으면 종료하는 단계를 이하에서 설명한다.
Step E, F: Deleting / updating / supplementing the existing knowledge information and newly extracted knowledge information, and A / A information if there is knowledge information to be deleted / updated / supplemented after deletion / The step of repeating the step E and terminating the step is described below.

D단계에서 삭제표시된 지식정보는 기 구축된 지식정보와 동일한 구조.의미,내용을 포함하는 것이어서 삭제하거나 별도의 저장공간으로 이동시킨다.
In step D, the deleted knowledge information includes the same structure, meaning, and content as the previously constructed knowledge information and is deleted or moved to a separate storage space.

D단계에서 갱신표시된 지식정보는 필요 정보(추가/수정해야할 노드, 관계 등과 그 값(값이 없는 경우도 있을 수 있다))를 부착하고 상태표시를 보완으로 변경한 후, A~E단계를 반복한다.In step D, the knowledge information that is updated is attached to the necessary information (node, relationship, etc., to be added / modified and its value (there may be no value)), do.

예를들어, 기 구축된 지식정보가 " 오바마는 미국 대통령이다"이고 새로 추출된 지식정보가 " 오바마는 미국의 21대 대통령이다" 또는 "21대 대통령인 오바마는 미국 최초의 흑인 대통령이다"라고 하는 경우, 기 구축정보인 "오바마는 미국 대통령이다"는 갱신의 대상이고,갱신에 필요한 정보는 몇대 대통령인지, 흑인인지 백인인지가 되는 것이다. 이 때 기 구축정보인 "오바마는 미국의 대통령이다"는 보완으로 변경한후 보완해야할 관련 정보(몇대 대통령인지, 흑인인지 백인인지를 나타내는 정보)가 반영된 후 A~E단계를 반복한다.For example, if the pre-built knowledge is "Obama is the President of the United States" and the newly extracted knowledge is "Obama is the 21st President of the United States" or "Obama, the 21st President, Obama is the President of the United States, and the information needed for renewal is the president, black, or white. At this time, we repeat the steps A to E after the change of the existing construction information "Obama is the President of the United States" and the related information to be supplemented (information indicating how many presidents, black or white people) are reflected.

즉, "오바마는 미국의 대통령이다"란 갱신 대상인 기 구축된 지식정보에 대해 A단계에서 몇대 대통령인지, 흑인인지 백인인지를 포함하는 새로운 질의어를 생성한후 수집/추출/검증 단계를 반복한다.
In other words, a new query term including "president", black or white is created in step A of the previously constructed knowledge information that "Obama is the President of the United States", and then the collection / extraction / verification steps are repeated.

또한 D단계에서 보완 표시된 지식정보의 경우에 대해 설명한다.The case of knowledge information supplemented in step D will be described.

기 구축된 지식정보가 "오바마는 민주당 출신의 미국 대통령이다"라는 정보이고 새로 추출된 정보가 "링컨은 미국의 대통령이다"라고 할 때, 새로 추출된 지식정보는 불완전한 정보로 보완의 필요성이 있는 것이고, 보완에 필요한 정보(링컨의 당적에 관한 정보)와 함께 A단계로 되돌아간다. A단계에서 당적이 포함된 새로운 질의어(예: 사람-당-국가-직위의 구조를 갖는 질의어)가 생성되고, 이후 수집/추출/검증 과정을 반복한다.When the newly-extracted information is "Lincoln is the President of the United States", the newly extracted knowledge information is incomplete information and needs to be supplemented And return to Step A with the information necessary for supplementation (information about Lincoln's party). In step A, a new query (eg, a query with a structure of person-party-country-title) containing the subject is created, and then the collection / extraction / verification process is repeated.

더 이상 삭제/갱신/보완할 지식정보가 없으면 지식정보와 문서수집을 이용한 지식확장 및 검증 방법은 종료된다.
If there is no knowledge information to be deleted / updated / supplemented, the knowledge extension and verification method using knowledge information and document collection is terminated.

이상에서 살펴본, 본 발명인 지식정보와 문서수집을 이용한 지식확장 및 검증 방법은 컴퓨터 프로그램화하여 자동적으로 수행되도록 할 수 있으며, 이 프로그램은 컴퓨터에서 판독될 수 있는 기록매체의 형태로 제공된다. The knowledge extension and verification method using the knowledge information and the document collection according to the present invention can be performed automatically by computerized programming, and the program is provided in the form of a recording medium readable by a computer.

100: 질의 생성부
200: 문서수집부
300:지식정보 추출부
400:지식정보 검증부
500:지식정보 갱신부 100: query generation unit
200: Document collecting section
300: knowledge information extracting unit
400: knowledge information verification unit
500: Knowledge information update section

Claims

A knowledge extension and verification system using structured knowledge information and document collection for extending and verifying established knowledge information,
A query generation unit (100) for converting the generated knowledge information into a query for a document search capable of collecting related documents for updating and verifying knowledge;
A document collection unit 200 for collecting / storing the returned result documents from the target domain for the query sentence generated by the query generation unit 100 and analyzing / evaluating the appropriateness of the collected documents in the target domain;
From the collected target documents, the structural framework and rules of prebuilt knowledge information, the meaning of the prebuilt knowledge information (vocabulary, relation, structure and used data value included in the relevant knowledge information), contents related to prebuilt knowledge information A knowledge information extracting unit 300 for extracting new knowledge information based on complementary or incompatible contents);
When the extracted knowledge information is compared with the already constructed knowledge information, if the previously constructed knowledge information is the same as the extracted knowledge information, the necessity of deleting the extracted knowledge information is compared with the extracted knowledge information In the case of containing less information, it is necessary to update the existing knowledge information. If the existing knowledge information includes more information than the extracted knowledge information, it is necessary to verify the necessity of supplementing the extracted knowledge information. (400);
And a knowledge information update unit 500 for performing the process of deleting extracted knowledge information, updating basic knowledge information, or supplementing extracted knowledge information according to the verification result of the knowledge information verification unit 400
The knowledge information extraction unit 300 extracts items structurally similar to the knowledge information constructed from the collected documents, extracts items that are semantically similar to the knowledge information constructed from the collected documents, Extracting new knowledge information using at least one of extracting items that are in conflict with the knowledge information, extracting items that are complementary to the knowledge information constructed from the collected documents, and extracting new knowledge information. Knowledge Extension and Verification System.

A. converting the constructed knowledge information into a query word of a domain for document collection;
B. collecting the document in the target domain with the translated query and performing analysis / evaluation on the collected document;
C. From the collected documents using the constructed knowledge information, the structural framework of the knowledge information, the meaning of the knowledge information (vocabulary, relation, structure and used data value included in the knowledge information) Extracting new knowledge information with reference to contents;
D. comparing and verifying extracted knowledge information with established knowledge information;
E. Deleting / updating / supplementing the previously constructed knowledge information and newly extracted knowledge information;
F. Repeating steps A to E if there is knowledge information to be deleted / updated / supplemented again after deleting / updating / supplementing processing in step E,
(C-1) extracting items structurally similar to the knowledge information constructed from the collected documents, (C-2) extracting items semantically similar to the knowledge information constructed from the collected documents, A step (C-3) of extracting an item contrary to the knowledge information constructed from the collected document, and a step (C-4) of extracting items complementary to the knowledge information constructed from the collected document Knowledge Extension and Verification Method Using Knowledge Information and Document Collection.

delete

3. The method of claim 2,
The step (D) includes the steps of (D-1) deleting the extracted knowledge information when the constructed knowledge information is the same as the extracted knowledge information, (D-2) renewing and displaying the constructed knowledge information, and (D-3) supplementing and displaying the extracted knowledge information when the constructed knowledge information includes more information than the extracted knowledge information Knowledge Extension and Verification Method Using Knowledge Information and Document Collection.

A computer-readable recording medium having recorded thereon a program for executing a knowledge extension and verification method using knowledge information and document collection according to claim 2 or claim 4.