KR20100068929A

KR20100068929A - System and method for annotating semantic tags to electronic documents

Info

Publication number: KR20100068929A
Application number: KR1020080127447A
Authority: KR
Inventors: 최기선; 남세진; 박정원; 문일철; 안진현
Original assignee: 한국과학기술원
Priority date: 2008-12-15
Filing date: 2008-12-15
Publication date: 2010-06-24
Also published as: KR101069207B1

Abstract

PURPOSE: A system and a method for annotating semantic tags to electronic documents are provided to partially automatise the procedure for annotating a natural language sentence in order to reduce annotation time. CONSTITUTION: An electronic document collecting device(5) collects an electronic document corresponding to user's request on a web. An electronics document group analysis device(6) selects the electronic document. A sentence extractor(7) naturally extracts the natural language sentences from the selected electronic document. A simple sentence divider(8) classifies the natural language sentences into simple sentences. A triple extractor(9) extracts the triple from the simple sentence. A triple matcher(11) matches each part of the triple with an existing machine-readable knowledge structure.

Description

System and method for attaching semantic information to electronic documents {System and method for annotating semantic tags to electronic documents}

본 발명은 전자 문서에 의미 정보를 부착하는 시스템 및 방법에 관한 것으로, 보다 상세하게는 기존의 자연언어 데이터에 의미 정보를 반자동으로 추가 함으로서 기계가 읽을 수 있는 형태로 바꾸는 시스템 및 방법에 관한 것이다. The present invention relates to a system and method for attaching semantic information to an electronic document. More particularly, the present invention relates to a system and method for converting semantic information to existing natural language data into a machine-readable form by semi-automatically adding the semantic information.

현재의 인터넷을 통한 정보 전달 기술은 상당히 발달하여 언제 어디서나 누구든지 쉽게 그 정보를 얻을 수가 있다. 정보의 홍수라 할 수 있는 현재의 상황에서 많은 연구자가 나름의 목적에 따라 의미 있는 정보 전달 기술을 개발 및 발전시키고 있다. 이러한 종래의 기술 가운데, HTML (HyperText Markup Language) 문서에 있는 각 단어의 의미 정보를 추가하는 시스템이 제안되었고 (참조: J. Heflin, J. Hendler, and S. Luke. Shoe: A knowledge representation language for internet applications. In Technical Report CS - TR -4078, volume UMIACS TR-99-7. Dept. of Computer Science, University of Maryland at College Park, 1999), GUI (Graphical User Interface) 기반의 어노테이션 툴로서 간단한 마우스 드래그-앤- 드롭 인터페이스를 통해서 주어진 기계 가독형 지식 구조의 의미 정보를 HTML 페이지에 추가하는 기술이 제안되었다 (참조: CREME: S. Handschuh and S. Staab. Authoring and annotation of web pages in cream. In Proceedings of the 11th international conference on World Wide Web, pages 462?473, New York, NY, USA, 2002. ACM.), 또한 하나의 HTML 페이지를 웹을 통하여 여러 사람이 동시에 어노테이션 하는 기능을 제공하는 시스템인 Annotizer가 제안되어 왔다 (참조: User-friendly www annotation system for collaboration in research and education environments. In The IASTED International Conference on Web Technolgies, Applciations and Services, WTAS 2006). Current information transmission technology through the Internet is so advanced that anyone can access the information anytime, anywhere. In the present situation, which is a flood of information, many researchers are developing and developing meaningful information transfer technologies for their own purposes. Among these conventional techniques, a system for adding semantic information of each word in a HyperText Markup Language (HTML) document has been proposed (see J. Heflin, J. Hendler, and S. Luke. Shoe: A knowledge representation language for internet applications.In Technical Report CS - TR- 4078 , volume UMIACS TR-99-7. Dept. of Computer Science, University of Maryland at College Park, 1999), a graphical user interface (GUI) based annotation tool that adds semantic information of a machine-readable knowledge structure to an HTML page through a simple mouse drag-and-drop interface. A technique has been proposed (CREME: S. Handschuh and S. Staab. Authoring and annotation of web pages in cream.In Proceedings of the 11th international conference on World Wide Web, pages 462-473, New York, NY, USA, 2002. ACM.), And Annotizer, a system that provides the ability to annotate a single HTML page over the web at the same time, has been proposed (see User-friendly www annotation system for collaboration in research and education environments. The IASTED International Conference on Web Technolgies, Applciations and Services, WTAS 2006).

상기한 종래의 기술은 사용자가 주어진 전자문서의 자연언어 문장을 일일이 보고 어노테이션하기 때문에 시간이 많이 걸릴 수 있다. 이러한 종래의 단점을 해결코자, 본 발명에 따라 개발된 전자 문서에 의미 정보를 부착하는 시스템 및 방법은 자연언어 문장을 어노테이션하는 과정을 부분적으로 자동화함으로써 사용자가 자연언어 문장을 일일이 봐야 하는 횟수를 줄여주어 어노테이션 시간을 줄여 준다. The conventional technique may take a long time because the user views and annotates the natural language sentence of a given electronic document one by one. In order to solve the above disadvantages, the system and method for attaching semantic information to an electronic document developed in accordance with the present invention partially automates the process of annotating natural language sentences, thereby reducing the number of times a user must view natural language sentences. To reduce annotation time.

또한, 상기한 종래의 기술은 주어진 자연언어 문장을 사용자가 수동으로 어노테이션 하기 때문에 사용자의 주관이 개입되어 어노테이션 산출물이 일관적이지 않을 수 있다는 문제가 있다. 그러나, 본 발명에 따른 전자 문서에 의미 정보를 부착하는 시스템 및 방법은 자연언어 문장의 어노테이션 과정을 부분적으로 자동 처리함으로써 사용자의 주관적인 개입이 줄어들게 되어 비교적 일관적인 어노테이션 산출물을 얻을 수가 있다. In addition, the above-described conventional technology has a problem that the annotation of the user may be involved and annotation output may not be consistent because the user manually annotates a given natural language sentence. However, the system and method for attaching semantic information to an electronic document according to the present invention can reduce the subjective intervention of the user by partially processing the annotation process of the natural language sentence, thereby obtaining a relatively consistent annotation output.

또한, 상기한 종래의 기술은 사용자의 주관적인 판단을 기록하는 장치가 없기 때문에 장치 자체의 개선이나 다른 사용자에게 어노테이션 작업을 인수인계하는데 사용되는 가이드 라인을 작성하기 쉽지 않다. 본 발명에 따른 전자 문서에 의미 정보를 부착하는 시스템 및 방법은 사용자가 어노테이션하는 과정에서 사용자가 사용한 모든 주관적인 판단을 유형별로 구분해서 기록을 남기게 함으로써, 본 발명에 따른 전자 문서에 의미 정보를 부착하는 장치의 내부 작업 흐름을 개선하는데 활용할 뿐만 아니라 다른 사용자가 동일한 작업을 일관성 있게 수행하는데 필요한 가이 드라인을 작성하는데 지침 자료로 활용할 수 있다. In addition, the above-described prior art does not have a device for recording a subjective judgment of the user, and thus it is not easy to improve the device itself or to prepare a guideline used to take over an annotation work to another user. The system and method for attaching semantic information to an electronic document according to the present invention allows the user to attach semantic information to the electronic document according to the present invention by leaving a record by classifying all subjective judgments used by the user in the process of annotating. Not only can it be used to improve the internal workflow of the device, it can also be used as a guideline for creating guidelines for other users to perform the same tasks consistently.

또한, 상기한 종래의 기술은 전자 문서와 그것의 어노테이션 산출물을 서로 비교하여 분석하는 장치가 없기 때문에 불필요한 어노테이션을 할 수가 있지만, 본 발명에 따른 전자 문서에 의미 정보를 부착하는 시스템 및 방법은 어노테이션 산출물을 기존의 기계 가독형 지식 구조가 기술하는 의미 정보에 대응시킴으로써 이미 어노테이션이 되어 기계 가독형 지식 구조에 의미 정보로서 기술된 자연언어 문장에 대해서는 다시 어노테이션할 필요가 없다는 것을 알아낼 수 있기 때문에 어노테이션 시간을 줄여준다. In addition, the above-described prior art can make unnecessary annotations because there is no device for comparing and analyzing an electronic document and its annotation outputs, but the system and method for attaching semantic information to an electronic document according to the present invention is an annotation output. By matching the semantic information described by the existing machine-readable knowledge structure to annotate natural language sentences that are already annotated and described as semantic information in the machine-readable knowledge structure. Reduce.

상기한 바와 같이 종래의 문제를 해결하기 위한 본 발명에 따른 전자 문서에 의미 정보를 부착하는 장치는, As described above, an apparatus for attaching semantic information to an electronic document according to the present invention for solving the conventional problems,

사용자의 정보 요구에 맞는 전자 문서를 웹상에서 수집하는 전자 문서 수집 장치; 상기 수집된 전자 문서 집합이나 사용자가 제공한 전자 문서 집합에 있는 전자 문서들을 분석하여 전자 문서를 선별하는 전자 문서 집합 분석 장치; 상기 선별된 전자 문서에서 자연언어 문장을 추출하는 문장 추출 장치; 상기 문장을 간단한 문장으로 분리하는 단순 문장화 장치; 상기 간단한 문장으로부터 트리플을 추출하는 트리플 추출 장치; 트리플의 각 부분을 기존의 기계 가독형 지식구조에 대응시키는 트리플 매핑 장치; 상기 트리플 및 매핑 정보로 부터 기계 가독형 지식 구조를 전자 문서로 출력하는 기계 가독형 지식 구조 문서 출력 장치; 사용자가 판단한 사항을 기록으로 남길 수 있게 하는 에러 로그 입력 장치; 및 상기 사용자의 기록을 분석하여 통계 수치화하여 보여주는 에러 로그 분석 장치를 포함한다. An electronic document collection device for collecting an electronic document meeting a user's information needs on a web; An electronic document set analyzing apparatus for selecting an electronic document by analyzing electronic documents in the collected electronic document set or an electronic document set provided by a user; A sentence extracting device extracting a natural language sentence from the selected electronic document; A simple sentence device for separating the sentence into simple sentences; A triple extraction device for extracting triples from the simple sentences; A triple mapping device for mapping each part of the triple to an existing machine-readable knowledge structure; A machine-readable knowledge structure document output device for outputting a machine-readable knowledge structure as an electronic document from the triples and mapping information; An error log input device for allowing a user to leave a record of what has been determined; And an error log analysis device that analyzes and records the user's records in numerical form.

또한, 상기 전자 문서는 HTML 문서, 위키피디아 문서, PDF (Portable Document Format) 문서, 마이크로소프트 워드 문서, 한글 문서, 광학 판독기를 이용해서 만들어진 전자 문서 등 자연언어 문장이 포함되어 있고 기계가 처리할 수 있는 전자 문서는 모두 가능하다. In addition, the electronic document includes natural language sentences such as HTML documents, Wikipedia documents, Portable Document Format (PDF) documents, Microsoft Word documents, Korean documents, electronic documents made using an optical reader, and can be processed by a machine. Electronic documents are all possible.

또한, 상기 자연언어는 영어, 한국어, 일본어, 중국어, 독일어, 프랑스어 등 전자 문서로 기록될 수 있는 모든 언어를 포함한다. In addition, the natural language includes all languages that can be recorded as electronic documents, such as English, Korean, Japanese, Chinese, German, and French.

또한, 상기 전자 문서 집합은 사용자가 임의로 선택한 전자 문서 집합, 위키피디아 문서 집합, 웹 상에서 수집된 HTML 문서 집합 등 전자 문서 집합은 모두 가능하다. The electronic document set may be an electronic document set, such as an electronic document set arbitrarily selected by a user, a Wikipedia document set, or an HTML document set collected on the Web.

또한, 상기 전자 문서 집합은 문서들 간의 링크 정보가 포함될 수 있다. In addition, the electronic document set may include link information between documents.

또한, 상기 기계 가독형 지식 구조를 기술하는 언어는 OWL (Web Ontology Language), KIF (Knowledge Interchange Format) 등 해석 처리 장치가 존재하여 기계가 그 의미를 해석 처리할 수 있는 형태의 언어는 모두 가능하다.In addition, the language for describing the machine-readable knowledge structure includes an interpretation processing device such as OWL (Web Ontology Language) and KIF (Knowledge Interchange Format), and any language in which the machine can interpret and interpret the meaning is possible. .

또한, 상기 트리플은 RDF (Resource Description Framework) 트리플은 물론 기타 기계 가독형 지식 구조 언어와 매핑되어 기존의 기계 가독형 지식 구조를 확장하거나 기계 가독형 지식 구조를 새롭게 생성할 수 있으면 모두 가능하다.In addition, the triple may be mapped to an RDF (Resource Description Framework) triple as well as other machine-readable knowledge structure languages, so long as the existing machine-readable knowledge structure can be extended or a machine-readable knowledge structure can be newly generated.

또한, 상기 전자 문서에 의미 정보를 부착하는 장치는 웹 인터페이스를 제공해서 웹 브라우저를 통해 사용될 수 있고 일반 응용 애플리케이션 형태로 제작되어 사용될 수 있다. In addition, the apparatus for attaching semantic information to the electronic document may be used through a web browser by providing a web interface, and may be manufactured and used in the form of a general application application.

또한, 상기 전자 문서에 의미 정보를 부착하는 장치는 관련 정보가 사용자 컴퓨터에만 저장되어서 해당 컴퓨터에서만 사용되거나 그 정보가 서버에 저장되어서 여러 사람에 의해 네트워크를 통해 사용될 수 있다.In addition, a device for attaching semantic information to the electronic document may be stored only on the user's computer and used only on the computer, or the information may be stored on a server and used by a plurality of people over a network.

또한, 상기 기계 가독형 지식 구조 문서 출력 장치는 화면에 그래프 형태로 출력될 수 있고 XML (Extensible Markup Language) 등 기계가 읽을 수 있는 언어를 사용해서 파일로 출력될 수 있다. In addition, the machine-readable knowledge structure document output device may be output in the form of a graph on the screen, and may be output as a file using a machine-readable language such as XML (Extensible Markup Language).

또한, 상기 에러 로그 입력 장치에서는 에러 로그 유형을 사용자에게 제시하고 사용자는 그 중에 하나를 택하고 기타 코멘트를 입력한다. In addition, the error log input device presents an error log type to the user, and the user selects one of them and inputs another comment.

또한, 상기 에러 로그 분석 장치에서는 상기와 같이 입력된 에러 로그들에 대한 통계 수치를 제공해서 사용자의 주관적인 판단을 계량화하고 그것을 다른 장치를 개선하는 자료로 활용한다. In addition, the error log analysis apparatus provides statistical values for the error logs input as described above to quantify the subjective judgment of the user and use it as data for improving other devices.

또한, 상기 에러 로그 입력장치와 상기 에러 로그 분석 장치는 사용자의 수작업이 개입되는 모든 장치(전자 문서 수집 장치, 전자 문서 집합 분석 장치, 문장 추출 장치, 단순 문장화 장치, 트리플 추출 장치, 트리플 매핑 장치, 기계 가독형 지식 구조 문서 출력 장치)와 연동 되어 사용자가 매 단계 마다 에러 로그를 입력하고 분석할 수 있게 한다. In addition, the error log input device and the error log analysis device are all devices that the user's manual intervention (electronic document collection device, electronic document set analysis device, sentence extraction device, simple sentence device, triple extraction device, triple mapping device) It is linked with the machine-readable knowledge structure document output device so that the user can input and analyze the error log at every step.

상기한 바와 같이 종래의 문제를 해결하기 위한 본 발명에 따른 전자 문서에 의미 정보를 부착하는 방법은,As described above, the method of attaching semantic information to an electronic document according to the present invention for solving the conventional problems,

주어진 전자 문서 집합을 분석해서 기 생성된 기계 가독형 지식 구조 문서와 가장 많이 겹치는 전자 문서들을 선별하는 제1 단계; 선별된 전자 문서에 있는 문장들 중 하나의 문장을 선택하는 제2 단계; 단순 문장화 장치를 이용하여 상기 선택된 문장을 간단한 문장들로 분리하는 제3 단계; 트리플 추출 장치를 이용하여 상기 간단한 문장으로부터 트리플을 추출하는 제 4단계; 트리플 매핑 장치를 이용하여 상기 추출된 트리플의 각 부분을 기존의 기계 가독형 의미 구조에 매핑 시키는 제 5단계; 및 기계 가독형 지식 구조 문서 출력 장치를 이용하여 상기 추출된 트리플 및 상기 매핑 정보를 이용하여 기계 가독형 지식 구조 문서를 출력하는 제 6단계를 포함한다. A first step of analyzing a given set of electronic documents and selecting electronic documents most overlapping with a pre-generated machine readable knowledge structure document; Selecting a sentence among sentences in the selected electronic document; A third step of dividing the selected sentence into simple sentences using a simple sentence device; A fourth step of extracting triples from the simple sentence using a triple extracting device; A fifth step of mapping each part of the extracted triple to an existing machine-readable semantic structure using a triple mapping device; And a sixth step of outputting the machine-readable knowledge structure document using the extracted triple and the mapping information by using the machine-readable knowledge structure document output device.

또한, 상기 전자 문서 집합 분석 장치에서 기 생성된 기계 가독형 지식 구조 문서와 주어진 전자 문서의 겹치는 정도는 기 생성된 지식 구조에 있는 단어와 관련된 내용이 주어진 전자 문서에 출현하는 빈도수에 비례한다. 상기 겹치는 정도의 또 다른 정의로는, 주어진 전자 문서들을 비슷한 문서들로 분류한 뒤 그것의 대표 단어와 관련된 내용이 기 생성된 지식 구조에 출현하는 정도이다. In addition, the degree of overlap between the machine-readable knowledge structure document previously generated in the electronic document set analyzing apparatus and the given electronic document is proportional to the frequency in which contents related to the words in the previously generated knowledge structure appear in the given electronic document. Another definition of the overlapping degree is that a given electronic document is classified into similar documents, and the contents related to its representative word appear in the generated knowledge structure.

또한, 상기 제 3단계는 트리플 추출 장치에 적합한 형태로 자동으로 변환하는 과정 및 반자동으로 변환할 수 있게 하는 인터페이스를 통해 사용자가 수작업으로 간단한 문장으로 바꾸는 과정을 포함한다. In addition, the third step includes a process of automatically converting into a form suitable for the triple extraction apparatus and a process of manually converting the sentence into simple sentences through a semi-automatic interface.

또한, 상기 제 3단계는 사용자가 수작업으로 간단한 문장으로 바꾸는 과정을 돕기 위해 명사절/구, 형용사절/구 등 절/구 표시, 지식 구조의 최소 단위인 단어 표시, 단어 간의 의존 관계 표시 등 자연언어 해석 처리 장치를 사용해서 분석될 수 있는 모든 정보를 사용자에게 글이나 그림의 형태로 제공하는 과정을 포함한다. 자연언어 해석 처리 장치는 패턴을 이용한 자연언어 문장 분석, 형태소 분석, 의존 문법 트리 구성을 통한 분석 등을 수행하는 장치를 포함한다. In addition, in the third step, natural language interpretation such as display of noun clauses / phrases, adjective clauses / phrases, and the like as the minimum unit of the knowledge structure, display of the dependency relationship between words, and the like to assist the user in converting the sentence into a simple sentence by hand. It involves providing the user with all the information that can be analyzed using the processing device in the form of text or pictures. The natural language analysis processing apparatus includes an apparatus for performing natural language sentence analysis, pattern morphology analysis, and analysis through dependency grammar tree construction using a pattern.

또한, 상기 제 3단계는 대명사에 대응되는 추천 단어 제시, 관련 내용이 기술된 기존 문장과 그 문서 제시 등 자연언어 문장의 애매함을 해소할 수 있는 정보를 사용자에게 글을 통해 제공하는 과정을 포함한다. In addition, the third step includes a process of providing a user with information for resolving the ambiguity of a natural language sentence, such as presenting a suggestion word corresponding to a pronoun, presenting an existing sentence, and a document of the related content. .

또한, 상기 제 3단계는 처리하고 있는 문장 및 문장의 단어와 관련된 정보가 어떠한 문서의 어떠한 문장의 어떠한 단어에 있는지를 선이나 기타 다른 방식의 도형을 통해 연결하고 어떤 관련성이 있는지는 글자로 표시하는 방식으로 사용자가 제 3단계를 수행함에 있어서 유용한 정보를 제공하는 과정을 포함한다. In addition, the third step is a method of connecting the information related to the word of the sentence and sentence in the process of any word in any sentence of any document through a line or other form of the figure and how it is related to the letter This includes providing the user with useful information in performing the third step.

또한, 상기 제 4단계는 패턴을 이용하여 트리플을 추출하는 방법과 의존 문법 트리를 이용하여 추출하는 방법 모두 가능하다. In addition, the fourth step may include both a method of extracting triples using a pattern and a method of extracting using a dependent grammar tree.

또한, 상기 제 4단계에서의 패턴을 이용하는 방법은 자연언어 문장 패턴과 그것에 대응되는 트리플 쌍을 모아 놓은 테이블을 이용하여 처리하려는 문장과 일치하는 패턴을 찾고 그것에 대응되는 트리플을 추출한다. In addition, the method using the pattern in the fourth step finds a pattern that matches the sentence to be processed by using a table of natural language sentence patterns and triple pairs corresponding thereto, and extracts triples corresponding thereto.

또한, 상기 제 4단계에서의 의존 문법 트리를 이용하여 추출하는 방법은 구문 분석기를 사용하여 의존 문법 트리를 얻은 뒤 몇 가지 규칙을 적용하여 트리플을 추출하는 방법을 포함한다. In addition, the method of extracting the dependency grammar tree in the fourth step includes obtaining a dependency grammar tree using a parser and then extracting a triple by applying some rules.

또한, 상기 제 5단계에서의 트리플의 subject와 object 부분을 기존의 기계 가독형 지식 구조의 클래스에 대응시키고 트리플의 predicate 부분을 기존의 기계 가독형 지식 구조의 predicate에 대응시키는 방법을 포함한다. 그리고, 워드넷 같은 기계 가독형 전자 사전이나 구글 검색 엔진 같은 웹 검색 엔진을 이용하여 트리플의 subject나 object의 상위어를 찾은 뒤 그 정보를 이용하여 subject나 object를 기존의 기계 가독형 지식 구조의 클래스와 그 주변 구조를 고려한 유사도를 계산하여 랭킹하여 사용자에게 제공하는 방법을 포함한다. predicate도 마찬가지로 랭킹하여 사용자에게 제공하는 방법을 포함한다. The method also includes a method of mapping the subject and object portions of the triple to the class of the existing machine-readable knowledge structure and the predicate portion of the triple to the predicate of the existing machine-readable knowledge structure. Then, using a machine-readable electronic dictionary such as WordNet or a web search engine such as the Google search engine, find the subject of the triple subject or object, and use the information to convert the subject or object into a class of existing machine-readable knowledge structure. And calculating and ranking similarity considering the surrounding structure and providing the same to the user. Predicates likewise include how to rank and provide them to the user.

상기한 바와 같은 본 발명에 따른 전자 문서에 의미 정보를 부착하는 장치 및 이를 이용하여 전자 문서 집합으로부터 기계 가독형 지식 구조 문서를 생성하는 방법에 의하면, 자연언어 문장을 기계가 읽을 수 있는 형태로 쉽게 바꿀 수 있기 때문에 자연언어로 작성된 기존의 수많은 웹상의 문서를 컴퓨터 프로그램이 읽을 수 있게 되기 때문에 지식 기반 웹 응용 프로그램 구현이 쉬워진다. According to the apparatus for attaching semantic information to an electronic document according to the present invention as described above, and a method for generating a machine-readable knowledge structure document from an electronic document set using the same, a natural language sentence can be easily read in a machine-readable form. This makes it easier to build knowledge-based web applications because computer programs can read many existing documents on the Web written in natural language.

또한, 상기한 바와 같은 본 발명에 따른 전자 문서에 의미 정보를 부착하는 시스템 및 방법은 자연언어 문장을 어노테이션하는 과정을 일부 자동화함으로써 사용자가 자연언어 문장을 일일이 봐야 하는 횟수를 줄여주어 어노테이션 시간을 줄여 준다. In addition, the system and method for attaching semantic information to an electronic document according to the present invention as described above partially automates the process of annotating natural language sentences, thereby reducing the number of times a user must view the natural language sentences one by one, thereby reducing the annotation time. give.

또한, 상기한 바와 같은 본 발명에 따른 전자 문서에 의미 정보를 부착하는 시스템 및 방법은 자연언어 문장을 어노테이션하는 과정을 일부 자동화함으로써 사용자의 주관적인 개입이 줄어들게 되어 비교적 일관적인 어노테이션 산출물을 얻을 수가 있다. In addition, the system and method for attaching semantic information to an electronic document according to the present invention as described above partially automates the process of annotating natural language sentences, thereby reducing the subjective intervention of the user, thereby obtaining a relatively consistent annotation output.

또한, 상기한 바와 같은 본 발명에 따른 전자 문서에 의미 정보를 부착하는 시스템 및 방법은 사용자가 어노테이션 하는 과정에서 사용자가 사용한 모든 주관적인 판단을 유형별로 구분해서 기록을 남기게 함으로써, 본 발명에 따른 전자 문서에 의미 정보를 부착하는 장치의 내부 작업 흐름을 개선하는데 활용할 뿐만 아니라 다른 사용자가 동일한 작업을 일관성 있게 수행하는데 필요한 가이드라인을 작성하는데 자료로 활용한다. In addition, the system and method for attaching semantic information to the electronic document according to the present invention as described above, by leaving a record by dividing all subjective judgments used by the user in the process of annotating by type, the electronic document according to the present invention It is not only used to improve the internal workflow of the device that attaches semantic information to the data, but also as a resource to prepare guidelines for other users to perform the same tasks consistently.

또한, 상기한 바와 같은 본 발명에 따른 전자 문서에 의미 정보를 부착하는 시스템 및 방법은 어노테이션 산출물을 기존의 기계 가독형 지식 구조가 기술하는 의미 정보에 대응시킴으로써 이미 어노테이션이 되어 기계 가독형 지식 구조에 의미 정보로서 기술된 자연언어 문장에 대해서는 다시 어노테이션할 필요가 없다는 것을 알아낼 수 있기 때문에 어노테이션 시간을 줄여준다. In addition, the system and method for attaching semantic information to an electronic document according to the present invention as described above is already annotated by mapping the annotation output to semantic information described by an existing machine-readable knowledge structure. Natural language sentences described as semantic information can be found to not need to be annotated again, thus reducing annotation time.

이하, 첨부된 도면을 참고하여 본 발명의 실시예를 상세히 설명한다. 우선, 도면들 중 동일한 구성요소 또는 부품들은 가능한 한 동일한 참조부호를 나타내고 있음에 유의해야 한다. 본 발명을 설명함에 있어서 관련된 공지기능 혹은 구성에 대한 구체적인 설명은 본 발명의 요지를 모호하게 하지 않기 위해 생략한다.Hereinafter, with reference to the accompanying drawings will be described an embodiment of the present invention; First, it should be noted that the same components or parts in the drawings represent the same reference numerals as much as possible. In describing the present invention, detailed descriptions of related well-known functions or configurations are omitted in order not to obscure the gist of the present invention.

도 1은 본 발명에 따른 전자 문서에 의미 정보를 부착하는 장치를 개념적으로 도시한 블록도이다. 1 is a block diagram conceptually illustrating an apparatus for attaching semantic information to an electronic document according to the present invention.

도 1에 도시된 바와 같이, 본 발명에 따른 전자 문서에 의미 정보를 부착하는 장치는, 크게 전자 문서 수집 장치(5), 전자 문서 집합 분석 장치(6), 문장 추출 장치(7), 단순 문장화 장치(8), 트리플 추출 장치(9), 트리플 매핑 장치(11), 기계 가독형 지식 구조 출력 장치(12)를 포함하여 구성된다. As shown in FIG. 1, an apparatus for attaching semantic information to an electronic document according to the present invention includes an electronic document collection device 5, an electronic document set analysis device 6, a sentence extraction device 7, and a simple sentence. It comprises a summarization apparatus 8, a triple extraction apparatus 9, a triple mapping apparatus 11, and a machine-readable knowledge structure output apparatus 12.

상기 전자 문서 수집 장치(5)에는 전자 문서가 다양한 방법을 통하여 입력된다. 웹(1)을 통해 입력되는 경우 구글 같은 웹 문서 검색 엔진을 통해 웹 문서를 수집하게 된다. 이때, 사용자로 하여금 검색 키워드를 입력하게 할 수 있다. 또한, 기계 가독형 지식 구조 데이터베이스(14)를 분석하여 부족한 내용에 관련된 키워드를 워드넷 같은 기계 가독형 전자 사전을 이용하여 자동으로 생성한 뒤 웹 문서 검색 엔진을 통해 해당 문서를 수집할 수도 있다. 데이터베이스(2)를 통해 입력되는 경우 사용자가 대상 문서를 미리 수집하여 데이터베이스(2)에 저장한 경우이다. 전자 문서 입력 장치(3)를 통해 입력되는 경우는 사용자가 데스크 탑 컴퓨터의 키보드, 타블렛 입력 장치, 모바일 기기의 입력 장치 등의 전자 문서 입력 장치(3)를 사용하여 전자 문서를 직접 작성하는 경우이다. 광학판독기(4)를 통해 입력되는 경우 전자 문서화되지 않은 문서를 광학판독기(4)를 사용해 읽어 들여 전자 문서를 생성하는 경우이다. The electronic document collecting device 5 is inputted through various methods. When input through the web (1), the web document is collected through a web document search engine such as Google. In this case, the user may be prompted to input a search keyword. In addition, the machine-readable knowledge structure database 14 may be analyzed to automatically generate keywords related to the insufficient contents using a machine-readable electronic dictionary such as WordNet, and then collect the documents through a web document search engine. In the case of input through the database 2, the user collects the target document in advance and stores it in the database 2. The case where the electronic document is input through the electronic document input device 3 is when a user directly creates an electronic document using the electronic document input device 3 such as a keyboard of a desktop computer, a tablet input device, or an input device of a mobile device. . In the case of input through the optical reader 4, the electronic document is read using the optical reader 4 to generate an electronic document.

상기 전자 문서 집합 분석 장치(6)는 상기 입력 또는 수집된 전자 문서 집합과 기존의 기계 가독형 지식 구조 데이터베이스(14)를 비교 분석하여 이미 기계 가독형 지식 구조화된 문서나 문장은 별도의 표시를 한다. 주어진 전자 문서에 있는 단어들이 기계 가독형 지식 구조 데이터베이스(14)에 출현한 정도를 분석하거나, 주어진 전자 문서를 내용에 따라 비슷한 문서로 분류한 뒤 그것의 대표 단어들이 기계 가독형 지식 구조 데이터베이스(14)에 출현한 정도를 분석할 수 있다. 분석할 때, 전자 문서가 웹 문서인 경우 웹 문서들간의 링크 정보를 활용할 수 있다. 전자 문서의 단어들에 대한 전자 사전이 존재한다면 전자 사전이 제공하는 단어에 대한 상위어, 하위어, 유사어 등의 정보를 활용해서 단어들 간의 유사성을 판단할 수 있다. 이러한 정보들을 종합하면 기계 가독형 지식 구조 데이터베이스(14)에 이미 기술된 내용을 서술하는 문장과 그렇지 않고 새로운 내용을 서술하는 문장을 구분할 수 있다. 그 개수를 세어서 전자 문서들을 각각의 척도에 따라 각각 랭킹해서 사용자에게 보여준다. 사용자는 그 전자 문서들 중에서 하나를 선택하여 다음 단계를 진행한다. The electronic document set analyzing apparatus 6 compares and analyzes the input or collected electronic document set with the existing machine-readable knowledge structure database 14 to separately display documents or sentences that are already machine-readable knowledge structure. Analyze the degree to which words in a given electronic document appear in the machine-readable knowledge structure database 14, or classify a given electronic document into similar documents according to its contents, and then represent its words in the machine-readable knowledge structure database 14 ) Can be analyzed. In the analysis, when the electronic document is a web document, link information between the web documents may be utilized. If there is an electronic dictionary for words in the electronic document, the similarity between the words may be determined by using information such as upper words, lower words, and similar words about the words provided by the electronic dictionary. Combining this information makes it possible to distinguish between a statement describing the content already described in the machine-readable knowledge structure database 14 and a statement describing the new content. The number is counted and the electronic documents are ranked according to their respective scales and displayed to the user. The user selects one of the electronic documents and proceeds to the next step.

상기 문장 추출 장치(7)는 사용자가 선택한 전자 문서에서 문장들을 추출한다. 전자 문서는 글뿐만 아니라 그림, 음악 등 여러 가지 데이터가 있을 수 있기 때문에 본 시스템에서 대상으로 하는 자연언어 문장만을 추출하는 작업이 필요하다. 이 때, 문장 마침 부호(. ! ?) 등을 기준으로 나누는 방법이 있고, 머신러닝 기법을 사용해서 나누는 방법을 포함한다. The sentence extracting apparatus 7 extracts sentences from the electronic document selected by the user. Since electronic documents can contain various data such as pictures and music as well as texts, it is necessary to extract only natural language sentences targeted by the system. At this time, there is a method of dividing by a sentence terminator (.!?) And the like, and includes a method using a machine learning technique.

상기 단순 문장화 장치(8)는 문장을 간단한 문장들로 분리하고 대명사를 원 래 단어로 대체한다. 문장이 복잡한 경우 자연언어 문장 구문 분석 장치(10)가 해석 처리하기 쉽지 않고 패턴-트리플 매핑 테이블(13)에서도 복잡하여 패턴을 정의하기 쉽지 않기 때문에 간단한 문장으로 분리할 필요가 있다. 절이나 구를 기준으로 문장을 분리하거나 특수 기호 안에 있는 단어들을 별도의 문장으로 생성하는 등의 규칙이 있을 수 있다. 단순 문장화 장치(8)는 이러한 정보들을 사용자에게 제시하고 사용자는 그것을 참고해서 간단한 문장들로 분리한다. The simple sentence device 8 separates sentences into simple sentences and replaces pronouns with original words. If the sentence is complex, it is necessary to separate it into a simple sentence because the natural language sentence parsing device 10 is not easy to interpret and process, and the pattern-triple mapping table 13 is also complicated and difficult to define the pattern. There may be rules such as separating sentences based on clauses or phrases or generating words in special symbols as separate sentences. The simple sentence device 8 presents this information to the user and the user refers to it and divides it into simple sentences.

상기 트리플 추출 장치(9)는 문장으로부터 트리플을 추출하는 장치이다. RDF 트리플을 사용하는 경우, "Dog is an animal."이라는 문장으로부터는, <dog><is_a><animal>이라는 트리플이 추출된다. 이 때, 자연언어 문장 구문 분석 장치를 이용해서 단어들 간의 의존 관계를 분석해서 트리플을 결정할 수 있고, 미리 정의된 패턴을 적용해서 트리플을 결정할 수도 있다. The triple extraction device 9 is a device for extracting triples from sentences. When using an RDF triple, the triple <dog> <is_a> <animal> is extracted from the sentence "Dog is an animal." In this case, a triple may be determined by analyzing a dependency relationship between words using a natural language sentence syntax analysis device, or a triple may be determined by applying a predefined pattern.

상기 트리플 매핑 장치(11)는 트리플의 각 부분이 기계 가독형 지식 구조 데이터베이스(14)에 있는 어떤 단어와 연관이 되는지를 판단하는 장치이다. 예들 들어, <lecture><is_taught_by><professor>, <student><attend><lecture>라는 지식이 기계 가독형 지식 구조 데이터 베이스(14)에 기술되어 있고, 트리플이 <Prof. Kim><teach><Database>인 경우, <Prof. Kim>는 <professor>에 <teach>는 <is_taught_by>에 <Database>는 <lecture>에 연관이 된다. 이것은 단어 간에 유사성으로 판단할 수도 있고 트리플에 부가적인 정보를 주어서 부가적인 정보가 추가된 트리플과 기계 가독형 지식 구조 데이터 베이스(14)에 있는 구조의 일부를 분석해서 유사성을 판단할 수도 있다. The triple mapping device 11 is a device for determining which words in each of the triples are associated with the machine-readable knowledge structure database 14. For example, the knowledge <lecture> <is_taught_by> <professor>, <student> <attend> <lecture> is described in the machine-readable knowledge structure database 14, and the triple is defined in <Prof. If Kim> <teach> <Database>, <Prof. Kim> is associated with <professor>, <teach> with <is_taught_by>, and <Database> with <lecture>. This may be determined by the similarity between words, or by providing additional information to the triple to determine similarity by analyzing a portion of the structure in the triple and machine readable knowledge structure database 14 with the additional information added.

상기 기계 가독형 지식 구조 출력 장치(12)는 이미 있는 기계 가독형 지식 구조 데이터 베이스(14)에 상기와 같은 과정을 거쳐서 매핑된 트리플을 추가하여 파일로 출력하거나 그래프의 형태로 화면에 출력한다. 매핑된 트리플은 기계 가독형 지식 구조 데이터 베이스(14)에 갱신되므로 다른 전자 문서에 동일한 문장이 있는 경우 전자 문서 집합 분석 장치(6)가 그 문장에 이미 있는 내용이라는 표시를 할 수 있으므로, 동일한 내용을 두 번 작업하지 않게 된다. The machine-readable knowledge structure output device 12 adds a triple mapped to the machine-readable knowledge structure database 14 through the above process and outputs it as a file or on the screen in the form of a graph. The mapped triple is updated in the machine-readable knowledge structure database 14, so that if the same sentence exists in another electronic document, the electronic document set analyzing apparatus 6 can indicate that the content is already in the sentence, and thus the same content. Will not work twice.

상기와 같은 전자 문서 수집 장치(5)를 통한 전자 문서 입력/수집 단계, 전자 문서 집합 분석 장치(6)를 통한 전자 문서 선택 단계, 문장 추출 장치(7)를 통한 문장 추출 및 문장 선택 단계, 단순 문장화 장치(8)를 통한 간단한 문장 생성 단계, 트리플 추출 장치(9)를 통한 트리플 추출 단계, 트리플 매핑 장치(11)를 통한 트리플 매핑 단계, 기계 가독형 지식 구조 출력 장치(12)를 통한 기계 가독형 지식 구조 출력 단계, 등의 모든 과정에서 사용자가 주관적으로 판단하는 내용을 로그 형태로 입력할 수 있다. 이 로그는 데이터베이스에 저장이 되어서 분석이 되고 시스템을 개선하는데 활용될 수 있다. Electronic document input / collection step through the electronic document collection device 5 as described above, electronic document selection step through the electronic document set analysis device 6, sentence extraction and sentence selection step by the sentence extraction device 7, simple Simple sentence generation step through sentence device 8, triple extraction step through triple extraction device 9, triple mapping step through triple mapping device 11, machine through machine readable knowledge structure output device 12 In the process of outputting the readable knowledge structure, etc., the user's subjective judgment may be input in a log form. This log can be stored in a database for analysis and use to improve the system.

Claims

An electronic document collecting device for collecting electronic documents on a web meeting a user's information needs;

An electronic document set analyzing apparatus for selecting an electronic document by analyzing electronic documents in the collected electronic document set or an electronic document set provided by a user;

A sentence extracting device extracting a natural language sentence from the selected electronic document;

A simple sentence device for separating the sentence into simple sentences;

A triple extraction device for extracting triples from the simple sentences;

A triple mapping device for mapping each part of the triple to an existing machine-readable knowledge structure;

A machine-readable knowledge structure document output device for outputting a machine-readable knowledge structure as an electronic document from the triples and mapping information;

An error log input device for allowing a user to leave a record of what has been determined; And

And attaching meaning to an electronic document including an error log analysis device that analyzes and records statistics of the user.

The method of claim 1,

The electronic document consists of an HTML document, a Wikipedia document, a Portable Document Format (PDF) document, a Microsoft Word document, a Korean document, an electronic document made using an optical reader, and an electronic document composed of all natural language sentences that can be processed by the machine. A system for attaching meaning to an electronic document comprising a.

3. The method of claim 2,

The natural language attaches meaning to an electronic document including all languages that can be recorded as an electronic document, including English, Korean, Japanese, Chinese, German, and French.

The method of claim 1,

And the electronic document set attaches meaning to an electronic document containing link information between the documents.

The method of claim 1,

The language describing the machine-readable knowledge structure includes all languages in a form in which an interpretation processing device exists, such as OWL (Web Ontology Language) and KIF (Knowledge Interchange Format), so that the machine can interpret and process the meaning. A system for attaching meaning to electronic documents.

The method of claim 1,

The triple maps to a machine readable knowledge structure language, including RDF (Resource Description Framework) triples, to extend all existing machine readable knowledge structures or to generate all machine readable knowledge structures. A system for attaching meaning to an containing electronic document.

The method of claim 1,

The apparatus for attaching semantic information to the electronic document is a system for attaching meaning to an electronic document that can be used through a web browser by providing a web interface and can be produced and used in the form of a general application application.

The method of claim 1,

A device for attaching semantic information to the electronic document is a system for attaching meaning to an electronic document in which relevant information is stored only on the user's computer and used only on the computer or the information is stored on a server so that the information can be used over a network by several people. .

The method of claim 1,

The machine-readable knowledge structure document output device attaches meaning to an electronic document that can be output in a graph form on a screen and can be output as a file using a machine-readable language such as XML (Extensible Markup Language). system.

A first step of analyzing a given set of electronic documents and selecting electronic documents most overlapping with a pre-generated machine readable knowledge structure document;

Selecting a sentence among sentences in the selected electronic document;

A third step of dividing the selected sentence into simple sentences using a simple sentence device;

A fourth step of extracting triples from the simple sentence using a triple extracting device;

A fifth step of mapping each part of the extracted triple to an existing machine-readable semantic structure using a triple mapping device; And

And a sixth step of outputting a machine-readable knowledge structure document using the extracted triples and the mapping information by using a machine-readable knowledge structure document output device.

The method of claim 10,

The overlapping degree of the machine-readable knowledge structure document and the electronic document previously generated by the electronic document set analyzing apparatus is proportional to the frequency of occurrence of the contents related to the words in the previously generated knowledge structure in the given electronic document. How to attach semantic information.

The method of claim 10,

The degree of overlap between the machine-readable knowledge structure document and the given electronic document previously generated by the electronic document set analyzing apparatus is classified into similar documents, and the contents related to the representative word appear in the previously generated knowledge structure. A method of attaching semantic information to an electronic document so that it is proportional.

The method of claim 10,

The third step is a method of attaching the semantic information to the electronic document comprising the step of automatically converting into a form suitable for the triple extraction device and a semi-automatic conversion by the user manually into a simple sentence.

The method of claim 10,

In the third step, a natural language includes a phrase / phrase display such as a noun clause / phrase, an adjective clause / phrase, a word display as a minimum unit of the knowledge structure, and a dependency relationship between words to help the user convert a simple sentence into a manual sentence. A method of attaching semantic information to an electronic document that includes providing the user in the form of a text or a picture with all information that can be analyzed using an interpretation processing device.

The method of claim 10,

In the third step, the natural language interpretation processing apparatus attaches semantic information to an electronic document, including performing a natural language sentence analysis using a pattern, a morpheme analysis, or an analysis using a dependent grammar tree.

The method of claim 10,

The third step includes a process of providing a user with information for resolving the ambiguity of a natural language sentence, such as presenting a suggestion word corresponding to a pronoun, presenting an existing sentence and related document in which the related content is described. How to attach semantic information to a document.

The method of claim 10,

The third step is to connect the information related to the word of the sentence and the sentence being processed in which word of which sentence in which document in a certain way including a line, and to indicate what relations are represented by letters. A method of attaching semantic information to an electronic document comprising providing information useful in performing the third step.

The method of claim 10,

In the fourth step, both a method of extracting triples using a pattern and a method of extracting using a dependent grammar tree are possible. The method of using a pattern includes a table of natural language sentence patterns and triple pairs corresponding thereto. To find a pattern that matches the sentence to be processed and extract the triple corresponding to it, and to extract it using the dependency grammar tree, use the parser to get the dependency grammar tree and apply some rules to extract the triple. How to attach semantic information to an electronic document.

The method of claim 10,

In the fifth step, the subject and object portions of the triple correspond to classes of the existing machine-readable knowledge structure, and the predicate portion of the triple is pre-kit of the existing machine-readable knowledge structure. A method of attaching semantic information to an electronic document, the method comprising a corresponding method.

The method of claim 10,

Uses machine-readable electronic dictionaries such as WordNet or web search engines such as the Google search engine to find the parent of a triple subject or object and then uses that information to classify the subject or object into a class of existing machine-readable knowledge structure and its surroundings. A method of attaching semantic information to an electronic document, comprising: calculating and ranking similarity in consideration of structure and ranking the same and providing the same to a user.

The method of claim 10,

A method of attaching semantic information to an electronic document, comprising the step of inputting an error log, thereby presenting the error log type to the user and the user selecting one of them to enter a comment.

The method of claim 21,

It includes an error log analysis step to provide the statistical values for the error log input as described above to quantify the subjective judgment of the user and to use it as data for improving other devices How to attach.

The method of claim 21,

The error log input step and the error log analysis step include all steps of user intervention (electronic document collection step, electronic document set analysis step, sentence extraction step, simple sentence step, triple extraction step, triple mapping step, machine readable type). A method of attaching semantic information to an electronic document including a process of inputting and analyzing an error log every step in conjunction with a knowledge structure document output step).