KR20200057207A

KR20200057207A - Assistive method and system for the comparative reading of multiple documents based on discourse analysis

Info

Publication number: KR20200057207A
Application number: KR1020180141266A
Authority: KR
Inventors: 박종철; 양원석; 김정호
Original assignee: 한국과학기술원
Priority date: 2018-11-16
Filing date: 2018-11-16
Publication date: 2020-05-26

Abstract

There has been developed a technique for analyzing common and different features between various documents when the documents are given, classifying and grouping the documents, and analyzing that the two documents make a difference in terms of any aspects when there is a difference therebetween. However, there has not been proposed a comparative-reading assist system for enabling a user to seamlessly read the documents to find the entire picture which can be identified only through the overall understanding of the various documents in the same manner as that when reading one integrated document. The present invention relates to a method and a system for analyzing a set of given documents on the basis of a discourse structure, comparing the same in parallel, and outputting and providing, to the user, the results. The method and the system enable the user to identify, at a glance, compared information in the documents when proceeding in reading the set of documents, to be automatically assisted to read the documents in a manner similar to that when reading one document through seamless reading of the merge document, and to be automatically assisted to ease a cognitive burden required to combine information reported over the various documents.

Description

{Assistive method and system for the comparative reading of multiple documents based on discourse analysis}

본 발명은 주어진 문서 집합에 대해 담화 구조를 기준으로 분석하고 병렬 대조하여 사용자에게 출력 제공하여, 사용자로 하여금 복수 문서 집합에 대한 독해를 진행할 때에도 한 문서를 읽을 때와 유사한 방식으로 독해할 수 있도록 하며, 각 문서에 있어 대조되는 정보를 한 눈에 파악할 수 있게 하는 자연언어처리 기술에 대한 것이다.The present invention analyzes a given document set based on a discourse structure and provides parallel output to the user, so that the user can read in a similar manner as when reading a document even when reading a plurality of document sets , It is about natural language processing technology that enables you to grasp the contrasting information in each document at a glance.

여러 개의 문서가 주어졌을 때, 각 문서 간의 공통점과 차이점을 분석하고 이에 따라 문서를 분류 및 군집화하는 기술 개발이 진행되어 왔다. 또한 여러 개의 문서가 주어졌을 때, 반대되는 의견에 대한 내용이 존재하는 지, 두 문서 사이에 차이가 있다면 어떤 측면에서 차이가 있는 지에 대한 세분화된 분석 기술 역시 개발 진행되어 왔으며, 특정 문서가 다른 문서에 비해 편향된 의견을 보이는 지, 다른 문서를 표절한 내용을 포함하고 있지 않은 지 등을 자동으로 파악하는 기술 역시 개발 진행되어 왔다. 동일 주제를 갖는 여러 문서가 주어졌을 때, 각 문서에서 부분적으로 제공하는 정보들을 수합하고 결합하여 관련된 전체 그림을 파악할 수 있도록 하는 기술이 개발 진행되어 왔으며, 여러 문서에서 동일한 내용이 다른 단어 및 표현 방식을 통해 서술된 것은 없는 지 파악하는 기술 역시 개발 진행되어 왔다.When several documents are given, technology development has been conducted to analyze commonalities and differences between the documents and classify and cluster documents accordingly. In addition, when multiple documents are given, subdivided analysis techniques have been developed to determine whether there are disagreements or if there are differences between the two documents. Compared to this, technologies have been developed to automatically detect whether they show biased opinions or whether they contain plagiarism of other documents. When multiple documents with the same subject are given, a technology has been developed to collect and combine information provided by each document partially so that the entire picture can be understood. Technology to grasp whether nothing has been described has also been developed.

Ramos (2003) 는 TF-IDF(Term Frequency - Inverse Document Frequency)를 활용하여 문서 정보 추출 중 단어 연관성을 분석하고 이를 통해 문서들 간의 연관성을 모델링하는 기술을 개발하였다. Trstenjak et al. (2014) 은 KNN(K Nearest Neighbor) 방법과 TF-IDF 방법을 결합하여 문서 범주화(categorization)의 효율을 향상시키는 기술을 개발하였다. Ramos (2003) developed a technique to analyze word associations during document information extraction using TF-IDF (Term Frequency-Inverse Document Frequency) and model the associations between documents through this. Trstenjak et al. (2014) developed a technique to improve the efficiency of document categorization by combining the K Nearest Neighbor (KNN) method and the TF-IDF method.

Stab and Gurevych (2016) 는 문서 내에서 특정 주장에 대한 반론이 존재하는 지 여부를 파악하는 기술을 개발하였다. Allen et al. (2014) 은 담화 구조 분석을 통해 대화 및 양방향 텍스트 전개에 있어 상호 비동의 발생 여부 및 담화 중 비동의 발생 위치를 파악하는 기술을 개발하였다. Kuang and Davison (2016) 은 의미 분석 및 문맥 분석을 통해, 문서 내에서 특정 견해에 대해 저자가 보이는 우호 정도 혹은 편향셩을 파악하기 위한 언어학적 모델을 구축하였다. Hube and Fetahu (2018) 는 위키피디아 내에서 특정 견해에 대해 편향성이 있는 문장을 자동으로 파악하는 시스템을 기술 개발하였다. Potthast et al. (2018) 은 문서가 보이는 문법 구조 및 언어학적 특징에 대한 분석을 통해 해당하는 문서가 특정 주제에 대해 과편향된 견해를 보이고 있는 지에 대한 판별과 해당하는 문서가 가짜 뉴스에 해당하는 지에 대한 판별을 자동으로 수행하는 시스템을 기술 개발하였다.Stab and Gurevych (2016) developed a technique to determine whether or not there are objections to specific claims within the document. Allen et al. (2014) developed a technique to identify whether there is a mutual sinus in the conversation and interactive text development and the location of the sinus in the conversation through the discourse structure analysis. Kuang and Davison (2016) constructed a linguistic model to identify the author's degree of affinity or bias for a particular view in the document through semantic analysis and context analysis. Hube and Fetahu (2018) has developed a system in Wikipedia that automatically identifies sentences that are biased against a particular view. Potthast et al. (2018) analyzes the grammatical structure and linguistic features of the document to automatically determine whether the document is showing an overbiased opinion on a specific topic and whether the document is fake news. Technology was developed to perform the system.

Wang et al. (2008) 은 특정 소프트웨어에 대해 상세된 오류 보고들 사이에 중복된 항목을 자동으로 파악하는 방법을 제안하였으며, Alipour et al. (2013) 은 특정 소프트웨어에 대해 상세된 오류 보고들 사이에 나타난 중복된 항목을 자동으로 파악함에 있어 문맥 정보를 활용하여 자동 파악 성능을 향상시키는 방법을 제안하였다. Falessi et al. (2010) 은 특정 소프트웨어에 대해 상세된 개발 요구사항 문서들 사이에 중복된 요구 사항을 자동으로 파악하는 방법을 제안하였다. Yang and Tan (2012) 은 소프트웨어와 관련된 문서에 있어서 여러 문서 간의 문맥 비교를 통해 유사한 소프트웨어적 특징을 갖는 소프트웨어 도메인 특화 단어가 표현하는 의미 사이의 유사도를 자동으로 파악하는 방법을 제안하였다. Wang et al. (2008) proposed a method for automatically identifying duplicate items between detailed error reports for specific software, Alipour et al. (2013) proposed a method to improve automatic recognition performance by using contextual information in automatically identifying duplicate items that appear between detailed error reports for specific software. Falessi et al. (2010) proposed a method for automatically identifying duplicate requirements between detailed development requirements documents for specific software. Yang and Tan (2012) proposed a method for automatically grasping the similarity between meanings expressed by software domain-specific words having similar software characteristics through context comparison between documents in software-related documents.

Barr

n-Cede

o (2010) 는 두 개의 다른 언어로 쓰여진 문서에 있어 문서 표절 여부를 자동으로 파악할 수 있는 방법을 제안하였다. Korpal and Bose (2016) 는 N-gram 빈도 비교 방법을 통한 문서 비교를 통해 표절 여부를 높은 정확도로 자동 파악하는 방법 및 시스템을 기술 개발하였다. 미국 특허 US20120060081A1 (granted, 2013-04-16), “Systems and methods for document analysis”는 표절 여부 자동 파악 서비스인 turnitin 에 활용되는, 문서 유사도 분석 및 의미 분석을 통한 표절 가능성 자동 파악 방법 및 시스템을 제안하였으며, 미국 특허 US20170178528A1 (application, 2017-06-22), “Method and System for Providing Automated Localized Feedback for an Extracted Component of an Electronic Document File”은 상기 turnitin 서비스에 활용되는, 문서의 일부에 대해 사용자가 접근하기 편리한 방식으로 부분적인 피드백을 제공하는 방법 및 시스템을 제안하였다.Barr

n-Cede

o (2010) proposed a method to automatically detect whether a document has been plagiarized in documents written in two different languages. Korpal and Bose (2016) developed a method and system to automatically detect plagiarism with high accuracy through document comparison using the N-gram frequency comparison method. US Patent US20120060081A1 (granted, 2013-04-16), “Systems and methods for document analysis” proposes a method and system to automatically detect plagiarism through document similarity analysis and semantic analysis, which is used for turnitin, an automatic identification service for plagiarism US Patent US20170178528A1 (application, 2017-06-22), “Method and System for Providing Automated Localized Feedback for an Extracted Component of an Electronic Document File” is used by the turnitin service to allow users to access parts of the document A method and system for providing partial feedback in the following convenient manner is proposed.

Ku and Leroy (2013) 는 전자 정부 시스템 및 정책 의사 결정 보조 시스템을 위한 범죄 관련 보고서의 자동 분석 및 범주화 방법을 제안하였다. Jayaweera et al. (2015) 은 여러 개의 뉴스 문서에 기재된 범죄관련 보도 정보를 대조 분석하여 범죄 관련 정보를 분석하는 방법을 제안하였다.Ku and Leroy (2013) proposed an automated analysis and categorization method for crime-related reports for e-government systems and policy decision-support systems. Jayaweera et al. (2015) proposed a method of analyzing crime-related information by collating and analyzing crime-related press information in several news documents.

상술한 바와 같이, 현재까지 여러 문서에 대한 대조 분석을 통한 문서 특징 분석 및 군집화 방법에 대한 연구 및 여러 문서에서 분산되어 표현되는 정보의 통합을 통한 특정 도메인 정보 조합 및 추출에 대한 연구가 진행되어 왔지만, 여러 문서에 대한 전반적인 이해를 통해야만 파악될 수 있는 전체 그림을 알아 내기 위한 독해 과정을 하나의 통합 문서를 독해할 때와 같은 방식으로 끊김 없이 독해 가능하도록 여러 개의 문서 순서를 맞추어(align) 주는 대조 독해 보조 시스템을 구축하기 위한 구체적인 방법이 제안된 적은 없다.As described above, until now, a study has been conducted on a method of characterizing and clustering a document through a comparative analysis of several documents, and a study of combining and extracting specific domain information through integration of information distributed and expressed in various documents. In the same way as reading a single workbook, the reading process to find out the whole picture that can only be grasped through an overall understanding of multiple documents can be aligned with multiple documents. No specific method has been proposed to build a control reading aid system.

상기와 같은 목적을 달성하기 위하여, 본 발명의 일 실시예에 따른 담화 구조 분석 기반 대조 독해 보조 시스템은,In order to achieve the above object, according to an embodiment of the present invention, a discourse structure analysis-based control reading aid system,

사용자로부터 상기 복수 문서 집합 (이하 입력 문서 집합) 및 상기 정렬 기준 문서에 대한 정보를 입력받고, 담화 구조 분석(discourse analysis)를 통해 기초 담화 단위(elementary discourse unit)를 추출하고, 상기 정렬 기준 문서 내의 문장들과 상기 복수 정렬 대상 문서 내의 문장들 사이의 유사도를 분석하여, 상기 정렬 대상 문서 내의 문장들 중 상기 정렬 기준 문서 내의 문장들과 유사한 의미를 갖는 문장들을 파악하며, 자동 문장 정렬 및 구문 행렬 기반 일관성 점수를 활용하여 상기 정렬 대상 문서 내의 문장들이 상기 정렬 기준 문서 내의 문장들의 행간과 병렬적으로 나란히 배치됨에 있어 어떤 위치의 행간에 배치되는 것이 적절할 지를 판단하며, 판단 결과에 따라 상기 정렬 대상 문서 내의 문장들의 나열 순서를 변경하며, 변경 이후에 상기 정렬 대상 문서들이 가지는 문장 나열 순서를 독해 대조를 위한 문장 나열 순서로 정의하며, 행간에 있어 병렬적으로 나란히 배치된 상기 정렬 대상 문서 내의 문장들과 상기 정렬 기준 문서 내의 문장들 간의 담화 관계(discourse relation)을 파악하는 사용자 입력 문서 처리부(110),The user receives information about the plurality of document sets (hereinafter referred to as an input document set) and the sorting reference document, extracts an elementary discourse unit through discourse analysis, and within the sorting reference document. Analyzing the similarity between sentences and sentences in the plurality of documents to be sorted, among sentences in the documents to be sorted, sentences having similar meaning to sentences in the documents to be sorted are identified, and automatic sentence sorting and syntax matrix based Using the consistency score, it is determined in which position it is appropriate to arrange the sentences in the alignment target document in parallel with the lines of the sentences in the alignment reference document, and according to the determination result, within the alignment target document The order of sorting of sentences is changed, and after the change, the order of sorting of the documents to be sorted is defined as a sorting order for reading and contrasting, and the sentences in the sorting target document arranged in parallel in parallel between the lines and the A user input document processing unit 110 that identifies a discourse relation between sentences in the sorting criteria document,

사용자 입력 문서 처리부(110)로부터 상기 정렬 기준 문서에 대한 정보를 전달받으며, 출력부(130)로 해당하는 정보를 전달하는 제어부(120),The control unit 120 receives information about the alignment reference document from the user input document processing unit 110 and transmits the corresponding information to the output unit 130,

제어부(120)로부터 상기 정렬 기준 문서에 대한 정보를 전달받고, 사용자 입력 문서 처리부(110)로부터 상기 독해 대조를 위한 문장 나열 순서와 상기 구문 관계를 전달받고, 상기 입력 문서 집합 내 문서 각각이 한 열에 상응하며 각 열에 있어 행에 해당하는 수직적 위치는 상기 정렬 기준 문서 및 상기 정렬 대상 문서를 독해함에 있어 권장되는 순서에 상응하도록 상기 독해 대조를 위한 문장 나열 순서에 따라 재정렬된 상기 정렬 대상 문서들과 상기 정렬 기준 문서를 병렬로 출력하며, 나열되는 문장 순서가 상기 독해 대조를 위한 문장 나열 순서에 따라 변경된 상기 정렬 대상 문서 각각에 있어서 기존의 문장 나열 순서를 문장 측면의 기존 문장 나열 순서 가시화 등의 방법을 통해 출력하며, 클릭 방식의 사용자 상호작용을 통해, 상기 문장 측면의 기존 문장 나열 순서를 클릭하였을 때, 상기 정렬 대상 문서의 기존 순서를 기준으로 "이전" 과 "다음"에 해당하는 정보를 선택할 수 있도록 하는 출력부(130)를 포함하며, 사용자로부터 입력받은 문서 집합을 담화 구조를 기준으로 분석하고 병렬 대조하여 사용자에게 출력 제공하여, 사용자로 하여금 복수 문서 집합에 대한 독해를 진행할 때에도 한 문서를 읽을 때와 유사한 방식으로 독해할 수 있도록 하며, 각 문서에 있어 대조되는 정보를 한 눈에 파악할 수 있게 한다.Receives information about the alignment reference document from the control unit 120, receives the order of sentence listing and the syntax relationship for the reading and collation from the user input document processing unit 110, and each document in the input document set is in one column. Corresponding and vertical positions corresponding to a row in each column correspond to the alignment target documents and the alignment target documents rearranged according to the order of sentences for the reading contrast so as to correspond to a recommended order in reading the alignment target documents and the alignment target documents. The sorting reference document is output in parallel, and a method of visualizing an existing sentence listing order on the side of a sentence is a method of visualizing an existing sentence listing order in each of the documents to be sorted in which the order of the listed sentences has been changed according to the sentence listing order for the comparison of reading comprehension. When clicking the order of listing the existing sentences on the side of the sentence, the information corresponding to "previous" and "next" can be selected based on the existing order of the document to be sorted. It includes an output unit 130 to enable, analyzes a set of documents received from a user based on a discourse structure and provides parallel output to the user to read a document even when the user reads multiple sets of documents. It enables reading in a similar way to the time, and allows you to grasp the contrasting information in each document at a glance.

바람직하게, 상기 자동 문장 정렬 및 구문 행렬 기반 일관성 점수는 하기 식을 통해 산출된다.Preferably, the automatic sentence alignment and syntax matrix based consistency scores are calculated through the following equation.

[수학식][Mathematics]

여기서,

은 특정 문서에 대한 상기 자동 문장 정렬 및 구문 행렬 기반 일관성 점수이며,

는 해당하는 문서 내의 문장들에 대해 자동 문장 순서 정렬 모델을 활용하여 정렬하였을 때의 문서로부터 해당하는 문서를 문장 스왑(swap)을 통해 만들어낼 수 있는 최소 스왑 개수이며,

은 0보다 큰 실수로 정의되는 고정 상수이며,

은 해당하는 문서에 대한 담화 구조 분석 결과로 얻을 수 있는 담화 역할 행렬(discourse role matrix)에 있어 null 항목을 0으로, null이 아닌 항목의 경우 담화 역할의 개수로 행렬 컴포넌트를 변경하는 방식을 통해 변환한 행렬의 행렬식(determinant)로 정의(담화 역할 행렬의 역행렬이 존재하지 않는 경우 0)이다.here,

Is the automatic sentence sorting and syntax matrix based consistency score for a particular document,

Is the minimum number of swaps that can be generated from the documents when sorting by using the automatic sentence order sorting model for the sentences in the corresponding document through sentence swap.

Is a fixed constant defined by a real number greater than 0,

Is transformed by changing the matrix component to 0 for null and 0 for the discourse role matrix that can be obtained as a result of discourse structure analysis on the corresponding document. Defined as a determinant of a matrix (0 if there is no inverse matrix of the conversation role matrix).

여기서

와

은 각각 가중치를 표현하는 음이 아닌 실수이며,

와

중 적어도 하나는 0이 아닌 값을 가진다.here

Wow

Is a non-negative real number representing each weight,

Wow

At least one of them has a non-zero value.

본 발명을 통해 사용자는 특정 문서 집합에 대해 담화 구조를 기준으로 병렬 대조된 병합 문서를 제공받을 수 있으며, 상기 특정 문서 집합 내의 문서들을 각기 독해하고 얻은 정보를 수합하는 과정을 통해 상기 특정 문서 집합이 보고하는 정보를 파악하는 것이 아니라, 병합 문서에 대한 끊김 없는 독해를 통해 한 문서를 읽을 때와 유사한 방식으로 독해하여 상기 특정 문서 집합이 보고하는 정보를 보다 효율적으로 파악할 수 있도록 자동 보조를 받을 수 있으며, 각 문서에 있어 대조되는 정보를 한 눈에 파악하고, 이를 통해 상기 특정 문서 집합에서 관심 주제에 해당하는 특정 정보를 조합하기 위해 요구되는 인지적 부담을 완화하기 위한 자동 보조를 받을 수 있다.Through the present invention, a user can be provided with a merged document in parallel collation based on a discourse structure for a specific set of documents, and the specific set of documents can be read through a process of reading the documents in the specific set of documents and collecting the obtained information. Rather than grasping the information you are reporting, you can receive automatic assistance to better understand the information reported by the specific set of documents by reading in a similar way to reading a document through continuous reading of merged documents. In other words, it is possible to receive automatic assistance to alleviate the cognitive burden required to combine the specific information corresponding to the subject of interest in the specific document set by grasping the contrasting information in each document at a glance.

도 1은 본 발명의 일 실시 예에 따른 담화 구조 분석 기반 대조 독해 보조 시스템의 구성도이다.
도 2는 도 1에 도시된 사용자 입력 문서 처리부의 일 실시 예 상세 구성도이다.
도 3은 도 1에 도시된 출력부의 일 실시 예 상세 구성도이다.
도 4는 본 발명의 일 실시 예에 따른 담화 구조 분석 기반 대조 독해 보조 시스템의 입력 및 출력 결과의 예시를 도시한 도면이다.
도 5은 본 발명의 일 실시 예에 따른 담화 구조 분석 기반 대조 독해 보조 방법 중 사용자 입력 문서 처리부에 의한 방법을 구체적으로 도시한 흐름도이다.
도 6는 본 발명의 일 실시 예에 따른 담화 구조 분석 기반 대조 독해 보조 방법 중 출력부에 의한 방법을 구체적으로 도시한 흐름도이다.1 is a block diagram of a control system for reading comprehension based on discourse structure analysis according to an embodiment of the present invention.
2 is a detailed configuration diagram of an embodiment of a user input document processing unit illustrated in FIG. 1.
3 is a detailed configuration diagram of an embodiment of the output unit illustrated in FIG. 1.
4 is a diagram illustrating an example of input and output results of a colloquial reading aid based on discourse structure analysis according to an embodiment of the present invention.
5 is a flowchart specifically illustrating a method by a user input document processing unit among a method for assisting reading and reading based on a discourse structure analysis according to an embodiment of the present invention.
6 is a flowchart specifically showing a method by an output unit of a method for assisting reading and reading based on discourse structure analysis according to an embodiment of the present invention.

본 발명의 이점 및 특징, 그리고 그것들을 달성하는 방법은 첨부되는 도면과 함께 상세하게 후술되어 있는 실시예들을 참조하면 명확해질 것이다. 그러나, 본 발명은 이하에서 개시되는 실시예들에 한정되는 것이 아니라 서로 다른 다양한 형 태로 구현될 것이며, 단지 본 실시예들은 본 발명의 개시가 완전하도록 하며, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 발명의 범주를 완전하게 알려주기 위해 제공되는 것이며, 본 발명은 청구항의 범주에 의해 정의될 뿐이다.Advantages and features of the present invention, and methods for achieving them will be clarified with reference to embodiments described below in detail together with the accompanying drawings. However, the present invention is not limited to the embodiments disclosed below, but will be implemented in various different forms, and only the present embodiments allow the disclosure of the present invention to be complete, and common knowledge in the art to which the present invention pertains. It is provided to completely inform the person having the scope of the invention, and the present invention is only defined by the scope of the claims.

본 명세서에서 사용된 용어는 실시예들을 설명하기 위한 것이며, 본 발명을 제한하고자 하는 것은 아니다. 본 명세서에서, 단수형은 문구에서 특별히 언급하지 않는 한 복수형도 포함한다. 명세서에서 사용되는 "포함한다(comprises)" 및/또는 "포함하는(comprising)"은 언급된 구성요소, 단계, 동작 및/또는 소자는 하나 이상 의 다른 구성요소, 단계, 동작 및/또는 소자의 존재 또는 추가를 배제하지 않는다.The terminology used herein is for describing the embodiments, and is not intended to limit the present invention. In the present specification, the singular form also includes the plural form unless otherwise specified in the phrase. As used herein, "comprises" and / or "comprising" refers to the components, steps, operations, and / or elements mentioned above of one or more other components, steps, operations, and / or elements. Presence or addition is not excluded.

다른 정의가 없다면, 본 명세서에서 사용되는 모든 용어(기술 및 과학적 용어를 포함)는 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 공통적으로 이해될 수 있는 의미로 사용될 수 있을 것이다. 또한, 일반적으로 사용되는 사 전에 정의되어 있는 용어들은 명백하게 특별히 정의되어 있지 않는 한 이상적으로 또는 과도하게 해석되지 않는다.Unless otherwise defined, all terms (including technical and scientific terms) used in the present specification may be used as meanings commonly understood by those skilled in the art to which the present invention pertains. In addition, terms that are commonly defined in the dictionary are not ideally or excessively interpreted unless specifically defined.

이하, 첨부한 도면들을 참조하여, 본 발명의 바람직한 실시예들을 보다 상세하게 설명하고자 한다. 도면 상의 동일한 구성요소에 대해서는 동일한 참조 부호를 사용하고 동일한 구성요소에 대해서 중복된 설명은 생략한다.Hereinafter, preferred embodiments of the present invention will be described in more detail with reference to the accompanying drawings. The same reference numerals are used for the same components in the drawings, and duplicate descriptions for the same components are omitted.

도 1은 본 발명의 일 실시 예에 따른 담화 구조 분석 기반 대조 독해 보조 시스템의 구성도이다.1 is a block diagram of a control system for reading comprehension based on discourse structure analysis according to an embodiment of the present invention.

도 1에 도시된 바와 같이, 본 발명의 일 실시 예에 따른 담화 구조 분석 기반 대조 독해 보조 시스템(100)은 사용자 입력 문서 처리부(110), 제어부(120), 출력부(130)를 포함한다.As illustrated in FIG. 1, the collation reading assistance system 100 based on the discourse structure analysis according to an embodiment of the present invention includes a user input document processing unit 110, a control unit 120, and an output unit 130.

도 4는 본 발명의 일 실시 예에 따른 담화 구조 분석 기반 대조 독해 보조 시스템의 입력 및 출력 결과의 예시를 도시한 도면이다. 도 4에 도시된 바와 같이, 본 발명의 일 실시 예에 따른 담화 구조 분석 기반 대조 독해 보조 시스템(100)은 사용자로부터 복수 문서 집합과 정렬 기준 문서에 해당하는 문서가 상기 복수 문서 집합 내의 어떤 문서인지에 대한 정보를 전달받고, 상기 정렬 기준 문서에 대해 상기 복수 문서 집합 내의 다른 문서들 (이하 복수 정렬 대상 문서) 내에 포함된 문장들을 병렬 대조하는 방식으로 사용자에게 출력 제공한다. 상기 병렬 대조 방식은, 상기 정렬 기준 문서 내에 포함된 문장들이 표현하는 정보와 일치하는 정보가 상기 복수 정렬 대상 문서 중 어떤 문장에 의해서 표현되고 있는 지를 보여주는 방식과, 상기 복수 정렬 대상 문서에 포함된 문장 중 정렬 기준 문서의 행간과 병렬적으로 나란히 배치되기에 적절한 문장이 어떤 것인 지를 판단하고 상기 정렬 대상 문서 내의 문장 순서를 이에 상응하도록 변경하여 상기 정렬 기준 문서의 행간과 나란히 병렬적으로 가시화하는 방식을 포함한다. 도 4는 입력되는 문서의 개수가 3개인 경우에 대한 것이다. 그러나 본 발명의 실시예는 이에 국한되지 않으며, 다양한 수의 입력 문서 집합에 대해 실시될 수 있다.4 is a diagram illustrating an example of input and output results of a colloquial reading aid based on discourse structure analysis according to an embodiment of the present invention. As shown in FIG. 4, the collation reading assistance system 100 based on discourse structure analysis according to an embodiment of the present invention determines from which user a document corresponding to a plurality of document sets and a sort reference document is a document in the plurality of document sets. Information is received, and output is provided to the user in a manner in which the sentences included in other documents (hereinafter referred to as multiple sort target documents) in the plurality of document sets are collated with respect to the sort reference document. The parallel collation method is a method of showing, by which sentence among the plurality of alignment target documents, information that matches information expressed by sentences included in the alignment reference document, and a sentence included in the plurality of alignment target documents A method of determining which sentence is appropriate to be arranged in parallel with the leading line of the sorting reference document, and changing the order of sentences in the document to be sorted correspondingly to visualize it in parallel with the leading line of the sorting reference document. It includes. 4 is for a case where the number of input documents is three. However, embodiments of the present invention are not limited to this, and may be implemented for various sets of input documents.

사용자 입력 문서 처리부(110)는 사용자로부터 상기 복수 문서 집합 (이하 입력 문서 집합) 및 상기 정렬 기준 문서에 대한 정보를 입력받고, 담화 구조 분석(discourse analysis)를 통해 기초 담화 단위(elementary discourse unit)를 추출하고, 상기 정렬 기준 문서 내의 문장들과 상기 복수 정렬 대상 문서 내의 문장들 사이의 유사도를 분석하여, 상기 정렬 대상 문서 내의 문장들 중 상기 정렬 기준 문서 내의 문장들과 유사한 의미를 갖는 문장들을 파악하며, 상기 정렬 대상 문서 내의 문장들이 상기 정렬 기준 문서 내의 문장들의 행간과 병렬적으로 나란히 배치됨에 있어 어떤 위치의 행간에 나란히 놓이는 것이 적절할 지를 판단하며, 판단 결과에 따라 상기 정렬 대상 문서 내의 문장들의 나열 순서를 변경하며, 변경 이후에 상기 정렬 대상 문서들이 갖는 문장 나열 순서를 독해 대조를 위한 문장 나열 순서로서 정의하며, 행간에 있어 병렬적으로 나란히 배치된 상기 정렬 대상 문서 내의 문장들과 상기 정렬 기준 문서 내의 문장들 간의 담화 관계(discourse relation)을 파악한다.The user input document processing unit 110 receives information about the plurality of document sets (hereinafter referred to as an input document set) and the sorting reference document from a user, and performs an elementary discourse unit through discourse analysis. Extracting, and analyzing the similarity between sentences in the alignment reference document and sentences in the plurality of alignment target documents, to identify sentences having similar meaning to sentences in the alignment reference document among sentences in the alignment target document In addition, it is determined whether or not it is appropriate to arrange the sentences in the alignment target document in parallel between the lines of the sentences in the alignment reference document, and according to the determination result, the order of listing of the sentences in the alignment target document Is changed, and after the change, the order in which the documents to be sorted has in the sorting order is defined as the order in which the documents are sorted for reading and collating. Identify discourse relations between sentences.

제어부(120)는 사용자 입력 문서 처리부(110)로부터 상기 정렬 기준 문서에 대한 정보를 전달받으며, 출력부(130)로 해당하는 정보를 전달한다.The control unit 120 receives information on the alignment reference document from the user input document processing unit 110 and transfers the corresponding information to the output unit 130.

출력부(130)는 제어부(120)로부터 상기 정렬 기준 문서에 대한 정보를 전달받고, 사용자 입력 문서 처리부(110)으로부터 상기 독해 대조를 위한 문장 나열 순서와 상기 담화 관계를 전달받고, 상기 입력 문서 집합 내 문서 각각이 한 열에 상응하며 각 열에 있어 행에 해당하는 수직적 위치는 상기 정렬 기준 문서 및 상기 정렬 대상 문서를 독해함에 있어 권장되는 순서에 상응하도록 상기 독해 대조를 위한 문장 나열 순서에 따라 재정렬된 상기 정렬 대상 문서들과 상기 정렬 기준 문서를 병렬로 출력하며, 나열되는 문장 순서가 상기 독해 대조를 위한 문장 나열 순서에 따라 변경된 상기 정렬 대상 문서 각각에 있어서 기존의 문장 나열 순서를 문장 측면의 기존 문장 나열 순서 가시화 등의 방법을 통해 출력하며, 클릭 방식의 사용자 상호작용을 통해, 상기 문장 측면의 기존 문장 나열 순서를 클릭하였을 때, 상기 정렬 대상 문서의 기존 순서를 기준으로 "이전" 과 "다음"에 해당하는 정보를 선택할 수 있도록 한다.The output unit 130 receives the information on the alignment reference document from the control unit 120, receives the order of sentence listing and the discourse relationship for the reading comparison from the user input document processing unit 110, and sets the input document Each of my documents corresponds to one column, and the vertical position corresponding to a row in each column is rearranged according to the order of sentences for the reading contrast so as to correspond to the sorting reference document and the recommended order in reading the document to be sorted. The sorting documents and the sorting reference documents are output in parallel, and the order of the existing sentences in each of the sorting documents in which the order of listed sentences is changed according to the order of sorting of sentences for reading and comprehension is listed in the sentence side. Output through a method such as visualizing the sequence, and when clicking a sequence of listing existing sentences on the side of the sentence through a click-type user interaction, based on the existing order of the document to be sorted, "previous" and "next" Make it possible to select the relevant information.

사용자 입력 문서 처리부(110)를 상세히 설명하면 다음과 같다.The user input document processing unit 110 will be described in detail as follows.

사용자 입력 문서 처리부(110)는 도 2에 도시된 바와 같이, 입력부(111), 전처리부(112), 유사 구문 판별부(113), 문장 순서 재설정부(114)로 구성된다.As illustrated in FIG. 2, the user input document processing unit 110 includes an input unit 111, a pre-processing unit 112, a similar phrase discrimination unit 113, and a sentence order resetting unit 114.

입력부(111)는 사용자로부터 상기 복수 문서 집합 (이하 입력 문서 집합) 및 상기 정렬 기준 문서에 대한 정보를 입력받는다.The input unit 111 receives information about the plurality of document sets (hereinafter referred to as an input document set) and the sorting reference document from a user.

전처리부(112)는 상기 입력 문서 집합에 대한 담화 구조 분석(discourse analysis)을 통해 기초 담화 단위(elementary discourse unit)를 추출한다.The pre-processing unit 112 extracts an elementary discourse unit through discourse analysis of the input document set.

유사 구문 판별부(113)는 상기 정렬 기준 문서 내의 문장들과 상기 복수 정렬 대상 문서 내의 문장들 사이의 유사도를 분석하여, 상기 정렬 대상 문서 내의 문장들 중 상기 정렬 기준 문서 내의 문장들과 유사한 의미를 갖는 문장들을 파악한다. 바람직하게, 문장 유사도의 경우 각 문장을 문장 의미 벡터로 변환한 다음 문장 의미 벡터 사이의 코사인 유사도(cosine similarity)를 통해 정의하며, 문장 의미 벡터 변환의 경우, Le and Mikolov (2014) 에 의해 제안된 기법과 같은 분산 표상 기법을 활용하여 진행하며, 상기 정렬 기준 문서 내의 문장들(이하 정렬 기준 문장 집합) 중에서 선택한 하나의 문장과 상기 복수 정렬 대상 문서 내의 문장들 중에서 선택한 하나의 문장을 묶어 만들 수 있는 모든 문장 쌍에 대해 문장 유사도를 측정한 다음, 문장 유사도에 있어 상위 L%에 해당하는 문장 쌍을 유사 문장 쌍으로 정의한다. 여기서, L은 시스템 초기값으로서 정의되는 0에서 100 사이의 실수이다.The similarity phrase determining unit 113 analyzes the similarity between the sentences in the alignment reference document and the sentences in the plurality of alignment target documents, and has similar meaning to the sentences in the alignment reference document among the sentences in the alignment target document. Identify the sentences you have. Preferably, in the case of sentence similarity, each sentence is converted into a sentence semantic vector and then defined through cosine similarity between sentence semantic vectors, and in the case of sentence semantic vector conversion, proposed by Le and Mikolov (2014) Proceeding by using a distributed representation technique such as a technique, a sentence selected from sentences in the sorting criteria document (hereinafter, a set of sorting criteria sentences) and a sentence selected from sentences in the plurality of sorting target documents can be combined and created. After the sentence similarity is measured for all sentence pairs, the sentence pair corresponding to the upper L% in the sentence similarity is defined as a similar sentence pair. Here, L is a real number between 0 and 100 defined as the system initial value.

문장 순서 재설정부(115)는 유사 구문 판별부(113)으로부터 상기 유사 문장 쌍에 해당하는 정보를 전달받고, 상기 유사 문장 쌍들이 병렬적으로 나란히 배치되도록 상기 정렬 대상 문서 내의 문장들의 문장 나열 순서를 변경하며, 상기 정렬 대상 문서 내의 문장들 중 상기 유사 문장 쌍 집합에 포함되지 않는 문장들(이하 비-유사 정렬 대상 문장 집합)에 있어 상기 정렬 기준 문서 내의 문장들의 행간과 병렬적으로 나란히 배치됨에 있어 어떤 위치의 행간에 나란히 놓이는 것이 적절할 지 판단하며, 판단 결과에 따라 상기 정렬 대상 문서 내의 문장들의 나열 순서를 변경하며, 변경 이후에 상기 정렬 대상 문서들이 갖는 문장 나열 순서를 독해 대조를 위한 문장 나열 순서로 정의하며, 행간에 있어 병렬적으로 나란히 배치된 상기 정렬 대상 문서 내의 문장들과 상기 정렬 기준 문서 내의 문장들 간의 담화 관계(discourse relation)를 파악한다. 바람직하게, 문장들 간의 담화 관계는 두 문장을 연결한 다음, 두 문장 사이의 공백에 대해 가장 적절한 접속사를 자동으로 예측하는 모델을 활용하고, 예측된 각각의 접속사를 통해 해당하는 접속사가 의미하는 담화 관계(예를 들어, but의 경우 Contrast, and의 경우 List, because의 경우 Cause 등)를 파악하는 방법을 통해 파악하며, 접속사의 자동 예측은 Rohde et al. (2018) 에 의해 제안된 방식으로 진행될 수 있다.The sentence order resetting unit 115 receives information corresponding to the similar sentence pair from the similar phrase determining unit 113 and arranges the order of sentences in sentences in the sort target document so that the similar sentence pairs are arranged in parallel. To change, and among the sentences in the document to be sorted, which are not included in the set of similar sentence pairs (hereinafter referred to as a set of non-similar sort objects), are arranged in parallel with the lines of sentences in the sorting reference document. It determines whether it is appropriate to line up between lines at a certain position, changes the order of listing of the sentences in the document to be sorted according to the result of the determination, and reads the order of sentences in the documents to be sorted after the change, and then sorts the order of sentences for comparison It is defined as, and a discourse relation between sentences in the alignment target document and sentences in the alignment reference document arranged in parallel in line is grasped. Preferably, the discourse relationship between sentences connects two sentences, then utilizes a model that automatically predicts the most appropriate conjunction with respect to the space between the two sentences, and the discourse that the corresponding conjunction means through each predicted conjunction relationship and identified through a method of identifying (e. g., but for the case of Contrast, and List, because if the Cause and the like), automatic prediction of a conjunction is Rohde et al. (2018).

문장 순서 재설정부(115)는 상기 정렬 기준 문장 집합과 상기 비-유사 정렬 대상 문장 집합을 병합한 문장 집합(이하 병합 문장 집합)을 통해 만들 수 있는 모든 문서 조합 중 자동 문장 정렬 및 구문 행렬 기반 일관성 점수가 최대화되는 문서 조합을 파악하고, 해당하는 문서 조합에서 상기 병렬 문장 집합이 갖는 문장 순서를 상기 독해 대조를 위한 문장 나열 순서로 한다.The sentence order resetting unit 115 is an automatic sentence sorting and syntax matrix-based consistency score among all document combinations that can be created through a sentence set (hereinafter, a merged sentence set) that merges the set of sorting criteria and the set of non-similar sort target sentences. Determines the document combination that is maximized, and sets the sentence order of the parallel sentence set in the corresponding document combination as the sentence order for reading and comparing.

[수학식][Mathematics]

여기서,

는 해당하는 문서 내의 문장들에 대해 Logeswaran et al. (2018)에 의해 제안된 것과 같은 자동 문장 순서 정렬 모델을 활용하여 정렬하였을 때의 문서로부터 해당하는 문서를 문장 스왑(swap)을 통해 만들어낼 수 있는 최소 스왑 개수이며,

은 0보다 큰 실수로 정의되는 고정 상수이며,

는 해당하는 문서에 대한 담화 구조 분석 결과로서 얻을 수 있는 담화 역할 행렬(discourse role matrix)에 있어 null 항목을 0으로, null이 아닌 항목의 경우 담화 역할의 개수로 행렬 컴포넌트를 변경하는 방식을 통해 변환한 행렬의 행렬식(determinant)로 정의(담화 역할 행렬의 역행렬이 존재하지 않는 경우 0)이다. 바람직하게, 담화 구조 분석을 통한 담화 역할 행렬의 계산은 Feng et al. (2014)에 의해 사용된 방법과 같이, PDTB-스타일 담화 구조 분석(Penn Discourse Tree Bank style discourse parsing)을 통해 진행하며, 담화 역할 행렬 역시 PDTB-스타일 담화 역할 행렬(Penn Discourse Tree Bank style discourse role matrix)을 계산하는 방식으로 계산한다.here,

For the sentences in the corresponding document, Logeswaran et al. This is the minimum number of swaps that can be generated from the document when sorting using the automatic sentence order sorting model as proposed by (2018) through sentence swap.

Is a fixed constant defined by a real number greater than 0,

Is transformed by changing the matrix component to 0 for null items in the discourse role matrix that can be obtained as a result of discourse structure analysis for the corresponding document, and the number of discourse roles for non-null items. Defined as a determinant of a matrix (0 if there is no inverse matrix of the conversation role matrix). Preferably, the calculation of the discourse role matrix through discourse structure analysis is performed by Feng et al. As in the method used by (2014), the PDTB-style discourse role matrix is also performed through the Penn Discourse Tree Bank style discourse parsing, and the discourse role matrix is also the PDTB-style discourse role matrix. ).

여기서

와

은 각각 가중치를 표현하는 음이 아닌 실수이며,

와

중 적어도 하나는 0이 아닌 값을 가진다.here

Wow

Is a non-negative real number representing each weight,

Wow

At least one of them has a non-zero value.

출력부(130)을 상세히 설명하면 다음과 같다. The output unit 130 will be described in detail as follows.

출력부(130)는 도 3에 도시된 바와 같이, 병렬 대조 문서 출력부(131), 추가 정보 상대 위치 출력부(132), 사용자 상호작용부(133)으로 구성된다.As shown in FIG. 3, the output unit 130 includes a parallel collation document output unit 131, additional information relative position output unit 132, and a user interaction unit 133.

병렬 대조 문서 출력부(131)는 상기 정렬 기준 문서에 대한 정보를 전달받고, 사용자 입력 문서 처리부(110)로부터 상기 독해 대조를 위한 문장 나열 순서와 상기 담화 관계를 전달받고, 상기 입력 문서 집합 내 문서 각각이 한 열에 상응하며 각 열에 있어 행에 해당하는 수직적 위치는 상기 정렬 기준 문서 및 상기 정렬 대상 문서를 독해함에 있어 권장되는 순서에 상응하도록 상기 독해 대조를 위한 문장 나열 순서에 따라 재정렬된 상기 정렬 대상 문서들과 상기 정렬 기준 문서를 병렬로 출력한다.The parallel collation document output unit 131 receives information on the alignment reference document, receives the order of sentence listing and collation relationship for the reading collation from the user input document processing unit 110, and documents in the input document set The sorting object rearranged according to the order of the sentences for the reading contrast so that the vertical position corresponding to a column in each column and corresponding to a row in each column corresponds to a recommended order in reading the sorting reference document and the sorting target document. The documents and the alignment reference document are output in parallel.

추가 정보 상대 위치 출력부(132)는 나열되는 문장 순서가 상기 독해 대조를 위한 문장 나열 순서에 따라 변경된 상기 정렬 대상 문서 각각에 있어서 기존의 문장 나열 순서를 문장 측면의 기존 문장 나열 순서 가시화 등의 방법을 통해 출력한다.Additional information The relative position output unit 132 visualizes an existing sentence listing order of each side of the sentence in the document to be sorted in each of the documents to be sorted in which the order of the listed sentences is changed according to the sentence listing order for reading comprehension. Output through

사용자 상호작용부(133)는 클릭 방식의 사용자 상호작용을 통해, 상기 문장 측면의 기존 문장 나열 순서를 클릭하였을 때, 상기 정렬 대상 문서의 기존 순서를 기준으로 "이전" 과 "다음"에 해당하는 정보를 선택할 수 있도록 한다. 구체적으로, "이전"의 경우 상기 정렬 대상 문서의 기존 순서를 기준으로 하였을 때의 이전 문장이 상기 독해 대조를 위한 문장 나열 순서에 따라 재정렬된 이후의 위치로 스크린 뷰를 이동하며, "다음"의 경우 마찬가지로 다음 문장의 위치로 이동하도록 하며, 추가 문서의 기존 순서를 기준으로 처음/마지막 문장일 경우에는 각각 "다음"/"이전"에 해당하는 정보만 선택할 수 있도록 한다.The user interaction unit 133 corresponds to "previous" and "next" based on the existing order of the documents to be sorted when clicking the order of listing the existing sentences on the side of the sentence through the user interaction of the click method. Allow information to be selected. Specifically, in the case of " previous ", the screen view is moved to a position after the previous sentence, which is based on the existing order of the document to be sorted, is rearranged according to the order of the sentences listed for the reading comparison, and the " next " In the same case, it moves to the position of the next sentence, and in the case of the first / last sentence based on the existing order of additional documents, only information corresponding to "next" / "previous" can be selected.

도 5와 6은 본 발명의 일 실시 예에 따른 담화 구조 분석 기반 대조 독해 보조 방법을 도시한 흐름도이다. 도 5에 도시된 바와 같이, 본 발명의 일 실시 예에 따른 담화 구조 분석 기반 대조 독해 보조 방법 중 사용자 입력 문서 처리부(110)에 의한 방법은 입력 단계(S310), 전처리 단계(S320), 유사 구문 판별 단계(S330), 문장 순서 재설정 단계(S340)를 포함하여 구성된다. 도 6에 도시된 바와 같이, 본 발명의 일 실시 예에 따른 담화 구조 분석 기반 대조 독해 보조 방법 중 출력부(130)에 의한 방법은 병렬 대조 문서 출력 단계(S410), 추가 정보 상대 위치 출력 단계(S420), 사용자 상호작용 단계(S430)를 포함하여 구성된다.5 and 6 are flowcharts illustrating a method for assisting a reading comprehension based on a discourse structure analysis according to an embodiment of the present invention. As shown in FIG. 5, the method by the user input document processing unit 110 among the colloquial structure analysis-based control reading aid methods according to an embodiment of the present invention is an input step (S310), a pre-processing step (S320), and similar syntax It comprises a determination step (S330), the sentence order resetting step (S340). As shown in FIG. 6, the method by the output unit 130 among the colloquial structure analysis-based collation reading assistance methods according to an embodiment of the present invention is a parallel collation document output step (S410), and additional information relative position output step ( S420), a user interaction step (S430).

입력 단계(S310)는 사용자로부터 상기 복수 문서 집합 (이하 입력 문서 집합) 및 상기 정렬 기준 문서에 대한 정보를 입력받는 단계이다. 전처리 단계(S320)는 상기 입력 문서 집합에 대한 담화 구조 분석를 통해 기초 담화 단위를 추출하는 단계이다. 유사 구문 판별 단계(S330)는 상기 정렬 기준 문서 내의 문장들과 상기 복수 정렬 대상 문서 내의 문장들 사이의 유사도를 분석하여, 상기 정렬 대상 문서 내의 문장들 중 상기 정렬 기준 문서 내의 문장들과 유사한 의미를 갖는 문장들을 파악하는 단계이다. 문장 순서 재설정 단계(S340)는 상기 정렬 대상 문서 내의 문장들이 상기 정렬 기준 문서 내의 문장들의 행간과 병렬적으로 나란히 배치됨에 있어 어떤 위치의 행간에 나란히 배치되는 것이 적절할 지를 판단하여 상기 정렬 대상 문서 내의 문장들의 나열 순서를 변경하며, 행간에 있어 병렬적으로 나란히 배치된 상기 정렬 대상 문서 내의 문장들과 상기 정렬 기준 문서 내의 문장들 간의 담화 관계을 파악하는 단계이다.The input step S310 is a step of receiving information about the plurality of document sets (hereinafter, the input document set) and the sorting reference document from a user. The pre-processing step S320 is a step of extracting a basic discourse unit through analysis of a discourse structure of the input document set. The similarity phrase determining step S330 analyzes the similarity between the sentences in the sorting reference document and the sentences in the plurality of sorting target documents, and has similar meaning to the sentences in the sorting reference document among the sentences in the sorting target document. This is the step to identify the sentences you have. In the sentence order resetting step (S340), the sentences in the document to be sorted are determined by determining which position is appropriately arranged in line with the lines of the sentences in the sorting reference document. It is a step of determining a discourse relationship between sentences in the alignment target document and sentences in the alignment reference document, which are arranged in parallel and arranged in parallel between rows.

병렬 대조 문서 출력 단계(S410)는 상기 입력 문서 집합 내 문서 각각이 한 열에 상응하며 각 열에 있어 행에 해당하는 수직적 위치는 상기 정렬 기준 문서 및 상기 정렬 대상 문서를 독해함에 있어 권장되는 순서에 상응하도록 상기 독해 대조를 위한 문장 나열 순서에 따라 재정렬된 상기 정렬 대상 문서들과 상기 정렬 기준 문서를 병렬로 출력하는 단계이다. 추가 정보 상대 위치 출력 단계(S420)는 나열되는 문장 순서가 상기 독해 대조를 위한 문장 나열 순서에 따라 변경된 상기 정렬 대상 문서 각각에 있어서 기존의 문장 나열 순서를 문장 측면의 기존 문장 나열 순서 가시화 등의 방법을 통해 출력하는 단계이다. 사용자 상호작용 단계(S430)는 클릭 방식의 사용자 상호작용을 통해, 상기 문장 측면의 기존 문장 나열 순서를 클릭하였을 때, 상기 정렬 대상 문서의 기존 순서를 기준으로 "이전" 과 "다음"에 해당하는 정보를 선택할 수 있도록 하는 단계이다.In the parallel matching document output step (S410), each document in the input document set corresponds to one column, and a vertical position corresponding to a row in each column corresponds to a recommended order in reading the sorting reference document and the sorting target document. It is a step of outputting, in parallel, the sort target documents and the sort reference documents rearranged according to a sentence order for reading and comparing. Additional information Relative position output step (S420) is a method such as visualizing the existing sentence listing order of the sentence side of the existing sentence listing order in each of the documents to be sorted in which the order in which the listed sentences are changed according to the sentence listing order for reading comprehension. This is the step to output through. The user interaction step (S430) corresponds to “previous” and “next” based on the existing order of the document to be sorted when clicking the order of listing the existing sentences on the side of the sentence through the user interaction of the click method. This is a step that allows you to select information.

이상에서 설명된 장치는 하드웨어 구성요소, 소프트웨어 구성요소, 및/또는 하드웨어 구성요소 및 소프트웨어 구성요소의 조합으로 구현될 수 있다. 예를 들어, 실시예들에서 설명된 장치 및 구성요소는, 예를 들어, 프로세서, 컨트롤러, ALU(arithmetic logic unit), 디지털 신호 프로세서(digital signal processor), 마이크로컴퓨터, FPA(field programmable array), PLU(programmable logic unit), 마이크로프로세서, 또는 명령(instruction)을 실행하고 응답할 수 있는 다른 어떠한 장치와 같이, 하나 이상의 범용 컴퓨터 또는 특수 목적 컴퓨터를 이용하여 구현될 수 있다. 처리 장치는 운영 체제(OS) 및 상기 운영 체제 상에서 수행되는 하나 이상의 소프트웨어 애플리케이션을 수행할 수 있다.　 또한, 처리 장치는 소프트웨어의 실행에 응답하여, 데이터를 접근, 저장, 조작, 처리 및 생성할 수도 있다.　 이해의 편의를 위하여, 처리 장치는 하나가 사용되는 것으로 설명된 경우도 있지만, 해당 기술분야에서 통상의 지식을 가진 자는, 처리 장치가 복수 개의 처리 요소(processing element) 및/또는 복수 유형의 처리 요소를 포함할 수 있음을 알 수 있다.　 예를 들어, 처리 장치는 복수 개의 프로세서 또는 하나의 프로세서 및 하나의 콘트롤러를 포함할 수 있다.　 또한, 병렬 프로세서(parallel processor)와 같은, 다른 처리 구성(processing configuration)도 가능하다.The device described above may be implemented with hardware components, software components, and / or combinations of hardware components and software components. For example, the devices and components described in the embodiments include, for example, a processor, a controller, an arithmetic logic unit (ALU), a digital signal processor (micro signal processor), a microcomputer, a field programmable array (FPA), It may be implemented using one or more general purpose computers or special purpose computers, such as a programmable logic unit (PLU), microprocessor, or any other device capable of executing and responding to instructions. The processing device may run an operating system (OS) and one or more software applications running on the operating system. In addition, the processing device may access, store, manipulate, process, and generate data in response to the execution of the software. For convenience of understanding, a processing device may be described as one being used, but a person having ordinary skill in the art, the processing device may include a plurality of processing elements and / or a plurality of types of processing elements. It can be seen that may include. For example, the processing device may include a plurality of processors or a processor and a controller. In addition, other processing configurations, such as parallel processors, are possible.

소프트웨어는 컴퓨터 프로그램(computer program), 코드(code), 명령(instruction), 또는 이들 중 하나 이상의 조합을 포함할 수 있으며, 원하는 대로 동작하도록 처리 장치를 구성하거나 독립적으로 또는 결합적으로(collectively) 처리 장치를 명령할 수 있다.　 소프트웨어 및/또는 데이터는, 처리 장치에 의하여 해석되거나 처리 장치에 명령 또는 데이터를 제공하기 위하여, 어떤 유형의 기계, 구성요소(component), 물리적 장치, 가상 장치(virtual equipment), 컴퓨터 저장 매체 또는 장치에 구체화(embody)될 수 있다.　 소프트웨어는 네트워크로 연결된 컴퓨터 시스템 상에 분산되어서, 분산된 방법으로 저장되거나 실행될 수도 있다. 소프트웨어 및 데이터는 하나 이상의 컴퓨터 판독 가능 기록 매체에 저장될 수 있다.The software may include a computer program, code, instruction, or a combination of one or more of these, and configure the processing device to operate as desired, or process independently or collectively You can command the device. Software and / or data may be interpreted by a processing device, or to provide instructions or data to a processing device, of any type of machine, component, physical device, virtual equipment, computer storage medium or device. Can be embodied in The software may be distributed over networked computer systems, and stored or executed in a distributed manner. Software and data may be stored in one or more computer-readable recording media.

실시예에 따른 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다.　 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다.　 상기 매체에 기록되는 프로그램 명령은 실시예를 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다.　 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다.　 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다.　 The method according to the embodiment may be implemented in the form of program instructions that can be executed through various computer means and recorded on a computer-readable medium. The computer-readable medium may include program instructions, data files, data structures, or the like alone or in combination. The program instructions recorded in the medium may be specially designed and configured for the embodiments or may be known and usable by those skilled in computer software. Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks, and magnetic tapes, optical media such as CD-ROMs, DVDs, and magnetic media such as floptical disks. -Hardware devices specifically configured to store and execute program instructions such as magneto-optical media, and ROM, RAM, flash memory, and the like. Examples of program instructions include high-level language codes that can be executed by a computer using an interpreter, etc., as well as machine language codes produced by a compiler.

이상과 같이 실시예들이 비록 한정된 실시예와 도면에 의해 설명되었으나, 해당 기술분야에서 통상의 지식을 가진 자라면 상기의 기재로부터 다양한 수정 및 변형이 가능하다.　 예를 들어, 설명된 기술들이 설명된 방법과 다른 순서로 수행되거나, 및/또는 설명된 시스템, 구조, 장치, 회로 등의 구성요소들이 설명된 방법과 다른 형태로 결합 또는 조합되거나, 다른 구성요소 또는 균등물에 의하여 대치되거나 치환되더라도 적절한 결과가 달성될 수 있다.As described above, although the embodiments have been described by a limited embodiment and drawings, those skilled in the art can make various modifications and variations from the above description. For example, the described techniques are performed in a different order than the described method, and / or the components of the described system, structure, device, circuit, etc. are combined or combined in a different form from the described method, or other components Alternatively, even if substituted or substituted by equivalents, appropriate results can be achieved.

그러므로, 다른 구현들, 다른 실시예들 및 특허청구범위와 균등한 것들도 후술하는 특허청구범위의 범위에 속한다.Therefore, other implementations, other embodiments, and equivalents to the claims are also within the scope of the following claims.

부호의 설명Explanation of code

100 : 담화 구조 분석 기반 대조 독해 보조 시스템100: Discourse structure analysis based contrast reading aid system

110 : 사용자 입력 문서 처리부110: user input document processing unit

120 : 제어부120: control unit

130 : 출력부130: output unit

111 : 입력부111: input unit

112 : 전처리부112: pre-processing unit

113 : 유사 구문 판별부113: similar phrase discrimination unit

114 : 문장 순서 재설정부114: sentence order resetting unit

131 : 병렬 대조 문서 출력부131: parallel matching document output unit

132 : 추가 정보 상대 위치 출력부132: Additional information relative position output unit

133 : 사용자 상호작용부133: user interaction unit

Claims

In the method for assisting control reading based on discourse structure analysis,
When information about a document corresponding to a plurality of document sets and a sorting reference document is received from a user, sentences included in other documents (hereinafter referred to as a target document) in the plurality of document sets for the sorting reference document are parallel to the leading line of the sorting reference document. A method of determining which sentence is suitable for placement in parallel, and changing the order of sentences in the document to be sorted accordingly, so that the sentences in the document to be sorted are visualized in parallel with the line of the sorting reference document.

According to claim 1,
The method for assisting the reading comprehension based on discourse structure analysis
An input step of receiving information on the plurality of document sets (hereinafter referred to as an input document set) and the sorting reference document from a user (S310);
A pre-processing step of extracting a basic discourse unit through analysis of a discourse structure for the set of input documents (S320);
Analyzing similarity between sentences in the alignment reference document and sentences in the plurality of alignment target documents to determine similar phrases to identify sentences having similar meaning to sentences in the alignment reference document among sentences in the alignment target document Step S330;
Based on automatic sentence sorting and syntax matrix-based consistency scores, the sorting of the documents to be sorted is determined by judging where it is appropriate to be arranged side by side in parallel with the lines of sentences in the sorting reference document. A sentence order resetting step of changing the order of listing of the sentences in the target document, and determining a syntax relationship between the sentences in the sort target document and the sentences in the sorting reference document arranged in parallel in line (S340);
Each of the documents in the set of input documents corresponds to one column, and the vertical position corresponding to a row in each column corresponds to the order of sentences for matching the reading so as to correspond to the recommended order in reading the sorting reference document and the sorting target document. A parallel collation document output step of outputting the sorted documents to be sorted and the sorting reference documents in parallel (S410);
Additional information relative position output step of outputting an existing sentence listing order in each of the documents to be sorted according to the sentence listing order in which the order of the listed sentences is changed according to the sentence listing order for reading comprehension through a method such as visualizing the existing sentence listing order of the sentence side (S420);
User interaction that enables selection of information corresponding to "previous" and "next" based on the existing order of the documents to be sorted when clicking the order of listing the existing sentences on the side of the sentence through the user interaction of the click method. Operation step (S430);
How to include

The method of claim 2, wherein the automatic sentence alignment and syntax matrix-based consistency scores are calculated through the following equation.
[Mathematics]

here,

Is a fixed constant defined by a real number greater than 0,

Is transformed by changing the matrix component to 0 for null and 0 for the discourse role matrix that can be obtained as a result of discourse structure analysis on the corresponding document. Defined as a determinant of a matrix (0 if there is no inverse matrix of the conversation role matrix).
here

Wow

Is a non-negative real number representing each weight,

Wow

At least one of them has a non-zero value.

In the contrast reading system based on discourse structure analysis,
When information about a document corresponding to a plurality of document sets and a sorting reference document is received from a user, sentences included in other documents (hereinafter referred to as a target document) in the plurality of document sets for the sorting reference document are parallel to the leading line of the sorting reference document. Determines which sentence is suitable for being arranged side by side in parallel, and changes the order of the sentences in the document to be sorted accordingly, so that the sentences in the document to be sorted are visualized in parallel with the line of the alignment reference document. system.

The method of claim 4,
A contrast reading system based on discourse structure analysis
A user input document processing unit 110; Control unit 120; It includes an output unit 130,
The user input document processing unit 110
An input unit 111 for receiving information about the plurality of document sets (hereinafter referred to as an input document set) and the sorting reference document;
A preprocessing unit 112 for extracting a basic discourse unit through analysis of a discourse structure for the input document set;
Analyzing similarity between sentences in the alignment reference document and sentences in the plurality of alignment target documents to determine similar phrases to identify sentences having similar meaning to sentences in the alignment reference document among sentences in the alignment target document Part 113;
Based on automatic sentence sorting and syntax matrix-based consistency scores, the sorting is determined by determining where it is appropriate to be arranged side by side at a position in which the sentences in the document to be sorted are arranged in parallel with the lines of sentences in the sorting reference document. A sentence order resetting unit 114 for changing the order of listing of sentences in the target document, and determining a syntax relationship between sentences in the alignment target documents arranged in parallel in parallel between lines and sentences in the alignment reference document;
Consisting of,
Each of the documents in the set of input documents corresponds to one column, and the vertical position corresponding to a row in each column corresponds to the order of sentences for the reading contrast so as to correspond to the recommended order in reading the sorting reference document and the sorting target document. A parallel collation document output unit 131 for outputting the alignment target documents and the alignment reference documents in parallel;
Additional information relative position output unit for outputting an existing sentence listing order in each of the documents to be sorted according to the sentence listing order in which the order of the listed sentences is changed according to the sentence listing order for reading comprehension through a method such as visualization of the existing sentence listing order on the sentence side (132);
User interaction that enables selection of information corresponding to "previous" and "next" based on the existing order of the documents to be sorted when clicking the order of listing the existing sentences on the side of the sentence through the user interaction of the click method. Working portion 133;
System consisting of.

The system of claim 5, wherein the automatic sentence alignment and syntax matrix-based consistency scores are calculated through the following equations.
[Mathematics]

here,

Is a fixed constant defined by a real number greater than 0,

Is transformed by changing the matrix component to the null item as 0 in the discourse role matrix obtained as a result of discourse structure analysis for the corresponding document, and the number of discourse roles in the case of non-null items. Defined as a determinant of a matrix (0 if there is no inverse matrix of the conversation role matrix).
here

Wow

Is a non-negative real number representing each weight,

Wow

At least one of them has a non-zero value.