KR102021057B1

KR102021057B1 - Apparatus and method for extracting paragraph in document

Info

Publication number: KR102021057B1
Application number: KR1020170135883A
Authority: KR
Inventors: 정회경; 이종원
Original assignee: 배재대학교 산학협력단
Priority date: 2017-10-19
Filing date: 2017-10-19
Publication date: 2019-09-11
Also published as: KR20190043857A

Abstract

문서 내 문단 추출 장치 및 방법이 개시된다. 문서 내 문단 추출 방법은 문서 및 검색 키워드를 입력받는 단계와, 상기 문서로부터 복수의 문단을 구분하여, 상기 복수의 문단이 배치된 순서대로 번호를 부여하고, 상기 복수의 문단 중에서 상기 검색 키워드가 포함되는 문단을 추출하는 단계와, 상기 추출된 문단을 부여된 번호에 따라, 정렬하여 출력하는 단계를 포함할 수 있다.Disclosed is an apparatus and method for extracting paragraphs in a document. The method of extracting paragraphs in a document includes receiving a document and a search keyword, separating a plurality of paragraphs from the document, numbering the plurality of paragraphs in the order in which the plurality of paragraphs are arranged, and including the search keyword among the plurality of paragraphs. And extracting the paragraphs to be output and sorting the extracted paragraphs according to the assigned numbers.

Description

Apparatus and method for extracting paragraphs in a document {APPARATUS AND METHOD FOR EXTRACTING PARAGRAPH IN DOCUMENT}

본 발명은 문서로부터 검색 키워드가 포함된 문단을 추출하여, 출력하는 문서 내 문단 추출 장치 및 방법에 관한 것이다.The present invention relates to a paragraph extraction apparatus and method for extracting a paragraph including a search keyword from a document and outputting the same.

현대사회에서는 다양한 정보가 다량으로 제공되고 있어, 현대인들은 각각의 정보를 모두 읽어, 내용을 일일이 파악하는 것이 쉽지 않지 않고, 이를 위해서는 시간을 많이 소비하게 된다.In the modern society, various information is provided in a large amount, and modern people read all the information and grasp the contents one by one.

정보 파악의 어려움 해소 및 시간 절약을 위해, 정보가 포함된 문서를 압축시켜 제공하는 문단 추출 장치가 제공되고 있다.In order to solve the difficulty of grasping information and to save time, a paragraph extraction apparatus for compressing and providing a document including information is provided.

기존의 문단 추출 장치는 문서에서 빈도수에 따라 검색 키워드를 자체적으로 결정하고, 검색 키워드를 포함하는 문단을 선별하여, 출력할 수 있다.The existing paragraph extracting apparatus may determine a search keyword by itself according to the frequency in a document, and select and output a paragraph including the search keyword.

그러나, 이러한 문단 추출 장치는 사용자의 검색 의도와 무관하게, 문단을 압축하여 제공하게 된다. 또한, 압축률이 높을 경우, 정확한 내용 파악이 어렵게 되고, 압축률이 낮을 경우에는, 출력하는 문단의 양이 증가 함에 따라, 압축 효과가 무의미하게 된다.However, such a paragraph extracting device compresses and provides a paragraph regardless of a user's search intention. In addition, when the compression ratio is high, it is difficult to accurately grasp the contents, and when the compression ratio is low, the compression effect becomes meaningless as the amount of output paragraph increases.

관련 선행기술로는 대한민국 등록특허공보 제10-1060594호(발명의 명칭: 문서 데이터의 키워드 추출 및 연관어 네트워크 구성 장치 및 방법, 등록일자: 2011.08.24)가 있다.Related prior arts include Korean Patent Publication No. 10-1060594 (Invention name: Keyword extraction and associated term network configuration device and method of document data, registration date: August 24, 2011).

본 발명은 문서 및 검색 키워드를 입력받고, 상기 문서로부터 상기 검색 키워드가 포함되는 문단을 추출하여 출력 함으로써, 사용자의 검색 의도에 따라, 상기 문서를 압축하여 제공하는 것을 목적으로 한다.An object of the present invention is to provide a document and a search keyword by compressing the document according to a user's search intention by extracting and outputting a paragraph including the search keyword from the document.

또한, 본 발명은 검색 키워드에 기초하여, 문서로부터 추출된 문단이 중복되는 경우, 불필요하게 중복된 문단을 제거하여 출력 함으로써, 문서의 내용을 정확히 파악할 수 있게 하는 한도 내에서 최소의 문단을 제공하여, 최적의 압축률을 지원하는 것을 목적으로 한다.In addition, the present invention provides a minimum paragraph within the limit to accurately understand the content of the document by removing and outputting unnecessary paragraphs when the paragraphs extracted from the document are duplicated based on the search keywords. The aim is to support optimal compression rates.

또한, 본 발명은 문서 내 검색 키워드에 대한 가중치 및 문서 내 임의의 문단에서의 검색 키워드에 대한 빈도수 중 적어도 하나의 정보를 산출하여, 출력 함으로써, 검색 키워드에 대한 중요도 또는 검색 키워드에 대한 사용 빈도를 파악할 수 있게 하는 것을 목적으로 한다.In addition, the present invention calculates and outputs at least one information of the weight of the search keyword in the document and the frequency of the search keyword in any paragraph in the document, and outputs the information so that the importance of the search keyword or the frequency of use of the search keyword is output. It is aimed at being able to grasp.

상기의 목적을 이루기 위한, 문서 내 문단 추출 장치는 문서 및 검색 키워드를 입력받는 인터페이스부와, 상기 문서로부터 복수의 문단을 구분하여, 상기 복수의 문단이 배치된 순서대로 번호를 부여하고, 상기 복수의 문단 중에서 상기 검색 키워드가 포함되는 문단을 추출하는 추출부와, 상기 추출된 문단을 부여된 번호에 따라, 정렬하여 출력하는 프로세서를 포함할 수 있다.In order to achieve the above object, an apparatus for extracting paragraphs in a document may include: an interface unit for receiving a document and a search keyword, separating a plurality of paragraphs from the document, assigning numbers in the order in which the plurality of paragraphs are arranged, An extraction unit for extracting a paragraph including the search keyword from the paragraph of the, and the extracted paragraph according to a given number, may include a processor for sorting and outputting.

상기의 목적을 이루기 위한, 문서 내 문단 추출 방법은 문서 및 검색 키워드를 입력받는 단계와, 상기 문서로부터 복수의 문단을 구분하여, 상기 복수의 문단이 배치된 순서대로 번호를 부여하고, 상기 복수의 문단 중에서 상기 검색 키워드가 포함되는 문단을 추출하는 단계와, 상기 추출된 문단을 부여된 번호에 따라, 정렬하여 출력하는 단계를 포함할 수 있다.In order to achieve the above object, a method for extracting paragraphs in a document may include receiving a document and a search keyword, dividing a plurality of paragraphs from the document, numbering the plurality of paragraphs in the order in which the plurality of paragraphs are arranged, Extracting a paragraph including the search keyword from a paragraph, and sorting and outputting the extracted paragraph according to a given number.

본 발명에 따르면, 문서 및 검색 키워드를 입력받고, 상기 문서로부터 상기 검색 키워드가 포함되는 문단을 추출하여 출력 함으로써, 사용자의 검색 의도에 따라, 상기 문서를 압축하여 제공할 수 있다.According to the present invention, by receiving a document and a search keyword, and extracting and outputting a paragraph including the search keyword from the document, it is possible to compress and provide the document according to the user's search intent.

또한, 본 발명에 의해서는, 검색 키워드에 기초하여, 문서로부터 추출된 문단이 중복되는 경우, 불필요하게 중복된 문단을 제거하여 출력 함으로써, 문서의 내용을 정확히 파악할 수 있게 하는 한도 내에서 최소의 문단을 제공하여, 최적의 압축률을 지원할 수 있다.In addition, according to the present invention, if paragraphs extracted from a document are duplicated based on a search keyword, unnecessary paragraphs are eliminated and output, thereby minimizing paragraphs within a limit to accurately grasp the contents of the document. By providing, it is possible to support the optimum compression ratio.

또한, 본 발명에 따르면, 문서 내 검색 키워드에 대한 가중치 및 문서 내 임의의 문단에서의 검색 키워드에 대한 빈도수 중 적어도 하나의 정보를 산출하여, 출력 함으로써, 검색 키워드에 대한 중요도 또는 검색 키워드에 대한 사용 빈도를 파악할 수 있게 한다.Further, according to the present invention, by calculating and outputting at least one of the weight of the search keyword in the document and the frequency of the search keyword in any paragraph in the document, by outputting, the use of the importance or search keyword for the search keyword Make sure you know how often.

도 1은 본 발명의 일실시예에 따른 문서 내 문단 추출 장치를 포함하는 네트워크의 일례를 도시하는 도면이다.
도 2는 본 발명의 일실시예에 따른 문서 내 문단 추출 장치의 구성을 나타내는 도면이다.
도 3은 본 발명의 일실시예에 따른 문서 내 문단 추출 장치에서 프로그램을 이용하여, 문서 내 키워드를 검색하는 방법을 설명하기 위한 도면이다.
도 4는 본 발명의 일실시예에 따른 문서 내 문단 추출 장치에서 문서 내 키워드를 검색하기 위한 수도코드의 일례를 도시한 도면이다.
도 5는 본 발명의 일실시예에 따른 문서 내 문단 추출 장치에서 프로그램을 이용하여, 문서 내 문단을 추출하는 방법을 설명하기 위한 도면이다.
도 6은 본 발명의 일실시예에 따른 문서 내 문단 추출 장치에서 검색 키워드를 입력받는 일례를 설명하기 위한 도면이다.
도 7은 본 발명의 일실시예에 따른 문서 내 문단 추출 장치에서 문서 내 문단을 추출하기 위한 수도코드의 일례를 도시한 도면이다.
도 8은 본 발명의 일실시예에 따른 문서 내 문단 추출 장치에서 검색 키워드에 대한 분석정보를 획득하는 방법을 설명하기 위한 도면이다.
도 9는 본 발명의 일실시예에 따른 문서 내 문단 추출 장치에서 문서 내 각 검색 키워드에 대한 빈도수 및 가중치를 출력하는 일례를 도시한 도면이다.
도 10은 본 발명의 일실시예에 따른 문서 내 문단 추출 장치에서 문서 내 임의의 문단에서의 검색 키워드에 대한 빈도수를 출력하는 일례를 도시한 도면이다.
도 11은 본 발명의 일실시예에 따른 문서 내 문단 추출 장치에서 문서 내 검색 키워드에 대한 가중치 및 문서 내 임의의 문단에서의 검색 키워드에 대한 빈도수를 산출하기 위한 수도코드의 일례를 도시한 도면이다.
도 12 및 도 13은 본 발명의 일실시예에 따른 문서 내 문단 추출 장치에서의 압축 결과를, 기존 문단 추출 장치에서의 압축 결과와 비교하여 설명하기 위한 도면이다.
도 14는 본 발명의 일실시예에 따른 문서 내 문단 추출 장치의 다른 구성 일례를 도시한 도면이다.
도 15는 본 발명의 일실시예에 따른 문서 내 문단 추출 방법을 나타내는 흐름도이다.1 is a diagram illustrating an example of a network including an apparatus for extracting paragraphs in a document according to an embodiment of the present invention.
2 is a diagram illustrating a configuration of an apparatus for extracting paragraphs in a document according to an embodiment of the present invention.
3 is a diagram for describing a method of searching for a keyword in a document using a program in the apparatus for extracting paragraphs in a document according to an embodiment of the present invention.
4 is a diagram illustrating an example of a pseudo code for searching for a keyword in a document in the apparatus for extracting paragraphs in a document according to an embodiment of the present invention.
5 is a view for explaining a method of extracting a paragraph in a document using a program in the apparatus for extracting paragraphs in a document according to an embodiment of the present invention.
6 is a view for explaining an example of receiving a search keyword in the paragraph extraction apparatus in a document according to an embodiment of the present invention.
7 is a diagram illustrating an example of a pseudo code for extracting a paragraph in a document in the apparatus for extracting paragraphs in a document according to an embodiment of the present invention.
8 is a diagram for describing a method of obtaining analysis information about a search keyword in an apparatus for extracting paragraphs in a document according to an embodiment of the present invention.
9 is a diagram illustrating an example of outputting a frequency and weight for each search keyword in a document in the apparatus for extracting paragraphs in a document according to an embodiment of the present invention.
FIG. 10 is a diagram for one example of outputting a frequency for a search keyword in an arbitrary paragraph in a document in the apparatus for extracting paragraphs in a document according to an embodiment of the present invention.
11 is a diagram illustrating an example of a pseudo code for calculating a weight for a search keyword in a document and a frequency for a search keyword in any paragraph in the document in the apparatus for extracting paragraphs in a document according to an embodiment of the present invention. .
12 and 13 are diagrams for explaining the compression result in the paragraph extraction apparatus in the document according to an embodiment of the present invention, compared with the compression result in the existing paragraph extraction apparatus.
14 is a diagram illustrating another configuration example of an apparatus for extracting paragraphs in a document according to an embodiment of the present invention.
15 is a flowchart illustrating a method of extracting a paragraph in a document according to an embodiment of the present invention.

이하, 첨부 도면들 및 첨부 도면들에 기재된 내용들을 참조하여 본 발명의 다양한 실시예를 상세하게 설명하지만, 본 발명이 실시예에 의해 제한되거나 한정되는 것은 아니다.Hereinafter, various embodiments of the present invention will be described in detail with reference to the accompanying drawings and the contents described in the accompanying drawings, but the present invention is not limited or limited to the embodiments.

도 1은 본 발명의 일실시예에 따른 문서 내 문단 추출 장치를 포함하는 네트워크의 일례를 도시하는 도면이다.1 is a diagram illustrating an example of a network including an apparatus for extracting paragraphs in a document according to an embodiment of the present invention.

도 1을 참조하면, 네트워크(100)는 단말(101) 및 문서 내 문단 추출 장치(103)를 포함할 수 있다.Referring to FIG. 1, the network 100 may include a terminal 101 and an apparatus for extracting paragraphs in a document.

단말(101)은 예컨대, 스마트폰, 태블릿 PC일 수 있으며, 유무선 통신을 통해, 문서(예컨대, XML 문서)에 대한 분석 요청을 문서 내 문단 추출 장치(103)로 전송할 수 있다. 여기서, 분석 요청은 검색 키워드를 포함할 수 있다.The terminal 101 may be, for example, a smartphone or a tablet PC, and may transmit a request for analyzing a document (eg, an XML document) to the paragraph extracting device 103 in the document through wired or wireless communication. Here, the analysis request may include a search keyword.

단말(101)은 상기 분석 요청에 대한 응답으로서, 상기 검색 키워드를 이용하여, 압축된 문서를 문서 내 문단 추출 장치(103)로부터 수신하여 표시할 수 있다.As a response to the analysis request, the terminal 101 may receive and display a compressed document from the paragraph extracting device 103 in the document by using the search keyword.

문서 내 문단 추출 장치(103)는 단말(101)로부터의 문서에 대한 분석 요청에 연동하여, 상기 문서를 압축하고, 압축된 문서를 단말(101)로 제공할 수 있다. 이때, 문서 내 문단 추출 장치(103)는 상기 분석 요청으로부터 검색 키워드를 추출하고, 상기 추출된 검색 키워드를 이용하여, 상기 문서를 압축할 수 있다.The paragraph extraction apparatus 103 within a document may compress the document in association with an analysis request for the document from the terminal 101 and provide the compressed document to the terminal 101. In this case, the paragraph extraction apparatus 103 may extract a search keyword from the analysis request, and compress the document by using the extracted search keyword.

구체적으로, 문서 내 문단 추출 장치(103)는 상기 문서로부터 복수의 문단을 구분하고, 상기 복수의 문단 중에서 상기 검색 키워드가 포함되는 문단을 추출하여 정렬 함으로써, 상기 문서를 압축할 수 있다.In detail, the paragraph extraction apparatus 103 within the document may compress the document by dividing a plurality of paragraphs from the document and extracting and sorting a paragraph including the search keyword from the plurality of paragraphs.

또한, 문서 내 문단 추출 장치(103)는 상기 문서 내 상기 검색 키워드에 대한 가중치 및 상기 문서 내 임의의 문단에서의 상기 검색 키워드에 대한 빈도수 중 적어도 하나의 정보를 산출할 수 있으며, 상기 산출된 정보를 상기 압축된 문서에 추가하여, 단말(101)로 제공할 수 있다.In addition, the paragraph extraction apparatus 103 in the document may calculate at least one information of the weight of the search keyword in the document and the frequency of the search keyword in any paragraph in the document, the calculated information May be provided to the terminal 101 in addition to the compressed document.

한편, 문서 내 문단 추출 장치(103)는 단말(101)로부터 수신된 문서에 대한 분석 요청에 대해, 압축된 문서를 단말(101)로 제공할 수 있으나, 이에 한정되지 않고, 사용자로부터 직접 문서를 입력받을 수 있으며, 입력된 문서에 대해, 압축된 문서를 출력할 수 있다.The paragraph extracting device 103 may provide a compressed document to the terminal 101 in response to an analysis request for the document received from the terminal 101, but is not limited thereto. A user may receive an input and output a compressed document with respect to the input document.

도 2는 본 발명의 일실시예에 따른 문서 내 문단 추출 장치의 구성을 나타내는 도면이다.2 is a diagram illustrating a configuration of an apparatus for extracting paragraphs in a document according to an embodiment of the present invention.

도 2를 참조하면, 본 발명의 일실시예에 따른 문서 내 문단 추출 장치(200)는 인터페이스부(201), 추출부(203), 프로세서(205) 및 데이터베이스(207)를 포함할 수 있다.Referring to FIG. 2, an apparatus for extracting paragraphs in a document according to an embodiment of the present invention may include an interface unit 201, an extractor 203, a processor 205, and a database 207.

인터페이스부(201)는 문서(예컨대, XML 문서) 및 검색 키워드를 입력받을 수 있다. 이때, 인터페이스부(201)는 사용자로부터 검색 키워드를 직접 입력받거나, 또는 추출부(203)에 의해, 상기 문서에서 검색된 키워드를 나열하여 작성한 리스트를 제공하고, 상기 리스트에 대한 선택명령에 따라, 선택된 키워드를 상기 검색 키워드로서 입력받을 수 있다. 이때, 인터페이스부(201)는 검색 키워드의 개수를 더 입력받거나, 또는 선택된 키워드를 카운트하여, 검색 키워드의 개수를 확인할 수 있다.The interface unit 201 may receive a document (eg, an XML document) and a search keyword. In this case, the interface unit 201 directly receives a search keyword from the user, or provides a list prepared by listing the keywords searched in the document by the extraction unit 203, and selected according to a selection command for the list. A keyword may be input as the search keyword. In this case, the interface unit 201 may further receive the number of search keywords or count the selected keywords to check the number of search keywords.

또한, 인터페이스부(201)는 상기 검색 키워드와 같은 의미를 갖는 유사 키워드(즉, 동의어)가 외부 서버(도시하지 않음)로부터 획득되는 경우, 상기 유사 키워드를 제공 함으로써, 사용자가 검색하고자 하는 내용이, 최종적으로 출력되는 문단에서 제거되는 것을 방지할 수 있게 한다. 예컨대, 인터페이스부(201)는 '사람'의 검색 키워드가 입력된 경우, '사람'의 검색 키워드에 대한 유사 키워드로서, '인간'을 외부 서버로부터 획득하여, 제공할 수 있다.In addition, the interface unit 201 provides a similar keyword when a similar keyword (that is, synonym) having the same meaning as the search keyword is obtained from an external server (not shown). This prevents it from being removed from the final printed paragraph. For example, when a search keyword of 'person' is input, the interface unit 201 may obtain and provide 'human' from an external server as a similar keyword for the search keyword of 'person'.

추출부(203)는 상기 문서로부터 복수의 문단을 구분하고, 구분된 복수의 문단(또는, 문서)으로부터 키워드를 추출하여 데이터베이스(207)에 저장할 수 있다. 또한, 추출부(203)는 상기 복수의 문단 각각에, 상기 문서 상에서 상기 복수의 문단이 배치된 순서대로 번호를 부여하고, 상기 복수의 문단 중에서 상기 검색 키워드가 포함되는 문단을 추출할 수 있다.The extractor 203 may classify a plurality of paragraphs from the document, extract a keyword from the plurality of divided paragraphs (or documents), and store the keyword in the database 207. The extractor 203 may assign a number to each of the plurality of paragraphs in the order in which the plurality of paragraphs are arranged on the document, and extract a paragraph including the search keyword from the plurality of paragraphs.

상기 검색 키워드가 복수일 경우, 추출부(203)는 상기 복수의 문단 중 각각의 검색 키워드가 포함되는 문단을 추출하고, 상기 추출된 문단 중 동일한 번호가 부여된 문단이 복수 개 존재하는 경우, 상기 복수 개의 문단 중 하나의 문단 이외의 나머지 문단을, 상기 추출된 문단에서 제거할 수 있다. 예컨대, 제1 검색 키워드가 포함되어 추출된 문단이, 제1 문단(번호 1이 부여된 문단), 제3 문단, 제4 문단이고, 제2 검색 키워드가 포함되어 추출된 문단이, 제3 문단, 제5 문단일 경우, 추출부(203)는 중복하여 추출된 2개의 제3 문단 중 하나의 문단을 제거할 수 있다.When there are a plurality of search keywords, the extractor 203 extracts a paragraph including each search keyword among the plurality of paragraphs, and when there are a plurality of paragraphs having the same number among the extracted paragraphs, The remaining paragraphs other than one of a plurality of paragraphs may be removed from the extracted paragraph. For example, a paragraph extracted by including a first search keyword is a first paragraph (paragraph numbered with number 1), a third paragraph, and a fourth paragraph, and a paragraph extracted by including a second search keyword is a third paragraph. In the case of the fifth paragraph, the extractor 203 may remove one paragraph of the two third paragraphs extracted in duplicate.

또한, 추출부(203)는 인터페이스부(201)를 통해, 제공된 상기 유사 키워드에 대해, 추가 검색 요청이 입력되면, 상기 복수의 문단 중에서 상기 유사 키워드가 포함되는 문단을 더 추출할 수 있다.In addition, the extraction unit 203 may further extract a paragraph including the similar keyword from the plurality of paragraphs when an additional search request is input to the provided similar keyword through the interface unit 201.

프로세서(205)는 추출부(203)에 의해, 추출된 문단(예컨대, 검색 키워드 또는 유사 키워드가 포함된 문단)을 출력 함으로써, 상기 문서를 압축시켜 제공할 수 있다. 이때, 프로세서(205)는 상기 추출된 문단을, 상기 추출된 문단에 각각 부여된 번호에 따라, 정렬하여 출력 함(예컨대, 부여된 번호가 작은 순서대로 문단을 정렬함)으로써, 상기 문서 상에서의 문단 간 순서가 뒤바뀌지 않고, 배치 순서를 유지할 수 있게 한다.The processor 205 may compress the document by providing the extracted paragraph (eg, a paragraph including a search keyword or a similar keyword) by the extractor 203. In this case, the processor 205 sorts the extracted paragraphs according to the numbers assigned to the extracted paragraphs, and outputs the sorted paragraphs (for example, the paragraphs are arranged in the order in which the assigned numbers are small). This ensures that the order of paragraphs stays intact without changing the order of paragraphs.

예컨대, 제1 검색 키워드가 포함되어 추출된 문단이, 제1 문단, 제3 문단, 제4 문단이고, 제2 검색 키워드가 포함되어 추출된 문단이, 제3 문단, 제5 문단일 경우, 프로세서(205)는 문단에 각각 부여된 번호에 따라, 제1 문단, 제3 문단, 제4 문단, 제5 문단 순서대로 정렬하여 출력할 수 있다.For example, when the extracted paragraph including the first search keyword is the first paragraph, the third paragraph, the fourth paragraph, and the extracted paragraph including the second search keyword is the third paragraph, the fifth paragraph, the processor 205 may arrange and output the first paragraph, the third paragraph, the fourth paragraph, and the fifth paragraph according to the numbers assigned to the paragraphs.

또한, 프로세서(205)는 상기 검색 키워드에 대한 분석정보로서, 상기 문서 내 상기 검색 키워드에 대한 가중치 및 상기 문서 내 임의의 문단에서의 상기 검색 키워드에 대한 빈도수 중 적어도 하나의 정보를 산출하여, 출력할 수 있다. 이때, 프로세서(205)는 상기 문서 내 상기 검색 키워드에 대한 가중치를 출력 함으로써, 문서에서의 검색 키워드에 대한 중요도를 파악할 수 있게 한다. 또한, 프로세서(205)는 상기 문서 내 임의의 문단에서의 상기 검색 키워드에 대한 빈도수를 출력 함으로써, 임의의 문단에서 각 검색 키워드가 사용된 횟수를 파악할 수 있게 한다.In addition, the processor 205 calculates and outputs at least one of the weight of the search keyword in the document and the frequency of the search keyword in any paragraph of the document as analysis information on the search keyword. can do. In this case, the processor 205 outputs a weight for the search keyword in the document, thereby determining the importance of the search keyword in the document. In addition, the processor 205 outputs a frequency for the search keyword in any paragraph in the document, so as to determine the number of times each search keyword is used in any paragraph.

상기 문서 내 검색 키워드에 대한 가중치 산출시, 프로세서(205)는 각각의 검색 키워드가 포함된 문단의 개수를 합하여, 총 개수(총 빈도수)를 산출하고, 총 개수 대비 특정 검색 키워드가 포함된 문단의 개수(특정 검색 키워드의 빈도수)에 대한 비율을, 상기 특정 검색 키워드에 대한 가중치로서 산출할 수 있다.When calculating weights for the search keywords in the document, the processor 205 calculates the total number (total frequency) by adding the number of paragraphs including each search keyword, and compares the number of paragraphs with a specific search keyword to the total number. The ratio with respect to the number (frequency of a specific search keyword) can be computed as a weight with respect to the said specific search keyword.

구체적으로, 상기 검색 키워드가, 제1 검색 키워드 및 제2 검색 키워드를 포함하는 경우, 프로세서(205)는 상기 제1 검색 키워드를 포함하여 추출된 문단의 제1 개수와 상기 제2 검색 키워드를 포함하여 추출된 문단의 제2 개수를 합하여, 총 개수를 산출할 수 있다. 이후, 프로세서(205)는 상기 총 개수 대비 상기 제1 개수의 비율을, 상기 제1 검색 키워드에 대한 가중치로서 산출하고, 상기 총 개수 대비 상기 제2 개수의 비율을, 상기 제2 검색 키워드에 대한 가중치로서 산출할 수 있다.Specifically, when the search keyword includes a first search keyword and a second search keyword, the processor 205 includes a first number of paragraphs extracted by including the first search keyword and the second search keyword. The total number may be calculated by adding the second number of extracted paragraphs. Thereafter, the processor 205 calculates a ratio of the first number to the total number as a weight for the first search keyword, and calculates a ratio of the second number to the second number for the second search keyword. It can calculate as a weight.

상기 문서 내 임의의 문단에서의 상기 검색 키워드에 대한 빈도수 산출시, 프로세서(205)는 문서 내 임의의 문단에서 상기 검색 키워드가 포함된 횟수(사용 횟수)를 상기 빈도수로서 산출할 수 있다.In calculating a frequency for the search keyword in any paragraph in the document, the processor 205 may calculate the frequency (number of uses) of the search keyword in any paragraph in the document as the frequency.

또한, 상기 문단 출력시, 프로세서(205)는 상기 추출된 문단 내 검색 키워드를, 상기 추출된 문단 내 다른 문자와 구별하여 출력할 수 있다. 이때, 프로세서(205)는 상기 검색 키워드가 복수일 경우, 상기 복수의 검색 키워드 별로, 상이한 형태(예컨대, 상이한 색상, 글꼴, 크기 등)로 구별하여 출력 함으로써, 문단 내 포함되는 복수의 검색 키워드를 쉽게 인식할 수 있게 한다.In addition, when outputting the paragraph, the processor 205 may output the search keyword in the extracted paragraph differently from other characters in the extracted paragraph. In this case, when the search keywords are plural, the processor 205 distinguishes and outputs the plurality of search keywords included in the paragraph by outputting them in different forms (for example, different colors, fonts, sizes, etc.) for each of the plurality of search keywords. Make it easy to recognize.

또한, 프로세서(205)는 연속적인 번호가 부여된 문단이, 설정된 개수 이상 추출되는 경우, 해당 문단에 연속 식별표시를 출력할 수 있다. 예컨대, 추출된 문단이, 제1 문단, 제3 문단, 제4 문단, 제5 문단이고, 설정된 개수가 '3'일 경우, 프로세서(205)는 3개의 연속적인 번호가 부여된 문단 즉, 제3 문단, 제4 문단, 제5 문단 각각에 대해, 연속 식별표시를 출력할 수 있다.In addition, the processor 205 may output a continuous identification mark to the paragraph when the consecutive numbered paragraph is extracted more than the set number. For example, when the extracted paragraphs are the first paragraph, the third paragraph, the fourth paragraph, and the fifth paragraph, and the set number is '3', the processor 205 may determine three consecutive numbered paragraphs, that is, the first paragraph. A continuous identification mark can be output for each of the three paragraphs, the fourth paragraph, and the fifth paragraph.

다른 일례로서, 프로세서(205)는 상기 검색 키워드가 복수일 경우, 상기 복수의 검색 키워드를 모두 포함하는 문단에 중요 식별표시를 출력할 수 있다. 예컨대, 검색 키워드가 제1 검색 키워드, 제2 검색 키워드 및 제3 검색 키워드를 포함하고, 제3 문단에 3개의 검색 키워드가 모두 포함될 경우, 프로세서(205)는 제3 문단에, 중요 식별표시를 출력 함으로써, 중요한 문단을 쉽게 인지할 수 있게 한다.As another example, when there are a plurality of search keywords, the processor 205 may output an important identification mark in a paragraph including all of the plurality of search keywords. For example, if the search keyword includes the first search keyword, the second search keyword, and the third search keyword, and all three search keywords are included in the third paragraph, the processor 205 may display an important identification mark in the third paragraph. By printing, important paragraphs can be easily recognized.

데이터베이스(207)는 문서로부터 추출된 키워드를 저장할 수 있다.The database 207 may store keywords extracted from the document.

도 3은 본 발명의 일실시예에 따른 문서 내 문단 추출 장치에서 프로그램을 이용하여, 문서 내 키워드를 검색하는 방법을 설명하기 위한 도면이다.3 is a diagram for describing a method of searching for a keyword in a document using a program in the apparatus for extracting paragraphs in a document according to an embodiment of the present invention.

도 3을 참조하면, 단계 301에서, 문서 내 문단 추출 장치는 자바 스캐너(Java Scanner) 클래스를 통해, XML 문서의 파일명을 입력받을 수 있다.Referring to FIG. 3, in operation 301, an apparatus for extracting paragraphs in a document may receive a file name of an XML document through a Java Scanner class.

단계 303에서, 문서 내 문단 추출 장치는 자바 파일인풋스트림(Java FileInputStream) 클래스를 통해, 상기 입력된 파일명의 문서를 읽어올 수 있다.In operation 303, the apparatus for extracting paragraphs in a document may read a document of the input file name through a Java FileInputStream class.

단계 305에서, 문서 내 문단 추출 장치는 자바 버퍼(Java Buffer) 클래스를 통해, 문서 내 문단(또는, 문자열)을 구분하여, 버퍼에 저장할 수 있다.In operation 305, the paragraph extracting apparatus in the document may classify a paragraph (or a string) in the document and store it in a buffer through a Java Buffer class.

단계 307에서, 문서 내 문단 추출 장치는 자바 컨테인 메소드(Java Contains Method)를 통해, 키워드 태그를 이용하여, 문단으로부터 키워드를 검색할 수 있다.In operation 307, the paragraph extraction apparatus in the document may search for a keyword from a paragraph using a keyword tag through a Java Contains Method.

한편, 문서 내 문단 추출 장치는 상기 각 단계(301 내지 307)를 예컨대, 도 4에 도시된 수도코드를 통해, 구현할 수 있다.On the other hand, the paragraph extraction apparatus in the document may implement each of the steps (301 to 307), for example, through the water code shown in FIG.

도 5는 본 발명의 일실시예에 따른 문서 내 문단 추출 장치에서 프로그램을 이용하여, 문서 내 문단을 추출하는 방법을 설명하기 위한 도면이다.5 is a view for explaining a method of extracting a paragraph in a document using a program in the apparatus for extracting paragraphs in a document according to an embodiment of the present invention.

도 5를 참조하면, 단계 501에서, 문서 내 문단 추출 장치는 자바 스캐너(Java Scanner) 클래스를 통해, 검색 키워드를 입력받을 수 있다. 이때, 문서 내 문단 추출 장치는 입력된 문서에서 검색한 키워드를 나열하여 작성한 리스트를 제공하고, 상기 리스트에 대한 선택명령에 따라, 선택된 키워드를 상기 검색 키워드로서 입력받을 수 있다. 예컨대, 문서 내 문단 추출 장치는 도 6에 도시된 바와 같이, 리스트를 제공할 수 있으며, '지질정보', '지반정보' 및 '통합관리시스템'이 선택된 경우, 선택된 '지질정보', '지반정보' 및 '통합관리시스템'을 상기 검색 키워드로서 입력받을 수 있다. 이때, 문서 내 문단 추출 장치는 검색 키워드의 개수를 더 입력받거나, 또는 선택된 키워드를 카운트하여, 검색 키워드의 개수를 확인할 수 있다.Referring to FIG. 5, in operation 501, an apparatus for extracting paragraphs in a document may receive a search keyword through a Java Scanner class. In this case, the paragraph extraction apparatus in the document may provide a list prepared by listing the keywords searched in the input document, and receive the selected keyword as the search keyword according to a selection command for the list. For example, the apparatus for extracting paragraphs in a document may provide a list as shown in FIG. 6, and when the 'geological information', the 'ground information' and the 'integrated management system' are selected, the selected 'geological information' and the 'ground' Information 'and' integrated management system 'may be input as the search keyword. In this case, the paragraph extraction apparatus in the document may further receive the number of search keywords or count the selected keywords to check the number of search keywords.

단계 503에서, 문서 내 문단 추출 장치는 자바 컨테인 메소드(Java Contains Method)를 통해, 문서로부터 검색 키워드가 포함되는 문단을 추출할 수 있다.In operation 503, the apparatus for extracting paragraphs in a document may extract a paragraph including a search keyword from a document through a Java Contains Method.

단계 505에서, 문서 내 문단 추출 장치는 자바 이터레이터 메소드(Java Iterator Method)를 통해, 추출된 문단에서 중복된 문단을 제거할 수 있다. 예컨대, 문서 내 문단 추출 장치는 첫번째 검색 키워드가 포함된 문단과 두번째 검색 키워드가 포함된 문단이 같을 경우, 문서로부터 상기 문단이 2번 추출되므로, 추출된 문단에서 해당 문단을 1번 제거할 수 있다. 또한, 문서 내 문단 추출 장치는 3개의 검색 키워드 각각을 포함하는 문단이 같을 경우, 중복 제거를 통해, 3번 추출된 동일한 문단 중에서 2개의 문단을 제거할 수 있다.In operation 505, the paragraph extracting apparatus in the document may remove duplicate paragraphs from the extracted paragraph through a Java Iterator Method. For example, the paragraph extraction apparatus in the document may remove the paragraph once from the extracted paragraph because the paragraph is extracted twice from the document when the paragraph including the first search keyword and the paragraph including the second search keyword are the same. . In addition, when a paragraph including each of the three search keywords is the same, the paragraph extracting apparatus in the document may remove two paragraphs from the same paragraph extracted three times through duplicate elimination.

단계 507에서, 문서 내 문단 추출 장치는 자바 어센딩 메소드(Java Ascending Method)를 통해, 상기 추출된 문단을 정렬하여 출력할 수 있다. 이때, 문서 내 문단 추출 장치는 상기 추출된 문단에 부여된 번호에 따라, 문단을 정렬 함으로써, 문단 간의 정렬 순서가, 문서 상에서 배치된 순서를 유지하도록 하여, 문서의 내용이 상이하게 전달되는 것을 방지할 수 있다.In operation 507, the paragraph extraction apparatus in the document may sort and output the extracted paragraph through a Java Ascending Method. In this case, the paragraph extraction apparatus in the document sorts the paragraphs according to the numbers assigned to the extracted paragraphs, thereby maintaining the order in which the paragraphs are arranged on the document, thereby preventing the contents of the document from being differently transmitted. can do.

한편, 문서 내 문단 추출 장치는 상기 각 단계(501 내지 507)를 예컨대, 도 7에 도시된 수도코드를 통해, 구현할 수 있다.On the other hand, the paragraph extraction apparatus in the document may implement each of the steps (501 to 507), for example, through the water code shown in FIG.

도 8은 본 발명의 일실시예에 따른 문서 내 문단 추출 장치에서 검색 키워드에 대한 분석정보를 획득하는 방법을 설명하기 위한 도면이다.8 is a diagram for describing a method of obtaining analysis information about a search keyword in an apparatus for extracting paragraphs in a document according to an embodiment of the present invention.

도 8을 참조하면, 문서 내 문단 추출 장치는 입력된 문서에서, 검색 키워드가 포함된 문단을 추출하여 출력할 수 있으며, 상기 문단 출력시, 상기 검색 키워드에 대한 분석정보(예컨대, 문서 내 검색 키워드에 대한 가중치, 문서 내 임의의 문단에서의 검색 키워드에 대한 빈도수)를 함께 출력할 수 있다.Referring to FIG. 8, the apparatus for extracting paragraphs in a document may extract and output a paragraph including a search keyword from an input document. When the paragraph is output, analysis information (eg, a search keyword in a document) may be output. Weight for, frequency for search keyword in any paragraph in the document).

단계 801에서, 문서 내 문단 추출 장치는 상기 검색 키워드에 대한 분석정보로서, [수학식 1]에 의해, 문서 내 각 검색 키워드에 대한 가중치를 산출하여 출력할 수 있다. 구체적으로, 문서 내 문단 추출 장치는 검색 키워드가 복수일 경우, 검색 키워드 각각이 포함된 문단의 총 개수(

) 대비, 임의의 검색 키워드가 포함된 문단의 개수(

)에 대한 비율을, 상기 임의의 검색 키워드에 대한 가중치(

)로 산출할 수 있다.In operation 801, the apparatus for extracting paragraphs in a document may calculate and output a weight for each search keyword in a document by using Equation 1 as analysis information on the search keyword. Specifically, the paragraph extraction apparatus in the document, if there are a plurality of search keywords, the total number of paragraphs containing each of the search keywords (

), Compared to the number of paragraphs containing any search keyword (

), The weight for any of the search keywords (

Can be calculated as

예컨대, 문서 내 문단 추출 장치는 검색 키워드가 제1 검색 키워드(예컨대, '지질정보'), 제2 검색 키워드(예컨대, '지반정보'), 제3 검색 키워드(예컨대, '통합관리시스템')를 포함하는 경우, 상기 제1 검색 키워드를 포함하여 추출된 문단의 제1 개수(제1 빈도수), 상기 제2 검색 키워드를 포함하여 추출된 문단의 제2 개수(제2 빈도수) 및 상기 제3 검색 키워드를 포함하여 추출된 문단의 제3 개수(제3 빈도수)를 합하여, 총 개수를 산출하고, 상기 총 개수 대비 상기 제1 개수의 비율을, 상기 제1 검색 키워드에 대한 가중치로서 산출할 수 있다.For example, a paragraph extracting apparatus in a document may include a search keyword as a first search keyword (eg, 'geological information'), a second search keyword (eg, 'ground information'), a third search keyword (eg, 'integrated management system'). In the case of including, the first number (first frequency) of the paragraph extracted by including the first search keyword, the second number (second frequency) of the paragraph extracted by including the second search keyword and the third The total number may be calculated by adding the third number (third frequency) of the extracted paragraphs including the search keyword, and the ratio of the first number to the total number may be calculated as a weight for the first search keyword. have.

문서 내 문단 추출 장치는 제1 검색 키워드와 마찬가지로, 제2 검색 키워드 및 제3 검색 키워드 각각에 대한 가중치를 산출할 수 있으며, 도 9에 도시된 바와 같이, 문서 내 각 검색 키워드에 대한 빈도수 및 가중치를 출력할 수 있다. 이때, 문서 내 문단 추출 장치는 제1,2,3 검색 키워드 중 적어나 하나의 검색 키워드가 포함된 문단의 총 개수(제1 내지 제3 빈도수를 합한 빈도수) 및 중복 제거된 문단의 개수를 더 출력할 수 있다.The paragraph extracting apparatus in the document may calculate weights for each of the second search keyword and the third search keyword, similar to the first search keyword, and as shown in FIG. 9, the frequency and weight for each search keyword in the document. You can output In this case, the paragraph extracting apparatus in the document may add the total number of paragraphs (the sum of the first to third frequencies) including at least one of the first, second, and third search keywords, and the number of the paragraphs that have been removed. You can print

단계 803에서, 문서 내 문단 추출 장치는 상기 검색 키워드에 대한 분석정보로서, 문서 내 임의의 문단에서의 검색 키워드에 대한 빈도수를 산출하여 출력할 수 있으며, [수학식 2]와 같이 나타낼 수 있다.In operation 803, the apparatus for extracting paragraphs in a document may calculate and output a frequency for the search keyword in any paragraph in the document as analysis information on the search keyword, and may be expressed as shown in [Equation 2].

예컨대, 문서 내 문단 추출 장치는 검색 키워드가 제1 검색 키워드(예컨대, '지질정보'), 제2 검색 키워드(예컨대, '지반정보'), 제3 검색 키워드(예컨대, '통합관리시스템')를 포함하는 경우, 각 검색 키워드에 대한 빈도수로서, 문서 내 임의의 문단에서 상기 제1 검색 키워드가 포함된 제1 횟수, 상기 임의의 문단에서 상기 제2 검색 키워드가 포함된 제2 횟수 및 상기 임의의 문단에서 상기 제3 검색 키워드가 포함된 제3 횟수를 산출할 수 있다. 이때, 문서 내 문단 추출 장치는 제1 문단에서, 제1 검색 키워드(예컨대, '지질정보')가 2번 포함되고, 제2 검색 키워드(예컨대, '지반정보')가 1번 포함되며, 제3 검색 키워드(예컨대, '통합관리시스템')가 0번 포함되는 경우, 제1,2,3 검색 키워드에 대해, 2:1:0의 빈도수를 산출할 수 있다. 또한, 문서 내 문단 추출 장치는 제2 문단 및 제3 문단 내에서도 각 검색 키워드에 대한 빈도수를 산출할 수 있으며, 도 10에 도시된 바와 같이, 출력할 수 있다.For example, a paragraph extracting apparatus in a document may include a search keyword as a first search keyword (eg, 'geological information'), a second search keyword (eg, 'ground information'), a third search keyword (eg, 'integrated management system'). Includes, a frequency for each search keyword, the first number of times the first search keyword is included in any paragraph in the document, the second number of times the second search keyword is included in the arbitrary paragraph and the random number; In the paragraph of, a third number including the third search keyword may be calculated. In this case, the paragraph extraction apparatus in the document includes the first search keyword (eg, 'geological information') twice and the second search keyword (eg, 'ground information') in the first paragraph. 3 When a search keyword (eg, an integrated management system) is included 0 times, a frequency of 2: 1: 0 may be calculated for the first, second, and third search keywords. In addition, the paragraph extraction apparatus in the document may calculate a frequency for each search keyword even in the second paragraph and the third paragraph, and may be output as shown in FIG. 10.

한편, 문서 내 문단 추출 장치는 도 11에 도시된 수도코드를 통해, 문서 내 검색 키워드에 대한 가중치 및 문서 내 임의의 문단에서의 검색 키워드에 대한 빈도수를 산출할 수 있다.Meanwhile, the paragraph extraction apparatus in the document may calculate the weight for the search keyword in the document and the frequency for the search keyword in any paragraph in the document through the water code shown in FIG. 11.

도 12 및 도 13은 본 발명의 일실시예에 따른 문서 내 문단 추출 장치에서의 압축 결과를, 기존 문단 추출 장치에서의 압축 결과와 비교하여 설명하기 위한 도면이다. 여기서, 도 12는 본 발명의 문서 내 문단 추출 장치 및 기존의 문단 추출 장치(예컨대, 교집합에 기초한 문단 추출 장치, 합집합에 기초한 문단 추출 장치)에서 각각 추출한 문단의 개수를 도시한 도면이고, 도 13은 본 발명의 문서 내 문단 추출 장치 및 기존의 문단 추출 장치에서의 압축률을 도시한 도면이다.12 and 13 are diagrams for explaining the compression result in the paragraph extraction apparatus in the document according to an embodiment of the present invention, compared with the compression result in the existing paragraph extraction apparatus. 12 is a diagram illustrating the number of paragraphs extracted by the paragraph extracting apparatus and the existing paragraph extracting apparatus (for example, paragraph-based extracting apparatus based on intersection and paragraph-based extracting apparatus based on union) of the present invention, and FIG. 13. Is a diagram illustrating a compression ratio in a paragraph extracting device and a conventional paragraph extracting device of the present invention.

도 12를 참조하면, 각 문단 추출 장치는 입력된 문서에서, 복수의 검색 키워드를 포함하는 문서를 추출하여 출력 함으로써, 문서를 압축할 수 있다.Referring to FIG. 12, each paragraph extracting apparatus may compress a document by extracting and outputting a document including a plurality of search keywords from the input document.

본 발명의 문서 내 문단 추출 장치는 문서로부터 각 검색 키워드를 포함하는 문단을 추출하고, 중복되어 추출된 문단을 제거할 수 있다. 반면, 교집합(Intersection)에 기초한 문단 추출 장치는 문서로부터 복수의 검색 키워드 모두를 포함하는 문단을 추출할 수 있다. 또한, 합집합(Union)에 기초한 문단 추출 장치는 문서로부터 각 검색 키워드를 포함하는 문단을 모두 추출할 수 있다.The apparatus for extracting paragraphs in a document of the present invention may extract a paragraph including each search keyword from a document, and remove a paragraph extracted in duplicate. On the other hand, the paragraph extraction apparatus based on the intersection may extract a paragraph including all of the plurality of search keywords from the document. In addition, the paragraph extraction apparatus based on the union may extract all of the paragraphs including the respective search keywords from the document.

예컨대, 680개의 문단으로 구성된 문서를 분석하는 테스트 1(Test 1)에서, 교집합에 기초한 문단 추출 장치는 8개의 문단을 추출할 수 있다. 이때, 합집합에 기초한 문단 추출 장치는 상기 문서로부터 244 문단을 추출할 수 있다. 반면, 본 발명의 문단 추출 장치는 50개의 중복된 문단을 제거하는 과정을 통해, 상기 문서로부터 194개의 문단을 추출할 수 있다. 여기서, 본 발명의 문서 내 문단 추출 장치는 도 13에 도시된 바와 같이, 합집합에 기초한 문단 추출 장치에 비해, 약 13.9%가 높은 약 78.0%의 압축률을 제공할 수 있다.For example, in Test 1 for analyzing a document composed of 680 paragraphs, a paragraph-based paragraph extraction apparatus may extract eight paragraphs. In this case, the paragraph extraction apparatus based on the union may extract the paragraph 244 from the document. In contrast, the paragraph extracting apparatus of the present invention may extract 194 paragraphs from the document through a process of removing 50 duplicated paragraphs. Here, the paragraph extraction apparatus in the document of the present invention may provide a compression ratio of about 78.0%, which is about 13.9% higher than that of the paragraph extraction apparatus based on the union, as shown in FIG. 13.

또한, 본 발명의 문서 내 문단 추출 장치는 101개의 문단으로 구성된 문서를 분석하는 테스트 2(Test 2)에서, 합집합에 기초한 문단 추출 장치에 비해, 약 14.0%가 높은 67.4%의 압축률을 제공할 수 있다.In addition, the paragraph extraction apparatus in the document of the present invention can provide a compression ratio of 67.4%, which is about 14.0% higher than that of the paragraph extraction apparatus based on the union, in test 2 for analyzing a document composed of 101 paragraphs. have.

본 발명의 문서 내 문단 추출 장치는 148개의 문단으로 구성된 문서를 분석하는 테스트 3(Test 3) 및 246개의 문단으로 구성된 문서를 분석하는 테스트 4(Test 4)에서도, 기존 문단 추출 장치에 비해 높은 압축률 제공할 수 있다.The paragraph extraction apparatus in the document of the present invention has a higher compression ratio than the conventional paragraph extraction apparatus even in the test 3 (Test 3) for analyzing a document composed of 148 paragraphs and the test 4 (Test 4) for analyzing a document composed of 246 paragraphs. Can provide.

결과적으로, 본 발명의 문서 내 문단 추출 장치는 추출되는 문단의 개수가 적지만 정확도가 낮아서, 정확한 문단 추출이 어려운, 교집합에 기초한 문단 추출 장치 또는 압축률이 낮아 사용자가 읽어야 할 문단이 많아지는, 합집합에 기초한 문단 추출 장치와 달리, 문서로부터 합집합 연산처리로 문단을 추출하고, 교집합 연산처리로 중복되는 문단을 제거하는 과정을 통해, 최적의 문단을 도출 함으로써, 문서의 내용을 정확히 파악할 수 있게 하는 한도 내에서 최소의 문단을 제공할 수 있게 한다.As a result, the paragraph extraction apparatus in the document of the present invention has a small number of extracted paragraphs, but the accuracy is low, and it is difficult to accurately extract the paragraphs. Unlike the paragraph extraction apparatus based on the above, the paragraph which extracts the paragraph by the union operation from the document and removes the paragraph which is overlapped by the intersection operation processing, derives the optimal paragraph, thereby limiting the exact contents of the document. Allows you to provide a minimum of paragraphs within

도 14는 본 발명의 일실시예에 따른 문서 내 문단 추출 장치의 다른 구성 일례를 도시한 도면이다.14 is a diagram illustrating another configuration example of an apparatus for extracting paragraphs in a document according to an embodiment of the present invention.

도 14를 참조하면, 문서 내 문단 추출 장치는 예컨대, 사용자가 입력한 XML 문서를 불러오는 기능과 키워드를 추출하는 기능, 상기 키워드 중 입력된 검색 키워드를 포함하고 있는 문단을 추출하는 기능, 문단의 순서를 유지하는 기능, 문단의 중복을 확인하고 제거하는 기능을 구현할 수 있다.Referring to FIG. 14, an apparatus for extracting paragraphs in a document may include, for example, a function of calling an XML document input by a user, a function of extracting a keyword, a function of extracting a paragraph including an input search keyword among the keywords, and a sequence of paragraphs. It can implement the function of maintaining the function of checking and eliminating the duplication of paragraphs.

이를 위해, 문서 내 문단 추출 장치는 3개의 계층 구조 예컨대, 사용자 뷰 계층(User View Layer), 프로세싱 계층(Processing Layer) 및 운영체제 계층(OS Layer)로 구성될 수 있다. 이때, 문서 내 문단 추출 장치는 자바(Java) 프로그램으로 구현할 수 있으며, OS에 종속되지 않고, 다양한 환경에서 문서에 대한 분석을 수행할 수 있다 To this end, the paragraph extracting apparatus in a document may be composed of three hierarchical structures, for example, a user view layer, a processing layer, and an OS layer. In this case, the paragraph extracting device in the document may be implemented as a Java program, and may not be dependent on the OS and may analyze the document in various environments.

도 15는 본 발명의 일실시예에 따른 문서 내 문단 추출 방법을 나타내는 흐름도이다.15 is a flowchart illustrating a method of extracting a paragraph in a document according to an embodiment of the present invention.

도 15를 참조하면, 단계 1501에서, 문서 내 문단 추출 장치는 문서 및 검색 키워드를 입력받을 수 있다. 이때, 문서 내 문단 추출 장치는 사용자로부터 검색 키워드를 직접 입력받거나, 또는 상기 문서에서 검색된 키워드를 나열하여 작성한 리스트를 제공하고, 상기 리스트에 대한 선택명령에 따라, 선택된 키워드를 상기 검색 키워드로서 입력받을 수 있다.Referring to FIG. 15, in operation 1501, an apparatus for extracting paragraphs in a document may receive a document and a search keyword. In this case, the paragraph extraction apparatus in the document may directly receive a search keyword from the user or provide a list prepared by listing the keywords searched in the document, and receive the selected keyword as the search keyword according to a selection command for the list. Can be.

단계 1503에서, 문서 내 문단 추출 장치는 상기 문서로부터 복수의 문단을 구분하고, 복수의 문단 각각에 상기 문서 상에서 상기 복수의 문단이 배치된 순서대로 번호를 부여할 수 있다.In operation 1503, the apparatus for extracting paragraphs within a document may distinguish a plurality of paragraphs from the document, and number each of the plurality of paragraphs in the order in which the plurality of paragraphs are arranged on the document.

단계 1505에서, 문서 내 문단 추출 장치는 상기 복수의 문단 중에서 검색 키워드가 포함되는 문단을 추출할 수 있다. 이때, 문서 내 문단 추출 장치는 상기 추출된 문단 중 동일한 번호가 부여된 문단이 복수 개 존재하는 경우, 상기 복수 개의 문단 중 하나의 문단 이외의 나머지 문단을, 상기 추출된 문단에서 제거할 수 있다.In operation 1505, the paragraph extraction apparatus in the document may extract a paragraph including a search keyword from the plurality of paragraphs. In this case, when there are a plurality of paragraphs with the same number among the extracted paragraphs, the paragraph extraction apparatus may remove remaining paragraphs other than one of the plurality of paragraphs from the extracted paragraphs.

또한, 문서 내 문단 추출 장치는 상기 검색 키워드와 같은 의미를 갖는 유사 키워드(즉, 동의어)가 외부 서버로부터 획득되는 경우, 상기 유사 키워드를 제공할 수 있다. 이때, 문서 내 문단 추출 장치는 상기 유사 키워드에 대한 추가 검색 요청이 입력되면, 상기 복수의 문단 중에서 상기 유사 키워드가 포함되는 문단을 더 추출할 수 있다.In addition, the apparatus for extracting paragraphs in a document may provide the similar keyword when a similar keyword (that is, synonym) having the same meaning as the search keyword is obtained from an external server. In this case, the paragraph extraction apparatus in the document may further extract a paragraph including the similar keyword from the plurality of paragraphs when an additional search request for the similar keyword is input.

단계 1507에서, 문서 내 문단 추출 장치는 추출된 문단을 부여된 번호에 따라, 정렬하여 출력 함으로써, 상기 문서를 압축시켜 제공할 수 있다. 이때, 문서 내 문단 추출 장치는 상기 추출된 문단을, 상기 추출된 문단에 각각 부여된 번호에 따라, 정렬하여 출력 함(예컨대, 부여된 번호가 작은 순서대로 문단을 정렬함)으로써, 상기 문서 상에서의 문단 간 순서가 뒤바뀌지 않고, 배치 순서를 유지할 수 있게 한다.In operation 1507, the paragraph extracting apparatus in the document may compress and provide the document by sorting and outputting the extracted paragraph according to the assigned number. In this case, the paragraph extraction apparatus in the document sorts and outputs the extracted paragraph according to the number assigned to each of the extracted paragraphs (for example, sorting the paragraphs in the order in which the assigned numbers are small). This ensures that the order of paragraphs is not reversed and maintains the order of placement.

한편, 문서 내 문단 추출 장치는 상기 검색 키워드에 대한 분석정보로서, 상기 문서 내 상기 검색 키워드에 대한 가중치 및 상기 문서 내 임의의 문단에서의 상기 검색 키워드에 대한 빈도수 중 적어도 하나의 정보를 산출하여, 출력할 수 있다.On the other hand, the paragraph extraction apparatus in the document as the analysis information for the search keyword, calculates at least one of the weight of the search keyword in the document and the frequency of the search keyword in any paragraph in the document, You can print

상기 문서 내 검색 키워드에 대한 가중치 산출시, 문서 내 문단 추출 장치는 각각의 검색 키워드가 포함된 문단의 개수를 합하여, 총 개수(총 빈도수)를 산출하고, 총 개수 대비 특정 검색 키워드가 포함된 문단의 개수(특정 검색 키워드의 빈도수)에 대한 비율을, 상기 특정 검색 키워드에 대한 가중치로서 산출할 수 있다.In calculating the weight of the search keywords in the document, the paragraph extraction apparatus in the document sums the number of paragraphs including each search keyword, calculates the total number (total frequency), and includes a paragraph including a specific search keyword relative to the total number. The ratio with respect to the number of (the frequency of a specific search keyword) can be computed as a weight with respect to the said specific search keyword.

구체적으로, 상기 검색 키워드가, 제1 검색 키워드 및 제2 검색 키워드를 포함하는 경우, 문서 내 문단 추출 장치는 상기 제1 검색 키워드를 포함하여 추출된 문단의 제1 개수와 상기 제2 검색 키워드를 포함하여 추출된 문단의 제2 개수를 합하여, 총 개수를 산출할 수 있다. 이후, 문서 내 문단 추출 장치는 상기 총 개수 대비 상기 제1 개수의 비율을, 상기 제1 검색 키워드에 대한 가중치로서 산출하고, 상기 총 개수 대비 상기 제2 개수의 비율을, 상기 제2 검색 키워드에 대한 가중치로서 산출할 수 있다.In detail, when the search keyword includes a first search keyword and a second search keyword, the paragraph extracting apparatus in the document may extract the first number of paragraphs including the first search keyword and the second search keyword. The total number may be calculated by adding the second numbers of the extracted paragraphs. Then, the paragraph extraction apparatus in the document calculates the ratio of the first number to the total number as a weight for the first search keyword, and calculates the ratio of the second number to the second number to the second search keyword. Can be calculated as a weighting factor.

상기 문서 내 임의의 문단에서의 상기 검색 키워드에 대한 빈도수 산출시, 문서 내 문단 추출 장치는 문서 내 임의의 문단에서 상기 검색 키워드가 포함된 횟수(사용 횟수)를 상기 빈도수로서 산출할 수 있다.When calculating a frequency for the search keyword in any paragraph in the document, the paragraph extracting device in the document may calculate the number of times the search keyword is included in the paragraph in the document as the frequency.

또한, 상기 문단 출력시, 문서 내 문단 추출 장치는 상기 추출된 문단 내 검색 키워드를, 상기 추출된 문단 내 다른 문자와 구별하여 출력할 수 있다. 이때, 프로세서(205)는 상기 검색 키워드가 복수일 경우, 상기 복수의 검색 키워드 별로, 상이한 형태(예컨대, 상이한 색상, 글꼴, 크기 등)로 구별하여 출력 함으로써, 문단 내 포함되는 복수의 검색 키워드를 쉽게 인식할 수 있게 한다.In addition, at the time of outputting the paragraph, the paragraph extraction apparatus in the document may output the search keyword in the extracted paragraph separately from other characters in the extracted paragraph. In this case, when the search keywords are plural, the processor 205 distinguishes and outputs the plurality of search keywords included in the paragraph by outputting them in different forms (for example, different colors, fonts, sizes, etc.) for each of the plurality of search keywords. Make it easy to recognize.

또한, 문서 내 문단 추출 장치는 연속적인 번호가 부여된 문단이, 설정된 개수 이상 추출되는 경우, 해당 문단에 연속 식별표시를 출력하거나, 또는 상기 검색 키워드가 복수일 경우, 상기 복수의 검색 키워드를 모두 포함하는 문단에 중요 식별표시를 출력 함으로써, 상대적으로 중요한 문단을 식별할 수 있게 한다. 수 있다.In addition, the paragraph extraction apparatus in the document outputs a continuous identification mark on a corresponding paragraph when a consecutive numbered paragraph is extracted or more than a set number, or when the search keywords are plural, all of the plurality of search keywords. By displaying an important identification mark in the containing paragraph, it is possible to identify a relatively important paragraph. Can be.

이상에서 설명된 장치는 하드웨어 구성요소, 소프트웨어 구성요소, 및/또는 하드웨어 구성요소 및 소프트웨어 구성요소의 조합으로 구현될 수 있다. 예를 들어, 실시예들에서 설명된 장치 및 구성요소는, 예를 들어, 프로세서, 콘트롤러, ALU(arithmetic logic unit), 디지털 신호 프로세서(digital signal processor), 마이크로컴퓨터, FPA(field programmable array), PLU(programmable logic unit), 마이크로프로세서, 또는 명령(instruction)을 실행하고 응답할 수 있는 다른 어떠한 장치와 같이, 하나 이상의 범용 컴퓨터 또는 특수 목적 컴퓨터를 이용하여 구현될 수 있다. 처리 장치는 운영 체제(OS) 및 상기 운영 체제 상에서 수행되는 하나 이상의 소프트웨어 애플리케이션을 수행할 수 있다. 또한, 처리 장치는 소프트웨어의 실행에 응답하여, 데이터를 접근, 저장, 조작, 처리 및 생성할 수도 있다. 이해의 편의를 위하여, 처리 장치는 하나가 사용되는 것으로 설명된 경우도 있지만, 해당 기술분야에서 통상의 지식을 가진 자는, 처리 장치가 복수 개의 처리 요소(processing element) 및/또는 복수 유형의 처리 요소를 포함할 수 있음을 알 수 있다. 예를 들어, 처리 장치는 복수 개의 프로세서 또는 하나의 프로세서 및 하나의 콘트롤러를 포함할 수 있다. 또한, 병렬 프로세서(parallel processor)와 같은, 다른 처리 구성(processing configuration)도 가능하다.The apparatus described above may be implemented as a hardware component, a software component, and / or a combination of hardware components and software components. For example, the devices and components described in the embodiments may be, for example, processors, controllers, arithmetic logic units (ALUs), digital signal processors, microcomputers, field programmable arrays (FPAs), It may be implemented using one or more general purpose or special purpose computers, such as a programmable logic unit (PLU), microprocessor, or any other device capable of executing and responding to instructions. The processing device may execute an operating system (OS) and one or more software applications running on the operating system. The processing device may also access, store, manipulate, process, and generate data in response to the execution of the software. For convenience of explanation, one processing device may be described as being used, but one of ordinary skill in the art will appreciate that the processing device includes a plurality of processing elements and / or a plurality of types of processing elements. It can be seen that it may include. For example, the processing device may include a plurality of processors or one processor and one controller. In addition, other processing configurations are possible, such as parallel processors.

소프트웨어는 컴퓨터 프로그램(computer program), 코드(code), 명령(instruction), 또는 이들 중 하나 이상의 조합을 포함할 수 있으며, 원하는 대로 동작하도록 처리 장치를 구성하거나 독립적으로 또는 결합적으로(collectively) 처리 장치를 명령할 수 있다. 소프트웨어 및/또는 데이터는, 처리 장치에 의하여 해석되거나 처리 장치에 명령 또는 데이터를 제공하기 위하여, 어떤 유형의 기계, 구성요소(component), 물리적 장치, 가상 장치(virtual equipment), 컴퓨터 저장 매체 또는 장치, 또는 전송되는 신호 파(signal wave)에 영구적으로, 또는 일시적으로 구체화(embody)될 수 있다. 소프트웨어는 네트워크로 연결된 컴퓨터 시스템 상에 분산되어서, 분산된 방법으로 저장되거나 실행될 수도 있다. 소프트웨어 및 데이터는 하나 이상의 컴퓨터 판독 가능 저장 매체에 저장될 수 있다.The software may include a computer program, code, instructions, or a combination of one or more of the above, and configure the processing device to operate as desired, or process it independently or collectively. You can command the device. Software and / or data may be any type of machine, component, physical device, virtual equipment, computer storage medium or device in order to be interpreted by or to provide instructions or data to the processing device. Or may be permanently or temporarily embodied in a signal wave to be transmitted. The software may be distributed over networked computer systems so that they may be stored or executed in a distributed manner. Software and data may be stored in one or more computer readable storage media.

실시예에 따른 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 저장될 수 있다. 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체에 저장되는 프로그램 명령은 실시예를 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 저장 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광저장 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media) 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. 상기된 하드웨어 장치는 실시예의 동작을 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.The method according to the embodiment may be implemented in the form of program instructions that can be executed by various computer means and stored in a computer readable medium. The computer readable medium may include program instructions, data files, data structures, etc. alone or in combination. The program instructions stored in the media may be those specially designed and constructed for the purposes of the embodiments, or they may be of the kind well-known and available to those having skill in the computer software arts. Examples of computer readable storage media include magnetic media such as hard disks, floppy disks and magnetic tape, optical media such as CD-ROMs, DVDs, and magnetic disks such as floppy disks. Hardware devices specially configured to store and execute program instructions such as magneto-optical media and ROM, RAM, flash memory and the like. Examples of program instructions include not only machine code generated by a compiler, but also high-level language code that can be executed by a computer using an interpreter or the like. The hardware device described above may be configured to operate as one or more software modules to perform the operations of the embodiments, and vice versa.

이상과 같이 실시예들이 비록 한정된 실시예와 도면에 의해 설명되었으나, 해당 기술분야에서 통상의 지식을 가진 자라면 상기의 기재로부터 다양한 수정 및 변형이 가능하다. 예를 들어, 설명된 기술들이 설명된 방법과 다른 순서로 수행되거나, 및/또는 설명된 시스템, 구조, 장치, 회로 등의 구성요소들이 설명된 방법과 다른 형태로 결합 또는 조합되거나, 다른 구성요소 또는 균등물에 의하여 대치되거나 치환되더라도 적절한 결과가 달성될 수 있다.Although the embodiments have been described by the limited embodiments and the drawings as described above, various modifications and variations are possible to those skilled in the art from the above description. For example, the described techniques may be performed in a different order than the described method, and / or components of the described systems, structures, devices, circuits, etc. may be combined or combined in a different form than the described method, or other components. Or even if replaced or substituted by equivalents, an appropriate result can be achieved.

그러므로, 다른 구현들, 다른 실시예들 및 청구범위와 균등한 것들도 후술하는 청구범위의 범위에 속한다.Therefore, other implementations, other embodiments, and equivalents to the claims are within the scope of the following claims.

200: 문서 내 문단 추출 장치
201: 인터페이스부
203: 추출부
205: 프로세서
207: 데이터베이스200: paragraph extraction device in the document
201: interface unit
203: extraction unit
205: processor
207: database

Claims

Receiving a document and a search keyword through a Java scanner class by an interface unit of a paragraph extracting device in a document;
The extracting unit of the paragraph extracting device in the document divides a plurality of paragraphs from the document, assigns the numbers in the order in which the plurality of paragraphs are arranged, and selects the number of paragraphs from the plurality of paragraphs through a Java Contains Method. Extracting a paragraph including a search keyword;
Obtaining, by the interface unit, a similar keyword having the same meaning as the search keyword from a database;
If the additional search request for the similar keyword is input, extracting further extracting a paragraph including the similar keyword from the plurality of paragraphs;
A processor of the paragraph extracting apparatus in the document sorting and outputting the extracted paragraph according to a given number through a Java Ascending Method;
The processor calculates a weight for the search keywords in the document, and when there are a plurality of search keywords, the total number (the total frequency) is calculated by adding the number of paragraphs including each search keyword and comparing the total number. Calculating and outputting a ratio of the number of paragraphs (frequency of a specific search keyword) including a specific search keyword as a weight for the specific search keyword; And
Calculating and outputting, by the processor, the frequency (number of times of use) of the search keyword included in any paragraph in the document as a frequency for the search keyword;
Paragraph extraction method in a document comprising a.

The method of claim 1,
The extracting step,
If the search keyword is plural,
Extracting a paragraph including each search keyword among the plurality of paragraphs; And
If there are a plurality of paragraphs with the same number among the extracted paragraphs, the remaining paragraphs other than one of the plurality of paragraphs are removed from the extracted paragraphs through a Java Iterator Method. Steps to
Paragraph extraction method in a document comprising a.

delete

The method of claim 1,
Receiving the input,
Providing a list prepared by listing the keywords found in the document; And
Receiving a selected keyword as the search keyword according to a selection command for the list;
Paragraph extraction method in a document comprising a.

delete

The method of claim 1,
The outputting step,
Outputting the search keywords in the extracted paragraphs differently from other characters in the extracted paragraphs, and outputting the search keywords in different forms for each of the plurality of search keywords when the search keywords are plural.
Paragraph extraction method in a document comprising a.

The method of claim 1,
The outputting step,
Outputting a continuous identification mark to the paragraph when the consecutive numbered paragraphs are extracted more than the set number; or
Outputting an important identification mark in a paragraph including all of the plurality of search keywords when the search keywords are plural;
Paragraph extraction method in a document comprising a.

An interface unit for receiving a document and a search keyword through a Java Scanner class;
A plurality of paragraphs are divided from the document, the plurality of paragraphs are numbered in the order in which the plurality of paragraphs are arranged, and a paragraph including the search keyword is extracted from the plurality of paragraphs through a Java Contains Method. Extraction unit; And
A processor for sorting and outputting the extracted paragraphs according to a given number through a Java Ascending Method
Including,
The interface unit
Providing a similar keyword having the same meaning as the search keyword from a database,
The extraction unit
When an additional search request for the similar keyword is input, a paragraph including the similar keyword is further extracted from the plurality of paragraphs.
The processor is
Computing weights for the search keywords in the document, and when there are a plurality of search keywords, calculating the total number (total frequency) by adding the number of paragraphs including each search keyword, and calculating a specific search keyword to the total number. Calculates and outputs a ratio of the number of paragraphs (frequency of a specific search keyword) including as a weight as the weight for the specific search keyword, and the number of times the search keyword is included in a certain paragraph in the document (number of uses). And a paragraph extracting device for calculating and outputting as a frequency for the search keyword.

The method of claim 9,
If the search keyword is plural,
The extraction unit,
Among the plurality of paragraphs, a paragraph including each search keyword is extracted, and if there are a plurality of paragraphs with the same number among the extracted paragraphs, a plurality of paragraphs are provided through a Java Iterator Method. Remove the remaining paragraphs other than one paragraph from the extracted paragraphs
Paragraph extraction device in documents.

delete

The method of claim 9,
The interface unit,
Providing a list prepared by listing the keywords searched in the document, and receiving a selected keyword as the search keyword according to a selection command for the list;
Paragraph extraction device in documents.

delete

The method of claim 9,
The processor,
If the number of consecutive numbered paragraphs is extracted more than the set number, output the continuous identification mark on the relevant paragraph, or
When there are a plurality of search keywords, an important identification mark is output to a paragraph including all of the plurality of search keywords.
Paragraph extraction device in documents.