KR101370831B1

KR101370831B1 - System and method for extracting condensed issue sentence

Info

Publication number: KR101370831B1
Application number: KR1020120041849A
Authority: KR
Inventors: 손근영; 오경우
Original assignee: 줌인터넷 주식회사; (주)이스트소프트
Priority date: 2012-04-23
Filing date: 2012-04-23
Publication date: 2014-03-17
Also published as: KR20130119031A

Abstract

축약된 이슈문장 추출 시스템 및 방법이 개시된다. 본 시스템은, 적어도 문서의 제목 및 URL에 대한 정보가 기록된 복수의 문서자료에 대한 문서정보 DB로부터 서로 관련이 있는 문서자료들을 동일한 클러스터로 그룹화하는 클러스터링 모듈; 동일한 클러스터에 속하는 개별 문서자료들에 대한 제목 정보로부터 중복이 제거된 유효이슈문장을 추출한 후, 각각의 상기 유효이슈문장에 대하여 문장의 어절을 단위문자열로 하는 이슈문장수열을 생성하는 이슈문장 추출모듈; 및 동일한 클러스터에 속하는 복수의 이슈문장수열로부터 공통된 단위문자열을 추출하여 부분문자열을 생성하는 이슈축약문 생성모듈;을 포함하고, 상기 이슈축약문 생성모듈에 의해 생성된 상기 부분문자열을 동일한 클러스터에 속하는 상기 복수의 문서자료들에 대한 이슈축약문으로 제공하는 것을 특징으로 한다.An abbreviated issue sentence extraction system and method are disclosed. The system includes a clustering module for grouping related document materials into the same cluster from document information DBs for a plurality of document materials in which at least information about a title and a URL of a document are recorded; An issue sentence extraction module for extracting a valid issue sentence from which duplicates are removed from title information of individual document data belonging to the same cluster, and then generating an issue sentence sequence using a sentence as a unit string for each valid issue sentence. ; And an issue abbreviation generating module for generating a substring by extracting a common unit string from a plurality of issue sentence sequences belonging to the same cluster, wherein the substring generated by the issue contract generating module belongs to the same cluster. Characterized in that it provides an abbreviation for the plurality of document data.

Description

System and method for abbreviated issue sentence extraction {SYSTEM AND METHOD FOR EXTRACTING CONDENSED ISSUE SENTENCE}

본 발명은 인터넷 등의 통신망을 이용한 검색 서비스에 관한 기술로서, 더 자세하게는 인터넷을 통해 제공되는 각종 뉴스 기사, 도서, 문헌 등을 포함하는 문서자료들에 대하여 이슈별로 요약된 정보를 사용자에게 빠르게 제공할 수 있는 축약된 이슈문장 추출 시스템 및 방법에 관한 것이다.The present invention is a technology related to a search service using a communication network such as the Internet, and more specifically, provides information summarized by issue with respect to document data including various news articles, books, documents, etc. provided through the Internet. It relates to an abbreviated issue sentence extraction system and method.

인터넷 등의 사용이 증가하면서 제공되는 문서의 종류 및 수가 지속적으로 증가하고 있다. 인터넷 등의 통신망을 통해 제공되는 검색 서비스는 새롭게 생성된 각종 뉴스 기사, 도서, 문헌 등의 문서자료들을 크롤링하여 색인하고, 색인된 정보를 기초로 사용자의 검색에 대한 기초 자료로서 제공한다.As the use of the Internet and the like increase, the type and number of documents provided are continuously increasing. The search service provided through a communication network such as the Internet crawls and indexes newly generated document data such as various news articles, books, and documents, and provides the basic data for the user's search based on the indexed information.

한편, 기하급수적으로 증가하는 문서자료들 중에는 예컨대 뉴스 기사들과 같이 중복되거나 동일한 이슈(Issue)에 대해 내용만 다소 변경된 자료들이 상당수 존재한다. 사용자가 많은 문서자료들에 대해 일일이 읽고자 한다면 유사한 내용을 담은 문서자료들로 인해 상당한 시간과 노력을 들여야 한다.
[선행기술문헌] 한국공개특허 제2012-0011662호, "문서 순위 결정 시스템 및 방법"On the other hand, among the exponentially increasing document data, there are a lot of materials which are only slightly changed about the same issue (Issue), such as news articles. If a user wants to read a lot of documents one by one, it takes considerable time and effort because of documents with similar contents.
Prior Art Documents Korean Laid-Open Patent Publication No. 2012-0011662, "Document Ranking System and Method"

본 발명은 상술한 종래의 인터넷 등의 통신망을 이용한 검색 서비스의 문제점을 해결하기 위한 것으로서, 다양한 이슈를 가진 문서자료들을 이슈별로 그룹화하고, 동일한 이슈를 가진 문서 그룹들을 대표하는 축약된 이슈문장을 추출하여 제시함으로써 사용자에게 다양한 문서자료들을 이슈별로 요약된 정보를 빠르게 전달할 수 있는 축약된 이슈문장 추출 시스템 및 방법을 제공하는 것을 목적으로 한다.The present invention is to solve the problem of the search service using a communication network, such as the conventional Internet described above, grouping document data having a variety of issues by issue, and extracts a shortened issue sentence representing document groups having the same issue The purpose of this article is to provide an abbreviated issue sentence extraction system and method that can quickly deliver information summarized by issue to various users.

또한, 본 발명의 다른 목적은, 그룹화된 문서자료들에 대한 최신성 또는 다른 사용자들의 관심도를 분석하여 사용자에게 그룹화된 문서자료들에 대한 축약된 이슈문장을 순위화하여 제공할 수 있는 축약된 이슈문장 추출 시스템 및 방법을 제공하는 것이다.In addition, another object of the present invention, the shortened issue that can be provided to the user by ranking the shortened issue sentences for the grouped document data by analyzing the freshness of the grouped document data or the interest of other users. It is to provide a sentence extraction system and method.

본 발명에 따른 축약된 이슈문장 추출 시스템은, 적어도 문서의 제목 및 URL에 대한 정보가 기록된 복수의 문서자료에 대한 문서정보 DB로부터 서로 관련이 있는 문서자료들을 동일한 클러스터로 그룹화하는 클러스터링 모듈; 동일한 클러스터에 속하는 개별 문서자료들에 대한 제목 정보로부터 중복이 제거된 유효이슈문장을 추출한 후, 각각의 상기 유효이슈문장에 대하여 문장의 어절을 단위문자열로 하는 이슈문장수열을 생성하는 이슈문장 추출모듈; 및 동일한 클러스터에 속하는 복수의 이슈문장수열로부터 공통된 단위문자열을 추출하여 부분문자열을 생성하는 이슈축약문 생성모듈;을 포함하고, 상기 이슈축약문 생성모듈에 의해 생성된 상기 부분문자열을 동일한 클러스터에 속하는 상기 복수의 문서자료들에 대한 이슈축약문으로 제공하는 것을 특징으로 한다.An abbreviated issue sentence extraction system according to the present invention comprises: a clustering module for grouping document data related to each other from a document information DB for a plurality of document data in which information on at least a title and a URL of a document are recorded; An issue sentence extraction module for extracting a valid issue sentence from which duplicates are removed from title information of individual document data belonging to the same cluster, and then generating an issue sentence sequence using a sentence as a unit string for each valid issue sentence. ; And an issue abbreviation generating module for generating a substring by extracting a common unit string from a plurality of issue sentence sequences belonging to the same cluster, wherein the substring generated by the issue contract generating module belongs to the same cluster. Characterized in that it provides an abbreviation for the plurality of document data.

여기서, 이슈축약문 생성모듈은, 최장 공통 부분수열 알고리듬에 따라 동일한 클러스터에 속하는 상기 복수의 이슈문장수열에서 최장으로 공통된 단위문자열을 추출하여 상기 부분문자열을 생성할 수 있다. 다른 방법으로서, 이슈축약문 생성모듈은, 동일한 클러스터에 속하는 각각의 상기 이슈문장수열을 이루는 각각의 상기 단위문자열이 상기 복수의 이슈문장수열 전체를 통해 출연된 확률에 기초하여 미리 설정된 확률값 범위에 속하는 상기 단위문자열로 이루어진 상기 부분문자열을 생성할 수 있다.Here, the issue abbreviation generating module may generate the substring by extracting the longest common unit string from the plurality of issue sentence sequences belonging to the same cluster according to the longest common subsequence algorithm. Alternatively, the issue abbreviation generation module may be further configured to include a predetermined range of probability values based on a probability that each of the unit strings constituting each of the issue sentence sequences belonging to the same cluster appears through the entire plurality of issue sentence sequences. The substring consisting of the unit string may be generated.

또한, 상기 문서정보 DB는 상기 문서자료들 각각의 작성시간 정보를 더 포함할 수 있고, 이 경우 동일한 클러스터에 속하는 복수의 문서자료들 각각의 상기 작성시간 정보를 기초로 최신성지수를 산출하는 순위판단 모듈을 더 포함할 수 있다. 그에 의해, 복수의 클러스터에 대한 각각의 상기 이슈축약문을 상기 최신성지수에 기초하여 복수의 상기 이슈축약문을 순위화하여 제공할 수 있다.The document information DB may further include creation time information of each of the document materials, and in this case, a rank for calculating a newness index based on the creation time information of each of a plurality of document materials belonging to the same cluster. The determination module may further include. Thereby, each of the issue contracts for a plurality of clusters may be provided by ranking the plurality of issue contracts based on the recency index.

한편, 적어도 사용자 식별자, 상기 문서정보 DB에 포함된 각 문서자료에 대한 URL 정보 및 각각의 URL에 대한 사용자의 방문시간 정보를 포함하는 검색로그 DB를 더 포함할 수 있고, 이 경우 동일한 클러스터에 속하는 복수의 문서자료들에 대한 사용자의 방문 회수에 기초하여 사용자의 관심도지수를 산출하는 순위판단 모듈을 더 포함할 수 있다. 그에 의해, 복수의 클러스터에 대한 각각의 상기 이슈축약문을 상기 관심도지수에 기초하여 복수의 상기 이슈축약문을 순위화하여 제공할 수 있다.On the other hand, it may further include a search log DB including at least a user identifier, URL information for each document data contained in the document information DB and the user's visit time information for each URL, in this case belonging to the same cluster The apparatus may further include a ranking determination module configured to calculate an index of interest of the user based on the number of visits of the user for the plurality of documents. Thereby, each of the issue contract statements for a plurality of clusters may be provided by ranking the plurality of issue contract statements based on the interest index.

또한, 본 발명에 따른 축약된 이슈문장 추출 방법은, 적어도 문서의 제목 및 URL에 대한 정보가 기록된 복수의 문서자료에 대한 문서정보 DB로부터 서로 관련이 있는 문서자료들을 동일한 클러스터로 그룹화하는 클러스터링 단계; 동일한 클러스터에 속하는 개별 문서자료들에 대한 제목 정보로부터 중복이 제거된 유효이슈문장을 추출한 후, 각각의 상기 유효이슈문장에 대하여 문장의 어절을 단위문자열로 하는 이슈문장수열을 생성하는 이슈문장 추출 단계; 및 동일한 클러스터에 속하는 복수의 이슈문장수열로부터 공통된 단위문자열을 추출함으로써 부분문자열을 생성하는 이슈축약문 생성 단계;를 포함하여 구성될 수 있다.In addition, the reduced issue sentence extraction method according to the present invention, the clustering step of grouping the document data related to each other from the document information DB for a plurality of document data recorded information on at least the title and URL of the document in the same cluster ; The issue sentence extraction step of extracting a valid issue sentence from which duplicates are removed from title information of individual document materials belonging to the same cluster, and generating an issue sentence sequence in which a sentence of a sentence is a unit string for each valid issue sentence. ; And an issue abbreviation generating step of generating a substring by extracting a common unit string from a plurality of issue sentence sequences belonging to the same cluster.

여기서, 이슈축약문 생성 단계에서는, 최장 공통 부분수열 알고리듬에 따라 동일한 클러스터에 속하는 상기 복수의 이슈문장수열에서 최장으로 공통된 단위문자열을 추출하여 상기 부분문자열을 생성할 수 있다. 다른 방법으로서, 이슈축약문 생성 단계에서는, 동일한 클러스터에 속하는 각각의 상기 이슈문장수열을 이루는 각각의 상기 단위문자열이 상기 복수의 이슈문장수열 전체를 통해 출연된 확률에 기초하여 미리 설정된 확률값 범위에 속하는 상기 단위문자열로 이루어진 상기 부분문자열을 생성할 수 있다.Here, in the issue contract generation step, the substring may be generated by extracting the longest common unit string from the plurality of issue sentence sequences belonging to the same cluster according to the longest common subsequence algorithm. Alternatively, in the issue abbreviation generation step, each of the unit strings constituting each of the issue sentence sequences belonging to the same cluster may fall within a preset probability value range based on a probability appeared throughout the plurality of issue sentence sequences. The substring consisting of the unit string may be generated.

한편, 문서정보 DB에 문서자료들 각각의 작성시간 정보가 포함된 경우, 동일한 클러스터에 속하는 복수의 문서자료들 각각의 상기 작성시간 정보를 기초로 최신성지수를 산출함으로써, 복수의 클러스터에 대한 각각의 상기 이슈축약문을 상기 최신성지수에 기초하여 복수의 상기 이슈축약문을 순위화하는 순위판단 단계를 더 포함할 수 있다.On the other hand, when the document information DB includes the creation time information of each of the document data, by calculating the most recent index based on the creation time information of each of a plurality of document data belonging to the same cluster, each of the plurality of clusters The method may further include a ranking determining step of ranking the plurality of issue contracts based on the issue contract statement.

또한, 적어도 사용자 식별자, 상기 문서정보 DB에 포함된 각 문서자료에 대한 URL 정보 및 각각의 URL에 대한 사용자의 방문시간 정보를 포함하는 검색로그 DB를 이용하는 경우, 동일한 클러스터에 속하는 복수의 문서자료들에 대한 사용자의 방문 회수를 카운팅하여 사용자의 관심도지수를 산출하고, 복수의 클러스터에 대한 각각의 상기 이슈축약문을 상기 관심도지수에 기초하여 복수의 상기 이슈축약문을 순위화하는 순위판단 단계를 더 포함할 수 있다.In addition, when using a search log DB that includes at least a user identifier, URL information for each document material included in the document information DB, and user visit time information for each URL, a plurality of document materials belonging to the same cluster. A ranking step of calculating a user's interest index by counting the number of visits of the user for the ranking, and ranking the plurality of issue contracts based on the interest index for each of the issue contracts for a plurality of clusters It may include.

본 발명에 따르면, 다양한 이슈를 가진 문서자료들을 이슈별로 그룹화하고, 동일한 이슈를 가진 문서 그룹들을 대표하는 축약된 이슈문장을 추출하여 제시함으로써 사용자에게 다양한 문서자료들을 이슈별로 요약된 정보 형태로 빠르게 전달할 수 있다.According to the present invention, by grouping document data having various issues by issue and extracting and presenting abbreviated issue sentences representing document groups having the same issue, various document data can be quickly delivered to the user in the form of information summarized by issue. Can be.

또한, 본 발명에 따르면, 그룹화된 문서자료들에 대한 최신성 또는 다른 사용자들의 관심도를 분석하여 사용자에게 그룹화된 문서자료들에 대한 축약된 이슈문장을 순위화하여 제공할 수 있다.In addition, according to the present invention, it is possible to analyze the recency of the grouped document data or the interests of other users to provide the user with a ranking of the abbreviated issue sentences for the grouped document data.

그에 따라, 인터넷 등의 통신망을 이용하여 뉴스 기사 혹은 문서 자료를 검색하고자 하는 사용자들이 최신 혹은 다른 사용자들이 관심을 가지는 이슈들로 정리된 축약된 이슈문장들을 짧은 시간에 살펴볼 수 있으므로, 검색 서비스에 대한 사용자 편의성이 향상된다.As a result, users who want to search for news articles or document data using a network such as the Internet can look at the shortened issue sentences that are summarized into the issues of interest to the latest or other users. User convenience is improved.

도 1은 본 발명의 일실시예에 따른 축약된 이슈문장 추출 시스템의 네트워크 연결을 도시한 개요도이다.
도 2는 본 발명의 일실시예에 따른 축약된 이슈문장 추출 시스템의 시스템 구성도이다.
도 3은 본 발명의 일실시예에 따른 클러스터링 DB의 일례를 도시한 도면으로서, 클러스터 ID별로 그에 속하는 문서자료들의 제목, URL 및 작성시간에 대한 정보가 기록된 레코드의 일례를 도시한다.
도 4는 본 발명의 일실시예에 따른 이슈문장수열 DB의 일례를 도시한 도면으로서, 클러스터 ID별로 그에 속하는 문서자료들의 제목 정보로부터 추출된 유효이슈문장 각각이 단위문자열로 구분되어 이슈문장수열로 기록된 레코드의 일례를 도시한다.1 is a schematic diagram illustrating a network connection of an abbreviated issue sentence extraction system according to an embodiment of the present invention.
2 is a system configuration diagram of a reduced issue sentence extraction system according to an embodiment of the present invention.
3 is a diagram illustrating an example of a clustering DB according to an embodiment of the present invention, and shows an example of a record in which information about a title, a URL, and a creation time of document data belonging to each cluster ID is recorded.
FIG. 4 is a diagram illustrating an example of an issue sentence sequence DB according to an embodiment of the present invention, wherein each valid issue sentence extracted from title information of document data belonging to each cluster ID is divided into a unit string to form an issue sentence sequence. An example of the recorded record is shown.

이하, 첨부한 도면들을 참조하여 본 발명에 따른 축약된 이슈문장 추출 시스템 및 방법에 대한 바람직한 실시예를 상세히 설명한다.Hereinafter, exemplary embodiments of the reduced issue sentence extraction system and method according to the present invention will be described in detail with reference to the accompanying drawings.

먼저, 도 1은 본 발명의 일실시예에 따른 축약된 이슈문장 추출 시스템의 네트워크 구성을 도시한 개요도이다. 사용자들은 사용자 단말기(110a, 110b)를 이용하여 유무선 통신망(120a, 120b)을 통해 축약된 이슈문장 추출 시스템(100)이 탑재된 검색 서버(100a)에 접속할 수 있다. 즉, 사용자들은 사용자 단말기(110a, 110b)를 통해 검색 서버(100a)에 접속하여, 현재 이슈가 되고 있는 각종 뉴스 기사, 도서, 또는 문헌들에 대한 정보를 얻는다. 이때, 검색 서버(100a)는 사용자 단말기(110a, 110b)에 본 이슈문장 추출 시스템(100a)로부터 추출된 축약된 이슈문장에 대한 정보를 제공하되, 필요에 따라 사용자 단말기(110a, 110b)로부터 특별한 요청이 있는 경우에만 선택적으로 제공할 수도 있다. 또한, 축약된 이슈문장 추출 시스템(100)은, 인터넷 검색 서비스를 제공하는 검색 서버(100a)에 통합되어 운영될 수도 있고, 물리적으로 이격된 별도의 시스템으로 구축되어 검색 서버(100a)와 소정의 통신망을 통해 통신하는 방식으로 운영될 수도 있다.First, Figure 1 is a schematic diagram showing the network configuration of the reduced issue sentence extraction system according to an embodiment of the present invention. Users may access the search server 100a equipped with the abbreviated issue sentence extraction system 100 through the wired / wireless communication networks 120a and 120b using the user terminals 110a and 110b. That is, users access the search server 100a through the user terminals 110a and 110b to obtain information on various news articles, books, or documents that are currently an issue. At this time, the search server 100a provides information about the abbreviated issue sentence extracted from the issue sentence extraction system 100a to the user terminals 110a and 110b, but is special from the user terminals 110a and 110b as necessary. It can also be optionally provided only upon request. In addition, the abbreviated issue sentence extraction system 100 may be integrated and operated in a search server 100a that provides an Internet search service, or may be constructed as a separate system that is physically separated from the search server 100a. It may be operated by communicating through a communication network.

도 2는 본 발명에 따른 축약된 이슈문장 추출 시스템(100)의 시스템 구성도이다. 도 2에서 보듯이, 본 발명에 따른 축약된 이슈문장 추출 시스템(100)은, 클러스터링 모듈(12), 이슈문장 추출모듈(14) 및 이슈축약문 생성모듈(16)을 포함할 수 있고, 나아가 순위판단 모듈(18)을 추가로 더 포함할 수도 있다. 아울러, 클러스터링 모듈(12), 이슈문장 추출모듈(14), 이슈축약문 생성모듈(16) 및 순위판단 모듈(18)은 모듈 제어부(10)에 의해 제어된다. 특히, 모듈 제어부(10)는 검색 서버(100a)에 지시에 의해 각각의 모듈들(12, 14, 16, 18)을 적절히 제어할 수 있다. 또한, 도 2에는 도시하지 않았으나, 본 이슈축약문 추출 시스템(100)이 검색 서버(100a)와 물리적으로 이격된 장소에 구축된 경우, 검색 서버(100a)와 통신할 수 있는 소정의 통신 모듈을 추가로 더 포함할 수도 있다.2 is a system configuration diagram of an abbreviated issue sentence extraction system 100 according to the present invention. As shown in FIG. 2, the abbreviated issue sentence extraction system 100 according to the present invention may include a clustering module 12, an issue sentence extraction module 14, and an issue abbreviation generation module 16. The ranking module 18 may further include. In addition, the clustering module 12, the issue sentence extraction module 14, the issue abbreviation generation module 16 and the ranking determination module 18 are controlled by the module control unit 10. In particular, the module controller 10 may appropriately control each of the modules 12, 14, 16, and 18 by an instruction to the search server 100a. In addition, although not shown in FIG. 2, when the issue shorthand extraction system 100 is built in a place physically separated from the search server 100a, a predetermined communication module capable of communicating with the search server 100a may be provided. It may further include further.

또한, 본 발명에 따른 축약된 이슈문장 추출 시스템(100)은, 데이터베이스 관리모듈(20)에 의해 제어되는 문서정보 DB(21), 검색로그 DB(22), 클러스터링 DB(23), 이슈문장수열 DB(24), 이슈축약문 DB(25) 및 라이브러리(26)를 포함할 수 있다.In addition, the reduced issue sentence extraction system 100 according to the present invention, the document information DB 21, the search log DB 22, the clustering DB 23, the issue sentence sequence controlled by the database management module 20 DB 24, issue abbreviation DB 25 and library 26 may be included.

여기서, 문서정보 DB(21)는 통신망을 통해 제공되는 각종 뉴스 기사, 도서, 문헌 등의 문서자료들에 대한 문서정보가 수록된 데이터베이스로서, 적어도 개별 문서의 제목 및 URL(Uniform Resource Locator; 컴퓨터 네트워크 상에 퍼져있는 특정 정보 자원의 종류와 위치가 기록된 "자원 위치 지장자"를 의미함)에 대한 정보를 포함한다. 나아가, 문서정보 DB(21)는 개별 문서에 대한 작성시간 정보를 포함하고, 여기서 "작성시간"은 실제로 해당 문서가 작성자에 의해 작성된 시간 뿐만 아니나 공개 또는 발표 시간을 의미할 수도 있다. 서비스 운영자는 소정의 검색엔진을 활용하여 인터넷 상에 제공되는 각종 문서자료들을 수집하고 개별 문서자료들에 대한 문서정보를 주기적으로 데이터베이스화할 수 있다.Here, the document information DB 21 is a database containing document information on document data such as various news articles, books, documents, and the like provided through a communication network, and includes at least a title and a URL (Uniform Resource Locator; Information about "resource location indicators" where the type and location of a particular information resource is recorded. Further, the document information DB 21 includes creation time information for the individual document, where the "creation time" may mean not only the time when the document was actually created by the author but also the publication or publication time. The service operator can collect various document data provided on the Internet using a predetermined search engine and periodically database the document information on individual document data.

클러스터링 모듈(12)은, 이렇게 미리 준비된 복수의 문서자료에 대한 문서정보 DB(21)로부터 판독되는 개별 문서자료들에 대한 문서정보를 기초로 서로 관련이 있는 문서자료들을 동일한 클러스터(Cluster)로 그룹화한다. 여기서, 클러스터링(Clusterig)은 유사성 등의 개념에 기초하여 데이터를 그룹으로 분류하는 것을 말하며, 예컨대 K-mean 클러스터링 알고리듬 등을 이용할 수 있다. 클러스터링은 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자라면 다양한 클러스터링 기법이 이용될 수 있음을 이해할 수 있을 것이므로, 여기서는 클러스터링 방식에 대한 구체적인 기술은 생략하기로 한다.The clustering module 12 groups the document data related to each other into the same cluster based on the document information on the individual document data read from the document information DB 21 for the plurality of document data thus prepared in advance. do. Here, clustering refers to classifying data into groups based on concepts such as similarity, and for example, K-mean clustering algorithm may be used. Since clustering may be understood by those skilled in the art, various clustering techniques may be used, and thus a detailed description of the clustering scheme will be omitted herein.

아울러, 클러스터링 모듈(12)에 의해 그룹화된 각각의 클러스터는 클러스터링 DB(23)에 데이터베이스화될 수 있다. 클러스터링 DB(23)는, 클러스터링 모듈(12)에 의해 서로 관련이 있는 문서자료들이 하나의 클러스터로 그룹화된 경우, 해당 클러스터에 대한 식별자(이하, "클러스터 ID")가 부여된 복수의 문서자료들에 대한 정보가 기록된 데이터베이스로서, 클러스터 ID에 따라 복수의 문서자료들 각각의 제목, URL 및 작성시간이 레코딩된다. 도 3에는, 클러스터 ID "32584206"가 동일하게 부여된 9건의 문서자료들에 대한 문서정보의 예를 도시하였다.In addition, each cluster grouped by clustering module 12 may be databased in clustering DB 23. The clustering DB 23, when the documents related to each other by the clustering module 12 is grouped into one cluster, a plurality of documents to which an identifier (hereinafter referred to as "cluster ID") for the cluster is assigned. A database in which information about is recorded, wherein titles, URLs, and creation times of a plurality of document materials are recorded according to the cluster ID. 3 shows an example of document information for nine document materials to which the cluster ID "32584206" is equally assigned.

다음으로, 이슈문장 추출모듈(14)은, 동일한 클러스터에 속하는 개별 문서자료들에 대한 제목 정보로부터 중복이 제거된 유효이슈문장을 추출한 후, 각각의 상기 유효이슈문장에 대하여 문장의 어절을 단위문자열로 하는 이슈문장수열을 생성한다. 즉, 이슈문장 추출모듈(14)은, 클러스터링 DB(23)를 판독하여 동일한 클러스터 ID가 부여된 복수의 문서자료들에 대한 제목 정보를 추출한다. 그리고 추출된 복수의 제목 정보를 비교 분석하여 중복된 제목을 제거한다. 중복이 제거된 복수의 제목 정보를 유효이슈문장으로 채택하고, 이들 유효이슈문장에 포함된 특수문자 혹은 문장부호를 제거하여 순수 문장 형태로 변환한다. 그 후, 유효이슈문장들을 문장의 어절 단위로 분리한다(예컨대, 띄어쓰기 단위로 분리한다). 이렇게 분리된 어절은 단위문자열을 구성하며, 그 결과 하나의 유효이슈문장은 복수의 단위문자열로 구분될 수 있다. 따라서, 하나의 유효이슈문장은 복수의 단위문자열이 수열로 배치된 이슈문장수열로 변환되며, 생성된 이슈문장수열은 개별적으로 이슈문장수열 DB(24)에 저장된다.Next, the issue sentence extraction module 14 extracts a valid issue sentence from which duplicates have been removed from title information of individual document materials belonging to the same cluster, and then, for each of the valid issue sentences, a sentence of a unit unit string. Create a sequence of issue statements. That is, the issue sentence extraction module 14 reads the clustering DB 23 and extracts title information of a plurality of document materials assigned the same cluster ID. The duplicate title is removed by comparing and analyzing the extracted plurality of title information. A plurality of title information without duplicates is adopted as a valid issue sentence, and special characters or punctuation marks included in the valid issue sentence are removed and converted into a pure sentence form. Then, the validity issue sentences are separated by the word unit of the sentence (eg, separated by the spacing unit). The separated words form a unit string, and as a result, one valid issue sentence may be divided into a plurality of unit strings. Therefore, one valid issue sentence is converted into an issue sentence sequence in which a plurality of unit strings are arranged in a sequence, and the generated issue sentence sequence is individually stored in the issue sentence sequence DB 24.

한편, 이슈문장수열 DB(24)에는, 동일한 클러스터 ID를 가진 문서자료들에 대하여 생성된 복수의 이슈문장수열이 데이터베이스화되어 있으며, 클러스터 ID별로 복수의 이슈문장수열이 레코딩된다. 예컨대, 도 3에서 하나의 클러스터로 그룹화된 복수의 문서자료 각각의 제목 정보는, 이슈문장 추출모듈(14)에 의한 (1) 유효이슈문장 선별 과정, (2) 순수 문장형태로의 변환 과정, (3) 단위문자열 구분 과정, 및 (4) 이슈문장수열 생성 과정을 거쳐, 도 4와 같이 레코딩된다.Meanwhile, in the issue sentence sequence DB 24, a plurality of issue sentence sequences generated for document data having the same cluster ID are databaseized, and a plurality of issue sentence sequences are recorded for each cluster ID. For example, the title information of each of the plurality of document data grouped into one cluster in FIG. 3 may include (1) valid issue sentence selection process by the issue sentence extraction module 14, (2) conversion into pure sentence form, It is recorded as shown in FIG. 4 through (3) a unit string classification process and (4) an issue sentence sequence generation process.

다음으로, 이슈축약문 생성모듈(16)은, 동일한 클러스터에 속하는 복수의 이슈문장수열로부터 공통된 단위문자열을 추출하여 부분문자열을 생성한다. 즉, 이슈축약문 생성모듈(16)은 이슈문장수열 DB(24)를 판독하여 클러스터 ID별로 저장된 복수의 이슈문장수열을 판독하고, 이들 이슈문장수열을 분석하여 공통된 단위문자열을 추출하며, 이렇게 추출된 부분문자열이 이슈축약문 DB(25)에 클러스터 ID별로 저장된다. 이때, 이슈축약문 생성모듈(16)에 의해 공통된 단위문자열을 추출하는 방식으로는 다음과 같은 방식을 이용할 수 있다.Next, the issue abbreviation generating module 16 extracts a common unit string from a plurality of issue sentence sequences belonging to the same cluster to generate a substring. That is, the issue abbreviation generation module 16 reads the issue sentence sequence DB 24 to read a plurality of issue sentence sequences stored for each cluster ID, analyzes the issue sentence sequence, and extracts a common unit string. Substring is stored in the issue abbreviation DB 25 for each cluster ID. In this case, the following method may be used as a method of extracting a common unit string by the issue contract generation module 16.

첫번째로는, 최장 공통 부분수열 알고리듬(Longest Common Subsequence Algorithm; 이하 'LCS 알고리듬')을 이용할 수 있다. 여기서, LCS 알고리듬은, 주어진 여러 개의 수열 모두의 부분수열이 되는 수열들 중에서 가장 긴 것을 찾는 알고리듬을 말하며, 각 수열에서 최장으로 공통된 순서를 가지는 서브 수열을 찾는 방법으로 이에 대하여는 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자라면 쉽게 이해할 수 있을 것이므로, 여기서는 이에 대한 자세한 기술을 생략한다. 이슈축약문 생성모듈(16)은, 이러한 최장 공통 부분수열 알고리듬을 이용하여 동일한 클러스터에 속하는 복수의 이슈문장수열에서 최장으로 공통된 단위문자열을 추출함으로써 부분문자열을 생성할 수 있다. 예컨대, 도 4에서 예시된 이슈문장수열들로부터 LCS 알고리듬에 따라 "이파니", "20kg", "감량"으로 이루어진 부분문자열이 추출된다.Firstly, a Longest Common Subsequence Algorithm (hereinafter, referred to as an "LCS algorithm") may be used. Here, the LCS algorithm refers to an algorithm that finds the longest among sequences that become subsequences of all of a plurality of sequences, and finds a subsequence having the longest common sequence in each sequence. Since it will be easily understood by those of ordinary skill in the art, a detailed description thereof will be omitted here. The issue abbreviation generating module 16 may generate a substring by extracting a longest common unit string from a plurality of issue sentence sequences belonging to the same cluster using the longest common subsequence algorithm. For example, a substring consisting of "ipani", "20kg", and "loss" is extracted from the issue sentence sequences illustrated in FIG. 4 according to the LCS algorithm.

한편, 두번째 방식으로는, 확률기반 부분문자열 생성 방식을 이용할 수 있다. 즉, 이슈축약문 생성모듈(16)은, 동일한 클러스터에 속하는 각각의 이슈문장수열을 이루는 각각의 단위문자열이 동일한 클러스터에 속하는 복수의 이슈문장수열 전체를 통해 출연된 확률에 기초하여 미리 설정된 확률값 범위에 속하는 단위문자열들로 이루어진 부분문자열을 생성할 수 있다. 예컨대, 이슈축약문 생성모듈(16)은 이슈문장수열 DB(24)로부터 동일한 클러스터 ID를 가진 각각의 이슈문장수열을 판독한다. 그리고, 각 이슈문장수열을 이루는 단위문자열을 차례로 분석하여, 해당 단위문자열이 동일한 클러스터에 속하는 전체 이슈문장수열에 포함된 회수를 카운팅하고 이를 해당 단위문자열과 함께 라이브러리(26)에 저장한다. 그리고, 이슈축약문 생성모듈(16)은 각각의 단위문자열과 이들의 출연회수에 기초하여 해당 단위문자열에 대한 출연 확률값(P)을 다음과 같은 수식에 의해 계산한다.
On the other hand, as a second method, it is possible to use a probability-based substring generation method. That is, the issue abbreviation generation module 16 includes a range of probability values set in advance based on the probability that each unit string constituting each issue sentence sequence belonging to the same cluster appears through all the plurality of issue sentence sequences belonging to the same cluster. You can create substrings of unit strings belonging to. For example, the issue abbreviation generation module 16 reads each issue sentence sequence having the same cluster ID from the issue sentence sequence DB 24. In addition, the unit string constituting each issue sentence sequence is analyzed in order, and the number of times the unit string is included in the entire issue sentence sequence belonging to the same cluster is counted and stored in the library 26 together with the unit string. Then, the issue abbreviation generating module 16 calculates the appearance probability value P for the unit string based on each unit string and the number of appearances thereof by the following equation.

[수식 1][Equation 1]

P = (해당 단위문자열의 출연 회수)/(전체 이슈문장수열 개수) × 100
P = (number of occurrences of the corresponding unit string) / (number of all issue sentences) × 100

서비스 운영자는 기준 확률값을 미리 설정할 수 있으며, 이 기준 확률값 범위에 속하는 출연 확률값을 가진 단위문자열을 선별할 수 있다. 첫번째 방식으로 예시한 LCS 알고리듬 방식의 경우에는 출연 확률값이 100%인 단위문자열만이 선택되어 부분문자열을 구성하게 되지만, 확률기반 부분문자열 생성 방식에 의하면 100%가 아닌 출연 확률값을 가진 단위문자열도 부분문자열을 구성할 수 있다. 예컨대, 도 4에 예시된 이슈문장수열들로부터 기준확률값을 70% 이상으로 설정하는 경우, "이파니", "출산", "20kg", 및 "감량"으로 이루어진 부분문자열을 생성할 수 있다. The service operator may set a reference probability value in advance, and select a unit string having a appearance probability value belonging to the reference probability value range. In the case of the LCS algorithm illustrated as the first method, only the unit string having the appearance probability value of 100% is selected to form a substring, but according to the probability-based substring generation method, the unit string having the appearance probability value is not 100%. You can construct a string. For example, when the reference probability value is set to 70% or more from the issue sentence sequences illustrated in FIG. 4, a substring consisting of "ipani", "birth", "20kg", and "loss" may be generated.

본 발명에 따른 축약된 이슈문장 추출 시스템은, 상술한 방식들 중 어느 하나를 이용하여 이슈축약문 생성모듈(16)에 의해 생성된 상기 부분문자열을 동일한 클러스터에 속하는 상기 복수의 문서자료들에 대한 이슈축약문으로 제공한다. 특히, 확률기반 부분문자열 생성 방식을 이용하는 경우, 라이브러리(26)에 클러스터별로 단위문자열과 그 출연 횟수에 대한 정보가 기록되어 있으므로, 단위문자열의 출연 횟수에 기초하여 이슈축약문을 생성할 수도 있다. 예컨대, 이슈축약문을 구성하는 단위문자열들을 그 출연 회수에 따라 배치하여 구성함으로써, 사용자가 이슈축약문을 읽을 때 가독성을 향상시킬 수 있다.The abbreviated issue sentence extraction system according to the present invention uses the substring generated by the issue abbreviation generation module 16 using any one of the above-described methods for the plurality of document materials belonging to the same cluster. Provide it as an issue abbreviation. In particular, in the case of using the probability-based substring generation method, since the information about the unit string and the number of appearances is recorded in the library 26 for each cluster, the issue abbreviation may be generated based on the number of appearances of the unit string. For example, by arranging the unit strings constituting the issue abbreviation according to the number of appearances, it is possible to improve readability when the user reads the issue abbreviation.

한편, 본 시스템(100)은 순위판단 모듈(18)을 더 포함할 수 있다. 순위판단 모듈은, 문서정보 DB(21)에 문서자료들 각각의 작성시간 정보가 포함된 경우, 동일한 클러스터에 속하는 복수의 문서자료들 각각의 작성시간 정보를 기초로 하여 최신성지수를 산출하며, 이 최신성지수를 기초로 복수의 클러스터에 대하여 각각 추출된 이슈축약문들을 순위화하여 제공할 수 있다. 여기서, 최신성지수는 각 클러스터에 포함된 문서정보 중 개별 문서자료에 대한 URL 및 작성시간 정보를 기초로, 예컨대 기준 시각으로부터 작성시간까지의 시간 간격에 따라 지수로서 정규화하여 산출될 수 있다. 이렇게 산출된 최신성지수에 따라 해당 클러스터에 대한 이슈축약문을 순위와 함께 제공할 수 있다.Meanwhile, the system 100 may further include a ranking determination module 18. The ranking determination module, when the document information DB 21 includes the creation time information of each of the document materials, calculates a newness index based on the creation time information of each of the plurality of document materials belonging to the same cluster. Based on this recency index, it is possible to rank and provide the extracted issue contracts for each of the plurality of clusters. Here, the latest index may be calculated by normalizing as an index based on a time interval from, for example, a reference time to a creation time, based on URL and creation time information of individual document data among document information included in each cluster. According to the calculated freshness index, the issue contract for the cluster can be provided along with the ranking.

또한, 순위판단 모듈(18)은 복수의 클러스터에 대한 각각의 이슈축약문을 사용자의 관심도에 기초하여 순위화할 수도 있다. 예컨대, 본 시스템(100)은 검색로그 DB(22)를 더 포함할 수 있으며, 검색로그 DB(22)에는 검색 엔진을 통해 사용자가 개별 문서자료를 열람한 경우, 해당 사용자 단말기를 식별하기 위한 숫자 또는 문자로 된 사용자 식별자에 대한 정보, 해당 문서자료에 대한 URL 정보 및 해당 사용자 단말기에 의해 열람된 시간(이하, "방문시간")에 대한 정보를 포함할 수 있다. 순위판단 모듈(18)은 검색로그 DB(22)를 통해 동일한 클러스터에 속하는 복수의 문서자료들에 대한 사용자의 방문 회수를 카운팅하고, 그에 기초하여 사용자의 관심도지수를 산출할 수 있다. 예컨대, 해당 클러스터에 속하는 문서자료에 대응되는 URL을 사용자들이 몇번이나 열람하였는지 카운팅하고, 각 URL에 대한 사용자 방문 횟수를 더하여 해당 클러스터에 대한 관심도지수를 정규화하여 산출할 수 있다. In addition, the ranking module 18 may rank each issue abbreviation for a plurality of clusters based on the degree of interest of the user. For example, the system 100 may further include a search log DB 22, and the search log DB 22 may include a number for identifying a corresponding user terminal when a user views individual document data through a search engine. Alternatively, the information may include information on a user identifier in text, URL information on a corresponding document, and information on a time read by the corresponding user terminal (hereinafter, referred to as “visit time”). The ranking module 18 may count the number of visits of the user for a plurality of document materials belonging to the same cluster through the search log DB 22, and calculate the user's interest index. For example, the number of URLs corresponding to document data belonging to the cluster may be counted, and the number of visits to each URL may be added to normalize the index of interest for the cluster.

본 발명에 따른 축약된 이슈문장 추출 방법은, 적어도 문서의 제목 및 URL에 대한 정보가 기록된 복수의 문서자료에 대한 문서정보 DB로부터 서로 관련이 있는 문서자료들을 동일한 클러스터로 그룹화하는 클러스터링 단계와, 동일한 클러스터에 속하는 개별 문서자료들에 대한 제목 정보로부터 중복이 제거된 유효이슈문장을 추출한 후 각각의 상기 유효이슈문장에 대하여 문장의 어절을 단위문자열로 하는 이슈문장수열을 생성하는 이슈문장 추출 단계와, 동일한 클러스터에 속하는 복수의 이슈문장수열로부터 공통된 단위문자열을 추출함으로써 부분문자열을 생성하는 이슈축약문 생성 단계를 포함할 수 있다. 이렇게 생성된 부분문자열이 해당 클러스터에 속하는 복수의 문서자료들에 대한 이슈축약문으로서 제공될 수 있다. The reduced issue sentence extraction method according to the present invention includes a clustering step of grouping related document data into the same cluster from document information DB for a plurality of document data in which at least information on a title and a URL of a document are recorded; An issue sentence extraction step of extracting a valid issue sentence having duplicates removed from title information of individual document materials belonging to the same cluster and generating an issue sentence sequence in which a sentence of a sentence is a unit string for each valid issue sentence; The method may further include generating an issue contract statement for generating a substring by extracting a common unit string from a plurality of issue sentence sequences belonging to the same cluster. The generated substring may be provided as an issue abbreviation for a plurality of document data belonging to the cluster.

특히, 이슈축약문 생성 단계에서, 최장 공통 부분수열 알고리듬에 따라 동일한 클러스터에 속하는 상기 복수의 이슈문장수열에서 최장으로 공통된 단위문자열을 추출하여 상기 부분문자열을 생성할 수 있다. 다른 방법으로서, 이슈축약문 생성 단계에서, 동일한 클러스터에 속하는 각각의 상기 이슈문장수열을 이루는 각각의 상기 단위문자열이 상기 복수의 이슈문장수열 전체를 통해 출연된 확률에 기초하여 미리 설정된 확률값 범위에 속하는 상기 단위문자열로 이루어진 상기 부분문자열을 생성할 수도 있다.In particular, in the issue abbreviation generation step, the substring may be generated by extracting the longest common unit string from the plurality of issue sentence sequences belonging to the same cluster according to the longest common subsequence algorithm. Alternatively, in the issue abbreviation generation step, each of the unit strings constituting each of the issue sentence sequences belonging to the same cluster may fall within a preset probability value range based on a probability appeared throughout the plurality of issue sentence sequences. The substring consisting of the unit string may be generated.

그리고, 복수의 문서자료들 각각의 작성시간 정보가 포함된 문서정보 DB를 이용하여, 동일한 클러스터에 속하는 복수의 문서자료들 각각의 상기 작성시간 정보를 기초로 최신성지수를 산출함으로써, 복수의 클러스터에 대한 각각의 상기 이슈축약문을 상기 최신성지수에 기초하여 복수의 상기 이슈축약문을 순위화하는 순위판단 단계를 더 포함할 수 있다.And, by using the document information DB containing the creation time information of each of a plurality of document data, by calculating the most recent index based on the creation time information of each of a plurality of document materials belonging to the same cluster, a plurality of clusters The method may further include a ranking determining step of ranking the plurality of issue contracts based on the issue index for each of the issue contracts.

또한, 순위판단 단계에서, 적어도 사용자 식별자, 상기 문서정보 DB에 포함된 각 문서자료에 대한 URL 정보 및 각각의 URL에 대한 사용자의 방문시간 정보를 포함하는 검색로그 DB로부터 동일한 클러스터에 속하는 복수의 문서자료들에 대한 사용자의 방문 회수를 카운팅하여 사용자의 관심도지수를 산출하고, 복수의 클러스터에 대한 각각의 상기 이슈축약문을 상기 관심도지수에 기초하여 복수의 상기 이슈축약문을 순위화할 수 있다.Further, in the ranking determination step, a plurality of documents belonging to the same cluster from a search log DB including at least a user identifier, URL information for each document data included in the document information DB, and user visit time information for each URL. The interest index of the user may be calculated by counting the number of visits of the user to the data, and the plurality of issue contracts may be ranked based on the interest index for each issue contract for a plurality of clusters.

상술한 축약된 이슈문장 추출 방법은 범용 컴퓨터 장치에 의해 수행될 수 있다. 예컨대, 컴퓨터 장치는, 램(RAM; Random Access Memory)와 롬(ROM; Read Only Memory)를 포함하는 주기억장치와 연결되는 하나 이상의 프로세서 혹은 중앙처리장치(CPU)를 포함할 수 있다. 본 기술분야에서 널리 알려져 있는 바와 같이, 롬은 데이터와 명령을 단방향성으로 CPU에 전송하는 역할을 하며, 램은 통상적으로 데이터와 명령을 양방향성으로 전송하는 데에 사용된다. 램 및 롬은 컴퓨터 판독 가능 매체의 어떠한 적절한 형태를 포함할 수 있다. 대용량 기억 장치는 양방향성으로 프로세서와 연결되어 추가적인 데이터 저장 능력을 제공하며, 컴퓨터로 판독 가능한 기록 매체 중 어떠한 것일 수 있다. 대용량 기억장치는 프로그램, 데이터 등을 저장하는데 사용되며, 통상적으로 주기억장치보다 속도가 느린 하드 디스크 혹은 CD 또는 DVD와 같은 보조기억장치일 수 있다. 그리고 프로세서는 네트워크 인터페이스를 통하여 유선 도는 무선 통신 네트워크에 연결될 수 있다. 이러한 네트워크 연결을 통하여 상기한 방법의 절차를 수행할 수 있다. 또한, 본 발명에 따른 축약된 이슈문장 추출 방법은 하나 이상의 소프트웨어 프로그램으로서 구성되어 이를 실행할 수 있는 컴퓨터 판독 가능한 기록 매체로 제공될 수 있다. The above-mentioned abbreviated issue sentence extraction method may be performed by a general purpose computer device. For example, the computer device may include one or more processors or a central processing unit (CPU) connected to a main memory including a random access memory (RAM) and a read only memory (ROM). As is well known in the art, ROM is responsible for transferring data and instructions unidirectionally to the CPU, and RAM is typically used to transfer data and instructions bidirectionally. The RAM and ROM may comprise any suitable form of computer readable medium. The mass storage device is bi-directionally coupled to the processor to provide additional data storage capabilities and may be any of a computer-readable recording medium. The mass storage device is used to store programs, data, and the like, and may be a hard disk, which is usually slower than the main storage device, or an auxiliary storage device such as a CD or a DVD. And the processor may be connected to the wired or wireless communication network via a network interface. Through the network connection, the procedure of the above-described method can be performed. In addition, the reduced issue sentence extraction method according to the present invention may be provided as a computer-readable recording medium configured as one or more software programs and executable thereon.

지금까지 본 발명의 바람직한 실시예에 대해 설명하였으나, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자는 본 발명의 본질적인 특성을 벗어나지 않는 범위 내에서 변형된 형태로 구현할 수 있을 것이다. 그러므로 여기서 설명한 본 발명의 실시예는 한정적인 관점이 아니라 설명적인 관점에서 고려되어야 하고, 본 발명의 범위는 상술한 설명이 아니라 특허청구범위에 나타나 있으며, 그와 동등한 범위 내에 있는 모든 차이점은 본 발명에 포함되는 것으로 해석되어야 한다.While the present invention has been particularly shown and described with reference to exemplary embodiments thereof, it is to be understood that the invention is not limited to the disclosed embodiments, but, on the contrary, is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the invention. It is therefore to be understood that the embodiments of the invention described herein are to be considered in all respects as illustrative and not restrictive, and the scope of the invention is indicated by the appended claims rather than by the foregoing description, Should be interpreted as being included in.

Claims

A clustering module for grouping related document materials into the same cluster from document information DBs for a plurality of document materials in which at least information on a title and a URL of the document are recorded;
An issue sentence extraction module for extracting a valid issue sentence from which duplicates are removed from title information of individual document data belonging to the same cluster, and then generating an issue sentence sequence using a sentence as a unit string for each valid issue sentence. ; And
And an issue contract generation module for generating a substring by extracting a common unit string from a plurality of issue sentence sequences belonging to the same cluster.
And the substring generated by the issue contract generation module as an issue contract for the plurality of document materials belonging to the same cluster.

The method of claim 1,
The shortened issue sentence extracting system is configured to extract the longest common unit string from the plurality of issue sentence sequences belonging to the same cluster according to a longest common partial sequence algorithm to generate the partial string. .

The method of claim 1,
The issue contract generation module may include: the unit string belonging to a preset probability value range based on a probability that each of the unit strings constituting each of the issue sentence sequences belonging to the same cluster appears through the entire plurality of issue sentence sequences; Shortened issue sentence extraction system, characterized in that for generating the substring consisting of.

delete

A retrieval system comprising the abbreviated issue sentence extraction system according to any one of claims 1 to 3.

A clustering step of grouping related document materials into the same cluster from document information DBs for a plurality of document materials in which at least information on a title and a URL of the document are recorded;
The issue sentence extraction step of extracting a valid issue sentence from which duplicates are removed from title information of individual document materials belonging to the same cluster, and generating an issue sentence sequence in which a sentence of a sentence is a unit string for each valid issue sentence. ; And
And a shortened issue sentence generating step of generating a substring by extracting a common unit string from a plurality of issue sentence sequences belonging to the same cluster.

The method of claim 7, wherein
The generating of the shortened issue sentence may include extracting the longest common unit string from the plurality of issue sentence sequences belonging to the same cluster according to a longest common partial sequence algorithm to generate the partial string. .

The method of claim 7, wherein
The generating of the issue contract may include generating the unit string belonging to a preset range of probability values based on a probability that each of the unit strings constituting the issue sentence sequence belonging to the same cluster is present through the entire plurality of issue sentence sequences. Shortened issue sentence extraction method, characterized in that for generating the substring consisting of.

delete

A computer-readable recording medium containing a program for executing the abbreviated issue sentence extracting method according to any one of claims 7 to 9.