KR20090036918A

KR20090036918A - Method and system for serving document exploration service based on title clustering

Info

Publication number: KR20090036918A
Application number: KR1020070102219A
Authority: KR
Inventors: 곽태영; 이은지; 김병학
Original assignee: 엔에이치엔(주)
Priority date: 2007-10-10
Filing date: 2007-10-10
Publication date: 2009-04-15
Also published as: KR100902673B1

Abstract

A method and a system for serving document exploration service based on title clustering are provided to efficiently search a document by mapping a title cluster formed based on a title of a document to directory corresponding to the subject. A document classification unit(211) classifies a document according to theme. A title extracting unit(212) extracts a title of a document. A document comprises one or a plurality of field. The title extracting unit extracts a title in consideration of an attribute of a field comprising document. A cluster forming unit(213) forms a cluster based on the extracted title. A directory mapping unit(214) maps a cluster in a directory belonging to theme. A document delivery unit(215) provides a document to a user terminal(230). The document delivery unit provides a visualization unit. The structure of cluster and the directory in which document belongs is visualized to visualization unit.

Description

Method and system for serving document exploration service based on title clustering}

본 발명은 제목 클러스터링에 기초한 문서 탐색 서비스 제공 방법 및 시스템에 관한 것이다. The present invention relates to a method and system for providing a document search service based on title clustering.

웹 상에서는 다양한 관심사에 대한 수많은 문서들이 존재한다. 사용자들은 자신이 원하는 정보에 대한 질의어 정보를 검색엔진에 전달함으로써 정보를 획득할 수 있다. 그러나, 매번 자신이 관심있는 주제에 대한 질의어를 입력하는 일은 매우 번거로운 일이다. There are numerous documents on the web about various interests. Users can obtain information by delivering query information about the information they want to a search engine. However, entering a query on a topic of interest each time is very cumbersome.

한편, 검색어 입력 등의 절차를 거치치 않고 자신이 원하는 정보에 접근하기 위해서 특정분야에 전문성을 가지는 버티컬 사이트(vertical site) 및 블로그(blog) 등에 접속하여 해당 분야의 최신 정보를 획득할 수 있다. Meanwhile, in order to access information desired by the user without accessing a search word, a user may access a vertical site and a blog having expertise in a specific field to obtain the latest information of the corresponding field.

이러한 버티컬 사이트 및 블로그에서 존재하는 정보들의 수준은 나날이 향 상되어 가고 있으며, 해당 분야에서 가장 빠르고 깊이 있는 정보를 회득할 수 있는 미디어로서 발전해가고 있다. The level of information existing in these vertical sites and blogs is improving day by day, and is developing as the media that can acquire the fastest and deepest information in the field.

그러나, 여러 버티컬 사이트 및 블로그에 걸쳐 흩어져 있는 정보들을 열람하기 위해 각 사이트를 방문하는 것 역시 사용자에게 불편함을 초래할 수 있다. 이에 대한 보완책으로 버티컬 사이트 및 블로그는 알에스에스 피드(RSS Feed, Really Simple Syndication Feed)를 제공하며, 이를 구독하기 위해 알에스에스 구독기(RSS Reader)와 같은 프로그램들이 사용될 있다. However, visiting each site to view information scattered across multiple vertical sites and blogs can also be inconvenient for the user. As a supplement, vertical sites and blogs provide Really Simple Syndication Feeds (RSS Feeds), and programs such as RSS Readers can be used to subscribe to them.

그러나, 각 알에스에스 피드들은 서로 독립적으로 정보를 제공하고, 동일하거나 극히 유사한 내용을 가지는 문서라고 하여도 별개의 정보로서 취급하므로 사용자가 정보를 탐색, 열람하는 과정에서 효율성을 향상시키기 위한 추가적인 노력이 요구된다.However, each RSS feed provides information independently of each other, and even if the document has the same or extremely similar contents, it is treated as separate information. Therefore, an additional effort to improve efficiency in the process of searching and viewing the information is needed. Required.

본 발명은 문서를 주제별로 분류하고, 문서의 제목에 기초하여 형성된 클러스터(cluster)를 해당 주제에 속하는 디렉토리로 매핑(mapping)함으로써, 보다 효율적인 문서 탐색 서비스 제공 방법 및 시스템을 제공하는 것이다. The present invention provides a method and system for providing a more efficient document search service by classifying documents by subject and mapping a cluster formed based on the title of the document to a directory belonging to the subject.

또한, 본 발명은 웹 상에서 수집된 문서를 분류하고, 분류된 문서의 제목에 기초하여 형성된 클러스터 및 클러스터가 매핑된 디렉토리를 시각화함으로써, 정보 탐색의 효율성을 재고하는 문서 탐색 서비스 제공 방법 및 시스템을 제공하는 것이다. In addition, the present invention provides a method and system for providing a document search service that classifies documents collected on the web and visualizes the clusters formed based on the titles of the classified documents and the directories to which the clusters are mapped, thereby reconstructing the efficiency of information search. It is.

본 발명의 일 측면에 따르면, 문서를 주제에 따라서 분류하는 단계; 문서의 제목을 추출하는 단계; 추출된 제목에 기초하여 클러스터를 형성하는 단계 및 클러스터를 주제에 속하는 소정의 디렉토리에 매핑하는 단계를 포함하는 문서 탐색 서비스 제공 방법이 제공된다. According to an aspect of the present invention, the method comprises: classifying documents according to a subject; Extracting a title of the document; There is provided a method for providing a document search service including forming a cluster based on the extracted title and mapping the cluster to a predetermined directory belonging to the subject.

문서 탐색 서비스 제공 방법은 문서를 사용자 단말기로 제공하는 단계를 더 포함하고, 문서 제공단계는 문서가 속하는 디렉토리 및 클러스터의 구조를 시각화하기 위한 수단을 제공할 수 있다. The method for providing a document search service may further include providing a document to a user terminal, and the document providing step may provide a means for visualizing a structure of a directory and a cluster to which the document belongs.

또한, 문서 탐색 서비스 제공 방법에서 문서는 하나 또는 복수개의 필드를 포함하는 것을 특징으로 하고, 문서의 제목을 추출하는 단계는 문서를 구성하는 필드의 속성을 고려하여 제목을 추출할 수 있다.In the document search service providing method, the document may include one or a plurality of fields, and the extracting the title of the document may extract the title in consideration of the attributes of the fields constituting the document.

한편, 클러스터를 형성하는 단계는, 추출된 제목을 음절단위로 구분하여, 제목 중에서 다른 문서와 공유되는 부분을 클러스터의 중심개념 후보로 선정할 수 있다. 중심개념 후보는 문서의 제목에 대한 엔-그램(n-gram) 분석을 이용하여 선정될 수 있다.Meanwhile, in the forming of the cluster, the extracted title may be divided into syllable units, and a part of the title shared with other documents may be selected as a candidate candidate for cluster concept. Core concept candidates may be selected using n-gram analysis of the title of the document.

클러스터를 형성하는 단계는 중심개념 후보를 공유하는 문서가 소정 개수 이상인 경우에 클러스터를 형성할 수 있으며, 중심개념 후보가 복수개인 경우 제목 후보구의 길이를 고려하여 중심개념을 선정할 수 있다. In the forming of the cluster, the cluster may be formed when there are more than a predetermined number of documents sharing the central concept candidate. In the case where there are a plurality of central concept candidates, the central concept may be selected in consideration of the length of the title candidate phrase.

또한, 본 발명의 다른 측면에 따르면, 문서를 주제에 따라서 분류하는 문서 분류부; 문서의 제목을 추출하는 제목 추출부; 추출된 제목에 기초하여 클러스터를 형성하는 클러스터 형성부 및 클러스터를 주제에 속하는 소정의 디렉토리에 매핑하는 디렉토리 매핑부를 포함하는 문서 탐색 서비스 제공 시스템이 제공된다.Further, according to another aspect of the invention, the document classification unit for classifying documents according to the subject; A title extractor for extracting a title of the document; There is provided a document search service providing system including a cluster forming unit for forming a cluster based on the extracted title, and a directory mapping unit for mapping the cluster to a predetermined directory belonging to the subject.

또한, 문서 탐색 서비스 제공 시스템은 문서를 사용자 단말기로 제공하는 문서 제공부를 더 포함하고, 문서 제공부는 문서가 속하는 디렉토리 및 클러스터의 구조를 시각화하기 위한 수단을 제공할 수 있다.In addition, the document search service providing system may further include a document providing unit for providing a document to the user terminal, and the document providing unit may provide a means for visualizing the structure of the directory and cluster to which the document belongs.

문서 탐색 서비스 제공 시스템에서 문서는 하나 또는 복수개의 필드를 포함하는 것을 특징으로 하고, 제목 추출부는 문서를 구성하는 필드의 속성을 고려하여 제목을 추출할 수 있다. In a document search service providing system, a document may include one or a plurality of fields, and the title extractor may extract a title in consideration of attributes of fields constituting the document.

한편, 클러스터 형성부는, 추출된 제목을 음절단위로 분리하여, 제목 중에서 다른 문서와 공유되는 부분을 클러스터의 중심개념 후보로 선정할 수 있다. 중심개념 후보는 문서의 제목에 대한 엔-그램(n-gram) 분석을 이용하여 선정될 수 있다.Meanwhile, the cluster forming unit may separate the extracted title into syllable units, and select a part of the title shared with other documents as the central concept candidate of the cluster. Core concept candidates may be selected using n-gram analysis of the title of the document.

클러스터 형성부는 중심개념 후보를 공유하는 문서가 소정 개수 이상인 경우에 클러스터를 형성하는 것을 특징으로 할 수 있으며, 중심개념 후보가 복수개인 경우 제목 후보구의 길이를 고려하여 중심개념을 선정할 수 있다. The cluster forming unit may be configured to form a cluster when there are more than a predetermined number of documents sharing the central concept candidate, and when there are a plurality of central concept candidates, the central concept may be selected in consideration of the length of the title candidate phrase.

한편, 문서 탐색 서비스 제공 방법은 컴퓨터에 의하여 수행될 수 있으며, 컴퓨터에서 실행하기 위한 프로그램을 기록하는 컴퓨터 판독 가능한 기록매체에 기록될 수 있다. Meanwhile, the method for providing a document search service may be performed by a computer, and may be recorded in a computer readable recording medium for recording a program to be executed in the computer.

전술한 것 외의 다른 측면, 특징, 이점이 이하의 도면, 특허청구범위 및 발 명의 상세한 설명으로부터 명확해질 것이다.Other aspects, features, and advantages other than those described above will become apparent from the following drawings, claims and detailed description of the invention.

본 발명의 바람직한 실시예에 따르면, 문서의 제목에 기초하여 형성된 클러스터를 해당 주제에 속하는 디렉토리로 매핑함으로써, 보다 효율적인 문서 탐색 서비스 제공 방법 및 시스템을 구현할 수 있다. According to a preferred embodiment of the present invention, a method and system for providing a more efficient document search service can be implemented by mapping a cluster formed based on a title of a document to a directory belonging to a corresponding subject.

또한, 본 발명의 바람직한 실시예에 따르면, 웹 상에서 수집된 문서를 분류하고, 분류된 문서의 제목에 기초하여 형성된 클러스터 및 클러스터가 매핑된 디렉토리를 시각화함으로써, 정보 탐색의 효율성을 재고하는 문서 탐색 서비스 제공 방법 및 시스템을 구현할 수 있다.Further, according to a preferred embodiment of the present invention, a document search service that classifies documents collected on the web and visualizes the efficiency of information search by visualizing clusters and clusters to which the clusters are mapped based on the classified document titles. Provisioning methods and systems can be implemented.

이하, 본 발명에 따른 제목 클러스터링에 기초한 문서 탐색 서비스 제공 방법 및 시스템의 실시예를 첨부도면을 참조하여 상세히 설명하기로 한다. 그러나, 이는 본 발명을 특정한 실시 형태에 대해 한정하려는 것이 아니며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변환, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다. 본 발명을 설명함에 있어서 관련된 공지 기술에 대한 구체적인 설명이 본 발명의 요지를 흐릴 수 있다고 판단되는 경우 그 상세한 설명을 생략한다. 또한, 첨부 도면을 참조하여 설명함에 있어, 동일하거나 대응하는 구성 요소는 동일한 도면번호를 부여하고 이에 대한 중복되는 설명은 생략하기로 한다. Hereinafter, an embodiment of a method and system for providing a document search service based on title clustering according to the present invention will be described in detail with reference to the accompanying drawings. However, this is not intended to limit the present invention to specific embodiments, it should be understood to include all transformations, equivalents, and substitutes included in the spirit and scope of the present invention. In the following description of the present invention, if it is determined that the detailed description of the related known technology may obscure the gist of the present invention, the detailed description thereof will be omitted. In addition, in the description with reference to the accompanying drawings, the same or corresponding components will be given the same reference numerals and redundant description thereof will be omitted.

도 1은 본 발명의 일 실시예에 따른 문서 탐색 서비스 제공 방법의 흐름도이고, 도 2는 본 발명의 일 실시예에 따른 문서 탐색 서비스 제공 시스템의 구성도이다. 1 is a flowchart of a method for providing a document search service according to an embodiment of the present invention, and FIG. 2 is a block diagram of a system for providing a document search service according to an embodiment of the present invention.

도 1 및 도 2를 참조하면, 탐색 서비스 제공 서버(210), 문서 분류부(211), 제목 추출부(212), 클러스터 형성부(213), 디렉토리 매핑부(214), 문서 제공부(215), 광고 제공부(216), 원본 문서 데이터베이스(221), 탐색 서비스 데이터베이스(222), 광고 데이터베이스(223) 및 사용자 단말기(230)가 도시되어 있다. 1 and 2, the search service providing server 210, the document classifying unit 211, the title extracting unit 212, the cluster forming unit 213, the directory mapping unit 214, and the document providing unit 215. ), An advertisement providing unit 216, an original document database 221, a search service database 222, an advertisement database 223, and a user terminal 230 are illustrated.

문서를 주제에 따라서 분류하는 단계(S110)는 문서 분류부(211)에 의하여 원본 문서 데이터베이스(221)내의 문서들을 분류하는 단계이다. Classifying documents according to a subject (S110) is classifying documents in the original document database 221 by the document classifying unit 211.

문서 분류부(211)는 원본 문서 데이터베이스(221)로부터 문서에 대한 정보를 획득하고, 분류에 대한 정보를 탐색 서비스 데이터베이스(222)로부터 획득한다. 획득한 정보들에 기초하여 문서가 어떤 분류와 매칭되는지를 결정하고, 문서와 분류 사이의 매칭 관계에 대한 정보를 탐색 서비스 데이터베이스(222)에 저장한다. The document classification unit 211 obtains information about the document from the original document database 221, and obtains information about the classification from the search service database 222. Based on the obtained information, it is determined which classification the document matches, and information about the matching relationship between the document and the classification is stored in the search service database 222.

본 단계에서 문서 분류부(211)는 문서에 포함된 정보를 사용하여 문서가 특정한 키워드를 포함하고 있는지 여부 및 특정한 내용을 포함하고 있는지 여부 등을 판단함으로써, 문서를 주제별로 분류할 수 있다.In this step, the document classifying unit 211 may classify the document by subject by determining whether the document includes a specific keyword and whether the document includes specific content, using information included in the document.

일 예로,'와인'이라는 분류와 매칭되는 적합한 문서인지 여부는 해당 문서가 '와인'이라는 분류명 자체 및 분류명의 동의어를 포함하고 있는지 여부와 '와인'과 깊은 관계가 있는 것으로 판단될 수 있는 키워드인 '소믈리에', '디켄팅' 등 을 포함하고 있는지 여부 등을 고려하여 결정될 수 있다. For example, whether the document is a suitable document that matches the classification "wine" includes whether the document contains the classification name "wine" itself and a synonym for the classification name, and keywords that may be considered to have a close relationship with "wine". It may be determined in consideration of whether it includes 'sommelier', 'decanting', and the like.

한편, 문서와 분류와의 매칭여부를 결정함에 있어서, 분류관련 키워드의 포함 여부를 수치화 하여 기준으로서 활용할 수 있다. 일 예로, 특정 주제와 관련된 키워드 들이 포함되는 경우 소정의 점수를 부여하고 이 점수들의 합이 일정한 기준을 넘는 경우 해당 분류와 매칭되는 것으로 결정할 수 있다. On the other hand, in determining whether to match the document and the classification, whether or not to include the keyword related to the classification can be used as a reference. For example, if keywords related to a specific subject are included, a predetermined score may be given, and if the sum of these scores exceeds a predetermined criterion, the score may be determined to match the corresponding classification.

문서의 분류 단계에서 하나의 문서가 반드시 하나의 분류에만 매칭되는 것으로 판단되는 것은 아니다. 일 예로, '와인'이라는 분류와 '일본 만화'라는 분류가 존재하는 경우, 와인을 주제로 다룬 일본만화인 '신의 물방울'에 대한 감상평등을 다루는 문서는 '와인'에 대한 분류는 물론 '일본 만화'라는 분류에 동시에 매칭될 수 있다.In the document classification stage, one document is not necessarily determined to match only one classification. For example, if there is a classification of wine and a Japanese manga, the document dealing with the appreciation of the Japanese comic book, “Drop of God,” which deals with wine, is not only a category for wine, but also “Japan”. May be simultaneously matched to the category 'manga'.

원본 문서 데이터베이스(221)에는 탐색 서비스 제공 서버(210)에 의하여 분류되고 재구성될 수 있는 원본 문서들에 대한 정보가 저장된다. 이러한 원본 문서들은 궁극적으로 사용자 단말기(230)로 제공될 수 있다. 원본 문서는 웹 상에서 웹 로봇 등에 의하여 수집될 수 있다.The original document database 221 stores information about original documents that can be classified and reconstructed by the search service providing server 210. These original documents may ultimately be provided to the user terminal 230. The original document may be collected by a web robot or the like on the web.

한편, 본 발명의 원본 문서 데이터베이스(221)에 저장되는 원본 문서들은 소정의 속성을 가진 문서들을 포함할 수 있다. 일 예로, 버티컬 사이트 및 블로그에서 사용되는 구조화된 문서가 원본 문서로 사용될 수 있다. Meanwhile, original documents stored in the original document database 221 of the present invention may include documents having predetermined attributes. For example, structured documents used in vertical sites and blogs may be used as original documents.

이러한 구조화된 문서들은 문서 내의 컨텐츠를 하나 또는 복수개의 영역 또는 구획으로 나누어 저장할 수 있다. 이러한 영역 또는 구획들은 필드(field)로 명명될 수 있다. 일 예로 블로그의 포스트(post)과 같은 문서의 경우 제목 필드, 본 문 필드, 작성 시각 필드 및 해당 포스트에 대한 키워드 필드 등을 포함할 수 있다.Such structured documents can divide and store content within a document into one or a plurality of regions or sections. Such regions or compartments may be named fields. For example, a document such as a post of a blog may include a title field, a body field, a creation time field, and a keyword field for the post.

이러한 문서에 관하여, 그 작성자는 각각의 필드명에 상응하는 컨텐츠를 입력함으로써 문서를 생성할 수 있으므로, 필드명과 그에 상응하는 컨텐츠는 후술하는 제목 추출 단계 등에서 유용하게 사용될 수 있다.With respect to such a document, the creator can generate a document by inputting contents corresponding to each field name, so that the field name and the corresponding content can be usefully used in a title extraction step and the like described later.

또한, 이러한 버티컬 사이트 및 블로그 들에서는 문서들 간의 관계 역시 구조화되어 있을 수 있다. 이러한 문서의 구조화된 관계는 해당 사이트에서의 디렉토리 형태로 나타날 수 있다. Also, in such vertical sites and blogs, the relationship between documents may also be structured. The structured relationships of these documents can appear in the form of directories on the site.

일 예로, 영화를 주제로 하는 버티컬 사이트는 해당 사이트의 문서를 분류하는 디렉토리로서 '영화 감상평','영화 순위'및 '최신 개봉작'등이 디렉토리를 포함할 수 있으며, 블로그 역시 각각의 포스트를 분류하는 디렉토리에 관한 정보를 가질수 있다. For example, a vertical site about a movie is a directory that classifies documents on the site, and may include directories such as 'movie review', 'movie ranking', and 'latest release', and blogs also categorize each post. It can have information about the directory it plays.

이러한 버티컬 사이트 및 블로그에서의 디렉토리 이름은 그 사이트가 다루는 주제와 연관된 키워드로서 사용될 수 있다. 이러한 키워드들은 앞서 언급된 문서의 주제별 분류 단계에서 활용되어 그 분류의 정확도를 개선하는 데 사용될 수 있다.Directory names in these vertical sites and blogs can be used as keywords associated with the topics the site deals with. These keywords can be used in the thematic classification stages of the aforementioned documents and used to improve the accuracy of the classification.

본 출원에서 문서라고 하는 용어는 전자적으로 기록된 문서들을 통칭하는 용어로 이해될 수 있다. 문서는 에이치티엠엘(HTML)등의 마크업 랭귀지를 사용하여 기술되고 *.htm 등의 확장자를 가질 수 있으나, 특정한 기술 형태 및 확장자를 가진 파일에 한정되는 것으로 해석되지는 않는다. In the present application, the term document may be understood as a term for electronically recorded documents. A document may be described using a markup language such as HTML, and may have an extension of * .htm, but is not interpreted as being limited to a file having a specific description form and extension.

탐색 서비스 데이터베이스(222)는 문서 분류부(211)에 의하여 결정된 문서와 분류간의 매칭 관계에 대한 정보가 저장된다. 문서 별로 각 분류에 대한 매칭여부가 저장될 수 있으며, 각 분류별 관련 키워드 포함여부를 수치화하여 저장할 수 있다.The search service database 222 stores information about a matching relationship between a document and a classification determined by the document classification unit 211. Matching for each category may be stored for each document, and whether or not to include related keywords for each category may be numerically stored.

한편, 앞서 언급된 원본 문서 데이터베이스(221) 및 탐색 서비스 데이터베이스(222)에서의 정보 저장방법 및 형태 등은 본 발명의 목적범위 내에서 다양하게 변화될 수 있다.On the other hand, the information storage method and form in the original document database 221 and the search service database 222 described above may be variously changed within the scope of the present invention.

문서들의 제목을 추출하는 단계(S120)는 제목 추출부(212)가 원본 문서 데이터베이스(221)에 저장된 문서들의 제목을 추출하는 단계이다. 문서의 제목이란, 문서의 내용 및 주제를 함축하고 있는 단어, 구 또는 문장을 의미한다. Extracting the titles of the documents (S120) is a step in which the title extractor 212 extracts the titles of the documents stored in the original document database 221. The title of a document means a word, phrase, or sentence that includes the content and subject of the document.

제목 추출부(212)는 원본 문서 데이터베이스(221)에 저장된 문서의 정보를 이용하여 각 문서의 제목을 추출하고, 추출된 제목을 탐색 서비스 데이터베이스(222)에 저장한다. The title extractor 212 extracts a title of each document by using the information of the document stored in the original document database 221 and stores the extracted title in the search service database 222.

본 단계에서, 제목 추출부(212)는 문서들에 포함된 정보를 이용하여 문서의 제목을 추출할 수 있다. 문서의 구조, 문서에 포함된 단어들의 출현 빈도 및 문서가 사용자 단말기(230)에서 브라우징 될 경우의 속성 등이 제목 결정의 기준으로 사용될 수 있다. In this step, the title extractor 212 may extract the title of the document using the information included in the documents. The structure of the document, the frequency of occurrence of words included in the document, and attributes when the document is browsed in the user terminal 230 may be used as a criterion for determining the title.

즉, 제목을 추출하는 과정에서 사용되는 문서 정보는 문서에 직접적으로 포함된 컨텐츠 텍스트만이 아니라, 문서가 사용자 단말기(230)에서 열람되는 형태에 관한 정보 등을 포함하는 개념으로 이해될 수 있다.That is, the document information used in the process of extracting the title may be understood as a concept including not only the content text directly included in the document, but also information about the form in which the document is viewed in the user terminal 230.

일 예로, 블로그 등의 웹 사이트는 구조화된 문서들을 포함할 수 있다. 이러한 문서들은 각각의 이름을 가진 필드(field)에 정보를 저장할 수 있다. '제목', '(title)'등의 필드 이름을 가지는 필드에 포함된 텍스트를 제목으로 선정할 수 있다. For example, a web site such as a blog may include structured documents. These documents can store information in fields with their respective names. Text included in a field having a field name such as 'title' or '(title)' may be selected as a title.

또 다른 예로, 문서가 사용자 단말기(230)의 웹 브라우저 등을 통하여 브라우징 되는 경우, 문서내의 다른 내용보다 상대적으로 크게 표시되거나 차별화되는 속성을 가지고 표현되어 강조되는 텍스트 역시 제목의 후보구로 고려될 수 있다. As another example, when a document is browsed through a web browser or the like of the user terminal 230, text that is represented and emphasized with a relatively larger or differentiating attribute than other contents in the document may also be considered as a candidate phrase of a title. .

추출된 제목에 기초하여 문서의 클러스터를 형성하는 단계(S130)는 클러스터 형성부(213)가 문서의 제목 정보에 기초하여 클러스터링을 수행하는 단계이다. The step S130 of forming a cluster of documents based on the extracted title is a step in which the cluster forming unit 213 performs clustering based on the title information of the document.

클러스터 형성부(213)는 탐색 서비스 데이터베이스(222)로부터 획득한 문서의 제목 정보에 기초하여 문서들의 클러스터를 형성한다. 형성된 클러스터에 관한 정보는 탐색 서비스 데이터베이스(222)에 저장된다.The cluster forming unit 213 forms a cluster of documents based on the title information of the documents obtained from the search service database 222. Information about the formed cluster is stored in the search service database 222.

문서의 클러스터는 중심개념을 공유하는 문서들의 그룹을 의미한다. 문서의 클러스터는 각 문서의 제목들에서 공통되는 부분들이 존재하는지 여부를 고려하여 형성될 수 있다. 각 클러스터는 중심개념을 이용하여 명명될 수 있다.A cluster of documents refers to a group of documents that share a central concept. The cluster of documents may be formed in consideration of whether there are common parts in the titles of each document. Each cluster can be named using a central concept.

문서의 제목 중 다른 문서와 공통되는 문자열이 클러스터의 중심개념의 후보가 될 수 있으며, 공통되는 문자열을 가지는 문서의 개수가 소정값 이상인 경우 하나의 독립된 클러스터가 형성될 수 있다. A string in common with other documents in the title of the document may be a candidate for the central concept of the cluster. When the number of documents having the common string is equal to or larger than a predetermined value, one independent cluster may be formed.

일 예로, 한 문서의 제목이 '소믈리에 따라잡기: 와인 에티켓 - 함께 즐기는 와인'이고 다른 문서의 제목이 '테이블 매너 5편 - 와인 에티켓'인 경우 두 제목에서 공통되는 부분인 '와인 에티켓'이 중심개념으로 추출될 수 있다. For example, if the title of one document is 'Catching Sommelier: Wine Etiquette-Wines Enjoying Together' and the other document is titled 'Table Manor 5-Wine Etiquette', the common part of the two titles is 'Wine Etiquette'. Can be extracted as a concept.

문서의 제목에서 중복되는 부분을 중심개념으로 추출하는 과정에서 엔-그램(n-gram) 분석 방법이 사용될 수 있다. 이 경우, 제목은 음절 단위로 분리되어 소정 개수의 음절을 가지는 문자열로 재조합 될 수 있다. An n-gram analysis method may be used in the process of extracting overlapping parts of a document's title as a central concept. In this case, the title may be separated into syllable units and recombined into a string having a predetermined number of syllables.

이러한 재조합된 문자열 중에서 중복되는 부분이 중심개념의 후보가 될 수 있다. 앞선 예시의 경우 두 개의 음절을 가진 '와인'과 다섯 개의 음절을 가진 '와인 에티켓'이 중심개념의 후보로서 고려될 수 있다.Duplicate parts of the recombined string may be candidates for the central concept. In the case of the previous example, 'wine' with two syllables and 'wine etiquette' with five syllables can be considered as candidates for the central concept.

이와 같이, 문서들의 제목에서 동일한 중복 부분이 여러 개 있는 경우, 그 중에서 하나의 중복 부분을 중심개념으로 결정하는 과정이 요구될 수 있다. 이 때, 중복 부분의 음절수, 중복부분과 문서의 분류명과의 관계 및 해당 중복부분을 가지는 문서의 수 등이 결정기준으로 사용될 수 있다. As such, when there are several identical overlapping parts in the titles of documents, a process of determining one overlapping part as a central concept may be required. At this time, the number of syllables of the duplicated portion, the relationship between the duplicated portion and the classification name of the document, and the number of documents having the corresponding duplicated portion may be used as the determination criteria.

상술한 예에서, 중복 부분은 '와인', '에티켓'및 '와인 에티켓'이다. 이 경우 '와인'은 문서들이 포함된 주제인 '와인'과 동일하므로 하나의 클러스터의 중심개념으로는 적절하지 않을 수 있다.In the above example, the overlapping portions are 'wine', 'etiquette' and 'wine etiquette'. In this case, 'wine' is the same as 'wine', which is a subject that includes documents, so it may not be appropriate as a central concept of a cluster.

중복 부분을 가지는 문서들의 개수 측면에서도, '와인'을 공유하는 문서의 개수는 하나의 클러스터로 형성하기에는 지나치게 큰 값일 수 있다. 이와 같이 클러스터 중심개념의 후보를 결정하는 경우, 그 중심개념의 후보를 공유하는 문서들의 개수를 소정의 범위로 제한하는 것이 요구될 수 있다.In terms of the number of documents having duplicate portions, the number of documents sharing 'wine' may be too large to form a cluster. As such, when determining the candidate for the cluster-centric concept, it may be required to limit the number of documents sharing the candidate for the central concept to a predetermined range.

또한, 중심개념 후보의 길이 역시 고려대상이 될 수 있다. 지나치게 짧은 중심개념 후보의 경우 그 품사가 조사이거나 특정한 분류로 사용되기에 부적절한 일반적 용어일 가능성이 있다. In addition, the length of the central concept candidate may also be considered. In the case of an overly short central concept candidate, the part-of-speech may be a general term that is inappropriate for investigation or for a particular classification.

한편, 길이가 긴 중심개념 후보의 경우 그 후보를 공유하는 클러스터 내의 문서들간의 관련도가 높을 것이 기대되며, 노이즈가 포함될 확률이 작은 것으로 기대될 수 있으므로, 중심개념 후보 중에서 가장 긴 후보를 우선적으로 고려할 수 있다. On the other hand, in case of a long concept candidate, it is expected that the relation among documents in the cluster sharing the candidate is high and the probability that noise may be included is low. Can be considered

상술한 예의 경우, 나머지 두 중심개념 후보가 '와인 에티켓'에 포함되므로 가장 길이가 긴 '와인 에티켓'을 우선 후보로 고려할 수 있으며, '와인 에티켓'을 공유하는 문서 수 등의 다른 기준을 만족하는 것으로 판단되는 경우 단일 클러스터를 구성하는 중심개념으로 선정될 수 있다. In the above example, since the remaining two central concept candidates are included in 'wine etiquette', the longest 'wine etiquette' may be considered as a candidate and the other criteria such as the number of documents sharing 'wine etiquette' may be satisfied. If it is determined that it can be selected as a central concept constituting a single cluster.

또한, '와인'이라는 주제에 포함된 문서에서 '에티켓'을 공유하는 문서의 개수와 '와인 에티켓'을 공유하는 문서의 수가 극히 유사한 경우라면 보다 구체적인 '와인 에티켓'을 중심개념으로 선정하는 것이 효율적일 수 있다.Also, if the number of documents sharing 'etiquette' and the number of documents sharing 'wine etiquette' are extremely similar in the documents included in the subject of 'wine', it is effective to select more specific 'wine etiquette' as the central concept. Can be.

이러한 중심개념 들 중에서 소정의 개수 이상의 문서가 연관된 것을 기준으로 클러스터를 구성할 수 있다. 클러스터를 이루는 중심개념 및 그 중심개념을 공유하는 클러스터에 속하는 문서에 관한 정보는 탐색 서비스 데이터베이스(222)에 저장된다. Among these central concepts, a cluster may be configured based on a predetermined number of documents associated with each other. Information about the central concept constituting the cluster and documents belonging to the cluster that share the central concept is stored in the search service database 222.

문서의 클러스터를 디렉토리로 매핑하는 단계(S140)는 디렉토리 매핑 부(214)가 문서의 클러스터를 각 클러스터의 중심개념을 기준으로 디렉토리에 매핑하는 단계이다. The step S140 of mapping a cluster of documents to a directory is a step in which the directory mapping unit 214 maps a cluster of documents to a directory based on a central concept of each cluster.

디렉토리는 문서의 분류 즉, 주제의 하위개념으로서 하나 또는 복수의 클러스터를 포함할 수 있는 소주제를 의미한다. 일 예로, 문서가 분류된 주제가 '와인'경우 그 디렉토리로서 '와인의 산지', '와인의 역사' 및 '와인 에티켓'등이 포함될 수 있으며, '와인의 산지'디렉토리는 와인의 생산지로 알려진 '보르도'및 '부르고뉴'등의 지명을 중심개념으로 형성된 클러스터를 포함할 수 있다.The directory refers to the classification of documents, that is, subtopics that can include one or more clusters as sub-concepts of the subject. For example, if the subject of a document is categorized as wine, the directory may include wine production, wine history, and wine etiquette. The wine production directory is known as the wine producing region. Bordeaux 'and' Burgundy 'and the like may include a cluster formed around the central concept.

디렉토리 매핑부(214)는 탐색 서비스 데이터베이스(222)로부터 디렉토리 구조에 관한 정보 및 클러스터에 대한 정보를 획득하여 각 디렉토리에 매핑될 클러스터를 결정한다. 결정된 디렉토리 매핑 정보는 탐색 서비스 데이터베이스(222)에 저장된다.The directory mapping unit 214 obtains information about a directory structure and information about a cluster from the search service database 222 and determines a cluster to be mapped to each directory. The determined directory mapping information is stored in the search service database 222.

클러스터가 매핑될 디렉토리는 해당 클러스터의 중심 개념이 그 디렉토리와 관계된 키워드를 포함하고 있는지 여부로 결정될 수 있다. The directory to which the cluster is mapped may be determined by whether the central concept of the cluster includes keywords related to the directory.

일 예로, 디렉토리가 '와인 에티켓'인 경우 디렉토리 이름에서는 '에티켓'이 디렉토리 포함여부를 결정하기 위한 키워드가 될 수 있다. 이미'와인'분류에 해당하는 것으로 판단된 문서들에 대해 형성된 클러스터를 매핑하는 과정에서는 분류명인 '와인'자체는 제외한 키워드로 디렉토리를 매핑하는 것이 효율적일 수 있다.For example, when the directory is 'wine etiquette', 'etiquette' may be a keyword for determining whether to include a directory in the directory name. In the process of mapping clusters formed for documents that have already been determined to belong to the 'wine' classification, it may be efficient to map the directory with keywords excluding the wine name itself.

한편, 이러한 키워드 들에 대해서는 사전식 나열법을 사용하여 해당 키워드를 확장하는 것이 요구될 수 있다. '에티켓'은 동의어, 유의어 및 표기언어를 달리하는 키워드로 확장될 수 있다.On the other hand, for these keywords it may be required to expand the keyword using lexicographical ordering. 'Etiquette' can be expanded with keywords that differ in synonyms, synonyms, and notation.

이 경우 '예절', 'etiquette', '매너' 및 'manner'등의 키워드가 디렉토리 매핑을 위한 추가적인 키워드로 고려될 수 있다. 이를 통해 디렉토리 매핑의 효율성을 향상시킬 수 있다.In this case, keywords such as 'etiquette', 'etiquette', 'manner' and 'manner' may be considered as additional keywords for directory mapping. This can improve directory mapping efficiency.

이러한 디렉토리 매핑을 위한 키워드 역시 디렉토리 구조에 관한 정보의 일부로서 탐색 서비스 데이터베이스(222)에 저장될 수 있다. Keywords for this directory mapping may also be stored in the search service database 222 as part of the information about the directory structure.

문서를 제공하는 단계(S150)는 문서 제공부(215)가 사용자 단말기(230)로 클러스터링된 문서들을 디렉토리 별로 제공하는 단계이다.In the providing of the document (S150), the document providing unit 215 provides the clustered documents to the user terminal 230 for each directory.

본 단계는 소정의 주제 즉, 분류에 속하는 디렉토리 구조 및 디렉토리에 속하는 클러스터의 포함관계를 시각화하여 제공함으로써 사용자가 자신이 관심분야에 문서들을 손쉽게 탐색 할 수 있도록 하는 것에 특징이 있다. This step is characterized in that the user can easily search for documents of interest by visualizing and providing a predetermined theme, that is, the directory structure belonging to the classification and the inclusion relation of the cluster belonging to the directory.

사용자는 사용자 단말기(230)를 통해 자신이 관심을 가지고 있는 주제에 관한 정보를 탐색 서비스 제공 서버(210)로 전송한다. 이는 해당 주제에 대한 탐색 서비스를 제공하는 웹 페이지에 대한 링크를 클릭하는 동작 등에 의하여 수행될 수 있다.The user transmits information about a topic of interest to the search service providing server 210 through the user terminal 230. This may be performed by clicking a link to a web page that provides a search service for the corresponding subject.

문서 제공부(215)는, 사용자의 관심 주제에 관한 정보가 포함된, 사용자 단말기(230)에서의 요청을 수신하여 이에 대한 응답으로 앞서 언급한 단계들에서 분류되고 클러스터링된 문서들에 대한 접근 링크를 포함하는 웹 페이지를 제공할 수 있다. 이를 통해 탐색 서비스가 사용자에게 제공된다. The document provider 215 receives a request from the user terminal 230, which contains information about a subject of interest of the user, and in response thereto access links to the documents classified and clustered in the aforementioned steps. It can provide a web page including a. This provides a navigation service to the user.

한편, 문서 탐색 서비스를 제공하기 위해 문서 제공부(215)는 클러스터링된 문서들에 대한 정보를 탐색 서비스 데이터베이스(222)로부터 획득한다.Meanwhile, in order to provide a document search service, the document provider 215 obtains information about clustered documents from the search service database 222.

문서 제공부(215)의 응답이 제공되는 형태 및 양식은 씨에스에스(CSS, cascading style sheets)등을 이용하여 조절될 수 있다. 또한, 별도의 컨텐츠 매니지먼트 시스템(CMS, content management system)을 이용하는 것도 가능하다. The form and form in which the response of the document providing unit 215 is provided may be adjusted by using cascading style sheets (CSS). It is also possible to use a separate content management system (CMS).

한편, 본 발명의 일 실시예에 따른 탐색 서비스 제공 서버(210)는 광고 제공부(216)을 더 포함할 수 있다. Meanwhile, the search service providing server 210 according to an embodiment of the present invention may further include an advertisement providing unit 216.

광고 제공부(216)는 광고 컨텐츠를 사용자 단말기(230)로 제공할 수 있다. 광고 컨텐츠는 광고 데이터베이스(223)에 저장될 수 있으며, 광고 제공부(216)에 의하여 호출되어 사용자 단말기(230)로 전송될 수 있다. The advertisement provider 216 may provide the advertisement content to the user terminal 230. The advertisement content may be stored in the advertisement database 223 and may be called by the advertisement providing unit 216 and transmitted to the user terminal 230.

사용자 단말기(230)로 전송될 광고 컨텐츠를 결정하는 요소로서, 사용자에 관한 정보 및 사용자가 탐색하는 문서에 관한 정보 등이 고려할 수 있다. As an element for determining advertisement content to be transmitted to the user terminal 230, information about a user and information about a document searched by the user may be considered.

일 예로, 사용자가 문서 탐색 서비스를 이용하는 과정에서 로그인(log-in) 절차를 수행한 경우, 사용자의 연령, 직업, 성별, 거주지역 등의 사용자의 개인정보가 광고 컨텐츠 결정 요소로서 고려될 수 있다. For example, when a user performs a log-in process while using a document search service, personal information of the user, such as the user's age, occupation, gender, and region of residence, may be considered as an advertisement content determining factor. .

한편, 사용자 단말기(230)를 통하여 열람되는 문서에 관한 정보들이 광고 컨텐츠 결정 요소로 고려될 수 있다. 사용자가 입력한 질의어(query) 정보 역시 고려될 수 있다. Meanwhile, information about a document read through the user terminal 230 may be considered as an advertisement content determining element. Query information entered by the user may also be considered.

또한, 사용자가 본 발명의 일 실시예에 따른 문서 탐색 서비스를 이용하는 과정에서 획득되는 정보들 역시 컨텐츠 결정 요소로 고려될 수 있다. In addition, the information obtained while the user uses the document search service according to an embodiment of the present invention may also be considered as a content determining element.

이와 같이, 사용자에 관한 정보 및 사용자가 탐색하는 문서에 관한 정보 등 을 이용하여 제공될 광고 컨텐츠를 결정함으로써, 제공되는 광고의 효과가 극대화 될 수 있다는 장점이 있다.As such, by determining the advertisement content to be provided using information about the user and information about the document searched by the user, there is an advantage that the effect of the provided advertisement can be maximized.

본 발명의 일 실시예에 따른 문서 탐색 서비스 제공 화면은 도 4 및 도 5를 참조하여 보다 자세히 설명될 것이다. The document search service providing screen according to an embodiment of the present invention will be described in more detail with reference to FIGS. 4 and 5.

한편, 도 2에 도시된 구성요소들은 반드시 하드웨어 구성을 가질 필요는 없으며, 일부 구성요소는 동일 또는 유사한 기능을 수행하는 응용 프로그램의 형태로 구현될 수 있다. 또한, 각 구성요소들은 발명의 사상 범위 내에서 결합되거나 분리될 수도 있음은 물론이다. On the other hand, the components shown in Figure 2 do not necessarily have a hardware configuration, some components may be implemented in the form of an application program that performs the same or similar functions. In addition, each component may be combined or separated within the scope of the invention.

본 실시예에서 탐색 서비스 제공 서버(210), 원본 문서 데이터베이스(221) 및 탐색 서비스 데이터베이스(222)는 각각 독립된 형태로 구성되어 있으나, 본 발명의 목적범위 내에서 하나 이상의 요소로 그룹화되어 구성될 수 있다. In the present embodiment, the search service providing server 210, the original document database 221, and the search service database 222 are each configured in an independent form, but may be grouped into one or more elements within the scope of the present invention. have.

본 실시예에서, 탐색 서비스 제공 서버(210), 원본 문서 데이터베이스(221) 및 탐색 서비스 데이터베이스(222) 및 사용자 단말기(230)는 네트워크를 이용하여 정보를 교환할 수 있다. In this embodiment, the search service providing server 210, the original document database 221, the search service database 222, and the user terminal 230 may exchange information using a network.

이들 네트워크가 반드시 하나의 단일 네트워크일 필요는 없다. 또한, 네트워크는 ADSL, VDSL, Wi-Fi, WIBRO 및 HSDPA 등의 기술에 의하여 LAN 및 WAN의 형태로 구성될 수 있으며, 보안을 강화하기 위해 VPN등의 기술이 사용될 수 있다.These networks do not necessarily have to be one single network. In addition, the network may be configured in the form of LAN and WAN by technologies such as ADSL, VDSL, Wi-Fi, WIBRO, and HSDPA, and technologies such as VPN may be used to enhance security.

도 3은 본 발명의 일 실시예에 따른 문서의 클러스터 및 디렉토리 매핑을 예시한 도면이다.3 is a diagram illustrating a cluster and directory mapping of a document according to an embodiment of the present invention.

본 실시예에서 원본 문서 데이터베이스(221)에 속하는 문서들은 분류(주제), 디렉토리, 클러스터, 문서 순으로 연결되는 계층구조에 의하여 구조화 될 수 있다. 문서들을 구조화 함으로써 여러 버티컬 사이트에 산재해 있는 정보들을 본 실시예에 의한 문서 탐색 서비스에서 제공하는 하나의 페이지 뷰로 간단히 탐색할 수 있게 된다. In the present embodiment, the documents belonging to the original document database 221 may be structured by a hierarchical structure connected in the order of topics, directories, clusters, and documents. By structuring the documents, information scattered among various vertical sites can be easily searched with a single page view provided by the document search service according to the present embodiment.

도 3을 참조하면, 문서들이 구조화되는 상위 개념인 분류는 '와인'이다. 특정 문서가 '와인' 분류에 해당하는지 여부는 문서 분류부(211)에 의하여 판단된다. Referring to FIG. 3, the classification, which is a higher concept in which documents are structured, is 'wine'. It is determined by the document classifying unit 211 whether a specific document corresponds to a 'wine' classification.

분류는 그 하위개념으로서 하나 또는 복수개의 디렉토리를 포함할 수 있다. '와인' 분류는 '와인의 산지', '와인의 역사' 및 '와인 에티켓'으로 명명된 디렉토리를 포함한다. A classification may include one or more directories as a sub-concept. The 'wine' classification includes directories named 'Wine Mountain,' 'Wine History,' and 'Wine Etiquette.'

디렉토리의 명칭은 사용자가 문서를 탐색하고자 하는 그룹의 이름으로 기능하므로, 원본 문서 데이터베이스(221)에 저장된 문서들의 출처인 버티컬 사이트 및 블로그 등에서 사용하는 문서그룹의 명칭을 디렉토리 이름으로 사용함으로써 사용자의 문서 탐색 효율을 높일 수 있다.Since the name of the directory functions as the name of the group that the user wants to search for the document, the user's document is used by using the name of the document group used in vertical sites and blogs, which are the sources of documents stored in the original document database 221, as the directory name. The search efficiency can be improved.

디렉토리는 그 하위에 하나 또는 복수개의 클러스터를 포함할 수 있다. '와인의 산지'디렉토리는 '보르도' 및 '부르고뉴'로 명명된 클러스터를 포함한다. 클러스터 역시 하나 또는 복수개의 문서를 포함할 수 있으며, 도 3에 예시된 문서 3의 경우 '보르도' 클러스터와 '부르고뉴'클러스터 모두에 포함되어 있다.The directory may include one or more clusters below it. The 'mountain of wine' directory includes clusters named 'Bordeaux' and 'Burgundy'. The cluster may also include one or a plurality of documents, and the document 3 illustrated in FIG. 3 is included in both the 'Bordeaux' cluster and the 'Burgundy' cluster.

한편, 버티컬 사이트 및 블로그에서 사용되는 디렉토리의 명칭 및 이러한 사이트에 포함된 구조화된 문서들의 필드 정보들이 문서를 주제별로 분류하고 클러 스터링하는 데 사용될 수 있음은 도 1 및 도 2의 상세한 설명에서 언급된 바와 같다. On the other hand, the names of directories used in vertical sites and blogs and field information of structured documents included in such sites can be used to categorize and cluster documents by subject, as discussed in the detailed description of FIGS. 1 and 2. As shown.

도 4는 본 발명의 일 실시예에 따른 문서 탐색 서비스 제공 화면을 예시한 도면이다. 도 4를 참조하면, 분류 표시 영역(400), 분류 구조 표시 영역(410), 탐색 위치 표시 영역(420) 및 클러스터 표시 영역(430)이 도시되어 있다. 4 is a diagram illustrating a document search service providing screen according to an embodiment of the present invention. Referring to FIG. 4, a classification display area 400, a classification structure display area 410, a search position display area 420, and a cluster display area 430 are illustrated.

앞서 살펴본 바와 같이 본 실시예의 문서 탐색 서비스를 제공하기 위해 문서 제공부(215)는 사용자 단말기(230)로 제공되는 웹 페이지를 생성하고 전송할 수 있다. As described above, in order to provide the document search service according to the present embodiment, the document provider 215 may generate and transmit a web page provided to the user terminal 230.

문서 탐색 서비스에서 제공되는 웹 페이지는 도 4에 예시된 것과 같은 화면 구성을 가질 수 있다. 이러한 화면구성에는 탐색 대상 문서의 구조화를 시각화하기 위한 표시영역이 포함될 수 있다. The web page provided in the document search service may have a screen configuration as illustrated in FIG. 4. Such a screen configuration may include a display area for visualizing the structure of the searched document.

분류 표시 영역(400)은 문서가 구조화되는 상위개념인 분류에 관한 정보가 표시되는 영역이다. 본 실시예에서는 분류명인 '와인'이 상대적으로 차별화된 속성으로 표시되어 있다. The classification display area 400 is an area where information on classification, which is a higher concept in which a document is structured, is displayed. In this embodiment, the classification name 'wine' is displayed as a relatively differentiated attribute.

분류 구조 표시 영역(410)은 도 3에서 예시된 것과 유사한 분류 구조를 사용자에게 제공하는 영역이다. 도 4의 경우 사용자가 탐색하고 있는 디렉토리인 '와인의 산지'와 클러스터 '보르도'는 다른 항목들과 차별화된 속성으로 표시되었다. The classification structure display area 410 is an area for providing a user with a classification structure similar to that illustrated in FIG. 3. In the case of FIG. 4, 'the mountain of wine' and the cluster 'Bordeau', which are the directories that the user is searching, are displayed as attributes different from other items.

한편, 사용자가 탐색하고 있는 디렉토리와 클러스터를 표시하는 탐색 위치 표시 영역(420)이 추가적으로 제공될 수 있다. Meanwhile, a search location display area 420 that displays a directory and a cluster that the user is searching may be additionally provided.

클러스터 표시 영역(430)은 사용자가 탐색하고 있는 클러스터에 속하는 문서들에 대한 접근 수단을 제공하는 영역이다. 본 실시예에서 사용자가 탐색하고 있는 클러스터의 중심개념은 '보르도'로서 클러스터 표시 영역(430)에는 '보르도'라는 중심개념과 관련된 문서들이 제공될 수 있다. The cluster display area 430 is an area that provides a means of accessing documents belonging to a cluster that the user is searching. In the present embodiment, the central concept of the cluster that the user is searching for is 'Bordeaux' and documents related to the central concept of 'Bordeaux' may be provided in the cluster display area 430.

클러스터 표시 영역(430)에는 사용자가 탐색하고 있는 클러스터에 속하는 문서들에 대한 문서 링크(434)가 제공된다. 문서 링크(434)는 참조하는 문서의 제목정보를 앵커 텍스트로 표시할 수 있다. The cluster display area 430 is provided with a document link 434 to documents belonging to the cluster that the user is searching. The document link 434 may display title information of the referenced document as anchor text.

한편, 문서 링크(434)는 그 앵커 텍스트에서 중복되는 부분을 중심으로 다시 그루핑(grouping)될 수 있다. 본 실시예에서 문서 링크(434)는 중복구절(432)을 표시한 후 그 아래에 차례로 나열되었다. Meanwhile, the document link 434 may be grouped again around the overlapping portion of the anchor text. In this embodiment, document links 434 are listed one after the other after displaying the duplicate phrase 432.

본 실시예에서 '보르도'클러스터는 그 중복구절이'보르도의 자연'인 그룹과 중복구절이 '보르도 지도'인 그룹을 포함한다. 클러스터는 하나의 그룹만으로 구성될 수도 있으며, 이 경우 그룹의 중복구절이 클러스터의 명칭으로 사용될 수 있다. In the present embodiment, the 'Bordeaux' cluster includes a group whose overlapping phrase is 'Nature of Bordeaux' and a group whose overlapping phrase is 'Bordeaux map'. The cluster may be composed of only one group, in which case, the redundant clause of the group may be used as the name of the cluster.

문서 링크(434)는 각 문서에 대한 링크이다. 이 링크를 선택함으로써 사용자는 자신이 탐색하고자 하는 정보를 포함하는 문서의 내용에 접근하게 된다. 이 경우, 사용자 단말기(230)에서 새로운 브라우저 창을 생성하여 해당 문서의 내용을 제공할 수도 있으며, 도 4의 문서 탐색 서비스 제공화면이 표시된 브라우저 창의 전부 또는 일부를 통해 문서의 내용을 제공하는 것도 가능하다.Document link 434 is a link to each document. By selecting this link, the user has access to the contents of the document containing the information he or she wants to search. In this case, the user terminal 230 may generate a new browser window to provide the content of the document, or may provide the content of the document through all or part of the browser window on which the document search service providing screen of FIG. 4 is displayed. Do.

사용자는 본 실시예에서 제공되는 각 영역의 항목을 클릭하는 방법 등으로 선택하여 자신의 탐색 대상을 변경할 수 있다. 이에 대한 응답으로 문서 제공 부(215)는 분류 구조 표시 영역(410), 탐색 위치 표시 영역(420) 및 클러스터 표시 영역(430)의 정보를 갱신한다. The user can change his or her search target by selecting an item of each area provided in the present embodiment or the like. In response, the document providing unit 215 updates the classification structure display area 410, the search position display area 420, and the cluster display area 430.

상술한 화면 구성을 통해 사용자에게 디렉토리 구조 및 클러스터링 구조를 시각적으로 전달함으로써 사용자가 방문하는 버티컬 사이트 및 블로그 등을 개별적으로 방문하지 않으면서도 관심 분야에 대한 정보를 효율적으로 탐색할 수 있다. Through the above screen configuration, the directory structure and the clustering structure can be visually transmitted to the user, thereby efficiently searching for information on the field of interest without separately visiting the vertical site and the blog that the user visits.

도 5는 본 발명의 일 실시예에 따른 광고 표시 영역을 포함하는 문서 탐색 서비스 제공 화면을 예시한 도면이다. 도 5를 참조하면, 분류 표시 영역(400), 분류 구조 표시 영역(410), 탐색 위치 표시 영역(420), 클러스터 표시 영역(430) 및 광고 표시 영역(510)이 도시되어 있다. 5 is a diagram illustrating a document search service providing screen including an advertisement display area according to an embodiment of the present invention. Referring to FIG. 5, a classification display area 400, a classification structure display area 410, a search position display area 420, a cluster display area 430, and an advertisement display area 510 are illustrated.

광고 표시 영역(510)은 광고 제공부(216)가 사용자 단말기(230)로 제공하는 광고 컨텐츠가 표시되는 영역이다. 광고 표시 영역(510)에는 텍스트 광고 컨텐츠(511) 및 애니메이션 광고 컨텐츠(512)가 표시되어 있다.The advertisement display area 510 is an area where advertisement content provided by the advertisement providing unit 216 to the user terminal 230 is displayed. In the advertisement display area 510, text advertisement content 511 and animated advertisement content 512 are displayed.

텍스트 광고 컨텐츠(511) 및 애니메이션 광고 컨텐츠(512)는 광고주와 관련된 추가적인 정보를 포함하고 있는 사이트로 접속할 수 있는 링크 등을 추가적으로 포함할 수 있다. The text advertisement content 511 and the animation advertisement content 512 may further include a link for accessing a site including additional information related to an advertiser.

광고 표시 영역(510)에 표시되는 되는 광고는, 클릭 횟수에 상응하여 광고 비용이 집행되는 피피씨(PPC, pay per click) 모델 및/또는 노출 횟수에 상응하여 광고 비용이 집행되는 피피브이(PPV, pay per view) 모델에 기초하여 운영될 수 있다.The advertisement displayed in the advertisement display area 510 may include a pay per click (PPC) model in which the advertisement cost is executed in accordance with the number of clicks, and / or a PPV (PPV) in which the advertisement cost is executed in accordance with the number of impressions. pay per view) can be operated based on the model.

한편, 광고 표시 영역(510)에 표시되는 광고 컨텐츠를 결정하는 요소로서 사용자에 관한 정보 및 사용자가 탐색하는 문서에 관한 정보 등이 고려될 수 있음은 앞서 살펴본 바와 같다. Meanwhile, as described above, information about the user and information about a document searched by the user may be considered as an element for determining the advertisement content displayed on the advertisement display area 510.

일 예로, 문서의 주제, 디렉토리, 클러스터 및 문서의 제목에 관한 정보가 광고 컨텐츠 결정 요소로 고려될 수 있다. 도 5를 참조하면, 사용자가 탐색하고 있는 주제는 '와인', 디렉토리는 '와인의 산지', 클러스터는 '보르도'이다. 이러한 정보에 기초하여 '보르도 와인 공동구매 신청'이라는 타이틀을 가진 텍스트 광고 컨텐츠(511)를 제공함으로써 광고의 효과를 최대화 할 수 있다.For example, information about a subject of a document, a directory, a cluster, and a title of a document may be considered as an advertisement content determining element. Referring to FIG. 5, the subject the user is searching for is 'wine', the directory is 'mountain of wine', and the cluster is 'Bordeaux'. Based on this information, it is possible to maximize the effect of the advertisement by providing the text advertisement content 511 titled 'Bordeaux wine co-purchase application'.

한편, 문서 탐색 서비스 제공 방법은 컴퓨터 프로그램으로 작성 가능하다. 상기 프로그램을 구성하는 코드들 및 코드 세그먼트들은 당해 분야의 컴퓨터 프로그래머에 의하여 용이하게 추론될 수 있다. 또한, 상기 프로그램은 컴퓨터가 읽을 수 있는 정보저장매체(computer readable media)에 저장되고, 컴퓨터에 의하여 읽혀지고 실행됨으로써 문서 탐색 서비스 제공 방법을 구현한다. 상기 정보저장매체는 자기 기록매체, 광 기록매체, 및 캐리어 웨이브 매체를 포함한다.On the other hand, the document search service providing method can be created by a computer program. Codes and code segments constituting the program can be easily inferred by a computer programmer in the art. In addition, the program is stored in a computer readable media, and read and executed by a computer to implement a method for providing a document search service. The information storage medium includes a magnetic recording medium, an optical recording medium, and a carrier wave medium.

본 출원에서 사용한 용어는 단지 특정한 실시예를 설명하기 위해 사용된 것으로, 본 발명을 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. The terminology used herein is for the purpose of describing particular example embodiments only and is not intended to be limiting of the present invention. Singular expressions include plural expressions unless the context clearly indicates otherwise.

본 출원에서, "포함하다" 또는 "가지다" 등의 용어는 명세서상에 기재된 특 징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.In this application, the terms "comprise" or "having" are intended to indicate that there is a feature, number, step, operation, component, part, or combination thereof described in the specification, one or more other It is to be understood that the present invention does not exclude the possibility of the presence or the addition of features, numbers, steps, operations, components, parts, or a combination thereof.

제1, 제2 등의 용어는 다양한 구성요소들을 설명하는데 사용될 수 있지만, 상기 구성요소들은 상기 용어들에 의해 한정되어서는 안 된다. 상기 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용된다. Terms such as first and second may be used to describe various components, but the components should not be limited by the terms. The terms are used only for the purpose of distinguishing one component from another.

이제까지 본 발명에 대하여 그 실시예를 중심으로 살펴보았다. 전술한 실시예 외의 많은 실시예들이 본 발명의 특허청구범위 내에 존재한다. 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자는 본 발명이 본 발명의 본질적인 특성에서 벗어나지 않는 범위에서 변형된 형태로 구현될 수 있음을 이해할 수 있을 것이다. 그러므로 개시된 실시예는 한정적인 관점이 아니라 설명적인 관점에서 고려되어야 한다. 본 발명의 범위는 전술한 설명이 아니라 특허청구범위에 나타나 있으며, 그와 동등한 범위 내에 있는 모든 차이점은 본 발명에 포함된 것으로 해석되어야 할 것이다.So far I looked at the center of the present invention with respect to the embodiment. Many embodiments other than the above-described embodiments are within the scope of the claims of the present invention. Those skilled in the art will appreciate that the present invention can be implemented in a modified form without departing from the essential features of the present invention. Therefore, the disclosed embodiments should be considered in descriptive sense only and not for purposes of limitation. The scope of the present invention is shown in the claims rather than the foregoing description, and all differences within the scope will be construed as being included in the present invention.

도 1은 본 발명의 일 실시예에 따른 문서 탐색 서비스 제공 방법의 흐름도이다. 1 is a flowchart of a method of providing a document search service according to an embodiment of the present invention.

도 2는 본 발명의 일 실시예에 따른 문서 탐색 서비스 제공 시스템의 구성도이다. 2 is a block diagram of a system for providing a document search service according to an embodiment of the present invention.

도 3은 본 발명의 일 실시예에 따른 문서의 클러스터 및 디렉토리 매핑을 예시한 도면이다. 3 is a diagram illustrating a cluster and directory mapping of a document according to an embodiment of the present invention.

도 4는 본 발명의 일 실시예에 따른 문서 탐색 서비스 제공 화면을 예시한 도면이다. 4 is a diagram illustrating a document search service providing screen according to an embodiment of the present invention.

도 5는 본 발명의 일 실시예에 따른 광고 표시 영역을 포함하는 문서 탐색 서비스 제공 화면을 예시한 도면이다.5 is a diagram illustrating a document search service providing screen including an advertisement display area according to an embodiment of the present invention.

<도면의 주요부분에 대한 부호의 설명><Description of the symbols for the main parts of the drawings>

210: 탐색 서비스 제공 서버 211: 문서 분류부210: search service providing server 211: document classification unit

212: 제목 추출부 213: 클러스터 형성부212: title extracting unit 213: cluster forming unit

214: 디렉토리 매핑부 215: 문서 제공부214: directory mapping unit 215: document providing unit

216: 광고 제공부 221: 원본 문서 데이터베이스216: advertisement provider 221: original document database

222: 탐색 서비스 데이터베이스 223: 광고 데이터베이스222: navigation service database 223: advertising database

230: 사용자 단말기 230: user terminal

400: 분류 표시 영역 410: 분류 구조 표시 영역400: classification display area 410: classification structure display area

420: 탐색 위치 표시 영역 430: 클러스터 표시 영역420: navigation location display area 430: cluster display area

432: 중복구절 434: 문서 링크432: Duplicate verse 434: Document link

510: 광고 표시 영역 511: 텍스트 광고 컨텐츠510: advertisement display area 511: text advertising content

512: 애니메이션 광고 컨텐츠512: animated advertising content

Claims

Classifying documents according to subject matter;

Extracting a title of the document;

Forming a cluster based on the extracted title; and

Mapping the cluster to a predetermined directory belonging to the subject.

The method of claim 1,

Providing the document to a user terminal;

The document providing step provides a means for visualizing the structure of the directory and the cluster to which the document belongs.

The method of claim 1,

The document is characterized in that it comprises one or a plurality of fields,

The extracting of the title of the document may include extracting the title in consideration of an attribute of a field configuring the document.

The method of claim 1,

Forming the cluster is

And dividing the extracted title into syllable units and selecting a part of the title shared with other documents as a central concept candidate of the cluster.

The method of claim 4, wherein

Forming the cluster is

And selecting the central concept candidate using n-gram analysis of the extracted title.

The method of claim 4, wherein

Forming the cluster is

And a cluster is formed when there are more than a predetermined number of documents sharing the central concept candidate.

The method of claim 4, wherein

And a plurality of the central concept candidates, selecting the central concept in consideration of the length of the title candidate phrase.

A computer-readable recording medium for recording a program for executing the method of any one of claims 1 to 7 on a computer.

A document classification unit for classifying documents according to subjects;

A title extractor for extracting a title of the document;

A cluster forming unit forming a cluster based on the extracted title; and

And a directory mapping unit for mapping the cluster to a predetermined directory belonging to the subject.

The method of claim 9,

Further comprising a document providing unit for providing the document to the user terminal,

And the document providing unit provides a means for visualizing the directory to which the document belongs and the structure of the cluster.

The method of claim 9,

And the title extracting unit extracts the title in consideration of an attribute of a field constituting the document.

The method of claim 9,

The cluster forming unit

And dividing the extracted title into syllable units, and selecting a part of the title shared with other documents as a central concept candidate of the cluster.

The method of claim 12,

The cluster forming unit

The system for providing a document search service, characterized in that for selecting the central concept candidate using an n-gram analysis of the extracted title.

The method of claim 12,

The cluster forming unit

The method of claim 12,

And a plurality of the central concept candidates, wherein the central concept is selected in consideration of the length of the title candidate phrase.