KR20100013157A

KR20100013157A - Tag clustering apparatus based on related tags and tag clustering method thereof

Info

Publication number: KR20100013157A
Application number: KR1020080074701A
Authority: KR
Inventors: 이시화; 황대훈; 기노일; 최길준
Original assignee: 주식회사 메디오피아테크
Priority date: 2008-07-30
Filing date: 2008-07-30
Publication date: 2010-02-09
Also published as: KR101007056B1

Abstract

PURPOSE: A tag clustering apparatus based on related tags and a tag clustering method thereof are provided to form a tag cluster by gathering tags, thereby overcoming deterioration of a search result. CONSTITUTION: A related tag mapping module(132) extracts and maps related tag pairs each other. A frequency extracting module(134) extracts an appearance frequency of the same tags. A weight matrix generating module(136) generates a weight matrix. A tag clustering module(138) extracts only related tags with a frequency which is larger than a threshold value.

Description

Tag clustering apparatus based on related tags and tag clustering method

본 발명은 웹 2.0에 기반한 인터넷 환경에 관한 것으로서, 특히 웹 서비스 환경에서 태그를 이용하여 콘텐츠를 검색할 때 검색 정확성을 높이기 위하여 연관 태그를 이용하는 방법 및 장치에 관한 것이다. The present invention relates to an internet environment based on web 2.0, and more particularly, to a method and apparatus for using an associated tag to improve search accuracy when searching for content using a tag in a web service environment.

현재 인터넷의 발달로 사용자가 급속히 늘어가면서 웹 서비스 환경이 다양하게 변화하고 있다. 종래의 웹 서비스가 정적으로 수동적인데 반하여, 웹 서비스는 점차 동적이고 능동적으로 변화되고 있으며, 이러한 웹 서비스 변화의 흐름을 반영하기 위하여 웹 2.0이 도입되었다. With the development of the Internet, as the number of users is increasing rapidly, the web service environment is changing in various ways. While conventional web services are statically passive, web services are gradually changing dynamically and actively, and web 2.0 has been introduced to reflect the flow of web service changes.

웹 2.0이란 정보의 개방을 통해 인터넷 사용자들 간의 정보 공유와 참여를 이끌어내고, 창조된 정보의 가치를 지속적으로 증대시키기 위하여 개발된 일련의 움직임을 의미한다. 즉, 웹 2.0에서는 개방적인 웹 환경을 기반으로 네티즌들이 자유롭게 참여하고, 콘텐츠를 생산 및 재창조, 공유할 수 있다. Web 2.0 refers to a series of movements developed to induce information sharing and participation among Internet users through the opening of information, and to continuously increase the value of created information. In other words, in Web 2.0, netizens can freely participate, produce, recreate and share content based on an open web environment.

웹 2.0에서 정보는 사용자에 의하여 생산되고, 사용자가 붙인 태그에 의해 정보가 체계화된다. 사용자들은 이러한 정보를 용이하게 공유할 수 있으며, 따라 서 다양한 리소스들이 상호 연관된다. 이와 같이 웹 2.0 현상은 모든 인터넷 사이트의 필수 전략이 되었으며, 웹 2.0을 성공적으로 구현하기 위하여 다양한 기법들이 소개되고 있다. In Web 2.0, information is produced by the user, and the information is organized by tags that the user attaches. Users can easily share this information, so that various resources are correlated. As such, the Web 2.0 phenomenon has become an essential strategy for all Internet sites, and various techniques have been introduced to implement Web 2.0 successfully.

이러한 기법들 중 하나가 태깅(tagging)이다. 태깅은 블로그와 같은 웹 문서로부터 이미지, 동영상과 같은 멀티미디어 콘텐츠에 까지 폭넓게 이용되고 있는데, 사용자가 자신이 생성한 콘텐츠에 태그를 붙임으로써 검색과 분류가 용이하게 이루어지도록 하는 것이다. One of these techniques is tagging. Tagging is widely used in web documents such as blogs and multimedia contents such as images and videos, so that users can easily tag and search their own contents.

그런데 이러한 태깅은 콘텐츠의 생산자가 임의로 붙이는 것이기 때문에 정보 검색의 정확도(precision)가 낮다. 즉, 사용자가 어떤 정보를 넓은 범주의 카테고리에 포함시키는 데에는 태깅 기법이 유용할 수 있지만, 해당 카테고리가 너무 넓기 때문에 다시 유용한 정보를 검색해 내는 과정은 효율적이지 않다. 예를 들어, 사용자는 컴퓨터를 찍은 사진에 자신의 이름을 붙여서 태깅할 수 있다. 이 경우 해당 컴퓨터의 사진을 검색하려는 사람에게는 이러한 사진이 노출되지 않게 된다. 즉, 리소스에 태깅된 태그들 중에는 부정확하게 태깅된 태그들이 많이 존재한다. However, since this tagging is randomly attached by the producer of the content, the precision of information retrieval is low. In other words, a tagging technique may be useful for a user to include information in a wide category, but the process of retrieving useful information is not efficient because the category is too wide. For example, a user can tag a picture taken of a computer with his name. In this case, those pictures will not be exposed to anyone trying to retrieve pictures from the computer. That is, among tags tagged in a resource, there are many tags that are incorrectly tagged.

또한, 태깅된 태그는 구조화되지 않은 메타데이터이기 때문에 정보 검색 네비게이션이 비효율적이다. 예를 들어, 컴퓨터의 모니터 사진을 검색하려는 사용자가 모니터의 명칭을 이용하여 태그 검색을 수행할 경우, 모니터가 컴퓨터의 일부를 구성한다는 특징을 이용하여 네비게이션을 수행하는 것이 불가능하며, 별개의 태그는 완전히 개별적으로 취급될 뿐이다. In addition, information retrieval navigation is inefficient because tagged tags are unstructured metadata. For example, if a user who wants to search a monitor picture of a computer performs a tag search using the name of the monitor, it is impossible to navigate using the feature that the monitor constitutes a part of the computer. It is treated entirely individually.

그러므로, 태그를 이용한 콘텐츠 검색 결과의 정확도를 향상시킴은 물론, 태 그들의 상호 관련성을 이용하여 여러 개의 태그들 사이의 정보 네비게이션을 가능하게 하는 시스템이 절실히 요구된다. Therefore, there is an urgent need for a system that not only improves the accuracy of content search results using tags, but also enables information navigation between tags using tag interrelationships.

본 발명의 목적은 부정확한 태그로 인한 검색 결과의 열화를 극복하기 위하여, 부정확한 태그를 제거하고 연관성이 높은 태그들만을 모아서 태그 클러스터를 형성하기 위한 장치 및 방법을 제공하는 것이다. SUMMARY OF THE INVENTION It is an object of the present invention to provide an apparatus and method for forming a tag cluster by removing an incorrect tag and collecting only relevant tags to overcome the deterioration of the search result due to the incorrect tag.

본 발명의 다른 목적은 태그들 간의 상호 연관 관계에 의미론적 모델(ontology model)을 적용함으로써 태그들 간의 상호 관계를 규명하고, 이를 이용함으로써 사용자들이 복수 개의 태그들 상호간을 용이하게 네비게이션할 수 있도록 하는 토픽맵을 생성하기 위한 장치 및 방법을 제공하는 것이다. Another object of the present invention is to identify the interrelationships between tags by applying a semantic model (ontology model) to the interrelationship between the tags, and by using the user to facilitate navigation between a plurality of tags An apparatus and method for generating a topic map are provided.

상기와 같은 목적들을 달성하기 위한 본 발명의 일면은 태그 클러스터링 장치에 관한 것으로서, 본 발명에 의한 태그 클러스터링 장치는 소정의 모집단에 포함되는 콘텐츠들 중 동일한 콘텐츠에 관련되는 태그들로부터 두 개씩 연관 태그 쌍(related tag pair)들을 각각 추출하여 서로 매핑하는 연관 태그 매핑 모듈(related tag mapping module), 태그 매핑 과정에서 동일 태그의 출현 빈도를 추출하는 빈도수 추출 모듈(frequency extracting module), 연관 태그 쌍들의 빈도수를 기반으로 가중치 행렬을 생성하는 가중치 행렬 생성 모듈(weight matrix generating module), 및 상기 가중치 행렬로부터 상기 연관 태그 쌍들 중 임계치 이상의 빈도수를 가지는 연관 태그들만을 추출하여 태그 클러스터(tag cluster)를 생성하는 태그 클러스터링 모듈(tag clustering module)을 포함한다. 특히, 연관 태그 매핑 모듈은, 동일한 외래어에 대한 상이한 발음을 나타내는 태그들, 본말과 줄임말을 나타내는 태그들, 및 동의어를 나타내는 태그들을 각각 동일한 태그로 간주하여 상기 연관 태그 쌍을 추출하는 것을 특징으로 한다. One aspect of the present invention for achieving the above object relates to a tag clustering device, the tag clustering device according to the present invention is associated tag pairs from each other from the two tags associated with the same content among the contents included in a predetermined population a related tag mapping module for extracting related tag pairs and mapping each other, a frequency extracting module for extracting a frequency of occurrence of the same tag in a tag mapping process, and a frequency of related tag pairs. A weight matrix generating module for generating a weight matrix based on the tag matrix, and a tag clustering for generating a tag cluster by extracting only related tags having a frequency greater than or equal to a threshold among the pairs of related tags from the weight matrix; It includes a module (tag clustering module). In particular, the association tag mapping module extracts the association tag pair by considering tags representing different pronunciations for the same foreign language, tags representing the main language and the abbreviation, and tags representing the synonyms as the same tag. .

본 발명의 일면에 의한 태그 클러스터링 장치는 상기 태그 클러스터 내의 태그들을 토픽으로서 추출하는 토픽 생성 모듈(topic generating module), 상기 토픽들 중 상기 연관 태그 쌍에 관련된 토픽들을 상기 빈도수가 큰 것부터 작은 순서대로 토픽 쌍으로서 추출하는 토픽 쌍 추출 모듈(topic pair extracting module), 추출된 토픽 쌍에 소정의 어휘 지식 모델을 적용하여 상기 토픽 쌍의 의미 관계를 추출하는 의미 관계 생성 모듈(association generating module) 및 상기 토픽 쌍에 적합한 콘텐츠의 주소를 상기 토픽 쌍에 부여하는 어커런스 생성 모듈(occurrence generating module)을 더 포함하는 것을 특징으로 한다. An apparatus for tag clustering according to an aspect of the present invention includes a topic generating module for extracting tags in the tag cluster as a topic, and topics related to the pair of related tags among the topics in order of increasing frequency from smallest to smallest. A topic pair extracting module for extracting a pair, a semantic relation generating module for extracting a semantic relation of the topic pair by applying a predetermined lexical knowledge model to the extracted topic pair, and the topic pair And an occurrence generating module for assigning an address of a content suitable for the topic pair.

본 발명의 일면에 의한 태그 클러스터링 장치에 포함되는 태그 클러스터링 모듈은, 상기 연관 태그 쌍들 중 기본(base) 연관 태그 쌍을 선택하고, 선택된 기본 연관 태그 쌍에 포함된 각각의 태그들을 포함하는 다른 연관 태그 쌍을 반복하여 선택하는 방식으로 상기 태그 클러스터를 생성하는 것을 특징으로 한다. The tag clustering module included in the tag clustering apparatus according to an aspect of the present invention selects a base association tag pair among the association tag pairs, and includes another association tag including respective tags included in the selected base association tag pair. The tag cluster may be generated by repeatedly selecting a pair.

본 발명의 일면에 의한 태그 클러스터링 장치에 포함되는 태그 클러스터링 모듈은, 가장 높은 가중치를 가지는 연관 태그 쌍을 상기 기본 연관 태그 쌍으로서 선택하고, 생성된 태그 클러스터에 포함된 태그들의 개수 및 상기 연관 태그 쌍의 가중치의 평균을 고려하여 상기 임계치를 결정하는 것을 특징으로 한다. The tag clustering module included in the tag clustering apparatus according to an aspect of the present invention selects an associative tag pair having the highest weight as the basic associative tag pair, and includes the number of tags included in the generated tag cluster and the associative tag pair. The threshold value is determined in consideration of an average of weights.

또는, 본 발명의 일면에 의한 태그 클러스터링 장치에 포함되는 태그 클러스 터링 모듈은, 사용자 선택에 따라서 상기 기본 연관 태그 쌍 및 상기 임계치를 결정하는 것을 특징으로 한다. Alternatively, the tag clustering module included in the tag clustering apparatus according to an embodiment of the present invention may determine the basic association tag pair and the threshold value according to a user selection.

본 발명의 일면에 의한 태그 클러스터링 장치에 포함되는 어휘 지식 모델은 RDF(Resource Description Framework), KQML(Knowledge Query and Manipulation Language), DAML-OIL(DARPA Agent Markup Language-Ontology Inference Layer), OWL(Ontology Web Language), 및 토픽맵 중 적어도 하나를 포함하는 것을 특징으로 한다. The lexical knowledge model included in the tag clustering apparatus according to an aspect of the present invention includes a Resource Description Framework (RDF), Knowledge Query and Manipulation Language (KQML), Darpa Agent Markup Language-Ontology Inference Layer (DAML-OIL), and Ontology Web Language) and a topic map.

본 발명의 일면에 의한 태그 클러스터링 장치에 포함되는 어커런스 생성 모듈은, 상기 토픽 쌍에 상응하는 연관 태그 쌍을 모두 포함하는 콘텐츠의 URL(Uniform Resource Locator)를 상기 토픽 쌍에 부여하는 것을 특징으로 한다. The occurrence generation module included in the tag clustering apparatus according to an aspect of the present invention is characterized by assigning a URL (Uniform Resource Locator) of the content including all the associated tag pairs corresponding to the topic pair to the topic pair.

상기와 같은 목적들을 달성하기 위한 본 발명의 다른 면은, 소정의 모집단에 포함되는 콘텐츠들 중 동일한 콘텐츠에 관련되는 태그들로부터 두 개씩 연관 태그 쌍(related tag pair)들을 각각 추출하여 서로 매핑하는 연관 태그 매핑 단계, 태그 매핑 과정에서 동일 태그의 출현 빈도를 추출하는 빈도수 추출 단계, 연관 태그 쌍들의 빈도수를 기반으로 가중치 행렬을 생성하는 가중치 행렬을 생성하는 가중치 행렬 생성 단계 및 상기 가중치 행렬로부터 상기 연관 태그 쌍들 중 임계치 이상의 빈도수를 가지는 연관 태그들만을 추출하여 태그 클러스터를 생성하는 태그 클러스터링 단계를 포함하는 태그 클러스터 방법에 관한 것이다. 특히, 연관 태그 매핑 단계는, 동일한 외래어에 대한 상이한 발음을 나타내는 태그들, 본말과 줄임말을 나타내는 태그들, 및 동의어를 나타내는 태그들을 각각 동일한 태그로 간주하는 단 계를 포함하는 것을 특징으로 한다. Another aspect of the present invention for achieving the above object is, the association associated with each other to extract the relevant tag pairs (tag pairs) from each of the tags included in the predetermined population related to the same content (map) A tag mapping step, a frequency extraction step of extracting the appearance frequency of the same tag in the tag mapping process, a weighting matrix generation step of generating a weighting matrix for generating a weighting matrix based on the frequency of association tag pairs and the association tag from the weighting matrix A tag clustering method includes generating a tag cluster by extracting only related tags having a frequency greater than or equal to a threshold among pairs. In particular, the associative tag mapping step includes the steps of considering the tags representing different pronunciations for the same foreign language, the tags representing the main language and the abbreviation, and the tags representing the synonyms as the same tag.

본 발명의 다른 면에 의한 태그 클러스터링 방법은 상기 태그 클러스터 내의 태그들을 토픽으로서 추출하는 토픽 생성 단계, 상기 토픽들 중 상기 연관 태그 쌍에 관련된 토픽들을 상기 빈도수가 큰 것부터 작은 순서대로 토픽 쌍으로서 추출하는 토픽 쌍 추출 단계, 추출된 토픽 쌍에 소정의 어휘 지식 모델을 적용하여 상기 토픽 쌍의 의미 관계를 추출하는 의미 관계 생성 단계 및 상기 토픽 쌍에 적합한 콘텐츠의 주소를 상기 토픽 쌍에 부여하는 어커런스 생성 단계를 더 포함하는 것을 특징으로 한다. The tag clustering method according to another aspect of the present invention is a topic generation step of extracting the tags in the tag cluster as a topic, extracting the topics related to the associated tag pair of the topics as the topic pair in order from the highest frequency A topic pair extraction step, a semantic relationship generation step of extracting a semantic relationship of the topic pair by applying a predetermined lexical knowledge model to the extracted topic pair, and an occurrence generation step of giving an address of content suitable for the topic pair to the topic pair It characterized in that it further comprises.

본 발명의 다른 면에 의한 태그 클러스터링 방법에 포함되는 태그 클러스터링 단계는, 상기 연관 태그 쌍들 중 기본 연관 태그 쌍을 선택하는 단계, 선택된 기본 연관 태그 쌍에 포함된 각각의 태그들을 포함하는 다른 연관 태그 쌍을 선택하는 단계 및 선택된 다른 연관 태그 쌍에 포함된 각각의 태그들을 포함하는 또 다른 연관 태그 쌍을 반복하여 선택하는 단계를 포함하는 것을 특징으로 한다. Tag clustering step included in the tag clustering method according to another aspect of the present invention, selecting a basic association tag pair of the association tag pair, another association tag pair including each of the tags included in the selected basic association tag pair And repeatedly selecting another association tag pair including respective tags included in the selected other association tag pair.

본 발명에 의하여, 부정확한 태그를 제거하고 연관성이 높은 태그들만을 이용하여 태그 클러스터를 형성하기 때문에 부정확한 태그로 인한 검색 결과의 열화가 극복되어 검색 결과의 품질이 향상된다. According to the present invention, since the tag cluster is formed using only the highly related tags by removing the incorrect tag, deterioration of the search result due to the incorrect tag is overcome and the quality of the search result is improved.

또한, 본 발명에 의하여 태그들 간의 상호 연관 관계에 의미론적 모델(ontology model)을 적용하여 토픽맵을 생성함으로써 태그들 간의 상호 관계를 알 수 있으며, 사용자들은 이러한 상호 관계를 이용하여 복수 개의 태그들을 용이 하게 네비게이션할 수 있다. In addition, according to the present invention, by generating a topic map by applying a semantic model (ontology model) to the interrelationship between the tags, the user can know the interrelationship between the tags, and users can use the interrelationship to You can easily navigate.

본 발명과 본 발명의 동작상의 이점 및 본 발명의 실시에 의하여 달성되는 목적을 충분히 이해하기 위해서는 본 발명의 바람직한 실시예를 예시하는 첨부 도면 및 첨부 도면에 기재된 내용을 참조하여야만 한다. In order to fully understand the present invention, the operational advantages of the present invention, and the objects achieved by the practice of the present invention, reference should be made to the accompanying drawings which illustrate preferred embodiments of the present invention and the contents described in the accompanying drawings.

이하, 첨부한 도면을 참조하여 본 발명의 바람직한 실시예를 설명함으로서, 본 발명을 상세히 설명한다. 그러나, 본 발명은 여러 가지 상이한 형태로 구현될 수 있으며, 설명하는 실시예에 한정되는 것이 아니다. 그리고, 본 발명을 명확하게 설명하기 위하여 설명과 관계없는 부분은 생략되며, 도면의 동일한 참조부호는 동일한 부재임을 나타낸다. Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings. As those skilled in the art would realize, the described embodiments may be modified in various different ways, all without departing from the spirit or scope of the present invention. In addition, in order to clearly describe the present invention, parts irrelevant to the description are omitted, and the same reference numerals in the drawings indicate the same members.

명세서 전체에서, 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라, 다른 구성요소를 더 포함할 수 있는 것을 의미한다. 또한, 명세서에 기재된 "...부", "...기", "모듈", "블록" 등의 용어는 적어도 하나의 기능이나 동작을 처리하는 단위를 의미하며, 이는 하드웨어나 소프트웨어 또는 하드웨어 및 소프트웨어의 결합으로 구현될 수 있다. Throughout the specification, when a part is said to "include" a certain component, it means that it may further include other components, without excluding the other components unless otherwise stated. In addition, the terms "... unit", "... unit", "module", "block", etc. described in the specification mean a unit for processing at least one function or operation, which means hardware, software, or hardware. And software.

도 1은 본 발명의 일 면에 의한 태그 클러스터링 장치를 개념적으로 나타내는 블록도이다. 1 is a block diagram conceptually illustrating a tag clustering apparatus according to an embodiment of the present invention.

도 1에 도시된 태그 클러스터링 장치(100)는 태그 관리부(110) 및 디스플레이(190)를 포함한다. 태그 관리부(110)는 태그 리더(120), 태그 클러스터링 부(130) 및 토픽맵 생성부(150)를 포함한다. 태그 클러스터링부(130)는 연관 태그 매핑 모듈(132), 빈도수 추출 모듈(134), 가중치 행렬 생성 모듈(136) 및 태그 클러스터링 모듈(138)을 포함한다. 토픽맵 생성부(150)는 토픽 생성 모듈(152), 토픽 쌍 추출 모듈(154), 의미 관계 생성 모듈(156) 및 어커런스 생성 모듈(158)을 포함한다. The tag clustering apparatus 100 illustrated in FIG. 1 includes a tag manager 110 and a display 190. The tag manager 110 includes a tag reader 120, a tag clustering unit 130, and a topic map generator 150. The tag clustering unit 130 includes an association tag mapping module 132, a frequency extraction module 134, a weight matrix generation module 136, and a tag clustering module 138. The topic map generation unit 150 includes a topic generation module 152, a topic pair extraction module 154, a semantic relationship generation module 156, and an occurrence generation module 158.

태그란 어떠한 정보, 리소스 또는 콘텐츠에 대하여 사용자가 직접 만드는 메타데이터(metadata)를 의미한다. 즉, 사용자는 웹 상의 기사, 이미지, 동영상, 즐겨찾기와 같은 모든 종류의 웹 자원들에 대해서 태그의 형태로 메타데이터를 기록할 수 있다. 메타데이터란 사용자가 해당 정보에 대하여 연관성이 있고 적절하다고 판단한 키워드 혹은 용어의 집합을 의미한다. A tag refers to metadata that a user directly creates about any information, resource, or content. That is, the user can record metadata in the form of a tag for all kinds of web resources such as articles, images, videos, and favorites on the web. Metadata refers to a set of keywords or terms that the user considers relevant and appropriate for the information.

본 명세서에서 클러스터링이란 유사한 특성을 가지는 데이터를 그룹화하고, 이들 데이터가 가지는 공통적인 특징을 추출하기 위한 기법을 의미한다. 즉, 동일한 콘텐츠 또는 정보에 포함된 태그들 중 동시에 나타나는 태그들의 경우 서로의 관련성이 높은 태그인 것으로 간주할 수 있다. 설명의 편의를 위하여 본 명세서에서는 이러한 태그 쌍을 연관 태그 쌍(related tag pair)라고 부른다. 수많은 태그들 중에서 연관 태그 쌍을 찾아냄으로써 무의미한 태그들이 배제되기 때문에 태그에 기반한 검색 결과의 정확성이 향상될 수 있다. In the present specification, clustering refers to a technique for grouping data having similar characteristics and extracting common features of these data. That is, tags simultaneously appearing among tags included in the same content or information may be regarded as tags having high correlation with each other. For convenience of description, this tag pair is referred to as a related tag pair in the present specification. Finding relevant tag pairs among a large number of tags can eliminate meaningless tags, which can improve the accuracy of search results based on tags.

본 발명에서 온톨로지(ontology)란 특정 분야에서 사용되는 어휘들의 집합이며, 응용 분야 내의 지식들을 개념화하고 명세화하는 기법을 의미한다. 온톨로지의 목적은 분산된 정보 자원들 간의 의미적 연관 관계를 정의하고, 이를 통합, 공 유함으로써 향상된 정보 검색 및 의미적 네비게이션(semantic navigation)을 제공하는 것이다. Ontology in the present invention is a set of vocabulary used in a specific field, and refers to a technique for conceptualizing and specifying knowledge in an application field. The purpose of the ontology is to provide improved information retrieval and semantic navigation by defining, integrating and sharing semantic relationships among distributed information resources.

태그들(180)은 태그 관리부(110)의 태그 리더(120)에 의하여 독출된다. 독출된 태그들은 태그 클러스터링부(130)로 전달된다. 이 과정에서, 연관 태그 매핑 모듈(132)은 독출된 태그들을 단순히 비교하는 것이 아니라, 관념적으로 동일한 대상을 지칭하는 것으로 판단되는 유사한 태그들을 동일한 태그인 것으로 간주함으로써 후술하는 가중치 행렬의 차원이 지나치게 증가하는 것을 막을 뿐만 아니라, 정보 처리량을 감소시키고, 또한 검색 결과를 개선할 수 있다. 예를 들어, 연관 태그 매핑 모듈(132)은 동일한 외래어에 대한 상이한 발음을 나타내는 태그들을 동일한 태그로 간주할 수 있다. 예를 들어, 컴퓨터, 콤퓨터, 컴퓨타, 콤퓨타 등은 동일한 대상인 computer를 나타내는 것으로 간주할 수 있다. 또한, 연관 태그 매핑 모듈(132)은 본말과 줄임말을 나타내는 태그들도 동일한 태그로 간주할 수 있다. 예를 들어, 대한민국과 한국이 동일한 태그인 것으로 간주할 수 있다. 또한, 연관 태그 매핑 모듈(132)은 동의어를 나타내는 태그들을 동일한 태그로 간주할 수 있는데, 예를 들어 여름과 summer를 동일한 태그로 판단할 수 있다. 또한, 연관 태그 매핑 모듈(132)은 한영 자판의 오변환에 기인한 무의미한 문자열을 각각 상응하는 언어로 변환시킬 수도 있다. 예를 들어, 'zmffjtmxjfld' 이라는 태그가 있을 때 가장 널리 쓰이는 2벌식-영문 자판 변환을 이용하여 'zmffjtmxjfld' 이라는 태그가 '클러스터링'이라는 태그와 동일한 것으로 간주할 수도 있다. The tags 180 are read by the tag reader 120 of the tag manager 110. The read tags are transferred to the tag clustering unit 130. In this process, the associative tag mapping module 132 does not simply compare the read tags, but considers similar tags that are conceptually determined to refer to the same object as the same tag, thereby excessively increasing the dimension of the weight matrix described below. Not only can it be prevented, it can reduce information throughput and improve search results. For example, the associated tag mapping module 132 may consider tags representing different pronunciations for the same foreign language as the same tag. For example, a computer, a computer, a computer, a computer, or the like can be regarded as representing a computer that is the same object. In addition, the associated tag mapping module 132 may regard the tags representing the main language and the abbreviation as the same tag. For example, Korea and Korea can be considered to be the same tag. In addition, the associated tag mapping module 132 may regard tags representing synonyms as the same tag. For example, the associated tag mapping module 132 may determine summer and summer as the same tag. In addition, the associative tag mapping module 132 may convert meaningless character strings due to mistranslation of the Korean-English keyboard into corresponding languages. For example, the tag 'zmffjtmxjfld' can be considered the same as the tag 'clustered', using the most widely used two-to-English keyboard conversion when the tag 'zmffjtmxjfld' is present.

연관 태그 매핑 모듈(related tag mapping module, 132)은 소정의 모집단에 포함되는 복수 개의 콘텐츠들 중 동일한 콘텐츠에 함께 관련되는 두 개의 태그들로 이루어지는 연관 태그 쌍(related tag pair)들의 집합을 추출한다. 즉, 연관 태그 매핑 모듈(132)은 동일한 콘텐츠에 모두 관련되는 태그들이 서로 연관 태그 쌍으로서 매핑한다. 이러한 매핑 과정을 이해하기 위하여 도 2a 및 도 2b를 참조한다. The related tag mapping module 132 extracts a set of related tag pairs consisting of two tags related to the same content among a plurality of contents included in a predetermined population. That is, the association tag mapping module 132 maps tags that are all related to the same content as association tag pairs. See FIG. 2A and FIG. 2B to understand this mapping process.

도 2a 및 도 2b는 도 1에 도시된 연관 태그 매핑 모듈(132)의 동작을 개념적으로 설명하는 도면들이다. 2A and 2B are diagrams conceptually describing an operation of the association tag mapping module 132 illustrated in FIG. 1.

하나의 콘텐츠에는 여러 개의 태그가 입력될 수 있다. 도 2a를 참조하면 제1 태그 군(200)에는 woman, teacher 및 school 이라는 태그가 포함된다. 그러면 연관 태그 매핑 모듈(132)은 woman, teacher 및 school 이라는 태그들이 동일한 하나의 콘텐츠에 포함되어 있으므로 이들을 상호 간에 매핑하여 3 개의 연관 태그 쌍을 생성한다. 즉, woman-teacher, teacher-school 및 school-woman 이라는 3 개의 연관 태그 쌍이 생성된다. Multiple tags may be input to one content. Referring to FIG. 2A, the first tag group 200 includes tags such as woman, teacher, and school. Then, the associated tag mapping module 132 generates three association tag pairs by mapping each other because the tags woman, teacher, and school are included in the same content. In other words, three association tag pairs are created: woman-teacher, teacher-school, and school-woman.

도 2b를 참조하면, 제2 태그 군(300)에는 school, people, children 및 boy 라는 태그들이 포함된다. 즉, school, people, children 및 boy 라는 태그들이 동일한 콘텐츠에 관련되므로, 이들을 상호 매핑하여 4 개의 연관 태그 쌍을 생성한다. 또한, 제1 태그 군(200) 및 제2 태그 군(300)은 school 이라는 태그를 공통으로 포함한다. 그러므로, school은 제1 태그 군(200)에 포함된 태그들(teacher, woman) 및 제2 태그 군(300)에 포함된 태그들(people, boy, children)과도 관련되어 연관 태그 쌍을 생성한다는 것을 알 수 있다. Referring to FIG. 2B, the second tag group 300 includes tags school, people, children, and boy. That is, since the tags school, people, children, and boy are related to the same content, four related tag pairs are generated by mapping them. In addition, the first tag group 200 and the second tag group 300 include a tag called school in common. Therefore, the school generates association tag pairs in association with the tags (teacher, woman) included in the first tag group 200 and the tags (people, boy, children) included in the second tag group 300. It can be seen that.

다시 도 1을 참조하면, 빈도수 추출 모듈(frequency extracting module, 134)은 연관 태그 쌍들이 콘텐츠들에서 발생하는 횟수인 빈도수를 추출한다. 즉, 빈도수가 2라는 것은 주어진 모집단에 속하는 콘텐츠들 중에 어느 연관 태그 쌍이 두 번 발생한다는 것을 의미한다. 빈도수가 높을수록 해당 연관 태그 쌍의 연관성이 높다는 것을 나타낸다. Referring again to FIG. 1, a frequency extracting module 134 extracts a frequency, which is the number of times associated tag pairs occur in the contents. In other words, a frequency of 2 means that an associated tag pair occurs twice among contents belonging to a given population. The higher the frequency, the higher the association of the associated tag pair.

이와 같이 연관 태그 쌍을 형성하고 빈도수를 구하는 과정은 다음 수학식 1 및 수학식 2를 이용하여 수행된다. As such, the process of forming an association tag pair and obtaining a frequency is performed by using Equations 1 and 2 below.

T_G(i, j) = 0; 태그 i 및 태그 j 간에 연관성이 없을 때T _G (i, j) = 0; When there is no association between tag i and tag j

T_G(i, j) = k; 태그 i 및 태그 j 가 모두 k 번 발견될 때T _G (i, j) = k; When tag i and tag j are both found k times

수학식 1, 2에서 T_G(i, j)는 가중치 행렬을 i행 j열의 원소를 나타낸다. In Equations 1 and 2, T _G (i, j) represents an element of the row i column j columns of the weight matrix.

빈도수 추출 모듈(134)에 의하여 추출된 빈도수는 가중치 행렬 생성 모듈(136)로 전달되고, 가중치 행렬 생성 모듈(136)은 수신된 연관 태그 쌍 별 빈도수를 이용하여 가중치 행렬을 생성한다. 도 3a는 가중치 행렬 생성 모듈(136)에 의하여 생성된 가중치 행렬의 일 예를 도시한다. The frequency extracted by the frequency extraction module 134 is transferred to the weight matrix generation module 136, and the weight matrix generation module 136 generates a weight matrix using the received frequency for each pair of associated tags. 3A illustrates an example of a weight matrix generated by the weight matrix generation module 136.

도 3a를 참조하면, 가중치 행렬은 행 및 열은 각각 콘텐츠 그룹으로부터 추출된 태그들을 포함한다. 가중치 행렬의 각 원소는 연관 태그 쌍의 빈도수이다. 예를 들어, teacher-classroom 이라는 연관 태그 쌍은 2의 가중치를 가지고, classroom-school 이라는 연관 태그 쌍은 4의 가중치를 가진다. 이해의 편의를 위 하여 4의 가중치를 가지는 연관 태그 쌍은 원형으로 표시되고, 2의 가중치를 가지는 연관 태그 쌍은 삼각형으로 표시된다. Referring to FIG. 3A, the weight matrix includes the tags extracted from the content group in rows and columns, respectively. Each element of the weight matrix is the frequency of an associated tag pair. For example, an association tag pair called teacher-classroom has a weight of 2 and an association tag pair called classroom-school has a weight of 4. For ease of understanding, an associated tag pair having a weight of 4 is indicated by a circle, and an associated tag pair having a weight of 2 is indicated by a triangle.

이와 같이 가중치 행렬이 구성되면 태그 클러스터링 모듈(tag clustering module, 138)이 임계치 이상의 빈도수를 가지는 연관 태그 쌍만을 추출하여 태그 클러스터를 생성한다. 이 때, 태그 클러스터를 생성하기 위하여 적용되는 임계치(문턱치)가 크면 클수록 태그 클러스터에는 적은 개수의 연관 태그 쌍이 포함되며, 적용되는 임계치가 작을수록 태그 클러스터에는 많은 개수의 태그들이 포함된다. 태그 클러스터에 포함되는 태그들이 개수가 많을수록 재현율(recall)은 높아지지만 정확도(precision)는 감소된다. 정확도란 소정의 태그를 이용하여 어느 콘텐츠를 검색하려고 할 때, 검색 결과가 얼마나 원하는 검색 목적에 근접하느냐를 나타낸다. 예를 들어 애플 컴퓨터를 검색하고자 했을 때 결과로서 과일인 사과가 나온다면 이는 정확도를 떨어뜨리는 것이다. 재현율은 검색 결과들 중에서 몇 %가 정확한 결과를 포함하느냐를 나타낸다. When the weight matrix is configured as described above, the tag clustering module 138 generates a tag cluster by extracting only an associated tag pair having a frequency above a threshold. At this time, the larger the threshold (threshold) applied to generate the tag cluster, the fewer tag pairs are included in the tag cluster, and the smaller the threshold is applied, the larger number of tags are included in the tag cluster. The larger the number of tags included in the tag cluster, the higher the recall, but the lower the precision. The accuracy indicates how close the search result is to the desired search purpose when searching for a certain content using a predetermined tag. For example, if you try to search an Apple computer, and you see apples as fruits as a result, this is less accurate. The recall rate indicates how many percent of the search results contain the correct result.

태그 클러스터링 모듈(138)이 태그 클러스터를 생성하기 위하여 문턱치를 설정하는 것이 매우 중요한데, 이는 도 5a 및 도 5b를 이용하여 상세히 후술된다. It is very important for the tag clustering module 138 to set thresholds to generate tag clusters, which will be described in detail below with reference to FIGS. 5A and 5B.

태그 클러스터링 알고리즘의 pseudo-code는 다음과 같다. The pseudo-code of the tag clustering algorithm is as follows.

//i : 클러스터 번호// i: cluster number

// C(i) : i번째 클러스터// C (i): i-th cluster

// T(i, j) : 태그 i 및 태그 j 간의 빈도수, 즉, 가중치 행렬 T_G의 i행 j열 의 원소// T (i, j): frequency between tag i and tag j, i.e. elements in row i column j of weight matrix T _G

// Max(i, j) : T_G의 원소 중 최대 가중치를 가지는 원소 // Max (i, j): the element with the maximum weight among the elements of T _G

// A_i: 클러스터 C(i)에 포함된 태그들의 가중치 행렬// A _i : Weight matrix of tags in cluster C (i)

i = 1i = 1

// 문턱치보다 큰 가중치를 가지는 모든 태그들이 클러스터에 포함될 때까지 반복// repeat until all tags with weight greater than threshold are included in cluster

A_i 초기화A _i Initialization

Repeat {Repeat {

// T_G에서 최대 가중치를 가지는 원소 Max(i, j)의 두 태그 i, j를 선택하여 클러스터 C(i)에 추가// Select two tags i, j of element Max (i, j) with maximum weight in T _G and add them to cluster C (i)

select Max(i, j)select Max (i, j)

Add tag i and tag j to C(i)Add tag i and tag j to C (i)

Add element Max(i, j) to A_i Add element Max (i, j) to A _i

While(T(i,j)>= 문턱치) {While (T (i, j)> = threshold) {

// 클러스터 C(i)의 가중치 행렬 Ai에 추가된 태그 i 및 태그 j 모두에 관련된 태그 중 가중치 평균이 문턱치보다 크기나 같은 원소 T(i,j)를 가중치 행렬 T_G에서 선택하여 C(i)에 추가// tag element is greater than the weighted average of the size and the threshold value associated with the tag i and j both tags added to the weighting matrix Ai of the cluster C (i) T (i, j) in the selected weight matrix T _G C (i Add to)

Add tag i and tag j of T_G to C(i)Add tag i and tag j of T _G to C (i)

Add element T(i, j) to A_i Add element T (i, j) to A _i

}}

i=i+1i = i + 1

} until (All (T(i,j))>=문턱치)} until (All (T (i, j))> = threshold)

이러한 클러스터링 과정을 상세히 도 3b를 참조하여 설명하면 다음과 같다. This clustering process will be described in detail with reference to FIG. 3B.

우선 가중치 행렬 중 최대 가중치를 가지는 태그 i(school)와 태그 j(classroom)를 클러스터C(i)에 추가한다. 그러면 C(i)에 추가된 태그 i 및 태그 j와 관련된 태그들인 teacher, me, female, woman 중 가중치 평균이 문턱치보다 크거나 같은 원소인 T(i,j)를 가중치 행렬 T_G로부터 선택하여 C(i)에 추가한다. 이러한 동작이 모든 가중치가 문턱치보다 작게 될 때까지 반복 수행된다. 도 3b를 참조하면 우선 가장 높은 가중치를 가지는 school-classroom 이 추출되고(원으로 표시됨), 그 이후에 2의 가중치를 가지는 classroom-teacher 및 teacher-school이 추출된다(삼각형으로 표시됨). First, a tag i (school) and a tag j (classroom) having the maximum weight among the weighting matrix are added to the cluster C (i). Then, from the weighting matrix T _G , T (i, j) is selected from elements weighted equal to or greater than the threshold among teachers, me, female, and women, tags related to tag i and j, added to C (i). Add to (i). This operation is repeated until all weights are smaller than the threshold. Referring to FIG. 3B, first, a school-classroom having the highest weight is extracted (indicated by a circle), and then, a classroom-teacher and teacher-school having a weight of 2 are extracted (indicated by a triangle).

도시된 바와 같이, 태그 클러스터링 모듈(138)은 임의의 태그 군 중에서 서로 관련성이 있는 연관 태그 쌍을 추출하고, 추출된 연관 태그 쌍의 빈도수에 기반하여 태그 클러스터를 생성함으로써 태그 클러스터 내에 속한 연관 태그 쌍들은 서로 밀접한 관련성을 가지게 된다. As shown, the tag clustering module 138 extracts associative tag pairs that are related to each other from any tag group, and generates a tag cluster based on the frequency of the extracted associative tag pair, thereby associating tag pairs belonging to the tag cluster. They are closely related to each other.

태그 클러스터링 모듈(138)에서 태그 클러스터링을 수행하기 위하여 적용할 문턱치는 사용자에 의하여 선택될 수 있고, 또는 다음과 같이 최적의 문턱치를 선 택할 수도 있다. In the tag clustering module 138, a threshold to be applied to perform tag clustering may be selected by a user, or an optimal threshold may be selected as follows.

도 5a 및 도 5b는 도 1에 도시된 태그 클러스터링 모듈(138)에서 임계치를 결정하기 위한 과정을 설명하기 위한 그래프들이다. 도 5a는 2 내지 12의 상이한 문턱치를 적용했을 경우에 태그 클러스터에 포함되는 평균 태그들의 개수를 나타낸다. 도시된 바와 같이, 문턱치가 증가할수록 태그 클러스터에 포함되는 태그들의 개수는 감소한다는 것을 알 수 있다. 도 5b는 문턱치를 증가했을 경우에 태그 클러스터에 포함된 태그들의 응집도를 도시한다. 도 5b의 응집도란 소정 태그 클러스터 내에 가중치로 연결된 연관 태그 쌍들의 군집 정도를 나타내며, 태그 클러스터에 포함된 연관 태그 쌍들의 가중치 평균을 나타낸다. 5A and 5B are graphs for describing a process of determining a threshold in the tag clustering module 138 illustrated in FIG. 1. 5A illustrates the number of average tags included in a tag cluster when different thresholds of 2 to 12 are applied. As shown, it can be seen that as the threshold increases, the number of tags included in the tag cluster decreases. 5B illustrates the cohesion of tags included in a tag cluster when the threshold is increased. The cohesion degree of FIG. 5B represents a clustering degree of association tag pairs weighted in a predetermined tag cluster, and represents a weighted average of association tag pairs included in the tag cluster.

임의의 태그 클러스터 C(i)의 응집도는 다음 수학식 3을 이용하여 연산된다. The degree of cohesion of any tag cluster C (i) is calculated using the following equation (3).

수학식 3에서, Ai(j, k)는 클러스터 C(i) 내에 가중치로 연결된 태그 j 및 태그 k를 의미하며, n은 C(i)에 속한 전체 태그 수를 나타낸다. In Equation 3, Ai (j, k) means a tag j and a tag k connected by weight in the cluster C (i), n represents the total number of tags belonging to C (i).

수학식 3을 이용한 개별 클러스터에 대한 응집도 평가에 기반하여 문턱치를 선택할 수 있는데, 이를 위하여 전체 클러스터의 응집도 평균을 다음 수학식 4와 같이 연산한다. A threshold may be selected based on the evaluation of the cohesion of the individual clusters using Equation 3, and for this purpose, the mean of the cohesion of the entire clusters is calculated as in Equation 4 below.

수학식 4에서 m은 문턱치를 달리 함에 따라서 선택된 태그 클러스터 C(i)의 개수를 나타낸다. In Equation 4, m represents the number of tag clusters C (i) selected according to different thresholds.

도 5b를 참조하면, 문턱치가 증가함에 따라서 응집도도 증가하지만, 어느 정도 값 이후에는 응집도의 변화가 거의 없다는 것을 나타낸다. 도 5b에 도시된 그래프에서 이러한 값은 9가 될 것이다. 즉, 문턱치가 9 이상이 되면 응집도에는 별 변화가 없다는 것을 알 수 있다. 즉, 사용자는 도 5b와 같은 결과를 참조하여 문턱치를 9로서 선택할 수 있다. Referring to FIG. 5B, the degree of cohesion also increases as the threshold increases, but after a certain value, there is little change in the degree of cohesion. In the graph shown in FIG. 5B this value will be 9. That is, it can be seen that when the threshold value is 9 or more, there is no change in the degree of cohesion. That is, the user may select the threshold as 9 with reference to the result as shown in FIG. 5B.

다시 도 1을 참조하면, 태그 클러스터링 모듈(138)에 의하여 생성된 태그 클러스터는 토픽맵 생성부(150)로 전달된다. 토픽 생성 모듈(152)은 태그 클러스터 내의 태그들을 토픽으로서 추출한다. 추출한 토픽들은 토픽 쌍 추출 모듈(154)로 전달된다. 토픽 쌍 추출 모듈(154)은 토픽들 중 상기 연관 태그 쌍에 관련된 토픽들을 상기 빈도수가 큰 것부터 작은 순서대로 토픽 쌍으로서 추출한다. 그러면, 추출된 토픽 쌍이 의미 관계 생성 모듈(156)로 전달된다. 의미 관계 생성 모듈(156)은 추출된 토픽 쌍에 소정의 어휘 지식 모델을 적용하여 토픽 쌍의 의미 관계를 추출해낸다. 그러면, 어커런스 생성 모듈(158)은 추출된 의미 관계를 반영하 여 각 토픽 쌍에 적합한 콘텐츠의 주소를 토픽 쌍에 부여한다. 이하, 토픽맵 생성부(150)의 각 구성요소의 동작을 상세히 후술한다. Referring back to FIG. 1, the tag cluster generated by the tag clustering module 138 is transferred to the topic map generator 150. Topic generation module 152 extracts the tags in the tag cluster as topics. The extracted topics are passed to the topic pair extraction module 154. The topic pair extraction module 154 extracts topics related to the associated tag pair among the topics as topic pairs in order from the highest frequency to the lowest frequency. Then, the extracted topic pairs are transferred to the semantic relationship generation module 156. The semantic relationship generation module 156 extracts the semantic relationship of the topic pair by applying a predetermined lexical knowledge model to the extracted topic pair. Then, the occurrence generation module 158 reflects the extracted semantic relationship and gives the topic pair an address of a content suitable for each topic pair. Hereinafter, the operation of each component of the topic map generator 150 will be described in detail.

토픽 생성 모듈(152)은 토픽맵을 구성하기 위한 UI를 제공할 수 있다. UI는 토픽맵의 명칭, 설명, 생성자, 배포자, 및 생성 날짜 등의 정보를 사용자에게 제공한다. 토픽 생성 모듈(152)은 태그 클러스터링 모듈(138)에 포함된 태그들을 자동적으로 토픽으로서 이용한다. 토픽의 기본 명칭(base name)은 사용자가 여러 토픽들 중 각각의 토픽을 이해할 수 있도록 한다. 각 토픽 맵의 기본 토픽은 사용자에 의하여 선택될 수 있으며, 또는 가장 빈도수가 높은 연관 태그 쌍으로부터 선택될 수도 있다. The topic generation module 152 may provide a UI for constructing a topic map. The UI provides the user with information such as the name of the topic map, description, creator, distributor, and creation date. The topic generation module 152 automatically uses the tags included in the tag clustering module 138 as topics. The base name of a topic allows the user to understand each topic among the various topics. The base topic of each topic map may be selected by the user, or may be selected from the pair of most frequently associated tags.

토픽 생성 모듈(152)에서 토픽을 추출하면, 토픽 쌍 추출 모듈(154)이 태그 클러스터로부터 토픽 쌍을 추출한다. 결국, 태그 클러스터를 구성하는 것이 태그이며, 이러한 태그들이 토픽맵에서 이용될 경우 토픽이 된다. 이러한 용어는 해당 기술 분야에서 일반적인 의미로 이용되는 용어를 나타내는 것이다. 토픽 쌍 추출 모듈(154)에서 이용하는 pseudo-code를 간략히 소개하면 다음과 같다. Topic generation module 152 extracts the topic, topic pair extraction module 154 extracts the topic pair from the tag cluster. After all, it is a tag that constitutes a tag cluster, and these tags become topics when used in a topic map. These terms refer to terms that are used in a general sense in the art. Briefly introducing the pseudo-code used in the topic pair extraction module 154 as follows.

// C(i) : 사용자에 의해 선택된 클러스터// C (i): Cluster selected by the user

// A(i, j) : C(i)에 포함된 태그들의 가중치 행렬// A (i, j): Weight matrix of tags in C (i)

// Max(A(i, j)) : 가중치 행렬 A(i, j)의 원소 중 최대값을 가지는 원소// Max (A (i, j)): the element with the maximum value among the elements of the weight matrix A (i, j)

// T(k) : Max(A(i,j))의 태그와 연관된 모든 태그들의 집합// T (k): set of all tags associated with tag of Max (A (i, j))

// B(l, m) : T(k)에 포함된 태그들의 가중치 행렬// B (l, m): Weight matrix of tags in T (k)

// Max(B(l, m)) : 가중치 행렬의 원소 중 최대값을 가지는 원소// Max (B (l, m)): the element with the maximum value among the elements of the weight matrix

// A(i, j)가 empty가 될 때까지 반복// repeat until A (i, j) becomes empty

Repeat {Repeat {

// 가중치 행렬의 원소 중 최대값 추출// extract maximum value among elements of weight matrix

Extract Max(A(i,j)) Extract Max (A (i, j))

// Max(A(i,j))의 태그들과 연관된 모든 태그들을 C(i)로부터 탐색하여 T(k) 구성// construct T (k) by retrieving all tags associated with tags of Max (A (i, j)) from C (i)

Find T(k)Find T (k)

// B(l,m)이 empty가 될 때까지 반복// repeat until B (l, m) becomes empty

Repeat { Repeat {

// 가중치 행렬의 원소 중 Max(B(l,m)) 추출// extract Max (B (l, m)) from the elements of the weight matrix

Extract Max(B(l,m))Extract Max (B (l, m))

// B(l,m)에서 Max(B(l,m))을 삭제// remove Max (B (l, m)) from B (l, m)

Remove Max B(l,m)) from B(l,m)Remove Max B (l, m)) from B (l, m)

} until (B(l,m)==empty)} until (B (l, m) == empty)

// A(i,j)에서 B(l,m) 삭제// delete B (l, m) from A (i, j)

Remove B(l,m) from A(i,j)Remove B (l, m) from A (i, j)

} until (A(i,j)==empty)} until (A (i, j) == empty)

전기된 의사 코드를 도 4a 내지 도 4e를 이용하여 설명하면 다음과 같다. The pseudo code described above will be described with reference to FIGS. 4A to 4E.

도 4a 내지 도 4e는 도 1에 도시된 토픽 쌍 추출 모듈(154)에서 토픽쌍을 추출하는 동작을 개념적으로 설명하기 위한 도면들이다. 4A through 4E are diagrams for conceptually describing an operation of extracting a topic pair in the topic pair extracting module 154 illustrated in FIG. 1.

우선, 생성된 태그 클러스터들(도 4a 참조) 중에서 사용자가 선택한 클러스터 C(i)의 가중치 행렬 A(i,j)의 원소 중 최대 가중치를 가지는 Max(A(i,j))를 선택하여, 선택된 태그 i 및 태그 j를 연관 태그 쌍으로서 추출한다(도 4b 참조). 그리고, Max(A(i,j))의 태그 i 및 태그 j와 관련된 모든 태그를 C(i)로부터 탐색하여 T(k)를 구성한다(도 4c 참조). 그리고, T(k)에 포함된 가중치 행렬 B(l,m)의 원소 중 최대값을 가지는 원소 Max(B(l,m))을 선택한다(도 4d). 그 후, B(l,m)에서 Max(B(l,m))은 삭제되고, 이러한 과정이 계속 반복된다(도 4e). First, among the generated tag clusters (see FIG. 4A), Max (A (i, j)) having the maximum weight among the elements of the weight matrix A (i, j) of the cluster C (i) selected by the user is selected, The selected tag i and tag j are extracted as associative tag pairs (see FIG. 4B). Then, all tags associated with tag i and tag j of Max (A (i, j)) are searched from C (i) to construct T (k) (see FIG. 4C). Then, the element Max (B (l, m)) having the maximum value among the elements of the weight matrix B (l, m) included in T (k) is selected (Fig. 4D). Thereafter, Max (B (l, m)) is deleted from B (l, m), and this process is repeated continuously (FIG. 4E).

토픽 쌍 추출 모듈(154)이 토픽 쌍을 추출하면, 의미 관계 생성 모듈(156)이 추출된 토픽 쌍에 웹 기반 온톨로지를 적용함으로써 의미 관계를 부여한다. When the topic pair extraction module 154 extracts a topic pair, the semantic relationship generation module 156 assigns a semantic relationship by applying a web-based ontology to the extracted topic pair.

온톨로지의 기본은 해당 영역에 존재하는 개념들이다. 예를 들어 책이라는 토픽은 저자, 출판사, 페이지수, 가격 등의 속성을 가질 수 있고, 입찰이라는 토픽은 대상, 날짜, 방식, 조건 등의 속성을 가질 수 있을 것이다. 또 토픽들은 서로 관계를 가질 수 있는데, 가장 기본적인 관계는 상하 포함 관계이다. 예를 들어 동화책이라는 토픽은 책에 포함되는 하위개념이 된다. 온톨로지가 발전하면 속성의 특성, 좀 더 복잡한 형식의 관계 등을 정의함으로써 풍부한 내용을 담을 수 있게 된다. 온톨로지를 독립적인 하나의 중심 구성요소로 보고 이를 개발과 운영의 중심에 놓는 것이 온톨로지 기반의 시스템(ontology-driven system)이며 이를 위하여 웹 온톨로지 개념이 도입된다. 웹 온톨로지(Web Ontology)란 어휘나 개념의 정의 또는 명세로서 정보 시스템 분야에서 시스템이 다루는 내용에 해당하는 구성 요소를 나타낸다. 즉, 온톨로지란 시맨틱 웹을 구성하기 위하여 사람이 직관적 또는 의미적으로 판단 또는 처리하던 작업을 컴퓨터가 처리할 수 있도록 공통 어휘를 기술한 것을 의미한다. 하지만, 모든 형상에 대한 표현은 매우 어렵기 때문에 웹이라는 특정 분야에 한해 W3C에서 확장성 생성 언어(XML) 및 자원 기술 프레임워크(Resource Description Framework, RDF)를 기반으로 웹 온톨로지 언어를 설계하였다. The basics of ontology are the concepts that exist in the realm. For example, a topic called a book may have attributes such as author, publisher, number of pages, and price, and a topic of bidding may have attributes such as object, date, method, and condition. Topics can also be related to each other, the most basic of which is up and down relationships. For example, a topic called a fairy tale book becomes a sub-concept included in the book. As the ontology develops, it can contain rich contents by defining the characteristics of attributes and more complex forms of relationships. Viewing the ontology as an independent central component and placing it at the center of development and operation is an ontology-driven system, and the concept of web ontology is introduced. Web Ontology is a definition or specification of a vocabulary or concept and represents the components corresponding to the contents of the system in the information system field. In other words, the ontology describes a common vocabulary so that a computer can process a task that a person intuitively or semantically judges or processes to construct a semantic web. However, since the representation of all shapes is very difficult, the web ontology language was designed based on the extensibility generation language (XML) and the resource description framework (RDF) in the W3C.

웹 온톨로지 언어(Ontology Web Language, OWL)이란 웹 상에서 첨단의 웹 검색, 소프트웨어 에이전트 및 지식 관리 기능을 제공하는 온톨로지를 발간 및 공유하기 위한 시맨틱 웹 생성 언어를 의미한다. 시맨틱 웹(semantic web)의 궁극적 목표는 컴퓨터도 이해할 수 있는 지식의 원천으로서의 웹을 만드는 것인데, HTML 형태의 문서들로 이뤄진 현재의 웹은 사람에게 정보를 주는 역할은 하고 있지만 컴퓨터 프로그램이 각 문서의 내용을 정확히 파악할 수 없다는 문제 의식에서 출발한다. OWL은 자원 기술 프레임워크(RDF)의 확장 언어로 개발된 것으로 DAML+OIL 언어로부터 시작되었다. OWL은 웹 온톨로지와 그에 관련된 지식을 정의하는 언어로 추론 시스템에 축적된 명제들을 정의하며, 클래스 및 그 구성원 간의 관계를 기술하고, 구문적으로 정의되지 않은 사실의 논리적 유추를 가능하게 하는 클래스 및 속성과 이에 적용할 수 있는 제약 사항의 집합으로 되어 있다. Ontology Web Language (OWL) is a semantic web generation language for publishing and sharing ontologies that provide advanced web search, software agent and knowledge management functions on the web. The ultimate goal of the semantic web is to create a web that is a source of knowledge that can be understood by a computer. Today's Web, which is made up of documents in the form of HTML, provides information to humans, but computer programs We start with the consciousness of not being able to grasp exactly the content. OWL was developed as an extension of the Resource Description Framework (RDF) and originated from the DAML + OIL language. OWL is a language that defines Web ontology and its related knowledge. It defines propositions accumulated in the inference system, describes the relationships between classes and their members, and allows classes and attributes that enable logical inference of facts that are not syntactically defined. And a set of constraints applicable to this.

온톨로지를 이용한 시스템은 다양하게 존재하며, 이 중에서 KQML-Knowledge Query and Manipulation Language)와 지식교환형식(예 KIF-Knowledge Interchange Format) 등을 정의했다. 특히 미 국방연구처(DARPA)의 DAML-OIL(DARPA Agent Markup Language - Ontology Inference Layer)이 대표적인 온톨로지 표현 언어 및 형식으로 받아들여지고 있다. There are various systems using ontology. Among them, KQML-Knowledge Query and Manipulation Language and KIF-Knowledge Interchange Format are defined. In particular, DARPA's DAML-OIL (DARPA Agent Markup Language-Ontology Inference Layer) is accepted as a representative ontology expression language and format.

또는, 추출된 연관 토픽 쌍에 영어를 기반으로 한 어휘 지식 모델인 워드넷(WordNet)을 적용할 수도 있다. 워드넷은 단어 상의 의미론적 패턴 또는 사용 패턴에 관련된 정보로서, 단어 간의 연관성을 구축한 데이터베이스라고 할 수 있다. 워드넷은 두 단어 간의 연관 관계, 상위어, 하위어, 동의어 등의 관계를 도출해 낼 수 있는 자바 기반의 워드넷 라이브러리(JWNL, Java WordNet Library)를 통해 공개 배포되고 있다. Alternatively, WordNet, a lexical knowledge model based on English, may be applied to the extracted pair of related topics. WordNet is information related to semantic patterns or usage patterns on words and can be referred to as a database that establishes associations between words. WordNet is openly distributed through the Java-based WordNet Library (JWNL), which can derive the relationship between two words, upper words, lower words, and synonyms.

본 발명에 의한 태그 클러스터링 장치(100)에 포함되는 의미 관계 생성 모듈(156)은 자동화된 의미 관계를 토픽 쌍에 부여하기 위하여 선택된 두 토픽들 간의 연관 관계를 워드넷으로부터 추출할 수 있다. 예를 들어, 워드넷은 has kind, is a kind of, has members, is a member of, has particulars, is a particulars, has part, is a part of 등의 연관 관계를 제공하고 있다. 예를 들어, 토픽 쌍 school-classroom의 경우, 워드넷을 이용하여 "is part of" 라는 관계가 도출될 수 있으므로, "classroom is part of school" 이라는 의미 관계가 성립될 수 있다. The semantic relation generation module 156 included in the tag clustering apparatus 100 according to the present invention may extract an association relation between two selected topics from WordNet to give an automated semantic relation to a topic pair. For example, WordNet provides associations such as has kind, is a kind of, has members, is a member of, has particulars, is a particulars, has part, is a part of. For example, in the case of a topic pair school-classroom, since a relationship of "is part of" may be derived using WordNet, a semantic relationship of "classroom is part of school" may be established.

도 6은 도 1에 도시된 의미 관계 생성 모듈(156)에서 생성한 토픽맵의 일 예를 도시한다. FIG. 6 illustrates an example of a topic map generated by the semantic relationship generation module 156 illustrated in FIG. 1.

도 6에 도시된 토픽맵은 computer를 기본 토픽으로 하여 구성된 것이다. 도 6에서 (1)의 관계는 "has part of"의 관계이며, (2)의 관계는 "has company of"의 관계를 나타낸다. 또한 (3)의 관계는 "has kind of"의 관계를 나타낸다. 도시된 바와 같이, 도 6에 도시된 토픽 맵에 포함된 각각의 토픽들은 단순히 나열되는 것 이 아니라, 이들 토픽간의 관계를 알 수 있다. 그러므로, 검색 성능이 향상된다. The topic map shown in FIG. 6 is constructed by using computer as a basic topic. In Fig. 6, the relationship of (1) represents the relationship of "has part of", and the relationship of (2) represents the relationship of "has company of". In addition, the relationship of (3) shows the relationship of "has kind of". As shown, each topic included in the topic map shown in FIG. 6 is not simply listed, but a relationship between these topics can be known. Therefore, search performance is improved.

도 1의 토픽맵 생성부(150)에 포함되는 어커런스 생성 모듈(occurrence generating module, 158)은 추출된 토픽 쌍에 적합한 콘텐츠의 주소를 해당 토픽 쌍에 부여한다. 즉, 어커런스 생성 모듈(158)은 추출된 각각의 토픽에 상응하는 콘텐츠의 URL 정보를 붙여준다. 이 과정에서 토픽에 상응하는 콘텐츠가 해당 토픽에 관련되기 때문에, 추후 검색 성능이 향상된다. 예를 들어, 'apple'이라는 태그를 포함하는 콘텐츠에는 과일 사과도 있을 수 있고, 애플 컴퓨터도 있을 수 있다. 이 경우, 어커런스 생성 모듈(158)은 과일 apple을 의미하는 토픽맵에 포함된 토픽(즉, 태그)에는 과일 apple에 상응하는 콘텐츠의 URL 주소를 부여한다. 또한, apple 컴퓨터를 의미하는 토픽맵에 포함된 토픽에는 컴퓨터 apple에 상응하는 콘텐츠의 URL 주소를 부여한다. 그 결과 과일 apple과 컴퓨터 apple이 명확히 구분되어 검색되게 된다. The occurrence generating module 158 included in the topic map generator 150 of FIG. 1 assigns an address of content suitable for the extracted topic pair to the topic pair. That is, the occurrence generation module 158 attaches URL information of content corresponding to each extracted topic. In this process, since the content corresponding to the topic is related to the topic, the search performance is improved later. For example, content that includes the tag "apple" could have a fruit apple, and an Apple computer. In this case, the occurrence generation module 158 assigns the URL address of the content corresponding to the fruit apple to a topic (ie, a tag) included in the topic map representing the fruit apple. In addition, a topic included in a topic map representing an apple computer is given a URL address of content corresponding to the computer apple. As a result, the fruit apple and the computer apple are clearly distinguished and searched.

도 7은 본 발명의 다른 면에 의한 태그 클러스터링 방법의 흐름도이다. 7 is a flowchart of a tag clustering method according to another aspect of the present invention.

우선 소정의 모집단에 포함되는 콘텐츠들 중 동일한 콘텐츠에 관련되는 태그들로부터 두 개씩 연관 태그 쌍(related tag pair)들을 각각 추출하여 서로 매핑하는 작업을 콘텐츠 각각에 대하여 수행한다(S710). 그러면 추출된 연관 태그 쌍이 콘텐츠들에서 발생하는 횟수인 빈도수를 추출한다(S720). First, two pieces of related tag pairs are extracted from two tags related to the same content among the contents included in a predetermined population and mapped to each other (S710). Then, a frequency that is the number of times that the extracted association tag pairs occur in the contents is extracted (S720).

추출된 빈도수는 가중치 행렬을 생성하는데 이용된다(S730). 가중치 행렬을 생성하는 방식은 도 1의 가중치 행렬 생성 모듈(136)에 대하여 전술된 바와 같다. The extracted frequency is used to generate a weight matrix (S730). The manner of generating the weight matrix is as described above with respect to the weight matrix generation module 136 of FIG.

그러면, 가중치 행렬로부터 연관 태그 쌍들 중 임계치 이상의 빈도수를 가지 는 연관 태그들만을 추출하여 태그 클러스터를 생성한다(S740). 이 과정에서 문턱치를 나타내는 임계치를 결정할 때 태그 클러스터에 포함되는 태그들의 개수 및 이들의 응집도를 참조할 수 있음은 전술된 바와 같다. Then, only tag tags having a frequency greater than or equal to a threshold among pairs of related tag pairs are extracted to generate a tag cluster (S740). As described above, when determining a threshold indicating a threshold in this process, the number of tags included in the tag cluster and the degree of aggregation thereof may be referred to.

태그 클러스터가 생성되면, 생성된 태그 클러스터로부터 토픽 및 연관 토픽 쌍을 추출한다(S750). 토픽 및 연관 토픽 쌍이란 온톨로지에서 이용되는 용어이며, 태그 클러스터에서 이용된 태그 및 연관 태그 쌍과 각각 대응된다는 것은 전술된 바와 같다. When the tag cluster is generated, a topic and an associated topic pair are extracted from the generated tag cluster (S750). Topic and association topic pairs are terms used in the ontology and correspond to the tag and association tag pairs used in the tag cluster as described above.

그러면, 추출된 연관 토픽 쌍에 온톨로지를 이용하여 의미 관계를 부여한다(S760). 마지막으로 토픽 쌍에 적합한 콘텐츠의 주소를 할당한다(S770). Then, a semantic relation is given to the extracted related topic pair by using an ontology (S760). Finally, the address of the content suitable for the topic pair is allocated (S770).

본 발명에 의한 태그 클러스터링 방법 및 장치에 따르면 부정확한 태그에 기인한 검색 결과의 열화와 비구조화된 태그로 인한 네비게이션의 비효율성이 극복된다. 예를 들어, 태그 기반 사이트인 Flickr의 검색 결과와 본 발명에 의한 태그 클러스터링 방법을 적용한 검색 결과를 비교한 결과 다음과 같은 결과를 얻는다. According to the tag clustering method and apparatus according to the present invention, the deterioration of search results due to incorrect tags and the inefficiency of navigation due to unstructured tags are overcome. For example, as a result of comparing the search results of the tag-based site Flickr with the tag clustering method according to the present invention, the following results are obtained.

비교 검사는 computer, apple, jaguar 라는 키워드를 이용하여 검색된 각각 120개의 이미지에 부여된 태그들을 이용하여 수행되었다. 그 결과, Flickr 사이트의 정확성 및 재현율은 평균 45.8%임엔 반하여, 본 발명이 적용된 시스템의 정확성은 평균 90.4%이며, 재현율은 평균 42.8%라는 것을 알 수 있었다. The comparison test was performed using tags assigned to 120 images each searched using the keywords computer, apple, and jaguar. As a result, the accuracy and reproducibility of the Flickr site was 45.8% on average, whereas the accuracy of the system to which the present invention was applied was 90.4% on average and the reproducibility was 42.8% on average.

본 발명은 도면에 도시된 실시예를 참고로 설명되었으나 이는 예시적인 것에 불과하며, 본 기술 분야의 통상의 지식을 가진 자라면 이로부터 다양한 변형 및 균등한 타 실시예가 가능하다는 점을 이해할 것이다. 예를 들면, 태그 클러스터링 모 듈(138)에서 관련성이 높은 연관 태그 쌍을 추출하기 위하여 적용한 CAST(Complexity Analysis of Sequence Tracts) 알고리즘 외에, 생체 정보학(bio-informatics)에서 이용되는 SEQ(Application in GCG), Sequence Clustering, BLASTCLUST, PROCLUST, TribeMCL 및 GeneRAGE 등의 기법 등이 적용될 수 있음은 물론이다. 즉, 연관 태그 쌍으로부터 소정의 가중치 이상을 가지는 연관 태그 쌍을 추출하여 태그 클러스터를 형성할 수 있는 모든 기법이 태그 클러스터링 모듈(138)에 적용될 수 있다. Although the present invention has been described with reference to the embodiments shown in the drawings, this is merely exemplary, and it will be understood by those skilled in the art that various modifications and equivalent other embodiments are possible. For example, in addition to the CAST (Complexity Analysis of Sequence Tracts) algorithm applied to extract a highly relevant association tag pair in the tag clustering module 138, an application in GCG (SEQ) used in bio-informatics. Of course, techniques such as Sequence Clustering, BLASTCLUST, PROCLUST, TribeMCL, and GeneRAGE may be applied. That is, all techniques that can form a tag cluster by extracting an associated tag pair having a predetermined weight or more from the associated tag pair may be applied to the tag clustering module 138.

따라서, 본 발명의 진정한 기술적 보호 범위는 첨부된 등록청구범위의 기술적 사상에 의해 정해져야 할 것이다. Therefore, the true technical protection scope of the present invention will be defined by the technical spirit of the appended claims.

본 발명은 태그 기반 검색 시스템에 적용될 수 있다. The present invention can be applied to a tag based search system.

도 3a 및 3b는 가중치 행렬에서 빈도수에 기반하여 높은 빈도수를 가지는 태그들을 추출하는 과정을 설명하는 도면들이다. 3A and 3B are diagrams illustrating a process of extracting tags having a high frequency based on a frequency in a weight matrix.

도 5a 및 도 5b는 도 1에 도시된 태그 클러스터링 모듈(138)에서 임계치를 결정하기 위한 과정을 설명하기 위한 그래프들이다. 5A and 5B are graphs for describing a process of determining a threshold in the tag clustering module 138 illustrated in FIG. 1.

Claims

In the tag cluster device,

A related tag mapping module for extracting two related tag pairs from two tags related to the same content among the contents included in a predetermined population and mapping them to each other;

A frequency extracting module for extracting a frequency of appearance of the same tag in a tag mapping process;

A weight matrix generating module for generating a weight matrix based on the frequency of the associative tag pairs; And

A tag clustering module for generating a tag cluster by extracting only relevant tags having a frequency greater than or equal to a threshold among the associated tag pairs from the weight matrix, wherein the associated tag mapping module includes:

And a tag representing different pronunciations for the same foreign language, tags representing the main language and the abbreviation, and tags representing the synonym, respectively, and extracting the association tag pairs as the same tag.

The method of claim 1,

A topic generating module for extracting tags in the tag cluster as topics;

A topic pair extracting module for extracting topics related to the associative tag pair from among the topics as topic pairs in order of increasing frequency;

A semantic relation generating module for extracting semantic relations of the topic pairs by applying a predetermined lexical knowledge model to the extracted topic pairs; And

And an occurrence generating module for assigning an address of content suitable for the topic pair to the topic pair.

The tag clustering module of claim 2, wherein the tag clustering module comprises:

The tag cluster is generated by selecting a base association tag pair among the association tag pairs and repeatedly selecting another association tag pair including respective tags included in the selected base association tag pair. Tag clustering device based on the associated tag.

The tag clustering module of claim 3, wherein the tag clustering module comprises:

Selecting a pair of tag with the highest weight as the base associated tag pair,

And determine the threshold value in consideration of the average number of tags included in the generated tag cluster and the weight of the associated tag pair.

The apparatus for tag clustering based on an association tag, wherein the basic association tag pair and the threshold are determined according to a user selection.

The method of claim 2,

The lexical knowledge model includes at least one of a Resource Description Framework (RDF), Knowledge Query and Manipulation Language (KQML), a Darpa Agent Markup Language-Ontology Inference Layer (DAML-OIL), Ontology Web Language (OWL), and a topic map. Tag clustering device based on the associated tag, characterized in that.

The method of claim 2, wherein the occurrence generation module,

The tag clustering device based on the association tag, characterized in that to give the topic pair a Uniform Resource Locator (URL) of the content including all of the association tag pair corresponding to the topic pair.

In the tag cluster method,

An association tag mapping step of extracting two related tag pairs from two tags related to the same content among the contents included in a predetermined population, and mapping each other;

A frequency extraction step of extracting an appearance frequency of the same tag in a tag mapping process;

A weight matrix generation step of generating a weight matrix based on the frequency of the associative tag pairs; And

A tag clustering step of generating a tag cluster by extracting only the associated tags having a frequency greater than or equal to a threshold among the pair of association tags from the weight matrix, wherein the association tag mapping step,

And a tag representing different pronunciations for the same foreign language, tags representing the main language and the abbreviation, and tags representing the synonyms, respectively, as the same tag.

The method of claim 8,

A topic generation step of extracting tags in the tag cluster as topics;

A topic pair extraction step of extracting topics related to the associative tag pair among the topics as topic pairs in descending order of frequency;

Generating a semantic relationship by applying a predetermined lexical knowledge model to the extracted topic pairs; And

And generating an occurrence of assigning an address of a content suitable for the topic pair to the topic pair.

The method of claim 9, wherein the tag clustering step,

Selecting a basic association tag pair among the association tag pairs;

Selecting another associated tag pair including respective tags included in the selected basic associated tag pair; And

And repeatedly selecting another association tag pair including respective tags included in the selected other association tag pair.

The method of claim 10, wherein the tag clustering step,

Selecting an associated tag pair having the highest weight as the base associated tag pair; And

And determining the threshold value in consideration of the average of the number of tags included in the generated tag cluster and the weight of the associated tag pair.

The method of claim 10, wherein the tag clustering step,

And determining the basic association tag pair and the threshold according to a user selection.

The method of claim 9,

The lexical knowledge model includes at least one of RDF, KQML, DAML-OIL, OWL, and topic map.

The method of claim 9, wherein the occurrence generating step comprises:

And assigning to the topic pairs a content including all of the associative tag pairs corresponding to the topic pairs.