JP2011216021A

JP2011216021A - Clustering device, clustering method and clustering program

Info

Publication number: JP2011216021A
Application number: JP2010085418A
Authority: JP
Inventors: Takeharu Eda; 毅晴江田; Tomoharu Iwata; 具治岩田; Toshiro Uchiyama; 俊郎内山; Masashi Uchiyama; 匡内山
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2010-04-01
Filing date: 2010-04-01
Publication date: 2011-10-27
Anticipated expiration: 2030-04-01
Also published as: JP5439261B2

Abstract

PROBLEM TO BE SOLVED: To solve ambiguity of a tag, and to prevent an abrupt increase of the number of tags during reuse of the tags.SOLUTION: A dendrogram is constructed by executing hierarchical clustering to an entire tag group, and a bottom-up index in which an upper layer can be specified from a lower layer is generated in advance, and the entire tag group is clustered into a plurality of partial tag groups by referring to the generated index when a high-order application makes a request.

Description

本発明は、協調的分類システムおいて資源情報に付与された分類軸を再利用する技術に関する。 The present invention relates to a technique for reusing a classification axis assigned to resource information in a cooperative classification system.

昨今、ＵＲＬ（ブックマーク）、写真、動画像、静止画像、本、論文といった様々な資源情報を一次的作成者と異なる第三者が分類整理し、その分類結果を広く共有することにより、鮮度の高い情報を閲覧者に提供できる協調的分類システム（Collaborative/Social Tagging System）が隆盛している。 Recently, a variety of resource information such as URLs (bookmarks), photos, moving images, still images, books, and papers is classified and arranged by a third party different from the primary creator, and the results of the classification are widely shared. Collaborative / social tagging systems that can provide high-level information to viewers are prosperous.

図１５は、協調的分類システムにおける分類軸（以下、タグという）の分類例を示す図である。資源情報の一例としてＵＲＬを示している。上記第三者である分類者により、ＵＲＬの記載内容「サッカーニュース『今日の結果』」に基づいて、その記載内容に関連する複数のタグ「ｎｅｗｓ」「ｓｏｃｃｅｒ」「面白い」が生成されている。 FIG. 15 is a diagram illustrating a classification example of classification axes (hereinafter referred to as tags) in the cooperative classification system. A URL is shown as an example of resource information. A plurality of tags “news”, “soccer”, and “interesting” related to the description content are generated by the third party classifier based on the description content “soccer news“ today's result ”” of the URL. .

図１６は、分類者により生成されたタグを再利用する流れを説明する図である。協調的分類システムは、分類者によるタグの自由生成や、生成されたタグを上記ＵＲＬに付与すること（関連付けること）を可能とし、分類結果格納データベース１２に予め格納して再利用している。すなわち、分類者による資源情報の分類結果とも言えるタグを収集し、閲覧者に提供される提示情報を生成する過程において、上記ＵＲＬの記載内容を記述する目的で付与されたタグを分類結果格納データベース１２から読み出して再利用することを実現している。 FIG. 16 is a diagram for explaining a flow of reusing tags generated by a classifier. The collaborative classification system allows a classifier to freely generate a tag and attach (associate) the generated tag to the URL, and stores it in the classification result storage database 12 for reuse. That is, in the process of collecting tags that can also be classified results of resource information by the classifier and generating the presentation information provided to the viewer, the tags given for the purpose of describing the description contents of the URL are classified result storage database 12 is read out and reused.

一方、第三者によるタグの生成や分類対象へのタグ付与が自由であるがゆえに、タグの曖昧性や、タグ数の爆発といった問題が顕在化している。タグが曖昧であるとは、一つのタグが複数の意味を持つこと（多義タグ）や、表記は異なるが意味が同じであること（同義タグ）であることをいう。このような曖昧性や、タグ付け時点での第三者（すなわち、分類者）の感情や好み、世の流行、他の第三者によって付与されたタグの傾向等の影響に起因するタグ数の爆発により、タグを再利用することが困難となっている（非特許文献１参照）。 On the other hand, problems such as ambiguity of tags and explosion of the number of tags have become apparent because tags can be freely generated by a third party and assigned to classification targets. The tag is ambiguous means that one tag has a plurality of meanings (ambiguity tag) and that the notation is different but the meaning is the same (synonymous tag). The number of tags due to such ambiguity, influences of third-party (ie, classifier) emotions and preferences at the time of tagging, trends in the world, trends in tags given by other third parties, etc. As a result of the explosion, it is difficult to reuse the tag (see Non-Patent Document 1).

しかしながら、そのような困難性を有しているにも関わらず、第三者によって付与されたタグを利用すると、提示情報本来には必ずしも含まれない第三者視点特有のタグで資源情報の意味を的確に表現することが可能となり、自動タグ付け（オートタギング）では収集困難な感想や意見を表した主観的タグ（例えば、「これはすごい」、「ｃｏｏｌ」、「後で読む」等）を収集できるため、タグ付けされた情報の記述に度々利用されている。 However, in spite of such difficulties, if a tag given by a third party is used, the meaning of the resource information is a tag unique to the third party viewpoint that is not necessarily included in the presentation information originally. Can be expressed accurately, and subjective tags that express impressions and opinions that are difficult to collect by auto-tagging (for example, “This is amazing”, “cool”, “read later”, etc.) Is often used to describe tagged information.

例えば、写真をタグ付けできるサービスとして「ｆｌｉｃｋｅｒ」（http://filckr.com）、ソーシャルブックマークサービスとして「ｄｅｌ．ｉｃｉｏ．ｕｓ」（http://del.icio.us）や「はてなブックマーク」（http://b.hatena.ne.jp）や「ｇｏｏｂｏｏｋｍａｒｋ」（http://bookmark.goo.ne.jp）、動画をタグ付けできるサービスとして「ｙｏｕｔｕｂｅ」（http://youtube.com）、書籍をタグ付けできるサービスとして「米アマゾン」（http://www.amazon.com）、学術論文をタグ付けできるサービスとして「ＣｉｔｅＵＬｉｋｅ」（http://www.citeulike.org/）等が提供されている。なお、実際のサービスでは、付与されたタグ集合のうち、利用頻度の高い順に必要な個数のタグを選択して、提示情報の記述に用いられている。 For example, “flicker” (http://filckr.com) is a service that can tag photos, and “del.icio.us” (http://del.icio.us) and “Hatena Bookmark” ( http://b.hatena.ne.jp), “go bookmark” (http://bookmark.goo.ne.jp), and “youtube” (http://youtube.com) , "US Amazon" (http://www.amazon.com) as a service for tagging books, "CiteULike" (http://www.citeulike.org/) as a service for tagging academic papers, etc. Has been. In an actual service, a required number of tags are selected from the assigned tag set in order of frequency of use and are used to describe the presentation information.

ここで、どの程度のタグ数を用いて提示情報を生成することが適切であるかが問題である。一般に、商品推薦においては、より多様な商品を推薦する方がユーザ満足度の向上につながるという仮説がある。その仮説に基づいて、推薦結果を多様化する手法が提案されている（非特許文献２）。また、ウェブ検索や画像検索においても、やはり同様の仮説に基づく推薦結果の多様化方法が提案されている（ウェブ検索については非特許文献３，４、画像検索については非特許文献５，６参照）。いずれも、多様な商品をユーザに推薦した後に、そのユーザの要求に基づいてオンデマンドで推薦商品をクラスタリングする手法（Ｐｏｓｔ−Ｐｒｏｃｅｓｓｉｎｇ）である。図１７には、そのような従来手法として、ユーザから要求があった後に、要求点ｒから｜ｍ｜以下の距離を計算し、その距離内の領域を特定の方向で分割して商品をクラスタリングする手法が示されている。なお、クラスタリングとは、全体集合を一部又は全部の部分集合にすることをいう。 Here, the problem is how much number of tags should be used to generate the presentation information. In general, in product recommendation, there is a hypothesis that recommending more various products leads to improved user satisfaction. Based on the hypothesis, a method for diversifying the recommendation results has been proposed (Non-Patent Document 2). Also in web search and image search, a method of diversifying recommendation results based on similar hypotheses has also been proposed (see non-patent documents 3 and 4 for web search and non-patent documents 5 and 6 for image search). ). In either case, after recommending various products to the user, the recommended products are clustered on-demand based on the user's request (Post-Processing). In FIG. 17, as such a conventional method, after a request from the user, a distance of | m | or less from the request point r is calculated, and an area within the distance is divided in a specific direction to cluster products. The technique to do is shown. Note that clustering refers to making a whole set a part or all of a subset.

すなわち、上記商品をタグに対応させると、出来る限り多くのタグを用いて提供情報を生成することが従来技術であったと言える。 That is, when the product is associated with tags, it can be said that it is a conventional technique to generate provided information using as many tags as possible.

Scott Golder、外１名、「The Structure of Collaborative Tagging Systems」、Journal of Information Science、2006年Scott Golder, 1 other, "The Structure of Collaborative Tagging Systems", Journal of Information Science, 2006 Cai-Nicolas Ziegler、外３名、「Improving recommendation lists through topic diversification」、Proc. WWW、2005年Cai-Nicolas Ziegler, 3 others, “Improving recommendation lists through topic diversification”, Proc. WWW, 2005 Filip Radlinski、外１名、「Improving personalized web search using result diversification」、Proc. SIGIR、2006年Filip Radlinski, 1 other, “Improving personalized web search using result diversification”, Proc. SIGIR, 2006 Rakesh Agrawal、外１名、「Diversifying search results」、Proc WSDM、2009年Rakesh Agrawal, 1 other, "Diversifying search results", Proc WSDM, 2009 Kai Song、外３名、「Diversifying the image retrieval results」、Proc. ACM Multimedia、2006年Kai Song, 3 others, “Diversifying the image retrieval results”, Proc. ACM Multimedia, 2006 Reinier H. van Leuken、外３名、「Visual diversification of image search results」、Proc. WWW、2009年Reinier H. van Leuken, 3 others, "Visual diversification of image search results", Proc. WWW, 2009 神嶌敏弘、「データマイニング分野のクラスタリング手法（１）」、人口知能学会誌、18巻1号、2003年1月Toshihiro Kamisu, “Clustering Method for Data Mining Field (1)”, Journal of Population Intelligence Society, Vol. 18, No. 1, January 2003

しかしながら、タグの再利用時において多くのタグを用いた場合には、前述したように曖昧性を含む複数のタグが存在し、類似する内容を表す複数のタグが収集表示される場合があるため、閲覧者の満足度は低下し、タグの再利用性が低下するという問題がある。 However, when many tags are used at the time of tag reuse, there are a plurality of tags including ambiguity as described above, and a plurality of tags representing similar contents may be collected and displayed. , There is a problem that the satisfaction of the viewer is lowered and the reusability of the tag is lowered.

本発明は、上記課題を鑑みてなされたものであり、タグの再利用時において、タグの曖昧性を解消し、タグ数の爆発を防止することを課題とする。 The present invention has been made in view of the above problems, and it is an object of the present invention to eliminate tag ambiguity and prevent explosion of the number of tags during tag reuse.

請求項１に記載の本発明は、所定の情報を特徴付ける複数の分類軸のうち類似度の高い分類軸を階層状に順次併合する樹状図を構築し、前記樹状図を探索して下層から上層を特定可能なインデックスを生成して記憶手段に記憶しておく階層的クラスタリング手段、を有することを特徴とする。 The present invention according to claim 1 constructs a dendrogram that sequentially merges the classification axes having a high similarity among the plurality of classification axes characterizing predetermined information in a hierarchical manner, and searches the dendrogram to find a lower layer. And hierarchical clustering means for generating an index that can identify the upper layer from the storage means and storing it in the storage means.

本発明によれば、所定の情報を特徴付ける複数の分類軸のうち類似度の高い分類軸を階層状に順次併合する樹状図を構築し、樹状図を探索して下層から上層を特定可能なインデックスを生成して記憶手段に記憶しておくため、後段のクラスタリング処理を高速化することが可能となる。 According to the present invention, it is possible to construct a dendrogram that sequentially merges high-similarity classification axes among a plurality of classification axes that characterize predetermined information, and search the dendrogram to identify the upper layer from the lower layer Since a simple index is generated and stored in the storage means, it is possible to speed up the subsequent clustering process.

請求項２に記載の本発明は、前記記憶手段から読み出した前記インデックスを参照し、前記上層が同一の分類軸を併合してクラスタ化することを所期のクラスタ数になるまで繰り返す部分クラスタリング手段、を更に有することを特徴とする。 The present invention as set forth in claim 2, wherein the index read out from the storage means is referred to, and the upper layer repeats clustering by merging the same classification axis until the desired number of clusters is reached. , Further comprising.

本発明によれば、記憶手段から読み出したインデックスを参照し、上層が同一の分類軸を併合してクラスタ化することを所期のクラスタ数になるまで繰り返すため、タグの曖昧性を解消し、タグ数の爆発を防止することが可能となる。 According to the present invention, referring to the index read from the storage means, the upper layer repeats clustering by merging the same classification axis until the desired number of clusters is reached, thereby eliminating tag ambiguity, It becomes possible to prevent explosion of the number of tags.

請求項３に記載の本発明は、所定の情報を特徴付ける複数の分類軸のうち類似度の高い分類軸を階層状に順次併合する樹状図を構築し、前記樹状図を探索して下層から上層を特定可能なインデックスを生成して記憶手段に記憶しておくステップ、を有することを特徴とする。 The present invention according to claim 3 constructs a dendrogram that sequentially merges the classification axes having a high degree of similarity among a plurality of classification axes characterizing predetermined information in a hierarchical manner, and searches the dendrogram for a lower layer. And generating an index that can identify the upper layer from the storage means and storing it in a storage means.

請求項４に記載の本発明は、前記記憶手段から読み出した前記インデックスを参照し、前記上層が同一の分類軸を併合してクラスタ化することを所期のクラスタ数になるまで繰り返すステップ、を更に有することを特徴とする。 The present invention described in claim 4 refers to the index read from the storage means, and repeats the upper layer to merge and cluster the same classification axis until the desired number of clusters is reached. Furthermore, it is characterized by having.

請求項５に記載の本発明は、請求項３又は４に記載した各ステップをコンピュータに実行させることを特徴とする。 The present invention described in claim 5 is characterized by causing a computer to execute each step described in claim 3 or 4.

本発明によれば、タグの再利用時において、タグの曖昧性を解消し、タグ数の爆発を防止することができる。 According to the present invention, at the time of tag reuse, tag ambiguity can be resolved and explosion of the number of tags can be prevented.

クラスタ化の概念を説明する図である。It is a figure explaining the concept of clustering. 階層的クラスタリングによって構築されるデンドログラムの一例を示す図である。It is a figure which shows an example of the dendrogram constructed | assembled by hierarchical clustering. 多様なタグで提示情報を記述するメリット・デメリットを説明する図である。It is a figure explaining the merit and demerit which describe presentation information with various tags. 全体システムの機能ブロック構成を概略的に示す図である。It is a figure which shows schematically the functional block structure of the whole system. クラスタリング装置の機能ブロック構成を概略的に示す図である。It is a figure which shows roughly the functional block structure of a clustering apparatus. クラスタリング装置の全体処理フローを示す図である。It is a figure which shows the whole processing flow of a clustering apparatus. 部分タグ集合へのクラスタリングを説明する図である。It is a figure explaining the clustering to a partial tag set. 階層的クラスタリング部の処理フローを示す図である。It is a figure which shows the processing flow of a hierarchical clustering part. デンドログラムとボトムアップインデックスの一例を示す図である。It is a figure which shows an example of a dendrogram and a bottom-up index. 部分クラスタリング部の処理フローを示す図である。It is a figure which shows the processing flow of a partial clustering part. 部分クラスタリング部の処理フローの一例を示す図である。It is a figure which shows an example of the processing flow of a partial clustering part. 部分クラスタリングの遷移を説明する図である。It is a figure explaining the transition of partial clustering. タグの所属クラスタの遷移を説明する図である。It is a figure explaining the transition of the cluster to which a tag belongs. クラスタリングされたタグ集合の一例を示す図である。It is a figure which shows an example of the tag set clustered. 協調的分類システムにおける分類軸の分類例を示す図である。It is a figure which shows the example of a classification | category of the classification axis | shaft in a cooperative classification system. 分類者により生成されたタグを再利用する流れを説明する図である。It is a figure explaining the flow which reuses the tag produced | generated by the classifier. 従来のクラスタ化の概念を説明する図である。It is a figure explaining the concept of the conventional clustering.

以下、本発明を実施する一実施の形態について図面を用いて説明する。但し、本発明は多くの異なる様態で実施することが可能であり、本実施の形態の記載内容に限定して解釈すべきではない。 Hereinafter, an embodiment for carrying out the present invention will be described with reference to the drawings. However, the present invention can be implemented in many different modes and should not be construed as being limited to the description of the present embodiment.

本実施の形態に係るクラスタリング装置の構成及び処理について説明する前に、理解を容易にするために本発明の概要について事前説明する。本実施の形態は、背景技術で説明した協調的分類システムにおいて、付与された分類軸（以下、タグという）を利用して閲覧者への提供情報を生成する際に、第三者によって生成された様々なタグをできる限り多様かつ高速に選択することにより、タグの再利用時におけるタグの曖昧性を解消し、タグ数の爆発を防止することにある。 Before describing the configuration and processing of the clustering apparatus according to the present embodiment, the outline of the present invention will be described in advance in order to facilitate understanding. This embodiment is generated by a third party when generating provided information to a viewer using a given classification axis (hereinafter referred to as a tag) in the cooperative classification system described in the background art. In addition, by selecting various tags as quickly as possible, the ambiguity of the tags at the time of tag reuse is resolved and the explosion of the number of tags is prevented.

すなわち、図１に示すように、図１７に示した対象タグ以外に複数のタグを考慮した状態で全てのタグを事前にクラスタリングしておき（Ｐｒｅ−Ｐｒｏｃｅｓｓｉｎｇ）、その事前クラスタリングの結果をインデックスとして利用することにより、従来よりも高速にタグの多様化を行うものである。 That is, as shown in FIG. 1, all tags are clustered in advance in consideration of a plurality of tags other than the target tag shown in FIG. 17 (Pre-Processing), and the result of the pre-clustering is used as an index. By using it, tags are diversified faster than before.

なお、クラスタリングには、分割最適化手法と階層的手法があるが（非特許文献７参照）、本実施の形態では階層的手法を用いる。すなわち、階層的クラスタリングにより構築されるタグ集合上のデンドログラムを利用している。ここで、階層的クラスタリングによって構築されるデンドログラムの特徴について以下説明する。 Clustering includes a division optimization method and a hierarchical method (see Non-Patent Document 7). In this embodiment, a hierarchical method is used. That is, a dendrogram on a tag set constructed by hierarchical clustering is used. Here, the features of the dendrogram constructed by hierarchical clustering will be described below.

図２は、階層的クラスタリングによって構築されるデンドログラムの一例を示す図である。このデンドログラムは、２分木構造を有し、類似性の高いタグを１つずつ纏めていくと最後には全体のタグ集合になり（凝集型）、逆に、全体のタグ集合を半分に分割していく操作を繰り返すと最後にはそれぞれ単独のタグになる（分岐型）という特徴がある。本発明は、凝集型又は分岐型のいずれのアルゴリズムにも適用可能であるが、分岐型について説明する。 FIG. 2 is a diagram illustrating an example of a dendrogram constructed by hierarchical clustering. This dendrogram has a binary tree structure. When tags with high similarity are grouped one by one, the whole tag set is finally obtained (aggregation type). Conversely, the entire tag set is halved. When the operation of dividing is repeated, the feature is that each becomes a single tag at the end (branch type). The present invention can be applied to any of the aggregation type and branch type algorithms, and the branch type will be described.

図２の中間ノードに付与されている数字は、全体集合を上層ノードから下層ノードに向けて順番に分割する順番を表している。例えば、根ノードからたどり、２番の中間ノードでデンドログラムをカットすると、全体集合は２分割される。引き続き、３番の中間ノードでカットすると、全体集合は３分割される。さらに４番の中間ノードでカットすると４分割（図２に示すＡ〜Ｄ）され、結果として中間ノードの数字で全体集合をクラスタリングしたことになる。この性質により、いったんデンドログラムをタグ集合上に構築すると、タグの総数以下の任意の個数にカットすることが可能となる。なお、カットとはクラスタリングすることをいう。 The numbers given to the intermediate nodes in FIG. 2 represent the order in which the entire set is divided in order from the upper layer node to the lower layer node. For example, if the dendrogram is cut at the second intermediate node from the root node, the entire set is divided into two. Subsequently, when cutting at the third intermediate node, the entire set is divided into three. Further, when the cut is made at the fourth intermediate node, it is divided into four (A to D shown in FIG. 2), and as a result, the whole set is clustered with the numbers of the intermediate nodes. Due to this property, once a dendrogram is constructed on a tag set, it can be cut to an arbitrary number less than the total number of tags. Note that “cut” means clustering.

一方、通常の分割最適化手法では、タグ集合をいくつかにカットしてクラスタリングすることはデータに依存しており、チューニングされる項目の一つである。本実施の形態で用いる階層的クラスタリングでは、いったんデンドログラムをタグ集合上に構築すれば、任意の個数のクラスタリング結果をデンドログラムのカットのみで得ることができる。つまり、階層的クラスタリングを構築してデンドログラムを生成することは、任意のクラスタ数を生成することが可能となることに等しい。 On the other hand, in a normal division optimization method, clustering by cutting a tag set into several pieces depends on data and is one of the items to be tuned. In the hierarchical clustering used in this embodiment, once a dendrogram is constructed on a tag set, an arbitrary number of clustering results can be obtained only by cutting the dendrogram. That is, constructing a hierarchical clustering to generate a dendrogram is equivalent to generating an arbitrary number of clusters.

なお、本実施の形態は、多様なタグ集合は有効であることを前提としている。多様なタグの集合とは、互いに意味が似通っていないタグの集まり、又は、似通っているタグも含まれるが大きく意味の異なるタグを含むタグの集まりである。多様なタグの集合として、「Ｗｅｂ」「エッセイ」「Ｇｏｏｇｌｅ」「後で読む」「ＨＴＴＰ」を一例に挙げることができる。一方、多様でないタグの集合とは、どれも意味的に近い内容を表しており、例えば、「Ｗｅｂ」「Ｗｅｂ２．０」「ＷＷＷ」「Ｉｎｔｅｒｎｅｔ」「ＨＴＴＰ」等を挙げることができる。 This embodiment is based on the premise that various tag sets are effective. The set of various tags is a set of tags having meanings that are not similar to each other, or a set of tags that include tags that are similar but have different meanings. As a set of various tags, “Web”, “essay”, “Google”, “read later”, “HTTP” can be cited as examples. On the other hand, a non-various set of tags represents content that is semantically close, and examples thereof include “Web”, “Web 2.0”, “WWW”, “Internet”, and “HTTP”.

参考までに、多様なタグで提示情報を記述するメリット・デメリットを図３に示す。「サッカー」「ゴール」「ニュース」「ｓｏｃｃｅｒ」といった類似度の高いタグ集合を提示するよりも、「サッカー」「ブラジル」「ニュース」「珍プレー」といった類似度の低い多様なタグ集合の方が、提示情報から適切に情報を把握することが可能となる。 For reference, the merits and demerits of describing presentation information with various tags are shown in FIG. Rather than presenting tag sets with high similarity such as “soccer”, “goal”, “news”, and “soccer”, a variety of tag sets with low similarity such as “soccer”, “Brazil”, “news”, and “rare play” Thus, it is possible to appropriately grasp information from the presented information.

次に、本実施の形態に係るクラスタリング装置を有する全体のシステム構成について説明する。図４は、全体システムの機能ブロック構成を概略的に示す図である。全体システムは、タグの分類者や提示情報の閲覧者（以下、ユーザと総称する）が利用するクライアント端末５と、インターネット等の通信ネットワーク３を介してクライアント端末５と通信可能な協調的分類システム１とで構成されている。 Next, an overall system configuration having the clustering apparatus according to the present embodiment will be described. FIG. 4 is a diagram schematically showing a functional block configuration of the entire system. The overall system is a collaborative classification system that can communicate with a client terminal 5 used by a tag classifier or a viewer of presentation information (hereinafter collectively referred to as a user) and a client terminal 5 via a communication network 3 such as the Internet. 1.

協調的分類システム１は、ネットワークサービスとして提供され、ユーザはクライアント端末５に搭載されたウェブブラウザやクライアントアプリケーションを通じて該ネットワークサービスを利用することができる。具体的には、ＬＡＮやルータ等の通信部１１と、データを格納する分類結果格納データベース１２と、実際に処理を実現するクラスタリング装置１３とで構成されている。 The cooperative classification system 1 is provided as a network service, and the user can use the network service through a web browser or a client application installed in the client terminal 5. Specifically, it comprises a communication unit 11 such as a LAN or a router, a classification result storage database 12 for storing data, and a clustering device 13 for actually realizing processing.

なお、協調的分類システム１は、背景技術で説明したように、分類者によるタグの自由生成や、生成されたタグを分類対象に付与することを可能とするアプリケーションを具備しているが、当該アプリケーションに係る具体的機能、処理内容の説明は省略する。協調的分類システム１の一例として、ソーシャルブックマークサービス（Social Bookmark Service：各ユーザのブックマークをネットワークを通じて共有するシステム）が挙げられる。 As described in the background art, the cooperative classification system 1 includes an application that allows a classifier to freely generate a tag and assign the generated tag to a classification target. Description of specific functions and processing contents related to the application is omitted. An example of the cooperative classification system 1 is a social bookmark service (Social Bookmark Service: a system that shares bookmarks of users through a network).

通信部１１は、協調的分類システム１による外部又は内部への通信、又は内部間通信等の通信をネットワークレベルで実現している。 The communication unit 11 realizes communication such as external or internal communication or inter-internal communication by the cooperative classification system 1 at the network level.

分類結果格納データベース１２は、背景技術で説明したように、ＵＲＬ（ブックマーク）、写真、動画像、静止画像、本、論文といった様々な資源情報に基づいて分類者によって自由生成されたタグや、資源情報に付与されたタグを読出可能に格納しておく機能を有している。 As described in the background art, the classification result storage database 12 includes tags and resources that are freely generated by a classifier based on various resource information such as URL (bookmark), photograph, moving image, still image, book, and paper. It has a function of storing tags attached to information in a readable manner.

なお、本実施の形態では、タグが予め生成されて分類結果格納データベース１２に格納されていればよく、クラスタリング装置１３は、どのような方法に基づいて生成されたタグであっても、どのような種類のタグであっても、後述する処理を実行することにより、タグの生成方法やタグの種類に関係なく同様の効果を得ることができる。 In the present embodiment, it is sufficient that tags are generated in advance and stored in the classification result storage database 12, and the clustering device 13 can use any method to generate tags. Even with various types of tags, the same effect can be obtained regardless of the tag generation method and the tag type by executing the processing described later.

クラスタリング装置１３は、図５に示すように、通信インタフェース１３１と、階層的クラスタリング部１３２と、部分クラスタリング部１３３と、代表タグ選択部１３４と、記憶部１３５とで構成されている。 As shown in FIG. 5, the clustering apparatus 13 includes a communication interface 131, a hierarchical clustering unit 132, a partial clustering unit 133, a representative tag selection unit 134, and a storage unit 135.

通信インタフェース１３１は、通信部１１や分類結果格納データベースとの間の通信を仲介する機能を有している。 The communication interface 131 has a function of mediating communication with the communication unit 11 and the classification result storage database.

階層的クラスタリング部１３２は、複数のタグ（以下、全体タグ集合という）を階層的にクラスタリングしてデンドログラムを構築するクラスタリング部１３２ａと、デンドログラムを探索してボトムアップなインデックスを生成するインデックス生成部１３２ｂとで構成されている。 The hierarchical clustering unit 132 includes a clustering unit 132a that hierarchically clusters a plurality of tags (hereinafter referred to as an entire tag set) to construct a dendrogram, and an index generator that searches the dendrogram and generates a bottom-up index. Part 132b.

部分クラスタリング部１３３は、生成されたインデックスを参照するインデックス参照部１３３ａと、参照インデックスを利用して全体タグ集合を複数の部分タグ集合にクラスタリングするソーティング部１３３ｂとで構成されている。 The partial clustering unit 133 includes an index reference unit 133a that refers to the generated index, and a sorting unit 133b that clusters the entire tag set into a plurality of partial tag sets using the reference index.

代表タグ選択部１３４は、各部分タグ集合から任意の代表タグをそれぞれ選択する機能を有している。 The representative tag selection unit 134 has a function of selecting an arbitrary representative tag from each partial tag set.

記憶部１３５は、構築されたデンドログラムや生成されたインデックスを読出可能に記憶しておく機能を有している。なお、このような記憶部１３５としては、例えば、ＲＯＭやＲＡＭ等のメモリや、ハードディスク等の記憶装置で実現可能である。 The storage unit 135 has a function of storing the constructed dendrogram and the generated index in a readable manner. Such a storage unit 135 can be realized by a memory such as a ROM or a RAM or a storage device such as a hard disk.

なお、クラスタリング装置１３を構成している上記各機能部は、単一のサーバ装置で実現することも可能であるし、複数のサーバ装置に各機能を分散配置させた構成で実現することも可能である。 In addition, each said function part which comprises the clustering apparatus 13 can also be implement | achieved by a single server apparatus, and can also be implement | achieved by the structure which distributed and arranged each function in the some server apparatus. It is.

次に、上記構成を有するクラスタリング装置１３の処理フローについて説明する。図６は、クラスタリング装置の全体処理フローを示す図である。 Next, a processing flow of the clustering apparatus 13 having the above configuration will be described. FIG. 6 is a diagram showing an overall processing flow of the clustering apparatus.

最初に、階層的クラスタリング部１３２により、全体タグ集合が階層的クラスタリングされてデンドログラムが構築され、ボトムアップなインデックスが事前に生成される（Ｓ１）。 First, the hierarchical clustering unit 132 hierarchically clusters the entire tag set to construct a dendrogram, and a bottom-up index is generated in advance (S1).

次いで、クライアント端末５やその他の上位アプリケーションから要求があった際に、部分クラスタリング部１３３により、Ｓ１で生成されたインデックスが参照され、全体タグ集合が複数の部分タグ集合にクラスタリングされる（Ｓ２）。 Next, when there is a request from the client terminal 5 or another upper application, the partial clustering unit 133 refers to the index generated in S1, and the entire tag set is clustered into a plurality of partial tag sets (S2). .

最後に、代表タグ選択部１３４により、個別にクラスタ化された各部分タグ集合から最頻出な代表タグがそれぞれ選択され、選択されたｋ個の多様なタグ集合が要求元の上位アプリケーションに返される（Ｓ３）。 Finally, the representative tag selection unit 134 selects the most frequent representative tag from each of the partial tag sets clustered individually, and returns the selected k various tag sets to the requesting higher-level application. (S3).

ここで、生成されるインデックスがボトムアップであり、そのボトムアップなインデックスを利用してクラスタリングする理由について説明する。 Here, the reason why the generated index is bottom-up and clustering is performed using the bottom-up index will be described.

トップダウンにインデックスを生成して部分タグ集合にクラスタリングすることも可能である。例えば、図７に示すように、全体のタグ（Ｔ１〜Ｔ１５）ではなく、斜線タグ（Ｔ１，Ｔ４，Ｔ７，Ｔ９，Ｔ１３，Ｔ１４）を用いて部分タグ集合にクラスタリングする場合について説明する。この場合、全てのタグ（Ｔ１〜Ｔ１５）を部分タグ集合にクラスタリングする場合よりも少ないカット数でクラスタリング可能であるため、中間ノードの分割順位（図７の○で囲まれた数字）は図２に示した場合と異なり、クラスタリングされる部分タグ集合の個数も異なる。 It is also possible to generate an index from the top down and cluster it into a partial tag set. For example, as shown in FIG. 7, a case will be described in which clustering into a partial tag set is performed using hatched tags (T1, T4, T7, T9, T13, T14) instead of the entire tags (T1 to T15). In this case, since all the tags (T1 to T15) can be clustered with a smaller number of cuts than when clustering into partial tag sets, the division order of intermediate nodes (numbers surrounded by circles in FIG. 7) is as shown in FIG. Unlike the case shown in Fig. 5, the number of partial tag sets to be clustered is also different.

そして、ある中間ノードで集合を分割する際に、分割されてできる部分集合に斜線タグが含まれているか否かを探査することによって、部分タグ集合にクラスタリングできる。 Then, when a set is divided at a certain intermediate node, it is possible to perform clustering into a partial tag set by searching whether or not a hatched tag is included in the divided subset.

しかしながら、部分タグ集合に斜線タグが存在するかどうかは、デンドログラムを葉ノード（最下層のタグ）まで辿らなければ判定できないため、効率的に部分タグ集合にクラスタリングすることができない。 However, since it cannot be determined whether the diagonal tag exists in the partial tag set unless the dendrogram is traced to the leaf node (the lowest tag), it cannot be efficiently clustered into the partial tag set.

そこで、本実施の形態では、全体タグ集合上のデンドログラムをボトムアップインデックスとして事前に実体化しておき、ボトムアップに部分タグ集合にクラスタリングしている。 Therefore, in this embodiment, the dendrogram on the entire tag set is materialized in advance as a bottom-up index, and clustered into partial tag sets in the bottom-up manner.

次に、階層的クラスタリング部１３２の処理フローについて具体的に説明する。図８は、階層的クラスタリング部の処理フローを示す図である。 Next, the processing flow of the hierarchical clustering unit 132 will be specifically described. FIG. 8 is a diagram illustrating a processing flow of the hierarchical clustering unit.

最初に、クラスタリング部１３２ａにより、分類結果格納データベース１２から複数のタグが読み出され、類似度の高いタグが階層状に順次併合されたデンドログラム（樹状図）が構築される（Ｓ１１）。 First, the clustering unit 132a reads a plurality of tags from the classification result storage database 12, and constructs a dendrogram (dendrogram) in which tags with high similarity are sequentially merged in a hierarchical manner (S11).

最後に、インデックス生成部１３２ｂにより、Ｓ１１で構築されたデンドログラムが探索されて、下層から上層を特定可能なボトムアップインデックスが生成されて記憶部１３５に記憶される（Ｓ１２）。 Finally, the index generator 132b searches the dendrogram constructed in S11, generates a bottom-up index that can identify the upper layer from the lower layer, and stores it in the storage unit 135 (S12).

以上から、Ｓ１１により、図９（ａ）に示すようなデンドログラム（全体タグ集合（Ｔ１〜Ｔ１５）上の二分木）が構築され、Ｓ１２により、図９（ｂ）に示すようなボトムアップインデックスが生成される。 From the above, a dendrogram (binary tree on the whole tag set (T1 to T15)) as shown in FIG. 9A is constructed by S11, and a bottom-up index as shown in FIG. 9B is obtained by S12. Is generated.

通常の階層的クラスタリングの利用シーンでは、このデンドログラム自体は、タグ間の関係を把握することを目的とした可視化に利用される程度であるが、本実施の形態では、この階層をボトムアップなインデックスとして保持する。ボトムアップなインデックスは、デンドログラム中の下層ノードをキーとした索引であり、ある下層ノード（以下、子ノードという場合もある）から上層ノード（以下、親ノードという場合もある）を取得することができる。デンドログラムを構築すれば、デンドログラムを１回スキャンすることによりボトムアップインデックスを生成可能となる。 In a normal use scene of hierarchical clustering, this dendrogram itself is only used for visualization for the purpose of grasping the relationship between tags, but in this embodiment, this hierarchy is bottom-up. Keep as an index. A bottom-up index is an index with a lower layer node in a dendrogram as a key, and an upper layer node (hereinafter also referred to as a parent node) is obtained from a lower layer node (hereinafter also referred to as a child node). Can do. If a dendrogram is constructed, a bottom-up index can be generated by scanning the dendrogram once.

すなわち、階層的クラスタリング部１３２により、全体タグ集合が階層的クラスタリングされてデンドログラムが構築され、ボトムアップなインデックスが事前に生成されるので、後段の部分クラスタリング部１３３によるクラスタリング処理を高速化することが可能となる。 That is, the hierarchical clustering unit 132 hierarchically clusters the entire tag set to construct a dendrogram, and a bottom-up index is generated in advance, so that the clustering process by the partial clustering unit 133 in the subsequent stage is accelerated. Is possible.

なお、階層的クラスタリングのアルゴリズムとしては分割型、併合型の双方を利用可能であるが、併合型のアルゴリズムはデンドログラム構築と同時にボトムアップインデックスを生成することもできるため、分岐型のアルゴリズムを用いるよりも高速にボトムアップインデックスを生成することが可能となる。 Note that both split and merged algorithms can be used as hierarchical clustering algorithms, but a merged algorithm can generate a bottom-up index at the same time as building a dendrogram, so a branching algorithm is used. It is possible to generate a bottom-up index at a higher speed.

次に、部分クラスタリング部１３３の処理フローについて具体的に説明する。図１０は、部分クラスタリング部の処理フローを示す図である。 Next, the processing flow of the partial clustering unit 133 will be specifically described. FIG. 10 is a diagram illustrating a processing flow of the partial clustering unit.

最初に、インデックス参照部１３３ａにより、階層的クラスタリング部１３２によって生成されたインデックス（図９（ｂ）参照）が記憶部１３５から読み出される（Ｓ２１）。 First, the index (see FIG. 9B) generated by the hierarchical clustering unit 132 is read from the storage unit 135 by the index reference unit 133a (S21).

最後に、ソーティング部１３３ｂにより、Ｓ２１で読み出されたインデックスを参照し、上層（親）が同一のタグを併合して部分タグ集合化（クラスタ化）することが、所期の部分タグ集合数（クラスタ数）になるまで繰り返される（Ｓ２２）。 Finally, the sorting unit 133b refers to the index read in S21, and the upper layer (parent) merges the same tags to form a partial tag set (clustering). It repeats until it becomes (the number of clusters) (S22).

すなわち、部分クラスタリング部１３３は、最初に１つのタグを１つのクラスタと見做してクラスタ数を初期化し、デンドログラムをボトムアップに登りながら併合を発見する度にクラスタをマージしていくボトムアップな処理を行うことを特徴としている。以下、Ｔ１，Ｔ４，Ｔ７，Ｔ９，Ｔ１３，Ｔ１４をクラスタリング対象タグとして、Ｓ２２における処理フローの一例を以下説明する。図１１は、部分クラスタリング部の処理フローの一例を示す図である。 That is, the partial clustering unit 133 first considers one tag as one cluster, initializes the number of clusters, and bottoms up by merging clusters every time a merge is found while climbing the dendrogram bottom up. It is characterized by performing various processes. Hereinafter, an example of the processing flow in S22 will be described below with T1, T4, T7, T9, T13, and T14 as clustering target tags. FIG. 11 is a diagram illustrating an example of a processing flow of the partial clustering unit.

最初に、アプリケーションが要求する指定クラスタ数ｋ、クラスタリング対象となる部分タグ集合Ｔ’、事前に取得したボトムアップインデックスＩＤＸの入力を受け付ける（Ｓ３１）。指定クラスタ数ｋは３、部分タグ集合Ｔ’はＴ１，Ｔ４，Ｔ７，Ｔ９，Ｔ１３，Ｔ１４、ボトムアップインデックスＩＤＸは図９（ｂ）であるとする。 First, the input of the designated cluster number k requested by the application, the partial tag set T ′ to be clustered, and the bottom-up index IDX acquired in advance is received (S31). Assume that the designated cluster number k is 3, the partial tag set T ′ is T1, T4, T7, T9, T13, T14, and the bottom-up index IDX is FIG. 9B.

次いで、その時点の一時クラスタ数ｃ（＝｜Ｔ’｜）と、部分タグ集合Ｔ’の親ノードの分割順位をボトムアップインデックスＩＤＸから取得して降順にソートした親ノードリストＰと、親ノードリストＰの中で最も分割順位が大きいノードの親ノードの分割順位が設定された位置ポインタｃｐとを一時変数として設定する（Ｓ３２）。部分タグ集合Ｔ’がＴ１，Ｔ４，Ｔ７，Ｔ９，Ｔ１３，Ｔ１４であることから、この時点で、ｃ＝６、Ｐ＝５，７，８，１１，１４，１５、ｃｐ＝１３が設定される。 Next, the number of temporary clusters c (= | T ′ |) at that time, the parent node list P obtained by obtaining the parent node division order of the partial tag set T ′ from the bottom-up index IDX and sorting in descending order, and the parent node The position pointer cp in which the division order of the parent node of the node having the largest division order in the list P is set as a temporary variable (S32). Since the partial tag set T ′ is T1, T4, T7, T9, T13, and T14, at this time, c = 6, P = 5, 7, 8, 11, 14, 15, and cp = 13 are set. The

次いで、一時クラスタ数ｃと、指定クラスタ数ｋとが比較され（Ｓ３３）、一時クラスタ数ｃが指定クラスタ数ｋよりも大きい場合には、一時クラスタ数ｃが指定クラスタ数ｋに一致するまで以下説明するＳ３４〜Ｓ３９の処理が繰り返される。 Next, the temporary cluster number c is compared with the designated cluster number k (S33). If the temporary cluster number c is larger than the designated cluster number k, the following is performed until the temporary cluster number c matches the designated cluster number k. The processes of S34 to S39 to be described are repeated.

次いで、Ｓ３３での比較の結果、一時クラスタ数ｃが指定クラスタ数ｋよりも大きい場合には、親ノードリストＰの中で最も分割順位が大きいノードの親ノードの分割順位をボトムアップインデックスＩＤＸから取得し、位置ポインタｃｐに設定する（Ｓ３４）。親ノードリストＰは変更されていないため、初期の一時値と同じｃｐ＝１３が設定される（図１２、図１３に示す時点Ａ参照）。 Next, as a result of the comparison in S33, if the temporary cluster number c is larger than the designated cluster number k, the division order of the parent node of the node having the largest division order in the parent node list P is determined from the bottom-up index IDX. It is acquired and set in the position pointer cp (S34). Since the parent node list P is not changed, the same cp = 13 as the initial temporary value is set (see time point A shown in FIGS. 12 and 13).

次いで、Ｓ３４で新たに設定された位置ポインタｃｐが親ノードリストＰに含まれるか否かを判定する（Ｓ３５）。図１２、図１３の時点Ａを参照すると、ｃｐ＝１３は、Ｐの中に含まれていない。 Next, it is determined whether or not the position pointer cp newly set in S34 is included in the parent node list P (S35). Referring to time point A in FIGS. 12 and 13, cp = 13 is not included in P.

次いで、Ｓ３５での判定の結果、位置ポインタｃｐが親ノードリストＰに含まれていない場合には、親ノードリストＰの中で最も分割順位が大きいノードの親ノードの分割順位をボトムアップインデックスＩＤＸから取得し、その最も大きいノードの分割順位を、取得した親ノードの分割順位と交換して降順に並び替えた後に、Ｓ３３に戻る（Ｓ３６）。これにより、Ｐ＝５，７，８，１１，１３，１４が設定される。 Next, as a result of the determination in S35, if the position pointer cp is not included in the parent node list P, the division order of the parent node of the node having the largest division order in the parent node list P is set to the bottom-up index IDX. And the rearrangement order of the largest node is exchanged with the obtained parent node division order and sorted in descending order, and the process returns to S33 (S36). Thereby, P = 5, 7, 8, 11, 13, and 14 are set.

その後、Ｓ３３、Ｓ３４の処理により、ｃｐ＝１２が設定される（図１２、図１３に示す時点Ｂ参照）。同様に、Ｓ３６、Ｓ３３、Ｓ３４の処理により、Ｐ＝５，７，８，１１，１２，１３、ｃｐ＝１２が設定される（図１２、図１３に示す時点Ｃ参照）。 Thereafter, cp = 12 is set by the processing of S33 and S34 (see time point B shown in FIGS. 12 and 13). Similarly, P = 5, 7, 8, 11, 12, 13, and cp = 12 are set by the processes of S36, S33, and S34 (see time point C shown in FIGS. 12 and 13).

次いで、Ｓ３５での判定の結果、位置ポインタｃｐが親ノードリストＰに含まれている場合には、これまで処理対象であった部分タグ集合の親ノードと同じ親ノードの他の部分タグ集合が存在すると判断できるため、親ノードリストＰの中で最も大きいノードの分割順位を削除することで、２つの部分タグ集合を併合する（Ｓ３７）。 Next, as a result of the determination in S35, if the position pointer cp is included in the parent node list P, another partial tag set of the same parent node as the parent node of the partial tag set that has been processed until now is displayed. Since it can be determined that it exists, by deleting the division order of the largest node in the parent node list P, the two partial tag sets are merged (S37).

次いで、親ノードリストＰを降順に並び替え（Ｓ３８）、一時クラスタ数ｃから１を引いた（Ｓ３９）後に、Ｓ３３に戻る。 Next, the parent node list P is rearranged in descending order (S38), 1 is subtracted from the temporary cluster number c (S39), and the process returns to S33.

その後、Ｓ３３、Ｓ３４の処理により、ｃｐ＝４が設定される（図１２、図１３に示す時点Ｄ参照）。同様に、Ｓ３３〜Ｓ３９の処理を繰り返すことにより、現在の処理時点は、図１２、図１３に示す時点Ｅであるとする。 Thereafter, cp = 4 is set by the processing of S33 and S34 (see time point D shown in FIGS. 12 and 13). Similarly, it is assumed that the current processing time point is time point E shown in FIGS. 12 and 13 by repeating the processing of S33 to S39.

次いで、Ｓ３３での比較の結果、一時クラスタ数ｃが指定クラスタ数ｋよりも大きくない場合には、ｋ個にクラスタリングされた部分タグ集合を出力する（Ｓ４０）。これにより、Ｐ＝３を親ノードとする部分タグ集合（Ｔ１３とＴ１４）と、Ｐ＝４を親ノードとする部分タグ集合（Ｔ１とＴ４）と、Ｐ＝５を親ノードとする部分タグ集合（Ｔ７とＴ９）とが出力される。なお、前述したように、Ｓ４０の処理後、代表タグ選択部１３４により、各部分タグ集合から最頻出な代表タグがそれぞれ選択され、選択されたｋ個の多様なタグ集合が要求元の上位アプリケーションに返される。 Next, as a result of the comparison in S33, if the temporary cluster number c is not larger than the designated cluster number k, a partial tag set clustered into k is output (S40). Thus, a partial tag set (T13 and T14) having P = 3 as a parent node, a partial tag set (T1 and T4) having P = 4 as a parent node, and a partial tag set having P = 5 as a parent node (T7 and T9) are output. As described above, after the processing of S40, the representative tag selection unit 134 selects the most frequent representative tag from each partial tag set, and the selected k various tag sets are the upper-level applications of the request source. Returned to

以上より、図１４に示すように、第三者によって付与された多数のタグ集合から、閲覧者のユーザ満足度を高める多様なタグ集合を高速に提供することが可能となる。 As described above, as shown in FIG. 14, it is possible to provide various tag sets that increase the user satisfaction of the viewer at high speed from a large number of tag sets assigned by a third party.

本実施の形態によれば、全体タグ集合に対して階層的クラスタリングを行ってデンドログラムを構築し、下層から上層を特定可能にするボトムアップなインデックスを事前に生成しておき、上位アプリケーションから要求があった際に、生成されたインデックスを参照して全体タグ集合を複数の部分タグ集合にクラスタリングするので、部分タグ集合へのクラスタリングを高速に実行することができ、タグの再利用時において、タグの曖昧性を解消し、タグ数の爆発を防止することが可能となる。 According to the present embodiment, a dendrogram is constructed by performing hierarchical clustering on the entire tag set, and a bottom-up index that makes it possible to identify the upper layer from the lower layer is generated in advance and requested from the upper application. When there is, the entire tag set is clustered into a plurality of partial tag sets with reference to the generated index, so clustering into the partial tag sets can be executed at high speed. It becomes possible to eliminate the ambiguity of the tags and prevent the explosion of the number of tags.

また、協調的分類システムにおいて収集されたタグの集合を、みなによって合意のとれた客観的タグと、分類軸としてはノイズとなる主観的タグの２つの区分することを実現し、又は人手による区別を支援することにより、サービス提供者のタグの利活用を容易にすることができる。 In addition, the set of tags collected in the collaborative classification system can be divided into two categories: objective tags agreed upon by everyone, and subjective tags that are noisy as classification axes, or manual differentiation. By supporting this, it is possible to facilitate the utilization of the tag of the service provider.

通常のソーシャルブックマークサービスで利用されているタグの選択方法として、タグ付け回数の多い順に上位からｋ件を取得するという方法がある。しかしながら、単なるタグ付け回数順では、必ずしも本実施の形態で説明したような多様なタグ集合が選ばれるとは限らない。 As a method of selecting tags used in a normal social bookmark service, there is a method of obtaining k items from the top in order of the number of tagging. However, various tag sets as described in the present embodiment are not necessarily selected in the order of simple tagging times.

また、別の方法として、人手でタグが属するカテゴリ辞書を構築し、できるだけ多くの異なるカテゴリに属するタグ集合を選ぶという教師ありの方法が考えられる。しかしながら、協調的分類システムでは第三者が毎時毎分にタグ付け続けるため、辞書を用いた手法では未知タグや意味の不明瞭なタグ（例えば、顔文字や絵文字等）のカテゴリを推測することが困難である。本実施の形態によれば、第三者がタグを分類した結果のみから類似性に関するタグ間の距離を用いてクラスタリングしているので、辞書などを必要とすることなく、教師なしで多様なタグ集合を取得することができる。 As another method, a supervised method of manually building a category dictionary to which tags belong and selecting tag sets belonging to as many different categories as possible is conceivable. However, in a collaborative classification system, third parties continue to tag every hour, so a dictionary-based approach is to guess the category of unknown tags or unclear tags (for example, emoticons, pictograms, etc.) Is difficult. According to the present embodiment, since the clustering is performed using the distance between the tags related to the similarity only from the result of the classification of the tags by a third party, various tags can be used without a teacher and without a teacher. A set can be obtained.

最後に、本実施の形態で説明したクラスタリング装置は、コンピュータで構成され、各機能ブロックの各処理はプログラムで実行される。また、本実施の形態で説明したクラスタリング装置をプログラムとして光記憶装置や磁気記憶装置等の記録媒体に読出可能に記録し、この記録媒体をコンピュータに組み込んだり、若しくは記録媒体に記録されたプログラムを、任意の通信回線を介してコンピュータにダウンロードしたり、又は記録媒体からインストールし、該プログラムでコンピュータを動作させることにより、上述した各処理動作をクラスタリング装置として機能させることができるのは勿論である。 Finally, the clustering apparatus described in the present embodiment is configured by a computer, and each process of each functional block is executed by a program. In addition, the clustering device described in the present embodiment is recorded as a program in a readable manner on a recording medium such as an optical storage device or a magnetic storage device, and this recording medium is incorporated into a computer or a program recorded on the recording medium is recorded. Of course, each processing operation described above can be made to function as a clustering apparatus by downloading to a computer via an arbitrary communication line or installing from a recording medium and operating the computer with the program. .

１…協調的分類システム
１１…通信部
１２…分類結果格納データベース
１３…クラスタリング装置
１３１…通信インタフェース
１３２…階層的クラスタリング部
１３２ａ…クラスタリング部
１３２ｂ…インデックス生成部
１３３…部分クラスタリング部
１３３ａ…インデックス参照部
１３３ｂ…ソーティング部
１３４…代表タグ選択部
１３５…記憶部
３…通信ネットワーク
５…クライアント端末
Ｓ１〜Ｓ３、Ｓ１１〜Ｓ１２、Ｓ２１〜Ｓ２２、Ｓ３１〜Ｓ４０…ステップ DESCRIPTION OF SYMBOLS 1 ... Cooperative classification system 11 ... Communication part 12 ... Classification result storage database 13 ... Clustering apparatus 131 ... Communication interface 132 ... Hierarchical clustering part 132a ... Clustering part 132b ... Index generation part 133 ... Partial clustering part 133a ... Index reference part 133b ... Sorting part 134 ... Representative tag selection part 135 ... Storage part 3 ... Communication network 5 ... Client terminal S1 to S3, S11 to S12, S21 to S22, S31 to S40 ... Step

Claims

Build a tree diagram that sequentially merges high-similarity classification axes among a plurality of classification axes characterizing predetermined information, and searches the tree diagram to generate an index that can identify the upper layer from the lower layer Hierarchical clustering means stored in the storage means,
A clustering apparatus characterized by comprising:

A partial clustering unit that refers to the index read from the storage unit and repeats clustering the upper layer by merging the same classification axes until the desired number of clusters is reached;
The clustering apparatus according to claim 1, further comprising:

Build a tree diagram that sequentially merges high-similarity classification axes among a plurality of classification axes characterizing predetermined information, and searches the tree diagram to generate an index that can identify the upper layer from the lower layer And storing it in the storage means,
A clustering method characterized by comprising:

Referring to the index read from the storage means, and repeating the upper layer merging the same classification axis and clustering until the desired number of clusters,
The clustering method according to claim 3, further comprising:

A clustering program that causes a computer to execute each step according to claim 3 or 4.