JP2010128771A

JP2010128771A - Clustering result display device, method thereof and program

Info

Publication number: JP2010128771A
Application number: JP2008302507A
Authority: JP
Inventors: Katsunori Kawaguchi; 克則川口; Satoshi Tokuno; 聡得能; Kazuo Mogi; 一男茂木; Akira Nakayama; 彰中山; Sadao Ishizaki; 貞大石崎
Original assignee: NTT Communications Corp
Current assignee: NTT Communications Corp
Priority date: 2008-11-27
Filing date: 2008-11-27
Publication date: 2010-06-10
Anticipated expiration: 2028-11-27
Also published as: JP5227146B2

Abstract

<P>PROBLEM TO BE SOLVED: To allow a cluster suiting individual user's interest and preference to be displayed at a higher rank and a cluster index to be displayed using a phrase, etc. known by a user when displaying a clustering result. <P>SOLUTION: Clustering is performed in a predetermined manner to classify a document set obtained through searching a network with a search word inputted by a user into a plurality of clusters, and one or more cluster representative elements are extracted for every cluster. Then, in reference to each previously prepared interest element with its interest element score of the user or a profile corresponding to the user, a total interest element score of the each interest element identical with each cluster representative element of the cluster is obtained as a cluster score of the cluster for every cluster; each cluster is assigned with a rank order according to the obtained score, and the clustering result is displayed according the rank order. <P>COPYRIGHT: (C)2010,JPO&INPIT

Description

本発明は、任意の検索語に基づき通信ネットワーク上を検索して得られた多数の文書情報をクラスタリングした結果を表示する、クラスタリング結果表示装置、その方法及びプログラムに関する。 The present invention relates to a clustering result display apparatus, a method thereof, and a program for displaying a result of clustering a large number of pieces of document information obtained by searching a communication network based on an arbitrary search word.

ネット情報検索におけるクラスタリングとは、任意の検索語による検索によりヒットした文書集合から、文書の何らかの類似性によって分類された複数の部分集合（クラスタ）を構成する手法をいう。代表的なクラスタリング手法は、例えば非特許文献１等で明らかにされている。また、特許文献１にはクラスタリング処理の結果として得られた各クラスタの表示順序を、検索されたそれぞれの文書に付されたスコアに基づき再構成する発明が開示されている。 Clustering in net information retrieval refers to a technique of constructing a plurality of subsets (clusters) classified according to some similarity of documents from a document set hit by a search using an arbitrary search word. A typical clustering method is clarified in Non-Patent Document 1, for example. Further, Patent Document 1 discloses an invention in which the display order of each cluster obtained as a result of the clustering process is reconfigured based on the score assigned to each retrieved document.

近年、Ｗｅｂ検索エンジンの検索結果である文書集合に対してクラスタリングを行った結果を、各クラスタに属する文書を代表する代表語によって提示するサービスがいくつか生まれている（例えば、非特許文献２、３）。これらのサービスにおいては、代表語を選択することによって、検索文書全体から自分が求める部分集合を効率的に取得できる。
国際公開第００／０７５８０９号パンフレット岸田和明、"文書クラスタリングの技法：文献レビュー"、[online]、Library and Information science、No.49、2003、[平成20年9月29日検索]、インターネット<URL: http://koara.lib.keio.ac.jp/xoonips/modules/xoonips/download.php?file_id=20480> グループネット株式会社、"クラスタリングとは？"、[online]、2008年、グループネット株式会社、[平成20年9月29日検索]、インターネット<URL: http://www.groupnet.co.jp/products/velocity/clustering.html> Cuil、"Drilldowns"、[online]、2008年、Cuil,Inc.、[平成20年9月29日検索]、インターネット<URL: http://www.cuil.com/info/features/> In recent years, several services have been born that present the result of clustering a document set as a search result of a Web search engine using representative words representing documents belonging to each cluster (for example, Non-Patent Document 2, 3). In these services, by selecting a representative word, it is possible to efficiently acquire a subset desired by the user from the entire search document.
International Publication No. 00/075809 Pamphlet Kishida Kazuaki, “Document Clustering Techniques: Literature Review”, [online], Library and Information science, No. 49, 2003, [September 29, 2008 search], Internet <URL: http: // koara. lib.keio.ac.jp/xoonips/modules/xoonips/download.php?file_id=20480> Group Net Co., Ltd., "What is clustering?", [Online], 2008, Group Net Co., Ltd., [Search September 29, 2008], Internet <URL: http://www.groupnet.co.jp /products/velocity/clustering.html> Cuil, "Drilldowns", [online], 2008, Cuil, Inc., [Search September 29, 2008], Internet <URL: http://www.cuil.com/info/features/>

特許文献１で開示された発明や、非特許文献２、３に係る従来のサービスにおいては、個々のユーザの興味や嗜好等について加味されることなく利用者一律に結果が表示される。 In the invention disclosed in Patent Document 1 and the conventional services related to Non-Patent Documents 2 and 3, the results are displayed uniformly for the user without taking into account the interests and preferences of individual users.

図２０(a)に、検索語「ジャガー」についての検索及びクラスタリング結果の例を示す。この例は、１００の文書からなる文書集合が、５つのクラスタに分類された場合である。従来のサービスによると、図２０(a)に示すようなクラスタリング結果が表示される際には、図２０(b)に示すように、単にクラスタを構成する文書数の順番で並べ替えて、各クラスタ毎に適宜抽出された代表語をそのまま当該クラスタのインデックスとして表示される。そのため、以下のような問題が生じうる。 FIG. 20A shows an example of search and clustering results for the search term “jaguar”. In this example, a document set consisting of 100 documents is classified into five clusters. According to the conventional service, when a clustering result as shown in FIG. 20 (a) is displayed, as shown in FIG. 20 (b), the documents are simply rearranged in the order of the number of documents constituting the cluster. A representative word appropriately extracted for each cluster is displayed as an index of the cluster as it is. Therefore, the following problems may occur.

(1)ユーザの求めるクラスタが上位に表示されない
通常、クラスタの表示順は、そのクラスタに含まれる文書数などユーザ個々の興味や嗜好等とは無関係な尺度により定められ、誰にでも同じように表示される。そのため、ユーザがマイナーな話題や専門的な話題を求めている場合、ユーザが求める話題が多く含まれるクラスタがその文書数の少なさから表示が下位になってしまい、探索に手間がかかる。また、不適当な（自らが求めるクラスタと異なる）クラスタを選択してしまうことも多くなると考えられ、そのような場合、手戻りや再入力といった余分な検索コストが発生してしまう。以上のような問題は、クラスタリングを行った結果生まれたクラスタの数が多い場合に特に顕著である。 (1) The cluster requested by the user is not displayed at the top. Normally, the display order of the cluster is determined by a measure that is unrelated to the user's individual interests and preferences, such as the number of documents included in the cluster. Is displayed. For this reason, when the user is seeking a minor topic or a specialized topic, a cluster including a large number of topics requested by the user is displayed at a lower level due to the small number of documents, and it takes time to search. In addition, it is considered that an inappropriate (different from the cluster desired by the user) cluster is often selected. In such a case, extra search costs such as rework and re-input are generated. The above problems are particularly remarkable when the number of clusters generated as a result of clustering is large.

(2)クラスタの代表語がユーザにとって未知語
各クラスタは、各クラスタに属する文書を代表する代表語をインデックスとして表示される。この代表語がユーザにとって未知語である場合、ユーザはクラスタの内容を推定できず、自らが求めるクラスタを選択することが困難となる。 (2) The representative word of the cluster is unknown to the user Each cluster is displayed with the representative word representing the document belonging to each cluster as an index. If this representative word is an unknown word for the user, the user cannot estimate the contents of the cluster, and it is difficult to select the cluster that the user wants.

本発明の目的は、クラスタリング結果の表示に際し、個々のユーザの興味や嗜好等に合ったクラスタを上位に表示し、また、クラスタのインデックスをユーザの既知の語句等で表示することが可能な、クラスタリング結果表示装置及び方法の実現にある。 The object of the present invention is to display a cluster suitable for each user's interests and preferences when displaying the clustering result, and to display the cluster index in the user's known phrases, The clustering result display apparatus and method are realized.

本発明のクラスタリング結果表示装置は、検索部とクラスタリング部と興味情報蓄積部と演算部と可視化出力部とから構成される。 The clustering result display apparatus according to the present invention includes a search unit, a clustering unit, an interest information storage unit, a calculation unit, and a visualization output unit.

検索部は、ユーザにより入力された検索語について、任意の検索エンジンを用いて通信ネットワーク上を検索し、該当する文書集合を収集する。 The search unit searches the communication network for a search term input by the user using an arbitrary search engine, and collects a corresponding document set.

クラスタリング部は、収集した文書集合について、所定の方法によりクラスタリングを行って複数のクラスタに分類するとともに、各クラスタ毎に、１以上の代表要素を抽出しクラスタ代表要素スコアを付して出力する。 The clustering unit performs clustering on the collected document set by a predetermined method to classify it into a plurality of clusters, extracts one or more representative elements for each cluster, attaches a cluster representative element score, and outputs the result.

興味情報蓄積部は、ユーザ又はプロファイル毎に、１以上の興味要素をその興味要素スコアとともに蓄積する。 The interest information accumulation unit accumulates one or more interest elements together with the interest element score for each user or profile.

演算部は、上記興味情報蓄積部に蓄積された上記ユーザ又は上記ユーザに対応するプロファイルの各興味要素とその興味要素スコアを参照し、当該各興味要素と上記クラスタの各代表要素とが合致した各要素の、興味要素スコアとクラスタ代表要素スコアとの積の合計を当該クラスタのクラスタスコアとして、それを各クラスタ毎に求め、そのスコアに従い各クラスタに順位を付与する。 The calculation unit refers to each interest element and the interest element score of the profile corresponding to the user or the user stored in the interest information storage unit, and the interest element and each representative element of the cluster match. The sum of the products of the interest element score and the cluster representative element score of each element is used as the cluster score of the cluster, which is obtained for each cluster, and ranking is given to each cluster according to the score.

可視化出力部は、それぞれのクラスタについて、上記代表要素スコアの最も大きい代表要素を当該クラスタのインデックスとして各クラスタの順位を可視化して出力する。 The visualization output unit visualizes and outputs the rank of each cluster with the representative element having the largest representative element score as the index of the cluster.

本発明のクラスタリング結果表示装置及び方法によれば、クラスタリング結果の表示に際し、個々のユーザの興味や嗜好に合ったクラスタを上位に表示し、また、クラスタのインデックスをユーザの既知の語句等で表示することができる。 According to the clustering result display apparatus and method of the present invention, when displaying the clustering result, the cluster that suits each user's interests and preferences is displayed at the top, and the cluster index is displayed in the user's known words and phrases can do.

〔第１実施形態〕
図１は、本発明のクラスタリング結果表示装置１０の機能構成例を示す図であり、図２はその処理フロー例である。クラスタリング結果表示装置１０は、検索部１１とクラスタリング部１２と興味情報蓄積部１３と演算部１４と可視化出力部１５とを備える。 [First Embodiment]
FIG. 1 is a diagram showing a functional configuration example of the clustering result display device 10 of the present invention, and FIG. 2 is an example of a processing flow thereof. The clustering result display device 10 includes a search unit 11, a clustering unit 12, an interest information storage unit 13, a calculation unit 14, and a visualization output unit 15.

検索部１１は、ユーザ端末１から通信ネットワーク２を介してユーザｕ（ｕはユーザ番号）により入力された検索語ｗについて、任意の検索エンジン３を用いて通信ネットワーク２を検索し、該当する文書集合Ｓを収集する（Ｓ１）。 The search unit 11 searches the communication network 2 using an arbitrary search engine 3 for the search word w input from the user terminal 1 via the communication network 2 by the user u (u is a user number), and the corresponding document The set S is collected (S1).

クラスタリング部１２は、収集した文書集合Ｓについて、まず、クラスタリングを行って複数のクラスタＣ_ｎ（ｎはクラスタ番号）に分類する。そして、分類したそれぞれのクラスタＣ_ｎごとに、１以上のクラスタ代表要素Ｒ_ｎ,ｉ（ｉは代表要素番号）を抽出し、Ｃ_ｎとＲ_ｎ,ｉを出力する（Ｓ２）。クラスタ代表要素としては、例えば代表語や代表ＵＲＬなどが考えられるが、そのクラスタの内容や特徴を表現するインデックスとなりうる要素であればいかなる要素を用いても構わない。抽出するクラスタ代表要素の個数は、代表要素という性質上、数個程度とすることが望ましい。また例えば、後述するクラスタ代表要素スコアが、任意に設定したしきい値の以上であるものを抽出することとしてもよい。 The clustering unit 12 first classifies the collected document set S into a plurality of clusters C _n (n is a cluster number). Then, for each classified cluster C _n , one or more cluster representative elements R _{n, i} (i is a representative element number) are extracted, and C _n and R _{n, i} are output (S2). As the cluster representative element, for example, a representative word, a representative URL, and the like can be considered, but any element may be used as long as it can serve as an index expressing the contents and characteristics of the cluster. The number of cluster representative elements to be extracted is preferably about several due to the property of representative elements. Further, for example, a cluster representative element score to be described later may be extracted that is equal to or greater than an arbitrarily set threshold value.

クラスタリングの手法及びクラスタ代表要素を抽出する手法は、いかなる手法を用いてもよく、例えば非特許文献１に記載されているＳＴＣ（Suffix Tree Clustering）などを用いることができる。 As a clustering method and a method for extracting cluster representative elements, any method may be used. For example, STC (Suffix Tree Clustering) described in Non-Patent Document 1 may be used.

図３に、検索語ｗとして「ジャガー」について検索及びクラスタリングを行い、クラスタリング代表要素を抽出した例を示す。この例では、１００の文書からなる文書集合Ｓが、それぞれ文書数がＢ_ｎの５つのクラスタに分類され、例えばクラスタＣ_２については、文書数Ｂ_２が２６であり、２つのクラスタ代表要素（代表語）Ｒ_２,ｉが抽出されている（Ｒ_２,１＝ネコ科、Ｒ_２,２＝ヒョウ）。 FIG. 3 shows an example in which “jaguar” is searched and clustered as the search word w, and clustering representative elements are extracted. In this example, the document set S consisting of 100 documents, classified respectively the number of documents within five clusters of B _n, for example, the cluster C ₂ is the number of documents B ₂ is 26, two clusters typical element ( Representative word) R _{2, i} is extracted (R _2,1 = feline, R _2,2 = leopard).

興味情報蓄積部１３には、ユーザ又はプロファイル毎に、１以上の興味要素Ａ_ｕ,ｊ（ｕはユーザ番号又はユーザｕに対応するプロファイル番号、ｊは興味要素番号）を、その興味要素スコアＳＡ_ｕ,ｊとともに、予め蓄積しておく。ここで、興味要素の種別は、クラスタ代表要素の種別と同じにする。例えば、代表要素を代表語とした場合には興味語とし、代表ＵＲＬとした場合は興味ＵＲＬとする。 In the interest information storage unit 13, for each user or profile, one or more interest elements A _{u, j} (u is a user number or a profile number corresponding to the user u, j is an interest element number), and an interest element score SA. Along with _{u, j} , it is accumulated in advance. Here, the type of the element of interest is the same as the type of the cluster representative element. For example, if the representative element is a representative word, it is an interesting word, and if it is a representative URL, the interesting URL is used.

なお、ユーザの興味要素とは、個々のユーザの趣味・嗜好等を示すキーワード等である。一方、プロファイルの興味要素とは、個々のユーザに係るものではなく、年齢層や性別や住所などの特定のプロファイルに属するユーザ群の趣味・嗜好等を示すキーワード等である。 The user's interest element is a keyword or the like indicating a hobby / preference of each user. On the other hand, profile interest elements are not related to individual users, but are keywords or the like indicating hobbies, preferences, etc. of users belonging to a specific profile such as age group, gender, and address.

また、興味要素スコアとは、個々のユーザ又は個々のプロファイルにおける、各興味要素の興味の度合を示すスコアであり、例えば、興味の大小をスコアの大小として表現する。 The interest factor score is a score indicating the degree of interest of each interest element in each user or individual profile. For example, the level of interest is expressed as the magnitude of the score.

興味要素及び興味要素スコアを求める手法は、いかなる手法を用いてもよく、例えばユーザのＷｅｂ閲覧履歴に基づき、ＴＦ−ＩＤＦ法（例えば、参考文献１参照）を適用することにより求めることができる。
〔参考文献１〕徳永健伸著、「情報検索と言語処理」、言語と計算５、東京大学出版会、１９９９年１１月、ｐ．２７−３３ Any method may be used as the method for obtaining the interest factor and the interest factor score. For example, the method can be obtained by applying the TF-IDF method (for example, see Reference 1) based on the user's Web browsing history.
[Reference 1] Takenobu Tokunaga, “Information Retrieval and Language Processing”, Language and Calculation 5, University of Tokyo Press, November 1999, p. 27-33

図４に、２ユーザ分の興味情報の例を示す。この例は興味の大小をスコアの大小として表現したものであり、スコア＝１が興味が最大、スコア＝０が興味が最小とした場合を示している。例えば、ユーザ１については、６つの興味要素（興味語）が抽出されており、４つ目の興味要素Ａ_１,４「プロレス」の興味要素スコアＳＡ_１,４は０．６、６つ目の興味要素Ａ_１,６「車」の興味要素スコアＳＡ_１,６は０．１であるため、「車」より「プロレス」に対する興味の度合が大きいことということになる。 FIG. 4 shows an example of interest information for two users. In this example, the magnitude of interest is expressed as the magnitude of the score, and score = 1 indicates that the interest is maximum, and score = 0 indicates that the interest is minimum. For example, for the user 1, six interest elements (interest words) are extracted, and the fourth interest element A _1,4 “Pro-wrestling” has an interest element score SA _1,4 of 0.6, the sixth. since interest elements a _{1, 6} interest element score SA _{1, 6} of the "car" is 0.1, it comes to that the degree of interest "wrestling" from "car" is large.

演算部１４は、まず、興味情報蓄積部１３に蓄積されたユーザｕ又はユーザｕに対応するプロファイルの各興味要素Ａ_ｕ,ｊとその興味要素スコアＳＡ_ｕ,ｊを参照し、検索部１１で抽出したクラスタＣ_ｎの各クラスタ代表要素Ｒ_ｎ,ｉと各興味要素Ａ_ｕ,ｊとが合致している要素を抽出する。次にクラスタＣ_ｎ毎に、合致している各要素の興味要素スコアＳＡ_ｕ,ｊの合計を求め、これをユーザｕについてのクラスタＣ_ｎのクラスタスコアＳＣ_ｕ,ｎとする。そして、このように求めたクラスタスコアに従い付与した各クラスタの順位Ｐ_ｕを出力する（Ｓ４）。なお、合致する要素が全く無い場合には、例えば、クラスタ番号順やクラスタを構成する文書数順で順位を付与すればよい。 The calculation unit 14 first refers to each interest element A _{u, j of} the profile corresponding to the user u or the user u stored in the interest information storage unit 13 and its interest element score SA _{u, j} , and the search unit 11 An element in which each cluster representative element R _{n, i} of each extracted cluster C _n matches each interested element A _{u, j} is extracted. Next, for each cluster C _n _, the sum of the interested element scores SA _{u, j} of each matching element is obtained, and this is set as the cluster score SC _{u, n} of the cluster C _n for the user u. Then, the rank P _{u of} each cluster assigned according to the cluster score thus obtained is output (S4). If there is no matching element, the rank may be given in order of the cluster number or the number of documents constituting the cluster, for example.

図３に示したクラスタリング結果例及び図４に示した興味情報例に基づいてユーザ１についてのクラスタ順位Ｐ_ｕを求める方法を、図５を用いて説明する。まず、検索部１１で抽出したクラスタＣ_ｎの各クラスタ代表要素Ｒ_ｎ,ｉと各興味要素Ａ_１,ｊとが合致している要素は、クラスタＣ_１については「車」、Ｃ_３については「ＭａｃＯＳ（登録商標）Ｘ」と「ジョブズ」、Ｃ_４については「プロレス」であることから、これらの要素を抽出する。次にクラスタＣ_ｎ毎の当該合致している各要素の興味要素スコアＳＡ_１,ｊの合計は、クラスタＣ_１については０．１（車）、Ｃ_３については０．７（＝０．５（ＭａｃＯＳＸ）＋０．２（ジョブズ））、Ｃ_４については０．６（プロレス）であることから、これらをそれぞれクラスタスコアＳＣ_１,１、ＳＣ_１、３、ＳＣ_１,４とする。なお、クラスタＣ_２とＣ_５については合致している要素が無いため、ＳＣ_１、２とＳＣ_１、５は０とする。そして、図４の例は、興味の大小をスコアの大小として表現したものであることから、ＳＣ_ｕ,ｎを数値の大きいものから順位付けをすれば、各クラスタＣ_ｎに対し、興味の度合が大きい順に順位Ｐ_ｕが付与されたことになる。図５の例では、ＳＣ_１、３＞ＳＣ_１,４＞ＳＣ_１,１＞ＳＣ_１、２＝ＳＣ_１、５であることから、この順番に各クラスタＣ_ｎに順位Ｐ_ｕを付与すればよい。なお、Ｃ_２とＣ_５のように同順位となるものについては、例えばクラスタを構成する文書数の大小関係によるなど、適宜ルールを定めて順位付けをすればよい。 A method for obtaining the cluster rank _Pu for the user 1 based on the clustering result example shown in FIG. 3 and the interest information example shown in FIG. 4 will be described with reference to FIG. First, an element in which each cluster representative element R _{n, i} of each cluster C _n extracted by the search unit 11 matches each interested element A _{1, j} is “car” for the cluster C _{1 and} for C ₃ . Since “MacOS (registered trademark) X”, “Jobs”, and C ₄ are “pro-wrestling”, these elements are extracted. Next _, the sum of the interest element scores SA _{1, j} of each matching element for each cluster C _n is 0.1 (car) for cluster C _{1 and} 0.7 (= 0.5 for C ₃ ). (MacOS X) +0.2 (Jobs)), because it is 0.6 (wrestling) for _{C 4,} they each cluster score SC _{1, 1,} _SC 1, 3, and SC _{l, 4.} Since there are no matching elements for clusters C ₂ and C ₅ , SC ₁ , ₂ and SC _{1, 5} are set to 0. The example of FIG. 4 expresses the magnitude of interest as the magnitude of the score. Therefore, if SC _{u, n} is ranked from the largest numerical value, the degree of interest for each cluster C _{n is shown.} The order P _u is assigned in descending order. In _the example of FIG. 5, since it is _{_{_{SC 1,3> SC 1,4> SC 1,1}}} > SC 1,2 = SC 1,5, if grant rank _{P u} in each cluster _{C n} in this order Good. It should be noted that those having the same ranking, such as C ₂ and C ₅ , may be ranked by appropriately defining rules, for example, depending on the size relationship of the number of documents constituting the cluster.

最後に、可視化出力部１５は、それぞれのクラスタＣ_ｎにおけるいずれかのクラスタ代表要素Ｒ_ｎ,ｉを、当該それぞれのクラスタＣ_ｎのインデックスとして、当該クラスタＣ_ｎについて順位Ｐ_ｕを可視化して出力する（Ｓ４）。上記の例では、クラスタ順位Ｐ_ｕはＣ_３、Ｃ_４、Ｃ_１、Ｃ_２、Ｃ_５の順となることから、例えば図６に示すように、各クラスタの代表要素番号ｉが１の代表要素をインデックスとして、それらを単純にクラスタ順位に従い、「ＭａｃＯＳＸ→プロレス→車→ネコ科→Ａｔａｒｉ（登録商標）」と並べることでクラスタの順位Ｐ_ｕを可視化することができる。また、並べ方によらずタグクラウド的に、表示フォントをクラスタの順位Ｐ_ｕに応じて相違させることで可視化してもよい。図７は、従来のようにクラスタに含まれる文書数が多い順に各クラスタのインデックスを並べつつ、本発明の方法により求めたクラスタの順位Ｐ_ｕをフォントの大小によって可視化した例である。 Finally, the visualization output unit 15 visualizes and outputs the rank P _u for the cluster C _{n using} any cluster representative element R _{n, i} in each cluster C _{n as} an index of the cluster C _n. (S4). In the above example, the cluster order P _u is in the order of C ₃ , C ₄ , C ₁ , C ₂ , C ₅ , so that the representative element number i of each cluster is 1 as shown in FIG. 6, for example. By using elements as indexes and simply following them according to the cluster order and arranging them in the order of “MacOS X → Pro Wrestling → Car → Cat Family → Atari (registered trademark)”, the cluster order _Pu can be visualized. Moreover, you may visualize by making display fonts differ according to the order _| rank Pu of a cluster like tag cloud regardless of how to arrange. 7, while arranging the index of each cluster in descending order of the number of documents contained in the conventional cluster as an example visualized by size rank P _u font clusters obtained by the method of the present invention.

クラスタリング結果の表示に際し、従来のようにクラスタに含まれる文書数が多い順に順位付けした場合には、クラスタの順位Ｐ_ｕはＣ_１、Ｃ_２、Ｃ_３、Ｃ_４、Ｃ_５の順となるのに対し、本発明のクラスタリング結果表示装置及び方法によれば、ユーザの趣味・嗜好が加味され、クラスタの順位Ｐ_ｕはＣ_３、Ｃ_４、Ｃ_１、Ｃ_２、Ｃ_５の順となる。そのため、ユーザがマイナーな話題や専門的な話題を求めている場合にも、個々のユーザの興味や嗜好に合ったクラスタが上位に表示されやすくなり、探索がしやすくなる。また、自らが求めるクラスタと異なるクラスタを選択することによる手戻りや再入力等の不要行為の抑制効果も期待できる。 When displaying the clustering results, when ranking is performed in descending order of the number of documents included in the cluster as in the prior art, the cluster ranking P _u is in the order of C ₁ , C ₂ , C ₃ , C ₄ , C _5. On the other hand, according to the clustering result display apparatus and method of the present invention, the user's hobbies and preferences are taken into consideration, and the cluster order P _u is in the order of C ₃ , C ₄ , C ₁ , C ₂ , C _5. . For this reason, even when the user is seeking a minor topic or a specialized topic, clusters that match the interests and preferences of individual users can be easily displayed on the top, making it easier to search. In addition, it can be expected to suppress unnecessary actions such as rework and re-input by selecting a cluster different from the cluster desired by the user.

また、派生的な効果として、興味情報蓄積部１３にプロファイル興味要素を蓄積した場合には、ユーザに対してプロファイル横断的な情報を加工して提供することができる。例えば、ある興味語について男女別に興味語スコアが得られていた場合、男女のスコアを正規化してそれぞれの割合を求めることで、図８(a)に示すようにその興味語に対する男女の興味の度合の相違を提示することができる。また、年齢層別に興味語スコアが得られていた場合には、年齢層ごとにスコアを正規化してそれぞれの割合を求めることで、図８(b)に示すようにその興味語に対する各年齢層の興味の度合の相違を提示することができる。このような参考情報をユーザに提供することで、ユーザが検索を行う際の利便性を向上することができる。 Further, as a derivative effect, when profile interest elements are accumulated in the interest information accumulation unit 13, it is possible to process and provide information across profiles to the user. For example, if an interest word score is obtained for a certain interested word by gender, normalization of the male and female scores and the respective ratios are obtained, and as shown in FIG. Differences in degrees can be presented. If interest word scores are obtained for each age group, the scores are normalized for each age group, and the respective ratios are obtained. As shown in FIG. The difference in the degree of interest can be presented. By providing such reference information to the user, it is possible to improve convenience when the user performs a search.

〔第２実施形態〕
第１実施形態では、各クラスタに該当する興味要素スコアのみにより各クラスタのクラスタスコアを求めて各クラスタに順位を付与した。これに対し、第２実施形態では、更にクラスタ代表要素スコアを加味してクラスタスコアを求め、そのスコアに従い各クラスタに順位を付与する。 [Second Embodiment]
In the first embodiment, the cluster score of each cluster is obtained based only on the interest factor score corresponding to each cluster, and the rank is assigned to each cluster. On the other hand, in the second embodiment, a cluster score is obtained by further adding a cluster representative element score, and a rank is assigned to each cluster according to the score.

第１実施形態の方法により決定したクラスタの順位には、各クラスタに該当する興味要素スコアの大小がそのまま反映されるが、たとえ興味要素スコアが大きくても、そのクラスタを構成する文書集合において、その興味要素スコアに対応するクラスタ代表要素が含有されている度合が小さい場合には、ユーザが求める話題がそのクラスタに存在する可能性は低くなる。そのため、上位に表示されたクラスタであってもユーザが求める話題の絶対量が少なくなってしまい、探索の効率化等の効果が十分に得られない場合がある。そこで、クラスタスコアの決定に、クラスタを構成する文書集合におけるクラスタ代表要素の含有度合を示すクラスタ代表要素スコアを加味することで、ユーザが求める話題の絶対量が多いクラスタが上位に表示されやすくする。 The rank of the clusters determined by the method of the first embodiment reflects the size of the interest element score corresponding to each cluster as it is, but even if the interest element score is large, in the document set constituting the cluster, When the cluster representative element corresponding to the interest element score is small, it is unlikely that the topic the user wants exists in the cluster. For this reason, even if the cluster is displayed at the top, the absolute amount of topics required by the user is reduced, and there may be a case where effects such as efficient search are not obtained. Therefore, by adding a cluster representative element score indicating the content of cluster representative elements in the document set constituting the cluster to the determination of the cluster score, a cluster having a large absolute amount of topics requested by the user can be easily displayed at the top. .

図９に第２実施形態のクラスタリング結果表示装置２０の機能構成例を示す。また、図１０はその処理フロー例である。クラスタリング結果表示装置２０は、検索部１１とクラスタリング部２２と興味情報蓄積部１３と演算部２４と可視化出力部１５とから構成される。つまり、第１実施形態のクラスタリング結果表示装置１０のクラスタリング部１２がクラスタリング部２２に、演算部１４が演算部２４に置き換わった構成であるため、ここではその差分に重点を置いて説明する。 FIG. 9 shows a functional configuration example of the clustering result display device 20 of the second embodiment. FIG. 10 shows an example of the processing flow. The clustering result display device 20 includes a search unit 11, a clustering unit 22, an interest information storage unit 13, a calculation unit 24, and a visualization output unit 15. That is, since the clustering unit 12 of the clustering result display device 10 of the first embodiment is replaced with the clustering unit 22 and the calculation unit 14 is replaced with the calculation unit 24, the difference will be described here with emphasis.

クラスタリング部２２は、クラスタリング部１２と同様な処理により分類した各クラスタＣ_ｎごとに、１以上のクラスタ代表要素Ｒ_ｎ,ｉを抽出するとともに、各クラスタ代表要素ごとにクラスタ代表要素スコアＳＲ_ｎ,ｉを付与し、Ｃ_ｎとＲ_ｎ,ｉとＳＲ_ｎ,ｉとを出力する（Ｓ５）。 The clustering unit 22 extracts one or more cluster representative elements R _{n, i} for each cluster C _n classified by the same process as the clustering unit 12, and also uses a cluster representative element score SR _n, for each cluster representative element _{. i} is assigned, and C _n , R _{n, i} and SR _{n, i} are output (S5).

クラスタ代表要素スコアＳＲ_ｎ,ｉは上記のとおり、クラスタＣ_ｎを構成する文書集合におけるクラスタ代表要素Ｒ_ｎ,ｉの含有度合を示す値であるが、このような性質を満たすものであれば、いかなる手法を用いて求めても構わない。例えば、クラスタリング手法と同様、ＳＴＣを用いて求めてもよいし、単純に、あるクラスタ代表要素がそのクラスタを構成する複数の文書のうち、どの程度の数の文書に含まれるかという比率によって求めてもよい。 As described above _{, the} cluster representative element score SR _{n, i} is a value indicating the content of the cluster representative element R _{n, i} in the document set constituting the cluster C _n . Any method may be used. For example, as with the clustering method, it may be obtained using STC, or simply obtained by the ratio of how many documents are included in a plurality of documents that constitute a cluster representative element. May be.

図１１に、図３で示したクラスタリング結果例における各クラスタ代表要素Ｒ_ｎ,ｉに、「要素の含まれる文書数÷クラスタ全文書数」により求めたクラスタ代表要素スコアＳＲ_ｎ,ｉをそれぞれ付与した例を示す。例えば、クラスタ２については、クラスタ代表要素（代表語）Ｒ_２,１「ネコ科」のクラスタ代表要素スコアＳＲ_２,１は１．０であることから、この場合、「ネコ科」という語句はクラスタ２の全文書に含まれていることになる。また、Ｒ_２,２「ヒョウ」のクラスタ代表要素スコアＳＲ_２,２は０．６であることから、「ヒョウ」という語句はクラスタ２の全文書の６割に含まれていることになる。 In FIG. 11, each cluster representative element R _{n, i} in the clustering result example shown in FIG. 3 is given a cluster representative element score SR _{n, i} obtained by “number of documents including element ÷ total number of documents in cluster”. An example is shown. For example, for cluster 2, the cluster representative element (representative word) R _2,1 “feline family” has a cluster representative element score SR _2,1 of 1.0. In this case, the phrase “feline family” is It is included in all documents in cluster 2. In addition, since the R _2,2 cluster representative element score SR _2,2 of the "leopard" it is 0.6, the phrase "leopard" is to be included in 60% of all documents in cluster 2.

演算部２４は、まず演算部１４での処理と同様に、クラスタＣ_ｎの各クラスタ代表要素Ｒ_ｎ,ｉとユーザｕの各興味要素Ａ_ｕ,ｊとが合致している要素を抽出する。次に合致している各要素のクラスタ代表要素スコアＳＲ_ｎ,ｉと興味要素スコアＡ_ｕ,ｊとを積算して各要素の代表要素スコアＳＡＲ_{ｕ,ｊ,ｎ,ｉ}を求める。そして、その合計をユーザｕについてのクラスタＣ_ｎのクラスタスコアＳＣ_ｕ,ｎとし、このように求めたクラスタスコアに従い各クラスタに順位Ｐ_ｕを付与し、順位Ｐ_ｕを出力する（Ｓ６）。 The computing unit 24 first extracts an element in which each cluster representative element R _{n, i} of the cluster C _n matches each interested element A _{u, j of the} user u, similarly to the processing in the computing unit 14. Next, the cluster representative element score SR _{n, i of} each matching element and the interest element score A _{u, j} are integrated to obtain the representative element score SAR _{u, j, n, i} of each element. Then, the sum is set as the cluster score SC _{u, n} of the cluster C _n for the user u, the rank _Pu is assigned to each cluster according to the cluster score thus obtained, and the rank _Pu is output (S6).

図１１に示したクラスタリング結果例及び図４に示した興味情報例に基づいてユーザ１についてのクラスタ順位Ｐ_１を求める方法を、図１２を用いて説明する。まず、検索部１１で抽出したクラスタＣ_ｎの各クラスタ代表要素Ｒ_ｎ,ｉと各興味要素Ａ_１,ｊとが合致している要素は、クラスタＣ_１については車、Ｃ_３についてはＭａｃＯＳＸとジョブズ、Ｃ_４についてはプロレスであることから、これらを抽出する。次に合致している各要素のクラスタ代表要素スコアＳＲ_ｎ,ｉと興味要素スコアＳＡ_ｕ,ｊとを積算し、代表要素スコアＳＡＲ_{ｕ,ｊ,ｎ,ｉ}を求める。具体的には、「車」については０．１（＝１．０×０．１）、「ＭａｃＯＳＸ」については０．５（＝１．０×０．５）、「ジョブズ」については０．０８（＝０．４×０．２）、「プロレス」については０．６（＝１．０×０．６）となる。そして、クラスタＣ_ｎ毎に代表要素スコアの合計を求める。具体的には、クラスタＣ_１については０．１（車）、Ｃ_３については０．５８（＝０．５（ＭａｃＯＳＸ）＋０．０８（ジョブズ））、Ｃ_４については０．６（プロレス）であることから、これらをそれぞれクラスタスコアＳＣ_１,１、ＳＣ_１、３、ＳＣ_１,４とする。なお、クラスタＣ_２とＣ_５については合致している要素が無く、代表要素スコアは共に０であるため、ＳＣ_１、２とＳＣ_１、５も共に０となる。その結果、ＳＣ_１、４＞ＳＣ_１,３＞ＳＣ_１,１＞ＳＣ_１、２＝ＳＣ_１、５であることから、この順番に各クラスタＣ_ｎに順位Ｐ_１を付与すればよい。 The method of obtaining the cluster rank P ₁ for user 1 based on the interest information example shown in clustering results Examples and Figure 4 shown in FIG. 11 will be described with reference to FIG. First, an element in which each cluster representative element R _{n, i} of each cluster C _n extracted by the search unit 11 matches each interesting element A _{1, j} is a car for the cluster C _{1 and} MacOS X for C _3. and Jobs, since for C ₄ is wrestling, extract these. Next, the cluster representative element score SR _{n, i of} each matching element and the interest element score SA _{u, j} are integrated to obtain a representative element score SAR _{u, j, n, i} . Specifically, 0.1 (= 1.0 × 0.1) for “car”, 0.5 (= 1.0 × 0.5) for “MacOS X”, 0 for “Jobs” 0.08 (= 0.4 × 0.2), and “pro-wrestling” is 0.6 (= 1.0 × 0.6). Then, a total of representative element score for each cluster C _n. Specifically, the cluster C ₁ is 0.1 (car), the C ₃ is 0.58 (= 0.5 (MacOS X) +0.08 (jobs)), and the C ₄ is 0.6 (pro wrestling). Therefore, these are designated as cluster scores SC _1,1 , SC _1,3 and SC _1,4 , respectively. Note that there are no matching elements for clusters C ₂ and C ₅ , and the representative element scores are both 0, so SC ₁ , ₂ and SC _{1, 5} are both 0. As a result, since SC _1,4 > SC _1,3 > SC _1,1 > SC _1,2 = SC _1,5 , the order P ₁ may be assigned to each cluster C _n in this order.

以上のように、クラスタ代表要素スコアを加味して求めたクラスタスコアによりクラスタの順位を付与することで、そのクラスタを構成する文書集合における当該要素の含有度合がクラスタへの順位の付与に際して考慮されるため、ユーザが求める話題の絶対量が多いクラスタが上位に表示されやすくすることができる。 As described above, by assigning the rank of the cluster based on the cluster score obtained by adding the cluster representative element score, the content level of the element in the document set constituting the cluster is taken into consideration when assigning the rank to the cluster. Therefore, a cluster having a large absolute amount of topics requested by the user can be easily displayed at the top.

〔第３実施形態〕
第２実施形態は、クラスタスコアを求めるのにあたり、各クラスタに該当する興味要素スコアに加え、各クラスタ代表要素のスコアを加味することにより、ユーザが求める話題の絶対量が多いクラスタが上位に表示されやすくするものである。第３実施形態は、各クラスタスコアに対して、更に各クラスタに含まれる文書数に応じて重み付けをして、その重み付け後のスコアによりクラスタの順位を決定するものである。 [Third Embodiment]
In the second embodiment, in obtaining a cluster score, in addition to an interest element score corresponding to each cluster, a score of each cluster representative element is added, so that a cluster having a large absolute amount of topics requested by the user is displayed at the top. It is easy to be done. In the third embodiment, each cluster score is further weighted according to the number of documents included in each cluster, and the rank of the cluster is determined based on the weighted score.

第２実施形態の方法に求めたクラスタスコアは、ユーザが求める話題の絶対量の大小が反映されやすくなるよう、各クラスタ内で値の調整を図ったものである。しかし、各クラスタを構成する文書数はそれぞれ異なっているため、より適切にユーザが求める話題の絶対量の大小がより反映された形でクラスタの順位付けをするには、各クラスタの文書数についても加味することが望ましい。そこで、第３実施形態では、上記各実施形態の方法により求めたクラスタスコアに対して、更にそのクラスタに含まれる文書数に応じて重み付けをして、重み付けしたスコアにより順位付けする。このように順位付けすることで、ユーザが求める話題の絶対量が多いクラスタを上位に、より表示されやすくすることができる。 The cluster score obtained by the method of the second embodiment is a value adjusted in each cluster so that the absolute amount of the topic requested by the user is easily reflected. However, since the number of documents that make up each cluster is different, to rank the clusters more appropriately reflecting the magnitude of the absolute amount of topics that the user is seeking, the number of documents in each cluster Is also desirable. Therefore, in the third embodiment, the cluster score obtained by the method of each of the above embodiments is further weighted according to the number of documents included in the cluster, and is ranked by the weighted score. By ranking in this way, a cluster having a large absolute amount of topics requested by the user can be more easily displayed on the top.

図１３に第３実施形態のクラスタリング結果表示装置３０の機能構成例を示す。また、図１４はその処理フロー例である。クラスタリング結果表示装置３０は、検索部１１とクラスタリング部３２と興味情報蓄積部１３と演算部３４と可視化出力部１５とから構成される。つまり、上記各実施形態のクラスタリング結果表示装置との相違部分は、クラスタリング部３２と演算部３４であることから、ここではその差分に重点を置いて説明する。なお、ここでは第２実施形態を基礎として説明する。 FIG. 13 shows a functional configuration example of the clustering result display device 30 of the third embodiment. FIG. 14 shows an example of the processing flow. The clustering result display device 30 includes a search unit 11, a clustering unit 32, an interest information storage unit 13, a calculation unit 34, and a visualization output unit 15. That is, since the difference from the clustering result display device of each of the above embodiments is the clustering unit 32 and the calculation unit 34, the difference will be described here. Here, the second embodiment will be described as a basis.

クラスタリング部３２は、クラスタリング部２２と同様な処理により得られた、クラスタＣ_ｎ、クラスタ代表要素Ｒ_ｎ,ｉ、クラスタ代表要素スコアＳＲ_ｎ,ｉとともに、各クラスタの文書数Ｂ_ｎを出力する（Ｓ７）。 The clustering unit 32 outputs the number of documents B _n of each cluster together with the cluster C _n , the cluster representative element R _{n, i} , and the cluster representative element score SR _{n, i} obtained by the same processing as the clustering unit 22 ( S7).

演算部３４は、演算部２４と同様な処理により得られた、ユーザｕの各クラスタＣ_ｎのクラスタスコアＳＣ_ｕ,ｎに、各クラスタの文書数Ｂ_ｎをそれぞれ積算し、その結果得られた各スコアＳＢＣ_ｕ,ｎに従い各クラスタに順位Ｐ_ｕを付与し、順位Ｐ_ｕを出力する（Ｓ８）。 The calculation unit 34 adds the number of documents B _n of each cluster to the cluster score SC _{u, n} of each cluster C _n of the user u obtained by the same processing as that of the calculation unit 24, and obtained as a result. In accordance with each score SBC _{u, n} , the rank P _u is assigned to each cluster, and the rank P _u is output (S8).

図１１に示したクラスタリング結果例及び図４に示した興味情報例に基づき、ユーザ１についてのクラスタ順位Ｐ_１を求める方法を、図１５を用いて説明する。演算部２４と同様な処理の結果、ユーザ１の各クラスタスコアＳＣ_１,ｎは、ＳＣ_１,１＝０．１、ＳＣ_１,２＝０、ＳＣ_１,３＝０．５８、ＳＣ_１,４＝０．６、ＳＣ_１,５＝０となる。これらにそれぞれ、各クラスタＣ_ｎの文書数Ｂ_ｎをそれぞれ積算してＳＢＣ_１,ｎを求めると、ＳＢＣ_１,１＝４（＝０．１×４０）、ＳＢＣ_１,２＝０、ＳＢＣ_１,３＝１２．１８（＝０．５８×２１）、ＳＢＣ_１,４＝６（＝０．６×１０）、ＳＢＣ_１,５＝０となる。その結果、ＳＢＣ_１、３＞ＳＢＣ_１,４＞ＳＢＣ_１,１＞ＳＢＣ_１、２＝ＳＢＣ_１、５となることから、この順番に各クラスタＣ_ｎに順位Ｐ_１を付与すればよい。 A method for obtaining the cluster rank P ₁ for the user 1 based on the clustering result example shown in FIG. 11 and the interest information example shown in FIG. 4 will be described with reference to FIG. As a result of the processing similar to that of the calculation unit 24, each cluster score SC _{1, n} of the user 1 is SC _1,1 = 0.1, SC _1,2 = 0, SC _1,3 = 0.58, SC _{1, 4} = 0.6 and SC _1,5 = 0. Each of these, when the number of documents _{B n} of each cluster _{C n} integrated each seek _{_{SBC 1, n, SBC 1,1 =}} 4 (= 0.1 × 40), SBC 1,2 = 0, SBC 1 _{, 3} = 12.18 (= 0.58 × 21), SBC _1,4 = 6 (= 0.6 × 10), and SBC _1,5 = 0. As a result, since SBC _1,3 > SBC _1,4 > SBC _1,1 > SBC _1,2 = SBC _1,5 , rank P ₁ may be given to each cluster C _n in this order.

以上のように、各クラスタスコアについて、更に各クラスタに含まれる文書数に応じて重み付けをして、重み付けしたスコアにより順位付けすることにより、ユーザが求める話題の絶対量が多いクラスタを上位に、より表示されやすくすることができる。 As described above, each cluster score is further weighted according to the number of documents included in each cluster, and ranked according to the weighted score, so that a cluster having a large absolute amount of topics requested by the user is ranked higher. It can be displayed more easily.

〔第４実施形態〕
各クラスタは、それぞれのクラスタにおける、クラスタ代表要素のいずれかをインデックスとして表示される。しかし、クラスタ代表要素がユーザにとって未知の語句等であった場合、ユーザはクラスタの内容を推定できず、求めるクラスタを選択することが困難となる。そこで第４実施形態は、第２実施形態における処理の際にそれぞれのクラスタごとに得られた各クラスタ代表要素の中で、代表要素スコアが最も大きいクラスタ代表要素を当該クラスタのインデックスとする構成である。 [Fourth Embodiment]
Each cluster is displayed with any one of the cluster representative elements in each cluster as an index. However, if the cluster representative element is an unknown word or the like for the user, the user cannot estimate the contents of the cluster, and it is difficult to select the desired cluster. Therefore, the fourth embodiment has a configuration in which the cluster representative element having the largest representative element score among the cluster representative elements obtained for each cluster during the processing in the second embodiment is used as the index of the cluster. is there.

代表要素スコアＳＡＲ_{ｕ,ｊ,ｎ,ｉ}は、ユーザｕの興味要素スコアＡ_ｕ,ｊとクラスタＣ_ｎのクラスタ代表要素スコアＳＲ_ｎ,ｉとを積算することで得られる値であり、ユーザの趣味・嗜好が反映された値であるとともに、クラスタを構成する文書集合におけるクラスタ代表要素Ｒ_ｎ,ｉの含有度合が反映された値であると言える。そのため、代表要素スコアが最も大きいクラスタ代表要素をインデックスとすることで、インデックスがユーザにとって既知の語句等で表示されるとともに、そのクラスタの内容がインデックスに適切に反映され、探索の効率化を図ることができる。 The representative element score SAR _{u, j, n, i} is a value obtained by integrating the interest element score A _{u, j} of the user u and the cluster representative element score SR _{n, i of the} cluster C _n . It can be said that it is a value reflecting hobbies and preferences, and a value reflecting the content of the cluster representative elements R _{n, i} in the document set constituting the cluster. Therefore, by using the cluster representative element having the highest representative element score as an index, the index is displayed in words and phrases known to the user, and the contents of the cluster are appropriately reflected in the index, thereby improving the search efficiency. be able to.

図１６に第４実施形態のクラスタリング結果表示装置４０の機能構成例を示す。また、図１７はその処理フロー例である。クラスタリング結果表示装置４０は、第２実施形態のクラスタリング結果表示装置２０を基礎とした構成例であり、検索部１１とクラスタリング部２２と興味情報蓄積部１３と演算部４４と可視化出力部４５とから構成される。つまり、第２実施形態の構成とは演算部４４と可視化出力部４５とが異なる。なお、クラスタリング結果表示装置４０に、第３実施形態の構成を組み込むことも可能である。 FIG. 16 shows a functional configuration example of the clustering result display device 40 of the fourth embodiment. FIG. 17 shows an example of the processing flow. The clustering result display device 40 is a configuration example based on the clustering result display device 20 of the second embodiment, and includes a search unit 11, a clustering unit 22, an interest information storage unit 13, a calculation unit 44, and a visualization output unit 45. Composed. That is, the calculation unit 44 and the visualization output unit 45 are different from the configuration of the second embodiment. The clustering result display device 40 can also incorporate the configuration of the third embodiment.

演算部４４は、演算部２４と同様な処理により得られた各クラスタの順位Ｐ_ｕを出力するとともに、代表要素スコアＳＡＲ_{ｕ,ｊ,ｎ,ｉ}を出力する（Ｓ９）。 The calculation unit 44 outputs the rank P _{u of} each cluster obtained by the same processing as the calculation unit 24, and outputs the representative element score SAR _{u, j, n, i} (S9).

可視化出力部４５は、それぞれのクラスタＣ_ｎごとに得られた各クラスタ代表要素Ｒ_ｎ,ｉの中で、代表要素スコアＳＡＲ_{ｕ,ｊ,ｎ,ｉ}が最大のクラスタ代表要素Ｒ_ｎ,ｉを当該クラスタのインデックスとして、各クラスタの順位Ｐ_ｕを可視化して出力する（Ｓ１０）。 Visualization output unit 45, each cluster representative elements R _n obtained for each cluster C _{_n,} in _i, the typical element score SAR _{u, j, n, i} is the largest cluster typical element R _n, the _i as the index of the cluster, and outputs the rank P _u in each cluster were visualized (S10).

図１１に示したクラスタリング結果例及び図４に示した興味情報例に基づいて、第２実施形態と同様な方法により、ユーザ２についてのクラスタ毎の、合致要素、代表要素スコア、クラスタスコア、クラスタ順位を求めた結果を図１８に示す。ここで、クラスタスコアが０のクラスタは文書数の多いものを上位にして順位付けしている。これに基づき、単純に各クラスタの代表要素番号ｉが１のクラスタ代表要素をインデックスとして結果を表示すると、図１９(a)に示すように「Ａｔａｒｉ→車→ネコ科→ＭａｃＯＳＸ→プロレス」という表示順となる。これに対し、代表要素スコアが最大のクラスタ代表要素をインデックスとして結果を表示すると、図１９(b)に示すように「ゲーム機→ジャガーＸＪ→ネコ科→ＭａｃＯＳＸ→プロレス」というように表示される。このように、第４実施形態の方法によれば、ユーザ２の興味語に無い「Ａｔａｒｉ」の代わりに興味語にある「ゲーム機」が表示され、興味語にあるが興味要素スコアが相対的に低い「車」の代わりに興味要素スコアが相対的に高い「ジャガーＸＪ」が表示される。 Based on the clustering result example shown in FIG. 11 and the interest information example shown in FIG. 4, a matching element, a representative element score, a cluster score, and a cluster for each user 2 cluster by the same method as in the second embodiment. The result of obtaining the ranking is shown in FIG. Here, clusters with a cluster score of 0 are ranked with the highest number of documents. Based on this, when the result is displayed simply using the cluster representative element having the representative element number i of 1 for each cluster as an index, as shown in FIG. Display order. On the other hand, when the result is displayed using the cluster representative element having the largest representative element score as an index, it is displayed as “game machine → jaguar XJ → feline department → MacOS X → pro wrestling” as shown in FIG. The As described above, according to the method of the fourth embodiment, instead of “Atari” that is not in the user 2's interest word, “game machine” that is the interest word is displayed, and the interest factor score is relative to the interest word. Instead of a low “car”, “Jaguar XJ” having a relatively high interest factor score is displayed.

以上のように、代表要素スコアが最も大きいクラスタ代表要素をインデックスとすることで、インデックスがユーザにとって既知の語句等で表示されるとともに、そのクラスタの内容がインデックスに適切に表現され、探索の効率化を図ることができる。 As described above, by using the cluster representative element having the highest representative element score as an index, the index is displayed in words and phrases known to the user, and the contents of the cluster are appropriately expressed in the index, thereby improving the search efficiency. Can be achieved.

本発明のクラスタリング結果表示装置及び方法は、上記の実施形態に限定されるものではなく、本発明を逸脱しない範囲で適宜変更が可能である。また、上記説明した各処理は記載の順に従った時系列において実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されるとしてもよい。 The clustering result display apparatus and method of the present invention are not limited to the above-described embodiments, and can be appropriately changed without departing from the present invention. Each process described above may be executed not only in time series according to the order of description, but also in parallel or individually as required by the processing capability of the apparatus that executes the process.

また、本発明のクラスタリング結果表示装置及び方法の処理機能をコンピュータによって実現する場合、当該クラスタリング結果表示装置及び方法が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより当該クラスタリング結果表示装置及び方法における処理機能がコンピュータ上で実現される。なお、処理内容の一部をハードウェア的に実現しても構わない。 Further, when the processing functions of the clustering result display apparatus and method of the present invention are realized by a computer, the processing contents of the functions that the clustering result display apparatus and method should have are described by a program. And the processing function in the said clustering result display apparatus and method is implement | achieved on a computer by running this program with a computer. A part of the processing content may be realized by hardware.

第１実施形態のクラスタリング結果表示装置の構成例を示す図。The figure which shows the structural example of the clustering result display apparatus of 1st Embodiment. 第１実施形態のクラスタリング結果表示装置の処理フロー例を示す図。The figure which shows the example of a processing flow of the clustering result display apparatus of 1st Embodiment. 第１実施形態のクラスタリング結果表示装置におけるクラスタリング結果の例を示す図（ＭａｃＯＳ、アップル、Ａｔａｒｉは登録商標）。The figure which shows the example of the clustering result in the clustering result display apparatus of 1st Embodiment (MacOS, Apple, and Atari are registered trademarks). ユーザ興味情報の一例を示す図（ＭａｃＢｏｏｋは登録商標）。The figure which shows an example of user interest information (MacBook is a registered trademark). 第１実施形態のクラスタリング結果表示装置によりユーザ１のクラスタ順位を求める演算イメージを示す図。The figure which shows the calculation image which calculates | requires the cluster order | rank of the user 1 by the clustering result display apparatus of 1st Embodiment. 第１実施形態のクラスタリング結果表示装置によるクラスタリング結果表示イメージを示す図。The figure which shows the clustering result display image by the clustering result display apparatus of 1st Embodiment. 第１実施形態のクラスタリング結果表示装置による別のクラスタリング結果表示イメージを示す図。The figure which shows another clustering result display image by the clustering result display apparatus of 1st Embodiment. 興味情報蓄積部に蓄積されたユーザプロファイルに基づき加工した情報の提供イメージを示す図。The figure which shows the provision image of the information processed based on the user profile accumulate | stored in the interest information storage part. 第２実施形態のクラスタリング結果表示装置の構成例を示す図。The figure which shows the structural example of the clustering result display apparatus of 2nd Embodiment. 第２実施形態のクラスタリング結果表示装置の処理フロー例を示す図。The figure which shows the example of a processing flow of the clustering result display apparatus of 2nd Embodiment. 第２実施形態のクラスタリング結果表示装置におけるクラスタリング結果の例を示す図。The figure which shows the example of the clustering result in the clustering result display apparatus of 2nd Embodiment. 第２実施形態のクラスタリング結果表示装置によりユーザ１のクラスタ順位を求める演算イメージを示す図。The figure which shows the calculation image which calculates | requires the cluster order | rank of the user 1 by the clustering result display apparatus of 2nd Embodiment. 第３実施形態のクラスタリング結果表示装置の構成例を示す図。The figure which shows the structural example of the clustering result display apparatus of 3rd Embodiment. 第３実施形態のクラスタリング結果表示装置の処理フロー例を示す図。The figure which shows the example of a processing flow of the clustering result display apparatus of 3rd Embodiment. 第３実施形態のクラスタリング結果表示装置によりユーザ１のクラスタ順位を求める演算イメージを示す図。The figure which shows the calculation image which calculates | requires the cluster order | rank of the user 1 by the clustering result display apparatus of 3rd Embodiment. 第４実施形態のクラスタリング結果表示装置の構成例を示す図。The figure which shows the structural example of the clustering result display apparatus of 4th Embodiment. 第４実施形態のクラスタリング結果表示装置の処理フロー例を示す図。The figure which shows the example of a processing flow of the clustering result display apparatus of 4th Embodiment. 第４実施形態のクラスタリング結果表示装置によりユーザ２のクラスタ順位及び表示インデックスを求める演算イメージを示す図。The figure which shows the calculation image which calculates | requires the cluster order | rank and display index of the user 2 with the clustering result display apparatus of 4th Embodiment. 第２実施形態のクラスタリング結果表示装置によるクラスタリング結果表示と第４実施形態のクラスタリング結果表示装置によるクラスタリング結果表示の相違を示す図。The figure which shows the difference of the clustering result display by the clustering result display apparatus of 2nd Embodiment, and the clustering result display by the clustering result display apparatus of 4th Embodiment. 従来技術によるクラスタリング結果表示イメージを示す図。The figure which shows the clustering result display image by a prior art.

Explanation of symbols

１ユーザ端末２通信ネットワーク３検索エンジン
１０、２０、３０、４０クラスタリング結果表示装置
１１検索部１２、２２、３２クラスタリング部
１３興味情報蓄積部１４、２４、３４、４４演算部
１５、４５可視化出力部 DESCRIPTION OF SYMBOLS 1 User terminal 2 Communication network 3 Search engine 10, 20, 30, 40 Clustering result display apparatus 11 Search part 12, 22, 32 Clustering part 13 Interest information storage part 14, 24, 34, 44 Calculation part 15, 45 Visualization output part

Claims

A search unit that searches a communication network using a search engine for a search term input by a user and collects a corresponding document set;
A clustering unit that performs clustering by a predetermined method and classifies the document set into a plurality of clusters, and extracts and outputs one or more cluster representative elements for each cluster;
An interest information accumulating unit in which one or more elements of interest are accumulated together with the interest element score for each user or profile;
With reference to each interest element of the user or the profile corresponding to the user stored in the interest information storage unit and the interest element score, each interest element and each cluster representative element of the cluster match each element. A calculation unit that calculates the sum of the interest factor scores as the cluster score of the cluster, calculates it for each cluster, and outputs the rank of each cluster assigned according to the score;
A visualization output unit that visualizes and outputs the ranking for each cluster, using any one of the cluster representative elements in each cluster as an index of the respective cluster;
A clustering result display device comprising:

In the clustering result display device according to claim 1,
The clustering unit further outputs a cluster representative element score for each cluster representative element,
The calculation unit further adds the cluster representative element score of each element to the interest element score of each element to obtain a representative element score of each element, and sets the sum as the cluster score of the cluster. Characteristic clustering result display device.

In the clustering result display device according to claim 2,
The arithmetic unit further outputs the representative element score,
The clustering result display device, wherein the visualization output unit uses, for each cluster, a cluster representative element having the largest representative element score as an index of the cluster.

In the clustering result display device according to any one of claims 1 to 3,
The clustering unit further outputs the number of documents constituting each cluster,
A clustering result display characterized in that the arithmetic unit assigns a rank to each cluster according to a score obtained by further adding the number of documents constituting each cluster to the cluster score obtained for each cluster. apparatus.

In the clustering result display device according to claim 1,
The clustering result display device, wherein the visualization output unit visualizes the rank by arranging the clusters according to the rank.

In the clustering result display device according to claim 1,
The clustering result display device characterized in that the visualization output unit visualizes the rank by making a font for displaying an index of each cluster different according to the rank.

In the clustering result display device according to any one of claims 1 to 6,
A clustering result display device, wherein the cluster representative element is a representative word, and the interesting element is an interesting word.

The clustering result display device according to any one of claims 1 to 6,
The clustering result display device, wherein the cluster representative element is a representative URL, and the element of interest is an interest URL.

A search step for searching for a search term input by a user on a communication network using an arbitrary search engine and collecting a corresponding document set;
A clustering step for classifying the document set into a plurality of clusters by performing clustering according to a predetermined method, and extracting and outputting one or more cluster representative elements for each cluster;
Each cluster representative element of the cluster and the user or each interest element of the profile corresponding to the user match the interest element score of each element as a cluster score of the cluster, and obtain it for each cluster. A calculation step of outputting the rank of each cluster assigned according to the score;
A visualization output step of visualizing and outputting the ranking for each cluster, using any one of the cluster representative elements in each cluster as an index of the respective cluster,
Clustering result display method to execute.

In the clustering result display method according to claim 9,
The clustering step further outputs a cluster representative element score for each of the cluster representative elements,
The calculation step further adds the cluster representative element score of each element to the interest element score of each element to obtain a representative element score of each element, and sets the sum as the cluster score of the cluster. Characteristic clustering result display method.

In the clustering result display method according to claim 10,
The calculation step further outputs the representative element score,
In the visualization output step, for each cluster, a cluster representative element having the largest representative element score is used as an index of the cluster.

The clustering result display method according to any one of claims 9 to 11,
The clustering step further outputs the number of documents constituting each cluster,
A clustering result display characterized in that the calculation step assigns a rank to each cluster according to a score obtained by adding the number of documents constituting each cluster to the cluster score obtained for each cluster. Method.

The clustering result display method according to claim 9 to 12,
The method of displaying a clustering result, wherein the visualization output step visualizes the ranking by arranging the clusters according to the ranking.

The clustering result display method according to claim 9 to 12,
In the visualization output step, the ranking is visualized by making a font for displaying an index of each cluster different according to the ranking.

The clustering result display method according to any one of claims 9 to 14,
A clustering result display method, wherein the cluster representative element is a representative word, and the interesting element is an interesting word.

The clustering result display method according to any one of claims 9 to 14,
A clustering result display method, wherein the cluster representative element is a representative URL, and the element of interest is an interest URL.

The program for functioning a computer as an apparatus in any one of Claims 1-8.