JP4708319B2

JP4708319B2 - Metadata similarity measurement device and metadata hierarchization device, metadata similarity measurement method and metadata hierarchization method, metadata similarity measurement program, metadata hierarchization program, and recording medium on which these programs are recorded

Info

Publication number: JP4708319B2
Application number: JP2006322050A
Authority: JP
Inventors: 滋藤村; 考藤村
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2006-11-29
Filing date: 2006-11-29
Publication date: 2011-06-22
Anticipated expiration: 2026-11-29
Also published as: JP2008134941A

Description

本発明は、メタデータが付与された文書において、メタデータ間の類似度を測定する方法および装置と、メタデータを自動的に階層化する方法および装置と、それらのプログラムおよび該プログラムを記録した記録媒体に関するものである。 The present invention relates to a method and apparatus for measuring the similarity between metadata, a method and apparatus for automatically hierarchizing metadata, a program thereof, and the program recorded in a document provided with metadata. The present invention relates to a recording medium.

ここで、本発明でのメタデータとは、主にブログにおけるタグ、カテゴリ、ジャンルなどと呼ばれる、ユーザーが自身の作成した文書を整理するために設定した、文書の内容を端的に表した語句や表現などのことである。 Here, the metadata in the present invention is mainly referred to as a tag, category, genre, etc. in a blog, which is a word or phrase that expresses the content of a document set by a user to organize the document created by the user. It is an expression.

また、本発明での文書群とは、同一のメタデータが付与された複数文書のことであり、具体的には、同一のタグが設定された、複数のユーザーが作成したブログ記事からなる記事集合などである。 The document group in the present invention is a plurality of documents with the same metadata, and specifically, an article composed of blog articles created by a plurality of users with the same tag set. Such as a set.

本発明における、メタデータと文書群の関係については、文書群の内容を表す代表値がメタデータであるということができる。 With regard to the relationship between metadata and document groups in the present invention, it can be said that a representative value representing the contents of a document group is metadata.

現在、十分に世間に浸透しているブログの多くには、前記のように記事の内容を表すメタデータを付与する機能が存在する。このメタデータの集計を利用した検索サービスもテクノラティ社からタグ検索として公開されている。 Many of the blogs that are sufficiently popular nowadays have a function of adding metadata representing the content of an article as described above. A search service using the aggregation of metadata is also published as a tag search from Technorati.

しかし、ブログでのメタデータはブロガー自身が自分の記事の整理のために付与するものであるため、同義のタグや表記ゆれが多く存在するという問題点がある。また、上記サービスでは、タグ同士の関連性が考慮されていないため、類似したタグが散在するため、利便性に問題があった。 However, since the blog metadata is provided by bloggers themselves for organizing their articles, there are problems with many synonymous tags and notation fluctuations. Further, in the above service, since the relationship between tags is not taken into consideration, similar tags are scattered, and there is a problem in convenience.

メタデータ間の類似度は、そのメタデータが付与された文書群間の類似度と捉えることができる。一方で、文書間の類似度を測定する技術については、下記特許文献１に記載のように、話題語抽出方法の一部として、文書内での語句の出現頻度を語句の重みとし、共有する語句の割合に基づいて０から１の範囲で類似度を測定する手法がある。 The similarity between metadata can be regarded as the similarity between document groups to which the metadata is assigned. On the other hand, as a technique for measuring the similarity between documents, as described in Patent Document 1 below, as a part of the topic word extraction method, the appearance frequency of the words in the document is used as the weight of the words and is shared. There is a technique of measuring the similarity in the range of 0 to 1 based on the phrase ratio.

ただし、上記手法は文書間の類似度を求める手法であり、文書群間の類似度を求める手法ではないため、メタデータ間の類似度を測定するためには直接利用することはできない。 However, since the above method is a method for obtaining the similarity between documents and not for obtaining a similarity between document groups, it cannot be directly used for measuring the similarity between metadata.

そこで、下記非特許文献１に記載のように、０から１の範囲で算出される文書間の類似度の測定を応用し、文書群を跨いだ文書間の類似度を平均値を用いることで、ブログのメタデータ間の類似度を測定する手法がある。 Therefore, as described in Non-Patent Document 1 below, by applying similarity measurement between documents calculated in the range of 0 to 1, the similarity between documents across document groups is used as an average value. There is a method to measure the similarity between blog metadata.

尚ポアソン分布に基づく推定文書頻度を利用した指標である残差ＩＤＦ（ＩｎｖｅｒｓｅＤｏｃｕｍｅｎｔＦｒｅｑｕｅｎｃｙ）については下記非特許文献２に記載のものが公知であり、また階層的クラスタリング手法については下記非特許文献３に記載のものが公知である。
特開２００４−２５８７２３号公報ＣｈｉｒｓｔｏｐｈｅｒＨ．Ｂｒｏｏｋｓ，ＮａｎｃｙＭｏｎｔａｎｅｚ．ＩｍｐｒｏｖｅｄＡｎｎｏｔａｔｉｏｎｏｆｔｈｅＢｌｏｇｏｓｐｈｅｒｅｖｉａＡｕｔｏｔａｇｇｉｎｇａｎｄＨｉｅｒａｒｃｈｉｃａｌＣｌｕｓｔｅｒｉｎｇ．１５ｔｈＩｎｔｅｒｎａｔｉｏｎａｌＷｏｒｌｄＷｉｄｅＷｅｂＣｏｎｆｅｒｅｎｃｅ，２００６．５．北研二ら著，「情報検索アルゴリズム」，共立出版，２００２年１月１日初版第１刷発行，ｐｐ．４１−４５神嶌敏弘，“データマイニング分野のクラスタリング手法（１）−クラスタリングを使ってみよう！−”，人工知能学会誌，ｖｏｌ．１８，ｎｏ．１，ｐｐ．５９−６５（２００３） The residual IDF (Inverse Document Frequency), which is an index using the estimated document frequency based on the Poisson distribution, is known in Non-Patent Document 2 below, and the hierarchical clustering technique is described in Non-Patent Document 3 below. Those described in (1) are known.
JP 2004-258723 A Chirstopher H.M. Brooks, Nancy Montanez. Improve Annotation of the Blogosphere via Autotagging and Hierarchical Clustering. 15th International World Wide Web Conference, 2006.5. Kitakenji et al., “Information Retrieval Algorithm”, Kyoritsu Shuppan, January 1, 2002, first edition, first edition, pp. 41-45 Toshihiro Kamisu, “Clustering Method in Data Mining Field (1)-Let's Use Clustering!”, Journal of Artificial Intelligence Society, vol. 18, no. 1, pp. 59-65 (2003)

非特許文献１の技術では、記事間の類似度は同じ語が全く使用されていない場合には０となるため、記事単位では同じ語が存在しない場合も多く、平均値によって求められるメタデータ間の類似度も小さな値になる傾向がある。 In the technique of Non-Patent Document 1, since the similarity between articles is 0 when the same word is not used at all, there are many cases where the same word does not exist in the article unit, and between the metadata obtained by the average value There is a tendency that the degree of similarity also becomes a small value.

非特許文献１では、同一のタグが設定されたブログ記事間の平均類似度が０．２〜０．４であることからも、上記傾向が示されている。 Non-Patent Document 1 also shows the above tendency because the average similarity between blog articles with the same tag set is 0.2 to 0.4.

したがって、タグ同士が類似しているかどうかを判定する際に、記事間の類似度の平均値をタグ間の類似度とすると、類似度の実測値においては、その値の範囲が狭いため、閾値を少し変更するだけで結果が大きく変わり、閾値の設定が困難であるという課題があった。 Therefore, when determining whether or not the tags are similar, if the average value of the similarity between articles is the similarity between the tags, the measured value of the similarity has a narrow range of values. There is a problem that the result is greatly changed only by changing a little, and it is difficult to set the threshold value.

本発明の目的は、特定の文書群に偏って出現する語句がその文書群の内容を表す特徴的な語句であるという仮定を基にして得られた特徴量を利用することで、閾値の設定がより容易になるような類似度測定装置、方法を提供すること、およびこの類似度測定装置を利用したメタデータ階層化装置、方法を提供し、関連性の高いメタデータを統合・階層化することによって、利便性の向上を図ることにある。 An object of the present invention is to set a threshold by using a feature amount obtained based on the assumption that a phrase that appears biased in a specific document group is a characteristic phrase that represents the contents of the document group. Provides a similarity measurement device and method that make it easier to perform, and provides a metadata hierarchization device and method using this similarity measurement device to integrate and stratify highly related metadata This is to improve convenience.

本発明は、全文書集合が与えられた際の文書群での特徴語のスコアリング法を核として、メタデータが付与された文書集合におけるメタデータの類似度を求める方法、および、その類似度を利用したクラスタリングの手法を対象としている。ここで、メタデータが付与された文書集合としては、主にブログやソーシャルブックマークサービス等を想定している。 The present invention provides a method for obtaining the similarity of metadata in a document set to which metadata is assigned, using the scoring method of feature words in a document group when all document sets are given, and the similarity The target is clustering methods using. Here, as a document set to which metadata is given, a blog, a social bookmark service, or the like is mainly assumed.

これらのサービスでは、ユーザーが自由にメタデータを設定できる反面、同義のメタデータや表記ゆれによってメタデータが氾濫するという事態を招いている。その解決法として、メタデータ間の類似度を測定するという発想は一般的であるが、メタデータが設定された文書群の特徴量を直接求める手法は存在しなかった。 In these services, the user can freely set metadata, but on the other hand, there is a situation where metadata is flooded due to synonymous metadata and fluctuation of notation. As a solution, the idea of measuring the similarity between metadata is common, but there is no method for directly obtaining the feature amount of a document group in which metadata is set.

一方で、ＴＦ・ＩＤＦ法等によって、文書群中の各文書の特徴ベクトルを求め、文書群の間での文書の平均コサイン類似度によってメタデータ間の類似度を測定する手法や、文書群中の文書の特徴ベクトルの平均ベクトルによってメタデータの特徴ベクトルを設定する手法が考えられるが、それぞれ計算効率の問題および平均を取ることによって一般的な語の重みが強くなりすぎるという問題がある。 On the other hand, the feature vector of each document in the document group is obtained by the TF / IDF method, etc., and the similarity between metadata is measured by the average cosine similarity of the document among the document group, A method of setting a feature vector of metadata by an average vector of feature vectors of documents is considered, but there is a problem of calculation efficiency and a problem that a general word weight becomes too strong by taking an average, respectively.

前記特許文献１、非特許文献１も含め、類似度の計算法は基本的には、語句のベクトル空間モデルによるコサイン類似度であるので、語句のスコアリング法をどのように行うのかという問題に帰着する。 Since the calculation method of the similarity is basically a cosine similarity based on the vector space model of the phrase, including the Patent Document 1 and the Non-Patent Document 1, the question is how to perform the phrase scoring method. Come back.

文書群を一つの文書として、文書群の特徴ベクトルを求める方法として、最も容易に考えられるものとしては、ＴＦ・ＩＤＦ法を応用する方法である。このＴＦ・ＩＤＦ法は、文書中で複数回用いられる語句は重要であるという仮定と、特定の文書のみで用いられる語は特徴的であるという仮定の２つを基にしている。 As a method for obtaining a feature vector of a document group by using the document group as one document, the most easily conceivable method is a method using the TF / IDF method. The TF / IDF method is based on two assumptions: a phrase used multiple times in a document is important, and an assumption that a word used only in a specific document is characteristic.

ここで、文書群を一つの文書としてみなすことによって、文書一つ（ここでは、文書群に相当する）当たりの語句の種類は多くなる。 Here, by regarding the document group as one document, the types of phrases per document (here, corresponding to the document group) increase.

したがって、一般語を取り除くためのＩＤＦ法の効果が鈍るため、文書群を一つの文書として捉えた場合の特徴量には、一般的な語もそのＴＦ値の大きさから大きな値を持ち、類似度算出に大きな影響を与えることが懸念される。 Therefore, since the effect of the IDF method for removing common words is dull, general words also have large values due to the size of their TF values when the document group is regarded as one document. There is concern that it will have a large impact on the calculation of the degree.

一方で、本発明ではある文書群に偏って出現する語句こそが重要な語句であるという仮定に基づいた、語句の重み付け手法を利用していることにより、一般的な語が特徴量となりにくいため精度の向上が期待できる。 On the other hand, in the present invention, since a word weighting method based on the assumption that words that appear biased in a certain document group are important words is used, it is difficult for general words to become feature quantities. Improvement in accuracy can be expected.

つまり、本発明はあらかじめ文書群が決まっており、全文書の特徴量も算出できている状況、例えば、ブログのタグ付き記事を収拾済みの場合に最適化された手法であるということができる。 That is, it can be said that the present invention is an optimized method in a situation where a document group is determined in advance and the feature values of all documents can be calculated, for example, when articles with tagged blogs have been collected.

上記課題を解決するための請求項１に記載のメタデータ間類似度測定装置は、文書に付与されたメタデータ間の類似度を測定する装置において、同一のメタデータが付与された文書の検索を行う文書検索手段と、前記文書検索手段によって検索された文書群より、メタデータの特徴量として、文書群中の語句とその語句の文書の内容との関連性を表したスコアの対を要素としたベクトルを抽出する手段であって、前記文書検索手段によって検索された文書群より、文書集合全体の語句の統計量からポアソン分布により文書内でその語句が一度も出現しない確率を求め、語句の文書群中での文書頻度とポアソン分布に基づく文書頻度の差を残差ＤＦ値として求め、前記残差ＤＦ値が所定値以下の語句は重要でない語として除外し、語句と残差ＤＦ値の対を要素としたベクトルを文書群の特徴量として出力する特徴量抽出手段と、前記特徴量を用いてメタデータ間の類似度を計算する類似度算出手段とを有することを特徴としている。 The inter-metadata similarity measuring apparatus according to claim 1 for solving the above-mentioned problem is a device for measuring the similarity between meta-data given to a document, and searches for the same-meta-data attached to the document. And a document search unit that performs a search and a score pair representing the relationship between a word in the document group and the content of the document of the word as a feature amount of metadata from the document group searched by the document search unit A vector extracting means for obtaining a probability that the word does not appear once in the document by Poisson distribution from the statistic of the word of the whole document set from the document group searched by the document searching means, The difference between the document frequency in the document group and the document frequency based on the Poisson distribution is obtained as a residual DF value, words having the residual DF value equal to or less than a predetermined value are excluded as insignificant words, and the phrase and residual D It is characterized by having a feature extraction means for outputting a vector pair was an element of the value as a feature quantity of documents, and a similarity calculation means for calculating the degree of similarity between the metadata using the feature quantity .

また請求項２に記載のメタデータ間類似度測定装置は、請求項１において、前記特徴量抽出手段は、前記スコアとして、語句の文書群中での文書頻度と、全文書を対象としたその語句の出現頻度に基づくポアソン分布によって推定される文書頻度との差を利用することを特徴としている。 Further, the metadata similarity measuring apparatus according to claim 2 is characterized in that, in claim 1, the feature amount extraction means uses the document frequency in the document group of the phrase and the target for all documents as the score. It is characterized by using the difference from the document frequency estimated by the Poisson distribution based on the appearance frequency of the phrase.

また請求項６に記載のメタデータ間類似度測定方法は、文書に付与されたメタデータ間の類似度を測定する方法において、文書検索手段が、同一のメタデータが付与された文書の検索を行う文書検索ステップと、特徴量抽出手段が、前記文書検索ステップによって検索された文書群より、メタデータの特徴量として、文書群中の語句とその語句の文書の内容との関連性を表したスコアの対を要素としたベクトルを抽出するステップであって、前記文書検索ステップによって検索された文書群より、文書集合全体の語句の統計量からポアソン分布により文書内でその語句が一度も出現しない確率を求めるステップと、語句の文書群中での文書頻度とポアソン分布に基づく文書頻度の差を残差ＤＦ値として求めるステップと、前記残差ＤＦ値が所定値以下の語句は重要でない語として除外するステップと、語句と残差ＤＦ値の対を要素としたベクトルを文書群の特徴量として出力するステップとを有する特徴量抽出ステップと、類似度算出手段が、前記特徴量抽出ステップによって抽出された前記特徴量を用いてメタデータ間の類似度を計算する類似度算出ステップとを有することを特徴としている。 The method for measuring the similarity between metadata according to claim 6 is a method for measuring the similarity between metadata given to a document, wherein the document search means searches for a document to which the same metadata is assigned. The document search step to be performed and the feature amount extraction unit represent the relationship between the words in the document group and the content of the document of the phrase as the feature amount of the metadata from the document group searched by the document search step. A step of extracting a vector having a score pair as an element, and from the document group searched by the document search step, the word / phrase never appears in the document by Poisson distribution based on the statistic of the word / phrase of the whole document set. A step of obtaining a probability, a step of obtaining a difference between a document frequency in a document group of words and a document frequency based on a Poisson distribution as a residual DF value, and the residual DF value A step of excluding a word not important the following words value, the feature amount extraction step and a step of outputting the vector was paired elements of words and residual DF value as a feature quantity of documents, the similarity calculating unit Has a similarity calculation step of calculating a similarity between metadata using the feature amount extracted by the feature amount extraction step.

また請求項７に記載のメタデータ間類似度測定方法は、請求項６において、前記特徴量抽出ステップは、前記スコアとして、語句の文書群中での文書頻度と、全文書を対象としたその語句の出現頻度に基づくポアソン分布によって推定される文書頻度との差を利用することを特徴としている。 Further, the method for measuring the similarity between metadata according to claim 7 is the method according to claim 6 , wherein the feature amount extraction step includes, as the score, the document frequency in the document group of the phrase and all documents. It is characterized by using the difference from the document frequency estimated by the Poisson distribution based on the appearance frequency of the phrase.

上記構成によれば、メタデータが設定された文書群の特徴量を統計的に直接求めているので、高精度に類似度を求めることができ、また、メタデータ間の類似度が小さな値になってしまうことはなく、類似度の格差を広げることができるため、閾値の設定が容易となる。 According to the above configuration, since the feature amount of the document group in which metadata is set is statistically obtained directly, the similarity can be obtained with high accuracy, and the similarity between the metadata is reduced to a small value. Since the difference in similarity can be widened, the threshold value can be easily set.

また請求項３に記載のメタデータ間類似度測定装置は、請求項１又は２において、前記特徴量抽出手段は、前記スコアとして、語句間における極端なスコアの格差を緩和するため、語句の特徴量に対し、その数値を対数化したものをスコアとして採用する、ことを特徴としている。 Further, the metadata similarity measurement apparatus according to claim 3 is characterized in that, in claim 1 or 2, the feature amount extraction unit reduces the extreme score disparity between words as the score. It is characterized by adopting the logarithm of the numerical value for the quantity as the score.

また請求項４に記載のメタデータ間類似度測定装置葉、請求項１ないし３のいずれか１項において、前記特徴量抽出手段は、前記スコアとして、文書量によるスコアの極端な格差を緩和するため、文書群の文書数を利用し正規化を行ったものをスコアとして採用することを特徴としている。 Further, in the metadata similarity measuring apparatus leaf according to claim 4 or any one of claims 1 to 3, the feature amount extraction unit eases an extreme disparity in scores due to document amount as the score. Therefore, the score obtained by normalizing using the number of documents in the document group is used as a score.

また請求項８に記載のメタデータ間類似度測定方法は、請求項６又は７において、前記特徴量抽出ステップは、前記スコアとして、語句間における極端なスコアの格差を緩和するため、語句の特徴量に対し、その数値を対数化したものをスコアとして採用する、ことを特徴としている。
また請求項９に記載のメタデータ間類似度測定方法は、請求項６ないし８のいずれか１項において、前記特徴量抽出ステップは、前記スコアとして、文書量によるスコアの極端な格差を緩和するため、文書群の文書数を利用し正規化を行ったものをスコアとして採用することを特徴としている。 In addition, the method for measuring the similarity between metadata according to claim 8 is characterized in that, in the feature extraction step according to claim 6 or 7 , the feature amount extraction step reduces the extreme score disparity between words as the score. It is characterized by adopting the logarithm of the numerical value for the quantity as the score.
The metadata similarity measuring method according to claim 9 is the method according to any one of claims 6 to 8 , wherein the feature amount extraction step alleviates an extreme disparity in scores due to document amount as the score. Therefore, the score obtained by normalizing using the number of documents in the document group is used as a score.

上記構成によれば、文書数が大きくなってもスコアの格差が極端に大きくなることはなく、信頼性が向上する。 According to the above configuration, even if the number of documents increases, the difference in score does not become extremely large, and the reliability is improved.

また請求項５に記載のメタデータ階層化装置は、請求項１ないし４のいずれか１項に記載のメタデータ間類似度測定装置に加え、類似するメタデータをクラスタとして統合し、階層化する、結合・階層化手段を有することを特徴としている。 Further, the metadata hierarchization apparatus according to claim 5 integrates similar metadata as a cluster in addition to the inter-metadata similarity measurement apparatus according to any one of claims 1 to 4 , and hierarchizes the same. , And having a combination / hierarchization means.

また請求項１０に記載のメタデータ階層化方法は、請求項６ないし９のいずれか１項に記載のメタデータ間類似度測定方法の各ステップに加え、結合・階層化手段が、類似するメタデータをクラスタとして統合し、階層化する、結合・階層化ステップを有することを特徴としている。 Further, the metadata hierarchization method according to claim 10 is similar to the steps of the inter-metadata similarity measurement method according to any one of claims 6 to 9 , and the combination / hierarchization means includes similar meta data. It is characterized by having a joining / hierarchizing step for integrating and hierarchizing data as clusters.

上記構成によれば、関連性の高いメタデータを結合・階層化することができ、利便性が向上する。 According to the above configuration, highly relevant metadata can be combined and hierarchized, and convenience is improved.

また請求項１１に記載のメタデータ間類似度測定プログラムは、請求項６ないし９のいずれか１項に記載のメタデータ間類似度測定方法の各ステップをコンピュータに実行させるためのプログラムとしたことを特徴としている。 The meta-data-similarity-measuring program according to claim 11, it has a program for executing the steps of the meta-data-similarity measuring method according to the computer in any one of claims 6 to 9 It is characterized by.

また請求項１２に記載のメタデータ階層化プログラムは、請求項１０に記載のメタデータ階層化方法の各ステップをコンピュータに実行させるためのプログラムとしたことを特徴としている。 A metadata hierarchization program according to claim 12 is a program for causing a computer to execute each step of the metadata hierarchization method according to claim 10 .

また請求項１３に記載のメタデータ間類似度測定プログラムを記録した記録媒体は、請求項１１に記載のプログラムを当該コンピュータが読み取りできる記録媒体に記録したことを特徴としている。 According to a thirteenth aspect of the present invention, there is provided a recording medium on which the inter-metadata similarity measurement program is recorded, wherein the program according to the eleventh aspect is recorded on a recording medium readable by the computer.

また請求項１４に記載のメタデータ階層化プログラムを記録した記録媒体は、請求項１２に記載のプログラムを当該コンピュータが読み取りできる記録媒体に記録したことを特徴としている。 A recording medium recording the metadata hierarchization program according to claim 14 is characterized in that the program according to claim 12 is recorded on a recording medium readable by the computer.

（１）請求項１〜１４に記載の発明によれば、メタデータが設定された文書群の特徴量を統計的に直接求めているので、高精度に類似度を求めることができ、また、メタデータ間の類似度が小さな値になってしまうことはなく、類似度の格差を広げることができるため、閾値の設定が容易となる。 (1) According to the inventions described in claims 1 to 14 , since the feature amount of the document group in which the metadata is set is statistically obtained directly, the similarity can be obtained with high accuracy. Since the similarity between metadata does not become a small value, and the disparity in similarity can be widened, the threshold value can be easily set.

また、本発明により得られた類似度を用いることによって、例えばブログのタグを用いた検索システムであれば、より高精度に類似したタグをユーザーに提供することができる。 Further, by using the similarity obtained by the present invention, for example, a search system using a tag of a blog can provide a user with a similar tag with higher accuracy.

さらに、類似したブログのタグをまとめることで、結果的に類似した概念を持った従来よりも大きなブログ記事群が得られることで、例えば、「旅行」という概念に関する記事群の中で最も注目されている地域はどこであるのかを調べるといった、マーケティングに応用することができる。
（２）請求項３，４，８，９に記載の発明によれば、文書数が大きくなってもスコアの格差が極端に大きくなることはなく、信頼性が向上する。
（３）請求項５，１０に記載の発明によれば、関連性の高いメタデータを結合・階層化することができ、利便性が向上する。
Furthermore, by gathering similar blog tags, you can obtain a larger group of blog articles with similar concepts as a result. It can be applied to marketing, such as finding out where the area is.
(2) According to the inventions described in claims 3, 4, 8 , and 9 , even if the number of documents increases, the difference in scores does not become extremely large, and the reliability is improved.
(3) According to the inventions described in claims 5 and 10 , highly relevant metadata can be combined and hierarchized, and convenience is improved.

以下、図面を参照しながら本発明の実施の形態を説明するが、本発明は下記の実施形態例に限定されるものではない。まず、メタデータ間類似度測定装置および方法について説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings, but the present invention is not limited to the following embodiments. First, an apparatus and method for measuring the similarity between metadata will be described.

図１は本発明の実施形態の一例であるメタデータ間類似度測定装置の構成を示す図である。本実施形態例のメタデータ間類似度測定装置１０は、所定のプログラムに基づいて動作する一般的なコンピュータ装置からなり、メタデータ付文書ＤＢ（データベース）１４から特定のメタデータが付与された文書を検索する文書検索手段１１と、メタデータの特徴量を抽出する特徴量抽出手段１２と、得られた特徴量を保存したメタデータ特徴量ＤＢ（データベース）１５を利用し、対象のメタデータ間の類似度を測定する類似度算出手段１３とから構成されている。 FIG. 1 is a diagram showing a configuration of an inter-metadata similarity measuring apparatus which is an example of an embodiment of the present invention. The inter-metadata similarity measuring apparatus 10 according to the present embodiment includes a general computer device that operates based on a predetermined program, and is a document to which specific metadata is assigned from a metadata-added document DB (database) 14. Between the target metadata using the document search means 11 for searching for the feature, the feature quantity extraction means 12 for extracting the feature quantity of the metadata, and the metadata feature quantity DB (database) 15 storing the obtained feature quantity. It is comprised from the similarity calculation means 13 which measures the similarity of these.

全体としての処理の流れは、図２に示されるように、まず類似度を測定するメタデータが付与された文書を検索し文書群を得る（ステップＳ１１）。次に、得られた文書群中の語句の統計量およびメタデータ付文書ＤＢ１４中に含まれる全文書から得られる語句の統計量を用いて、メタデータの特徴量を抽出する（ステップＳ１２）。さらに、得られたメタデータの特徴量間で類似度を計算し、その値を類似度として出力する（ステップＳ１３）。 As shown in FIG. 2, the overall processing flow first searches for documents to which metadata for measuring similarity is added to obtain a document group (step S11). Next, metadata feature amounts are extracted using the obtained statistic values of the phrases in the document group and the statistic values of the phrases obtained from all the documents included in the metadata-added document DB 14 (step S12). Further, a similarity is calculated between the feature quantities of the obtained metadata, and the value is output as the similarity (step S13).

以下、前記図２に示された処理の手順を基に、各手段の詳細な説明を示す。文書検索手段１１は、従来から用いられている一般的な検索システムと同様に、メタデータが付与された文書集合から、指定したメタデータと同一のメタデータが付与された文書を検索し、その結果をリストとして次手段への出力とする。 In the following, detailed description of each means will be given based on the processing procedure shown in FIG. The document retrieval unit 11 retrieves a document to which the same metadata as the specified metadata is assigned from a set of documents to which the metadata is assigned, as in a general search system that has been used conventionally. The result is output as a list to the next means.

ここで、検索結果として次手段に渡す文書数は、計算量を削減するため数千〜数万程度にまで絞っても良い。 Here, the number of documents to be passed to the next means as a search result may be reduced to about several thousand to several tens of thousands in order to reduce the calculation amount.

メタデータ付文書ＤＢ１４は、図３に示すように、文書ＩＤ、メタデータ、本文を、文書ごとに関連付けを行った形でデータとして保持している。ここで、文書ＩＤは文書ごとにユニークに与えられる識別子である。メタデータは本文の内容を端的に表している語句である。本文は、文書のテキストそのものである。また、図３における、ＵＲＬのように、付加的な情報を加えて関連付けを行いデータを保持することもできる。 As shown in FIG. 3, the metadata-added document DB 14 holds document ID, metadata, and text as data in a form in which each document is associated. Here, the document ID is an identifier uniquely given to each document. Metadata is a word that expresses the content of the text. The body is the text of the document itself. Further, as in the URL in FIG. 3, additional information can be added to perform association and hold data.

特徴量抽出手段１２は、文書検索手段１１によって得られた特定のメタデータが付与された文書群中の各語句の統計量と文書集合全体における各語句の統計量を利用して、メタデータの特徴量を抽出する。ここで、特徴量は語句と、その語句が文書群中の内容を表しているかどうかを示す残差ＤＦ（ＤｏｃｕｍｅｎｔＦｒｅｑｕｅｎｃｙ）値の対を要素としたベクトルで表現される。 The feature amount extraction unit 12 uses the statistics of each word in the document group to which the specific metadata obtained by the document search unit 11 and the statistics of each word in the entire document set are used to calculate the metadata. Extract features. Here, the feature quantity is expressed as a vector having a phrase and a pair of residual DF (Document Frequency) values indicating whether or not the phrase represents the contents in the document group.

残差ＤＦ値は、語句の文書群中での文書頻度と、ポアソン分布によって推定されたその文書群中での文書頻度との差によって表される。例えば、検索結果の文書群の総数をｎ、文書群中の語句ｉの文書頻度をｄｆ_i、全文書数をＮ、全文書中での語句ｉの大域的頻度をＦ_iとすると、残差ＤＦ値は以下の式（１）によって求められる。 The residual DF value is represented by the difference between the document frequency of the phrase in the document group and the document frequency in the document group estimated by the Poisson distribution. For example, if the total number of documents in the search result is n, the document frequency of phrase i in the document group is df _i , the total number of documents is N, and the global frequency of phrase i in all documents is F _i , the residual The DF value is obtained by the following equation (1).

一例として、「サッカー」というカテゴリが付与されたブログエントリ群中の「Ｊリーグ」という語句の残差ＤＦ値は、収集したブログエントリ数を４０００００、「サッカー」というカテゴリが付与されたブログエントリ数を１０００、このエントリ群中での「Ｊリーグ」の文書頻度を３００、収集したブログエントリ全体での「Ｊリーグ」の大域的頻度を７００とすると、残差ＤＦは３００−１０００（１−ｅｘｐ（−７００／４０００００））＝２９８．２５．．．のようになる。 As an example, the residual DF value of the phrase “J-League” in the blog entry group to which the category “soccer” is assigned is the number of collected blog entries 400000 and the number of blog entries to which the category “soccer” is assigned. Is 1000, the document frequency of "J League" in this entry group is 300, and the global frequency of "J League" in the collected blog entries is 700, the residual DF is 300-1000 (1-exp. (−700/400000)) = 298.25. . . become that way.

特徴量抽出手段１２の処理の流れの一例を図４に示す。図４において、まず前記文書検索手段１１より文書群を入力として受け取る（ステップＳ１２１）。次に文書検索手段１１によって検索された文書群より、文書集合全体の語句の統計量からポアソン分布により文書内でその語句が一度も出現しない確率Ｐ＝ｅ^-Fi/Nを求める（ステップＳ１２２）。 An example of the processing flow of the feature quantity extraction means 12 is shown in FIG. In FIG. 4, first, a document group is received as an input from the document search means 11 (step S121). Next, from the group of documents searched by the document search means 11, the probability P = e ^{−Fi / N} that the ^{word /} phrase does not appear in the document is obtained by the Poisson distribution from the statistic of the word / phrase of the whole document set (step S122). .

次に語句の文書群中での文書頻度とポアソン分布に基づく文書頻度ｎ（１−Ｐ）の差を残差ＤＦ値として求める（ステップＳ１２３）。次に前記残差ＤＦ値が例えば１．０以下の語句は重要でない語として除外する（ステップＳ１２４）。次に語句と残差ＤＦ値の対を要素としたベクトルを文書群の特徴量として出力する（ステップＳ１２５）。 Next, the difference between the document frequency in the document group of the phrase and the document frequency n (1-P) based on the Poisson distribution is obtained as a residual DF value (step S123). Next, words having a residual DF value of 1.0 or less are excluded as insignificant words (step S124). Next, a vector having a pair of a word and a residual DF value as an element is output as a feature amount of the document group (step S125).

ポアソン分布に基づく推定文書頻度を利用した指標としては、残差ＩＤＦ（ＩｎｖｅｒｓｅＤｏｃｕｍｅｎｔＦｒｅｑｕｅｎｃｙ）がある（非特許文献２参照）。従来、残差ＩＤＦは大域的頻度の大きな語は結果的にＤＦ値も大きくなり、ＩＤＦ値が小さくなるため、語の重み付け法であるＴＦ（ＴｅｒｍＦｒｅｑｕｅｎｃｙ）・ＩＤＦ法がうまく機能しないといった問題を解決するために用いられてきた。重要な語は同一の文書内で複数回繰り返されて使われるため、ポアソン分布によって推定されるＤＦ値よりも実際のＤＦ値が小さくなるという仮定に基づいている。 As an index using the estimated document frequency based on the Poisson distribution, there is a residual IDF (Inverse Document Frequency) (see Non-Patent Document 2). Conventionally, the residual IDF has a problem that a word with a large global frequency results in a large DF value and a small IDF value, so that the TF (Term Frequency) IDF method, which is a word weighting method, does not work well. Has been used to solve. Since important words are used multiple times in the same document, this is based on the assumption that the actual DF value is smaller than the DF value estimated by the Poisson distribution.

一方で、同一のメタデータが付与された文書群は内容的にも、用いられている語句的にも似ている可能性が高い。本実施形態例の特徴量抽出手段１２における残差ＤＦ値は、同一のメタデータが付与された文書群内では、内容を表すような語句はその文書群中の文書に偏って出現している可能性が高いため、そのＤＦ値はポアソン分布によって推定されるＤＦ値よりも大きな値をとるという仮定に基づいている点で異なっている。 On the other hand, a group of documents to which the same metadata is assigned is likely to be similar in terms of content and phrase used. In the residual DF value in the feature amount extraction unit 12 of the present embodiment example, in a document group to which the same metadata is assigned, a phrase that expresses the content appears biased to the documents in the document group. It is likely that the DF value is different in that it is based on the assumption that it takes a value larger than the DF value estimated by the Poisson distribution.

つまり、前記残差ＩＤＦ値は文書集合全体の中で特徴的な語を調べるための手法であり、本実施形態例の特徴量抽出手段１２における残差ＤＦ値は特定の文書群の中で特徴的な語を調べるための手法である。以上のように、特徴的な語を調べる対象が異なっている。 That is, the residual IDF value is a method for examining characteristic words in the entire document set, and the residual DF value in the feature amount extraction unit 12 of the present embodiment is a feature in a specific document group. It is a technique for examining typical words. As described above, the objects for examining characteristic words are different.

また、残差ＤＦ値は直感的には文書の頻度として捉えることができるので、数値の意味を理解しやすいというメリットがある。残差ＤＦ値は同一のメタデータが付与された文書数が大きくなるにつれて、値の格差が非常に大きくなるという特徴を持つ。したがって、その格差を小さくするために残差ＤＦ値を対数化する、もしくは、文書数で正規化を行った値を特徴量として採用するといった方法も考えられる。ここで、正規化の一例としては、単純に残差ＤＦ値を文書数で割るといった方法がある。 Further, since the residual DF value can be intuitively grasped as the frequency of the document, there is an advantage that it is easy to understand the meaning of the numerical value. The residual DF value has a feature that the difference in value becomes very large as the number of documents to which the same metadata is assigned increases. Therefore, a method of logarithmizing the residual DF value in order to reduce the disparity or adopting a value normalized by the number of documents as a feature amount is also conceivable. Here, as an example of normalization, there is a method of simply dividing the residual DF value by the number of documents.

メタデータ特徴量ＤＢ１５は、図５に示すように、メタデータ、語句、スコアを、メタデータごとに関連付けを行った形でデータとして保持している。ここで、メタデータは識別子である。語句はそのメタデータが付与された文書群中で用いられた語句である。また、スコアは前記特徴量抽出手段１２によって算出される。 As shown in FIG. 5, the metadata feature DB 15 holds metadata, phrases, and scores as data in a form in which each metadata is associated. Here, the metadata is an identifier. The phrase is a phrase used in the document group to which the metadata is assigned. The score is calculated by the feature amount extraction unit 12.

類似度算出手段１３は、特徴量抽出手段１２によって得られたメタデータの特徴量を利用して、メタデータ間の類似度を計算するための手段である。ここで、メタデータの特徴量は、語句と残差ＤＦ値をはじめとしたスコアの対を要素としたベクトルであるので、メタデータ間の類似度はベクトル同士のコサインによって求められる。例えば、Ｃ_i、Ｃ_jをメタデータの特徴ベクトルとすると、類似度は以下の式（２）となる。 The similarity calculation unit 13 is a unit for calculating the similarity between metadata using the feature amount of the metadata obtained by the feature amount extraction unit 12. Here, since the feature amount of the metadata is a vector having a pair of scores including a phrase and a residual DF value as an element, the similarity between the metadata is obtained by a cosine of the vectors. For example, when C _i and C _j are feature vectors of metadata, the similarity is expressed by the following equation (2).

次に、メタデータ階層化装置および方法について説明する。図６は本発明の実施形態の一例であるメタデータ階層化装置の構成を示す図である。本実施形態例のメタデータ階層化装置２０は、所定のプログラムに基づいて動作する一般的なコンピュータ装置からなり、例えば前記図１のメタデータ間類似度測定装置１０から成るメタデータ間類似度測定装置２１と、得られた類似度を保存するメタデータ類似度ＤＢ（データベース）２３を利用し、類似度の大きい順にメタデータを結合することで階層化を行う結合・階層化手段２２とから構成されている。 Next, a metadata hierarchization apparatus and method will be described. FIG. 6 is a diagram showing a configuration of a metadata hierarchization apparatus which is an example of an embodiment of the present invention. The metadata hierarchization apparatus 20 according to the present embodiment includes a general computer apparatus that operates based on a predetermined program. For example, the inter-metadata similarity measurement includes the inter-metadata similarity measurement apparatus 10 of FIG. A device 21 and a combination / hierarchization unit 22 that uses a metadata similarity DB (database) 23 for storing the obtained similarity and combines the metadata in descending order of similarity to form a hierarchy. Has been.

全体としての処理の流れは、図７に示されるように、まず、メタデータ付文書を利用して、メタデータ間類似度測定装置２１により、メタデータ間の類似度を計算する（ステップＳ２１）。次に、階層的クラスタリング手法を用いてメタデータの結合および階層化を行う（ステップＳ２２）。 As shown in FIG. 7, the overall processing flow is as follows. First, similarity between metadata is calculated by the inter-metadata similarity measuring device 21 using a metadata-added document (step S21). . Next, metadata is combined and hierarchized using a hierarchical clustering technique (step S22).

以下、前記図７に示された処理の手順を基に、各手段の詳細な説明を示す。メタデータ類似度ＤＢ２３は、図８に示すように、メタデータ１、メタデータ２、類似度を関連付けを行った形でデータとして保持している。ここで、メタデータ１、２は、類似度算出の対象となった２つのメタデータである。類似度は、前記類似度測定装置２１によって算出された、類似度である。 In the following, a detailed description of each means will be given based on the processing procedure shown in FIG. As shown in FIG. 8, the metadata similarity DB 23 holds metadata 1, metadata 2, and similarity as data in an associated form. Here, the metadata 1 and 2 are two pieces of metadata that are targets of similarity calculation. The similarity is a similarity calculated by the similarity measuring device 21.

また、メタデータ類似度ＤＢ２３を利用してメタデータ間の類似度をグラフによって可視化したものの一例が図９である。これは、ブログ１ヶ月のデータを基に、カテゴリ「グルメ」と、設定人数上位２００カテゴリとの類似度を、ｘ軸左方が設定人数最上位となるように、可視化したグラフである。ここでは、類似度０．６以上のタグの名前のみを表示しているが、確かに直感的にも内容に関連性があると考えられるものの類似度が高くなっている。 Further, FIG. 9 shows an example in which the similarity between metadata is visualized by a graph using the metadata similarity DB 23. This is a graph that visualizes the similarity between the category “gourmet” and the top 200 categories for the set number of people based on the data for one month of the blog so that the left side of the x-axis is the highest for the set number of people. Here, only the names of tags with a similarity of 0.6 or higher are displayed, but the similarity is high although it is considered intuitively related to the contents.

図１０、図１１は、前記図９と同様に、カテゴリ「日常」、「映画」と設定人数上位２００カテゴリとの類似度を可視化したグラフである。 FIG. 10 and FIG. 11 are graphs visualizing the similarity between the categories “daily life” and “movie” and the top 200 categories with the same number of people as in FIG.

これら図９〜図１１の結果から、表記ゆれや類義語により同一の概念を指すと考えられるカテゴリ間の類似度は特に高いことが分かる。また、あいまいな概念を持った「日常」のようなカテゴリは「映画」や「グルメ」と比較して、類似度のピークが低いことが分かる。 From these results of FIGS. 9 to 11, it can be seen that the similarity between categories considered to indicate the same concept by notation fluctuations and synonyms is particularly high. In addition, it can be seen that a category such as “daily life” with an ambiguous concept has a lower similarity peak than “movie” or “gourmet”.

このように本実施形態例によれば、関連性の高いカテゴリと関連性の低いカテゴリの差が顕著に現れるため、良好な類似度を得ることができる。 As described above, according to the present embodiment example, a difference between a highly related category and a less related category appears remarkably, so that a good similarity can be obtained.

結合・階層化手段２２は、前記得られた類似度を基にして、類似度の大きな順にメタデータ同士を一つの集合に結合していくことによって、同義のメタデータを一つにまとめ、さらに、メタデータに階層を与えるための手段である。具体的には、階層的クラスタリングと呼ばれる手法を用いる（非特許文献３参照）。階層的クラスタリング手法では、クラスタ間の距離の定義の仕方によって違いが存在し、最短距離法や最長距離法、群平均法などがあるが、本手段においてはどの手法を用いてもよい。 The combining / hierarchizing means 22 combines the metadata having the same meaning into one set by combining the metadata into one set in descending order of the similarity based on the obtained similarity. It is a means for giving a hierarchy to metadata. Specifically, a technique called hierarchical clustering is used (see Non-Patent Document 3). In the hierarchical clustering method, there are differences depending on how the distance between clusters is defined, and there are a shortest distance method, a longest distance method, a group average method, and the like. Any method may be used in this means.

一例として、ブログのカテゴリに対して、前記類似度測定装置２１によって得られた類似度を用いて、最短距離法によってクラスタリングを行った結果の一部を図１２に示す。図１２は「食べ物」に関連するカテゴリのクラスタの成長を表す図であり、次々と結合が行われることで、階層が形成されていることが分かる。 As an example, FIG. 12 shows a part of the result obtained by clustering the blog category by the shortest distance method using the similarity obtained by the similarity measuring device 21. FIG. 12 is a diagram showing the growth of the clusters of the category related to “food”, and it can be seen that a hierarchy is formed by combining one after another.

クラスタリングの終了条件の設定法としては、クラスタ数があらかじめ設定した閾値に達した場合に終了するという方法や、あらかじめ設定された類似度を閾値として利用する方法、および両閾値を組み合わせる方法などがある。 The clustering end condition setting method includes a method that ends when the number of clusters reaches a preset threshold, a method that uses a preset similarity as a threshold, and a method that combines both thresholds. .

上記で示した実施例の外にも様々な実施の形態が考えられる。例えば、階層化の際に既存の分類体系の上位の一部を利用し、その末端に類似度の大きなメタデータを結合することで、階層化の問題をより簡単にするといった方法も考えられる。 Various embodiments can be considered in addition to the examples shown above. For example, a method of using a part of the higher rank of the existing classification system at the time of hierarchization and combining metadata having a high similarity at the end thereof can be considered to make the problem of hierarchization easier.

また、前述したメタデータ間類似度測定方法及びメタデータ階層化方法コンピュータに実行させるためのプログラムを構築するものである。 In addition, a program for causing a computer to execute the above-described similarity measurement method between metadata and metadata hierarchization method is constructed.

また前記プログラムを記録した記録媒体を、システム、又は装置に供給し、そのシステム又は装置のＣＰＵ（ＭＰＵ）が記録媒体に格納されたプログラムを読み出し実行することも可能である。この場合記録媒体から読み出されたプログラム自体が上記実施形態の機能を実現することになり、このプログラムを記録した記録媒体としては、例えば、ＣＤ−ＲＯＭ，ＤＶＤ−ＲＯＭ，ＣＤ−Ｒ，ＣＤ−ＲＷ，ＭＯ及びＨＤＤ等がある。 It is also possible to supply a recording medium recording the program to a system or apparatus, and the CPU (MPU) of the system or apparatus reads and executes the program stored in the recording medium. In this case, the program itself read from the recording medium realizes the functions of the above-described embodiment, and examples of the recording medium on which the program is recorded include CD-ROM, DVD-ROM, CD-R, CD- There are RW, MO, and HDD.

本発明の一実施形態例のメタデータ間類似度測定装置の構成図。The block diagram of the similarity measuring apparatus between metadata of one embodiment of this invention. 本発明の一実施形態例のメタデータ間類似度測定装置の処理の流れを示すフローチャート。The flowchart which shows the flow of a process of the similarity measuring apparatus between metadata of one embodiment of this invention. 本発明の一実施形態例で用いられるメタデータ付文書ＤＢの一例を示す説明図。Explanatory drawing which shows an example of document DB with metadata used by one Example of this invention. 本発明の一実施形態例の特徴量抽出手段の処理の流れを示すフローチャート。The flowchart which shows the flow of a process of the feature-value extraction means of one embodiment of this invention. 本発明の一実施形態例で用いられるメタデータ特徴量ＤＢの一例を示す説明図。Explanatory drawing which shows an example of metadata feature-value DB used by one Embodiment of this invention. 本発明の他の実施形態例のメタデータ階層化装置の構成図。The block diagram of the metadata hierarchy apparatus of the other embodiment of this invention. 本発明の他の実施形態例のメタデータ階層化装置の処理の流れを示すフローチャート。The flowchart which shows the flow of a process of the metadata hierarchization apparatus of the other embodiment of this invention. 本発明の他の実施形態例で用いられるメタデータ類似度ＤＢの一例を示す説明図。Explanatory drawing which shows an example of metadata similarity DB used in the other embodiment of this invention. 本発明の実施形態例によるカテゴリ「グルメ」に関する類似度を示すグラフ。The graph which shows the similarity regarding the category "gourmet" by the embodiment of this invention. 本発明の実施形態例によるカテゴリ「日常」に関する類似度を示すグラフ。The graph which shows the similarity regarding the category "daily" by the example embodiment of this invention. 本発明の実施形態例によるカテゴリ「映画」に関する類似度を示すグラフ。The graph which shows the similarity regarding the category "movie" by the example embodiment of this invention. 本発明の他の実施形態例によるクラスタリング結果の一例を示す説明図。Explanatory drawing which shows an example of the clustering result by other embodiment examples of this invention.

Explanation of symbols

１０，２１…メタデータ間類似度測定装置、１１…文書検索手段、１２…特徴量抽出手段、１３…類似度算出手段、１４…メタデータ付文書ＤＢ、１５…メタデータ特徴量ＤＢ、２０…メタデータ階層化装置、２２…結合・階層化手段、２３…メタデータ類似度ＤＢ。 DESCRIPTION OF SYMBOLS 10,21 ... Metadata similarity measuring apparatus, 11 ... Document search means, 12 ... Feature quantity extraction means, 13 ... Similarity calculation means, 14 ... Document DB with metadata, 15 ... Metadata feature quantity DB, 20 ... Metadata hierarchization apparatus, 22 ... combination / hierarchization means, 23 ... metadata similarity DB.

Claims

In a device that measures the similarity between metadata assigned to documents,
A document search means for searching for documents with the same metadata;
Means for extracting, from the document group searched by the document search means, as a feature quantity of metadata, a vector having as an element a pair of scores representing the relationship between a word / phrase in the document group and the document content of the word / phrase. Then, from the document group searched by the document search means, the probability of the word not appearing in the document by the Poisson distribution is obtained from the statistic of the word of the whole document set, and the document in the document group of the word A difference between the frequency and the document frequency based on the Poisson distribution is obtained as a residual DF value, a phrase having the residual DF value equal to or smaller than a predetermined value is excluded as an insignificant word, and a vector having a pair of the phrase and the residual DF value as elements A feature amount extraction means for outputting a document group feature amount
A similarity calculation unit between metadata, comprising: similarity calculation means for calculating similarity between metadata using the feature quantity.

The feature amount extraction means includes:
The difference between a document frequency in a document group of a word and a document frequency estimated by a Poisson distribution based on an appearance frequency of the word for all documents is used as the score. The apparatus for measuring similarity between metadata described in 1.

The feature amount extraction means includes:
3. The meta according to claim 1, wherein the score is a logarithm of the numerical value of the feature amount of the phrase as a score in order to reduce an extreme score disparity between the phrases. Data similarity measurement device.

The feature amount extraction means includes:
4. The score according to claim 1, wherein the score obtained by normalizing using the number of documents in the document group is adopted as the score in order to alleviate an extreme difference in score due to the amount of documents. The apparatus for measuring the similarity between metadata described in the paragraph.

In addition to the meta-data-similarity-measuring device according to any one of claims 1 to 4, it integrates metadata similar as a cluster, stratified, and having a coupling-layered unit meth Data tiering device.

In a method for measuring the similarity between metadata assigned to documents,
A document search step in which the document search means searches for a document with the same metadata;
The feature amount extraction means uses, as an element, a pair of scores representing a relationship between a word / phrase in the document group and the content of the document of the word / phrase as a feature amount of metadata from the document group searched by the document search step. A step of extracting a vector obtained from the group of documents searched by the document search step, and determining a probability that the word / phrase never appears in the document by Poisson distribution from a statistic of the word / phrase of the whole document set; A step of obtaining a difference between a document frequency in a document group of a word and a document frequency based on a Poisson distribution as a residual DF value, a step of excluding a word whose residual DF value is a predetermined value or less as an insignificant word, And a step of outputting a vector having a pair of residual DF values as an element as a feature amount of the document group ,
A similarity calculation means comprising: a similarity calculation step in which similarity calculation means calculates a similarity between metadata using the feature amount extracted in the feature amount extraction step.

The feature amount extraction step includes:
As the score, claim, characterized in that use and document frequency in a document group of words, the difference between the document frequency estimated by the Poisson distribution based on the appearance frequency of the word intended for the entire document 6 Method for measuring the similarity between metadata described in 1.

The feature amount extraction step includes:
As the score, in order to relieve the extreme score disparity between phrases, the logarithm of the numerical value for the feature amount of the phrase is adopted as the score.
The method for measuring similarity between metadata according to claim 6 or 7 .

The feature amount extraction step includes:
As the score, to alleviate extreme disparity in the score by document size, either 6 to claim, characterized in that to adopt what was normalized using the number of documents in the document group as a score of 8 1 The method for measuring the similarity between metadata described in the paragraph.

In addition to each step of the method for measuring similarity between metadata according to any one of claims 6 to 9 , a combination / hierarchy unit combines and hierarchizes similar metadata as clusters. A metadata hierarchization method characterized by comprising a conversion step.

10. A program for measuring similarity between metadata, which is a program for causing a computer to execute each step of the method for measuring similarity between metadata according to any one of claims 6 to 9 .

11. A metadata hierarchization program characterized by being a program for causing a computer to execute each step of the metadata hierarchization method according to claim 10 .

A recording medium recording a similarity measurement program between metadata, wherein the program according to claim 11 is recorded on a recording medium readable by the computer.

13. A recording medium recording a metadata hierarchization program, wherein the program according to claim 12 is recorded on a recording medium readable by the computer.