JP2012168678A

JP2012168678A - Inter-document similarity calculation device, inter-document similarity calculation method and inter-document similarity calculation program

Info

Publication number: JP2012168678A
Application number: JP2011028181A
Authority: JP
Inventors: Mitsugi Miura; 貢三浦
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2011-02-14
Filing date: 2011-02-14
Publication date: 2012-09-06
Anticipated expiration: 2031-02-14
Also published as: JP5617674B2

Abstract

PROBLEM TO BE SOLVED: To provide an inter-document similarity calculation device capable of calculating similarity with high accuracy while preventing a burden from being excessive in calculating the similarity between documents.SOLUTION: A device 100 comprises: a unit (101) for, when representing the total number of characters constituting a sentence included in each of multiple documents as N, generating suffix part information that shows a suffix part equivalent to a remaining character string after excluding i characters from the top of the sentence, for each of integers i from 0 to N-1; a unit (102) for selecting a suffix part generated based on the multiple sentences as a reference suffix part from the suffix parts; a unit (103) for, with respect to each of the multiple documents, generating similarity basic information that shows whether or not the document includes the reference suffix part; and a unit (104) for calculating the similarity showing a level of similarity between a first document and a second document based on the similarity basic information generated for the first document and the similarity basic information generated for the second document.

Description

本発明は、複数の文書が互いに類似している程度を表す類似度を算出する文書間類似度算出装置に関する。 The present invention relates to an inter-document similarity calculating apparatus that calculates a similarity indicating a degree of similarity between a plurality of documents.

複数の文書が互いに類似している程度を表す類似度を算出する文書間類似度算出装置が知られている。この種の文書間類似度算出装置の１つとして特許文献１に記載の文書間類似度算出装置は、複数の文書のそれぞれに対して、形態素解析を行うことにより、当該文書を単語に分割する。 There is known an inter-document similarity calculation apparatus that calculates a similarity indicating a degree of similarity between a plurality of documents. The inter-document similarity calculation device described in Patent Document 1 as one of this type of inter-document similarity calculation device divides the document into words by performing morphological analysis on each of a plurality of documents. .

更に、文書間類似度算出装置は、文書毎に、単語のそれぞれに対して、当該単語が当該文書にて出現する数を計数する。そして、文書間類似度算出装置は、文書毎に、単語のそれぞれに対して、当該単語を表す成分として当該単語が出現する数を値として有するベクトルを表す類似度基礎情報を生成する。 Further, the inter-document similarity calculation apparatus counts the number of occurrences of the word in the document for each word for each document. Then, the inter-document similarity calculation device generates similarity basic information representing a vector having, as a value, the number of occurrences of the word as a component representing the word for each word.

文書間類似度算出装置は、第１の文書に対して生成された類似度基礎情報が表すベクトルと、第２の文書に対して生成された類似度基礎情報が表すベクトルと、の間の角度が小さくなるほど大きくなる値を有する類似度を算出する。 The inter-document similarity calculation device includes an angle between a vector represented by the similarity basic information generated for the first document and a vector represented by the similarity basic information generated for the second document. The degree of similarity having a value that increases as becomes smaller is calculated.

特開２００２−１０８８９４号公報JP 2002-108894 A

ところで、上記文書間類似度算出装置は、辞書に予め登録されている単語に基づいて形態素解析を行う。従って、辞書に登録されていない単語が文書に含まれる場合、上記文書間類似度算出装置は、類似度基礎情報が表すベクトルの成分として、正確な単語を用いることができない虞がある。なお、この種の問題は、ユーザが予め設定した特徴語を、類似度基礎情報が表すベクトルの成分として用いるように構成された文書間類似度算出装置においても同様に生じる。 By the way, the inter-document similarity calculation device performs morphological analysis based on words registered in advance in the dictionary. Therefore, when a word that is not registered in the dictionary is included in the document, the inter-document similarity calculation device may not be able to use an accurate word as a vector component represented by the similarity basic information. This type of problem also occurs in an inter-document similarity calculation apparatus configured to use a feature word preset by a user as a vector component represented by similarity basic information.

この結果、上記文書間類似度算出装置においては、高い精度にて類似度を算出することができない虞があった。 As a result, there is a possibility that the similarity between the documents cannot be calculated with high accuracy.

また、Ｎグラム（Ｎ−ｇｒａｍ）方式に従って生成されたインデックスとしての文字列を、類似度基礎情報が表すベクトルの成分として用いるように上記文書間類似度算出装置を構成することも考えられる。これによれば、辞書に登録されていない単語が文書に含まれる場合であっても、高い精度にて類似度を算出できることが期待される。 It is also conceivable to configure the inter-document similarity calculation apparatus so that a character string as an index generated according to an N-gram method is used as a vector component represented by the similarity basic information. According to this, even when a word that is not registered in the dictionary is included in the document, it is expected that the similarity can be calculated with high accuracy.

ところで、第１の単語の一部と、第２の単語の全体と、が一致することがある。例えば、第１の単語が「プリンタ」であり、第２の単語が「プリン」である場合が想定される。この場合、文書間類似度算出装置がインデックスとして「プリン」を生成した場合、「プリンタ」に関する第１の文書と、「プリン」に関する第２の文書と、の間の類似度として、過度に大きな値を算出してしまう虞がある。即ち、この場合、高い精度にて類似度を算出することができない虞があった。 By the way, a part of the first word may coincide with the whole of the second word. For example, it is assumed that the first word is “printer” and the second word is “pudding”. In this case, when the inter-document similarity calculation device generates “pudding” as an index, the similarity between the first document related to “printer” and the second document related to “printing” is excessively large. There is a risk of calculating the value. That is, in this case, there is a possibility that the similarity cannot be calculated with high accuracy.

また、インデックスは、例えば、「ンが食」のように特定の意味を有しない文字列も含めて生成される。従って、生成されるインデックスの総数は、比較的多くなる。このため、類似度基礎情報が表すベクトルの成分（次元）の数も過度に多くなる。その結果、類似度基礎情報に基づいて類似度を算出する際の文書間類似度算出装置の負荷が過大となる虞もあった。 In addition, the index is generated including a character string that does not have a specific meaning, for example, “N is a meal”. Therefore, the total number of generated indexes is relatively large. For this reason, the number of vector components (dimensions) represented by the similarity basic information is excessively large. As a result, there is a possibility that the load of the inter-document similarity calculation device when calculating the similarity based on the similarity basic information becomes excessive.

このため、本発明の目的は、上述した課題である「高い精度にて文書間の類似度を算出することができない場合が生じること、及び、文書間の類似度を算出する際の負荷が過大となる場合が生じること」を解決することが可能な文書間類似度算出装置を提供することにある。 For this reason, the object of the present invention is the above-described problem that “the similarity between documents cannot be calculated with high accuracy, and that the load when calculating the similarity between documents is excessive. It is an object of the present invention to provide an inter-document similarity calculation apparatus capable of solving the “occurrence of a case where“

かかる目的を達成するため本発明の一形態である文書間類似度算出装置は、
複数の文書のそれぞれが含む文毎に、当該文を構成する文字の総数をＮにより表した場合に、０からＮ−１までの整数ｉのそれぞれに対する、当該文の先頭からｉ文字を除いた残余の文字列である接尾部を表す情報である接尾部情報を生成する接尾部情報生成手段と、
上記生成された接尾部情報が表す接尾部の中から、複数の文に基づいて生成された接尾部を、基準接尾部として選択する基準接尾部選択手段と、
上記複数の文書のそれぞれに対して、当該文書が上記選択された基準接尾部のそれぞれを含むか否かを表す類似度基礎情報を生成する類似度基礎情報生成手段と、
上記複数の文書のうちの第１の文書に対して上記生成された類似度基礎情報と、当該複数の文書のうちの第２の文書に対して上記生成された類似度基礎情報と、に基づいて、当該第１の文書と当該第２の文書とが類似している程度を表す類似度を算出する類似度算出手段と、
を備える。 In order to achieve such an object, an inter-document similarity calculation apparatus according to an aspect of the present invention includes:
For each sentence included in each of a plurality of documents, when the total number of characters constituting the sentence is represented by N, i characters are excluded from the head of the sentence for each integer i from 0 to N-1. Suffix information generating means for generating suffix information which is information representing a suffix that is a residual character string;
Reference suffix selecting means for selecting, as a reference suffix, a suffix generated based on a plurality of sentences from the suffixes represented by the generated suffix information;
For each of the plurality of documents, similarity basic information generation means for generating similarity basic information indicating whether or not the document includes each of the selected reference suffixes;
Based on the similarity basic information generated for the first document of the plurality of documents and the similarity basic information generated for the second document of the plurality of documents. A similarity calculating means for calculating a similarity indicating the degree of similarity between the first document and the second document;
Is provided.

また、本発明の他の形態である文書間類似度算出方法は、
複数の文書のそれぞれが含む文毎に、当該文を構成する文字の総数をＮにより表した場合に、０からＮ−１までの整数ｉのそれぞれに対する、当該文の先頭からｉ文字を除いた残余の文字列である接尾部を表す情報である接尾部情報を生成し、
上記生成された接尾部情報が表す接尾部の中から、複数の文に基づいて生成された接尾部を、基準接尾部として選択し、
上記複数の文書のそれぞれに対して、当該文書が上記選択された基準接尾部のそれぞれを含むか否かを表す類似度基礎情報を生成し、
上記複数の文書のうちの第１の文書に対して上記生成された類似度基礎情報と、当該複数の文書のうちの第２の文書に対して上記生成された類似度基礎情報と、に基づいて、当該第１の文書と当該第２の文書とが類似している程度を表す類似度を算出する方法である。 In addition, the inter-document similarity calculation method according to another aspect of the present invention is:
For each sentence included in each of a plurality of documents, when the total number of characters constituting the sentence is represented by N, i characters are excluded from the head of the sentence for each integer i from 0 to N-1. Generate suffix information that is information representing the suffix that is the remaining character string,
From the suffixes represented by the generated suffix information, select a suffix generated based on a plurality of sentences as a reference suffix,
For each of the plurality of documents, generate similarity basic information indicating whether or not the document includes each of the selected reference suffixes,
Based on the similarity basic information generated for the first document of the plurality of documents and the similarity basic information generated for the second document of the plurality of documents. This is a method of calculating a similarity indicating the degree of similarity between the first document and the second document.

また、本発明の他の形態である文書間類似度算出プログラムは、
情報処理装置に、
複数の文書のそれぞれが含む文毎に、当該文を構成する文字の総数をＮにより表した場合に、０からＮ−１までの整数ｉのそれぞれに対する、当該文の先頭からｉ文字を除いた残余の文字列である接尾部を表す情報である接尾部情報を生成する接尾部情報生成手段と、
上記生成された接尾部情報が表す接尾部の中から、複数の文に基づいて生成された接尾部を、基準接尾部として選択する基準接尾部選択手段と、
上記複数の文書のそれぞれに対して、当該文書が上記選択された基準接尾部のそれぞれを含むか否かを表す類似度基礎情報を生成する類似度基礎情報生成手段と、
上記複数の文書のうちの第１の文書に対して上記生成された類似度基礎情報と、当該複数の文書のうちの第２の文書に対して上記生成された類似度基礎情報と、に基づいて、当該第１の文書と当該第２の文書とが類似している程度を表す類似度を算出する類似度算出手段と、
を実現させるためのプログラムである。 Moreover, the similarity calculation program between documents which is the other form of this invention is the following.
In the information processing device,
For each sentence included in each of a plurality of documents, when the total number of characters constituting the sentence is represented by N, i characters are excluded from the head of the sentence for each integer i from 0 to N-1. Suffix information generating means for generating suffix information which is information representing a suffix that is a residual character string;
Reference suffix selecting means for selecting, as a reference suffix, a suffix generated based on a plurality of sentences from the suffixes represented by the generated suffix information;
For each of the plurality of documents, similarity basic information generation means for generating similarity basic information indicating whether or not the document includes each of the selected reference suffixes;
Based on the similarity basic information generated for the first document of the plurality of documents and the similarity basic information generated for the second document of the plurality of documents. A similarity calculating means for calculating a similarity indicating the degree of similarity between the first document and the second document;
It is a program for realizing.

本発明は、以上のように構成されることにより、文書間の類似度を算出する際の負荷が過大となることを防止しながら、高い精度にて類似度を算出することができる。 According to the present invention configured as described above, it is possible to calculate the similarity with high accuracy while preventing an excessive load when calculating the similarity between documents.

本発明の第１実施形態に係る文書間類似度算出装置の概略を表すブロック図である。It is a block diagram showing the outline of the inter-document similarity calculation apparatus which concerns on 1st Embodiment of this invention. 類似度基礎情報が表すベクトルを概念的に示した説明図である。It is explanatory drawing which showed notionally the vector which similarity basic information represents. 本発明の第１実施形態に係る文書間類似度算出装置が実行する文書間類似度算出プログラムを示したフローチャートである。It is the flowchart which showed the document similarity calculation program which the document similarity calculation apparatus which concerns on 1st Embodiment of this invention performs. 本発明の第１実施形態に係る文書間類似度算出装置が実行する基準接尾部選択処理を示したフローチャートである。It is the flowchart which showed the reference | standard suffix part selection process which the similarity calculation apparatus between documents which concerns on 1st Embodiment of this invention performs. 本発明の第１実施形態に係る文書間類似度算出装置が実行する類似度基礎情報生成処理を示したフローチャートである。It is the flowchart which showed the similarity basic information generation process which the inter-document similarity calculation apparatus which concerns on 1st Embodiment of this invention performs. 文、接尾部情報、及び、基準接尾部の関係を概念的に示した説明図である。It is explanatory drawing which showed notionally the relationship between a sentence, suffix information, and a reference suffix. 本発明の第２実施形態に係る文書間類似度算出装置の概略を表すブロック図である。It is a block diagram showing the outline of the similarity calculation apparatus between documents which concerns on 2nd Embodiment of this invention.

以下、本発明に係る、文書間類似度算出装置、文書間類似度算出方法、及び、文書間類似度算出プログラム、の各実施形態について図１〜図７を参照しながら説明する。 Hereinafter, embodiments of an inter-document similarity calculation apparatus, an inter-document similarity calculation method, and an inter-document similarity calculation program according to the present invention will be described with reference to FIGS.

＜第１実施形態＞
（構成）
図１に示したように、第１実施形態に係る文書間類似度算出装置１０は、情報処理装置である。なお、文書間類似度算出装置１０は、パーソナル・コンピュータ、サーバ装置、携帯電話端末、ＰＨＳ（ＰｅｒｓｏｎａｌＨａｎｄｙｐｈｏｎｅＳｙｓｔｅｍ）、ＰＤＡ（ＰｅｒｓｏｎａｌＤａｔａＡｓｓｉｓｔａｎｃｅ、ＰｅｒｓｏｎａｌＤｉｇｉｔａｌＡｓｓｉｓｔａｎｔ）、カーナビゲーション端末、又は、ゲーム端末等であってもよい。 <First Embodiment>
(Constitution)
As shown in FIG. 1, the inter-document similarity calculation apparatus 10 according to the first embodiment is an information processing apparatus. The inter-document similarity calculation device 10 is a personal computer, server device, mobile phone terminal, PHS (Personal Handyphone System), PDA (Personal Data Assistant), Personal Digital Assistant, car navigation terminal, game terminal, or the like. There may be.

文書間類似度算出装置１０は、図示しない中央処理装置（ＣＰＵ；ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）、及び、記憶装置（メモリ及びハードディスク駆動装置（ＨＤＤ；ＨａｒｄＤｉｓｋＤｒｉｖｅ））を備える。文書間類似度算出装置１０は、記憶装置に記憶されているプログラムをＣＰＵが実行することにより、後述する機能を実現するように構成されている。 The inter-document similarity calculation device 10 includes a central processing unit (CPU) (not shown) and a storage device (memory and hard disk drive (HDD)). The inter-document similarity calculation device 10 is configured to realize functions to be described later when a CPU executes a program stored in a storage device.

（機能）
図１は、上記のように構成された文書間類似度算出装置１０の機能を表すブロック図である。
文書間類似度算出装置１０の機能は、文書情報記憶部１１と、接尾部情報生成部（接尾部情報生成手段）１２と、基準接尾部選択部（基準接尾部選択手段）１３と、類似度基礎情報生成部（類似度基礎情報生成手段）１４と、類似度算出部（類似度算出手段）１５と、を含む。 (function)
FIG. 1 is a block diagram showing the function of the inter-document similarity calculation apparatus 10 configured as described above.
The functions of the inter-document similarity calculation device 10 include a document information storage unit 11, a suffix information generation unit (suffix information generation unit) 12, a reference suffix selection unit (reference suffix selection unit) 13, and a similarity. A basic information generation unit (similarity basic information generation unit) 14 and a similarity calculation unit (similarity calculation unit) 15 are included.

文書情報記憶部１１は、複数の文書情報を記憶する。文書情報は、文書を表す情報である。文書は、少なくとも１つの文を含む。文は、複数の文字からなる文字列により構成される。文書情報記憶部１１が記憶している文書情報は、ユーザにより入力された情報であってもよいし、他の情報処理装置から受信された情報であってもよい。 The document information storage unit 11 stores a plurality of document information. Document information is information representing a document. The document includes at least one sentence. A sentence is composed of a character string composed of a plurality of characters. The document information stored in the document information storage unit 11 may be information input by a user or information received from another information processing apparatus.

接尾部情報生成部１２は、文書情報記憶部１１に記憶されている複数の文書情報が表す複数の文書のそれぞれが含む文毎に、接尾部情報を生成する。接尾部情報は、当該文を構成する文字の総数をＮにより表した場合に、０からＮ−１までの整数ｉのそれぞれに対する、当該文の先頭からｉ文字を除いた残余の文字列である接尾部（即ち、Ｎ個の接尾部）を表す情報である。 The suffix information generation unit 12 generates suffix information for each sentence included in each of a plurality of documents represented by a plurality of document information stored in the document information storage unit 11. The suffix information is a remaining character string excluding i characters from the head of the sentence for each of the integers i from 0 to N−1, where N represents the total number of characters constituting the sentence. Information indicating a suffix (that is, N suffixes).

本例では、接尾部情報は、接尾辞配列（サフィックス・アレイ）を表す情報である。接尾辞配列は、接尾部を辞書順に並べ替えた配列である。 In this example, the suffix information is information representing a suffix array (suffix array). The suffix array is an array in which suffixes are rearranged in dictionary order.

例えば、文が「ＢＡＮＡＮＡ」である場合、接尾部情報生成部１２は、「ＢＡＮＡＮＡ」、「ＡＮＡＮＡ」、「ＮＡＮＡ」、「ＡＮＡ」、「ＮＡ」、及び、「Ａ」からなる６個の接尾部を、辞書順に並べ替えた、「Ａ」、「ＡＮＡ」、「ＡＮＡＮＡ」、「ＢＡＮＡＮＡ」、「ＮＡ」、及び、「ＮＡＮＡ」からなる配列を表す接尾部情報を生成する。
なお、接尾部情報は、接尾辞木（サフィックス木）を表す情報であってもよい。 For example, when the sentence is “BANANA”, the suffix information generation unit 12 includes six suffixes including “BANANA”, “ANANA”, “NANA”, “ANA”, “NA”, and “A”. Suffix information representing an array of “A”, “ANA”, “ANANA”, “BANANA”, “NA”, and “NANA” is generated by rearranging the parts in dictionary order.
Note that the suffix information may be information representing a suffix tree (suffix tree).

基準接尾部選択部１３は、接尾部情報生成部１２により生成された接尾部情報が表す接尾部の中から、複数の文に基づいて生成された接尾部を、基準接尾部として選択する。なお、基準接尾部選択部１３は、生成された接尾部情報が表す接尾部の中から、予め設定された閾値数（２以上の整数）よりも多い数の文に基づいて生成された接尾部を、基準接尾部として選択するように構成されていてもよい。 The reference suffix selection unit 13 selects a suffix generated based on a plurality of sentences as a reference suffix from the suffixes represented by the suffix information generated by the suffix information generation unit 12. The reference suffix selecting unit 13 generates a suffix generated based on a sentence having a number larger than a preset threshold number (an integer of 2 or more) from the suffixes represented by the generated suffix information. May be selected as the reference suffix.

類似度基礎情報生成部１４は、文書情報記憶部１１に記憶されている複数の文書情報が表す複数の文書のそれぞれに対して、類似度基礎情報を生成する。類似度基礎情報は、当該文書が基準接尾部選択部１３により選択された基準接尾部のそれぞれを含むか否かを表す情報である。 The similarity basic information generation unit 14 generates similarity basic information for each of a plurality of documents represented by a plurality of document information stored in the document information storage unit 11. The similarity basic information is information indicating whether or not the document includes each of the reference suffixes selected by the reference suffix selection unit 13.

本例では、類似度基礎情報は、基準接尾部選択部１３により選択された基準接尾部のそれぞれを表す成分を有するベクトルを表す情報である。 In this example, the similarity basic information is information representing a vector having components representing each of the reference suffixes selected by the reference suffix selection unit 13.

具体的には、類似度基礎情報生成部１４は、類似度基礎情報が表すベクトルの成分のそれぞれの値を、当該成分が表す基準接尾部を当該文書が含む場合に正の値に設定し、一方、当該成分が表す基準接尾部を当該文書が含まない場合に０に設定する。 Specifically, the similarity basic information generation unit 14 sets each value of the vector component represented by the similarity basic information to a positive value when the document includes the reference suffix represented by the component, On the other hand, the reference suffix represented by the component is set to 0 when the document is not included.

更に、本例では、類似度基礎情報生成部１４は、類似度基礎情報が表すベクトルの成分のそれぞれの値を、当該成分が表す基準接尾部を当該文書が含む数（即ち、当該文書にて当該基準接尾部が出現する回数）に、増分値を乗じた値に設定する。 Further, in this example, the similarity basic information generation unit 14 includes each value of the vector component represented by the similarity basic information and the number of the document including the reference suffix represented by the component (that is, the document) The number of times the reference suffix appears) is multiplied by the increment value.

即ち、増分値は、文書が含む基準接尾部の数が１だけ増える毎に当該基準接尾部を表す成分としての値を増加させる増分を表す値である。本例では、類似度基礎情報生成部１４は、増分値を、当該基準接尾部を構成する文字の総数が多くなるほど大きくなる値（例えば、当該文字の総数に正比例する値）に設定する。 That is, the increment value is a value representing an increment that increases the value as a component representing the reference suffix each time the number of the reference suffix included in the document increases by one. In this example, the similarity basic information generation unit 14 sets the increment value to a value that increases as the total number of characters constituting the reference suffix increases (for example, a value that is directly proportional to the total number of characters).

このようにして、類似度基礎情報生成部１４は、文書が含む基準接尾部の数が多くなるほど、当該基準接尾部を表す成分として、大きくなる値を有するベクトルを表す情報を、類似度基礎情報として生成している、と言うことができる。 In this way, the similarity basic information generation unit 14 converts information representing a vector having a value that increases as a component representing the reference suffix as the number of reference suffixes included in the document increases. It can be said that it is generated as.

類似度算出部１５は、文書情報記憶部１１に記憶されている複数の文書情報が表す複数の文書のうちの任意の２つの文書が類似している程度を表す類似度を算出する。類似度算出部１５は、当該２つの文書の一方（第１の文書）に対して、類似度基礎情報生成部１４により生成された類似度基礎情報と、当該２つの文書の他方（第２の文書）に対して、類似度基礎情報生成部１４により生成された類似度基礎情報と、に基づいて、類似度を算出する。 The similarity calculation unit 15 calculates a similarity indicating the degree to which any two documents among a plurality of documents represented by a plurality of document information stored in the document information storage unit 11 are similar. The similarity calculation unit 15 applies the similarity basic information generated by the similarity basic information generation unit 14 to one of the two documents (first document) and the other of the two documents (second document). The similarity is calculated based on the similarity basic information generated by the similarity basic information generation unit 14 for the document.

具体的には、類似度算出部１５は、第１の文書に対して生成された類似度基礎情報が表すベクトルと、第２の文書に対して生成された前記類似度基礎情報が表すベクトルと、の間の角度が小さくなるほど大きくなる値（本例では、２つのベクトルのなす角の余弦）を類似度として算出する。 Specifically, the similarity calculation unit 15 includes a vector represented by the similarity basic information generated for the first document, a vector represented by the similarity basic information generated for the second document, A value that increases as the angle between and decreases (in this example, the cosine of the angle formed by the two vectors) is calculated as the similarity.

図２は、基準接尾部として、「猫」、「白い」、及び、「鼠」が選択された場合における、第１の文書に対するベクトルＶ１と、第２の文書に対するベクトルＶ２と、を概念的に示した説明図である。 FIG. 2 conceptually shows a vector V1 for the first document and a vector V2 for the second document when “cat”, “white”, and “鼠” are selected as reference suffixes. It is explanatory drawing shown in.

本例では、類似度算出部１５は、文書情報記憶部１１に記憶されている複数の文書情報が表す複数の文書のすべての組み合わせのそれぞれに対して類似度を算出する。なお、類似度算出部１５は、文書情報記憶部１１に記憶されている複数の文書情報が表す複数の文書の特定の（例えば、ユーザにより指定された）組み合わせのみに対して類似度を算出するように構成されていてもよい。 In this example, the similarity calculation unit 15 calculates a similarity for each of all combinations of a plurality of documents represented by a plurality of document information stored in the document information storage unit 11. The similarity calculation unit 15 calculates the similarity only for a specific combination (for example, designated by the user) of a plurality of documents represented by a plurality of document information stored in the document information storage unit 11. It may be configured as follows.

（作動）
次に、上述した文書間類似度算出装置１０の作動について説明する。
文書間類似度算出装置１０のＣＰＵは、図３乃至図５にフローチャートにより示した文書間類似度算出プログラムを実行するようになっている。 (Operation)
Next, the operation of the above-described inter-document similarity calculation apparatus 10 will be described.
The CPU of the inter-document similarity calculation apparatus 10 is configured to execute the inter-document similarity calculation program shown by the flowcharts in FIGS.

具体的に述べると、文書間類似度算出装置１０は、文書間類似度算出プログラムの処理を開始すると、先ず、各文（文書情報記憶部１１に記憶されている複数の文書情報が表す複数の文書のそれぞれが含む文のそれぞれ）に対して接尾部情報を生成する（ステップＳ１０１）。 More specifically, when the inter-document similarity calculation apparatus 10 starts processing of the inter-document similarity calculation program, first, each sentence (a plurality of document information represented by a plurality of document information stored in the document information storage unit 11) is described. Suffix information is generated for each of the sentences included in each document (step S101).

例えば、図６の（Ａ）及び（Ｂ）に示したように、第１の文書が文＃１として「黒い猫も白い猫も鳴いた」を含み、且つ、第２の文書が文＃２として「私の白い猫も鳴いた」を含む場合を想定する。この場合、文書間類似度算出装置１０は、文＃１に対する接尾部情報として、図６の（Ａ）に示した接尾辞配列を表す情報を生成する。更に、文書間類似度算出装置１０は、文＃２に対する接尾部情報として、図６の（Ｂ）に示した接尾辞配列を表す情報を生成する。 For example, as shown in FIGS. 6A and 6B, the first document includes “a black cat and a white cat rang” as sentence # 1, and the second document is sentence # 2. Assuming that "my white cat sang" as well. In this case, the inter-document similarity calculation apparatus 10 generates information representing the suffix array shown in FIG. 6A as the suffix information for the sentence # 1. Further, the inter-document similarity calculation apparatus 10 generates information representing the suffix array shown in FIG. 6B as the suffix information for the sentence # 2.

そして、文書間類似度算出装置１０は、生成された接尾部情報が表す接尾部のそれぞれ（各接尾部）に対して、基準接尾部選択処理を実行する（ステップＳ１０２）。具体的には、文書間類似度算出装置１０は、図４に示した基準接尾部選択処理を各接尾部に対して実行する。 Then, the inter-document similarity calculation apparatus 10 performs a reference suffix selection process for each of the suffixes (each suffix) indicated by the generated suffix information (step S102). Specifically, the inter-document similarity calculation apparatus 10 performs the reference suffix selection process illustrated in FIG. 4 for each suffix.

即ち、先ず、文書間類似度算出装置１０は、基準接尾部選択処理の対象となる接尾部が取得される基となった（即ち、当該接尾部を含む）文の数（基礎文数）を取得する（ステップＳ２０１）。 That is, first, the inter-document similarity calculation apparatus 10 obtains the number (basic sentence number) of sentences (that is, including the suffix) from which a suffix that is a target of the reference suffix selection process is acquired. Obtain (step S201).

次いで、文書間類似度算出装置１０は、取得された基礎文数が、予め設定された閾値数（本例では、１）よりも大きいか否かを判定する（ステップＳ２０２）。
基礎文数が閾値数よりも大きい場合、文書間類似度算出装置１０は、「Ｙｅｓ」と判定してステップＳ２０３へ進み、基準接尾部選択処理の対象となる接尾部を基準接尾部として選択する。そして、文書間類似度算出装置１０は、基準接尾部選択処理を終了する。 Next, the inter-document similarity calculation device 10 determines whether or not the acquired number of basic sentences is larger than a preset threshold number (1 in this example) (step S202).
When the number of basic sentences is larger than the threshold number, the inter-document similarity calculation apparatus 10 determines “Yes”, proceeds to step S203, and selects a suffix that is a target of the reference suffix selection process as a reference suffix. . Then, the inter-document similarity calculation apparatus 10 ends the reference suffix selection process.

一方、基礎文数が閾値数以下である場合、文書間類似度算出装置１０は、「Ｎｏ」と判定して、基準接尾部選択処理の対象となる接尾部を基準接尾部として選択することなく、基準接尾部選択処理を終了する。 On the other hand, when the number of basic sentences is equal to or less than the threshold number, the inter-document similarity calculation apparatus 10 determines “No” and does not select the suffix that is the target of the reference suffix selection process as the reference suffix. Then, the reference suffix selection process is terminated.

例えば、図６の（Ａ）及び（Ｂ）に示したように、第１の文書が文＃１として「黒い猫も白い猫も鳴いた」を含み、且つ、第２の文書が文＃２として「私の白い猫も鳴いた」を含む場合を想定する。この場合、文書間類似度算出装置１０は、図６の（Ｃ）に示した接尾部を基準接尾部として選択する。 For example, as shown in FIGS. 6A and 6B, the first document includes “a black cat and a white cat rang” as sentence # 1, and the second document is sentence # 2. Assuming that "my white cat sang" as well. In this case, the inter-document similarity calculation apparatus 10 selects the suffix shown in (C) of FIG. 6 as the reference suffix.

次いで、文書間類似度算出装置１０は、各文書（文書情報記憶部１１に記憶されている複数の文書情報が表す複数の文書のそれぞれ）に対して類似度基礎情報生成処理を実行する（ステップＳ１０３）。 Next, the inter-document similarity calculation apparatus 10 executes similarity basic information generation processing for each document (each of a plurality of documents represented by a plurality of document information stored in the document information storage unit 11) (step S1). S103).

具体的には、文書間類似度算出装置１０は、図５に示した類似度基礎情報生成処理を各文書に対して実行する。 Specifically, the inter-document similarity calculation apparatus 10 executes the similarity basic information generation process shown in FIG. 5 for each document.

即ち、文書間類似度算出装置１０は、選択された基準接尾部のそれぞれを１つずつ順に処理対象とするループ処理（ステップＳ３０１〜ステップＳ３０５）を実行する。 That is, the inter-document similarity calculation apparatus 10 executes a loop process (steps S301 to S305) in which each selected reference suffix is sequentially processed.

ループ処理において、先ず、文書間類似度算出装置１０は、処理対象となる基準接尾部を構成する文字の総数（当該基準接尾部の文字数）を取得する（ステップＳ３０２）。次いで、文書間類似度算出装置１０は、類似度基礎情報生成処理の対象となる文書が含む、処理対象となる基準接尾部の数（基準接尾部数）を取得する（ステップＳ３０３）。 In the loop processing, first, the inter-document similarity calculation apparatus 10 acquires the total number of characters (number of characters of the reference suffix) constituting the reference suffix to be processed (step S302). Next, the inter-document similarity calculation apparatus 10 acquires the number of reference suffixes (number of reference suffixes) to be processed included in the document to be processed by the similarity basic information generation process (step S303).

そして、文書間類似度算出装置１０は、取得された基準接尾部の文字数（増分値）を、取得された基準接尾部数に乗じた値を、処理対象となる基準接尾部を表す成分の値（成分値）として算出する（ステップＳ３０４）。 The inter-document similarity calculation device 10 then multiplies the acquired number of reference suffixes by the number of characters (increment value) of the acquired reference suffix, and the value of the component representing the reference suffix to be processed ( (Component value) is calculated (step S304).

そして、文書間類似度算出装置１０は、上記ステップＳ２０３にて選択された基準接尾部のすべてに対して、上記ループ処理（ステップＳ３０１〜ステップＳ３０５）を実行した後、ステップＳ３０６へ進む。 Then, the inter-document similarity calculation apparatus 10 performs the loop process (steps S301 to S305) for all the reference suffixes selected in step S203, and then proceeds to step S306.

そして、文書間類似度算出装置１０は、上記選択された基準接尾部のそれぞれを表す成分を有するベクトルを表す類似度基礎情報を生成する（ステップＳ３０６）。その後、文書間類似度算出装置１０は、類似度基礎情報生成処理を終了する。 Then, the inter-document similarity calculation apparatus 10 generates similarity basic information representing a vector having components representing each of the selected reference suffixes (step S306). Thereafter, the inter-document similarity calculation apparatus 10 ends the similarity basic information generation process.

次いで、文書間類似度算出装置１０は、文書情報記憶部１１に記憶されている複数の文書情報が表す複数の文書のすべての組み合わせのそれぞれに対して類似度を算出する（ステップＳ１０４）。具体的には、文書間類似度算出装置１０は、第１の文書に対して生成された類似度基礎情報が表すベクトルと、第２の文書に対して生成された前記類似度基礎情報が表すベクトルと、のなす角の余弦を類似度として算出する。
その後、文書間類似度算出装置１０は、文書間類似度算出プログラムの処理を終了する。 Next, the inter-document similarity calculation device 10 calculates the similarity for each of all combinations of the plurality of documents represented by the plurality of document information stored in the document information storage unit 11 (step S104). Specifically, the inter-document similarity calculation device 10 represents the vector represented by the similarity basic information generated for the first document and the similarity basic information generated for the second document. The cosine of the angle formed by the vector is calculated as the similarity.
Thereafter, the inter-document similarity calculation apparatus 10 ends the processing of the inter-document similarity calculation program.

以上、説明したように、第１実施形態に係る文書間類似度算出装置１０によれば、文書間の類似度を算出する際の文書間類似度算出装置１０の負荷が過大となることを防止しながら、高い精度にて文書間の類似度を算出することができる。 As described above, according to the inter-document similarity calculation apparatus 10 according to the first embodiment, it is possible to prevent an excessive load on the inter-document similarity calculation apparatus 10 when calculating the similarity between documents. However, the similarity between documents can be calculated with high accuracy.

また、第１実施形態に係る文書間類似度算出装置１０は、文書が含む基準接尾部の数が１だけ増える毎に当該基準接尾部を表す成分としての値を増加させる増分値を、当該基準接尾部を構成する文字の総数が多くなるほど大きくするように構成されている。 Further, the inter-document similarity calculation apparatus 10 according to the first embodiment calculates an increment value that increases a value as a component representing the reference suffix each time the number of reference suffixes included in the document increases by 1. It is configured to increase as the total number of characters constituting the suffix increases.

ところで、接尾部を構成する文字の総数が多くなるほど、当該接尾部は、当該接尾部を含む文書の特徴をよく表す。従って、上記のように構成された文書間類似度算出装置１０によれば、より一層高い精度にて文書間の類似度を算出することができる。 By the way, as the total number of characters constituting the suffix portion increases, the suffix portion better represents the characteristics of the document including the suffix portion. Therefore, according to the inter-document similarity calculation apparatus 10 configured as described above, it is possible to calculate the similarity between documents with higher accuracy.

なお、第１実施形態の変形例に係る文書間類似度算出装置１０は、生成された接尾部情報が表す接尾部の中から、複数の文書に基づいて生成された接尾部を、基準接尾部として選択するように構成される。 The inter-document similarity calculation apparatus 10 according to the modification of the first embodiment uses a suffix generated based on a plurality of documents as a reference suffix from the suffixes represented by the generated suffix information. Configured to select as.

ところで、同一の接尾部を含む文書の数が多くなるほど、当該接尾部は、当該接尾部を含む文書の特徴をよく表す。従って、このように文書間類似度算出装置１０を構成することにより、より一層高い精度にて文書間の類似度を算出することができる。 By the way, the greater the number of documents including the same suffix, the better the feature of the document including the suffix. Therefore, by configuring the inter-document similarity calculation apparatus 10 in this way, it is possible to calculate the similarity between documents with higher accuracy.

また、文書間類似度算出装置１０は、生成された接尾部情報と、当該接尾部情報を生成する基となった文書を識別するための文書識別情報と、を対応付けて記憶するように構成されていてもよい。また、文書間類似度算出装置１０は、生成された類似度基礎情報と、当該類似度基礎情報を生成する基となった文書を識別するための文書識別情報と、を対応付けて記憶するように構成されていてもよい。 Further, the inter-document similarity calculation device 10 is configured to store the generated suffix information in association with the document identification information for identifying the document that is the basis for generating the suffix information. May be. Further, the inter-document similarity calculation device 10 stores the generated similarity basic information and the document identification information for identifying the document that is the basis for generating the similarity basic information in association with each other. It may be configured.

＜第２実施形態＞
次に、本発明の第２実施形態に係る文書間類似度算出装置について図７を参照しながら説明する。
第２実施形態に係る文書間類似度算出装置１００は、
複数の文書のそれぞれが含む文毎に、当該文を構成する文字の総数をＮにより表した場合に、０からＮ−１までの整数ｉのそれぞれに対する、当該文の先頭からｉ文字を除いた残余の文字列である接尾部を表す情報である接尾部情報を生成する接尾部情報生成部（接尾部情報生成手段）１０１と、
上記生成された接尾部情報が表す接尾部の中から、複数の文に基づいて生成された接尾部を、基準接尾部として選択する基準接尾部選択部（基準接尾部選択手段）１０２と、
上記複数の文書のそれぞれに対して、当該文書が上記選択された基準接尾部のそれぞれを含むか否かを表す類似度基礎情報を生成する類似度基礎情報生成部（類似度基礎情報生成手段）１０３と、
上記複数の文書のうちの第１の文書に対して上記生成された類似度基礎情報と、当該複数の文書のうちの第２の文書に対して上記生成された類似度基礎情報と、に基づいて、当該第１の文書と当該第２の文書とが類似している程度を表す類似度を算出する類似度算出部（類似度算出手段）１０４と、
を備える。 Second Embodiment
Next, an inter-document similarity calculation apparatus according to a second embodiment of the present invention will be described with reference to FIG.
The inter-document similarity calculation apparatus 100 according to the second embodiment
For each sentence included in each of a plurality of documents, when the total number of characters constituting the sentence is represented by N, i characters are excluded from the head of the sentence for each integer i from 0 to N-1. A suffix information generating unit (suffix information generating means) 101 that generates suffix information that is information indicating a suffix that is a residual character string;
A reference suffix selecting unit (reference suffix selecting means) 102 for selecting, as a reference suffix, a suffix generated based on a plurality of sentences from the suffixes represented by the generated suffix information;
A similarity basic information generation unit (similarity basic information generation means) that generates, for each of the plurality of documents, similarity basic information indicating whether or not the document includes each of the selected reference suffixes. 103,
Based on the similarity basic information generated for the first document of the plurality of documents and the similarity basic information generated for the second document of the plurality of documents. A similarity calculation unit (similarity calculation means) 104 for calculating a similarity indicating the degree of similarity between the first document and the second document;
Is provided.

これによれば、文書間の類似度を算出する際の文書間類似度算出装置１００の負荷が過大となることを防止しながら、高い精度にて文書間の類似度を算出することができる。 Accordingly, it is possible to calculate the similarity between documents with high accuracy while preventing an excessive load on the inter-document similarity calculation apparatus 100 when calculating the similarity between documents.

以上、上記実施形態を参照して本願発明を説明したが、本願発明は、上述した実施形態に限定されるものではない。本願発明の構成及び詳細に、本願発明の範囲内において当業者が理解し得る様々な変更をすることができる。 Although the present invention has been described with reference to the above embodiment, the present invention is not limited to the above-described embodiment. Various changes that can be understood by those skilled in the art can be made to the configuration and details of the present invention within the scope of the present invention.

例えば、文書間類似度算出装置は、算出された類似度に基づいて、複数の文書を分類する（例えば、クラスタリングする）ように構成されていてもよい。 For example, the inter-document similarity calculation apparatus may be configured to classify (for example, cluster) a plurality of documents based on the calculated similarity.

なお、上記各実施形態において文書間類似度算出装置の各機能は、ＣＰＵがプログラム（ソフトウェア）を実行することにより実現されていたが、回路等のハードウェアにより実現されていてもよい。 In the above embodiments, each function of the inter-document similarity calculation device is realized by the CPU executing a program (software), but may be realized by hardware such as a circuit.

また、上記各実施形態においてプログラムは、記憶装置に記憶されていたが、コンピュータが読み取り可能な記録媒体に記憶されていてもよい。例えば、記録媒体は、フレキシブルディスク、光ディスク、光磁気ディスク、及び、半導体メモリ等の可搬性を有する媒体である。 In each of the above embodiments, the program is stored in the storage device, but may be stored in a computer-readable recording medium. For example, the recording medium is a portable medium such as a flexible disk, an optical disk, a magneto-optical disk, and a semiconductor memory.

また、上記実施形態の他の変形例として、上述した実施形態及び変形例の任意の組み合わせが採用されてもよい。 In addition, as another modified example of the above-described embodiment, any combination of the above-described embodiments and modified examples may be employed.

＜付記＞
上記実施形態の一部又は全部は、以下の付記のように記載され得るが、以下には限られない。 <Appendix>
A part or all of the above embodiment can be described as the following supplementary notes, but is not limited thereto.

（付記１）
複数の文書のそれぞれが含む文毎に、当該文を構成する文字の総数をＮにより表した場合に、０からＮ−１までの整数ｉのそれぞれに対する、当該文の先頭からｉ文字を除いた残余の文字列である接尾部を表す情報である接尾部情報を生成する接尾部情報生成手段と、
前記生成された接尾部情報が表す接尾部の中から、複数の文に基づいて生成された接尾部を、基準接尾部として選択する基準接尾部選択手段と、
前記複数の文書のそれぞれに対して、当該文書が前記選択された基準接尾部のそれぞれを含むか否かを表す類似度基礎情報を生成する類似度基礎情報生成手段と、
前記複数の文書のうちの第１の文書に対して前記生成された類似度基礎情報と、当該複数の文書のうちの第２の文書に対して前記生成された類似度基礎情報と、に基づいて、当該第１の文書と当該第２の文書とが類似している程度を表す類似度を算出する類似度算出手段と、
を備える文書間類似度算出装置。 (Appendix 1)
For each sentence included in each of a plurality of documents, when the total number of characters constituting the sentence is represented by N, i characters are excluded from the head of the sentence for each integer i from 0 to N-1. Suffix information generating means for generating suffix information which is information representing a suffix that is a residual character string;
Reference suffix selection means for selecting, as a reference suffix, a suffix generated based on a plurality of sentences from the suffixes represented by the generated suffix information;
For each of the plurality of documents, similarity basic information generating means for generating similarity basic information indicating whether or not the document includes each of the selected reference suffixes;
Based on the basic similarity information generated for the first document of the plurality of documents and the basic similarity information generated for the second document of the plurality of documents. A similarity calculating means for calculating a similarity indicating the degree of similarity between the first document and the second document;
An inter-document similarity calculation apparatus.

これによれば、文書間の類似度を算出する際の文書間類似度算出装置の負荷が過大となることを防止しながら、高い精度にて文書間の類似度を算出することができる。 According to this, it is possible to calculate the similarity between documents with high accuracy while preventing an excessive load on the inter-document similarity calculation apparatus when calculating the similarity between documents.

（付記２）
付記１に記載の文書間類似度算出装置であって、
前記類似度基礎情報生成手段は、前記選択された基準接尾部のそれぞれに対して、前記文書が当該基準接尾部を含む場合に当該基準接尾部を表す成分として、正の値を有し、一方、当該文書が当該基準接尾部を含まない場合に当該成分として０を有するベクトルを表す情報を、前記類似度基礎情報として生成するように構成され、
前記類似度算出手段は、前記第１の文書に対して生成された前記類似度基礎情報が表すベクトルと、前記第２の文書に対して生成された前記類似度基礎情報が表すベクトルと、の間の角度が小さくなるほど大きくなる値を前記類似度として算出するように構成された文書間類似度算出装置。 (Appendix 2)
An inter-document similarity calculation apparatus according to appendix 1,
The similarity basic information generation means has a positive value as a component representing the reference suffix when the document includes the reference suffix for each of the selected reference suffixes, And, when the document does not include the reference suffix, information representing a vector having 0 as the component is generated as the similarity basic information,
The similarity calculation means includes a vector represented by the similarity basic information generated for the first document and a vector represented by the similarity basic information generated for the second document. An inter-document similarity calculation device configured to calculate a value that increases as the angle between them decreases as the similarity.

（付記３）
付記２に記載の文書間類似度算出装置であって、
前記類似度基礎情報生成手段は、前記文書が含む前記基準接尾部の数が多くなるほど、当該基準接尾部を表す成分として、大きくなる値を有する前記ベクトルを表す情報を、前記類似度基礎情報として生成するように構成された文書間類似度算出装置。 (Appendix 3)
An inter-document similarity calculation apparatus according to appendix 2,
The similarity basic information generation means uses, as the similarity basic information, information representing the vector having a value that increases as a component representing the reference suffix as the number of the reference suffixes included in the document increases. An inter-document similarity calculation device configured to generate.

（付記４）
付記３に記載の文書間類似度算出装置であって、
前記類似度基礎情報生成手段は、前記文書が含む前記基準接尾部の数が１だけ増える毎に当該基準接尾部を表す成分としての値を増加させる増分値を、当該基準接尾部を構成する文字の総数が多くなるほど大きくするように構成された文書間類似度算出装置。 (Appendix 4)
An inter-document similarity calculation apparatus according to appendix 3,
The similarity basic information generation means, each time the number of the reference suffixes included in the document increases by 1, increases an increment value that increases a value as a component representing the reference suffix, the characters constituting the reference suffix The inter-document similarity calculation device is configured to increase as the total number increases.

ところで、接尾部を構成する文字の総数が多くなるほど、当該接尾部は、当該接尾部を含む文書の特徴をよく表す。従って、上記のように文書間類似度算出装置を構成することにより、より一層高い精度にて文書間の類似度を算出することができる。 By the way, as the total number of characters constituting the suffix portion increases, the suffix portion better represents the characteristics of the document including the suffix portion. Therefore, by configuring the inter-document similarity calculation apparatus as described above, it is possible to calculate the similarity between documents with higher accuracy.

（付記５）
付記１乃至付記４のいずれか一項に記載の文書間類似度算出装置であって、
前記基準接尾部選択手段は、前記生成された接尾部情報が表す接尾部の中から、複数の文書に基づいて生成された接尾部を、前記基準接尾部として選択するように構成された文書間類似度算出装置。 (Appendix 5)
The inter-document similarity calculation device according to any one of appendix 1 to appendix 4,
The reference suffix selecting means is configured to select a suffix generated based on a plurality of documents from the suffixes represented by the generated suffix information as the reference suffix. Similarity calculation device.

ところで、同一の接尾部を含む文書の数が多くなるほど、当該接尾部は、当該接尾部を含む文書の特徴をよく表す。従って、上記のように文書間類似度算出装置を構成することにより、より一層高い精度にて文書間の類似度を算出することができる。 By the way, the greater the number of documents including the same suffix, the better the feature of the document including the suffix. Therefore, by configuring the inter-document similarity calculation apparatus as described above, it is possible to calculate the similarity between documents with higher accuracy.

（付記６）
付記１乃至付記５のいずれか一項に記載の文書間類似度算出装置であって、
前記接尾部情報は、接尾辞木、又は、接尾辞配列を表す情報である文書間類似度算出装置。 (Appendix 6)
The inter-document similarity calculation device according to any one of appendix 1 to appendix 5,
The inter-document similarity calculation device, wherein the suffix information is information indicating a suffix tree or a suffix array.

（付記７）
複数の文書のそれぞれが含む文毎に、当該文を構成する文字の総数をＮにより表した場合に、０からＮ−１までの整数ｉのそれぞれに対する、当該文の先頭からｉ文字を除いた残余の文字列である接尾部を表す情報である接尾部情報を生成し、
前記生成された接尾部情報が表す接尾部の中から、複数の文に基づいて生成された接尾部を、基準接尾部として選択し、
前記複数の文書のそれぞれに対して、当該文書が前記選択された基準接尾部のそれぞれを含むか否かを表す類似度基礎情報を生成し、
前記複数の文書のうちの第１の文書に対して前記生成された類似度基礎情報と、当該複数の文書のうちの第２の文書に対して前記生成された類似度基礎情報と、に基づいて、当該第１の文書と当該第２の文書とが類似している程度を表す類似度を算出する、文書間類似度算出方法。 (Appendix 7)
For each sentence included in each of a plurality of documents, when the total number of characters constituting the sentence is represented by N, i characters are excluded from the head of the sentence for each integer i from 0 to N-1. Generate suffix information that is information representing the suffix that is the remaining character string,
From the suffixes represented by the generated suffix information, select a suffix generated based on a plurality of sentences as a reference suffix,
For each of the plurality of documents, generate similarity basic information indicating whether or not the document includes each of the selected reference suffixes,
Based on the basic similarity information generated for the first document of the plurality of documents and the basic similarity information generated for the second document of the plurality of documents. An inter-document similarity calculation method for calculating a similarity indicating a degree of similarity between the first document and the second document.

（付記８）
付記７に記載の文書間類似度算出方法であって、
前記選択された基準接尾部のそれぞれに対して、前記文書が当該基準接尾部を含む場合に当該基準接尾部を表す成分として、正の値を有し、一方、当該文書が当該基準接尾部を含まない場合に当該成分として０を有するベクトルを表す情報を、前記類似度基礎情報として生成し、
前記第１の文書に対して生成された前記類似度基礎情報が表すベクトルと、前記第２の文書に対して生成された前記類似度基礎情報が表すベクトルと、の間の角度が小さくなるほど大きくなる値を前記類似度として算出する、文書間類似度算出方法。 (Appendix 8)
The method for calculating the similarity between documents according to appendix 7,
For each of the selected reference suffixes, the document has a positive value as a component representing the reference suffix when the document includes the reference suffix, while the document has the reference suffix When not included, information representing a vector having 0 as the component is generated as the similarity basic information,
The smaller the angle between the vector represented by the similarity basic information generated for the first document and the vector represented by the similarity basic information generated for the second document, the larger the smaller the angle is. The inter-document similarity calculation method for calculating a value obtained as the similarity.

（付記９）
付記８に記載の文書間類似度算出方法であって、
前記文書が含む前記基準接尾部の数が多くなるほど、当該基準接尾部を表す成分として、大きくなる値を有する前記ベクトルを表す情報を、前記類似度基礎情報として生成する、文書間類似度算出方法。 (Appendix 9)
An inter-document similarity calculation method according to appendix 8,
The inter-document similarity calculation method for generating, as the similarity basic information, information representing the vector having a value that increases as a component representing the reference suffix as the number of the reference suffix included in the document increases. .

（付記１０）
情報処理装置に、
複数の文書のそれぞれが含む文毎に、当該文を構成する文字の総数をＮにより表した場合に、０からＮ−１までの整数ｉのそれぞれに対する、当該文の先頭からｉ文字を除いた残余の文字列である接尾部を表す情報である接尾部情報を生成する接尾部情報生成手段と、
前記生成された接尾部情報が表す接尾部の中から、複数の文に基づいて生成された接尾部を、基準接尾部として選択する基準接尾部選択手段と、
前記複数の文書のそれぞれに対して、当該文書が前記選択された基準接尾部のそれぞれを含むか否かを表す類似度基礎情報を生成する類似度基礎情報生成手段と、
前記複数の文書のうちの第１の文書に対して前記生成された類似度基礎情報と、当該複数の文書のうちの第２の文書に対して前記生成された類似度基礎情報と、に基づいて、当該第１の文書と当該第２の文書とが類似している程度を表す類似度を算出する類似度算出手段と、
を実現させるための文書間類似度算出プログラム。 (Appendix 10)
In the information processing device,
For each sentence included in each of a plurality of documents, when the total number of characters constituting the sentence is represented by N, i characters are excluded from the head of the sentence for each integer i from 0 to N-1. Suffix information generating means for generating suffix information which is information representing a suffix that is a residual character string;
Reference suffix selection means for selecting, as a reference suffix, a suffix generated based on a plurality of sentences from the suffixes represented by the generated suffix information;
For each of the plurality of documents, similarity basic information generating means for generating similarity basic information indicating whether or not the document includes each of the selected reference suffixes;
Based on the basic similarity information generated for the first document of the plurality of documents and the basic similarity information generated for the second document of the plurality of documents. A similarity calculating means for calculating a similarity indicating the degree of similarity between the first document and the second document;
Inter-document similarity calculation program for realizing

（付記１１）
付記１０に記載の文書間類似度算出プログラムであって、
前記類似度基礎情報生成手段は、前記選択された基準接尾部のそれぞれに対して、前記文書が当該基準接尾部を含む場合に当該基準接尾部を表す成分として、正の値を有し、一方、当該文書が当該基準接尾部を含まない場合に当該成分として０を有するベクトルを表す情報を、前記類似度基礎情報として生成するように構成され、
前記類似度算出手段は、前記第１の文書に対して生成された前記類似度基礎情報が表すベクトルと、前記第２の文書に対して生成された前記類似度基礎情報が表すベクトルと、の間の角度が小さくなるほど大きくなる値を前記類似度として算出するように構成された文書間類似度算出プログラム。 (Appendix 11)
An inter-document similarity calculation program according to attachment 10, wherein
The similarity basic information generation means has a positive value as a component representing the reference suffix when the document includes the reference suffix for each of the selected reference suffixes, And, when the document does not include the reference suffix, information representing a vector having 0 as the component is generated as the similarity basic information,
The similarity calculation means includes a vector represented by the similarity basic information generated for the first document and a vector represented by the similarity basic information generated for the second document. An inter-document similarity calculation program configured to calculate, as the similarity, a value that increases as the angle between them decreases.

（付記１２）
付記１１に記載の文書間類似度算出プログラムであって、
前記類似度基礎情報生成手段は、前記文書が含む前記基準接尾部の数が多くなるほど、当該基準接尾部を表す成分として、大きくなる値を有する前記ベクトルを表す情報を、前記類似度基礎情報として生成するように構成された文書間類似度算出プログラム。 (Appendix 12)
An inter-document similarity calculation program according to attachment 11, wherein
The similarity basic information generation means uses, as the similarity basic information, information representing the vector having a value that increases as a component representing the reference suffix as the number of the reference suffixes included in the document increases. An inter-document similarity calculation program configured to generate.

本発明は、複数の文書が互いに類似している程度を表す類似度を算出する文書間類似度算出装置、及び、複数の文書を分類する文書分類装置等に適用可能である。 The present invention can be applied to an inter-document similarity calculation device that calculates a degree of similarity indicating a degree of similarity between a plurality of documents, a document classification device that classifies a plurality of documents, and the like.

１０文書間類似度算出装置
１１文書情報記憶部
１２接尾部情報生成部
１３基準接尾部選択部
１４類似度基礎情報生成部
１５類似度算出部
１００文書間類似度算出装置
１０１接尾部情報生成部
１０２基準接尾部選択部
１０３類似度基礎情報生成部
１０４類似度算出部 DESCRIPTION OF SYMBOLS 10 Inter-document similarity calculation apparatus 11 Document information storage part 12 Suffix information generation part 13 Reference suffix part selection part 14 Similarity basic information generation part 15 Similarity calculation part 100 Inter-document similarity calculation apparatus 101 Suffix information generation part 102 Reference suffix selection unit 103 Similarity basic information generation unit 104 Similarity calculation unit

Claims

For each sentence included in each of a plurality of documents, when the total number of characters constituting the sentence is represented by N, i characters are excluded from the head of the sentence for each integer i from 0 to N-1. Suffix information generating means for generating suffix information which is information representing a suffix that is a residual character string;
Reference suffix selection means for selecting, as a reference suffix, a suffix generated based on a plurality of sentences from the suffixes represented by the generated suffix information;
For each of the plurality of documents, similarity basic information generating means for generating similarity basic information indicating whether or not the document includes each of the selected reference suffixes;
Based on the basic similarity information generated for the first document of the plurality of documents and the basic similarity information generated for the second document of the plurality of documents. A similarity calculating means for calculating a similarity indicating the degree of similarity between the first document and the second document;
An inter-document similarity calculation apparatus.

The inter-document similarity calculation apparatus according to claim 1,
The similarity basic information generation means has a positive value as a component representing the reference suffix when the document includes the reference suffix for each of the selected reference suffixes, And, when the document does not include the reference suffix, information representing a vector having 0 as the component is generated as the similarity basic information,
The similarity calculation means includes a vector represented by the similarity basic information generated for the first document and a vector represented by the similarity basic information generated for the second document. An inter-document similarity calculation device configured to calculate a value that increases as the angle between them decreases as the similarity.

The inter-document similarity calculation apparatus according to claim 2,
The similarity basic information generation means uses, as the similarity basic information, information representing the vector having a value that increases as a component representing the reference suffix as the number of the reference suffixes included in the document increases. An inter-document similarity calculation device configured to generate.

The inter-document similarity calculation apparatus according to claim 3,
The similarity basic information generation means, each time the number of the reference suffixes included in the document increases by 1, increases an increment value that increases a value as a component representing the reference suffix, the characters constituting the reference suffix The inter-document similarity calculation device is configured to increase as the total number increases.

The inter-document similarity calculation device according to any one of claims 1 to 4,
The reference suffix selecting means is configured to select a suffix generated based on a plurality of documents from the suffixes represented by the generated suffix information as the reference suffix. Similarity calculation device.

The inter-document similarity calculation apparatus according to any one of claims 1 to 5,
The inter-document similarity calculation device, wherein the suffix information is information indicating a suffix tree or a suffix array.

For each sentence included in each of a plurality of documents, when the total number of characters constituting the sentence is represented by N, i characters are excluded from the head of the sentence for each integer i from 0 to N-1. Generate suffix information that is information representing the suffix that is the remaining character string,
From the suffixes represented by the generated suffix information, select a suffix generated based on a plurality of sentences as a reference suffix,
For each of the plurality of documents, generate similarity basic information indicating whether or not the document includes each of the selected reference suffixes,
Based on the basic similarity information generated for the first document of the plurality of documents and the basic similarity information generated for the second document of the plurality of documents. An inter-document similarity calculation method for calculating a similarity indicating a degree of similarity between the first document and the second document.

The inter-document similarity calculation method according to claim 7,
For each of the selected reference suffixes, the document has a positive value as a component representing the reference suffix when the document includes the reference suffix, while the document has the reference suffix When not included, information representing a vector having 0 as the component is generated as the similarity basic information,
The smaller the angle between the vector represented by the similarity basic information generated for the first document and the vector represented by the similarity basic information generated for the second document, the larger the smaller the angle is. The inter-document similarity calculation method for calculating a value obtained as the similarity.

The inter-document similarity calculation method according to claim 8,
The inter-document similarity calculation method for generating, as the similarity basic information, information representing the vector having a value that increases as a component representing the reference suffix as the number of the reference suffix included in the document increases. .

In the information processing device,
For each sentence included in each of a plurality of documents, when the total number of characters constituting the sentence is represented by N, i characters are excluded from the head of the sentence for each integer i from 0 to N-1. Suffix information generating means for generating suffix information which is information representing a suffix that is a residual character string;
Reference suffix selection means for selecting, as a reference suffix, a suffix generated based on a plurality of sentences from the suffixes represented by the generated suffix information;
For each of the plurality of documents, similarity basic information generating means for generating similarity basic information indicating whether or not the document includes each of the selected reference suffixes;
Based on the basic similarity information generated for the first document of the plurality of documents and the basic similarity information generated for the second document of the plurality of documents. A similarity calculating means for calculating a similarity indicating the degree of similarity between the first document and the second document;
Inter-document similarity calculation program for realizing