JP5184438B2

JP5184438B2 - Document signature generation apparatus, document signature generation method, and document signature generation program for detecting similar documents

Info

Publication number: JP5184438B2
Application number: JP2009118477A
Authority: JP
Inventors: 隆広野田; 俊介小長井; 良治片岡
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2009-05-15
Filing date: 2009-05-15
Publication date: 2013-04-17
Anticipated expiration: 2029-05-15
Also published as: JP2010267108A

Description

本発明は、Ｗｅｂ全文検索エンジンのインデックスを構成するような大規模文書集合に含まれる類似文書を検出するための文書署名生成技術に関する。 The present invention relates to a document signature generation technique for detecting similar documents included in a large-scale document set that constitutes an index of a Web full-text search engine.

大規模文書集合に含まれる文書から効率的に文書を検索する技術として、文書署名を用いる方法が古くから知られている。たとえば、非特許文献１では複数の文書署名作成法が紹介されている。文書署名は元文書集合と比較して小容量であり、また、文書署名を利用することで元文書集合を利用するよりも高速に文書を検索することができる。 As a technique for efficiently retrieving documents from documents included in a large-scale document set, a method using a document signature has been known for a long time. For example, Non-Patent Document 1 introduces a plurality of document signature creation methods. The document signature has a smaller capacity than the original document set, and the document signature can be used to search for a document faster than the original document set.

文書署名法を類似文書検索に適用するため、非特許文献２では、非特許文献１で紹介されている“ＳｕｐｅｒｉｍｐｏｓｅｄＣｏｄｉｎｇ”法を拡張し、単語の出現頻度を考慮した文書署名生成技術を紹介している。 In order to apply the document signature method to similar document retrieval, Non-Patent Document 2 introduces a document signature generation technique that takes into account the appearance frequency of words by extending the “Superimposed Coding” method introduced in Non-Patent Document 1. ing.

ここで、非特許文献２で紹介されている技術を利用してＦビット長の文書署名を生成する方法を示す。まず、文書に含まれる各単語についてＦビットのハッシュ値を計算する。次に、各ハッシュ値をビット列として扱い、ビットが１となっている位置の要素を１、ビットが０となっている位置の要素を−１、とするような、各要素が｛−１，１｝であるＦ要素のベクトルに各ビット列を変換する。これで、各単語がＦ要素のベクトルで表現されたことになる。さらに、各ベクトルの同じ位置にある要素の和を計算し、その和が非負であれば１、負であれば０、を要素とするＦ要素のベクトルを求める。このベクトルの要素をビット列としたものが文書署名となる。 Here, a method of generating an F-bit document signature using the technique introduced in Non-Patent Document 2 will be described. First, an F-bit hash value is calculated for each word included in the document. Next, each hash value is treated as a bit string, and each element is {-1, 1 at the position where the bit is 1 and -1 at the position where the bit is 0. Convert each bit string to a vector of F elements that is 1}. Thus, each word is expressed by a vector of F elements. Further, the sum of the elements at the same position of each vector is calculated, and a vector of F elements having 1 as an element if the sum is non-negative and 0 as an element is obtained. A document signature is obtained by converting the elements of this vector into a bit string.

Ｆａｌｏｕｔｓｏｓ，Ｃ．ａｎｄＣｈｒｉｓｔｏｄｏｕｌａｋｉｓ，Ｓ．“Ｄｅｓｃｒｉｐｔｉｏｎａｎｄｐｅｒｆｏｒｍａｎｃｅａｎａｌｙｓｉｓｏｆｓｉｇｎａｔｕｒｅｆｉｌｅｍｅｔｈｏｄｓｆｏｒｏｆｆｉｃｅｆｉｌｉｎｇ．”ＡＣＭＴｒａｎｓａｃｔｉｏｎｓｏｎＯｆｆｉｃｅＩｎｆｏｒｍａｔｉｏｎＳｙｓｔｅｍｓ．Ｖｏｌ．５，Ｎｏ．３，Ｊｕｌｙ１９８７，ｐｐ．２３７−２５７．２００９年５月１日検索＜インターネットＵＲＬ；ｈｔｔｐ：／／ｄｏｉ．ａｃｍ．ｏｒｇ／１０．１１４５／２７６４１．２８０５７＞Faloutos, C.I. and Christodoulakis, S .; “Description and performance analysis of signature file methods for office filling.” ACM Transactions on Office Information Systems. Vol. 5, no. 3, July 1987, pp. 237-257. Search on May 1, 2009 <Internet URL; http: // doi. acm. org / 10.1145 / 2764128057> Ｈｅｎｚｉｎｇｅｒ，Ｍ．“Ｆｉｎｄｉｎｇｎｅａｒ−ｄｕｐｌｉｃａｔｅｗｅｂｐａｇｅｓ：ａｌａｒｇｅ−ｓｃａｌｅｅｖａｌｕａｔｉｏｎｏｆａｌｇｏｒｉｔｈｍｓ．”ＩｎＰｒｏｃｅｅｄｉｎｇｓｏｆｔｈｅ２９ｔｈＡｎｎｕａｌｉｎｔｅｒｎａｔｉｏｎａｌＡＣＭＳＩＧＩＲＣｏｎｆｅｒｅｎｃｅｏｎＲｅｓｅａｒｃｈａｎｄＤｅｖｅｌｏｐｍｅｎｔｉｎｉｎｆｏｒｍａｔｉｏｎＲｅｔｒｉｅｖａｌ（Ｓｅａｔｔｌｅ，Ｗａｓｈｉｎｇｔｏｎ，ＵＳＡ，Ａｕｇｕｓｔ０６−１１，２００６．）ＳＩＧＩＲ ’０６．ＡＣＭ，ＮｅｗＹｏｒｋ，ＮＹ，ｐｐ．２８４−２９１．２００９年５月１日検索＜インターネットＵＲＬ；ｈｔｔｐ：／／ｄｏｉ．ａｃｍ．ｏｒｇ／１０．１１４５／１１４８１７０．１１４８２２２＞Henzinger, M.M. "Finding near-duplicate web pages:. A large-scale evaluation of algorithms" In Proceedings of the 29th Annual international ACM SIGIR Conference on Research and Development in information Retrieval (. Seattle, Washington, USA, August 06-11,2006) SIGIR '06. ACM, New York, NY, pp. Search on May 1, 284-291.209 <Internet URL; http: // doi. acm. org / 10.1145 / 1148170.1148222>

本発明では、類似文書検索のための文書署名生成法としての背景技術がもつ２つの課題を解決する。以下、２つの課題を、それぞれ、課題１、課題２として述べる。 The present invention solves two problems of the background art as a document signature generation method for retrieving similar documents. Hereinafter, the two problems will be described as a problem 1 and a problem 2, respectively.

課題１は、背景技術が文書中の全ての単語を同等に扱うため、大きな違いも小さな違いも同等に扱うことである。その一例として、次の２組の文を挙げる。 Problem 1 is that the background technology treats all words in the document equally, so that large and small differences are treated equally. As an example, consider the following two sets of sentences.

（例１−ａ）私は大阪に行きました。 (Example 1-a) I went to Osaka.

（例１−ｂ）私は大阪へ行きました。 (Example 1-b) I went to Osaka.

（例２−ａ）私は東京へ行きました。 (Example 2-a) I went to Tokyo.

（例２−ｂ）私は大阪へ行きました。 (Example 2-b) I went to Osaka.

例１−ａと例１−ｂは「に」と「へ」の１語のみの違いである。また、例２−ａと例２−ｂも同様に「東京」と「大阪」の１語のみの違いである。背景技術では各単語の扱いは同等のため、例１−ａと例１−ｂとの違い、あるいは、例２−ａと例２−ｂとの違いについて差をつけない。しかし、例１−ａと例１−ｂとが類似文書であるのに対して、例２−ａと例２−ｂとは類似文書ではない。 Example 1-a and Example 1-b are the only differences between “ni” and “to”. Similarly, Example 2-a and Example 2-b are different from each other by only one word “Tokyo” and “Osaka”. In the background art, since each word is handled in the same way, the difference between Example 1-a and Example 1-b or the difference between Example 2-a and Example 2-b is not different. However, Example 1-a and Example 1-b are similar documents, whereas Example 2-a and Example 2-b are not similar documents.

課題２は、背景技術が文書中の語順を考慮しないので、文の順番が入れ替わっているものに対しても同じ文書署名を生成することである。その一例として次の２文を挙げる。 Problem 2 is to generate the same document signature even for sentences in which the order of sentences is changed because the background art does not consider the order of words in the document. The following two sentences are given as an example.

（例３−ａ）私は大阪に行きました。その後、東京に行きました。 (Example 3-a) I went to Osaka. After that, I went to Tokyo.

（例３−ｂ）私は東京に行きました。その後、大阪に行きました。 (Example 3-b) I went to Tokyo. After that, I went to Osaka.

背景技術では語順を考慮しないため、例３−ａと例３−ｂに対して同じ文書署名を生成する。背景技術では、これらの文書を類似文書とせず文書署名で区別したいという要求に応えられない。 Since the background technology does not consider the word order, the same document signature is generated for Example 3-a and Example 3-b. In the background art, it is not possible to meet the request for distinguishing these documents by document signatures instead of similar documents.

課題１に対しては、文書署名を計算する際に各単語の重要度を考慮する。具体的には、非特許技術２で各単語をビット列とし、さらに、ベクトルに変換する際に各要素を｛−１，１｝としているところを、単語の重要度に応じて｛−ｗ_t，ｗ_t｝とする。ただし、ｗ_tは各単語の重要度を数値化したもので、重要な単語ほど大きく、重要でない単語ほど小さくする。 For Task 1, consider the importance of each word when calculating the document signature. Specifically, in Non-Patent Technology 2, each word is converted into a bit string, and each element is set to {−1, 1} when converted into a vector, depending on the importance of the word {−w _t , Let w _t }. Here, w _t is a numerical value of the importance of each word, and is larger for important words and smaller for unimportant words.

課題２に対しては、出現位置に応じて各単語の文書署名への寄与度を決定する。具体的には、非特許技術２で各単語をビット列とし、さらに、ベクトルに変換する際に各要素を｛−１，１｝としているところを、単語の出現位置に応じて｛−ｗ_p，ｗ_p｝とする。ただし、ｗ_pは出現位置による寄与度を数値化したもの（出現位置に応じた重要度）である。一般に、文書の先頭に重要な情報が集中する傾向があるので、ｗ_pは出現位置が前の単語ほど大きく、後ろの単語ほど小さくする。 For assignment 2, the degree of contribution of each word to the document signature is determined according to the appearance position. Specifically, in Non-Patent Technology 2, each word is converted into a bit string, and each element is set to {−1, 1} when converted into a vector, and {−w _p , _Let w _p }. However, w _p is a numerical value of the degree of contribution due to the appearance position (importance corresponding to the appearance position). In general, since important information tends to concentrate at the beginning of a document, w _p is made larger in the preceding word and smaller in the following word.

課題１と課題２とを同時に解決するには、非特許技術２で各単語をビット列とし、さらに、ベクトルに変換する際に各要素を｛−１，１｝としているところを、単語の重要度と出現位置に応じて｛−ｗ_tｗ_p，ｗ_tｗ_p｝とする。 To solve Problem 1 and Problem 2 at the same time, each word is converted into a bit string in Non-Patent Technology 2, and each element is set to {-1, 1} when converted to a vector. And {−w _t w _p , w _t w _p } depending on the appearance position.

本発明の請求項１に記載の類似文書を検出するための文書署名生成装置は、文書に含まれる各単語の重要度ｗ_t又は前記各単語の出現位置に応じた重要度ｗ_pのうち少なくともいずれか一方を計算して重要度ｗを求める重要度算出手段と、前記重要度算出手段により求められた各単語の重要度ｗに応じて、各単語について{−ｗ，ｗ}を要素とするベクトルを生成し、前記生成されたベクトルからＦ要素の文書署名ベクトルを求め、該文書署名ベクトルを文書署名情報とする文書署名計算手段と、を備えたことを特徴としている。 The document signature generation apparatus for detecting a similar document according to claim 1 of the present invention includes at least the importance w _{t of} each word included in the document or the importance w _p corresponding to the appearance position of each word. Importance calculating means for calculating importance w by calculating either one of them and {-w, w} for each word as an element according to importance w of each word obtained by the importance calculating means And a document signature calculating unit that generates a vector, obtains a document signature vector of an F element from the generated vector, and uses the document signature vector as document signature information.

（１）本発明によれば、課題１を解決することができ、これにより、文書署名を生成する際に瑣末な単語の違いに影響されなくなる。背景技術では文書中の単語の違いにより異なる文書署名を生成していた場合でも、本発明により、類似文書に対して同じ文書署名を生成できるようになる場合が増える。本発明の文書署名を利用することで、背景技術より多くの類似文書を検出できる。
（２）また本発明によれば、課題２を解決することができ、これにより、文書署名を生成する際に、文書中の語順の違いを考慮することができる。背景技術では、語順が異なるだけの複数文書に対して同じ文書署名を生成していたが、本発明では、語順が異なる場合は異なる文書署名を生成する。背景技術は語順が異なるだけの文書を類似文書として誤検出することがあったが、本発明の文書署名を利用することで、そのような誤検出を回避できる。 (1) According to the present invention, Problem 1 can be solved, so that it is not affected by a trivial word difference when generating a document signature. In the background art, even when different document signatures are generated due to differences in words in a document, the present invention increases the number of cases where the same document signature can be generated for similar documents. By using the document signature of the present invention, it is possible to detect more similar documents than the background art.
(2) Further, according to the present invention, the problem 2 can be solved, and therefore, the difference in word order in the document can be taken into account when generating the document signature. In the background art, the same document signature is generated for a plurality of documents having different word orders. However, in the present invention, different document signatures are generated if the word orders are different. In the background art, there are cases where a document having only a different word order is erroneously detected as a similar document. However, such erroneous detection can be avoided by using the document signature of the present invention.

本発明の文書署名生成装置の一実施形態例を示すブロック図。1 is a block diagram showing an example of an embodiment of a document signature generating apparatus according to the present invention. 本発明の文書署名生成方法の一実施形態例における文書署名計算方法を示すフローチャート。The flowchart which shows the document signature calculation method in one Embodiment of the document signature production | generation method of this invention.

以下、図面を参照しながら本発明の実施の形態を説明するが、本発明は下記の実施形態例に限定されるものではない。図１は本発明の文書署名生成装置の一実施形態例のブロック図であり、図中、破線矢印はデータの流れを、実線矢印は処理の流れを示す。 Hereinafter, embodiments of the present invention will be described with reference to the drawings, but the present invention is not limited to the following embodiments. FIG. 1 is a block diagram of an embodiment of a document signature generating apparatus according to the present invention. In the figure, a broken line arrow indicates a data flow, and a solid line arrow indicates a processing flow.

図１において、１００は文書集合データベースであり、類似文書検出対象の文書を保存した記憶装置である。 In FIG. 1, reference numeral 100 denotes a document set database, which is a storage device that stores similar documents to be detected.

２００は文書集合データベース１００に保存された各文書に含まれる各単語の重要度ｗ_tを計算する単語統計情報計算手段である。 Reference numeral 200 denotes word statistical information calculation means for calculating the importance w _t of each word included in each document stored in the document set database 100.

３００は単語重要度データベースであり、単語統計情報計算手段２００により計算された各単語に対する重要度ｗ_tを保存する記憶装置である。 Reference numeral 300 denotes a word importance database, which is a storage device that stores the importance w _t for each word calculated by the word statistical information calculation means 200.

４００は、文書集合データベース１００に保存された各文書に含まれる各単語の出現位置に応じた重要度ｗ_pを求めるとともに、該重要度ｗ_pと単語重要度データベース３００に保存された重要度ｗ_tに基づいて、文書集合に含まれる各文書について文書署名を計算する文書署名計算手段である。 400 obtains the importance w _p according to the appearance position of each word included in each document stored in the document set database 100, and the importance w _p stored in the word importance database 300. Document signature calculation means for calculating a document signature for each document included in the document set based on _t .

５００は、文書署名データベースであり、各文書に対する文書署名を保存する記憶装置である。 Reference numeral 500 denotes a document signature database, which is a storage device that stores a document signature for each document.

前記単語統計情報計算手段２００および文書署名計算手段４００の後述する各機能は例えばコンピュータによって達成される。 Each function to be described later of the word statistical information calculation unit 200 and the document signature calculation unit 400 is achieved by a computer, for example.

また本発明の重要度算出手段は、単語統計情報計算手段２００が各単語の重要度ｗ_tを求め、文書署名計算手段４００が各単語の出現位置に応じた重要度ｗ_pを求めることで達成される。 The importance calculation means of the present invention is achieved by the word statistical information calculation means 200 obtaining the importance w _t of each word and the document signature calculation means 400 obtaining the importance w _p according to the appearance position of each word. Is done.

次に上記のように構成された装置の動作を説明する。まず、単語統計情報計算手段２００が単語の重要度を計算する。単語統計情報計算手段２００は、文書集合データベース１００から文書を読み込み、形態素解析器により各文書に含まれる単語を抽出する。 Next, the operation of the apparatus configured as described above will be described. First, the word statistical information calculation means 200 calculates the importance of a word. The word statistical information calculation means 200 reads a document from the document set database 100 and extracts words included in each document by a morphological analyzer.

次に、各単語のＩＤＦ（逆文頻度）を計算し、ＩＤＦを各単語ｔの重要度ｗ_t（下記式（１））として単語重要度データベース３００に記録する。 Next, the IDF (reverse sentence frequency) of each word is calculated, and the IDF is recorded in the word importance database 300 as the importance w _t (the following formula (1)) of each word t.

ここで、式（１）のｉｄｆは単語ｔのＩＤＦを、Ｎは集合文書に含まれる文書の総数を示し、ｄｆ_tは単語ｔを含む文書数を示す。 Here, the IDF of idf word t of formula (1), N denotes the total number of documents in the set document, df _t is the number of documents that contain term t.

次に、文書署名計算手段４００が、図２に示すフローチャートに沿って、文書集合に含まれる各文書について文書署名を計算する。文書署名計算手段４００は、文書集合データベース１００から文書を読み込み、形態素解析器により各文書に含まれる単語を抽出する（ステップＳ１）。形態素解析器を利用できない場合は、例えばｎ−ｇｒａｍを単語として抽出してよい。 Next, the document signature calculation means 400 calculates a document signature for each document included in the document set according to the flowchart shown in FIG. The document signature calculation unit 400 reads a document from the document set database 100 and extracts words included in each document by a morphological analyzer (step S1). When a morphological analyzer cannot be used, for example, n-gram may be extracted as a word.

次いで、抽出した各単語についてハッシュ関数を用いてＦビットのハッシュ値を計算する（ステップＳ２）。ハッシュ関数には、似た単語から近いハッシュ値が生成されない、ハッシュ値の衝突が容易に起きない、という条件を満たす関数を利用する。ＭＤ５（ＭｅｓｓａｇｅＤｉｇｅｓｔＡｌｇｏｒｉｔｈｍ５）、ＳＨＡ（ＳｅｃｕｒｅＨａｓｈＡｌｇｏｒｉｔｈｍ）−１、ＳＨＡ−２５６といった標準化されたハッシュ関数は、これらの条件を満たす。つづいて、単語重要度データベース３００から各単語の単語重要度を取得する。 Next, an F-bit hash value is calculated for each extracted word using a hash function (step S2). As the hash function, a function that satisfies the conditions that a hash value close to similar words is not generated and that hash value collisions do not easily occur is used. Standardized hash functions such as MD5 (Message Digest Algorithm 5), SHA (Secure Hash Algorithm) -1, and SHA-256 satisfy these conditions. Subsequently, the word importance of each word is acquired from the word importance database 300.

さらに、単語の出現位置ｐに応じた重要度ｗ_pを計算する。実施例では、文書の先頭に近いほど重要であるとして、文書長をＮ語である文書において単語がｐ番目に出現したとすると、
ｗ_p＝（Ｎ−ｐ）／Ｎ …（２）
であるｗ_pを出現位置の重要度とする。 Further, the importance w _p corresponding to the word appearance position _p is calculated. In the embodiment, assuming that the closer to the beginning of the document, the more important it is, and assuming that the word appears pth in a document whose document length is N words,
w _p = (N−p) / N (2)
_Let w _{p be} the importance of the appearance position.

そして、各単語について、各単語のハッシュ値をビット列として扱い、ビットが０となっている位置の要素を−ｗ_tｗ_p、ビットが１となっている位置の要素をｗ_tｗ_p、とするようなＦ要素ベクトル（第１のベクトル）を求める（ステップＳ３）。ここで、単語の出現位置に応じた重要度ｗ_pを利用しない場合はｗ_p＝１、単語重要度ｗ_tを利用しない場合はｗ_t＝１とする。 For each word, the hash value of each word is treated as a bit string, the element at the position where the bit is 0 is −w _t w _p , the element at the position where the bit is 1 is w _t w _p , and so on. An F element vector (first vector) is calculated (step S3). Here, w _p = 1 when not using the importance w _p corresponding to the appearance position of the word, and w _t = 1 when not using the word importance w _t .

さらに、各Ｆ要素ベクトルの同じ位置にある要素の和を計算して新たなベクトル（第２のベクトル）とし（ステップＳ４）、その和が非負であれば１、負であれば０、を要素とするＦ要素の文書署名ベクトルを求める（ステップＳ５）。最後にこの文書署名ベクトルをＦビットのビット列としたものを文書署名として文書署名データベース５００に格納する（ステップＳ６）。 Further, the sum of the elements at the same position of each F element vector is calculated to be a new vector (second vector) (step S4). If the sum is non-negative, 1 is set, and if the sum is negative, 0 is set. A document signature vector of the F element is obtained (step S5). Finally, the document signature vector converted into an F-bit bit string is stored in the document signature database 500 as a document signature (step S6).

尚、前記文書署名計算手段４００により計算されたハッシュ値は、例えば図示省略のメモリに格納して利用するように構成しても良い。 It should be noted that the hash value calculated by the document signature calculation unit 400 may be stored and used in a memory (not shown), for example.

また、本実施形態の文書署名生成装置における各手段の一部もしくは全部の機能をコンピュータのプログラムで構成し、そのプログラムをコンピュータを用いて実行して本発明を実現することができること、本実施形態の文書署名生成方法における手順をコンピュータのプログラムで構成し、そのプログラムをコンピュータに実行させることができることは言うまでもなく、コンピュータでその機能を実現するためのプログラムを、そのコンピュータが読み取り可能な記録媒体、例えばＦＤ（Ｆｌｏｐｐｙ（登録商標）Ｄｉｓｋ）や、ＭＯ（Ｍａｇｎｅｔｏ−Ｏｐｔｉｃａｌｄｉｓｋ）、ＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）、メモリカード、ＣＤ（ＣｏｍｐａｃｔＤｉｓｋ）−ＲＯＭ、ＤＶＤ（ＤｉｇｉｔａｌＶｅｒｓａｔｉｌｅＤｉｓｋ）−ＲＯＭ、ＣＤ−Ｒ、ＣＤ−ＲＷ、ＨＤＤ、リムーバブルディスクなどに記録して、保存したり、配布したりすることが可能である。また、上記のプログラムをインターネットや電子メールなど、ネットワークを通して提供することも可能である。 In addition, the present invention can be realized by configuring some or all of the functions of each means in the document signature generating apparatus of the present embodiment by a computer program and executing the program using the computer. It is needless to say that the procedure in the document signature generation method of the above can be configured by a computer program and the program can be executed by the computer, and a program for realizing the function by the computer can be read by the computer, For example, FD (Floppy (registered trademark) Disk), MO (Magneto-Optical disk), ROM (Read Only Memory), memory card, CD (Compact Disk) -ROM, DVD (Digital Versati) e Disk) -ROM, CD-R, CD-RW, HDD, and recorded in a removable disk, or stored, it is possible or distribute. It is also possible to provide the above program through a network such as the Internet or electronic mail.

１００…文書集合データベース
２００…単語統計情報計算手段
３００…単語重要度データベース
４００…文書署名計算手段
５００…文書署名データベース DESCRIPTION OF SYMBOLS 100 ... Document set database 200 ... Word statistical information calculation means 300 ... Word importance database 400 ... Document signature calculation means 500 ... Document signature database

Claims

An apparatus for generating document signature information for detecting a similar document,
Importance calculating means for calculating importance w by calculating at least one of importance w _{t of} each word included in the document or importance w _p corresponding to the appearance position of each word;
A vector having {−w, w} as an element is generated for each word according to the importance w of each word obtained by the importance calculating means, and a document signature vector of F element is generated from the generated vector. A document signature generation apparatus for detecting a similar document characterized by comprising: a document signature calculation means that obtains the document signature vector as document signature information.

The document signature calculation means
Calculate an F-bit hash value for each word contained in the document;
Based on the importance w _t of each word obtained by the importance calculation means and the importance w _p according to the appearance position of each word, the calculated hash value is treated as a bit string, and the bit becomes 0 Find a first vector with the element at the position -w _t w _p and the element at the position where the bit is 1 w _t w _p ,
Calculating a sum of elements at the same position of the first vector to obtain a second vector;
2. The document signature generation apparatus for detecting a similar document according to claim 1, wherein a document signature vector of an F element in which a non-negative element of the second vector is 1 and a negative element is 0 is obtained.

A method of generating document signature information for detecting similar documents,
Importance calculation step in which the importance calculation means calculates importance w by calculating at least one of importance w _{t of} each word included in the document or importance w _p corresponding to the appearance position of each word When,
The document signature calculating means generates a vector having {−w, w} as an element for each word according to the importance w of each word obtained by the importance calculating means, and F is generated from the generated vector. A document signature generation method for detecting a similar document, comprising: obtaining a document signature vector of an element; and a document signature calculation step using the document signature vector as document signature information.

The document signature calculation step includes:
Calculating an F-bit hash value for each word contained in the document;
Based on the importance w _t of each word obtained by the importance calculation means and the importance w _p according to the appearance position of each word, the calculated hash value is treated as a bit string, and the bit becomes 0 Obtaining a first vector having an element at a certain position as −w _t w _p and an element at a position where the bit is 1 as w _t w _p ;
Calculating a sum of elements at the same position of the first vector to obtain a second vector;
A method for detecting a similar document according to claim 3, further comprising: obtaining a document signature vector of an F element in which the non-negative element of the second vector is 1 and the negative element is 0. Document signature generation method.

A document signature generation program for detecting a similar document that causes a computer to function as each unit according to claim 1.