JP7193000B2

JP7193000B2 - Similar document search method, similar document search program, similar document search device, index information creation method, index information creation program, and index information creation device

Info

Publication number: JP7193000B2
Application number: JP2021541969A
Authority: JP
Inventors: 謙介馬場; 智哉野呂; 茂紀福田; 清司大倉
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2019-08-30
Filing date: 2019-08-30
Publication date: 2022-12-20
Anticipated expiration: 2039-08-30
Also published as: WO2021038887A1; JPWO2021038887A1

Description

本発明の実施形態は、類似文書検索方法、類似文書検索プログラム、類似文書検索装置、索引情報作成方法、索引情報作成プログラムおよび索引情報作成装置に関する。 The embodiments of the present invention relate to a similar document search method, a similar document search program, a similar document search device, an index information creation method, an index information creation program, and an index information creation device.

従来、コンピュータによる自然言語処理の一つとして、データベースに記憶された文書の中から入力文書に類似する文書を検索する検索処理がある。例えば、問い合わせ文書のサンプルと当該サンプルに対応する返答文書とをデータベースに登録しておく。そして、入力された問い合わせ文書に類似するサンプルを検索し、当該類似するサンプルに対応する返答文書を出力するチャットボット等の対話インタフェースを構築することが考えられる。 2. Description of the Related Art Conventionally, as one of natural language processing by a computer, there is search processing for searching for documents similar to an input document from among documents stored in a database. For example, a sample inquiry document and a reply document corresponding to the sample are registered in the database. Then, it is conceivable to build an interactive interface such as a chatbot that retrieves samples similar to the input inquiry text and outputs response texts corresponding to the similar samples.

データベースに登録されたサンプルの中から入力文書に類似するサンプルを検索する方法としては、２つの文書の間で出現する単語の共通度を評価する方法がある。例えば、以下のような検索方法が考えられる。ある文書に含まれる単語集合から１つのハッシュ値を算出するハッシュ関数（Min-hash関数と言うことがある）を複数個定義しておく。各ハッシュ関数は、異なる単語に対して異なる値を対応付けた対応関係をもち、ある単語集合に含まれる単語に対応する値のうち最小の値をハッシュ値として出力する。データベースに登録されたサンプルに対して、この複数のハッシュ関数を用いて算出された複数のハッシュ値を列挙したベクトルを予め生成しておく。そして、入力文書に含まれる単語集合と上記の複数のハッシュ関数から同様にベクトルを算出し、データベースに登録されたサンプルのベクトルと近似するものを検索する。 As a method of retrieving samples similar to the input document from the samples registered in the database, there is a method of evaluating the degree of commonality of words appearing between two documents. For example, the following search methods are conceivable. A plurality of hash functions (sometimes called Min-hash functions) for calculating one hash value from a set of words contained in a document are defined. Each hash function has a correspondence relationship in which different values are associated with different words, and outputs the minimum value of values corresponding to words included in a certain word set as a hash value. A vector listing a plurality of hash values calculated using the plurality of hash functions is generated in advance for the samples registered in the database. Then, vectors are similarly calculated from the set of words contained in the input document and the plurality of hash functions described above, and a vector similar to the vector of the samples registered in the database is searched.

特開２０１８－１７３９０９号公報JP 2018-173909 A

しかしながら、上記の従来技術では、入力文書が短い、文書に用いられる表現が多様でサンプルと一致する表現が少ないなどの場合に、単語の共通度が低く評価され、検索精度が低下するという問題がある。例えば、チャットボットの入力文書は、主に話し言葉であり、１文が短く表現も多様であることから、入力文書と、サンプルとの間で共通の単語が出現する確率が全体的に低くなる。この結果、入力文書に対するサンプルの検索精度が低くなりやすい。 However, in the above-described conventional technology, when the input document is short or the expressions used in the document are diverse and the number of expressions that match the sample is small, the degree of commonality of words is evaluated low, resulting in a decrease in search accuracy. be. For example, since the input documents of the chatbot are mainly spoken words, sentences are short, and expressions are diverse, the probability of common words appearing between the input document and the sample is generally low. As a result, the sample retrieval accuracy for the input document tends to be low.

１つの側面では、類似する文書の検索精度を向上させる類似文書検索方法、類似文書検索プログラム、類似文書検索装置、索引情報作成方法、索引情報作成プログラムおよび索引情報作成装置を提供することを目的とする。 An object of one aspect is to provide a similar document search method, a similar document search program, a similar document search device, an index information creation method, an index information creation program, and an index information creation device that improve the accuracy of searching for similar documents. do.

１つの案では、類似文書検索方法は、生成する処理と、算出する処理と、検索する処理とをコンピュータが実行する。生成する処理は、検索対象文書に含まれる単語の集合と、単語の語意の近さを示す語間情報とに基づき、単語の集合に含まれる単語それぞれに対して、所定の単語を基準として当該単語に対して語意の近い順に近い値を割り当てるハッシュ関数を生成する。算出する処理は、生成したハッシュ関数に基づいて複数の検索対象文書それぞれの要約情報を算出し、生成したハッシュ関数に基づいて入力文書の要約情報を算出する。検索する処理は、算出した検索対象文書の要約情報と、入力文書の要約情報との間の比較に基づいて、複数の検索対象文書の中から入力文書に類似する文書を検索する。 In one proposal, in the similar document search method, a computer executes the process of generating, the process of calculating, and the process of searching. The generating process is based on a set of words included in the search target document and inter-word information indicating the closeness of the meaning of the words. Generate a hash function that assigns close values to words in descending order of semantics. The calculation process calculates summary information for each of a plurality of search target documents based on the generated hash function, and calculates summary information for the input document based on the generated hash function. In the searching process, a document similar to the input document is searched from a plurality of search target documents based on comparison between the calculated summary information of the search target document and the summary information of the input document.

１つの側面では、類似する文書の検索精度が向上する。 In one aspect, search accuracy for similar documents is improved.

図１は、実施形態にかかる情報処理装置の機能構成例を示すブロック図である。FIG. 1 is a block diagram of a functional configuration example of an information processing apparatus according to an embodiment; 図２は、実施形態にかかる情報処理装置の処理フローの一例を示す説明図である。FIG. 2 is an explanatory diagram of an example of a processing flow of the information processing apparatus according to the embodiment; 図３は、ハッシュ関数の生成処理の一例を示すフローチャートである。FIG. 3 is a flowchart illustrating an example of hash function generation processing. 図４は、ハッシュ関数を説明する説明図である。FIG. 4 is an explanatory diagram for explaining the hash function. 図５は、語間の類似度を説明する説明図である。FIG. 5 is an explanatory diagram for explaining the degree of similarity between words. 図６は、Ｍｉｎｈａｓｈによるハッシュ値を説明する説明図である。FIG. 6 is an explanatory diagram for explaining hash values based on Min hash. 図７は、実施形態にかかる情報処理装置の動作の概要を説明する説明図である。FIG. 7 is an explanatory diagram for explaining an overview of the operation of the information processing apparatus according to the embodiment; 図８は、検索対象文書の絞り込みを説明する説明図である。FIG. 8 is an explanatory diagram for explaining narrowing down of search target documents. 図９は、操作画面の表示例を示す説明図である。FIG. 9 is an explanatory diagram showing a display example of the operation screen. 図１０は、プログラムを実行するコンピュータの一例を示す図である。FIG. 10 is a diagram illustrating an example of a computer that executes programs.

以下、図面を参照して、実施形態にかかる類似文書検索方法、類似文書検索プログラム、類似文書検索装置、索引情報作成方法、索引情報作成プログラムおよび索引情報作成装置を説明する。実施形態において同一の機能を有する構成には同一の符号を付し、重複する説明は省略する。なお、以下の実施形態で説明する類似文書検索方法、類似文書検索プログラム、類似文書検索装置、索引情報作成方法、索引情報作成プログラムおよび索引情報作成装置は、一例を示すに過ぎず、実施形態を限定するものではない。また、以下の各実施形態は、矛盾しない範囲内で適宜組みあわせてもよい。 Hereinafter, a similar document search method, a similar document search program, a similar document search device, an index information creation method, an index information creation program, and an index information creation device according to embodiments will be described with reference to the drawings. Configurations having the same functions in the embodiments are denoted by the same reference numerals, and overlapping descriptions are omitted. Note that the similar document search method, similar document search program, similar document search device, index information creation method, index information creation program, and index information creation device described in the following embodiments are merely examples, and the embodiments are not limited to these. It is not limited. Moreover, each of the following embodiments may be appropriately combined within a non-contradictory range.

［概要］
本実施形態では、データベースに登録されたサンプル（以下、検索対象文書とも呼ぶ）の中から入力文書（以下、問い合わせ文書とも呼ぶ）に類似する文書を検索する情報処理装置を例示する。[overview]
This embodiment will exemplify an information processing apparatus that retrieves a document similar to an input document (hereinafter also referred to as an inquiry document) from samples (hereinafter also referred to as search target documents) registered in a database.

この情報処理装置では、先ず事前処理を行い、次いで検索処理を行う。事前処理では、ハッシュ関数を用いて検索対象文書それぞれについてハッシュ値を算出し、算出したハッシュ値より検索対象文書を検索するための木構造などの索引構造（例えば探索木）を作る。 In this information processing apparatus, first, preliminary processing is performed, and then search processing is performed. In the pre-processing, a hash value is calculated for each search target document using a hash function, and an index structure such as a tree structure (for example, a search tree) for searching the search target document is created from the calculated hash value.

検索処理では、ハッシュ関数を用いて問い合わせ文書のハッシュ値を算出する。次いで、情報処理装置は、問い合わせ文書のハッシュ値と、検索対象文書それぞれのハッシュ値とを比較し、索引構造で示された検索対象文書それぞれのハッシュ値の中から近いものを探す。次いで、情報処理装置は、問い合わせ文書のハッシュ値と近いハッシュ値の検索対象文書を類似する文書の検索結果とする。 In the search process, a hash function is used to calculate the hash value of the inquiry document. Next, the information processing device compares the hash value of the inquiry document with the hash value of each search target document, and searches for a similar hash value among the search target documents shown in the index structure. Next, the information processing apparatus determines search target documents having hash values close to the hash value of the inquiry document as similar document search results.

ハッシュ値を算出するハッシュ関数については、データベースに登録された検索対象文書に含まれる単語を抽出して得られた単語の集合をもとに、複数個定義しておく。具体的には、情報処理装置は、Ｗを単語の集合、ｈをＷから｛０，１，…，｜Ｗ｜－１｝へのすべての単射の集合（複数のハッシュ関数）としてハッシュ関数を複数生成しておく。 A plurality of hash functions for calculating hash values are defined based on a set of words obtained by extracting words contained in search target documents registered in the database. Specifically, the information processing apparatus uses a hash function where W is a set of words, and h is a set (a plurality of hash functions) of all injections from W to {0, 1, ..., |W|-1}. Generate multiple .

情報処理装置は、検索対象文書および問い合わせ文書について次の処理を行うことで、複数のハッシュ値によるベクトルを求めて検索対象文書および問い合わせ文書をハッシュ関数で要約した要約情報を得る。
・ｈからハッシュ関数をランダムに選択する。
・検索対象文書に含まれる単語を抽出して単語の集合を取得する。
・選択したｈにより各単語から得る整数の中で最小のものをハッシュ値とする。
・関数のランダムな選択を複数回行うことで複数のハッシュ値を得る。The information processing apparatus performs the following processing on the search target document and the inquiry document, obtains a vector of a plurality of hash values, and obtains summary information summarizing the search target document and the inquiry document with a hash function.
• Randomly select a hash function from h.
・A set of words is obtained by extracting words contained in a document to be searched.
・The hash value is the smallest integer among the integers obtained from each word by the selected h.
・Multiple hash values are obtained by performing random selection of functions multiple times.

互いに類似する文書同士は、共通語の出現割合（ジャカール係数）が高くなる。ｈからのハッシュ関数の選択について、ハッシュ値が一致する確率はジャカール係数に一致する。よって、類似する文書を求める際の文書のハッシュ値同士の比較において、ベクトルの各要素の一致の割合からジャカール係数を確率的に計算でき、ハッシュ値のハミング距離（例えば不一致の数）が文書間の近さ（類似度）を反映している。 Documents that are similar to each other have a high common word appearance ratio (Jacquard coefficient). For a selection of hash functions from h, the probability of matching hash values is equal to the Jacquard coefficient. Therefore, when comparing hash values of documents when searching for similar documents, the Jacquard coefficient can be calculated stochastically from the rate of matching of each element of the vector. It reflects the closeness (similarity) of

また、ｎを検索対象文書の数、ｍを検索対象文書ごとの単語数（平均値）、ｋをハッシュ関数の数とする。すると、検索対象文書のベクトルを計算する計算量はＯ（ｋｍｎ）となる。１つの問い合わせ文書のベクトルを計算する計算量はＯ（ｋｍ）となり、このベクトルを用いた近傍探索の計算コストはＯ（ｌｏｇ（ｎ））となる。ハッシュ関数は予めランダムにｋ個生成される。ｋをＯ（ｌｏｇ（ｎ））とすると、異なる検索対象文書から同一のハッシュ値が算出される衝突確率を十分に小さくすることができる。 Let n be the number of search target documents, m be the number of words (average value) in each search target document, and k be the number of hash functions. Then, the amount of calculation for calculating the vector of the search target document is O (kmn). The amount of calculation for calculating a vector of one query document is O(km), and the calculation cost of neighborhood search using this vector is O(log(n)). K hash functions are randomly generated in advance. If k is O(log(n)), the collision probability of calculating the same hash value from different search target documents can be sufficiently reduced.

しかしながら、ハミング距離による類似度の検証では、ベクトルの各要素の差に意味はなく、要素が不一致の場合（共通語の出現がない場合）、類似度は０となる。このため、共通語の出現が少ない場合、類似検索の精度が低くなりやすい。 However, in verification of similarity based on Hamming distance, the difference between the elements of the vectors is meaningless, and the similarity is 0 when the elements do not match (when there is no common word). Therefore, when common words appear infrequently, the accuracy of similarity search tends to be low.

そこで、本実施形態では、情報処理装置は、単語間の語意の近さを示す語間距離情報をもとに、所定の単語を基準として語意の近い順に近い値を割り当てるハッシュ関数を生成する。具体的には、情報処理装置は、検索対象文書に含まれる単語を抽出して単語の集合し、ランダムに基準とする単語を選択する。次いで、情報処理装置は、語間距離情報をもとに、基準とする単語に対して語意の近い順に単語の集合に含まれる単語をソートし、ソート順に一意な値（例えばソート順に大きくなる整数値）の割り当てを行う。情報処理装置は、上記の処理を繰り返すことで、複数のハッシュ関数を生成する。 Therefore, in the present embodiment, the information processing apparatus generates a hash function that assigns a close value in descending order of meaning with reference to a predetermined word based on inter-word distance information indicating the closeness of meaning between words. Specifically, the information processing device extracts words contained in the search target document, collects the words, and randomly selects a reference word. Next, based on the inter-word distance information, the information processing device sorts the words included in the set of words in the order of closest meaning to the reference word, and obtains a unique value in the sort order (for example, an integer that increases in the sort order). numbers). The information processing device generates a plurality of hash functions by repeating the above process.

例えば、単語の集合が｛猫、ご飯、…にゃんこ、えさ｝であり、基準とする単語を｛猫｝とする場合、情報処理装置は、｛猫｝を基準として語意の近い順にソートする。これにより、｛猫、にゃんこ、…ご飯、えさ｝が得られる。このようにソートした単語の集合について、情報処理装置は、｛猫＝０、にゃんこ＝１、…ご飯＝５、えさ＝６｝などのように、ソート順に整数値を割り当てる。 For example, if the set of words is {cat, rice, . As a result, {cat, cat, . . . rice, food} is obtained. For the set of words sorted in this way, the information processing device assigns integer values in order of sorting, such as {cat=0, cat=1, . . . rice=5, food=6}.

このようにして生成されたハッシュ関数によるハッシュ値は、所定の単語に対する語意の距離（語間とも呼ぶ）に応じた値となる。このため、ハッシュ値間のハミング距離だけでなく、ユークリッド距離による類似度の検証が可能となる。したがって、共通語の出現がなく、ハッシュ値が不一致（要素が不一致の場合）であっても、ユークリッド距離により類似度を検証することができ、類似検索の精度を向上させることができる。 The hash value generated by the hash function generated in this manner is a value corresponding to the distance (also called between words) of the meaning of a given word. Therefore, it is possible to verify the degree of similarity not only by the Hamming distance between hash values but also by the Euclidean distance. Therefore, even if there is no common word and the hash values do not match (if the elements do not match), the degree of similarity can be verified by Euclidean distance, and the accuracy of similarity search can be improved.

［構成例］
図１は、実施形態にかかる情報処理装置の機能構成例を示すブロック図である。図１に示すように、情報処理装置１００は、入力された問い合わせ文書１０１について、検索対象文書データベース１２１に格納された検索対象文書の中から類似する類似文書１０２を求める処理を行う装置である。情報処理装置１００としては、例えばパーソナルコンピュータ等を適用できる。[Configuration example]
FIG. 1 is a block diagram of a functional configuration example of an information processing apparatus according to an embodiment; As shown in FIG. 1, the information processing apparatus 100 is an apparatus that performs a process of obtaining a similar document 102 similar to an input inquiry document 101 from search target documents stored in a search target document database 121 . For example, a personal computer or the like can be applied as the information processing apparatus 100 .

情報処理装置１００は、索引構造を作成する事前処理を行うインデックス生成モジュール１１１と、検索処理を行う検索モジュール１１２と、処理に関する各種データを格納する記憶部（検索対象文書データベース１２１、語間距離情報記憶部１２２、ハッシュ関数記憶部１２３およびインデックス記憶部１２４）とを有する。すなわち、情報処理装置１００は、類似文書検索装置および索引情報作成装置の一例である。 The information processing apparatus 100 includes an index generation module 111 that performs preprocessing for creating an index structure, a search module 112 that performs search processing, and a storage unit that stores various data related to processing (search target document database 121, inter-word distance information, etc.). It has a storage unit 122, a hash function storage unit 123, and an index storage unit 124). That is, the information processing device 100 is an example of a similar document search device and an index information creation device.

検索対象文書データベース１２１は、問い合わせ文書１０１に対して検索対象となる検索対象文書が登録されたデータベースである。検索対象文書データベース１２１における検索対象文書は、事前に登録されていてもよいし、情報処理装置１００を使用するユーザとのチャットボット等の対話インタフェースにおける対話を通じて追加されてもよし、ネットワークから自動的に収集されてもよい。 The search target document database 121 is a database in which search target documents to be searched for the inquiry document 101 are registered. The search target documents in the search target document database 121 may be registered in advance, may be added through interaction with the user using the information processing apparatus 100 through a dialog interface such as a chatbot, or may be added automatically from the network. may be collected at

語間距離情報記憶部１２２は、単語それぞれについて、他の単語との語意の近さ（語間）を表す語間距離情報を格納する。具体的には、語間距離情報としては、単語それぞれ（ｖ）と、他の単語（ｗ）との語意の距離（語間）を表す関数ｄ（ｄ，ｗ）などがある。すなわち、語間距離情報は、単語の語意の近さを示す語間情報の一例である。 The inter-word distance information storage unit 122 stores inter-word distance information representing the closeness of meaning (between words) of each word to other words. Specifically, the inter-word distance information includes a function d(d, w) representing the distance (word inter-word) between each word (v) and another word (w). That is, inter-word distance information is an example of inter-word information indicating the closeness of word meanings of words.

ハッシュ関数記憶部１２３は、インデックス生成モジュール１１１が生成した異なる複数のハッシュ関数を記憶する。各ハッシュ関数は、検索対象文書に出現し得る単語それぞれに対して一意な整数（所定の単語に対して語意の近い順）を対応付ける対応関係をもち、単語集合を受け付けて１つのハッシュ値を出力する。また、異なるハッシュ関数は異なる対応関係をもつ。 Hash function storage unit 123 stores a plurality of different hash functions generated by index generation module 111 . Each hash function has a correspondence relationship that associates each word that can appear in a search target document with a unique integer (in order of meaning close to a given word), accepts a set of words, and outputs one hash value. do. Also, different hash functions have different correspondences.

インデックス記憶部１２４は、問い合わせ文書１０１に類似する検索対象文書を検索するための索引構造を記憶する。索引構造は、複数のハッシュ関数を用いて検索対象文書から算出されたベクトル（要約情報）に基づいてインデックス生成部１３２が生成したものであり、検索対象文書それぞれの要約情報を探索する索引情報の一例である。 The index storage unit 124 stores an index structure for searching for search target documents similar to the inquiry document 101 . The index structure is generated by the index generator 132 based on vectors (summary information) calculated from the search target documents using a plurality of hash functions. An example.

索引構造としては、例えば探索木を適用できる。探索木は、木構造に接続された複数のノード（葉ノードと、葉ノードに至る各ノード）を含む。探索木の葉ノードは、検索対象文書を指し示す。例えば、葉ノードは、検索対象文書のベクトルと当該検索対象文書を識別する識別情報（例えば文書ＩＤ）とを含む。ただし、葉ノードがベクトルを含まなくてもよい。 For example, a search tree can be applied as the index structure. A search tree includes a plurality of nodes (a leaf node and each node leading up to the leaf node) connected in a tree structure. A leaf node of the search tree indicates a document to be searched. For example, a leaf node includes a vector of a search target document and identification information (for example, document ID) that identifies the search target document. However, leaf nodes do not have to contain vectors.

葉ノード以外の各ノードには２つの子ノードが接続されている。葉ノード以外の各ノードは、ベクトルの中の特定の次元に対する閾値をもつ。入力されたベクトルの中の特定の次元のハッシュ値が閾値以上である場合は右子ノードに進み、特定の次元のハッシュ値が閾値未満である場合は左子ノードに進む。このようにして、探索木をルートノードから葉ノードに向かって辿ることで、所定のベクトルに近い検索対象文書を効率的に検索することができる。 Two child nodes are connected to each non-leaf node. Each non-leaf node has a threshold for a particular dimension in the vector. If the hash value of the specified dimension in the input vector is greater than or equal to the threshold, proceed to the right child node, and if the hash value of the specified dimension is less than the threshold, proceed to the left child node. By tracing the search tree from the root node to the leaf nodes in this way, it is possible to efficiently search for a search target document that is close to a predetermined vector.

インデックス生成モジュール１１１は、ハッシュ関数生成部１３１と、インデックス生成部１３２とを有する。 The index generation module 111 has a hash function generation section 131 and an index generation section 132 .

ハッシュ関数生成部１３１は、ハッシュ関数生成部１３１に記憶された検索対象文書と、問い合わせ受信部１３３に記憶された語間距離情報とに基づいて、複数のハッシュ関数を生成し、生成した複数のハッシュ関数をハッシュ関数記憶部１２３に格納する。 Hash function generation unit 131 generates a plurality of hash functions based on search target documents stored in hash function generation unit 131 and inter-word distance information stored in inquiry reception unit 133, and generates a plurality of hash functions. A hash function is stored in the hash function storage unit 123 .

具体的には、ハッシュ関数生成部１３１は、検索対象文書に含まれる単語を抽出して単語の集合し、ランダムに基準とする単語を選択する。次いで、ハッシュ関数生成部１３１は、語間距離情報記憶部１２２の語間距離情報を参照し、基準とする単語に対して語意の近い順に単語の集合に含まれる単語を整列（ソート）する。次いで、ハッシュ関数生成部１３１は、単語の集合に含まれる単語について、整列順に一意な値（例えばソート順に大きくなる整数値）を割り当ててハッシュ関数を生成する。ハッシュ関数生成部１３１は、上記のハッシュ関数を生成する処理を繰り返すことで、複数のハッシュ関数を生成する。 Specifically, the hash function generator 131 extracts words contained in the search target document, collects the words, and randomly selects a reference word. Next, the hash function generation unit 131 refers to the inter-word distance information in the inter-word distance information storage unit 122, and sorts the words included in the set of words in order of closeness to the reference word. Next, the hash function generating unit 131 generates a hash function by assigning unique values (for example, integer values that increase in sort order) to the words included in the set of words. The hash function generation unit 131 generates a plurality of hash functions by repeating the process of generating the above hash functions.

例えば、語間距離情報として単語それぞれ（ｖ）と、他の単語（ｗ）との語意の距離（語間）を表す関数ｄ（ｄ，ｗ）が与えられているものとする。この関数ｄ（ｄ，ｗ）については、語間の類似度や単語のベクトル表現などを参照して予め作成することができる。また、関数ｄ（ｄ，ｗ）は、次のとおりである。
・任意のｖ，ｗ∈Ｗについて、０≦ｄ（ｖ，ｗ）である。
・任意のｗ∈Ｗについて、ｄ（ｗ，ｗ）＝０である。For example, it is assumed that a function d(d, w) representing the distance (word spacing) between each word (v) and another word (w) is given as inter-word distance information. This function d(d, w) can be created in advance with reference to the degree of similarity between words, the vector representation of words, and the like. Also, the function d(d, w) is as follows.
• 0≤d(v,w) for any v,wεW.
• d(w,w)=0 for any wεW.

また、ハッシュ関数については、任意のｕ、ｖ∈Ｗについて、ｈ（ｕ）＞ｈ（ｖ）⇔ｄ（ｕ，ｗ）＞ｄ（ｖ，ｗ）となるｗ∈Ｗがあるものとする。 As for the hash function, for any u, v∈W, there is w∈W such that h(u)>h(v)⇔d(u, w)>d(v, w).

ハッシュ関数生成部１３１は、Ｗからｗをランダムに選択し、すべてのｖ∈Ｗをｄ（ｖ，ｗ）の小さい順にソートする。次いで、ハッシュ関数生成部１３１は、ソートされた語ｗ，ｖ_１，ｖ_２…に整数０，１，２，…を割り当てる。なお、ハッシュ関数生成部１３１は、整数を割り当てる代わりにｄ（ｖ，ｗ）の値をそのまま使ってもよい（重複がないものとする）。The hash function generator 131 randomly selects w from W and sorts all vεW in ascending order of d(v, w). Then, the hash function generator 131 assigns integers 0, 1, 2, . . . to the sorted words w, v ₁ , v ₂ . Note that the hash function generation unit 131 may use the value of d(v, w) as it is instead of allocating an integer (assuming there is no duplication).

インデックス生成部１３２は、検索対象文書データベース１２１に記憶された検索対象文書とハッシュ関数記憶部１２３に記憶されたハッシュ関数に基づいて、索引構造を生成し、生成した索引構造をインデックス記憶部１２４に格納する。 The index generation unit 132 generates an index structure based on the search target documents stored in the search target document database 121 and the hash function stored in the hash function storage unit 123, and stores the generated index structure in the index storage unit 124. Store.

具体的には、インデックス生成部１３２は、検索対象文書ごとに単語集合を抽出し、抽出した単語集合を複数のハッシュ関数それぞれに入力して、ハッシュ値のベクトル、すなわち検索対象文書の要約情報を算出する。次いで、インデックス生成部１３２は、複数の検索対象文書に対応する複数のベクトルを効率的に検索できるように、索引構造を生成する。例えば、インデックス生成部１３２は、ベクトルの中の１つの次元に着目し、ベクトルの集合が二分割されるように当該次元のハッシュ値の閾値を決定することを繰り返すことで、探索木を生成する。このとき、インデックス生成部１３２は、探索木の葉ノードにはできる限り単一のベクトルが対応付けられるように中間ノードを生成する。 Specifically, the index generation unit 132 extracts a word set for each search target document, inputs the extracted word set to each of a plurality of hash functions, and generates a hash value vector, that is, summary information of the search target document. calculate. Next, the index generator 132 generates an index structure so that multiple vectors corresponding to multiple search target documents can be efficiently retrieved. For example, the index generation unit 132 generates a search tree by focusing on one dimension in the vector and repeatedly determining the threshold value of the hash value of the dimension so that the set of vectors is divided into two. . At this time, the index generating unit 132 generates intermediate nodes so that leaf nodes of the search tree are associated with a single vector as much as possible.

検索モジュール１１２は、問い合わせ受信部１３３と、ハッシュ値算出部１３４と、検索部１３５と、出力部１３６とを有する。 The search module 112 has an inquiry reception unit 133 , a hash value calculation unit 134 , a search unit 135 and an output unit 136 .

問い合わせ受信部１３３は、問い合わせ文書１０１を受信する。問い合わせ受信部１３３は、ユーザから文字列として入力された問い合わせ文書１０１を受信してもよいし、ユーザが口頭で発した問い合わせ発話の音声信号を文字列に変換してもよい。また、問い合わせ受信部１３３は、他の情報処理装置から文字列または音声信号を受信してもよい。 The inquiry reception unit 133 receives the inquiry document 101 . The inquiry receiving unit 133 may receive the inquiry document 101 input as a character string from the user, or may convert a voice signal of an inquiry utterance verbally uttered by the user into a character string. Also, the inquiry receiving unit 133 may receive a character string or an audio signal from another information processing device.

ハッシュ値算出部１３４は、ハッシュ関数記憶部１２３に記憶された複数のハッシュ関数に基づいて、問い合わせ文書１０１に対応するベクトル、すなわち問い合わせ文書１０１の要約情報を生成する。具体的には、ハッシュ値算出部１３４は、問い合わせ文書１０１から単語集合を抽出し、抽出した単語集合を複数のハッシュ関数それぞれに入力して、ハッシュ値のベクトルを算出する。 The hash value calculation unit 134 generates a vector corresponding to the inquiry document 101, that is, summary information of the inquiry document 101, based on a plurality of hash functions stored in the hash function storage unit 123. FIG. Specifically, the hash value calculation unit 134 extracts a word set from the inquiry document 101, inputs the extracted word set to each of a plurality of hash functions, and calculates a hash value vector.

検索部１３５は、インデックス記憶部１２４に記憶された索引構造と問い合わせ文書１０１のベクトルに基づいて、近傍探索により問い合わせ文書１０１に最も類似する検索対象文書を検索する。具体的には、問い合わせ文書１０１に最も類似する検索対象文書は、ベクトル同士を比較したときにハッシュ値が一致する次元が最も多いものである。 Based on the index structure stored in the index storage unit 124 and the vector of the inquiry document 101, the search unit 135 searches for a search target document that is most similar to the inquiry document 101 by neighborhood search. Specifically, the search target document that is most similar to the inquiry document 101 has the largest number of dimensions with matching hash values when the vectors are compared.

例えば、検索部１３５は、問い合わせ文書１０１のベクトルの中の特定の次元のハッシュ値と閾値とを比較しながら、インデックス記憶部１２４に記憶された探索木をルートノードから葉ノードに向かって辿り、特定の葉ノードに到達する。検索部１３５は、到達した葉ノードに対応する検索対象文書を選択する。 For example, the search unit 135 traces the search tree stored in the index storage unit 124 from the root node to the leaf nodes while comparing the hash value of a specific dimension in the vector of the inquiry document 101 with a threshold, Reach a particular leaf node. The search unit 135 selects a search target document corresponding to the reached leaf node.

なお、到達した葉ノードに２以上の検索対象文書が対応付けられている場合、すなわち、ハッシュ値が一致する次元数が同じであり、探索木では検索対象文書を１つに絞り込めない場合、検索部１３５は、ハッシュ値同士を比較してユークリッド距離を求める。次いで、検索部１３５は、ユークリッド距離がより近いものを最も類似する検索対象文書とする。 Note that if two or more search target documents are associated with the reached leaf node, that is, if the number of dimensions with matching hash values is the same and the search tree cannot narrow down the search target document to one, The search unit 135 compares the hash values to find the Euclidean distance. Next, the search unit 135 selects documents with the closest Euclidean distance as the most similar search target documents.

出力部１３６は、検索された検索対象文書を類似文書１０２として出力する。例えば、出力部１３６は、類似文書１０２の文字列をディスプレイ等に表示してもよいし、類似文書１０２を音声信号に変換してスピーカにより音声を再生してもよい。また、出力部１３６は、他の情報処理装置に類似文書１０２の文字列または音声信号を送信してもよい。 The output unit 136 outputs the retrieved search target document as the similar document 102 . For example, the output unit 136 may display the character string of the similar document 102 on a display or the like, or may convert the similar document 102 into an audio signal and reproduce the audio through a speaker. The output unit 136 may also transmit the character string or audio signal of the similar document 102 to another information processing device.

また、出力部１３６は、類似文書１０２を出力する代わりに、検索された検索対象文書に対して検索対象文書データベース１２１において予め紐付けられた処理を実施してもよい。具体的には、検索対象文書データベース１２１には、検索対象文書ごとに、所定の処理（例えばスケジュール登録、メール送信）が登録されているものとする。出力部１３６は、検索された検索対象文書に紐付けられた処理を検索対象文書データベース１２１より読み出して実行することで、問い合わせ文書１０１に対応した処理を行うことが可能となる。 Also, instead of outputting the similar document 102 , the output unit 136 may perform processing associated in advance in the search target document database 121 with respect to the searched search target document. Specifically, it is assumed that predetermined processing (for example, schedule registration, mail transmission) is registered in the search target document database 121 for each search target document. The output unit 136 reads from the search target document database 121 and executes the process associated with the searched search target document, thereby making it possible to perform the process corresponding to the inquiry document 101 .

［動作例］
図２は、実施形態にかかる情報処理装置の処理フローの一例を示す説明図である。図２に示すように、情報処理装置１００は、索引構造を作成する事前処理（Ｓ１）と、問い合わせ文書１０１に対する類似文書１０２を検索して出力する検索処理（Ｓ２）とを行う。[Example of operation]
FIG. 2 is an explanatory diagram of an example of a processing flow of the information processing apparatus according to the embodiment; As shown in FIG. 2, the information processing apparatus 100 performs a preliminary process (S1) for creating an index structure, and a search process (S2) for retrieving and outputting similar documents 102 to a query document 101. FIG.

先ず、事前処理（Ｓ１）について説明する。事前処理（Ｓ１）では、先ず検索対象文書データベース１２１より検索対象文書が読み出され、ハッシュ関数生成部１３１に入力される（Ｓ１１）。 First, the preprocessing (S1) will be described. In the pre-processing (S1), a search target document is first read from the search target document database 121 and input to the hash function generator 131 (S11).

ハッシュ関数生成部１３１では、問い合わせ受信部１３３より語間の距離の入力を受け付け（Ｓ１３）、入力された検索対象文書と、語間の距離とに基づき、複数のハッシュ関数１２３ａを生成する（Ｓ１２）。 The hash function generating unit 131 receives the input of the distance between words from the inquiry receiving unit 133 (S13), and generates a plurality of hash functions 123a based on the input search target document and the distance between words (S12 ).

図３は、ハッシュ関数の生成処理の一例を示すフローチャートである。図３に示すように、ハッシュ関数生成部１３１は、検索対象文書の入力を受け付け（Ｓ３１）、検索対象文書に含まれる単語（出現語）を抽出することで（Ｓ３２）、単語の集合（語集合）を取得する（Ｓ３３）。 FIG. 3 is a flowchart illustrating an example of hash function generation processing. As shown in FIG. 3, the hash function generator 131 receives an input of a search target document (S31), extracts words (appearing words) contained in the search target document (S32), and generates a set of words (words set) is acquired (S33).

次いで、ハッシュ関数生成部１３１は、Ｓ３４～Ｓ３９の処理を生成するハッシュ関数の数であるｋ回繰り返すことで、複数のハッシュ関数１２３ａを生成する。 Next, the hash function generation unit 131 generates a plurality of hash functions 123a by repeating the processes of S34 to S39 k times, which is the number of hash functions to be generated.

具体的には、ハッシュ関数生成部１３１は、語集合の中からランダムに１つの単語を選択し（Ｓ３５）、語間の距離を示す語間距離情報の入力を受け付け（Ｓ３６）、選択した単語と他の単語との距離を語間距離情報より参照する（Ｓ３７）。次いで、ハッシュ関数生成部１３１は、選択した単語との距離の近い順に語集合の単語をソートし（Ｓ３８）、整列した各単語に整列順に大きくなるような整数値をハッシュ値として割り当てる（Ｓ３９）。 Specifically, the hash function generation unit 131 randomly selects one word from the word set (S35), receives input of inter-word distance information indicating the distance between words (S36), and other words are referred to from the inter-word distance information (S37). Next, the hash function generator 131 sorts the words in the word set in descending order of distance to the selected word (S38), and assigns an integer value that increases in order of alignment to each aligned word as a hash value (S39). .

次いで、ハッシュ関数生成部１３１は、Ｓ３４～Ｓ３９の処理をｋ回繰り返して得られた複数のハッシュ関数１２３ａを出力し、インデックス生成部１３２に格納する（Ｓ４０）。 Next, the hash function generator 131 outputs a plurality of hash functions 123a obtained by repeating the processes of S34 to S39 k times, and stores them in the index generator 132 (S40).

図４は、ハッシュ関数１２３ａを説明する説明図である。図４に示すように、ハッシュ関数１２３ａにおけるｈ_１，ｈ_２，…ｈ２，…が１つのハッシュ関数である。FIG. 4 is an explanatory diagram for explaining the hash function 123a. As shown in FIG. 4, h ₁ , h ₂ , . . . h2, .

例えば、ｈ_１は、（猫）を基準の単語としており、語集合における各単語について（猫）に対する語間の距離に応じた整数値が割り当てられている。また、ｈ_２は、（ごはん）を基準の単語としており、語集合における各単語について（ごはん）に対する語間の距離に応じた整数値が割り当てられている。また、ｈ_３は、（にゃんこ）を基準の単語としており、語集合における各単語について（にゃんこ）に対する語間の距離に応じた整数値が割り当てられている。また、ｈ_４は、（えさ）を基準の単語としており、語集合における各単語について（えさ）に対する語間の距離に応じた整数値が割り当てられている。また、ｈ_５は、（花）を基準の単語としており、語集合における各単語について（花）に対する語間の距離に応じた整数値が割り当てられている。また、ｈ_６は、（水）を基準の単語としており、語集合における各単語について（水）に対する語間の距離に応じた整数値が割り当てられている。For example, h1 has (cat) as _a reference word, and each word in the word set is assigned an integer value according to the inter-word distance from (cat). Also, _h2 is based on the word (rice), and each word in the word set is assigned an integer value corresponding to the inter-word distance from the word (rice). Also _, h3 is based on a word (cat), and an integer value is assigned to each word in the word set according to the distance between words with respect to (cat). Also, h4 is based _on the word (bait), and an integer value is assigned to each word in the word set according to the distance between words with respect to (bait). In addition, _h5 uses (flower) as a reference word, and an integer value corresponding to the inter-word distance from (flower) is assigned to each word in the word set. Also, h6 is based _on the word (water), and each word in the word set is assigned an integer value corresponding to the inter-word distance to (water).

ここで、図４に例示したハッシュ関数１２３ａにより次の文書Ａ～文書Ｃのハッシュ値を求める場合を例示する。
文書Ａ：「猫にごはんをやる」
文書Ｂ：「にゃんこにえさをやる」
文書Ｃ：「花に水をやる」Here, a case where the hash values of the following documents A to C are obtained by the hash function 123a illustrated in FIG. 4 will be exemplified.
Document A: "Feed the cat"
Document B: "Feed the Nyanko"
Document C: "Watering the flowers"

文書Ａの語集合は｛猫、ご飯｝、文書Ｂの語集合は｛にゃんこ、えさ｝、文書Ｃの語集合は｛花、水｝である。よって、図４に例示したハッシュ関数１２３ａによる文書Ａのハッシュ値Ｈ_Ａは、Ｈ_Ａ＝００１１３３となる。同様に、文書Ｂのハッシュ値Ｈ_Ｂは、Ｈ_Ｂ＝１１００２２となる。また、文書Ｃのハッシュ値Ｈ_Ｃは、Ｈ_Ｃ＝４２４２００となる。The word set of document A is {cat, rice}, the word set of document B is {nyanko, food}, and the word set of document C is {flower, water}. Therefore, the hash value H _A of document A by the hash function 123a illustrated in FIG. 4 is H _A =001133. Similarly, the hash value H _B of document B is H _B =110022. Also, the hash value H _C of document C is H _C =424200.

ここで、文書Ａ～Ｃのハッシュ値を比較してハミング距離を計算すると、Ｈ_Ａ，Ｈ_Ｂ：６、Ｈ_Ａ，Ｈ_Ｃ：６、Ｈ_Ｂ，Ｈ_Ｃ：６となる。また、文書Ａ～Ｃのハッシュ値を比較してユークリッド距離を計算すると、Ｈ_Ａ，Ｈ_Ｂ：１、Ｈ_Ａ，Ｈ_Ｃ：６．９、Ｈ_Ｂ，Ｈ_Ｃ：６．２となる。このように、ハミング距離では類似度の検証が困難な場合（共通語が含まれていない場合）でも、ユークリッド距離により互いに類似する文書（図示例では文書Ａ、Ｂ）を検証することができる。Here, when Hamming distances are calculated by comparing hash values of documents A to C, H _A , H _B : 6, H _A , H _C : 6, H _B , H _C : 6. Also, when comparing the hash values of documents A to C and calculating the Euclidean distance, H _A , H _B : 1, H _A , H _C : 6.9, H _B , H _C : 6.2. In this way, even when it is difficult to verify the degree of similarity using the Hamming distance (when no common words are included), it is possible to verify mutually similar documents (documents A and B in the illustrated example) using the Euclidean distance.

図２に戻り、Ｓ１２に次いで、インデックス生成部１３２は、入力された検索対象文書と、生成したハッシュ関数１２３ａに基づいて、索引構造１２４ａを生成し（Ｓ１４）、生成した索引構造１２４ａをインデックス記憶部１２４に格納する。 Returning to FIG. 2, following S12, the index generator 132 generates an index structure 124a based on the input search target document and the generated hash function 123a (S14), and stores the generated index structure 124a as an index. Stored in section 124 .

検索処理（Ｓ２）では、問い合わせ受信部１３３が受信した問い合わせ文書１０１がハッシュ値算出部１３４に入力される（Ｓ２１）。ハッシュ値算出部１３４は、ハッシュ関数記憶部１２３に記憶された複数のハッシュ関数に基づいて、入力された問い合わせ文書１０１のハッシュ値を複数生成し（Ｓ２２）、問い合わせ文書１０１に対応するベクトルを得る。 In the search process (S2), the inquiry document 101 received by the inquiry reception unit 133 is input to the hash value calculation unit 134 (S21). Hash value calculation unit 134 generates a plurality of hash values of input inquiry document 101 based on a plurality of hash functions stored in hash function storage unit 123 (S22), and obtains a vector corresponding to inquiry document 101. .

次いで、検索部１３５は、インデックス記憶部１２４に記憶された索引構造における検索対象文書それぞれのベクトルと、問い合わせ文書１０１のベクトルのハッシュ値を照合し（Ｓ２３）、問い合わせ文書１０１に最も類似する検索対象文書を検索する。出力部１３６は、検索部１３５が検索した類似文書１０２を出力する（Ｓ２４）。 Next, the search unit 135 compares the vector of each search target document in the index structure stored in the index storage unit 124 with the hash value of the vector of the query document 101 (S23), and finds the search target most similar to the query document 101 (S23). Search for documents. The output unit 136 outputs the similar document 102 searched by the search unit 135 (S24).

図５は、語間の類似度を説明する説明図である。具体的には、図５は、語間の類似度を示す高次元の空間における語Ｗ１～Ｗ６それぞれの配置を俯瞰した図である。図５の語Ｗ１～Ｗ６それぞれは、文書に含まれる単語を示す。ここで、語Ｗ１～Ｗ３は、例えば「猫」などの単語について類似するものであり、点線で示すクラスターを形成している。同様に、語Ｗ４～Ｗ６は、例えば「犬」などの単語について類似するものであり、語Ｗ１～Ｗ３とは別のクラスターを形成している。 FIG. 5 is an explanatory diagram for explaining the degree of similarity between words. Specifically, FIG. 5 is a bird's-eye view of the arrangement of the words W1 to W6 in a high-dimensional space indicating the degree of similarity between words. Words W1 to W6 in FIG. 5 each indicate a word contained in the document. Here, the words W1 to W3 are similar to words such as "cat" and form a cluster indicated by a dotted line. Similarly, words W4-W6 are similar for words such as "dog" and form a separate cluster from words W1-W3.

図５に示すように、語間の類似度を示す高次元の空間において、単純な射影では、類似度をうまく評価することは困難である。例えば、軸Ａ１への直交射影では、類似する単語（語Ｗ１～Ｗ３または語Ｗ４～Ｗ６）の値が近くなる。しかしながら、軸Ａ１とは異なる軸Ａ２への直交射影では、互いに類似しない単語（例えば語Ｗ１と語Ｗ４）が互いに類似する単語（例えば語Ｗ１と語Ｗ３）よりも近くなる場合がある。 As shown in FIG. 5, in a high-dimensional space of similarity between words, it is difficult to evaluate similarity well with simple projection. For example, an orthogonal projection onto axis A1 brings the values of similar words (words W1-W3 or words W4-W6) closer together. However, in an orthogonal projection onto an axis A2, which is different from axis A1, words that are dissimilar to each other (eg, words W1 and W4) may be closer together than words that are similar to each other (eg, words W1 and W3).

本実施形態では、ランダムに基準とする単語を選択し、Ｍｉｎｈａｓｈによるハッシュ値を用いることから、基準とする単語（基準点）からの距離による射影を用いている。 In this embodiment, since a reference word is randomly selected and a hash value based on Min hash is used, projection based on the distance from the reference word (reference point) is used.

図６は、Ｍｉｎｈａｓｈによるハッシュ値を説明する説明図である。ここで、語Ｗ１は「猫」、語Ｗ２は「にゃんこ」、語Ｗ３は「キャット」、語Ｗ４は「犬」、語Ｗ５は「鼠」、語Ｗ６は「ドッグ」、語Ｗ７は「わんこ」であるものとする。また、基準点は語Ｗ１であるものとする。 FIG. 6 is an explanatory diagram for explaining hash values based on Min hash. Here, the word W1 is "cat", the word W2 is "nyanko", the word W3 is "cat", the word W4 is "dog", the word W5 is "mouse", the word W6 is "dog", and the word W7 is "dog". ” shall be It is also assumed that the reference point is the word W1.

図６に示すように、基準点である語Ｗ１をもとに、ハッシュ関数１２３ａでは、１：「猫」、２：「にゃんこ」、３：「キャット」、４：「犬」、５：「鼠」、６：「ドッグ」、７：「わんこ」の値が割り当てられる。ここで、基準点からの距離ならば、基準点に近いほど類似度の大小関係が保たれている（例えば１：「猫」、２：「にゃんこ」、３：「キャット」など）。しかしながら、基準点からの距離が遠いところでは（例えば５：「鼠」、６：「ドッグ」）、類似しない語の値が近くなる場合がある。 As shown in FIG. 6, based on the word W1 as a reference point, the hash function 123a calculates 1: "cat", 2: "cat", 3: "cat", 4: "dog", 5: " A value of 6: "dog", 7: "dog" is assigned. Here, regarding the distance from the reference point, the closer to the reference point, the higher and lower the degree of similarity is maintained (for example, 1: "cat", 2: "nyanko", 3: "cat", etc.). However, at greater distances from the reference point (eg 5: "mouse", 6: "dog"), dissimilar words may have closer values.

本実施形態では、語集合のハッシュ値の中の最小値を用いる（Ｍｉｎｈａｓｈ）ので、類似度の大小関係が保たれた、語の類似度を表現できる射影として適切なものとなっている。 In this embodiment, since the minimum hash value among the hash values of the word set is used (Min hash), it is suitable as a projection capable of expressing the degree of similarity of words while maintaining the degree of similarity.

図７は、実施形態にかかる情報処理装置１００の動作の概要を説明する説明図である。図７に示すように、事前処理（Ｓ１）において、情報処理装置１００は、語間距離情報による語間の類似度をもとに、検索対象文書データベース１２１の検索対象文書１２１ａそれぞれに含まれる単語の集合について、所定の単語を基準として語意の近い順に近い値を割り当てるハッシュ関数を複数生成する。次いで、情報処理装置１００は、生成した複数のハッシュ関数を用いて検索対象文書１２１ａそれぞれをＭｉｎＨａｓｈによる変換を行い、算出されたベクトルを検索するための索引構造を生成してインデックス記憶部１２４に格納する。 FIG. 7 is an explanatory diagram illustrating an overview of the operation of the information processing apparatus 100 according to the embodiment. As shown in FIG. 7, in the pre-processing (S1), the information processing apparatus 100 extracts words contained in each search target document 121a of the search target document database 121 based on the similarity between words based on the distance information between words. For the set of , a plurality of hash functions are generated that assign close values in descending order of word meaning based on a predetermined word. Next, the information processing apparatus 100 performs Min Hash conversion on each of the search target documents 121a using the plurality of generated hash functions, generates an index structure for searching the calculated vector, and stores the index structure in the index storage unit 124. Store.

検索処理（Ｓ２）において、情報処理装置１００は、入力された問い合わせ文書１０１について、Ｓ１で生成した同じ複数のハッシュ関数を用いてＭｉｎＨａｓｈによる変換を行う。次いで、情報処理装置１００は、問い合わせ文書１０１より算出したハッシュ値のベクトルと、インデックス記憶部１２４に格納されたハッシュ値のベクトルとを比較することで、問い合わせ文書１０１に最も類似する検索対象文書１２１ａを検索する。 In the search process (S2), the information processing apparatus 100 performs Min Hash conversion on the input inquiry document 101 using the same plurality of hash functions generated in S1. Next, the information processing apparatus 100 compares the hash value vector calculated from the inquiry document 101 with the hash value vector stored in the index storage unit 124 to find the search target document 121a that is most similar to the inquiry document 101. Search for

図示例では、「会議を調整したい」とする問い合わせ文書１０１に対し、検索対象文書１２１ａの中で「スケジュールの調整」と、「打ち合わせの調整」とが「調整」という単語においてハッシュ値が一致することから、ハミング距離が近いものとなる。ここで、問い合わせ文書１０１に含まれる「会議」に対し、「スケジュール」よりも「会議」の方がより語意が近く、ユークリッド距離が近いものとなる。したがって、「会議を調整したい」とする問い合わせ文書１０１に対しては、「打ち合わせを調整したい」とする類似文書１０２が得られることとなる。 In the illustrated example, for the query document 101 "I want to coordinate a meeting", the hash values of the word "coordination" in the search target document 121a match "schedule coordination" and "meeting coordination". Therefore, the Hamming distance is close. Here, "meeting" contained in the inquiry document 101 has a closer meaning and a closer Euclidean distance than "schedule". Therefore, for the inquiry document 101 "I want to coordinate a meeting", a similar document 102 "I want to coordinate a meeting" is obtained.

なお、出力部１３６は、検索された検索対象文書について、最も類似する１つの類似文書１０２を出力してもよいし、ハミング距離およびユークリッド距離により得られた類似度が所定の閾値以上である複数の類似文書１０２を出力してもよい。 Note that the output unit 136 may output one similar document 102 that is most similar to the searched search target document, or may output a plurality of similar documents 102 whose similarities obtained by the Hamming distance and the Euclidean distance are equal to or greater than a predetermined threshold. similar document 102 may be output.

図８は、検索対象文書の絞り込みを説明する説明図である。具体的には、図８は、類似度の閾値と検索対象文書１２１ａのヒット数との関係例を示すグラフである。 FIG. 8 is an explanatory diagram for explaining narrowing down of search target documents. Specifically, FIG. 8 is a graph showing an example of the relationship between the similarity threshold and the number of hits of the search target document 121a.

グラフ１０は、問い合わせ文書１０１と検索対象文書１２１ａの類似度の閾値と、類似度が閾値より大きい検索対象文書１２１ａの数（ヒット数）との間の関係を示す。類似度は、問い合わせ発話のベクトルと検索対象発話のベクトルの間でハッシュ値が一致する次元の数に、ハッシュ値同士のユークリッド距離を合わせたものである。 The graph 10 shows the relationship between the similarity threshold between the inquiry document 101 and the search target document 121a and the number of search target documents 121a having a similarity greater than the threshold (the number of hits). The degree of similarity is the sum of the number of dimensions in which the hash values of the query utterance vector and the search target utterance vector coincide with the Euclidean distance between the hash values.

（Ａ）関連語を考慮せずにベクトルを算出する方法では、問い合わせ文書１０１と各検索対象文書１２１ａの類似度が全体として低く算出される。よって、類似度の閾値とヒット数との間の関係は曲線１１のようになる。すなわち、類似度の閾値を低く設定してもヒット数が少なくなり、類似する検索対象文書１２１ａの検索漏れが多くなる。 (A) In the method of calculating vectors without considering related words, the similarity between the inquiry document 101 and each search target document 121a is calculated to be low as a whole. Therefore, the relationship between the similarity threshold and the number of hits is shown by curve 11 . In other words, even if the similarity threshold is set low, the number of hits decreases, and similar retrieval target documents 121a are often omitted in retrieval.

（Ｂ）関連語を同一視してベクトルを算出する方法では、問い合わせ文書１０１と各検索対象文書１２１ａの類似度が全体として高く算出される。よって、類似度の閾値とヒット数との間の関係は曲線１２のようになる。すなわち、類似度の閾値を高く設定してもヒット数が多くなり、類似する検索対象文書１２１ａを効率的に絞り込むことが難しい。 (B) In the method of calculating vectors by equating related terms, the degree of similarity between the query document 101 and each search target document 121a is calculated to be high as a whole. Therefore, the relationship between the similarity threshold and the number of hits is shown by curve 12 . That is, even if the similarity threshold is set high, the number of hits increases, making it difficult to efficiently narrow down the similar search target documents 121a.

（Ｃ）本実施形態の方法では、問い合わせ文書１０１と各検索対象文書１２１ａの類似度がユークリッド距離を加味したものとなる。よって、類似度の閾値とヒット数との間の関係は曲線１３のように連続的なものとなる。その結果、問い合わせ文書１０１に類似する検索対象文書１２１ａを効率的に絞り込むことができる。 (C) In the method of the present embodiment, the degree of similarity between the query document 101 and each search target document 121a takes into consideration the Euclidean distance. Therefore, the relationship between the similarity threshold and the number of hits is continuous like curve 13 . As a result, the search target documents 121a similar to the inquiry document 101 can be efficiently narrowed down.

図９は、操作画面の表示例を示す説明図である。操作画面２０は、例えばチャットボット等の対話インタフェースにおいて情報処理装置１００のユーザに提示することでユーザからの各種操作を受け付ける画面である。操作画面２０において、表示領域２１は処理結果などを表示する領域であり、入力領域２２は文書などの入力を行う領域である。例えば、情報処理装置１００は、入力領域２２に入力された問い合わせ文書１０１に対して類似文書１０２の検索を行い、検索結果に応じた出力を表示領域２１に行う。 FIG. 9 is an explanatory diagram showing a display example of the operation screen. The operation screen 20 is a screen that accepts various operations from the user by presenting it to the user of the information processing apparatus 100 through an interactive interface such as a chatbot. In the operation screen 20, a display area 21 is an area for displaying processing results and the like, and an input area 22 is an area for inputting documents and the like. For example, the information processing apparatus 100 searches similar documents 102 for the query document 101 input to the input area 22 and outputs the search results to the display area 21 .

例えば、情報処理装置１００は、「スケジュール調整したい」とする入力文書２１ａを問い合わせ文書１０１として類似文書１０２の検索を行い、検索結果２１ｂを表示領域２１に表示する。具体的には、「スケジュール調整したい」とする問い合わせ文書１０１に対し、検索対象文書１２１ａの中から類似文書１０２として得られた「スケジュールの調整」に対応するスケジュール登録の処理が実行される。 For example, the information processing apparatus 100 searches for a similar document 102 using an input document 21a indicating “I want to adjust the schedule” as an inquiry document 101, and displays a search result 21b in the display area 21. FIG. Specifically, schedule registration processing corresponding to "schedule adjustment" obtained as a similar document 102 from among the search target documents 121a is executed for the inquiry document 101 "I want to adjust the schedule".

情報処理装置１００では、図９の例におけるチャットボット等の対話インタフェースのように、問い合わせ文書１０１が主に話し言葉であり、１文が短く表現も多様である場合であっても、適切に類似文書１０２の検索を行うことができる。 In the information processing apparatus 100, even if the inquiry document 101 is mainly spoken language, and one sentence is short and has various expressions, like the interactive interface such as the chatbot in the example of FIG. 102 searches can be performed.

［効果］
以上のように、情報処理装置１００は、ハッシュ関数生成部１３１と、インデックス生成部１３２と、ハッシュ値算出部１３４と、検索部１３５とを有する。ハッシュ関数生成部１３１は、検索対象文書１２１ａに含まれる単語の集合と、単語の語意の近さを示す語間距離情報とに基づき、単語の集合に含まれる単語それぞれに対して、所定の単語を基準として語意の近い順に近い値を割り当てるハッシュ関数を生成する。インデックス生成部１３２は、生成したハッシュ関数に基づいて複数の検索対象文書１２１ａそれぞれの要約情報を算出する。ハッシュ値算出部１３４は、生成したハッシュ関数に基づいて入力文書（問い合わせ文書１０１）の要約情報を算出する。検索部１３５は、算出した検索対象文書１２１ａの要約情報と、問い合わせ文書１０１の要約情報との間の比較に基づいて、複数の検索対象文書１２１ａの中から入力文書に類似する文書を検索する。[effect]
As described above, the information processing apparatus 100 has the hash function generator 131 , the index generator 132 , the hash value calculator 134 , and the searcher 135 . The hash function generation unit 131 generates a predetermined word Generates a hash function that assigns values that are close to each other in order of meaning. The index generator 132 calculates summary information for each of the search target documents 121a based on the generated hash function. The hash value calculation unit 134 calculates summary information of the input document (inquiry document 101) based on the generated hash function. Based on the comparison between the calculated summary information of the search target document 121a and the summary information of the query document 101, the search unit 135 searches for documents similar to the input document from among the plurality of search target documents 121a.

このため、情報処理装置１００において、生成したハッシュ関数に基づいて算出した検索対象文書１２１ａの要約情報と、問い合わせ文書１０１の要約情報とに含まれるハッシュ値は、所定の単語に対する語意の距離に応じた値となる。したがって、情報処理装置１００では、検索対象文書１２１ａの要約情報と、問い合わせ文書１０１の要約情報との間の比較において、例えば互いの要約情報の間におけるハミング距離だけでなく、ユークリッド距離を用いた類似度合いの検証を行うことができ、類似する文書の検索精度を向上させることができる。 Therefore, in the information processing apparatus 100, the hash values included in the summary information of the search target document 121a calculated based on the generated hash function and the summary information of the query document 101 are calculated according to the distance of the meaning of a predetermined word. value. Therefore, in the information processing apparatus 100, when comparing the summary information of the search target document 121a and the summary information of the query document 101, for example, not only the Hamming distance between the summary information but also the Euclidean distance is used to determine the similarity between the summary information. The degree of verification can be performed, and the search accuracy for similar documents can be improved.

また、インデックス生成部１３２およびハッシュ値算出部１３４のそれぞれは、生成したハッシュ関数において検索対象文書１２１ａまたは問い合わせ文書１０１に含まれる単語の集合の単語それぞれに割り当てられた値の中で最小の値をハッシュ値として算出する。このように、情報処理装置１００では、Min-hash関数により検索対象文書１２１ａまたは問い合わせ文書１０１の要約情報を算出できる。 In addition, each of the index generation unit 132 and the hash value calculation unit 134 calculates the minimum value among the values assigned to each word in the set of words included in the search target document 121a or the query document 101 in the generated hash function. Calculate as a hash value. In this manner, the information processing apparatus 100 can calculate summary information of the search target document 121a or the query document 101 using the Min-hash function.

また、ハッシュ関数生成部１３１は、所定の単語を選び直してハッシュ関数を生成する処理を繰り返すことでハッシュ関数を複数生成する。そして、インデックス生成部１３２およびハッシュ値算出部１３４のそれぞれは、生成した複数のハッシュ関数によって検索対象文書１２１ａまたは問い合わせ文書１０１に含まれる単語の集合から算出される複数のハッシュ値を含むベクトルを要約情報として算出する。これにより、情報処理装置１００では、検索対象文書１２１ａの要約情報と、問い合わせ文書１０１の要約情報との間の比較において、複数のハッシュ値を列挙したベクトルの比較により、例えばハミング距離、ユークリッド距離などを求めて類似度合いを検証することができる。 Further, the hash function generation unit 131 generates a plurality of hash functions by repeating the process of reselecting predetermined words and generating hash functions. Then, each of the index generating unit 132 and the hash value calculating unit 134 summarizes a vector including a plurality of hash values calculated from a set of words included in the search target document 121a or the query document 101 using the generated plurality of hash functions. Calculated as information. As a result, in the information processing apparatus 100, in the comparison between the summary information of the search target document 121a and the summary information of the query document 101, by comparing vectors listing a plurality of hash values, for example, Hamming distance, Euclidean distance, etc. can be obtained to verify the degree of similarity.

また、情報処理装置１００のインデックス生成部１３２は、算出した複数の検索対象文書１２１ａそれぞれの要約情報と、検索対象文書１２１ａとを対応付けた索引情報を生成する。検索部１３５は、問い合わせ文書１０１の要約情報と、索引情報において検索対象文書１２１ａと対応付けられた要約情報との間の比較を行う。情報処理装置１００では、索引情報を生成しておくことで、複数の検索対象文書１２１ａの中から問い合わせ文書１０１に類似する文書の検索を高速に行うことができる。 In addition, the index generation unit 132 of the information processing apparatus 100 generates index information in which the calculated summary information of each of the plurality of search target documents 121a is associated with the search target document 121a. The search unit 135 compares the summary information of the inquiry document 101 with the summary information associated with the search target document 121a in the index information. By generating index information in advance, the information processing apparatus 100 can quickly search for documents similar to the inquiry document 101 from among the plurality of search target documents 121a.

［その他］
なお、図示した各装置の各構成要素は、必ずしも物理的に図示の如く構成されていることを要しない。すなわち、各装置の分散・統合の具体的形態は図示のものに限られず、その全部または一部を、各種の負荷や使用状況などに応じて、任意の単位で機能的または物理的に分散・統合して構成することができる。[others]
It should be noted that each component of each illustrated device does not necessarily need to be physically configured as illustrated. In other words, the specific form of distribution and integration of each device is not limited to the one shown in the figure, and all or part of them can be functionally or physically distributed and integrated in arbitrary units according to various loads and usage conditions. Can be integrated and configured.

例えば、本実施形態では、インデックス生成モジュール１１１と、検索モジュール１１２とを有する情報処理装置１００を例示したが、インデックス生成モジュール１１１と、検索モジュール１１２とはそれぞれ異なる情報処理装置が有していてもよい。すなわち、事前処理（Ｓ１）と、検索処理（Ｓ２）とは、それぞれ異なる情報処理装置で実施してもよい。 For example, in the present embodiment, the information processing apparatus 100 having the index generation module 111 and the search module 112 was exemplified. good. That is, the preprocessing (S1) and the search processing (S2) may be performed by different information processing apparatuses.

また、情報処理装置１００で行われる各種処理機能は、ＣＰＵ（またはＭＰＵ、ＭＣＵ（Micro Controller Unit）等のマイクロ・コンピュータ）上で、その全部または任意の一部を実行するようにしてもよい。また、各種処理機能は、ＣＰＵ（またはＭＰＵ、ＭＣＵ等のマイクロ・コンピュータ）で解析実行されるプログラム上、またはワイヤードロジックによるハードウエア上で、その全部または任意の一部を実行するようにしてもよいことは言うまでもない。また、情報処理装置１００で行われる各種処理機能は、クラウドコンピューティングにより、複数のコンピュータが協働して実行してもよい。 Various processing functions performed by the information processing apparatus 100 may be executed in whole or in part on a CPU (or a microcomputer such as an MPU or MCU (Micro Controller Unit)). Also, various processing functions may be executed in whole or in part on a program analyzed and executed by a CPU (or a microcomputer such as an MPU or MCU) or on hardware based on wired logic. It goes without saying that it is good. Further, various processing functions performed by the information processing apparatus 100 may be performed in collaboration with a plurality of computers by cloud computing.

ところで、上記の実施形態で説明した各種の処理は、予め用意されたプログラムをコンピュータで実行することで実現できる。そこで、以下では、上記の実施例と同様の機能を有するプログラムを実行するコンピュータ（ハードウエア）の一例を説明する。図１０は、プログラムを実行するコンピュータの一例を示す図である。 By the way, the various processes described in the above embodiments can be realized by executing a prepared program on a computer. Therefore, an example of a computer (hardware) that executes a program having functions similar to those of the above embodiments will be described below. FIG. 10 is a diagram illustrating an example of a computer that executes programs.

図１０に示すように、コンピュータ１は、各種演算処理を実行するＣＰＵ２０１と、データ入力を受け付ける入力装置２０２と、モニタ２０３と、スピーカ２０４とを有する。また、コンピュータ１は、記憶媒体からプログラム等を読み取る媒体読取装置２０５と、各種装置と接続するためのインタフェース装置２０６と、有線または無線により外部機器と通信接続するための通信装置２０７とを有する。また、コンピュータ１は、各種情報を一時記憶するＲＡＭ２０８と、ハードディスク装置２０９とを有する。また、コンピュータ１内の各部（２０１～２０９）は、バス２１０に接続される。 As shown in FIG. 10, the computer 1 has a CPU 201 that executes various arithmetic processes, an input device 202 that receives data input, a monitor 203 and a speaker 204 . The computer 1 also has a medium reading device 205 for reading a program or the like from a storage medium, an interface device 206 for connecting with various devices, and a communication device 207 for communicating with an external device by wire or wirelessly. The computer 1 also has a RAM 208 that temporarily stores various information, and a hard disk device 209 . Each unit ( 201 to 209 ) in computer 1 is connected to bus 210 .

ハードディスク装置２０９には、上記の実施形態で説明した各種の処理を実行するためのプログラム２１１が記憶される。また、ハードディスク装置２０９には、プログラム２１１が参照する各種データ２１２（例えば語間距離情報記憶部１２２、検索対象文書データベース１２１、ハッシュ関数記憶部１２３およびインデックス記憶部１２４の情報）が記憶される。入力装置２０２は、例えば、コンピュータ１の操作者から操作情報の入力を受け付ける。モニタ２０３は、例えば、操作者が操作する各種画面を表示する。インタフェース装置２０６は、例えば印刷装置等が接続される。通信装置２０７は、ＬＡＮ（Local Area Network）等の通信ネットワークと接続され、通信ネットワークを介した外部機器との間で各種情報をやりとりする。 The hard disk device 209 stores a program 211 for executing various processes described in the above embodiment. The hard disk device 209 also stores various data 212 (for example, information on the inter-word distance information storage unit 122, the search target document database 121, the hash function storage unit 123, and the index storage unit 124) referenced by the program 211. FIG. The input device 202 receives input of operation information from the operator of the computer 1, for example. The monitor 203 displays, for example, various screens operated by the operator. The interface device 206 is connected with, for example, a printing device. The communication device 207 is connected to a communication network such as a LAN (Local Area Network), and exchanges various information with external devices via the communication network.

ＣＰＵ２０１は、ハードディスク装置２０９に記憶されたプログラム２１１を読み出して、ＲＡＭ２０８に展開して実行することで、ハッシュ関数生成部１３１、インデックス生成部１３２、問い合わせ受信部１３３、ハッシュ値算出部１３４、検索部１３５および出力部１３６に関する各種の処理を行う。なお、プログラム２１１は、ハードディスク装置２０９に記憶されていなくてもよい。例えば、コンピュータ１が読み取り可能な記憶媒体に記憶されたプログラム２１１を、コンピュータ１が読み出して実行するようにしてもよい。コンピュータ１が読み取り可能な記憶媒体は、例えば、ＣＤ－ＲＯＭやＤＶＤディスク、ＵＳＢ（Universal Serial Bus）メモリ等の可搬型記録媒体、フラッシュメモリ等の半導体メモリ、ハードディスクドライブ等が対応する。また、公衆回線、インターネット、ＬＡＮ等に接続された装置にこのプログラム２１１を記憶させておき、コンピュータ１がこれらからプログラムを読み出して実行するようにしてもよい。 The CPU 201 reads out the program 211 stored in the hard disk device 209, develops it in the RAM 208, and executes it, thereby generating a hash function generation unit 131, an index generation unit 132, an inquiry reception unit 133, a hash value calculation unit 134, and a search unit. 135 and output unit 136. Note that the program 211 does not have to be stored in the hard disk device 209 . For example, the computer 1 may read and execute the program 211 stored in a storage medium readable by the computer 1 . Examples of storage media readable by the computer 1 include portable recording media such as CD-ROMs, DVD discs, USB (Universal Serial Bus) memories, semiconductor memories such as flash memories, and hard disk drives. Alternatively, the program 211 may be stored in a device connected to a public line, the Internet, a LAN, etc., and the computer 1 may read and execute the program therefrom.

１…コンピュータ
１０…グラフ
１１～１３…曲線
２０…操作画面
２１…表示領域
２１ａ…入力文書
２１ｂ…検索結果
２２…入力領域
１００…情報処理装置
１０１…問い合わせ文書
１０２…類似文書
１１１…インデックス生成モジュール
１１２…検索モジュール
１２１…検索対象文書データベース
１２１ａ…検索対象文書
１２２…語間距離情報記憶部
１２３…ハッシュ関数記憶部
１２３ａ…ハッシュ関数
１２４…インデックス記憶部
１２４ａ…索引構造
１３１…ハッシュ関数生成部
１３２…インデックス生成部
１３３…問い合わせ受信部
１３４…ハッシュ値算出部
１３５…検索部
１３６…出力部
２０１…ＣＰＵ
２０２…入力装置
２０３…モニタ
２０４…スピーカ
２０５…媒体読取装置
２０６…インタフェース装置
２０７…通信装置
２０８…ＲＡＭ
２０９…ハードディスク装置
２１０…バス
２１１…プログラム
２１２…各種データ
Ａ１～Ａ２…軸
Ｗ１～Ｗ７…語Reference Signs List 1 Computer 10 Graphs 11 to 13 Curve 20 Operation screen 21 Display area 21a Input document 21b Search result 22 Input area 100 Information processing device 101 Inquiry document 102 Similar document 111 Index generation module 112 Search module 121 Search target document database 121a Search target document 122 Interword distance information storage unit 123 Hash function storage unit 123a Hash function 124 Index storage unit 124a Index structure 131 Hash function generation unit 132 Index Generation unit 133 inquiry reception unit 134 hash value calculation unit 135 search unit 136 output unit 201 CPU
202... Input device 203... Monitor 204... Speaker 205... Medium reading device 206... Interface device 207... Communication device 208... RAM
209 Hard disk device 210 Bus 211 Program 212 Various data A1 to A2 Axis W1 to W7 Word

Claims

Based on a set of words included in the search target document and inter-word information indicating the closeness of meaning of the words, for each word included in the set of words, a predetermined word is used as a reference for the word. Generate a hash function that assigns close values in order of meaning,
calculating summary information for each of the plurality of search target documents based on the generated hash function;
calculating summary information of the input document based on the generated hash function;
searching for a document similar to the input document from among the plurality of search target documents based on a comparison between the calculated summary information of the search target document and the summary information of the input document;
A similar document search method characterized in that processing is executed by a computer.

In each of the calculating processes, a minimum value among values assigned to each word in a set of words included in the search target document or the input document in the generated hash function is calculated as a hash value.
2. The similar document search method according to claim 1, characterized by:

In the generating process, a plurality of hash functions are generated by repeating the process of reselecting the predetermined word and generating the hash function,
In each of the calculating processes, a vector including a plurality of hash values calculated from a set of words included in the search target document or the input document by the plurality of generated hash functions is calculated as summary information.
3. The similar document retrieval method according to claim 2, characterized by:

a computer further executing a process of generating index information in which the calculated summary information of each of the plurality of search target documents and the search target documents are associated with each other;
The searching process compares summary information of the input document with summary information associated with the search target document in the index information.
2. The similar document search method according to claim 1, characterized by:

Based on a set of words included in the search target document and inter-word information indicating the closeness of meaning of the words, for each word included in the set of words, a predetermined word is used as a reference for the word. Generate a hash function that assigns close values in order of meaning,
calculating summary information for each of the plurality of search target documents based on the generated hash function;
calculating summary information of the input document based on the generated hash function;
searching for a document similar to the input document from among the plurality of search target documents based on a comparison between the calculated summary information of the search target document and the summary information of the input document;
A similar document search program characterized by causing a computer to execute processing.

Based on a set of words included in the search target document and inter-word information indicating the closeness of meaning of the words, for each word included in the set of words, a predetermined word is used as a reference for the word. a hash function generation unit that generates a hash function that assigns values close to each other in order of meaning;
a first calculation unit that calculates summary information of each of the plurality of search target documents based on the generated hash function;
a second calculation unit that calculates summary information of an input document based on the generated hash function;
a search unit that searches for a document similar to the input document from among the plurality of search target documents based on a comparison between the calculated summary information of the search target document and the summary information of the input document;
A similar document retrieval device characterized by comprising:

Based on a set of words included in the search target document and inter-word information indicating the closeness of meaning of the words, for each word included in the set of words, a predetermined word is used as a reference for the word. Generate a hash function that assigns close values in order of meaning,
calculating summary information for each of the plurality of search target documents based on the generated hash function;
generating index information for searching summary information of each of the plurality of calculated search target documents;
An index information creation method characterized in that processing is executed by a computer.

Based on a set of words included in the search target document and inter-word information indicating the closeness of meaning of the words, for each word included in the set of words, a predetermined word is used as a reference for the word. Generate a hash function that assigns close values in order of meaning,
calculating summary information for each of the plurality of search target documents based on the generated hash function;
generating index information for searching summary information of each of the plurality of calculated search target documents;
An index information creation program characterized by causing a computer to execute processing.

Based on a set of words included in the search target document and inter-word information indicating the closeness of meaning of the words, for each word included in the set of words, a predetermined word is used as a reference for the word. a hash function generation unit that generates a hash function that assigns values close to each other in order of meaning;
a calculation unit that calculates summary information for each of the plurality of search target documents based on the generated hash function;
an index information generation unit that generates index information for searching for summary information of each of the plurality of calculated search target documents;
An index information creation device characterized by having: