JP5345918B2

JP5345918B2 - Document search method, document search apparatus, and document search program

Info

Publication number: JP5345918B2
Application number: JP2009236366A
Authority: JP
Inventors: 俊郎内山; 克人別所; 毅晴江田; 匡内山
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2009-10-13
Filing date: 2009-10-13
Publication date: 2013-11-20
Anticipated expiration: 2029-10-13
Also published as: JP2011085991A

Abstract

<P>PROBLEM TO BE SOLVED: To reduce a calculation cost in retrieving retrieval words by previously performing the calculation of distances or similarities between the feature vectors of words constituting the retrieval words and the feature vectors of documents to be retrieved. <P>SOLUTION: In a step S3 for selecting the documents to be retrieved from the aggregation of the documents to be retrieved, a retrieval result determining means selects the documents to be retrieved corresponding to the distances from the aggregation of the documents to be retrieved in the ascending order of the distances between the feature vectors of the retrieval words and the feature vectors of the documents to be retrieved, which are calculated by obtaining, with the use of the weights of the words, the weightsum of the distances between the feature vectors of the words constituting the prescribed retrieval words and the feature vectors of the documents to be retrieved, which are previously calculated via steps S1, S2. The similarities may be adapted in place of the distances. In this case, the documents to be retrieved corresponding to the similarities are selected from the aggregation of the documents to be retrieved in the descending order of the similarities. <P>COPYRIGHT: (C)2011,JPO&INPIT

Description

本発明は、文書の特徴をその構成要素である単語の特徴ベクトルの加重平均によるベクトルで表現し、ベクトル間の距離に基づいて類似検索を行う時、高速に検索する技術に関する。 The present invention relates to a technique for expressing features of a document by a vector based on a weighted average of feature vectors of words that are constituent elements, and performing a high-speed search when performing a similar search based on a distance between vectors.

インターネットの普及により、膨大な文書を高速で検索するニーズが増している。検索技術においては、文書の特徴をベクトルで表現し、ベクトル間の距離に基づく方法が広く用いられている。 With the spread of the Internet, there is an increasing need to search a huge amount of documents at high speed. In search technology, a method is widely used in which document features are represented by vectors and based on the distance between the vectors.

これまで、多次元ベクトル検索の高速化技術が多数提案されている。１９９８年以前は、主に木構造を利用した多次元ベクトル検索技術が提案された。 Up to now, many techniques for speeding up multidimensional vector searches have been proposed. Prior to 1998, a multidimensional vector search technique mainly using a tree structure was proposed.

しかし、木構造を用いる多次元ベクトル検索技術は、次元数が大きくなると、いわゆる「次元の呪い」問題が発生し、線形検索と同等のコストが必要である（例えば、非特許文献１参照）。そこで，ＶＡ−ｆｉｌｅと呼ばれるデータ構造を用いた高速化技術（非特許文献２参照）や、このＶＡ−ｆｉｌｅの問題を克服した局所性検知可能ハッシュ（ＬＳＨ（ｌｏｃａｌｉｔｙＳｅｎｓｉｔｉｖｅＨａｓｈｉｎｇ））（非特許文献３参照）が提案されている。ＬＳＨは、検索精度を確率的に保障しながら、検索コストは（ハッシュ個数）×（次元数）で済む。理論上、次元数が大きくても機能し、高速に類似検索が可能という特長があり、高速な類似検索技術としては最も有望である。 However, a multidimensional vector search technique using a tree structure causes a so-called “curse of dimension” problem when the number of dimensions increases, and requires a cost equivalent to that of linear search (see, for example, Non-Patent Document 1). Therefore, a speed-up technique using a data structure called VA-file (see Non-Patent Document 2) and a locality-detectable hash (LSH (Locality Sensitive Hashing)) that overcomes this VA-file problem (Non-Patent Document) 3) has been proposed. While LSH guarantees the search accuracy stochastically, the search cost is (hashed number) × (number of dimensions). Theoretically, it functions even when the number of dimensions is large, and has a feature that a similar search can be performed at high speed, and is the most promising as a high-speed similar search technique.

Kevin Beyer, Jonathan Goldstein, Raghu Ramakrishnan,Uri Shaft,「When Is ``Nearest Neighbor'' Meaningful?」,ICDT 2005, p.158‐172Kevin Beyer, Jonathan Goldstein, Raghu Ramakrishnan, Uri Shaft, `` When Is `` Nearest Neighbor '' Meaningful? '', ICDT 2005, p.158-172 Roger Weber, Hans-J.Schek, Stephen Blott,「A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces」，Very Large Data Bases,1998,p.194‐205Roger Weber, Hans-J.Schek, Stephen Blott, “A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces”, Very Large Data Bases, 1998, p.194-205 Piotr Indyk, Rajeev Motwani,「Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality」，Annual ACM Symposium on Theory of Computing,1998,p.603‐613Piotr Indyk, Rajeev Motwani, "Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality", Annual ACM Symposium on Theory of Computing, 1998, p.603-613 北研二，津田和彦，獅々堀正幹，「情報検索アルゴリズム」，共立出版，２００２年１月１日，ｐ．３４−３５Kita Kenji, Tsuda Kazuhiko, Sasabori Masatomi, “Information Retrieval Algorithm”, Kyoritsu Shuppan, January 1, 2002, p. 34-35

上記の従来技術においては，いずれの方法においても対象とするベクトルの次元数に比例してコストが増大する問題がある。 In the above prior art, there is a problem that the cost increases in proportion to the number of dimensions of the target vector in any method.

本発明は、検索語（文書）の特徴ベクトルが当該検索語の構成要素である単語の特徴ベクトルの加重平均によって計算される場合に、ベクトルの次元数に比例したコストの増大を避け、高速な検索処理を行うことを目的とする。 In the present invention, when the feature vector of a search word (document) is calculated by the weighted average of the feature vectors of the words that are the constituent elements of the search word, an increase in cost proportional to the number of dimensions of the vector is avoided, and high speed is achieved. The purpose is to perform search processing.

本発明は検索語（文書）の構成要素となり得る全ての単語の特徴ベクトルと被検索文書の特徴ベクトルとの距離計算を事前に行っている。そして、前記検索語の特徴ベクトルと検索対象である前記被検索文書の特徴ベクトルとの距離の大小関係が前記単語の特徴ベクトルと前記被検索文書の特徴ベクトルとの距離の重み付け和の大小関係とが同値関係にあるという性質に基づき、検索時には事前に計算した距離の重み付け和を計算する。これによりベクトルの次元数に比例する計算を回避できる。尚、距離の代わりに類似度と置き換えた場合も同様である。 In the present invention, distance calculation is performed in advance between the feature vectors of all words that can be components of the search word (document) and the feature vectors of the searched document. The magnitude relationship between the distance between the feature vector of the search word and the feature vector of the searched document that is the search target is the magnitude relationship of the weighted sum of the distance between the feature vector of the word and the feature vector of the searched document. Based on the property that is in an equivalence relation, a weighted sum of distances calculated in advance is calculated at the time of retrieval. This avoids calculations proportional to the number of vector dimensions. The same applies when the similarity is replaced instead of the distance.

すなわち、本発明の文書検索方法は、所定の検索語の特徴ベクトルが当該検索語を構成する単語の特徴ベクトルの当該単語の重みを用いた加重平均により表される時に、被検索文書集合から前記検索語の特徴ベクトルとの距離または類似度に基づき被検索文書を選択する文書検索方法であって、検索結果決定手段が、予め算出された所定の検索語を構成する単語の特徴ベクトルと被検索文書の特徴ベクトルとの距離または類似度を当該単語の重みを用いて重み付け和することで算出された前記検索語の特徴ベクトルと前記被検索文書の特徴ベクトルとの距離の昇順または類似度の降順に、被検索文書の集合から前記距離または類似度に対応した被検索文書を選択するステップを有する。 That is, the document search method of the present invention is configured such that when a feature vector of a predetermined search word is represented by a weighted average using the weight of the word of the feature vector of the word constituting the search word, A document search method for selecting a document to be searched based on a distance or similarity with a feature vector of a search word, wherein the search result determination means and a feature vector of a word constituting a predetermined search word and a search target The ascending order of the distance or the descending order of the similarity between the feature vector of the search word and the feature vector of the searched document calculated by weighting and summing the distance or similarity with the feature vector of the document using the weight of the word And a step of selecting a searched document corresponding to the distance or similarity from the set of searched documents.

また、本発明の文書検索装置は、所定の検索語の特徴ベクトルが当該検索語を構成する単語の特徴ベクトルの当該単語の重みを用いた加重平均により表される時に、被検索文書集合から前記検索語の特徴ベクトルとの距離または類似度に基づき被検索文書を選択する文書検索装置であって、予め算出された所定の検索語を構成する単語の特徴ベクトルと被検索文書の特徴ベクトルとの距離または類似度を当該単語の重みを用いて重み付け和することで算出された前記検索語の特徴ベクトルと前記被検索文書の特徴ベクトルとの距離の昇順または類似度の降順に、被検索文書の集合から前記距離または類似度に対応した被検索文書を選択する検索結果決定手段を備える。 Further, the document search device of the present invention is configured such that when a feature vector of a predetermined search word is represented by a weighted average using the weight of the word of the feature vector of the word constituting the search word, A document search apparatus that selects a document to be searched based on a distance or similarity with a feature vector of a search word, and includes a feature vector of a word constituting a predetermined search word calculated in advance and a feature vector of the document to be searched In the ascending order of the distance between the feature vector of the search word calculated by weighting and summing the distance or the similarity using the weight of the word and the feature vector of the searched document or in descending order of the similarity, Search result determining means for selecting a search target document corresponding to the distance or similarity from the set is provided.

本発明の原理について説明する。まず、検索語（文書）の特徴ベクトルをｄ、被検索文書（Ｋ個あるとする）の特徴ベクトルをｇ_k（ｋ＝１，…，Ｋ）、検索語（文書）に含まれる単語（ｎ個あるとする）の特徴ベクトルをｘ_i（ｉ＝１，…，ｎ）、その単語の重みをＴ_i、総重みをＴ_allと表す。そして、重みの総和を１に正規化した重みをｔ_iで表すと、下記の式が成立する。 The principle of the present invention will be described. First, the feature vector of the search word (document) is d, the feature vector of the search target document (K is assumed) is g _k (k = 1,..., K), and the word (n) included in the search word (document). X _i (i = 1,..., N), the word weight is represented by T _i , and the total weight is represented by T _all . Then, when the weight normalized by summing the weights to 1 is represented by t _i , the following equation is established.

ここで「重み」とは、単語の出現頻度に基づき計算される「局所的重み」と文書集合全体を考慮した単語の「大域的重み」を掛けて得られるもので、非特許文献４にいくつかの方法が示されている。 Here, the “weight” is obtained by multiplying the “local weight” calculated based on the appearance frequency of the word and the “global weight” of the word in consideration of the entire document set. That way is shown.

本発明が利用できる前提条件の「検索語（文書）の特徴ベクトルが、当該検索語の構成要素である単語の特徴ベクトルの加重平均によって計算される場合」とは、検索語の特徴ベクトルｄが当該検索語の構成要素である単語ｘ_i（ｉ＝１，…，ｎ）とその重みｔ_i（ｉ＝１，…，ｎ）を用いて下記の式（１）と示されることである。 The precondition “when the feature vector of a search word (document) is calculated by a weighted average of the feature vectors of words that are constituent elements of the search word”, which is a precondition that the present invention can be used, The following expression (1) is shown by using the word x _i (i = 1,..., N) and its weight t _i (i = 1,..., N), which are constituent elements of the search term.

この時、検索語（文書）の特徴ベクトルｄと各被検索文書の特徴ベクトルｇ_kとの距離（あるいは類似度）の大小関係（異なるｋに関する）は、いくつかの距離（類似度）定義を用いた場合、前記検索語の構成要素である単語の特徴ベクトルｘ_iと被検索文書の特徴ベクトルｇ_kとの距離（あるいは類似度）の重み付け和になる。 At this time, the characteristic magnitude of the vector d and the distance between the feature vector g _k of each of the retrieved documents (or similarities) (for different k) is some distance (similarity) defined search terms (document) When used, this is a weighted sum of the distance (or similarity) between the feature vector x _i of the word that is a component of the search term and the feature vector g _k of the searched document.

このような関係性が存在する場合、特徴ベクトルｘ_iと特徴ベクトルｇ_kの距離（あるいは類似度）計算にはベクトルの次元数に依存した計算コストを要するが、図１及び図２に示された発明に係る文書検索方法及び文書検索装置のように、事前に算出しておけば、実際の検索の際にはこの計算を回避できる。 When such a relationship exists, calculation of the distance (or similarity) between the feature vector x _i and the feature vector g _k requires a calculation cost depending on the number of dimensions of the vector. If the calculation is performed in advance as in the document search method and document search apparatus according to the present invention, this calculation can be avoided in the actual search.

すなわち、図１及び図２に示された文書検索方法及びその装置は、所定の検索語（あるいは文書）の特徴ベクトルが当該検索語を構成する単語の特徴ベクトルの当該単語の重みを用いた加重平均により表される時に、被検索文書集合から前記検索語の特徴ベクトルとの距離または類似度に基づき被検索文書を選択する文書検索方法及びその装置であって、検索結果決定手段が、予め算出された所定の検索語を構成する単語の特徴ベクトルと被検索文書の特徴ベクトルとの距離または類似度を当該単語の重みを用いて重み付け和することで算出された前記検索語の特徴ベクトルと前記被検索文書の特徴ベクトルとの距離の昇順または類似度の降順に取り出す。前記所定の検索語を構成する単語について、当該単語のベクトルと前記被検索文書の特徴ベクトルとの距離は予め算出される。 That is, the document search method and apparatus shown in FIG. 1 and FIG. 2 are weighted using the feature vector of a predetermined search word (or document) using the weight of the word of the feature vector of the word constituting the search word. A document search method and apparatus for selecting a search target document based on a distance or similarity to a feature vector of the search term from a set of search target documents when represented by an average, wherein a search result determination unit calculates in advance The feature vector of the search word calculated by weighting and summing the distance or similarity between the feature vector of the word constituting the predetermined search word and the feature vector of the searched document using the weight of the word, and the Extracted in ascending order of the distance from the feature vector of the searched document or descending order of similarity For words constituting the predetermined search word, the distance between the vector of the word and the feature vector of the searched document is calculated in advance.

上記の文書検索方法の具体的な態様としては、図１に示されたように、入力された所定の検索語を構成する各単語の当該検索語における出現頻度と予め計算して記憶しておいた当該単語の重要度とから当該単語の重みを単語重み算出手段が算出する単語重み算出ステップ（Ｓ１）と、前記各単語について予め算出しておいた当該単語の特徴ベクトルと被検索文書の特徴ベクトルとの距離を前記算出した当該単語の重みを用いて重み付け和としたものを前記検索語の特徴ベクトルと前記被検索文書の特徴ベクトルとの距離として算出する計算を全ての被検索文書について距離算出手段が行う距離算出ステップ（Ｓ２）と、検索結果決定手段が前記被検索文書の集合から前記算出された距離のうち昇順に選択された各距離に対応した被検索文書を検索結果とする検索結果決定ステップ（Ｓ３）とを有する。距離の代わりに類似度を使い、距離の近さの代わりに類似度の高さを用いてもよい。この場合、算出された類似度のうち降順に選択された各類似度に対応した被検索文書を検索結果とする。 As a specific mode of the above document search method, as shown in FIG. 1, the appearance frequency of each word constituting the inputted predetermined search word in the search word is calculated and stored in advance. A word weight calculating step (S1) in which the word weight calculating means calculates the weight of the word from the importance of the word, the feature vector of the word calculated for each word, and the feature of the searched document The calculation for calculating the distance between the feature vector of the search word and the feature vector of the searched document as the weighted sum using the calculated weight of the word is calculated for all the searched documents. A distance calculating step (S2) performed by the calculating means; and a search result corresponding to each distance selected by the search result determining means in the ascending order of the calculated distances from the set of searched documents. Having a search result determination step (S3) and to search results. Similarity may be used instead of distance, and the height of similarity may be used instead of distance. In this case, a search target document corresponding to each similarity selected in descending order among the calculated similarities is set as a search result.

上記の文書検索方法に対応した文書検索装置の態様としては、所定の検索語を入力する入力手段１１と、単語の重要度を記憶する単語重要度記憶手段１２と、前記入力された検索語を構成する各単語の当該検索語における出現頻度と単語重要度記憶手段１２から引き出した当該単語の重要度とから当該単語の重みを算出する単語重み算出手段１３と、前記各単語について予め算出して単語・検索対象間距離記憶手段１４に記憶しておいた当該単語の特徴ベクトルと被検索文書の特徴ベクトルとの距離を単語重み算出手段１３にて算出された前記単語の重みを用いて重み付け和としたものを前記検索語の特徴ベクトルと前記被検索文書の特徴ベクトルとの距離として算出する計算を全ての被検索文書について行う距離算出手段１５と、前記被検索文書の集合から前記算出された距離のうち最小のものから順に選択された各距離に対応した被検索文書を検索結果とする検索結果決定手段１６とを備える。尚、距離の代わりに類似度を使い、距離の近さの代わりに類似度の高さを用いる手段も含まれる。 As an aspect of the document search apparatus corresponding to the document search method, an input unit 11 for inputting a predetermined search word, a word importance storage unit 12 for storing the importance of a word, and the input search word A word weight calculating means 13 for calculating the weight of the word from the appearance frequency of each word constituting the search word and the importance of the word extracted from the word importance storage means 12; The distance between the feature vector of the word stored in the word / search target distance storage means 14 and the feature vector of the searched document is weighted using the weight of the word calculated by the word weight calculation means 13. A distance calculation means 15 for performing a calculation for calculating all the searched documents as a distance between the feature vector of the search term and the feature vector of the search document; And a search result determination unit 16, the search results to be retrieved documents corresponding to each a selected distance in order from smallest of distances the calculated from a set of documents. It should be noted that means that uses similarity instead of distance and uses high similarity instead of distance is also included.

以上の発明に係る文書検索方法及びその装置において、前記距離または類似度の指標として、前記所定の検索語の特徴ベクトルと被検索文書の特徴ベクトルの距離または類似度の値の大小関係が当該検索語を構成する単語の特徴ベクトルと前記被検索文書の特徴ベクトルとの距離あるいは類似度の重み付け和の値の大小関係と同値関係となる距離または類似度の指標が用いられる。 In the document search method and apparatus according to the above invention, as the distance or similarity index, the magnitude relationship between the distance between the feature vector of the predetermined search word and the feature vector of the searched document or the value of similarity is the search. A distance or similarity index that is equivalent to the distance between the feature vector of the word constituting the word and the feature vector of the document to be searched or the magnitude relation of the weighted sum of the similarities is used.

前記距離の指標としては前記単語の特徴ベクトルと前記被検索文書の特徴ベクトルと間のユークリッド２乗距離、カルバック・ライブラー・ダイバージェンス、このダイバージェンスを用いたクロスエントロピーのいずれかが例示される。前記類似度の指標としては、前記両者の特徴ベクトルの内積が例示される。 Examples of the distance index include any one of Euclidean square distance between the feature vector of the word and the feature vector of the search target document, Cullback-Liber divergence, and cross-entropy using this divergence. An example of the similarity index is an inner product of both feature vectors.

すなわち、距離の指標として両ベクトルのユークリッド２乗距離を用いる場合、下記の式（２）となる。 That is, when using the Euclidean square distance of both vectors as a distance index, the following equation (2) is obtained.

以下の例でも、「単語の特徴ベクトルと被検索文書の特徴ベクトルとの距離の重み付け和」を考えるときの「重み」は正規化の前後のいずれを用いても良い。 Also in the following example, “weight” when considering “weighted sum of distance between feature vector of word and feature vector of searched document” may be any before and after normalization.

類似度の指標として両者の特徴ベクトルｘ_i，ｇ_kの内積を用いる場合、下記の式（３）となる。 When the inner product of both feature vectors x _i and g _k is used as an index of similarity, the following equation (3) is obtained.

これはそのまま単語の特徴ベクトルｘ_iと被検索文書の特徴ベクトルｇ_kとの内積の重み付け和になっている。この両者のベクトルの内積の重み付け和は当該両者のベクトルの類似度の重み付け和とみなすことができる。 This is the weighted sum of inner products of the feature vector x _i of the word and the feature vector g _k of the searched document as it is. The weighted sum of the inner products of both vectors can be regarded as the weighted sum of the similarity between the vectors.

また、距離の指標として両者の特徴ベクトルｘ_i，ｇ_kのカルバック・ライブラー・ダイバージェンスを用いる場合、両者の特徴ベクトルｘ_i，ｇ_kを確率分布とみなすと、ベクトルｇ_kからベクトルｘ_iへのカルバック・ライブラー・ダイバージェンスの重み付け和がベクトルｘ_iとベクトルｇ_kとの類似度の重み付け和とみなすことができる。すなわち、特徴ベクトルｄから特徴ベクトルｇ_kへのカルバック・ライブラー・ダイバージェンスを計算した場合、下記の式（４）となる。 In addition, when using the Cullback / Lailer divergence of both feature vectors x _i and g _k as an index of distance, if both feature vectors x _i and g _k are regarded as probability distributions, the vector g _{k changes} to the vector x _i . Can be regarded as a weighted sum of the similarity between the vector x _i and the vector g _k . That is, when the cullback, librarian divergence from the feature vector d to the feature vector g _k is calculated, the following equation (4) is obtained.

また、両者の特徴ベクトルｘ_i，ｇ_kを確率分布とみなした場合における当該両者の距離の他の指標としては、ベクトルｇ_kからベクトルｘ_iへのカルバック・ライブラー・ダイバージェンスを用いたベクトルｘ_i，ｇ_kのクロスエントロピーが挙げられる。このクロスエントロピーの重み付け和もベクトルｘ_iとベクトルｇ_kとの距離の重み付け和とみなすことができる。 In addition, when the feature vectors x _i and g _k of both are regarded as probability distributions, as another index of the distance between the two, the vector x using the kullback, librarian divergence from the vector g _k to the vector x _i Cross entropy of _i and g _k can be mentioned. The weighted sum of the cross entropy can also be regarded as a weighted sum of the distance between the vector x _i and the vector g _k .

尚、本発明は上記文書検索装置を構成する各手段としてコンピュータを機能させるための文書検索プログラムの態様とすることもできる。 It should be noted that the present invention may be a document search program for causing a computer to function as each means constituting the document search apparatus.

以上の発明によれば検索語を構成する単語の特徴ベクトルと被検索文書の特徴ベクトルとの距離または類似度の計算を事前に済ませることができるので当該検索語の検索時の計算コストが削減される。 According to the above invention, since the distance or similarity between the feature vector of the word constituting the search word and the feature vector of the searched document can be calculated in advance, the calculation cost when searching for the search word is reduced. The

本発明の文書検索方法の一態様の原理を説明したフローチャート図。The flowchart figure explaining the principle of the one aspect | mode of the document search method of this invention. 本発明の文書検索装置の一態様を示したブロック図。1 is a block diagram showing an aspect of a document search apparatus according to the present invention. 本発明の実施形態に係る文書検索装置を示したブロック図。1 is a block diagram showing a document search apparatus according to an embodiment of the present invention. 本発明の実施形態に係る文書検索方法を説明したフローチャート図。The flowchart figure explaining the document search method concerning embodiment of this invention.

以下、図３を参照しながら発明の実施形態に係る文書検索装置１００について説明する。 Hereinafter, a document search apparatus 100 according to an embodiment of the invention will be described with reference to FIG.

文書検索装置１００は図３に示されたようにＣＰＵ５１とメモリ５２とディスプレイ５３とキーボード５４と処理プログラム５５と処理対象記憶手段５６とＯＳ５７と単語・検索対象間距離記憶手段５８と単語重要度記憶手段５９とを有する。 As shown in FIG. 3, the document search apparatus 100 includes a CPU 51, a memory 52, a display 53, a keyboard 54, a processing program 55, a processing target storage means 56, an OS 57, a word / search target distance storage means 58, and a word importance storage. Means 59.

ＣＰＵ５１はＯＳ（オペレーティングシステム）５７上で動作する処理プログラム５５との協働によって図４に示されたステップＳ１０１，Ｓ１０２及び発明に係るステップＳ１０３〜Ｓ１０６を実行する各手段として機能する。 The CPU 51 functions as each means for executing steps S101 and S102 shown in FIG. 4 and steps S103 to S106 according to the invention in cooperation with a processing program 55 operating on an OS (operating system) 57.

すなわち、ＣＰＵ５１は、予め算出された所定の検索語を構成する単語の特徴ベクトルと被検索文書の特徴ベクトルとの距離を当該単語の重みを用いて重み付け和することで算出された前記検索語の特徴ベクトルと前記被検索文書の特徴ベクトルとの距離の昇順（小さいものから順）に、被検索文書の集合から前記距離に対応した被検索文書を選択するステップ（Ｓ１０６）を実行する検索結果決定手段として機能する。前記距離の代わりに類似度を適用してもよい。この場合、前記検索語の特徴ベクトルと前記被検索文書の特徴ベクトルとの類似度の降順（高いものから順）に、被検索文書の集合から前記距離に対応した被検索文書が選択される。 That is, the CPU 51 uses the weight of the word to calculate the distance between the feature vector of the word constituting the predetermined search word calculated in advance and the feature vector of the searched document. Search result determination for executing a step (S106) of selecting a search target document corresponding to the distance from a set of search target documents in ascending order of distance between the feature vector and the feature vector of the search target document (in ascending order). Functions as a means. Similarity may be applied instead of the distance. In this case, the search target document corresponding to the distance is selected from the set of search target documents in descending order of similarity between the feature vector of the search word and the feature vector of the search target document (in descending order).

また、ＣＰＵ５１は、前記所定の検索語を構成する単語について、その重みの算出、及び当該重みを用いた重み付け和による前記検索語の特徴ベクトルと前記被検索文書の特徴ベクトルとの距離の算出が行われていない場合（Ｓ１０３）、前記単語の当該検索語における出現頻度と当該単語の重要度とから当該単語の重みを算出するステップ（Ｓ１０４）を実行する単語重み算出手段として機能する。さらに、ＣＰＵ５１は、前記単語の特徴ベクトルと被検索文書の特徴ベクトルとの距離を前記算出した当該単語の重みを用いて重み付け和としたものを前記検索語の特徴ベクトルと前記被検索文書の特徴ベクトルとの距離として算出し、この距離の計算を全ての被検索文書について行うステップ（Ｓ１０５）を実行する計算手段として機能する。前記距離の代わりに類似度を適用してもよい。 Further, the CPU 51 calculates the weight of the words constituting the predetermined search word and calculates the distance between the feature vector of the search word and the feature vector of the searched document by a weighted sum using the weight. When not performed (S103), it functions as a word weight calculating means for executing the step (S104) of calculating the weight of the word from the appearance frequency of the word in the search word and the importance of the word. Further, the CPU 51 sets the distance between the feature vector of the word and the feature vector of the searched document as a weighted sum using the calculated weight of the word, and the feature vector of the search word and the feature of the searched document. It is calculated as a distance to the vector, and functions as a calculation means for executing the step (S105) in which this distance is calculated for all searched documents. Similarity may be applied instead of the distance.

メモリ５２は、図２に示された入力手段１１の一態様であるキーボード５４によって入力された処理対象である検索語（あるいは文書）または処理対象記憶手段５６から引き出された前記処理対象を処理プログラム５５に係る処理に供するために一時的に記憶する。また、メモリ５２は記憶手段５８，５９からの情報も処理プログラム５５に係る処理に供するために一時的に記憶する。 The memory 52 stores a search word (or document) that is a processing target input by the keyboard 54 that is one mode of the input unit 11 shown in FIG. 2 or the processing target extracted from the processing target storage unit 56. This is temporarily stored for use in processing related to 55. The memory 52 also temporarily stores information from the storage means 58 and 59 for use in processing related to the processing program 55.

ディスプレイ５３はキーボード５４によって入力された検索語またはこれに基づく後述のステップＳ１０１〜Ｓ１０６で得られた検索結果を表示するための表示手段である。 The display 53 is a display unit for displaying a search word input by the keyboard 54 or a search result obtained in steps S101 to S106 described later based on the search word.

処理プログラム５５は、上述のようにＣＰＵ５１にＳ１０１〜Ｓ１０６を実行させる機能させるアプリケーションプログラムであって、ＯＳ（オペレーティングシステム）５７上で動作するように構成されている。前記手順においてＳ１０３〜Ｓ１０５はそれぞれ図２に示された機能手段１３〜１５に対応した手順となっている。処理プログラム５５はＯＳ５７がインストールされるハードディスク装置に例示される記憶手段に格納される。 The processing program 55 is an application program for causing the CPU 51 to execute S101 to S106 as described above, and is configured to operate on the OS (operating system) 57. In the above procedure, S103 to S105 are procedures corresponding to the function means 13 to 15 shown in FIG. The processing program 55 is stored in a storage unit exemplified by a hard disk device in which the OS 57 is installed.

処理対象記憶手段５６はキーボード５４によって入力された検索語を処理対象として格納している。処理対象記憶手段５６はハードディスク装置やファイルサーバの態様で例示される周知の記憶手段を適用すればよい。 The processing target storage means 56 stores the search term input by the keyboard 54 as a processing target. The processing target storage means 56 may be a known storage means exemplified in the form of a hard disk device or a file server.

単語・検索対象間距離記憶手段５８は予め算出された単語の特徴ベクトルと被検索文書の特徴ベクトルとの距離を記憶している。距離としては例えば前述の式（２）で定義された２乗距離、前述の式（４）で定義されたダイバージェンス、このダイバージェンスが適用されたクロスエントロピーのいずれかが挙げられる。記憶手段５８は前記距離の代わりに前記両者のベクトルの類似度を記憶する。類似度としては、例えば、前述の式（３）で定義された内積が挙げられる。 The word / search target distance storage means 58 stores the distance between the feature vector of the word calculated in advance and the feature vector of the searched document. Examples of the distance include any one of the square distance defined by the above-described equation (2), the divergence defined by the above-described equation (4), and the cross entropy to which this divergence is applied. The storage means 58 stores the similarity between the vectors instead of the distance. As the similarity, for example, the inner product defined by the above-described equation (3) can be given.

単語重要度記憶手段５９は前記所定の検索語における出現頻度と当該検索語を構成する単語の重要度とから当該単語の重みを算出するための当該重要度を予め記憶している。重要度としては例えば非特許文献４に開示された「大域的重み」が挙げられる。 The word importance storage unit 59 stores in advance the importance for calculating the weight of the word from the appearance frequency in the predetermined search word and the importance of the word constituting the search word. Examples of the importance include “global weight” disclosed in Non-Patent Document 4.

記憶手段５８，５９も記憶手段５６と同様に周知の記憶手段を用いればよい。 The storage means 58 and 59 may be a known storage means in the same manner as the storage means 56.

図４に示されたフローチャートを参照しながら文書検索装置１００に係るＣＰＵ５１によって実行されるステップＳ１０１〜Ｓ１０６について説明する。Ｓ１０１〜Ｓ１０６においては単語の特徴ベクトルと被検索文書の特徴ベクトルとの距離を予め算出している。前記距離としては例えば前述の式（２）で示される前記両者のベクトルのユークリッド２乗距離が適用される。 Steps S101 to S106 executed by the CPU 51 of the document search apparatus 100 will be described with reference to the flowchart shown in FIG. In S101 to S106, the distance between the feature vector of the word and the feature vector of the searched document is calculated in advance. As the distance, for example, the Euclidean square distance of the two vectors represented by the above-described equation (2) is applied.

Ｓ１０１：処理対象記憶手段５６から処理対象である検索語（あるいは文書）を引き出してメモリ５２上に読み込む。 S 101: A search word (or document) to be processed is extracted from the processing target storage unit 56 and read into the memory 52.

Ｓ１０２：メモリ５２上の検索語の特徴ベクトルと検索対象（被検索文書）の特徴ベクトルとの距離の配列の全要素を０に初期化する。前記距離の配列Ｄの要素数は、被検索文書数Ｋと等しく、各要素はＤ_k（ｋ＝１，…，Ｋ）で表す。 S102: All elements of the array of distances between the feature vector of the search word on the memory 52 and the feature vector of the search target (searched document) are initialized to zero. The number of elements of the distance array D is equal to the number of documents to be searched K, and each element is represented by D _k (k = 1,..., K).

Ｓ１０３：メモリ５２上に読み込まれた検索語（あるいは文書）を構成する単語のうち、Ｓ１０４，Ｓ１０５の処理を行っていない単語を取り出す。未処理な単語が無い場合はＳ１０６へ進む。 S103: The words that are not subjected to the processing of S104 and S105 are extracted from the words constituting the search word (or document) read onto the memory 52. If there is no unprocessed word, the process proceeds to S106.

Ｓ１０４：処理対象の単語の重要度を単語重要度記憶手段５９から引き出して、当該単語の検索語における出現頻度との積を該単語の重さとする。例えば、前記重さの決定にあたり非特許文献４に記載された「索引語」（本発明では「単語」に相当）の重みの決定方法を適用してもよい。 S104: The importance level of the word to be processed is extracted from the word importance level storage means 59, and the product of the word frequency and the appearance frequency in the search word is used as the weight of the word. For example, in determining the weight, a method of determining the weight of “index word” (corresponding to “word” in the present invention) described in Non-Patent Document 4 may be applied.

Ｓ１０５：処理対象の単語について、単語の特徴ベクトルと検索対象である被検索文書の特徴ベクトルとの距離（これは検索対象数Ｋの要素を持つ配列データ）を単語・検索対象間距離記憶手段５８から引き出し、この引き出した距離の各要素にＳ１０４で算出された当該単語の重みを掛けたものを検索語の特徴ベクトルと検索対象の特徴ベクトルとの距離を示す配列Ｄの各要素に加算する。 S105: For the word to be processed, the distance between the feature vector of the word and the feature vector of the search target document that is the search target (this is array data having the number K of search target numbers) is stored as the word / search target distance storage means 58. Then, the value obtained by multiplying each element of the extracted distance by the weight of the word calculated in S104 is added to each element of the array D indicating the distance between the feature vector of the search word and the feature vector of the search target.

本処理は、単語の重みをＴ、単語の特徴ベクトルと検索対象の特徴ベクトルとの距離のｋ番目の要素をＤ^W _kと表すと、下記の式（５）のように表すことができる。 This processing can be expressed as the following equation (5), where T is the word weight, and D ^w _k is the k-th element of the distance between the word feature vector and the feature vector to be searched.

Ｓ１０３〜Ｓ１０５は検索語を構成する未処理な単語が無くなるまで実行される。 S103 to S105 are executed until there is no unprocessed word constituting the search word.

Ｓ１０６：検索語の特徴ベクトルと検索対象の特徴ベクトルと距離の配列Ｄの要素値を昇順にソートし、その上位Ｙ件に対応する検索対象を検索結果として出力する。この検索結果はディスプレイ５３から出力表示される。尚、件数Ｙはあらかじめ決めておいてもよい、または、ユーザがキーボード５４の操作によって任意の件数を入力して決めておいてもよい。 S106: The element values of the feature vector of the search word, the feature vector of the search target, and the distance array D are sorted in ascending order, and the search target corresponding to the top Y items is output as the search result. This search result is output from the display 53 and displayed. The number of cases Y may be determined in advance or may be determined by the user inputting an arbitrary number by operating the keyboard 54.

以上説明したＳ１０１〜Ｓ１０６では単語の特徴ベクトルと被検索文書の特徴ベクトルとの距離を予め算出しているが、距離の代わりに類似度を算出するようにしてもよい。この場合、Ｓ１０６では、検索語の特徴ベクトルと検索対象の特徴ベクトルと類似度の配列Ｄの要素値を降順にソートし、その上位Ｙ件に対応する検索対象を検索結果として出力する。尚、前記類似度の指標としては、例えば、前述の式（３）で示された両者の特徴ベクトルの内積を適用すればよい。また、距離の指標として、ユークリッド２乗距離の代わりに、前述の式（４）で示されたカルバック・ライブラー・ダイバージェンス、このダイバージェンスを用いたクロスエントロピーのいずれかを適用しても良い。 In S101 to S106 described above, the distance between the feature vector of the word and the feature vector of the searched document is calculated in advance, but the similarity may be calculated instead of the distance. In this case, in S106, the element values of the search term feature vector, the search target feature vector, and the similarity array D are sorted in descending order, and the search target corresponding to the top Y items is output as the search result. As the similarity index, for example, the inner product of both feature vectors represented by the above-described equation (3) may be applied. Further, as the distance index, instead of the Euclidean square distance, any of the Cullback-Liber divergence expressed by the above-described equation (4) and the cross entropy using this divergence may be applied.

以上のように発明の実施形態に係る文書検索装置１００は、検索語（あるいは検索文書）の特徴ベクトルが単語の特徴ベクトルの加重平均で与えられるとしたとき、単語の特徴ベクトルと被検索文書の特徴ベクトルとの距離または類似度の加重平均でもって、検索語の特徴ベクトルと被検索文書の特徴ベクトルとの距離または類似度とみなしている。検索語の特徴ベクトルと被検索文書の特徴ベクトルの距離または類似度の値の大小関係が当該検索語を構成する単語の特徴ベクトルと前記被検索文書の特徴ベクトルとの距離または類似度の重み付け和の値の大小関係とが同値関係にあるからである。ゆえに、単語と被検索文書との距離または類似度を事前に計算しておけば、加重平均の大小比較だけで検索ができる。 As described above, the document search apparatus 100 according to the embodiment of the present invention assumes that the feature vector of the search word (or search document) is given by the weighted average of the feature vector of the word, and the feature vector of the word and the searched document The distance or similarity between the feature vector of the search term and the feature vector of the searched document is regarded as a weighted average of the distance or the similarity with the feature vector. The weighted sum of the distance or similarity between the feature vector of the word constituting the search word and the feature vector of the searched document based on the magnitude relationship between the distance between the feature vector of the search word and the feature vector of the searched document or the value of the similarity This is because there is an equivalence relationship with the magnitude relationship of the values of. Therefore, if the distance or similarity between the word and the document to be searched is calculated in advance, the search can be performed only by comparing the weighted averages.

したがって、文書検索装置１００によれば、検索語を構成する単語の特徴ベクトルと被検索文書の特徴ベクトルとの距離または類似度の計算のうち計算コストが大きいものを事前に済ませることができ、当該検索語の検索時の計算コストが削減される。 Therefore, according to the document search apparatus 100, the calculation of the distance or the similarity between the feature vector of the word constituting the search word and the feature vector of the searched document can be performed in advance, and the calculation cost is high. The calculation cost when searching for a search term is reduced.

また、前記検索語の特徴ベクトルと前記被検索文書の特徴ベクトルとの距離の昇順または類似度の降順に前記被検索文書の集合から前記距離または類似度に対応した被検索文書が出力されるので、ユーザにとっても効率的な検索が実現する。 In addition, since a search target document corresponding to the distance or similarity is output from the set of search target documents in ascending order of distance between the feature vector of the search word and the feature vector of the search target document or in descending order of similarity. And efficient search for users is realized.

尚、処理プログラム５５はコンピュータ読み取り可能な記録媒体に格納された場合、コンピュータによって当該媒体から読み出された当該プログラム自体が本発明に係る文書検索装置を構成する機能手段を実現することになる。したがって、処理プログラム５５またはこれを記憶した記録媒体、例えばＣＤ−ＲＯＭ、ＤＶＤ−ＲＯＭ、ＣＤ−Ｒ、ＭＯ、ＨＤＤ等も本発明の一態様を構成することになる。 When the processing program 55 is stored in a computer-readable recording medium, the program itself read from the medium by the computer realizes functional means constituting the document search apparatus according to the present invention. Therefore, the processing program 55 or a recording medium storing the processing program 55, such as a CD-ROM, a DVD-ROM, a CD-R, an MO, or an HDD, also constitutes one aspect of the present invention.

１２，５９…単語重要度記憶手段
１３…単語重み算出手段
１４，５８…単語・検索対象距離記憶手段
１５…距離算出手段
１６…検索結果決定手段
５５…処理プログラム（文書検索プログラム）
１００…文書検索装置 12, 59 ... Word importance storage means 13 ... Word weight calculation means 14, 58 ... Word / search target distance storage means 15 ... Distance calculation means 16 ... Search result determination means 55 ... Processing program (document search program)
100: Document search device

Claims

When a feature vector of a predetermined search word is represented by a weighted average using the weight of the word of the word constituting the search word, the distance or similarity between the search document set and the feature vector of the search word A document search method for selecting a search target document based on a degree,
Calculating the weight of the word from the appearance frequency of each word constituting the predetermined search word inputted by the word weight calculating means and the importance of the word calculated in advance;
The distance calculation means calculates the distance or similarity between the feature vector of the word calculated in advance for each word and the feature vector of the searched document as a weighted sum using the calculated weight of the word. Performing a calculation for all searched documents to calculate the distance or similarity between the feature vector of the search term and the feature vector of the searched document;
The search result determining means determines the search target document corresponding to each distance selected in ascending order or the similarity selected in descending order among the calculated distances or similarities from the set of search target documents as a search result. Step and
Document search method characterized by having a.

As the distance or similarity index, the feature vector of the word constituting the search word and the search target document are the magnitude relationship between the distance or similarity value of the feature vector of the predetermined search word and the feature vector of the search target document. The document search method according to claim 1 , wherein a distance or similarity index that is equivalent to a magnitude relationship between a distance to a feature vector or a weighted sum of similarity values is used.

The index of the distance is any one of a Euclidean square distance between the feature vector of the word and the feature vector of the searched document, Cullback-Liber divergence, and cross-entropy using this divergence,
The document search method according to claim 2 , wherein the similarity index is an inner product of the feature vectors of the two .

When a feature vector of a predetermined search word is represented by a weighted average using the weight of the word of the word constituting the search word, the distance or similarity between the search document set and the feature vector of the search word A document search device for selecting a search target document based on a degree,
Word weight calculating means for calculating the weight of the word from the appearance frequency of the word constituting the input search word in the search word and the importance of the word calculated in advance;
The feature of the search word is a weighted sum of the distance or similarity between the feature vector of the word calculated in advance for each word and the feature vector of the searched document, using the calculated weight of the word. Distance calculation means for performing calculation for all searched documents as a distance or similarity between a vector and a feature vector of the searched document;
Search result determining means for determining, as a search result, a search target document corresponding to each distance selected in ascending order from the set of search target documents or the similarity selected in descending order among the calculated distances or similarities.
Document retrieval system according to claim <br/> further comprising a.

As an indicator of the distance or similarity, the predetermined search word feature vector and the feature vector and the to-be-searched documents words magnitude of distance or similarity values of the feature vector of the to-be-searched document constitutes the search term 5. The document search apparatus according to claim 4 , wherein a distance or similarity index that is equivalent to a magnitude relationship between a distance to a feature vector or a weighted sum of similarity values is used.

The index of the distance is any one of a Euclidean square distance between the feature vector of the word and the feature vector of the searched document, Cullback-Liber divergence, and cross-entropy using this divergence,
6. The document search apparatus according to claim 5 , wherein the similarity index is an inner product of the feature vectors of the two.

A document search program for causing a computer to function as each means constituting the document search device according to any one of claims 4 to 6 .