JP2003006214A

JP2003006214A - Document retrieval processing method, system, and storage medium

Info

Publication number: JP2003006214A
Application number: JP2001193444A
Authority: JP
Inventors: Daiki Suzuki; 大記鈴木
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2001-06-26
Filing date: 2001-06-26
Publication date: 2003-01-10

Abstract

PROBLEM TO BE SOLVED: To provide a document retrieval processing system designed for enhancement of processing speed in document retrieval. SOLUTION: A CPU 101 generates a document vector which characterizes a document, a document similarity between documents from out of document vectors, and a similarity between a plurality of reference vectors existing independently from the documents and the document vectors. The CPU 101 retrieves a similar document by the above document similarity using the above document vectors and also controls so as to perform retrieval using even the above reference similarity.

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は、文書データを検索
する文書検索処理方法及び装置並びに記憶媒体に関す
る。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a document search processing method and apparatus for searching document data and a storage medium.

【０００２】[0002]

【従来の技術】近年、大量の文書データを扱う機会が増
加してきたため、所望の文書データを検索処理する手段
も多種多様になってきている。その理由は、単純なキー
ワード（キーワードを指示し、それが出現するか否かで
検索を行う）では、ユーザの要求を十分に満たさなくな
ってきているためである。2. Description of the Related Art In recent years, since the opportunities for handling a large amount of document data have increased, there have been various types of means for searching desired document data. The reason is that simple keywords (instructing a keyword and performing a search based on whether or not it appears) are no longer sufficient to satisfy the user's request.

【０００３】そのため文書の内容を特徴付ける意味、分
野、単語そのものを次元とし、その特徴量を値とするこ
とでベクトル表現し、文書べクトル間の内積等の値を用
いて文書間の類似度を求める方法が主流である。Therefore, the meaning, the field, and the word itself that characterize the content of the document are used as dimensions, and the feature amount is used as a value for vector expression, and the value of the dot product between document vectors is used to determine the similarity between documents. The method of seeking is the mainstream.

【０００４】斯かる文書類似度の精度向上のため、つま
り文書の特徴を深く捉えるためにベクトルの次元数は数
百、数千のレベルで用意される傾向にある。In order to improve the accuracy of the document similarity, that is, in order to capture the features of the document deeply, the number of dimensions of the vector tends to be prepared at the level of hundreds or thousands.

【０００５】[0005]

【発明が解決しようとする課題】しかしながら、上記従
来の技術によれば、類似度の精度は向上するが、文書類
似度の生成負荷が増大し、文書検索処理の速度低下を招
いているという問題点があった。However, according to the above-mentioned conventional technique, although the accuracy of the similarity is improved, the load of generating the document similarity is increased and the speed of the document search processing is reduced. There was a point.

【０００６】本発明は、上記従来技術の有する問題点を
解消するためになされたもので、その第１の目的は、文
書検索の処理速度向上を図った文書検索処理方法及び装
置を提供することである。The present invention has been made to solve the above-mentioned problems of the prior art. A first object of the present invention is to provide a document search processing method and apparatus for improving the processing speed of document search. Is.

【０００７】また、本発明の第２の目的は、上述したよ
うな本発明の文書検索処理装置を制御するための制御プ
ログラムを格納した記憶媒体を提供することである。A second object of the present invention is to provide a storage medium storing a control program for controlling the document search processing device of the present invention as described above.

【０００８】[0008]

【課題を解決するための手段】上記第１の目的を達成す
るために、本発明の請求項１に記載の文書検索処理方法
は、文書を特徴付ける文書ベクトルを生成する文書ベク
トル生成ステップと、文書ベクトル間から文書間の文書
類似度を生成する文書類似度生成ステップと、文書と独
立して存在する複数の基準ベクトルと文書ベクトルとの
類似度を生成する基準類似度生成ステップと、前記文書
ベクトル生成ステップで生成された文書ベクトルを用い
て前記文書類似度生成ステップによって生成された文書
類似度によって類似文書を検索すると共に前記基準類似
度生成ステップで生成された基準類似度をも用いて検索
を行う文書検索ステップとを有することを特徴とする。In order to achieve the first object, a document search processing method according to claim 1 of the present invention comprises a document vector generating step of generating a document vector characterizing a document, A document similarity generation step of generating a document similarity between documents from between vectors; a reference similarity generation step of generating similarity between a plurality of reference vectors existing independently of the document and the document vector; Using the document vector generated in the generation step, a similar document is searched by the document similarity generated in the document similarity generation step, and a search is also performed using the reference similarity generated in the reference similarity generation step. And a document search step to be performed.

【０００９】また、上記第１の目的を達成するために、
本発明の請求項２に記載の文書検索処理方法は、請求項
１に記載の文書検索処理方法において、検索対象の文書
が検索以前に文書ベクトルと基準類似度を生成保持して
いることを特徴とする。Further, in order to achieve the first object,
The document search processing method according to claim 2 of the present invention is characterized in that, in the document search processing method according to claim 1, the document to be searched generates and holds a document vector and a reference similarity before the search. And

【００１０】また、上記第１の目的を達成するために、
本発明の請求項３に記載の文書検索処理装置は、文書を
特徴付ける文書ベクトルを生成する文書ベクトル生成手
段と、文書ベクトル間から文書間の文書類似度を生成す
る文書類似度生成手段と、文書と独立して存在する複数
の基準ベクトルと文書ベクトルとの類似度を生成する基
準類似度生成手段と、前記文書ベクトル生成手段で生成
された文書ベクトルを用いて前記文書類似度生成手段に
よって生成された文書類似度によって類似文書を検索す
ると共に前記基準類似度生成手段で生成された基準類似
度をも用いて検索を行う文書検索手段とを有することを
特徴とする。Further, in order to achieve the first object,
A document search processing apparatus according to claim 3 of the present invention is a document vector generation unit that generates a document vector that characterizes a document, a document similarity generation unit that generates a document similarity between documents from among the document vectors, and a document. Is generated by the document similarity generation unit using the document vector generated by the document vector generation unit. And a document search unit that searches for a similar document based on the document similarity and also uses the reference similarity generated by the reference similarity generation unit.

【００１１】また、上記第１の目的を達成するために、
本発明の請求項４に記載の文書検索処理装置は、請求項
３に記載の文書検索処理装置において、検索対象の文書
が検索以前に文書ベクトルと基準類似度を生成保持して
いることを特徴とする。Further, in order to achieve the above first object,
A document search processing device according to a fourth aspect of the present invention is the document search processing device according to the third aspect, wherein the document to be searched generates and holds a document vector and a reference similarity before the search. And

【００１２】また、上記第２の目的を達成するために、
本発明の請求項５に記載の記憶媒体は、文書データを検
索する文書検索処理装置を制御するためのコンピュータ
で読み取り可能な制御プログラムを格納した記憶媒体で
あって、前記制御プログラムは、文書を特徴付ける文書
ベクトルを生成する文書ベクトル生成モジュールと、文
書ベクトル間から文書間の文書類似度を生成する文書類
似度生成モジュールと、文書と独立して存在する複数の
基準ベクトルと文書ベクトルとの類似度を生成する基準
類似度生成モジュールと、前記文書ベクトル生成モジュ
ールで生成された文書ベクトルを用いて前記文書類似度
生成モジュールによって生成された文書類似度によって
類似文書を検索すると共に前記基準類似度生成モジュー
ルで生成された基準類似度をも用いて検索を行う文書検
索モジュールとを有することを特徴とする。Further, in order to achieve the above second object,
A storage medium according to claim 5 of the present invention is a storage medium storing a computer-readable control program for controlling a document search processing device that searches for document data, wherein the control program stores a document. A document vector generation module that generates a document vector that characterizes a document, a document similarity generation module that generates a document similarity between documents between document vectors, and a similarity between a plurality of reference vectors that exist independently of a document and a document vector And a reference similarity generation module that searches for similar documents using the document similarity generated by the document similarity generation module using the document vector generated by the document vector generation module. With the document search module that searches using the reference similarity generated in Characterized in that it.

【００１３】また、上記第２の目的を達成するために、
本発明の請求項６に記載の記憶媒体は、請求項５に記載
の記憶媒体において、検索対象の文書が検索以前に文書
ベクトルと基準類似度を生成保持していることを特徴と
する。In order to achieve the second object,
A storage medium according to a sixth aspect of the present invention is the storage medium according to the fifth aspect, wherein the document to be searched has a document vector and a reference similarity prior to the search.

【００１４】[0014]

【発明の実施の形態】以下、本発明の各実施の形態につ
いて、図面を用いて説明する。BEST MODE FOR CARRYING OUT THE INVENTION Embodiments of the present invention will be described below with reference to the drawings.

【００１５】（第１の実施の形態）まず、本発明の第１
の実施の形態について、図１〜図１４を用いて説明す
る。(First Embodiment) First, the first embodiment of the present invention
Embodiments will be described with reference to FIGS. 1 to 14.

【００１６】図１は、本実施の形態に係る文書検索処理
装置のシステム構成を示すブロック図である。同図にお
いて、１０１はＣＰＵ（マイクロプロセッサ：中央演算
処理装置）で、文書検索のための演算、論理判断等を行
い、後述するバス（ＢＵＳ）１０２を介して該バス１０
２に接続された後述する各構成要素を制御する。また、
ＣＰＵ１０１が文書検索表示手段としても動作する。FIG. 1 is a block diagram showing the system configuration of the document search processing apparatus according to this embodiment. In the figure, reference numeral 101 denotes a CPU (microprocessor: central processing unit), which performs calculations for document retrieval, logical judgments, etc., and executes the bus 10 via a bus (BUS) 102 described later.
It controls each component described later that is connected to 2. Also,
The CPU 101 also operates as a document search / display unit.

【００１７】１０２はバス（ＢＵＳ）で、ＣＰＵ１０１
の制御対象である後述する各構成要素を指示するアドレ
ス信号、コントロール信号を転送する。また、後述する
各構成要素間のデータ転送を行う。Reference numeral 102 denotes a bus (BUS), which is a CPU 101.
An address signal and a control signal for instructing each of the components to be described later, which are control targets, are transferred. In addition, data transfer between the components described later is performed.

【００１８】１０３はＲＯＭ（リードオンリーメモリ）
で、読み出し専用の固定メモリであり、ＣＰＵ１０１に
よる制御プログラム等を記憶している。１０４はＲＡＭ
（ランダムアクセスメモリ）で、各構成要素からの各種
データの一時記憶に用いる。１０５は入力装置で、キー
ボード及びマウス等からなる。１０６は表示装置で、Ｃ
ＲＴ（陰極線管）或いは液晶表示器等からなる。103 is a ROM (read only memory)
The read-only fixed memory stores a control program and the like by the CPU 101. 104 is RAM
(Random access memory) used for temporary storage of various data from each component. An input device 105 includes a keyboard and a mouse. A display device 106 is C
It is composed of an RT (cathode ray tube) or a liquid crystal display.

【００１９】１０７は記憶装置で、ハードディスクから
なり、検索対象となる文書ファイルデータベース（文書
ＤＢ）１０７ａ及び辞書ＤＩＣ１０７ｂ等が格納されて
いる。１０８は外部記憶装置で、フロッピー（登録商
標）ディスクや書き込み可能ＣＤ（コンパクトディス
ク）、ＤＶＤ（デジタルビデオディスク）等の外部記憶
装置にアクセスするためのドライブ等である。この外部
記憶装置１０７は、記憶装置１０７と同等に使用でき、
それらの記憶媒体を通して他の文書検索処理装置とのデ
ータ交換を行う装置である。１０９は通信装置で、モデ
ム或いはＬＡＮ（ローカルエリアネットワーク）コント
ローラ等からなり、通信回線を介して外部とのデータ交
換を行う装置である。A storage device 107 is a hard disk, and stores a document file database (document DB) 107a to be searched and a dictionary DIC 107b. An external storage device 108 is a drive or the like for accessing the external storage device such as a floppy (registered trademark) disk, a writable CD (compact disk), a DVD (digital video disk), or the like. This external storage device 107 can be used similarly to the storage device 107,
It is a device for exchanging data with other document search processing devices through these storage media. A communication device 109 includes a modem or a LAN (local area network) controller and the like, and is a device for exchanging data with the outside through a communication line.

【００２０】斯かる各構成要素からなる本実施の形態に
斯かる文書検索処理装置においては、入力装置１０５か
らの各種の入力に応じて作動するものであって、入力装
置１０５からの入力が供給されると、まず、インタラプ
タ信号がＣＰＵ１０１に送られ、該ＣＰＵ１０１がＲＯ
Ｍ１０３内に記憶してある各種の制御信号を読み出し、
それらの制御信号に従って各種の制御が行われる。In the document search processing apparatus according to the present embodiment, which is composed of the respective constituent elements, the document search processing apparatus operates in response to various inputs from the input device 105, and the input from the input device 105 is supplied. Then, first, the interrupter signal is sent to the CPU 101, and the CPU 101 sends the RO signal.
Read out various control signals stored in M103,
Various controls are performed in accordance with those control signals.

【００２１】以下、上記構成になる本実施の形態に係る
文書検索処理装置では、基準類似度で検索対象の文書を
絞り込むことことにより、文書類似度判定の負荷を大幅
に軽減することで、高速な文書検索処理を行うことが可
能である。Hereinafter, in the document search processing apparatus according to the present embodiment having the above-described configuration, by narrowing down the documents to be searched by the reference similarity, the load of the document similarity determination is significantly reduced, and high speed is achieved. It is possible to perform various document search processing.

【００２２】以下に、この文書検索処理の一例を説明す
る。An example of this document search processing will be described below.

【００２３】図２は、本実施の形態に係る文書検索処理
装置における文書検索結処理果の表示画面構成例を示す
図である。同図において、２０１は検索条件パネルで、
今回の検索指示内容が表示される。検索指示としての表
示例としては、ユーザの手による自然文或いはユーザの
入力した複数のキーワードの羅列、またユーザが指示し
た既存の文書の内容等である。２０２は検索結果パネル
で、上記検索条件によって行われた文書検索結果が表示
される。検索結果としてリストアップされた各文書のＩ
Ｄ２０２ａと文書タイトル２０２ｂ、類似度２０２ｃが
表示される。FIG. 2 is a diagram showing an example of the display screen structure of the document search result of the document search processing apparatus according to this embodiment. In the figure, 201 is a search condition panel,
The contents of this search instruction are displayed. Examples of the display as the search instruction include a natural sentence by the user, a list of a plurality of keywords input by the user, the contents of an existing document instructed by the user, and the like. Reference numeral 202 denotes a search result panel, which displays the document search results made by the above search conditions. I of each document listed as a search result
The D202a, the document title 202b, and the similarity 202c are displayed.

【００２４】次に、本実施の形態に係る文書検索処理装
置における基準類似度を用いた検索対象の絞込みの基本
動作を説明する。この基本動作は、基準類似度の作成、
検索クエリーの文書ベクトルから基準ベクトルへの展
開、基準類似度による検索対象の判定の３つに大きく分
けられる。Next, the basic operation of narrowing down the search target using the reference similarity in the document search processing apparatus according to the present embodiment will be described. This basic operation is to create a reference similarity,
The search query can be broadly divided into three: expansion from a document vector to a reference vector and determination of a search target based on reference similarity.

【００２５】最初に基準類似度の生成過程を説明する。First, the process of generating the reference similarity will be described.

【００２６】文書は、記憶装置１０７の文書ＤＢ１０７
ａに登録される段階で、最初に文書を特徴付ける文書ベ
クトルが生成される。文書からの文書ベクトル生成は、
文書内に出現する単語から記憶装置１０７の辞書ＤＩＣ
１０７ｂを用いて算出される。The document is stored in the document DB 107 of the storage device 107.
When registered in a, a document vector characterizing the document is first generated. Document vector generation from a document
The dictionary DIC of the storage device 107 based on the words that appear in the document
It is calculated using 107b.

【００２７】図３は、辞書ＤＩＣ１０７ｂの構成を示す
図である。同図に示すように、辞書ＤＩＣ１０７ｂは、
単語毎にベクトル表現時のそれぞれの次元（Ｄｉｍ．）
に対応した特徴量が格納されている。次元は、その単語
本来の意味によって分類された基準や、その単語の使用
分野に応じて分類された基準等が採用される。単語１の
Ｄｉｍ．１の特徴量は０であり、Ｄｉｍ．２の特徴量は
２３であることが分かる。FIG. 3 is a diagram showing the structure of the dictionary DIC 107b. As shown in the figure, the dictionary DIC 107b is
Each dimension (Dim.) When expressing a vector for each word
The feature amount corresponding to is stored. As the dimension, a criterion classified according to the original meaning of the word, a criterion classified according to the field of use of the word, or the like is adopted. Word 1 Dim. The feature amount of 1 is 0, and Dim. It can be seen that the feature amount of 2 is 23.

【００２８】このように辞書ＤＩＣ１０７ｂから１つの
単語におけるそれぞれの次元（Ｄｉｍ．）の特徴量を得
ることが可能となる。特徴量は、その単語が使用される
ことにより、その文書がその分類基準（＝次元）をどれ
くらい特徴付ける可能性があるかを示す値と解釈するこ
とが可能である。文書を構成する全ての単語から得られ
た分類基準別（次元別）の特徴量から、文書全体の特徴
量を分類基準（＝次元）とするベクトルで表現する。得
られたベクトルをノルム＝１で正規化した値を文書ベク
トルとして格納する。In this way, it becomes possible to obtain the feature quantity of each dimension (Dim.) In one word from the dictionary DIC 107b. The feature amount can be interpreted as a value indicating how likely the document is to characterize the classification criterion (= dimension) by using the word. The feature amount of each document is represented by a vector having the feature amount of the entire document as the classification criterion (= dimension), based on the feature amount of each classification criterion (by dimension) obtained from all the words constituting the document. A value obtained by normalizing the obtained vector with norm = 1 is stored as a document vector.

【００２９】図４は、格納された文書ベクトルの状態の
一例を示す図である。同図に示すように、例えば、文書
ＩＤ＝６９４７の文書ベクトルのＤｉｍ．１の特徴量は
０．１８３であり、Ｄｉｍ．２の特徴量は０．２１４で
あることが分かる。FIG. 4 is a diagram showing an example of the state of the stored document vector. As shown in the figure, for example, the document vector Dim. 1 has a feature amount of 0.183, and Dim. It can be seen that the feature amount of 2 is 0.214.

【００３０】生成された文書ベクトルから基準類似度を
生成する。基準類似度は、用意された固定の値を持つ基
準ベクトル（ＢａｓｅＶｅｃｔｏｒ＝ＢＶ）と文書ベ
クトルの類似度から生成する。基準ベクトルは、各次元
の特徴量比率を整数比で持たせたベクトルＰをノルム＝
１で正規化したベクトルＢＶの形で保持する。用意する
基準ベクトルは、ターゲットの文書ＤＢを考慮し、より
有効な任意の値を用意することが可能である。A reference similarity is generated from the generated document vector. The reference similarity is generated from the similarity between the prepared reference vector (Base Vector = BV) having a fixed value and the document vector. As the reference vector, a vector P having a feature amount ratio of each dimension as an integer ratio is norm =
Hold in the form of a vector BV normalized by 1. As the prepared reference vector, a more effective arbitrary value can be prepared in consideration of the target document DB.

【００３１】図５は、特徴量を整数比で持たせた基準ベ
クトルＰの状態の一例を示す図であり、図６は、図５の
基準ベクトルをノルム＝１で正規化した基準ベクトルＢ
Ｖの状態の一例を示す図であり、文書ベクトルと同様に
基準ベクトルの各次元の特徴量を示している。FIG. 5 is a diagram showing an example of a state of the reference vector P having the feature quantity in an integer ratio, and FIG. 6 is a reference vector B obtained by normalizing the reference vector of FIG. 5 with norm = 1.
It is a figure which shows an example of the state of V, and has shown the feature-value of each dimension of a reference vector like a document vector.

【００３２】図６において、例えば、基準ベクトルＢＶ
１のＤｉｍ．１〜３の特徴量はいずれも０、Ｄｉｍ．４
〜５の特徴量は０．４０８であることが分かる。In FIG. 6, for example, the reference vector BV
1 Dim. The feature amounts of 1 to 3 are all 0, and Dim. Four
It can be seen that the feature amount of ~ 5 is 0.408.

【００３３】本実施の形態に係る文書検索処理装置にお
いては、以上の文書ベクトルと基準ベクトルの余弦測度
による類似度を基に基準類似度を生成している。In the document search processing device according to the present embodiment, the reference similarity is generated based on the similarity by the cosine measure between the above document vector and the reference vector.

【００３４】図７は、基準類似度の算出方法の一例を示
す図である。同図に示すように、文書ベクトルＸは、各
次元にｘ１〜ｘｎの値を持つｎ次元のベクトル、同様に
基準ベクトルＰは、各次元にｐ１〜ｐｎの値を持つｎ次
元のベクトルである。FIG. 7 is a diagram showing an example of a method of calculating the reference similarity. As shown in the figure, the document vector X is an n-dimensional vector having values x1 to xn in each dimension, and the reference vector P is an n-dimensional vector having values p1 to pn in each dimension. .

【００３５】ここで、余弦測度による類似度をＳＤ
（Ｘ，Ｐ）、基準類似度をＳ（Ｘ，Ｐ）と表わすことに
する。Here, the similarity by the cosine measure is SD
(X, P), and the reference similarity is represented by S (X, P).

【００３６】余弦測度による類似度ＳＤ（Ｘ，Ｐ）は、
両ベクトルの内積を両ベクトルのノルムの積で割った値
となる。両ベクトルがノルム＝１で正規化されている本
実施の形態では、ＳＤ（Ｘ，Ｐ）は内積そのものに相当
する。よって、両ベクトルの同次元の値の総和で求める
ことができる。基準類似度は、この総和の値を閾値でα
で判別し、１或いは０の２値いずれかに転値する。値を
簡易化することにより、基準類似度にとる判定効率が向
上し、更に処理速度が向上する。本実施の形態では、閾
値αの値として０．３０２をセットしている。The similarity SD (X, P) according to the cosine measure is
It is the value obtained by dividing the inner product of both vectors by the product of the norms of both vectors. In the present embodiment in which both vectors are normalized with norm = 1, SD (X, P) corresponds to the inner product itself. Therefore, it is possible to obtain the sum of the values of the same dimension of both vectors. For the reference similarity, the value of this sum is a threshold value α
Then, the value is converted into one of two values, 1 or 0. By simplifying the value, the determination efficiency for the reference similarity is improved, and the processing speed is further improved. In this embodiment, 0.302 is set as the value of the threshold value α.

【００３７】以上の手段によって文書単位で基準ベクト
ル毎の基準類似度を生成する。By the above means, the reference similarity for each reference vector is generated for each document.

【００３８】図８は、文書毎の基準類似度の状態の一例
を示す図である。同図において、基準ベクトル１（ＢＶ
１）による基準類似度｛（ＳＤ（Ｘ，ＢＶ１））をＢＳ
Ｖ１と表現している。文書ＩＤ＝６９４７のＢＳＶ１〜
２の値は１であり、ＢＳＶ３〜５の値は０であることが
分かる。FIG. 8 is a diagram showing an example of the state of reference similarity for each document. In the figure, reference vector 1 (BV
1) the base similarity {(SD (X, BV1)) to BS
It is expressed as V1. BSV1 of document ID = 6947
It can be seen that the value of 2 is 1 and the values of BSV3-5 are 0.

【００３９】次に、検索文の文書ベクトルから基準ベク
トルのサブセットへの展開を説明する。Next, the expansion of the document vector of the search sentence into a subset of the reference vector will be described.

【００４０】検索クエリーとしての入力文も文書ベクト
ル生成と同手段で入力文の文書ベクトルを生成する。生
成されたクエリーの文書ベクトルを基準ベクトルのサブ
セットに展開する。The input sentence as the search query also generates the document vector of the input sentence by the same means as the document vector generation. Expand the document vector of the generated query into a subset of the reference vector.

【００４１】図９は、クエリーの文書ベクトルを基準ベ
クトルのサブセットに展開するための展開方法を示す図
である。FIG. 9 is a diagram showing an expansion method for expanding the document vector of the query into a subset of the reference vector.

【００４２】入力文の特徴ベクトルＹの各次元の特徴量
を閾値βによって０／１に転値したベクトルＹ’を算出
する。本実施の形態では、閾値βの値として０．３０２
をセットする。これは基準類似度の閾値と同じ値を使用
している。この方法では、０／１に転値されるので、０
／１の整数比で用意された基準ベクトルＰのサブセット
に展開することがベクトルの加減によって可能である。A vector Y'in which the feature amount of each dimension of the feature vector Y of the input sentence is converted to 0/1 by the threshold value β is calculated. In the present embodiment, the value of the threshold β is 0.302.
Set. This uses the same value as the reference similarity threshold. In this method, the value is converted to 0/1, so 0
It is possible to expand into a subset of the reference vector P prepared with an integer ratio of / 1 by adjusting the vector.

【００４３】次に、規準類似度による検索対象を判定す
る過程を説明する。Next, a process of determining a search target based on the standard similarity will be described.

【００４４】Ｙ’＝Ｐｉ＋Ｐｊ＋Ｐｋに展開された場
合、展開された基準ベクトルに対応する基準類似度ＢＳ
Ｖｉ，ＢＳＶｊ，ＢＳＶｋによって対象文書に絞り込む
ことが可能となる。仮に、Ｙ’＝Ｐ３＋Ｐ４となる検索
クエリーが入力されたとする。これは基準類似度ＢＳＶ
３，ＢＳＶ４の値が１である文書煮を絞り込むことが可
能である。When expanded to Y '= Pi + Pj + Pk, the reference similarity BS corresponding to the expanded reference vector
It becomes possible to narrow down the target document by Vi, BSVj, BSVk. It is assumed that the search query Y ′ = P3 + P4 is input. This is the standard similarity BSV
It is possible to narrow down documents that have a BSV4 value of 3 or 1.

【００４５】図３の例で確認すると、ＢＳＶ３＝１また
はＢＳＶ４＝１である文書ＩＤは、６９５４，６９５
５，６９５９の３文書である。この３文書が対象文書と
して絞り込まれ、入力文の特徴ベクトルＹとそれぞれの
文書ベクトルの類似度が実際に算出される。逆に、３文
書以外の文書の類似度算出は行われないため、高速な検
索が可能である。When confirmed in the example of FIG. 3, the document IDs for which BSV3 = 1 or BSV4 = 1 are 6954 and 695.
There are three documents, 5,6959. These three documents are narrowed down as target documents, and the similarity between the feature vector Y of the input sentence and each document vector is actually calculated. On the contrary, since the similarity of documents other than the three documents is not calculated, high-speed search is possible.

【００４６】以下、上述した説明事項に付いて、図１０
〜図１４のフローチャートを用いて説明する。With respect to the above-mentioned explanation items, FIG.
~ It demonstrates using the flowchart of FIG.

【００４７】図１０は、本実施の形態に係る文書検索処
理装置におけるＣＰＵ１０１の処理手順を示すフローチ
ャートである。同図において、まず、ステップＳ１００
１でシステムの初期化処理、即ち、各種パラメータの初
期化や初期画面の表示等の処理を行う。次に、ステップ
Ｓ１００２でキー入力を待つ。即ち、入力装置１０５か
ら何らかのキーが押下され、割り込みが発生するのをＣ
ＰＵ１０１において待つ。そして、キーが入力される
と、次のステップＳ１００３でＣＰＵ１０１は、入力さ
れたキーを判別し、次のステップＳ１００４でキーの種
類において各種の処理に分岐する。この各種キーに対応
した分岐先の複数の処理をステップＳ１００４において
は、「各種対応処理」として纏めて表現している。図１
１及び図１２で説明する文書の登録処理及び検索処理が
この分岐先の一部となる。FIG. 10 is a flow chart showing the processing procedure of the CPU 101 in the document search processing device according to the present embodiment. In the figure, first, step S100.
In step 1, system initialization processing, that is, processing such as initialization of various parameters and display of an initial screen is performed. Next, in step S1002, key input is awaited. That is, when any key is pressed from the input device 105 to generate an interrupt,
Wait at PU 101. When a key is input, the CPU 101 discriminates the input key in the next step S1003, and branches to various processes according to the type of the key in the next step S1004. In step S1004, a plurality of branch destination processes corresponding to these various keys are collectively expressed as “various correspondence processes”. Figure 1
Document registration processing and search processing described with reference to FIG. 1 and FIG. 12 are part of this branch destination.

【００４８】次に、ステップＳ１００５へ進んで、上記
の処理の結果、変更された部分を表示する表示処理を行
う。この表示処理は、表示内容を表示パターンに展開
し、バッファに出力するといった通常行われている処理
である。このステップＳ１００５における表示処理を終
了後は、前記ステップＳ１００２へ戻る。Next, in step S1005, a display process for displaying the changed part as a result of the above process is performed. This display process is a normally performed process such as expanding the display content into a display pattern and outputting it to a buffer. After the display process in step S1005 is completed, the process returns to step S1002.

【００４９】図１１は、図１０におけるステップＳ１０
０４の一部である文書の登録処理の詳細な流れを示すフ
ローチャートである。同図において、まず、ステップＳ
１１０１で文書から単語を抽出する処理である単語抽出
処理、即ち、形態素解析用辞書を使用して形態素解析を
行う。次に、ステップＳ１１０２で文書ベクトルの生成
処理を行う。即ち、前記ステップＳ１１０１において抽
出された単語から辞書ＤＩＣ１０７ｂを検索し、単語毎
の次元別の特徴量を得て、その総和から文書ベクトルを
生成する。FIG. 11 shows step S10 in FIG.
14 is a flowchart showing a detailed flow of a document registration process which is a part of 04. In the figure, first, step S
At 1101, a word extraction process for extracting a word from a document, that is, a morphological analysis is performed using a morphological analysis dictionary. Next, in step S1102, document vector generation processing is performed. That is, the dictionary DIC 107b is searched from the words extracted in step S1101, the feature amount for each dimension is obtained for each word, and the document vector is generated from the sum.

【００５０】次に、ステップＳ１１０３で基準類似度生
成処理を行う。即ち、前記ステップＳ１１０２において
得られた文書ベクトルと基準ベクトルＢＶから基準類似
度を算出する。この算出方法の一例を図７に示してい
る。Next, in step S1103, reference similarity generation processing is performed. That is, the reference similarity is calculated from the document vector obtained in step S1102 and the reference vector BV. An example of this calculation method is shown in FIG.

【００５１】次に、ステップＳ１１０４で文書ＤＢ１０
７ａへの登録処理である文書ＤＢ登録処理を行う。即
ち、文書の内容と前記ステップＳ１１０２において得ら
れた文書ベクトルと前記ステップＳ１１０３において得
られた基準類似度とを登録すると共に、文書ＤＢ１０７
ａのインデックスを更新する。このステップＳ１１０４
における文書ＤＢ登録処理を終了後は、リターンする。Next, in step S1104, the document DB 10
A document DB registration process that is a registration process to 7a is performed. That is, the content of the document, the document vector obtained in step S1102, and the reference similarity obtained in step S1103 are registered, and the document DB 107 is registered.
Update the index of a. This step S1104
After the document DB registration processing in step S3 is completed, the process returns.

【００５２】図１２は、図１０におけるステップＳ１０
０４の一部である文書の検索実行処理の詳細な流れを示
すフローチャートである。同図において、まず、ステッ
プＳ１２０１で検索条件入力処理、即ちユーザーは自然
文或いは複数のキーワードを入力する或いは既存の文書
を指定する形で指示する処理を行う。次に、次に、ステ
ップＳ１２０２で検索条件情報生成処理、即ち類似度生
成に必要な検索条件文の文書ベクトルと検索対象絞込み
に必要な基準ベクトルのサブセットを得る処理を行う。FIG. 12 shows step S10 in FIG.
14 is a flowchart showing a detailed flow of a document search execution process that is a part of 04. In the figure, first, in step S1201, a search condition input process, that is, a user inputs a natural sentence or a plurality of keywords or designates an existing document. Next, in step S1202, a search condition information generation process, that is, a process of obtaining a document vector of the search condition sentence necessary for the similarity generation and a subset of the reference vector necessary for narrowing down the search target is performed.

【００５３】次に、ステップＳ１２０３で類似度生成格
納処理、即ち前記ステップＳ１２０２において得られた
基準ベクトルのサブセットに応じた基準類似度を基に対
象文書を絞り、同じくステップＳ１２０２において得ら
れた文書ベクトルと対象文書ベクトルとから類似度を生
成し、ＲＡＭ１０４に格納する処理を行う。生成した値
を記憶装置１０７の文書ＤＢ１０７ａに登録することも
可能である。Next, in step S1203, the similarity generation / storing process, that is, the target document is narrowed down based on the reference similarity corresponding to the subset of the reference vectors obtained in step S1202, and the document vector obtained in step S1202 is also obtained. And a target document vector to generate a similarity and store the similarity in the RAM 104. It is also possible to register the generated value in the document DB 107a of the storage device 107.

【００５４】次に、ステップＳ１２０４で類似度による
順序付け処理、即ち前記ステップＳ１２０３において格
納した文書毎の類似度を順序付けする処理を行う。次
に、ステップＳ１２０５で検索結果表示処理、即ち前記
ステップＳ１２０4において順序付けされた文書を検索
結果としてリストアップして表示装置１０６に表示する
処理を行う。その際に、前記ステップＳ１２０３におい
て登録された類似度の値も同時に表示する。このステッ
プＳ１２０５における検索結果表示処理を終了後は、リ
ターンする。Next, in step S1204, an ordering process according to the degree of similarity, that is, a process of ordering the degree of similarity for each document stored in step S1203 is performed. Next, in step S1205, a search result display process, that is, a process of listing the documents ordered in step S1204 as search results and displaying them on the display device 106 is performed. At that time, the value of the degree of similarity registered in step S1203 is also displayed at the same time. After the search result display process in step S1205 is completed, the process returns.

【００５５】図１３は、図１２のステップＳ１２０２に
おける検索条件情報生成処理の詳細な流れを示すフロー
チャートである。同図において、まず、ステップＳ１３
０１で前記図１２のステップＳ１２０１において得られ
たユーザーの検索条件を読み込む処理を行う。次に、ス
テップＳ１３０２で前記ステップＳ１３０１において読
み込まれたユーザー指定の検索条件文から単語を抽出す
る処理、即ち形態素解析用辞書を使用して形態素解析処
理を行う。FIG. 13 is a flowchart showing the detailed flow of the search condition information generation processing in step S1202 of FIG. In the figure, first, step S13.
In step 01, a process of reading the user search condition obtained in step S1201 of FIG. 12 is performed. Next, in step S1302, a process for extracting words from the user-specified search condition sentence read in step S1301, that is, a morphological analysis process is performed using a morphological analysis dictionary.

【００５６】次に、ステップＳ１３０３で検索文の文書
ベクトル生成処理、即ち前記ステップＳ１３０２におい
て抽出された単語から記憶装置１０７の辞書ＤＩＣ１０
７ｂを検索し、単語毎の次元別の特徴量を得て、その総
和から文書ベクトルを生成する処理を行う。Next, in step S1303, a document vector generation process for the search sentence, that is, the dictionary DIC10 of the storage device 107 based on the words extracted in step S1302.
7b is searched, a feature amount for each word is obtained for each dimension, and a document vector is generated from the sum.

【００５７】これらのステップＳ１３０２及びステップ
Ｓ１３０３における処理は、前記図１１のステップＳ１
１０１及びステップＳ１１０２における処理と同等の処
理である。The processing in these steps S1302 and S1303 is the same as step S1 in FIG.
This is the same processing as the processing in 101 and step S1102.

【００５８】次に、ステップＳ１３０４で検索文の配置
ベクトルを生成する処理を行う。次に、ステップＳ１３
０５で、基準ベクトルセット展開処理、即ち前記ステッ
プＳ１３０４において得られた配置ベクトルから基準ベ
クトルに展開する処理を行う。このステップＳ１３０４
における配置ベクトルセット展開処理を終了後は、リタ
ーンする。Next, in step S1304, processing for generating a search sentence arrangement vector is performed. Next, step S13
In 05, a reference vector set expansion process, that is, a process of expanding the arrangement vector obtained in step S1304 into a reference vector is performed. This step S1304
After the arrangement vector set expansion processing in (3) is completed, the process returns.

【００５９】図１４は、図１２のステップＳ１２０３に
おける類似度生成格納処理の詳細な流れを示すフローチ
ャートである。同図において、まず、ステップＳ１４０
１で検索対象である文書ＤＢ１０７ａ内の文書を指定す
るカウンタＮに初期値１をセットし、次の、ステップＳ
１４０２で文書ＤＢ１０７ａからＮ番目の文書の文書ベ
クトルと基準類似度を呼び出す処理を行う。FIG. 14 is a flow chart showing a detailed flow of the similarity generation / storage process in step S1203 of FIG. In the figure, first, step S140.
The initial value 1 is set to the counter N that designates the document in the document DB 107a to be searched at 1, and the next step S
In step 1402, the document vector of the Nth document and the reference similarity are retrieved from the document DB 107a.

【００６０】次に、ステップＳ１４０３で前記ステップ
Ｓ１４０２において呼び出された基準類似度から前記ス
テップＳ１３０５において展開された基準ベクトルセッ
トに対応した基準類似度のみを抽出する。次に、ステッ
プＳ１４０４で基準類似度から類似度算出の対象になる
か否かを判定する。Next, in step S1403, only the reference similarity corresponding to the reference vector set developed in step S1305 is extracted from the reference similarity called in step S1402. Next, in step S1404, it is determined from the reference similarity whether or not the similarity is to be calculated.

【００６１】図８に示した基準類似度の例では、前記ス
テップＳ１４０３において抽出された基準類似度の総和
が０か否かで判定している。In the example of the reference similarity shown in FIG. 8, it is determined whether the total sum of the reference similarities extracted in step S1403 is 0 or not.

【００６２】そして、前記ステップＳ１４０４において
類似度算出の対象になると判定された場合は、ステップ
Ｓ１４０５で検索文の文書ベクトルと呼び出されている
文書の文書ベクトルとから類似度を算出する。また、前
記ステップＳ１４０４において類似度算出の対象になら
ないと判定された場合は、ステップＳ１４０６で前記ス
テップＳ１４０５において行われるような類似度算出処
理は行わず、類似度を固定値０とする。If it is determined in step S1404 that the similarity is to be calculated, the similarity is calculated from the document vector of the search sentence and the document vector of the called document in step S1405. If it is determined in step S1404 that the similarity calculation is not performed, the similarity calculation process like that performed in step S1405 is not performed in step S1406, and the similarity is set to a fixed value 0.

【００６３】前記ステップＳ１４０５或いは前記ステッ
プＳ１４０６における処理が終了後は、ステップＳ１４
０７で類似度格納処理、即ち前記ステップＳ１４０５に
おいて算出された類似度或いは前記ステップＳ１４０６
においてセットされた類似度（＝０）を、文書ＤＢ１０
７ａ或いはＲＡＭ１０４に格納し、前記図１２のステッ
プＳ１２０４において参照する。After the processing in step S1405 or step S1406 is completed, step S14
07, the similarity storage process, that is, the similarity calculated in step S1405 or the step S1406.
The similarity (= 0) set in the document DB 10
7a or RAM 104 and refer to it in step S1204 of FIG.

【００６４】次に、ステップＳ１４０８で文書ＤＢ１０
７ａ内の検索対象文書に残りがあるか否かを判定する。
そして、文書ＤＢ１０７ａ内の検索対象文書に残りがあ
ると判定された場合は、ステップＳ１４０９でカウンタ
であるｎをカウントアップした後、前記ステップＳ１４
０２へ戻り、検索条件適合値の算出を繰り返す。また、
文書ＤＢ１０７ａ内の検索対象文書に残りがないと判定
された場合は、リターンする。Next, in step S1408, the document DB 10
It is determined whether or not there is a remaining document to be searched in 7a.
If it is determined that there is a remaining document to be searched in the document DB 107a, the counter n is incremented in step S1409, and then step S14 is performed.
Returning to 02, the calculation of the search condition matching value is repeated. Also,
When it is determined that there is no remaining search target document in the document DB 107a, the process returns.

【００６５】以上詳述したように、本実施の形態に係る
文書検索処理方法及び装置によれば、特定の値を持つ基
準ベクトルを複数用意し、基準ベクトルと検索対象とな
る文書の文書ベクトル間の基準類似度を生成し、検索ク
リエー文を用意された基準ベクトルのサブセットに展開
し、該展開された基準ベクトルに応じた基準類似度で検
索対象の文書を絞り込み、文書ベクトルと検索文の文書
ベクトル間の類似度を算出することにより、処理負荷の
高い文書類似度生成手段による文書類似度判定の負荷を
大幅に削減することが可能となり、高速な文書検索が可
能で、処理速度が向上するという効果を奏する。As described in detail above, according to the document search processing method and apparatus according to the present embodiment, a plurality of reference vectors having a specific value are prepared, and the reference vector and the document vector of the document to be searched are The reference similarity is generated, the search creation sentence is expanded into a prepared reference vector subset, the documents to be searched are narrowed down by the reference similarity according to the expanded reference vector, and the document vector and the document of the search sentence are extracted. By calculating the similarity between the vectors, it is possible to significantly reduce the load of the document similarity determination by the document similarity generation means, which has a high processing load, enables high-speed document retrieval, and improves the processing speed. Has the effect.

【００６６】（他の実施の形態）上述した第１の実施の
形態における基準類似度の算出方法は、１つの閾値αに
よって２段階の値（０または１）を取る方法であった
が、本発明はこれに限られるものではなく、複数の閾値
を用意し、２以上の段階の値を取らせることも可能であ
る。また、基準ベクトルの配置方法の閾値βも同様であ
る。その際に基準類似度の判定基準としての閾値を持た
せることが可能である。また、事前に検索対象数を得る
手法にした場合は、その値によって判定基準を動的に変
化させることも可能である。(Other Embodiments) The method of calculating the reference similarity in the above-described first embodiment is a method of taking two levels (0 or 1) with one threshold value α. The invention is not limited to this, and it is also possible to prepare a plurality of threshold values and set the values in two or more stages. The same applies to the threshold value β of the arrangement method of the reference vector. At that time, it is possible to provide a threshold value as a criterion for determining the reference similarity. In addition, when the method of obtaining the number of search targets is used in advance, it is possible to dynamically change the determination criterion depending on the value.

【００６７】また、上述した第１の実施の形態において
は、基準類似度の算出、基準ベクトルの配置のいずれの
閾値（αとβ）も固定の値をとる方法であったが、本発
明はこれに限られるものではなく、その時点での文書Ｄ
Ｂ内のベクトル及び検索文の基準ベクトルの特徴量から
平均値等の統計処理した値を基に動的に変化させる方法
も可能である。Further, in the above-described first embodiment, the threshold values (α and β) for calculating the reference similarity and for arranging the reference vectors have fixed values. The document D at that time is not limited to this.
It is also possible to use a method of dynamically changing the feature amount of the vector in B and the reference vector of the search sentence on the basis of a statistically processed value such as an average value.

【００６８】また、本発明は、単体の装置に限らず、複
数の装置から構成されるシステムに適用可能である。ま
た、それらの装置やシステムに記憶媒体や通信装置を使
ってソフトウェアを提供することによっても実現可能で
あることは言うまでもない。The present invention is applicable not only to a single device but also to a system composed of a plurality of devices. Needless to say, it can be realized by providing software to these devices and systems using a storage medium or a communication device.

【００６９】この場合、本発明を達成するためのソフト
ウェアによって表わされる制御プログラムを格納した記
憶媒体を、システム或いは装置に読み出すこと、或いは
前記制御プログラムを、ネットワーク経由でシステム或
いは装置に読み出すことによって、そのシステム或いは
装置が本発明の効果を享受することが可能となる。In this case, a storage medium storing a control program represented by software for achieving the present invention is read into a system or device, or the control program is read into the system or device via a network. The system or device can enjoy the effects of the present invention.

【００７０】また、前記制御プログラムを記憶するため
の記憶媒体としては、ハードディスク、フロッピー（登
録商標）ディスク、光ディスク、光磁気ディスク、ＣＤ
−Ｒ、ＤＶＤ、磁気テープ、不揮発性のメモリカード、
ＣＤ−ＲＯＭ等を用いることができるのは言うまでもな
い。A storage medium for storing the control program is a hard disk, a floppy (registered trademark) disk, an optical disk, a magneto-optical disk, a CD.
-R, DVD, magnetic tape, non-volatile memory card,
It goes without saying that a CD-ROM or the like can be used.

【００７１】[0071]

【発明の効果】以上詳述したように、本発明の文書検索
処理方法及び装置によれば、文書検索の処理速度が向上
するという効果を奏する。As described above in detail, according to the document search processing method and apparatus of the present invention, the document search processing speed is improved.

【００７２】また、本発明の記憶媒体によれば、上述し
たような本発明の文書検索処理装置を円滑に制御するこ
とができるという効果を奏する。Further, according to the storage medium of the present invention, it is possible to smoothly control the document search processing device of the present invention as described above.

【図面の簡単な説明】[Brief description of drawings]

【図１】本発明の第１の実施の形態に係る文書検索処理
装置のシステム構成を示すブロック図である。FIG. 1 is a block diagram showing a system configuration of a document search processing device according to a first embodiment of the present invention.

【図２】本発明の第１の実施の形態に係る文書検索処理
装置における文書検索結果の表示画面構成の一例を示す
図である。FIG. 2 is a diagram showing an example of a display screen configuration of document search results in the document search processing device according to the first embodiment of the present invention.

【図３】本発明の第１の実施の形態に係る文書検索処理
装置における辞書ＤＩＣの構成の一例を示す図である。FIG. 3 is a diagram showing an example of a configuration of a dictionary DIC in the document search processing device according to the first embodiment of the present invention.

【図４】本発明の第１の実施の形態に係る文書検索処理
装置における文書の文書ベクトルの状態の一例を示す図
である。FIG. 4 is a diagram showing an example of a state of a document vector of a document in the document search processing device according to the first embodiment of the present invention.

【図５】本発明の第１の実施の形態に係る文書検索処理
装置における基準ベクトルＰの状態の一例を示す図であ
る。FIG. 5 is a diagram showing an example of a state of a reference vector P in the document search processing device according to the first embodiment of the present invention.

【図６】本発明の第１の実施の形態に係る文書検索処理
装置における基準ベクトルＢＶの状態の一例を示す図で
ある。FIG. 6 is a diagram showing an example of a state of a reference vector BV in the document search processing device according to the first embodiment of the present invention.

【図７】本発明の第１の実施の形態に係る文書検索処理
装置における基準類似度の算出方法の一例を示す図であ
る。FIG. 7 is a diagram showing an example of a reference similarity calculation method in the document search processing device according to the first embodiment of the present invention.

【図８】本発明の第１の実施の形態に係る文書検索処理
装置における基準類似度ＢＳＶの状態の一例を示す図で
ある。FIG. 8 is a diagram showing an example of a state of reference similarity BSV in the document search processing device according to the first embodiment of the present invention.

【図９】本発明の第１の実施の形態に係る文書検索処理
装置における検索文の展開方法を示す図である。FIG. 9 is a diagram showing a search sentence expansion method in the document search processing device according to the first embodiment of the present invention.

【図１０】本発明の第１の実施の形態に係る文書検索処
理装置における文書検索処理動作全体の流れを示すフロ
ーチャートである。FIG. 10 is a flowchart showing the overall flow of a document search processing operation in the document search processing apparatus according to the first embodiment of the present invention.

【図１１】本発明の第１の実施の形態に係る文書検索処
理装置における文書を文書ＤＢに登録する処理動作の流
れを示すフローチャートである。FIG. 11 is a flowchart showing a flow of processing operation for registering a document in a document DB in the document search processing apparatus according to the first exemplary embodiment of the present invention.

【図１２】本発明の第１の実施の形態に係る文書検索処
理装置における文書検索実行処理動作の詳細な流れを示
すフローチャートである。FIG. 12 is a flowchart showing a detailed flow of a document search execution processing operation in the document search processing apparatus according to the first embodiment of the present invention.

【図１３】本発明の第１の実施の形態に係る文書検索処
理装置における検索条件情報生成処理動作の詳細な流れ
を示すフローチャートである。FIG. 13 is a flowchart showing a detailed flow of a search condition information generation processing operation in the document search processing device according to the first exemplary embodiment of the present invention.

【図１４】本発明の第１の実施の形態に係る文書検索処
理装置における類似度生成格納処理動作の詳細な流れを
示すフローチャートである。FIG. 14 is a flowchart showing a detailed flow of a similarity generation / storing processing operation in the document search processing device according to the first exemplary embodiment of the present invention.

[Explanation of symbols]

１０１ＣＰＵ（マイクロプロセッサ：中央演算処
理装置）１０２バス（ＢＵＳ）１０２１０３ＲＯＭ（リードオンリーメモリ）１０４ＲＡＭ（ランダムアクセスメモリ）１０５入力装置１０６表示装置１０７記憶装置１０７ａ文書ファイルデータベース（文書ＤＢ）１０７ｂ辞書ＤＩＣ１０８外部記憶装１０９通信装置101 CPU (Microprocessor: Central Processing Unit) 102 Bus (BUS) 102 103 ROM (Read Only Memory) 104 RAM (Random Access Memory) 105 Input Device 106 Display Device 107 Storage Device 107a Document File Database (Document DB) 107b Dictionary DIC 108 External storage device 109 Communication device

Claims

[Claims]

1. A document vector generating step of generating a document vector characterizing a document, a document similarity generating step of generating a document similarity between documents from among the document vectors, and a plurality of reference vectors existing independently of the document. And a document vector, a reference similarity generation step of generating a similarity, and a document vector generated in the document vector generation step are used to search for a similar document based on the document similarity generated in the document similarity generation step. And a document search step of performing a search also using the reference similarity generated in the reference similarity generation step.

2. The document search processing method according to claim 1, wherein the document to be searched holds and holds a document vector and a reference similarity before the search.

3. A document vector generation means for generating a document vector characterizing a document, a document similarity generation means for generating a document similarity between documents from among the document vectors, and a plurality of reference vectors existing independently of the document. And a document vector, and a reference similarity generating means for generating a similarity between the document vector and the document vector; And a document search means for performing a search also using the reference similarity generated by the reference similarity generation means.

4. The document search processing device according to claim 3, wherein the document to be searched generates and holds a document vector and a reference similarity before the search.

5. A storage medium storing a computer-readable control program for controlling a document search processing device for searching document data, the control program generating a document vector characterizing a document. A generation module, a document similarity generation module that generates a document similarity between documents from between document vectors, and a reference similarity generation module that generates a similarity between a plurality of reference vectors existing independently of a document and a document vector And searching for a similar document by the document similarity generated by the document similarity generation module using the document vector generated by the document vector generation module, and the reference similarity generated by the reference similarity generation module. And a document search module for performing a search using Storage medium.

6. The storage medium according to claim 5, wherein the document to be searched generates and holds the document vector and the reference similarity before the search.