JP2002251412A

JP2002251412A - Document retrieving device, method, and storage medium

Info

Publication number: JP2002251412A
Application number: JP2001047027A
Authority: JP
Inventors: Masanobu Funakoshi; 正伸船越
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2001-02-22
Filing date: 2001-02-22
Publication date: 2002-09-06

Abstract

PROBLEM TO BE SOLVED: To reduce processing time in document retrieving processing, where multi-dimensional vector matching is used, even when a large amount of document data is retrieved. SOLUTION: Document data retrieving is conducted using a document database in which pairs, each composed of document data and a semantic vector representing its contents with a multi-dimensional vector, are stored. Document data stored in the document database is classified into groups according to similar semantic vectors, and the groups are stored in a group table. When a query described in a natural language statement is entered, a language analysis is made for the query to generate the semantic vector of the query (S21-S23). Based on the language analysis result, a group is selected from the group table (S24, S25). For document data belonging to the selected group, matching between the semantic vector stored in the selected group and the semantic vector of the query is performed to retrieve for a document and then the result is output (S26-S28).

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、言語解析による文
書検索装置及び方法ならびにこれを記憶した媒体に関す
る。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to an apparatus and a method for retrieving documents by linguistic analysis and a medium storing the same.

【０００２】[0002]

【従来の技術】インターネットの爆発的な普及と共に、
その入り口となるパーソナルコンピュータ（ＰＣ）や携
帯情報機器（ＰＤＡ）の数も増大し、電子メールなどを
利用した電子化文書によるコミュニケーションは我々の
日常生活に浸透しつつある。このような状況において電
子化文書の数は現在もなお増え続ける一方である。この
ため、膨大な文書データの中から必要な文書を迅速に検
索する技術はますます重要視されており、より使いやす
いインターフェースを持つ文書検索システムが登場して
いる。2. Description of the Related Art With the explosive spread of the Internet,
The number of personal computers (PCs) and personal digital assistants (PDAs) serving as entrances thereof has also increased, and communication using digitized documents using e-mail and the like is permeating our daily lives. Under such circumstances, the number of digitized documents is still increasing. For this reason, a technology for quickly searching for a necessary document from a huge amount of document data is increasingly regarded as important, and a document search system having an easier-to-use interface has appeared.

【０００３】この種の文書検索システムに、例えば、意
味概念を用いた文書検索システムがある。これは、文書
データベースにおいて、各文書データとその内容の意味
概念を示す多次元ベクトルとを一対一に対応付けて格納
しておき、ユーザが日常使っている自然言語を用いた簡
単な質問文が入力されると、この質問文を言語解析して
その意味概念を表す多次元ベクトルに変換し、文書デー
タベース中の多次元ベクトルとマッチングをとることに
よって検索するものである。この手法によれば、ユーザ
が探している文書を自然言語を用いたインターフェース
で検索することができる。As this type of document search system, for example, there is a document search system using a semantic concept. This is because, in a document database, each document data is stored in a one-to-one correspondence with a multidimensional vector indicating a semantic concept of the content, and a simple question sentence using a natural language that a user uses everyday is stored. When input, the question sentence is linguistically analyzed, converted into a multidimensional vector representing the semantic concept, and searched by matching with the multidimensional vector in the document database. According to this method, a document that a user is searching for can be searched for using an interface using a natural language.

【０００４】[0004]

【発明が解決しようとする課題】しかしながら、多次元
ベクトルのマッチングは、キーワードマッチングなどと
比較すると計算量が大きくなるため、検索対象となる文
書データの数が膨大になると、検索処理に多大な時間が
かかってしまうという課題があった。However, multi-dimensional vector matching requires a large amount of calculation as compared with keyword matching or the like. Therefore, if the number of document data to be searched becomes enormous, a large amount of time is required for search processing. There was a problem that it would take.

【０００５】また、多次元ベクトルのマッチングによる
検索では、場合によっては検索結果の中に全く関係のな
い文書が混じってしまい、検索結果の精度が上がらない
という課題もあった。[0005] In addition, in a search by matching of multidimensional vectors, there is a problem that a document having no relation at all is mixed in the search result in some cases, and the accuracy of the search result is not improved.

【０００６】本発明は、上記の課題に鑑みてなされたも
のであり、多次元ベクトルマッチングを用いた文書検索
処理において、膨大な文書データが検索対象となった場
合でもその処理時間を短く保つことを目的とする。SUMMARY OF THE INVENTION The present invention has been made in view of the above-mentioned problems, and it is an object of the present invention to keep processing time of a large amount of document data short in a document search process using multidimensional vector matching. With the goal.

【０００７】[0007]

【課題を解決するための手段】上記の課題を達成するた
めの本発明による文書検索装置は以下の構成を備える。
すなわち、文書データと、その内容を多次元ベクトルで
表現する意味ベクトルとを対にして格納する格納手段
と、前記格納手段に格納された文書データを、その意味
ベクトルに基づいてグループに分類して管理する管理手
段と、検索条件として入力された自然言語文を言語解析
して、検索条件文の意味ベクトルを生成する生成手段
と、前記言語解析の結果に基づいて、前記管理手段で管
理されているグループを選択する選択手段と、前記選択
手段で選択されたグループに属する文書データについ
て、前記格納手段に格納された意味ベクトルと前記検索
条件文の意味ベクトルとのマッチングを行って文書を検
索する検索手段とを備える。A document search apparatus according to the present invention for achieving the above object has the following arrangement.
That is, storage means for storing document data and a semantic vector representing the contents thereof in a multidimensional vector in pairs, and classifying the document data stored in the storage means into groups based on the semantic vector. A managing unit for managing, a linguistic analysis of a natural language sentence input as a search condition, and a generating unit for generating a meaning vector of the search condition sentence; and a managing unit based on a result of the linguistic analysis. Selecting means for selecting a group to be searched, and for the document data belonging to the group selected by the selecting means, searching the document by matching the semantic vector stored in the storage means with the semantic vector of the search condition sentence. Search means.

【０００８】また、上記の課題を達成するための本発明
による文書検索方法は、文書データと、その内容を多次
元ベクトルで表現する意味ベクトルとを対にして格納す
る格納手段を用いた文書検索方法であって、前記格納手
段に格納された文書データを、その意味ベクトルに基づ
いてグループに分類して管理する管理工程と、検索条件
として入力された自然言語文を言語解析して、検索条件
文の意味ベクトルを生成する生成工程と、前記言語解析
の結果に基づいて、前記管理工程で管理されているグル
ープを選択する選択工程と、前記選択工程で選択された
グループに属する文書データについて、前記格納手段に
格納された意味ベクトルと前記検索条件文の意味ベクト
ルとのマッチングを行って文書を検索する検索工程とを
備える。Further, a document search method according to the present invention for achieving the above object provides a document search method using storage means for storing a pair of document data and a semantic vector expressing the contents of the document data in a multidimensional vector. A management step of classifying and managing the document data stored in the storage means into groups based on a semantic vector thereof, and a language analysis of a natural language sentence input as a search condition, and A generating step of generating a sentence semantic vector, a selecting step of selecting a group managed in the managing step based on the result of the linguistic analysis, and a document data belonging to the group selected in the selecting step, A search step of matching the meaning vector stored in the storage unit with the meaning vector of the search condition sentence to search for a document.

【０００９】[0009]

【発明の実施の形態】以下、添付の図面を参照しながら
本発明の好適な実施形態を説明する。Preferred embodiments of the present invention will be described below with reference to the accompanying drawings.

【００１０】本実施形態では、文書データを登録する際
に、文書の意味内容を示す多次元ベクトルを作成し、こ
の多次元ベクトルを利用して予め文書を意味概念によっ
て自動分類して管理する。そして、検索時のユーザクエ
リにこの分類と合致する言葉が現れた場合に、まず検索
対象をこの分類に絞り込み、多次元ベクトルマッチング
を行う。この検索対象の絞り込みにより検索結果の精度
を高めると同時に、文書検索時の計算量を減らし、正確
で迅速な文書検索を実行する。以下、本実施形態につい
て詳細に説明する。In the present embodiment, when registering document data, a multidimensional vector indicating the meaning of the document is created, and the document is automatically classified and managed in advance using the multidimensional vector according to the semantic concept. Then, when a word that matches this category appears in the user query at the time of search, first, the search target is narrowed down to this category, and multidimensional vector matching is performed. By narrowing down the search target, the accuracy of the search result is increased, and at the same time, the amount of calculation at the time of document search is reduced, and accurate and quick document search is executed. Hereinafter, the present embodiment will be described in detail.

【００１１】図１は本実施形態による文書検索処理を実
行するコンピュータ装置１００の構成を示すブロック図
である。図１の構成において、ＣＰＵ１０１はマイクロ
プロセッサであり、文書検索処理のための演算、論理判
断等を行い、ＰＣＩバス１０２を介して接続された各構
成要素を制御する。ＰＣＩバス１０２はＣＰＵ１０１の
制御の対象とする構成要素を指示するアドレス信号を転
送し、ＣＰＵ１０１の制御の対象とする各構成要素のコ
ントロール信号を転送し、各構成機器相互間のデータ転
送を行う。FIG. 1 is a block diagram showing the configuration of a computer device 100 that executes a document search process according to this embodiment. In the configuration shown in FIG. 1, a CPU 101 is a microprocessor, which performs calculations, logical judgments, and the like for document search processing, and controls each component connected via a PCI bus 102. The PCI bus 102 transfers an address signal indicating a component to be controlled by the CPU 101, transfers a control signal of each component to be controlled by the CPU 101, and performs data transfer between components.

【００１２】ＲＯＭ１０３は読出し専用の固定メモリで
ある。ＲＯＭ１０３には本実施形態の構成における基本
Ｉ／Ｏプログラムが格納される。また、ＲＡＭ１０４
は、書込み可能のランダムアクセスメモリであって、各
構成要素からの各種データの一時記憶と、本実施形態に
おける各種処理が記述されたプログラムが格納され、こ
のプログラムに基づいてＣＰＵ１０１が各種処理を行
う。The ROM 103 is a fixed read-only memory. The ROM 103 stores a basic I / O program in the configuration of the present embodiment. Also, the RAM 104
Is a writable random access memory, which temporarily stores various data from each component and stores a program in which various processes in the present embodiment are described. The CPU 101 performs various processes based on this program. .

【００１３】ＤＶＤＤ１０５はＤＶＤドライブである。
ＤＶＤメディア（ＤＶＤ−ＭＥＤＩＡ）１０６に記録さ
れているプログラムやデータはこのＤＶＤドライブ１０
５を通じて本システムにロードされる。また、ＤＩＳＫ
１０７に蓄えられた各種データをＤＶＤドライブ１０５
を通じてＤＶＤ−ＭＥＤＩＡ１０６に書き込むことがで
きる。なお、ＤＶＤ−ＭＥＤＩＡ１０６は、具体的には
ＤＶＤ−ＲＯＭ，ＤＶＤ−ＲＡＭ，ＤＶＤ−Ｒ，ＤＶＤ
−ＲＷ，ＤＶＤ−ＶＩＤＥＯ，ＤＶＤ−ＡＵＤＩＯなど
のＤＶＤ規格のメディアを総称したものである。本実施
形態において、ＤＶＤ−ＭＥＤＩＡ１０６は文書データ
やその関連データ、もしくはプログラムなどの大容量デ
ータの読み書きに用いられる。The DVDD 105 is a DVD drive.
Programs and data recorded on a DVD medium (DVD-MEDIA) 106 are stored in the DVD drive 10.
5 is loaded into the system. Also, DISK
Various data stored in the DVD drive 105
Through the DVD-MEDIA 106. The DVD-MEDIA 106 is specifically a DVD-ROM, DVD-RAM, DVD-R, DVD
-RW, DVD-VIDEO, DVD-AUDIO, etc., are a general term for DVD standard media. In this embodiment, the DVD-MEDIA 106 is used for reading and writing large amounts of data such as document data and related data, or programs.

【００１４】ＩＮＰＵＴＣ１０８は入力コントローラで
ある。キーボード（ＫＢ）１０９やポインティングデバ
イス（ＰＤ）１１０から送られてくる入力信号は、この
コントローラによって適宜適切な信号に変換された後、
ＰＣＩバス１０２を経由してＣＰＵ１０１に送信され
る。INPUT C 108 is an input controller. Input signals sent from the keyboard (KB) 109 and the pointing device (PD) 110 are appropriately converted into appropriate signals by this controller,
The data is transmitted to the CPU 101 via the PCI bus 102.

【００１５】ＫＢ１０９は、アルファベットキー、ひら
がなキー、カタカナキー等の文字記号入力キー、及び、
カーソル移動を指示するカーソル移動キー等のような各
種のファンクションキーを備えている。ＰＤ１１０は、
マウスやトラックボールなどのポインティングデバイス
であり、表示画面上のカーソルやボタンなどを指摘する
ために使用される。KB109 is a character symbol input key such as an alphabet key, a hiragana key, a katakana key, and the like.
Various function keys such as a cursor movement key for instructing cursor movement are provided. PD 110 is
It is a pointing device such as a mouse or a trackball, and is used to indicate a cursor or a button on a display screen.

【００１６】ＤＩＳＫ１０７はデータやプログラム等を
記憶するための外部メモリである。データやプログラム
等は必要に応じて保管され、また、保管されたデータや
プログラムはキーボードの指示により、必要な時に呼び
出される。本実施形態における文書データベースは主に
このＤＩＫＳ１０７上に実装される。The DISK 107 is an external memory for storing data, programs, and the like. Data and programs are stored as needed, and the stored data and programs are called up when necessary according to keyboard instructions. The document database in the present embodiment is mainly mounted on the DIKS 107.

【００１７】ＶＩＤＥＯ１１１はビデオコントローラで
ある。ＰＣＩバス１０２を介して表示用のデータがここ
に蓄えられるとともに、表示用の信号に変換されて表示
装置（ＤＩＳＰ）１１２に出力される。ＤＩＳＰ１１２
には、陰極線管や液晶などが用いられ、各種処理の結果
や装置の状態、ユーザに対するメッセージなどを表示す
る。VIDEO 111 is a video controller. Data for display is stored here via the PCI bus 102, converted into a signal for display, and output to the display device (DISP) 112. DISP112
Displays a result of various processes, a state of the apparatus, a message for the user, and the like.

【００１８】ＤＥＶＣ１１３はデバイスコントローラで
ある。ＰＣＩバス１０２を介して伝達されるＣＰＵ１０
１の指示によって、このコントローラに接続されている
機器を制御し、また、接続されている機器が出力する信
号やデータをＰＣＩバス１０２を介してＣＰＵ１０１や
ＤＩＳＫ１０７に適宜伝達する。ＳＣＡＮ１１４はスキ
ャナーである。ＤＥＶＣ１１３からの指示によって、光
学的な方法によってここにセットされた原稿をスキャン
し、原稿画像を読み取り、これをＤＥＶＣ１１３に出力
する。The DEVC 113 is a device controller. CPU 10 transmitted via PCI bus 102
In accordance with the instruction 1, the device connected to the controller is controlled, and signals and data output from the connected device are transmitted to the CPU 101 and the DISK 107 via the PCI bus 102 as appropriate. SCAN 114 is a scanner. In accordance with an instruction from the DEVC 113, the original set here is scanned by an optical method, an original image is read, and this is output to the DEVC 113.

【００１９】ＮＩ１１５はネットワークインターフェー
スであり、本実施形態の文書検索システムをＬＡＮやイ
ンターネット１１６などを経由して外部のシステムと接
続するための機器である。本実施形態の文書検索システ
ムは、この接続を経由して、信号やデータを外部のシス
テムと送受信することが可能である。An NI 115 is a network interface, and is a device for connecting the document search system of the present embodiment to an external system via a LAN, the Internet 116, or the like. The document search system according to the present embodiment can transmit and receive signals and data to and from an external system via this connection.

【００２０】ＭＩＸ１１７はミキサーである。ＰＣＩバ
ス１０２を介して音声出力用のデータがここに送られる
と、ＭＩＸ１１７はこれらの信号を合成しかつ音声出力
用の信号に変換してスピーカ（ＳＰＫ）１１８に出力す
る。ＳＰＫ１１８は、処理結果や装置の状態、ユーザに
対するメッセージ、音楽などを音声で出力する。MIX 117 is a mixer. When data for audio output is sent here via the PCI bus 102, the MIX 117 synthesizes these signals, converts them into audio output signals, and outputs them to the speaker (SPK) 118. The SPK 118 outputs a processing result, a state of the apparatus, a message for the user, music, and the like by voice.

【００２１】かかる各構成要素からなる本実施形態のコ
ンピュータ装置１００においては、キーボード１０９や
ポインティングデバイス１１０からの各種の入力に応じ
て各種処理を実行させる。すなわち、キーボード１０９
やポインティングデバイス１１０から入力信号が供給さ
れると、ＩＮＰＵＴＣ１０８を経由して、インタラプト
信号がＣＰＵ１０１に送られ、ＣＰＵ１０１がＲＯＭ１
０３内に記憶してある各種の制御信号を読み出し、それ
らの制御信号に従って、各種の制御が行われる。In the computer apparatus 100 according to the present embodiment having the above components, various processes are executed in response to various inputs from the keyboard 109 and the pointing device 110. That is, the keyboard 109
When an input signal is supplied from the CPU or the pointing device 110, an interrupt signal is sent to the CPU 101 via the INPUT C 108, and the CPU 101
Various control signals stored in memory 03 are read out, and various controls are performed according to the control signals.

【００２２】本実施形態では、コンピュータが基本Ｉ／
Ｏプログラム、ＯＳ、及び本文書検索処理プログラムを
ＣＰＵ１０１が実行することによって、文書検索装置と
して動作する。基本Ｉ／ＯプログラムはＲＯＭ１０３中
に書き込まれており、ＯＳはＤＩＳＫ１０７に書き込ま
れている。そして、本システムの電源がＯＮにされる
と、基本Ｉ／Ｏプログラム中のＩＰＬ（イニシャルプロ
グラムローディング）機能によりＤＩＳＫ１０７からＯ
ＳがＲＡＭ１０４に読み込まれ、ＣＰＵ１０１によるＯ
Ｓの動作が開始される。なお、文書検索処理プログラム
は、図９〜図１５のフローチャートによってに示される
文書検索処理手順をＣＰＵ１０１によって実現させるた
めのプログラムコードである。In this embodiment, the computer is a basic I / O
When the CPU 101 executes the O program, the OS, and the document search processing program, it operates as a document search device. The basic I / O program is written in the ROM 103, and the OS is written in the DISK 107. Then, when the power supply of the present system is turned on, the disk 107 is turned on by the IPL (initial program loading) function in the basic I / O program.
S is read into the RAM 104, and O
The operation of S is started. Note that the document search processing program is a program code for causing the CPU 101 to implement the document search processing procedure shown in the flowcharts of FIGS.

【００２３】図２は、本文書検索処理プログラム及び関
連データをＤＶＤ−ＭＥＤＩＡ１０６に記録したときの
ＤＶＤ−ＭＥＤＩＡ１０６の内容の構成図である。本実
施形態において、文書検索処理プログラム及び関連デー
タはＤＶＤ−ＭＥＤＩＡ１０６に記録されている。図示
のようにＤＶＤ−ＭＥＤＩＡ１０６の先頭領域には、こ
のＤＶＤ−ＭＥＤＩＡのボリューム情報２０１とディレ
クトリ情報２０２が記録されており、その後にこのＤＶ
Ｄ−ＭＥＤＩＡ１０６のコンテンツである本実施形態の
文書検索処理プログラム（実行ファイル）２０３と、文
書検索処理関連データ２０４が記録されている。FIG. 2 is a configuration diagram of the contents of the DVD-MEDIA 106 when the document search processing program and related data are recorded on the DVD-MEDIA 106. In the present embodiment, the document search processing program and related data are recorded on the DVD-MEDIA 106. As shown in the figure, in the head area of the DVD-MEDIA 106, volume information 201 and directory information 202 of the DVD-MEDIA are recorded.
A document search processing program (executable file) 203 of the present embodiment, which is the content of the D-MEDIA 106, and document search processing related data 204 are recorded.

【００２４】図３はコンピュータ装置１００と本文書検
索処理プログラムが記録されたＤＶＤ−ＭＥＤＩＡ１０
６の模式図である。ＤＶＤ−ＭＥＩＤＡ１０６に記録さ
れた文書検索処理プログラム２０３および関連データ２
０４は、図３に示したようにＤＶＤ−ＭＥＤＩＡドライ
ブ（ＤＶＤＤ）１０５を通じて本システムにロードする
ことができる。すなわち、このＤＶＤ−ＭＥＤＩＡ１０
６をＤＶＤＤ１０５にセットすると、ＯＳ及び基本Ｉ／
Ｏプログラムの制御のもとに本文書検索処理プログラム
および関連データがＤＶＤ−ＭＥＤＩＡ１０６から読み
出され、ＲＡＭ１０４にロードされて動作可能となる。FIG. 3 shows a computer 100 and a DVD-MEDIA 10 storing the document search processing program.
FIG. 6 is a schematic view of FIG. Document search processing program 203 and related data 2 recorded on DVD-MEIDA 106
04 can be loaded into the system through a DVD-MEDIA drive (DVDD) 105 as shown in FIG. That is, this DVD-MEDIA10
6 in the DVDD 105, the OS and the basic I / O
Under the control of the O program, the document search processing program and related data are read from the DVD-MEDIA 106 and loaded into the RAM 104 to be operable.

【００２５】図４は、本文書検索処理プログラムがＲＡ
Ｍ１０４にロードされ実行可能となった状態のメモリマ
ップを示す。コンピュータ装置１００の立ち上げ時にＲ
ＯＭ１０３やＤＩＳＫ１０７よりロードされる基本Ｉ／
Ｏプログラム４１１とＯＳ４１２が格納されている。ま
た、ＤＶＤ−ＭＥＤＩＡ１０６からロードされた文書検
索処理プログラム２０３とその関連データ２０４がそれ
ぞれ文書検索処理プログラム４１３、関連データ４１４
として格納される。また、ワークエリア４１５には、意
味概念辞書ＭＤＩＣ４２１、検索結果バッファＲＢＵＦ
４２２、結果出力数ＣＯＵＮＴ４２４、グループテーブ
ルＧＴＢＬ４２３が存在している。FIG. 4 shows that the present document search processing program is RA
8 shows a memory map in a state where the memory map is loaded into the M104 and becomes executable. When the computer device 100 starts up, R
Basic I / O loaded from OM103 or DISK107
An O program 411 and an OS 412 are stored. Further, the document search processing program 203 and its related data 204 loaded from the DVD-MEDIA 106 are stored in the document search processing program 413 and the related data 414, respectively.
Is stored as The work area 415 includes a semantic concept dictionary MDIC421, a search result buffer RBUF.
422, a result output number COUNT424, and a group table GTBL423.

【００２６】図５は、本実施形態における意味概念辞書
ＭＤＩＣの構成例を説明した図である。FIG. 5 is a diagram for explaining an example of the configuration of the semantic concept dictionary MDIC in the present embodiment.

【００２７】本実施形態における意味概念辞書（ＭＤＩ
Ｃ）４２１には、図示したように、単語ＩＤ５０１と、
単語表記５０２と、その意味表現である多次元ベクトル
５０３のリストで構成される。図５において、表の行が
単語一個分のデータに相当する。このうち、単語ＩＤ５
０１は、本実施形態の文書検索システムにおいて、各単
語を識別管理するために、各単語に対して一意に割り振
られている番号である。また、単語表記５０２は、各単
語の表記を表す文字列である。また、意味ベクトル５０
３は、各単語が持つ意味概念を多次元ベクトルで表現し
たものであり、予め定められている。The semantic concept dictionary (MDI) in this embodiment
C) 421 includes a word ID 501 as shown in FIG.
It is composed of a list of a word notation 502 and a multidimensional vector 503 as a semantic expression. In FIG. 5, the rows of the table correspond to data for one word. Of these, word ID5
01 is a number that is uniquely assigned to each word in the document search system of the present embodiment in order to identify and manage each word. The word notation 502 is a character string representing the notation of each word. Also, the semantic vector 50
Numeral 3 represents a semantic concept of each word as a multidimensional vector, which is predetermined.

【００２８】なお、本実施形態における意味ベクトルの
各次元は、名詞のシソーラスなどを参照して上位概念に
相当する単語を適宜抽出して設定する。また、各単語の
意味ベクトルは国語辞書などの語義文を言語解析して、
各次元として選択された概念の合成に落とし込むことで
作成するが、このような手法は意味処理として一般的で
あり、公知であるので、ここでは詳述しない。Each dimension of the semantic vector in the present embodiment is set by appropriately extracting words corresponding to the superordinate concept with reference to a noun thesaurus or the like. In addition, the semantic vector of each word is analyzed by analyzing a semantic sentence such as a Japanese language dictionary.
It is created by dropping into the synthesis of the concept selected as each dimension. However, such a method is general as a semantic process and is well known, and thus will not be described in detail here.

【００２９】意味概念辞書（ＭＤＩＣ）４２１は以上の
構成によって、単語ＩＤ５０１もしくは単語表記５０２
をキーにして検索され、格納されている各情報を参照す
ることが可能である。本実施形態において、意味概念辞
書（ＭＤＩＣ）４２１は、後述する各処理において適宜
検索、参照される。The semantic concept dictionary (MDIC) 421 has the word ID 501 or the word notation 502 according to the above configuration.
Is used as a key, and each stored information can be referred to. In the present embodiment, the semantic concept dictionary (MDIC) 421 is appropriately searched and referred to in each process described later.

【００３０】図６は、本実施形態における検索結果バッ
ファ（ＲＢＵＦ）の構成例を説明した図である。本実施
形態における検索結果バッファ（ＲＢＵＦ）４２２に
は、文書検索処理において検索されたデータのうち、意
味マッチングの度合いが高いデータから順に結果出力数
ＣＯＵＮＴ４２４の数だけ格納される。図５は、ＣＯＵ
ＮＴ＝１０の場合の検索結果バッファ（ＲＢＵＦ）４２
２の構成例である。FIG. 6 is a view for explaining a configuration example of the search result buffer (RBUF) in the present embodiment. In the search result buffer (RBUF) 422 in the present embodiment, among the data searched in the document search processing, the number of result outputs COUNT 424 is stored in ascending order of data having a high degree of semantic matching. Figure 5 shows COU
Search result buffer (RBUF) 42 when NT = 10
2 is a configuration example.

【００３１】図６に示したように、一つの検索結果デー
タは、順位６０１と、意味近似度６０２と、意味ベクト
ル６０３と、文書ＩＤ６０４と、グループＩＤ６０５に
よって構成される。このうち、順位６０１は、ＲＢＵＦ
４２２に格納されたデータの中における意味近似度が大
きい順番である。また、意味近似度６０２は、各文書デ
ータに対応付けられた意味ベクトル６０３と、クエリを
言語解析して作成した意味ベクトルとのマッチングを取
ったときの度合いであり、本実施形態においてはパーセ
ントで格納される。なお、２つの意味ベクトルのマッチ
ングの度合いは、主にこれら２つのベクトルのなす角を
算出することによって決定される。As shown in FIG. 6, one piece of search result data is composed of a rank 601, a semantic similarity 602, a semantic vector 603, a document ID 604, and a group ID 605. Among them, the order 601 is RBUF
The order of the degree of semantic similarity in the data stored in 422 is large. The semantic approximation degree 602 is a degree when matching is performed between a semantic vector 603 associated with each document data and a semantic vector created by linguistic analysis of a query, and in this embodiment, expressed as a percentage. Is stored. Note that the degree of matching between the two semantic vectors is determined mainly by calculating the angle between these two vectors.

【００３２】意味ベクトル６０３は、各文書データ中の
テキストを言語解析して意味情報を抽出し、多次元ベク
トルで表現したものであり、登録された各文書について
予めデータベース中に格納されている。この意味ベクト
ル６０３は、通常、テキストを形態素解析を用いて品詞
分解し、普通名詞や固有名詞を抜き出し、これらの持つ
意味ベクトルを意味概念辞書（ＭＤＩＣ）５０２から検
索して取得後、構文情報によって重み付けして合成する
ことによって作成される。なお、このような手法は自然
言語処理分野において公知であるので、ここでは詳述し
ない。The semantic vector 603 is obtained by language-analyzing text in each document data, extracting semantic information, and expressing it as a multidimensional vector. Each registered document is stored in a database in advance. This semantic vector 603 is usually obtained by decomposing a text using morphological analysis, extracting common nouns and proper nouns, retrieving these semantic vectors from a semantic concept dictionary (MDIC) 502, acquiring them, and then using syntax information. It is created by weighting and combining. Such a method is known in the field of natural language processing, and will not be described in detail here.

【００３３】また、文書ＩＤ６０４は、本実施形態の文
書データベースに格納されている文書のＩＤであり、こ
のＩＤを利用して実際の文書データを読み出すことが可
能である。また、グループＩＤ６０５は、各文書が属す
る意味近似グループのＩＤであり、これを用いてグルー
プテーブル（ＧＴＢＬ）４２３を参照することにより、
この文書のグループ情報を取得することができる。な
お、意味近似グループについては図７を参照して以下に
説明する。The document ID 604 is the ID of a document stored in the document database of the present embodiment, and actual document data can be read using this ID. The group ID 605 is the ID of a semantic approximation group to which each document belongs, and is used to refer to the group table (GTBL) 423,
The group information of this document can be obtained. The meaning approximation group will be described below with reference to FIG.

【００３４】図７は、本実施形態におけるグループテー
ブル（ＧＴＢＬ）の構成例を説明した図である。図７に
示したように、本実施形態におけるグループテーブル
（ＧＴＢＬ）４２３は、グループＩＤ７０１と、グルー
プ名７０２と、データ数７０３と、文書ＩＤリスト７０
４と、グループの意味ベクトル７０５によって構成され
る。このうち、グループＩＤ７０１は、本実施形態にお
ける文書データベースに登録してある文書データを、後
述する手法によって意味的に分類することによって作成
されるグループに一意に割り振られる番号である。な
お、このグループＩＤによって特定される１つのグルー
プが１つの意味近似グループを構成する。このグループ
ＩＤ７０１はＲＢＵＦ４２２のグループＩＤ６０５に対
応している。また、グループ名７０２は、本実施形態に
おける各意味近似文書グループに与えられたユニークな
名前である。このグループ名を決定する処理は図１５を
用いて後述する。また、データ数７０３は、このグルー
プに属している文書データの数である。FIG. 7 is a view for explaining a configuration example of the group table (GTBL) in the present embodiment. As shown in FIG. 7, the group table (GTBL) 423 in the present embodiment includes a group ID 701, a group name 702, a data number 703, and a document ID list 70.
4 and a group semantic vector 705. Among them, the group ID 701 is a number uniquely assigned to a group created by semantically classifying document data registered in the document database in the present embodiment by a method described later. One group specified by this group ID constitutes one semantic approximation group. The group ID 701 corresponds to the group ID 605 of the RBUF 422. The group name 702 is a unique name given to each semantic approximate document group in the present embodiment. The process of determining the group name will be described later with reference to FIG. The number of data 703 is the number of document data belonging to this group.

【００３５】また、文書ＩＤリスト７０４は、このグル
ープに属する文書データのＩＤのリストである。グルー
プの意味ベクトル７０５は、当該グループに属する文書
データの全ての意味ベクトルとのなす角度が最小となる
多次元ベクトルである。グループテーブル（ＧＴＢＬ）
４２３は、後述する意味近似グループ作成処理によって
作成され、主に文書検索時の検索対象データの絞り込み
に使用される。図７は、ＰＣ関連のニュース記事を自動
分類した場合のグループテーブルＧＴＢＬの一例を示し
ている。The document ID list 704 is a list of IDs of document data belonging to this group. The group semantic vector 705 is a multidimensional vector in which the angle formed by all the semantic vectors of the document data belonging to the group is minimized. Group table (GTBL)
Reference numeral 423 is created by a semantic approximation group creation process described later, and is mainly used to narrow down search target data at the time of document search. FIG. 7 shows an example of the group table GTBL when news articles related to PCs are automatically classified.

【００３６】図８は、本実施形態における文書データベ
ースに格納されている一つのデータ構造の構成例を説明
した図である。図８に示したように、本実施形態におけ
る文書データベース中の１つのデータは、文書ＩＤ８０
１、文書データポインタ８０２、キーワード８０３、意
味ベクトル８０４によって構成される。FIG. 8 is a view for explaining an example of the configuration of one data structure stored in the document database in the present embodiment. As shown in FIG. 8, one data in the document database in the present embodiment is a document ID 80
1, a document data pointer 802, a keyword 803, and a meaning vector 804.

【００３７】ここで、文書ＩＤ８０１はこのデータ構造
そのもののインデックス番号であると同時に、データベ
ース中の文書データ自体のインデックス番号である。文
書ＩＤ８０１はＲＢＵＦ４２２における文書ＩＤ６０
４、ＧＴＢＬ７０２における文書ＩＤリストで用いられ
る文書ＩＤである。また、文書データポインタ８０２
は、実際の文書データを指すポインタであり、このポイ
ンタを利用することにより実際の文書データにアクセス
することができる。また、キーワード８０３は、文書デ
ータ中に現れる代表的な単語のリストである。また、意
味ベクトル８０４は文書データの意味を多次元ベクトル
で表現したもので、ＲＢＵＦ４２２に格納される意味ベ
クトル６０３と同じものである。このようなデータ構造
を取ることによって、実際の文書データにアクセスする
ことなく検索を高速に行うことが可能となる。Here, the document ID 801 is the index number of the document data itself in the database as well as the index number of the data structure itself. The document ID 801 is the document ID 60 in the RBUF 422.
4. Document ID used in the document ID list in GTBL702. Also, the document data pointer 802
Is a pointer pointing to actual document data, and the actual document data can be accessed by using this pointer. The keyword 803 is a list of typical words appearing in the document data. The meaning vector 804 expresses the meaning of the document data by a multidimensional vector, and is the same as the meaning vector 603 stored in the RBUF 422. By adopting such a data structure, it is possible to perform a search at high speed without accessing actual document data.

【００３８】以上の構成を備えた本実施形態の文書検索
装置の動作について、以下、詳細に説明する。The operation of the document search apparatus according to the present embodiment having the above configuration will be described in detail below.

【００３９】図９は、本実施形態における文書検索処理
の全体を説明するフローチャートである。ステップＳ１
は、ユーザのＫＢ１０９もしくはＰＤ１１０の操作を受
け付け、これをコマンドとして解釈するユーザコマンド
入力処理である。このような処理はＰＣなどを利用した
システムにおいて一般的なユーザインターフェース処理
であり、公知であるのでここでは詳述しない。処理を終
えると、ステップＳ２へ進む。FIG. 9 is a flowchart for explaining the entire document search process in this embodiment. Step S1
Is a user command input process for accepting a user's operation on the KB 109 or PD 110 and interpreting the operation as a command. Such a process is a general user interface process in a system using a PC or the like, and is well known and will not be described in detail here. Upon completion of the process, the process proceeds to a step S2.

【００４０】ステップＳ２は、ステップＳ１で解釈した
コマンドを判定して、各種処理に分岐するユーザコマン
ド判定処理である。ユーザコマンドが文書登録を指示し
ている場合は、ステップＳ３へ進む。また、ユーザコマ
ンドが文書検索を指示している場合は、ステップＳ４へ
進む。また、ユーザコマンドがその他の処理を指示して
いる場合は、ステップＳ５へ進む。Step S2 is a user command determination process for determining the command interpreted in step S1 and branching to various processes. When the user command indicates document registration, the process proceeds to step S3. If the user command indicates a document search, the process proceeds to step S4. If the user command indicates another process, the process proceeds to step S5.

【００４１】ステップＳ３では、検索対象となる文書デ
ータベースに新たに文書データを登録する文書登録処理
を行う。この処理は、図１０のフローチャートを用いて
後述する。文書登録処理を終えると、ステップＳ１へ戻
る。ステップＳ４では、ユーザクエリを受け付けて、そ
のクエリに合致するデータベース中の文書データを検索
する文書検索処理を実行する。この文書検索処理は、図
１１のフローチャートを用いて後述する。処理を終える
と、ステップＳ１へ戻る。ステップＳ５では、表示のカ
スタマイズ、システムの各種設定変更、ユーザ情報登録
などの、その他の処理を行う。これらの処理はこのよう
なシステムにおいて一般的であり、公知であるので詳述
しない。処理を終えると、ステップＳ１へ戻る。In step S3, a document registration process for newly registering document data in a document database to be searched is performed. This processing will be described later using the flowchart of FIG. Upon completion of the document registration process, the process returns to the step S1. In step S4, a user query is received, and a document search process for searching document data in the database that matches the query is executed. This document search process will be described later using the flowchart of FIG. Upon completion of the process, the process returns to the step S1. In step S5, other processes such as customizing the display, changing various system settings, and registering user information are performed. These processes are common in such systems and are well known and will not be described in detail. Upon completion of the process, the process returns to the step S1.

【００４２】なお、本実施形態のシステムにおいて、ス
テップＳ１へ処理が進むとき、新規登録文書数がある一
定値を超えた、グループ情報の更新指示がされたなど
の、諸処の条件が満たされた場合、システムに割り込み
がかかり、ステップＳ６へ処理が進む。ステップＳ６
は、その時点で本実施形態の文書データベースに登録さ
れている文書をその意味によって新たにグループ分けし
て、図７に示したようなグループテーブルＧＴＢＬを作
成する、意味近似文書グループ作成処理を行う。なお、
この処理の詳細は、図１２のフローチャートを用いて後
述する。意味近似文書グループ作成処理を終えると、ス
テップＳ１へ処理が戻り、割込みから復帰する。In the system according to the present embodiment, when the process proceeds to step S1, various processing conditions such as the number of newly registered documents exceeding a certain value, an instruction to update group information, and the like are satisfied. In this case, the system is interrupted, and the process proceeds to step S6. Step S6
Performs a semantic approximation document group creation process of newly grouping documents registered in the document database of the present embodiment according to the meaning at that time and creating a group table GTBL as shown in FIG. . In addition,
Details of this processing will be described later using the flowchart of FIG. Upon completion of the semantic approximate document group creation process, the process returns to step S1 and returns from the interrupt.

【００４３】図１０は、図９のステップＳ３における文
書登録処理を詳細化したフローチャートである。本処理
により、図８に示す形態で文書が文書データベースに登
録されることになる。FIG. 10 is a detailed flowchart of the document registration process in step S3 of FIG. By this processing, the document is registered in the document database in the form shown in FIG.

【００４４】ステップＳ１１では、ユーザが登録する文
書を指定する登録文書指定処理を実行する。この処理
は、実際にはＧＵＩを利用した文書ファイルのドラッグ
・ドロップであったり、あるいは、登録ファイル名のリ
ストを格納したテキストファイルの指定でも良い。この
種の処理は本実施形態のようなシステムのユーザインタ
ーフェースとして極めて一般的であるので、詳述しな
い。処理を終えると、ステップＳ１２へ進む。In step S11, a registered document specifying process for specifying a document to be registered by the user is executed. This processing may actually be a drag and drop of a document file using a GUI, or a designation of a text file storing a list of registered file names. This type of processing is very common as a user interface of the system as in the present embodiment, and will not be described in detail. Upon completion of the process, the process proceeds to a step S12.

【００４５】ステップＳ１２では、ステップＳ１１で指
定された文書データのうち、テキストの主要部分を抽出
して言語解析する主要文解析処理が行われる。ここで、
テキスト主要部分とは、テキストの大意が記述されてい
る可能性が高い部分を抽出したものであり、文書の論
理、レイアウト構造を利用して経験的に抽出される。主
要文解析処理を終えると、ステップＳ１３へ進む。In step S12, a main sentence analyzing process for extracting the main part of the text from the document data designated in step S11 and analyzing the language is performed. here,
The main part of the text is a part which has a high possibility that the meaning of the text is described, and is empirically extracted using the logic and layout structure of the document. Upon completion of the main sentence analysis process, the process proceeds to step S13.

【００４６】ステップＳ１３では、ステップＳ１２で得
られた言語解析（主要文解析処理）の結果を利用して、
この文書において特徴的に現れるキーワードを、重み付
けして抽出するキーワード抽出処理が行われる。このよ
うなキーワード抽出処理は言語処理の分野で一般的に行
われており、公知であるので詳述しない。キーワード抽
出処理を終えると、ステップＳ１４へ進む。In step S13, using the result of the language analysis (main sentence analysis processing) obtained in step S12,
A keyword extraction process for weighting and extracting keywords characteristically appearing in this document is performed. Such keyword extraction processing is generally performed in the field of language processing, and is well known and will not be described in detail. Upon completion of the keyword extraction process, the process proceeds to step S14.

【００４７】ステップＳ１４では、ステップＳ１３で得
られたキーワード情報を元に、この文書データの意味概
念を表す多次元ベクトルを作成する文書意味ベクトル作
成処理が行われる。本実施形態において、文書意味ベク
トルの作成は、以下のように行われる。まず、ステップ
Ｓ１３で得られた各キーワードを意味概念辞書ＭＤＩＣ
４２１（図５）から検索し、各キーワードの意味ベクト
ルを取得する。次に、各意味ベクトルにステップＳ１３
で得られた重み付けを施した後、全ての意味ベクトルを
合成する。処理を終えると、ステップＳ１５へ進む。In step S14, based on the keyword information obtained in step S13, a document meaning vector creation process for creating a multidimensional vector representing the meaning concept of the document data is performed. In the present embodiment, the creation of the document meaning vector is performed as follows. First, each keyword obtained in step S13 is converted into a semantic concept dictionary MDIC.
421 (FIG. 5) to obtain the meaning vector of each keyword. Next, step S13 is applied to each semantic vector.
After applying the weights obtained in the above, all the semantic vectors are synthesized. Upon completion of the process, the process proceeds to a step S15.

【００４８】ステップＳ１５では、ステップＳ１３で得
られたキーワード、ステップＳ１４で得られた意味ベク
トルの情報から図８に示した文書データを作成して文書
データベースに登録する。なお、文書ＩＤは、重複が発
生しないように、当該システムによって自動的に割り振
られる。また、同時に、実際の文書データ自体もデータ
ベース上の別領域に登録され、このときに文書データポ
インタが得られ、データベースに登録される。データ登
録処理を終えると、ステップＳ１６へ進む。In step S15, the document data shown in FIG. 8 is created from the keyword obtained in step S13 and the meaning vector information obtained in step S14, and registered in the document database. The document ID is automatically assigned by the system so that duplication does not occur. At the same time, the actual document data itself is also registered in another area on the database. At this time, a document data pointer is obtained and registered in the database. Upon completion of the data registration process, the process advances to step S16.

【００４９】ステップＳ１６では、ステップＳ１４で作
成した意味ベクトルを利用して、登録する文書が属する
べき意味近似文書グループを決定する。意味近似文書グ
ループの決定は、登録する文書の有する意味ベクトル
と、グループテーブル（ＧＴＢＬ）中の各グループの意
味ベクトル７０５とのマッチングを取り、最も近似して
いるグループを決定することによって行われる。文書の
属するグループが決定されると、図７で示したグループ
テーブル（ＧＴＢＬ）中の決定されたグループの文書Ｉ
Ｄリストに、ステップＳ１５で割り振られた文書ＩＤが
格納される。処理を終えると、ステップＳ１７へ進む。In step S16, a semantic approximate document group to which the document to be registered belongs is determined using the semantic vector created in step S14. The determination of the semantic approximation document group is performed by matching the semantic vector of the document to be registered with the semantic vector 705 of each group in the group table (GTBL), and determining the most similar group. When the group to which the document belongs is determined, the document I of the determined group in the group table (GTBL) shown in FIG.
The document ID assigned in step S15 is stored in the D list. Upon completion of the process, the process proceeds to a step S17.

【００５０】ステップＳ１７は、これまでの登録処理の
結果をユーザに通知する登録結果通知処理である。例え
ば、これは結果を記載したダイアログボックスの表示で
あったり警告音の出力であるが、この種の処理はユーザ
インターフェースとして一般的であり、公知であるので
詳述しない。処理を終えると、文書登録処理を終了す
る。Step S17 is a registration result notifying process for notifying the user of the result of the registration process so far. For example, this is the display of a dialog box describing the result or the output of a warning sound, but this type of processing is common as a user interface and is well known and will not be described in detail. When the processing ends, the document registration processing ends.

【００５１】図１１は、図９に示したステップＳ４の文
書検索処理の詳細を示すフローチャートである。FIG. 11 is a flowchart showing details of the document search process in step S4 shown in FIG.

【００５２】ステップＳ２１では、文書を検索するため
のクエリ（質問文）をユーザに入力してもらうクエリ入
力処理を実行する。クエリは、ユーザが探している文書
の内容を表現する簡単な文であり、例えば、「ＰＣの価
格下落について」のように自然言語にて入力する。この
ような処理は一般の検索処理において公知であり、ここ
では詳細な説明は行わない。クエリ入力処理を終える
と、ステップＳ２２に進む。In step S21, a query input process is executed to have the user input a query (question sentence) for searching for a document. The query is a simple sentence expressing the contents of the document that the user is looking for, and is input in a natural language such as “about a price drop of a PC”. Such processing is known in general search processing, and will not be described in detail here. Upon completion of the query input process, the process proceeds to a step S22.

【００５３】ステップＳ２２では、ステップＳ２１で入
力されたクエリを言語解析するクエリ解析処理が実行さ
れる。この処理では、ステップＳ２１で入力されたクエ
リに、形態素解析、構文解析等の言語処理を行い、後の
処理で利用しやすい形式である言語情報に変換する。な
お、このような処理は一般の言語処理において公知であ
り、ここでは詳細な説明は行わない。クエリ解析処理を
終えると、ステップＳ２３に進む。In step S22, a query analysis process for language analysis of the query input in step S21 is executed. In this process, the query input in step S21 is subjected to linguistic processing such as morphological analysis and syntax analysis, and is converted into linguistic information in a format that can be easily used in subsequent processing. Such processing is known in general language processing, and will not be described in detail here. Upon completion of the query analysis process, the process proceeds to a step S23.

【００５４】ステップＳ２３では、ステップＳ２２の結
果得られた言語情報を利用して、クエリの意味ベクトル
を作成する意味ベクトル作成処理が行われる。この意味
ベクトル作成処理では、文書の意味ベクトルを作成する
場合（ステップＳ１４）と同様に、クエリ中に現れる単
語の意味ベクトルを意味概念辞書（ＭＤＩＣ）４２１か
ら検索して取得し、これらを構文情報によって重み付け
してから合成することで作成する。意味ベクトル作成処
理を終えると、ステップＳ２４へ進む。In step S23, a semantic vector creation process for creating a semantic vector of the query is performed using the linguistic information obtained as a result of step S22. In this semantic vector creation process, similar to the case of creating a semantic vector of a document (step S14), the semantic vector of a word appearing in a query is searched for and acquired from a semantic concept dictionary (MDIC) 421, and the syntax information is obtained. It is created by weighting and combining. Upon completion of the meaning vector creation processing, the process proceeds to step S24.

【００５５】ステップＳ２４は、ステップＳ２２で得ら
れた言語情報を利用して、クエリ中にグループテーブル
（ＧＴＢＬ）４２３に格納されている意味近似グループ
のグループ名が存在しているかどうか調べる処理であ
る。この処理の結果、クエリ中に１つ以上のグループ名
が存在する場合はステップＳ２５へ進む。グループ名が
１つも存在しない場合はステップＳ２６へ進む。Step S24 is a process for checking whether or not the group name of the semantic approximation group stored in the group table (GTBL) 423 exists in the query using the language information obtained in step S22. . As a result of this processing, when one or more group names exist in the query, the process proceeds to step S25. If there is no group name, the process proceeds to step S26.

【００５６】ステップＳ２５では、ステップＳ２４の結
果、クエリ中に存在するグループ名を持つ意味近似グル
ープに属する文書のみに検索対象となる文書を絞り込
む。検索対象をこの時点で絞り込むことによって、後の
意味マッチング処理における処理量（計算量）を減ら
し、検索を高速化することができる。検索文書の絞込み
処理が終わるとステップＳ２６へ進む。In step S25, as a result of step S24, documents to be searched are narrowed down to only documents belonging to a semantic approximation group having a group name existing in the query. By narrowing down the search targets at this point, the processing amount (computation amount) in the semantic matching processing to be performed later can be reduced, and the search can be speeded up. When the search document narrowing process is completed, the process proceeds to step S26.

【００５７】ステップＳ２６では、ステップＳ２３にお
いて得られたクエリの意味ベクトルと、文書データベー
ス中の意味ベクトルとのマッチングを行って、ユーザが
探している文書を検索する意味マッチング処理が実行さ
れる。意味ベクトルのマッチングは、前述したように２
つの多次元ベクトルのなす角を算出することによって行
われる。なお、ステップＳ２５によって検索対象となる
文書が絞り込まれた場合は、検索対象となっている文書
の意味ベクトルとのみマッチングが行われる。この処理
の結果として、マッチング結果バッファ（ＲＢＵＦ）４
２２には、データベース中のデータの中で、マッチング
の度合いが上位のものから順に、結果出力数ＣＯＵＮＴ
の数だけ検索結果データが格納される。なお、このよう
は処理は一般のデータベース検索処理において公知であ
り、ここでは詳細な説明は行わない。以上の意味マッチ
ング処理を終えると、処理はステップＳ２７に進む。In step S26, the meaning vector of the query obtained in step S23 is matched with the meaning vector in the document database to execute a meaning matching process for searching for the document that the user is looking for. Matching of the semantic vector is 2 as described above.
This is performed by calculating an angle between two multidimensional vectors. When documents to be searched are narrowed down in step S25, matching is performed only with the semantic vector of the documents to be searched. As a result of this processing, a matching result buffer (RBUF) 4
22 shows the number of result outputs COUNT in order from the data with the highest matching degree in the data in the database.
The search result data is stored by the number of. Such processing is known in general database search processing, and will not be described in detail here. Upon completion of the above semantic matching process, the process proceeds to step S27.

【００５８】ステップＳ２７では、ステップＳ２６によ
って出力された検索結果バッファ（ＲＢＵＦ）中の各文
書データが属するグループのＩＤを、グループテーブル
（ＧＴＢＬ）４２３を検索することによって取得する、
グループＩＤ取得処理を実行する。この処理によって取
得したグループＩＤは、検索結果バッファ（ＲＢＵＦ）
４２２中のグループＩＤ６０５として格納される。処理
を終えると、ステップＳ２８へ進む。In step S27, the ID of the group to which each document data in the search result buffer (RBUF) output in step S26 belongs is obtained by searching the group table (GTBL) 423.
Execute a group ID acquisition process. The group ID obtained by this processing is stored in the search result buffer (RBUF).
422 is stored as a group ID 605. Upon completion of the process, the process advances to step S28.

【００５９】そして、ステップＳ２８では、ステップＳ
２７によって作成された検索結果バッファＲＢＵＦ中の
検索結果データに基づいて、ディスプレイに文書データ
を出力する検索結果表示処理である。この種の処理は表
示を行うシステムにおいて一般的に行われており、公知
であるので詳述しない。処理を終えると、文書検索処理
を終了する。Then, in step S28, step S
27 is a search result display process for outputting document data to a display based on the search result data in the search result buffer RBUF created by 27. This type of processing is generally performed in a display system, and is well known and will not be described in detail. When the processing ends, the document search processing ends.

【００６０】図１２は、図９のステップＳ６に示した意
味近似文書グループ作成処理の詳細を示すフローチャー
トである。この処理は、各文書の多次元ベクトルを利用
して予め文書を意味概念によって自動分類して管理する
ものであり、結果として図７に示したグループテーブル
（ＧＴＢＬ）４２３が自動生成される。FIG. 12 is a flowchart showing details of the semantic approximate document group creation processing shown in step S6 of FIG. In this process, the documents are automatically classified and managed by the semantic concept using the multidimensional vector of each document in advance, and as a result, the group table (GTBL) 423 shown in FIG. 7 is automatically generated.

【００６１】ステップＳ３１では、本実施形態における
文書データベースに格納されている意味ベクトルによっ
て自己組織化マップ学習を行う。この自己組織化マップ
学習処理は、多次元ベクトル表現の集合を２次元マップ
に変換して分類する手法として一般的に行われており、
公知であるが、ここではその概要を図１３を用いて簡単
に説明する。In step S31, self-organizing map learning is performed using the semantic vectors stored in the document database in the present embodiment. This self-organizing map learning process is generally performed as a method of converting a set of multidimensional vector expressions into a two-dimensional map and classifying the set.
Although publicly known, the outline thereof will be briefly described with reference to FIG.

【００６２】図１３は自己組織化マップの学習の例を示
す図である。図１３において、１はユニットＵ、２はＵ
の近傍ユニットＵｃである。図１３に示したように、自
己組織化マップは正方形であり、Ｎ×Ｎ個（Ｎは正の整
数）のユニットで構成される。各ユニットは、意味ベク
トルと同次元のパターンベクトルを持つ。FIG. 13 is a diagram showing an example of learning of the self-organizing map. In FIG. 13, 1 is a unit U and 2 is a U
Is a neighboring unit Uc. As shown in FIG. 13, the self-organizing map is a square, and is composed of N × N (N is a positive integer) units. Each unit has a pattern vector of the same dimension as the semantic vector.

【００６３】自己組織化マップの学習は、以下の手順で
行う。まず、マップ中のユニットが持つ全てのパターン
ベクトルをゼロベクトルに初期化する。次に、文書デー
タベース中の各意味ベクトルに対して、なす角が最小で
あるパターンベクトルを持つユニットＵを決定する。次
に、Ｕとその近傍ユニットＵｃが持つパターンベクトル
を意味ベクトルに近づける。このとき、各ユニットが持
つパターンベクトルに対して、ユニットＵとの距離が近
いほど、パターンベクトルを意味ベクトルに近づける度
合いを高くする。初期化以外の以上の操作を予め決めら
れた回数繰り返すと、２次元マップ上の各ユニットが持
つベクトルパターンはある定値に収束し、自己組織化マ
ップ学習が終了する。自己組織化マップ学習処理を終え
ると、ステップＳ３２へ進む。Learning of the self-organizing map is performed in the following procedure. First, all the pattern vectors of the units in the map are initialized to zero vectors. Next, for each semantic vector in the document database, a unit U having a pattern vector with the smallest angle is determined. Next, the pattern vector of U and its neighboring unit Uc is made closer to the semantic vector. At this time, as the distance between the unit U and the pattern vector of each unit is shorter, the degree of bringing the pattern vector closer to the meaning vector is increased. When the above operations other than the initialization are repeated a predetermined number of times, the vector pattern of each unit on the two-dimensional map converges to a certain constant value, and the self-organizing map learning ends. Upon completion of the self-organizing map learning process, the process proceeds to step S32.

【００６４】ステップＳ３２では、ステップＳ３１で学
習が済んだ自己組織化マップ上のユニットに文書データ
ベースに登録されている各文書の意味ベクトルを配置す
る処理が行われる。各文書の意味ベクトルは、マップ学
習時と同様に、パターンベクトルとのなす角が最小であ
るユニットに配置される。処理を終えると、ステップＳ
３３へ進む。In step S32, a process of arranging the semantic vector of each document registered in the document database in the unit on the self-organizing map learned in step S31 is performed. The meaning vector of each document is arranged in a unit having the smallest angle with the pattern vector, as in the case of map learning. When the processing is completed, step S
Go to 33.

【００６５】ステップＳ３３では、ステップＳ３２によ
って２次元マップ上に配置された各文書データを８連結
手法によってグループ化し、グループを決定する処理が
行われる。ここで、８連結手法とは、２次元マップ上で
縦横斜めに隣接するユニット同士を同じグループとして
まとめる手法である。図１４は、２次元マップ上におい
て８連結手法によってグループを決定する例を示す図で
ある。図１４において、３はグループＩＤ＝１のグルー
プ、４はグループＩＤ＝２のグループ、５はグループＩ
Ｄ＝３のグループ、６はグループＩＤ＝４のグループで
ある。なお、マップ上の数字は配置された文書データの
数を示す。処理を終えると、ステップＳ３４へ進む。ス
テップＳ３４では、ステップＳ４３で決定したグループ
に属する文書データの全ての意味ベクトルとのなす角度
が最小となる多次元ベクトルを求め、これをグループの
意味ベクトルとする。In step S33, a process is performed in which the document data arranged on the two-dimensional map in step S32 is grouped by an 8-connection method, and a group is determined. Here, the 8-connection method is a method in which units adjacent diagonally vertically and horizontally on a two-dimensional map are grouped into the same group. FIG. 14 is a diagram illustrating an example in which a group is determined on a two-dimensional map by an 8-connection method. In FIG. 14, 3 is a group with group ID = 1, 4 is a group with group ID = 2, and 5 is a group I
D = 3 group, 6 is a group with group ID = 4. The numbers on the map indicate the number of arranged document data. Upon completion of the process, the process advances to step S34. In step S34, a multidimensional vector that minimizes the angle between all the semantic vectors of the document data belonging to the group determined in step S43 is determined, and this is set as the group semantic vector.

【００６６】ステップＳ３５では、ステップＳ３４で決
定された意味近似文書グループのグループ名を決定す
る、グループ名決定処理が行われる。この処理は、図１
５を用いて後述する。グループ名決定処理を終えるとス
テップＳ３６へ進む。ステップＳ３６は、ステップＳ３
３〜ステップＳ３５で作成されたグループの情報を参照
して、グループテーブル（ＧＴＢＬ）４２３を作成する
処理である。処理を終えると、意味近似文書グループ作
成処理を終了する。なお、本例では、グループテーブル
（ＧＴＢＬ）４２３において、文書データの総数の多い
順に、グループＩＤを１から順に割り当てる。例えば、
図１４において、グループ３のマッチング数は１５＋２
８＋９＝５２で最も多く、図７に示すように、ＩＤ＝１
が割り当てられている。なお、グループ作成処理の際に
は、図６の検索結果は破棄されているものとする。In step S35, a group name determining process for determining the group name of the semantically approximated document group determined in step S34 is performed. This processing is shown in FIG.
5 will be described later. Upon completion of the group name determination processing, the flow advances to step S36. Step S36 is equivalent to step S3
This is a process of creating a group table (GTBL) 423 with reference to the information of the group created in steps 3 to S35. When the processing ends, the semantic approximate document group creation processing ends. In this example, in the group table (GTBL) 423, group IDs are assigned from 1 in ascending order of the total number of document data. For example,
In FIG. 14, the number of matches for group 3 is 15 + 2
8 + 9 = 52, which is the highest, and as shown in FIG.
Is assigned. It is assumed that the search result in FIG. 6 has been discarded during the group creation processing.

【００６７】図１５は、ステップＳ３４におけるグルー
プ名決定処理の詳細を説明するフローチャートである。
図１５において、ステップＳ４１では、グループ名を決
定するグループの意味ベクトル７０５と、意味概念辞書
（ＭＤＩＣ）中の意味ベクトルとのマッチングを取り、
ある閾値以上の類似度を持つ単語を当該グループのグル
ープ名候補として複数個検索する。また、ここで用いる
閾値は、予め本実施形態のシステムにおいて定まってい
るものとする。FIG. 15 is a flowchart for explaining the details of the group name determination processing in step S34.
In FIG. 15, in step S41, matching is performed between the meaning vector 705 of the group for determining the group name and the meaning vector in the meaning concept dictionary (MDIC).
A plurality of words having a similarity greater than or equal to a certain threshold are searched as group name candidates of the group. The threshold used here is assumed to be determined in advance in the system of the present embodiment.

【００６８】次に、ステップＳ４２において、ステップ
Ｓ４１で選択したグループ名候補単語が、シソーラスツ
リー上においてどの位置にあるかを判定し、シソーラス
ツリー上の中間部に位置する単語をグループ名として選
択する。これは、シソーラスツリー上の上位（根の部
分）に位置する単語をグループ名として選択すると、意
味概念が大きくなりすぎるためそのグループの持つ概念
がぼやけてしまうのでそれを避けるためである。また、
逆に、シソーラスツリー上の下位（子葉）部分に位置す
る単語をグループ名として選択すると、その単語が持つ
意味概念が狭すぎるために、グループ名にふさわしくな
い文書データがグループに紛れ込む可能性が高くなるの
で、子葉部分に位置する単語は候補から外すためであ
る。Next, in step S42, the position of the group name candidate word selected in step S41 on the thesaurus tree is determined, and the word located in the middle part of the thesaurus tree is selected as the group name. . This is because if a word located at a higher position (root part) in the thesaurus tree is selected as a group name, the semantic concept becomes too large and the concept of the group is blurred, which is avoided. Also,
Conversely, if a word located in the lower (cotyledon) part of the thesaurus tree is selected as a group name, there is a high possibility that document data that is not suitable for the group name will be mixed into the group because the semantic concept of the word is too narrow. Therefore, the word located in the cotyledon portion is excluded from the candidates.

【００６９】この処理において、シソーラスツリー上の
中間部に位置する単語のうち、グループの意味ベクトル
７０５との類似度が最も高い意味ベクトルを持つ単語が
グループ名として決定される。処理を終えると、グルー
プ名決定処理を終了する。In this processing, a word having a meaning vector having the highest similarity with the meaning vector 705 of the group among words located in the middle part of the thesaurus tree is determined as a group name. When the processing ends, the group name determination processing ends.

【００７０】以上のように、本実施形態によれば、文書
データを登録する際に、文書の意味内容を示す多次元ベ
クトルが作成され、このベクトルを利用して予め文書が
意味概念によって自動分類され、意味概念の類似してい
る文書グループを表現するのにふさわしい単語がグルー
プ名として付加される。そして、検索時のユーザクエリ
にこの単語が現れた場合に、検索結果候補をこの単語を
グループ名とする意味近似グループに絞り込むことによ
って、意味ベクトルのマッチング処理量を減らし、迅速
な検索を実行することが可能となる。また、特に、グル
ープ名としてシソーラスツリーの中間に位置する単語を
選択することにより、検索精度を高めることが可能とな
る。As described above, according to the present embodiment, when registering document data, a multidimensional vector indicating the semantic content of the document is created, and the document is automatically classified based on the semantic concept using this vector. Then, a word suitable for expressing a document group having a similar semantic concept is added as a group name. Then, when this word appears in a user query at the time of a search, the search result candidates are narrowed down to a semantic approximation group having this word as a group name, thereby reducing the amount of semantic vector matching processing and executing a quick search. It becomes possible. In particular, by selecting a word located in the middle of the thesaurus tree as a group name, it is possible to improve search accuracy.

【００７１】なお、本発明は上述した実施形態に限定さ
れるものではない。例えば、上述の実施形態では、文書
検索装置のバスとしてＰＣＩバスを採用しているが、Ｉ
ＳＡバスやＶＬバスなどでもまったく同様な文書検索装
置を構成することが可能である。また、上述の実施形態
では、ＯＳはＤＩＳＫに格納されているが、ＯＳをＲＯ
Ｍ上に格納しても同様な処理を行うことが可能である。The present invention is not limited to the above embodiment. For example, in the above-described embodiment, the PCI bus is used as the bus of the document search device.
It is possible to configure a completely similar document search device using an SA bus or a VL bus. In the above embodiment, the OS is stored in the DISK, but the OS is
Similar processing can be performed even if the data is stored on M.

【００７２】また、上述の実施形態では、ＤＶＤ−ＭＥ
ＤＩＡから文書検索処理プログラムおよび関連データを
直接ＲＡＭにロードして実行させる例を示したが、この
ほかにＤＣＤ−ＭＥＤＩＡから文書から文書検索処理プ
ログラムおよび関連データを一旦ＤＩＳＫに格納（イン
ストール）しておき、本文書検索処理プログラムを動作
させるときにＤＩＳＫからＲＡＭにロードするようにす
ることも可能である。In the above embodiment, the DVD-ME
Although the example in which the document search processing program and the related data are directly loaded from the DIA to the RAM and executed is described above, the document search processing program and the related data are temporarily stored (installed) in the DISK from the document from the DCD-MEDIA. It is also possible to load the document search processing program from the DISK into the RAM when operating this document search processing program.

【００７３】また、上述の実施形態では、本文書検索処
理プログラムを記憶する媒体としてＤＶＤ−ＭＥＤＩＡ
を用いているが、それ以外にＣＤ−ＭＥＤＩＡ，ＭＯ，
ＤＦ，ＩＣメモリカード、光磁気カードなどを用いても
良い。更に文書検索処理プログラムをＲＯＭに記憶して
おき、これをメモリマップの一部となすように構成し、
直接ＣＰＵで実行することも可能である。In the above-described embodiment, the medium for storing the document search processing program is a DVD-MEDIA.
But CD-MEDIA, MO,
A DF, IC memory card, magneto-optical card, or the like may be used. Further, the document search processing program is stored in the ROM, and is configured to be a part of the memory map,
It is also possible to execute directly with CPU.

【００７４】また、本発明の目的は、前述した実施形態
の機能を実現するソフトウェアのプログラムコードを記
録した記憶媒体（または記録媒体）を、システムあるい
は装置に供給し、そのシステムあるいは装置のコンピュ
ータ（またはＣＰＵやＭＰＵ）が記憶媒体に格納された
プログラムコードを読み出し実行することによっても、
達成されることは言うまでもない。この場合、記憶媒体
から読み出されたプログラムコード自体が前述した実施
形態の機能を実現することになり、そのプログラムコー
ドを記憶した記憶媒体は本発明を構成することになる。
また、コンピュータが読み出したプログラムコードを実
行することにより、前述した実施形態の機能が実現され
るだけでなく、そのプログラムコードの指示に基づき、
コンピュータ上で稼働しているオペレーティングシステ
ム（ＯＳ）などが実際の処理の一部または全部を行い、
その処理によって前述した実施形態の機能が実現される
場合も含まれることは言うまでもない。Further, an object of the present invention is to supply a storage medium (or a recording medium) recording software program codes for realizing the functions of the above-described embodiments to a system or an apparatus, and to provide a computer (a computer) of the system or the apparatus. Or a CPU or MPU) reads out and executes the program code stored in the storage medium,
It goes without saying that this is achieved. In this case, the program code itself read from the storage medium implements the functions of the above-described embodiment, and the storage medium storing the program code constitutes the present invention.
In addition, by executing the program code read by the computer, not only the functions of the above-described embodiments are realized, but also based on the instructions of the program code,
The operating system (OS) running on the computer performs part or all of the actual processing,
It goes without saying that a case where the function of the above-described embodiment is realized by the processing is also included.

【００７５】さらに、記憶媒体から読み出されたプログ
ラムコードが、コンピュータに挿入された機能拡張カー
ドやコンピュータに接続された機能拡張ユニットに備わ
るメモリに書込まれた後、そのプログラムコードの指示
に基づき、その機能拡張カードや機能拡張ユニットに備
わるＣＰＵなどが実際の処理の一部または全部を行い、
その処理によって前述した実施形態の機能が実現される
場合も含まれることは言うまでもない。Further, after the program code read from the storage medium is written into the memory provided in the function expansion card inserted into the computer or the function expansion unit connected to the computer, the program code is read based on the instruction of the program code. , The CPU provided in the function expansion card or the function expansion unit performs part or all of the actual processing,
It goes without saying that a case where the function of the above-described embodiment is realized by the processing is also included.

【００７６】その他、本発明はその要旨を逸脱しない範
囲で種々変形して実施することができる。In addition, the present invention can be variously modified and implemented without departing from the gist thereof.

【００７７】[0077]

【発明の効果】以上説明したように、本発明によれば、
多次元ベクトルマッチングを用いた文書検索処理におい
て、膨大な文書データが検索対象となった場合でもその
処理時間を短く保つことができる。As described above, according to the present invention,
In a document search process using multidimensional vector matching, even when a huge amount of document data is to be searched, the processing time can be kept short.

[Brief description of the drawings]

【図１】本実施形態による文書検索処理を実行するコン
ピュータ装置１００の構成を示すブロック図である。FIG. 1 is a block diagram illustrating a configuration of a computer device 100 that executes a document search process according to an embodiment.

【図２】本文書検索処理プログラム及び関連データをＤ
ＶＤ−ＭＥＤＩＡ１０６に記録したときのＤＶＤ−ＭＥ
ＤＩＡ１０６の内容の構成図である。FIG. 2 shows the document search processing program and related data in D
DVD-ME recorded on VD-MEDIA 106
FIG. 3 is a configuration diagram of the contents of a DIA.

【図３】コンピュータ装置１００と本文書検索処理プロ
グラムが記録されたＤＶＤ−ＭＥＤＩＡ１０６の模式図
である。FIG. 3 is a schematic diagram of a computer apparatus 100 and a DVD-MEDIA 106 in which a document search processing program is recorded.

【図４】本文書検索処理プログラムがＲＡＭ１０４にロ
ードされ実行可能となった状態のメモリマップを示す図
である。FIG. 4 is a diagram showing a memory map in a state where the document search processing program is loaded into a RAM 104 and is executable.

【図５】ＣＯＵＮＴ＝１０の場合の検索結果バッファ
（ＲＢＵＦ）４２２の構成例を説明した図である。FIG. 5 is a diagram illustrating a configuration example of a search result buffer (RBUF) 422 when COUNT = 10.

【図６】本実施形態における検索結果バッファ（ＲＢＵ
Ｆ）の構成例を説明した図である。FIG. 6 shows a search result buffer (RBU) according to the embodiment;
It is a figure explaining the example of composition of F).

【図７】本実施形態におけるグループテーブル（ＧＴＢ
Ｌ）の構成例を説明した図である。FIG. 7 illustrates a group table (GTB) according to the present embodiment.
It is a figure explaining the example of composition of L).

【図８】本実施形態における文書データベースに格納さ
れている一つのデータ構造の構成例を説明した図であ
る。FIG. 8 is a diagram illustrating a configuration example of one data structure stored in a document database according to the present embodiment.

【図９】本実施形態における文書検索処理の全体を説明
するフローチャートである。FIG. 9 is a flowchart illustrating an entire document search process according to the embodiment.

【図１０】図９のステップＳ３における文書登録処理を
詳細化したフローチャートである。FIG. 10 is a detailed flowchart of a document registration process in step S3 of FIG. 9;

【図１１】図９に示したステップＳ４の文書検索処理の
詳細を示すフローチャートである。FIG. 11 is a flowchart showing details of a document search process in step S4 shown in FIG. 9;

【図１２】図９のステップＳ６に示した意味近似文書グ
ループ作成処理の詳細を示すフローチャートである。FIG. 12 is a flowchart showing details of a semantic approximate document group creation process shown in step S6 of FIG. 9;

【図１３】自己組織化マップの学習の例を示す図であ
る。FIG. 13 is a diagram illustrating an example of learning of a self-organizing map.

【図１４】自己組織化マップの学習において、２次元マ
ップ上で８連結手法によってグループを決定する例を示
す図である。FIG. 14 is a diagram showing an example in which a group is determined on a two-dimensional map by an 8-connection method in learning of a self-organizing map.

【図１５】ステップＳ３４におけるグループ名決定処理
の詳細を説明するフローチャートである。FIG. 15 is a flowchart illustrating details of a group name determination process in step S34.

Claims

[Claims]

A storage unit configured to store a pair of document data and a semantic vector expressing the contents of the document data in a multidimensional vector; A management means for classifying and managing, and a language analysis of a natural language sentence inputted as a search condition,
Generating means for generating a semantic vector of a search condition sentence; selecting means for selecting a group managed by the managing means based on a result of the linguistic analysis; and document data belonging to the group selected by the selecting means And a search unit for searching for a document by matching the meaning vector stored in the storage unit with the meaning vector of the search condition sentence.

2. The management unit generates a self-organizing map using a semantic vector of each document data stored by the storage unit, and generates a self-organizing map based on the semantic vector of each document data stored in the storage unit. 2. The document search device according to claim 1, wherein each document data is arranged on the self-organizing map, and one group is formed by the document data arranged nearby on the self-organizing map. .

3. The management means assigns a group name based on a word representing the group to each group, and the selection means matches a word included in the search sentence obtained as a result of the linguistic analysis processing. 3. The document search apparatus according to claim 1, wherein a group having a group name is selected.

4. The management means sets a semantic vector of the group based on a semantic vector of the document data belonging to the group, and sets a word acquired based on the semantic vector of the group as the group name. 4. The document search device according to claim 3, wherein:

5. The management unit acquires a plurality of words based on a semantic vector of the group, and selects a word located at an intermediate part on a thesaurus tree from among the acquired words as the group name. The document search apparatus according to claim 4, wherein the search is performed.

6. The document search apparatus according to claim 1, further comprising output means for outputting information indicating a predetermined number of pieces of document data in descending order of the degree of matching as a search result by the search means. .

7. The document search device according to claim 6, wherein the output unit adds information specifying a group to which the document data belongs to each of the document data and outputs the added document data.

8. A document search method using storage means for storing a pair of document data and a semantic vector representing the contents of the document data in a multidimensional vector, wherein the document data stored in the storage means is A management process of classifying and managing the groups based on the semantic vector, a language analysis of a natural language sentence input as a search condition,
A generating step of generating a semantic vector of a search condition sentence; a selecting step of selecting a group managed in the managing step based on a result of the linguistic analysis; a document data belonging to the group selected in the selecting step A search step of matching the meaning vector stored in the storage means with the meaning vector of the search condition sentence to search for a document.

9. The managing step includes: generating a self-organizing map using a semantic vector of each document data stored by the storage unit; and generating a self-organizing map based on the semantic vector of each document data stored in the storage unit. The document search method according to claim 8, wherein each document data is arranged on the self-organizing map, and one group is formed by the document data arranged in the vicinity on the self-organizing map. .

10. The management step assigns a group name by a word representing the group to each group, and the selection step matches a word included in the search sentence obtained as a result of the linguistic analysis processing. 10. The document search method according to claim 8, wherein a group having a group name is selected.

11. The managing step sets a semantic vector of the group based on the semantic vector of the document data belonging to the group, and sets a word acquired based on the semantic vector of the group as the group name. The document search method according to claim 10, wherein:

12. The management step includes acquiring a plurality of words based on a semantic vector of the group, and selecting a word located at an intermediate part on a thesaurus tree from among the acquired plurality of words as the group name. 12. The document search method according to claim 11, wherein the search is performed.

13. A search result in the search step,
9. The document search method according to claim 8, further comprising an output step of outputting information indicating a predetermined number of pieces of document data in descending order of the degree of matching.

14. The document search method according to claim 13, wherein, in the output step, information specifying a group to which the document data belongs is added to each of the document data and output.

15. A control program for causing a computer to implement the document search method according to claim 8. Description:

16. A storage medium for storing a control program for causing a computer to implement the document search method according to claim 8. Description: