JP2002132789A

JP2002132789A - Document search method

Info

Publication number: JP2002132789A
Application number: JP2000318787A
Authority: JP
Inventors: Katsumi Tada; 勝己多田; Takuya Okamoto; 卓哉岡本; Natsuko Sugaya; 菅谷　　奈津子; Tadataka Matsubayashi; 忠孝松林; Yasuhiko Inaba; 靖彦稲場; Yasushi Kawashita; 靖司川下
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2000-10-19
Filing date: 2000-10-19
Publication date: 2002-05-10

Abstract

(57)【要約】【課題】英語、ドイツ語やフランス語のなどの表音文
字で記述された大規模文書データベースに対して、高頻
度語を含むフレーズが検索タームに指定された場合に
も、高速に全文検索可能なシステムを安価に提供する。【解決手段】高頻度語(a,the,of等)のリスト(1210)を
設け、文書をデータベースに登録する際、単語を抽出し
て(1120)、該単語について、文書識別子と単語の文書中
での位置を有する検索用インデクスを作成し(1140)、抽
出した単語について前記リストを参照して、該単語が高
頻度語である場合には(1130)、該単語に続く１単語の組
に対して検索用インデクスを作成する(1140)。検索時に
は、検索ターム中の単語に高頻度語がある場合には、高
頻度語である単語に続く１単語の組を検索タームにおけ
る単語として扱い、この単語の組を含む検索タームによ
り検索処理を行う。 (57) [Summary] [Problem] Even when a phrase including a frequent word is specified as a search term in a large-scale document database described in phonograms such as English, German, and French, Provide a low-cost system capable of high-speed full-text search. A list (1210) of high-frequency words (a, the, of, etc.) is provided, and when a document is registered in a database, a word is extracted (1120). A search index having a position within the word is created (1140), and the extracted word is referred to the list, and if the word is a frequent word (1130), a set of one word following the word is set. A search index is created for (1140). At the time of search, if a word in the search term includes a high-frequency word, a set of one word following the word that is a high-frequency word is treated as a word in the search term, and the search processing is performed by the search term including the set of words. Do.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、大規模な文書デー
タベースを対象として、指定された文字列を含む文書を
検索する全文検索の方法に係わり、特に英語、ドイツ語
やフランス語などの表音文字で記載された文書を対象と
したデータベース、文書管理システム、文書ファイリン
グシステムおよびDTP(Desk Top Publishing)システムな
どに適用されるものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a full-text search method for searching a document containing a specified character string in a large-scale document database, and particularly relates to phonetic characters such as English, German and French. It is applied to a database, a document management system, a document filing system, a DTP (Desk Top Publishing) system, and the like for the documents described in.

【０００２】[0002]

【従来の技術】近年、情報化社会の急速な進展に伴い、
ワードプロセッサやパーソナルコンピュータなどを用い
て作成される電子化文書情報も爆発的な勢いで増加しつ
つある。このような状況下で、蓄積された膨大な電子化
文書群の中から、必要とする情報を含んだ文書を高速か
つ高精度に検索したいという要求が高まっている。この
よう要求に応える技術として全文検索がある。全文検索
では、文書の登録時に登録対象文書中のテキスト全体を
計算機システムに入力してデータベース化し、検索時に
は該当データベース中からユーザの指定した文字列（以
下、検索タームと呼ぶ）を含む全ての文書を探し出すこ
とにより、登録時にキーワード付けを行なうことなく、
目的とする文書を漏れなく検索することが可能である。2. Description of the Related Art In recent years, with the rapid progress of the information society,
Electronic document information created using a word processor, a personal computer, or the like is also increasing explosively. Under such circumstances, there is an increasing demand for searching a document containing necessary information at high speed and with high accuracy from a huge group of stored electronic documents. There is a full-text search as a technique to meet such a demand. In a full-text search, when a document is registered, the entire text in the document to be registered is input to a computer system and is converted into a database. At the time of search, all documents including a character string specified by a user (hereinafter referred to as a search term) from the database are searched. Without searching for keywords during registration,
It is possible to search for a target document without omission.

【０００３】一般に、英語、ドイツ語、フランス語のな
どの表音文字で記載された文書を対象とした全文検索に
は、単語インデクス方式と呼ばれる方式が用いられてい
る。単語インデクス方式の概要について簡単に説明す
る。英語、ドイツ語やフランス語などの表音文字で記載
された文書では、一般に“ ”(スペース)、“，”(コン
マ)や“．”(ピリオド)などの文字で単語が分かち書き
されている。すなわち、これらの文字を単語の区切れ目
として抽出することにより、登録対象文書から単語を抽
出することができる。そして、これらの単語に対し、文
書データベース中で該当文書を識別するための文書識別
情報および該当文書中での出現単語位置を抽出し、これ
を検索用のインデクスデータとして格納しておく。そし
て検索時には、検索条件に指定された語句の各々に対し
検索用インデクスを参照し、検索タームとして指定され
た単語群が同一文書中にあり、かつ各単語の並びが検索
条件における単語の並びと同一である文書を抽出するこ
とにより、検索ノイズのない全文検索を実現するもので
ある。In general, a method called a word index method is used for full-text search of a document described in phonetic characters such as English, German, and French. An outline of the word index method will be briefly described. In documents written in phonetic characters such as English, German, and French, words are generally separated by characters such as "" (space), "," (comma), and "." (Period). That is, by extracting these characters as word breaks, words can be extracted from the registration target document. Then, with respect to these words, document identification information for identifying the relevant document in the document database and the position of the appearing word in the relevant document are extracted and stored as search index data. At the time of the search, the search index is referred to for each of the phrases specified in the search condition, the word group specified as the search term is in the same document, and the sequence of each word is the same as the sequence of words in the search condition. By extracting identical documents, full-text search without search noise is realized.

【０００４】しかし、本方式による全文検索システムで
は、検索ターム中に登録済み文書中での出現頻度の高い
語（以下、高頻度語と呼ぶ）が含まれる場合に、検索レ
スポンスが低下するという問題がある。すなわち、高頻
度語に関するインデクスデータは通常の語に比べ著しく
大きくなるため、磁気ディスクなどの二次記憶上に格納
されたインデクスデータを参照するのに要する時間が長
くなり、検索に時間がかかってしまう。英文における高
頻度語としては、“a”や“the”などの接続詞、“an
d”や“but”などの接続詞、および“of”や“to”な
どの前置詞などがある。これらの語は登録済み文書中の
大半に含まれ、検索タームとして意味を持つことが少な
いため、一般には「Information Retrieval」(PRENTICE
HALL発行、William B.Frakes, Ricardo Baeza-Yates
著)（以下、公知例１と呼ぶ）113ページ以降に示されて
いるように、該当語をストップワード(排除語)として定
義し、これらに対しては検索用インデクスを作成しない
方式が提案されている。[0004] However, in the full-text search system according to this method, when a search term includes a word having a high appearance frequency in a registered document (hereinafter, referred to as a high-frequency word), the search response is reduced. There is. That is, since index data for high-frequency words is significantly larger than ordinary words, the time required to refer to index data stored on secondary storage such as a magnetic disk becomes longer, and the search takes longer. I will. Frequently used words in English are connectives such as "a" and "the";
There are connectives such as “d” and “but”, and prepositions such as “of” and “to.” These words are included in most of the registered documents and have few meanings as search terms. Generally, `` Information Retrieval '' (PRENTICE
HALL issued, William B. Frakes, Ricardo Baeza-Yates
As shown on page 113 et seq. (Hereinafter referred to as well-known example 1), a method has been proposed in which a corresponding word is defined as a stop word (excluded word) and a search index is not created for these words. ing.

【０００５】[0005]

【発明が解決しようとする課題】しかし、公知例１にお
ける英文検索方法では、ストップワードを含むフレーズ
（句）を検索することができないという問題がある。す
なわち、“a”や“the”などの定冠詞、“and”や“bu
t”などの接続詞、および“of”や“to”などの前置詞
そのものが検索タームとして意味を持つものは少ない
が、それを含むフレーズが検索タームとして意味を持つ
ものも少なくない。例えば、定冠詞“the”については
“タイムズ紙”を表す“the Times”は“the”を伴って
始めて本来の意味を表すものであり、“the”を伴わな
い“Times”では“回数”等としての意味しか持たな
い。つまり、“the”を含むフレーズが検索できない場
合には、“Times”のみで検索するしか方法はなく、
“回数”などの意味で用いられている不要文書がノイズ
として検索されてしまうという問題がある。“the Whit
e House”(ホワイトハウス)や“the West”(西洋)、“T
he East”(東洋)なども、“the”を伴って始めて所定の
意味を示す語の一例である。However, the English sentence search method in the prior art example 1 has a problem that it is not possible to search for a phrase including a stop word. That is, definite articles such as "a" and "the", and "and"
Although few connectives such as "t" and prepositions such as "of" and "to" have meaning as search terms, there are many cases where phrases containing them have meaning as search terms. For example, definite articles " As for "the", "the Times" which represents "Times" means the original meaning starting with "the", and "Times" without "the" has only meaning as "number of times" etc. In other words, if you can't find a phrase that contains "the", you can only search for "Times"
There is a problem that unnecessary documents used in the meaning of “number of times” are searched as noise. “The Whit
e House ”(White House),“ the West ”(Western),“ T
“He East” is an example of a word that has a predetermined meaning only when accompanied by “the”.

【０００６】また、接続詞“and”における例として
は、例えば“バターのついたパン”を表すフレーズ“br
ead and butter”を検索することができない。このた
め、“bread”と“butter”のAND条件で検索を行うこと
になるが、“bread”と“butter”を含むが“bread and
butter”というフレーズで用いられていない不要文書
がノイズとして検索されてしまう。さらに、接続詞“o
f”についても同様に、例えば“アメリカ銀行”を表す
フレーズ“Bank of America”を検索することができな
いため、“Bank”と“America”のAND条件で検索を行う
ことになるが、“Bank”と“America”を含むが“Banko
f America”というフレーズとして用いられていない不
要文書がノイズとして検索されてしまうという問題があ
る。An example of the conjunction "and" is, for example, a phrase "br" representing "bread with butter".
ead and butter ”cannot be searched. Therefore, a search is performed using the AND condition of“ bread ”and“ butter ”, but“ bread ”and“ butter ”are included, but“ bread and
Unwanted documents that are not used in the phrase "butter" are searched for as noise.
Similarly, for "f", for example, the phrase "Bank of America" representing "American Bank" cannot be searched, so the search is performed using the AND condition of "Bank" and "America". And “America” but “Banko
There is a problem that unnecessary documents not used as the phrase "f America" are searched for as noise.

【０００７】この問題を解決するために、従来の方法で
は “a”や“the”などの定冠詞、“and”や“but”な
どの接続詞、および“of”や“to”などの前置詞につ
いても検索用インデクスを作成することになるが、例え
ば“the East”というフレーズを検索する際に、非常に
容量の大きい“the”に関するインデクスデータを磁気
ディスクなどの二次記憶から読み出す必要がある。そし
て、“East”に関するインデクスデータと単語位置の比
較を行い、同一文書中にあり、かつ単語位置が１単語隣
接して現れるものを抽出する必要があるため、“East”
１単語を検索する場合に比べ著しく検索レスポンスが低
下してしまう。すなわち、本発明の解決しようとする課
題は、英語、ドイツ語やフランス語のなどの表音文字で
記述された大規模文書データベースに対して、“a”や
“the”などの定冠詞、“and”や“but”などの接続
詞、および“of”や“to”などの前置詞などの高頻度
語を含むフレーズが検索タームに指定された場合にも、
高速に全文検索可能なシステムを安価に提供することで
ある。In order to solve this problem, in the conventional method, definite articles such as “a” and “the”, conjunctions such as “and” and “but”, and prepositions such as “of” and “to” are also used. To create a search index, for example, when searching for the phrase "the East", it is necessary to read out index data relating to "the", which has a very large capacity, from a secondary storage such as a magnetic disk. Then, it is necessary to compare the index data relating to “East” with the word position and to extract those that are in the same document and have the word position appearing one word adjacent thereto.
The search response is significantly reduced as compared with the case of searching for one word. That is, the problem to be solved by the present invention is that, for a large-scale document database described in phonetic characters such as English, German, and French, definite articles such as "a" and "the" and "and" If a search term contains a phrase that contains a conjunction, such as や or “but,” and a frequent word, such as a preposition, such as ofof ’or toto’
An object of the present invention is to provide an inexpensive high-speed full-text search system.

【０００８】[0008]

【課題を解決するための手段】上記課題を解決するため
に、本発明による文書検索方法は以下に示すステップを
備える。すなわち、本発明による文書検索方法におい
て、文書の登録処理は、登録対象文書に対し、スペー
ス、コンマやピリオドなどの区切り文字を識別すること
により単語を抽出する登録用単語抽出ステップと、該抽
出された単語が、高頻度語として定義された単語である
か否かを判定する高頻度語判定ステップと、該高頻度語
判定ステップにおける判定結果が高頻度語であった場合
には、該高頻度語に接続する１単語を抽出し、該高頻度
語と該抽出した１単語からなる単語列に対して検索用イ
ンデクスデータを作成し、また該高頻度語判定ステップ
における判定結果が高頻度語でなかった場合には、前記
抽出された単語に対して検索用インデクスデータを作成
し、登録する検索用インデクス作成登録ステップを有す
る。また、本発明による文書検索方法において登録済み
文書の検索処理は、検索タームに対し、スペース、コン
マやピリオドなどの区切り文字を識別することにより単
語を抽出する検索用単語抽出ステップと、該抽出された
単語が、高頻度語として定義された単語であるか否かを
判定する高頻度語判定ステップと、該高頻度語判定ステ
ップにおける判定結果が高頻度語であった場合には、該
高頻度語に接続する１単語を抽出し、該高頻度語と該抽
出した１単語からなる単語列に対して検索用インデクス
データを、また高頻度語判定ステップにおける判定結果
が高頻度語でなかった場合には、前記抽出された単語に
関する検索用インデクスデータを、参照することによ
り、指定された検索タームを含む文書を検索する検索実
行ステップを有する。In order to solve the above-mentioned problems, a document search method according to the present invention includes the following steps. That is, in the document search method according to the present invention, the document registration process includes a registration word extraction step of extracting words by identifying a delimiter such as a space, a comma, or a period for the registration target document; A high-frequency word determining step of determining whether or not the word is a word defined as a high-frequency word; and if the determination result in the high-frequency word determining step is a high-frequency word, One word connected to a word is extracted, search index data is created for the word string composed of the high-frequency word and the extracted one word, and the determination result in the high-frequency word determination step is a high-frequency word. If not, a search index creation / registration step of creating and registering search index data for the extracted word is provided. In the document search method according to the present invention, the registered document search process includes a search word extraction step of extracting a word by identifying a delimiter such as a space, a comma or a period with respect to a search term; A high-frequency word determining step of determining whether or not the word is a word defined as a high-frequency word; and if the determination result in the high-frequency word determining step is a high-frequency word, When one word connected to a word is extracted, search index data is used for the word string composed of the high-frequency word and the extracted one word, and the result of the determination in the high-frequency word determination step is not a high-frequency word Has a search execution step of searching for a document including a specified search term by referring to the search index data related to the extracted word.

【０００９】[0009]

【発明の実施の形態】本発明を適用した第一の実施例に
ついて、図面を用いて説明する。はじめに、本実施例の
システム構成について説明する。図１は、本発明による
文書検索システムの第一の実施例の全体構成を示す図で
ある。図１に示す通り、本実施例における文書検索シス
テムは、登録用サブシステム１０００、文書検索サーバ
２０００、検索クライアント３０００および４０００、
ネットワーク５０００から構成される。文書登録サブシ
ステム１０００は、登録対象となる文書を入力として、
検索時に必要となる検索用インデクスを作成する。この
インデクスデータはネットワーク５０００を介して文書
検索サーバ２０００に転送され、後に文書検索サーバ２
０００が全文検索処理を行う際に用いられる。文書検索
サーバ２０００は、検索クライアント３０００および４
０００から検索コマンドを受け取り、文書登録サブシス
テム１０００が作成した検索用インデクスを用いて、該
検索コマンドで指定された条件に適合する文書を検索
し、検索結果データを要求元の検索クライアントに送り
返す。検索クライアント３０００および４０００は、ユ
ーザが対話的に検索条件を指定するための画面をディス
プレイ上に表示し、この面上でユーザが指定した検索条
件を、文書検索サーバ２０００で解釈可能な検索コマン
ドの形に変換し、このコマンドをネットワーク５０００
を介して文書検索サーバ２０００に送信する。文書検索
サーバ２０００で、前述した通り検索コマンドに対応す
る検索処理を行い検索結果データを送り返してくると、
検索クライアント３０００および４０００は受け取った
検索結果データを検索結果画面としてユーザに提示す
る。なお、図１では２台のコンピュータ３０００および
４０００が検索用クライアントとして接続されている構
成例を示したが、検索クライアントが１台のみとする構
成を取ることもできる。また、３台以上の構成を取るこ
ともできる。最後に、ネットワーク５０００はローカル
エリアネットワーク(LAN)または広域ネットワーク(WAN)
により構成され、文書登録サブシステム１０００、文書
検索サーバ２０００、検索クライアント３０００および
４０００が各種データやコマンドを交換するために用い
られる。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS A first embodiment to which the present invention is applied will be described with reference to the drawings. First, a system configuration of the present embodiment will be described. FIG. 1 is a diagram showing an overall configuration of a first embodiment of a document search system according to the present invention. As shown in FIG. 1, the document search system according to the present embodiment includes a registration subsystem 1000, a document search server 2000, search clients 3000 and 4000,
It comprises a network 5000. The document registration subsystem 1000 receives a document to be registered as an input,
Create search indexes required for searching. The index data is transferred to the document search server 2000 via the network 5000, and later transferred to the document search server 2
000 is used when performing full-text search processing. The document search server 2000 includes search clients 3000 and 4
000, and retrieves a document that meets the conditions specified by the retrieval command using the retrieval index created by the document registration subsystem 1000, and sends the retrieval result data back to the requesting retrieval client. The search clients 3000 and 4000 display a screen on which a user interactively specifies search conditions on a display, and the search conditions specified by the user on this screen are converted to search commands that can be interpreted by the document search server 2000. And convert this command to network 5000
To the document search server 2000 via the. When the document search server 2000 performs the search processing corresponding to the search command as described above and returns the search result data,
The search clients 3000 and 4000 present the received search result data to the user as a search result screen. Although FIG. 1 shows a configuration example in which two computers 3000 and 4000 are connected as search clients, a configuration in which only one search client is used may be adopted. Also, a configuration of three or more units can be adopted. Finally, the network 5000 can be a local area network (LAN) or a wide area network (WAN)
The document registration subsystem 1000, the document search server 2000, and the search clients 3000 and 4000 are used to exchange various data and commands.

【００１０】なお、図１では文書登録サブシステム１０
００から文書検索サーバ２０００に対し検索用インデク
スデータを転送するために、ネットワーク５０００を使
用するものとしたが、代わりにフロッピー（登録商標）
ディスクや光磁気ディスクなどの可搬型媒体を使用する
構成を取ることもできる。あるいは、文書登録サブシス
テム１０００と文書検索サーバ２０００を１台のコンピ
ュータ上に実装し、データ転送を行わない構成を取るこ
ともできる。さらに、図１では検索クライアント３００
０および４０００を文書検索サーバ２０００と別のコン
ピュータ上に構成するものとしたが、１個以上の検索ク
ライアントを文書検索サーバ２０００と同一のコンピュ
ータ上で実行する構成を取ることもできる。以上が本実
施例におけるシステム構成に関する説明である。In FIG. 1, the document registration subsystem 10
00, the network 5000 is used to transfer the search index data to the document search server 2000. Instead, a floppy (registered trademark) is used.
A configuration using a portable medium such as a disk or a magneto-optical disk may be adopted. Alternatively, a configuration in which the document registration subsystem 1000 and the document search server 2000 are mounted on one computer and data transfer is not performed may be adopted. Further, in FIG.
Although 0 and 4000 are configured on a separate computer from the document search server 2000, it is also possible to adopt a configuration in which one or more search clients are executed on the same computer as the document search server 2000. The above is the description of the system configuration in the present embodiment.

【００１１】次に、本実施例における文書登録サブシス
テム、すなわち図１における１０００について図を用い
て説明する。図２は、本実施例における文書登録サブシ
ステム１０００の構成を示す図である。本図に示す文書
登録サブシステム１０００は、処理の実行状況などを表
示するディスプレイ１０１０、登録用のコマンド等を入
力するキーボード１０２０、登録処理を実行する中央演
算処理装置ＣＰＵ１０３０、フロッピディスクからデー
タを読み出すフロッピディスクドライバ１０４０、デー
タベースへの登録対象となる文書データを格納したフロ
ッピディスク１０５０、登録用のプログラムならびにデ
ータなどを一時的に格納する主メモリ１０６０、各種デ
ータおよびプログラムを格納する磁気ディスク１０７０
およびこれらを接続するバス１０８０で構成される。ま
た、主メモリ１０６０にはシステム制御プログラム１１
００、登録制御プログラム１１１０、登録用単語抽出プ
ログラム１１２０、登録用高頻度語判定プログラム１１
３０、検索用インデクス作成登録プログラム１１４０が
磁気ディスク１０７０から読み出されるとともに、ワー
クエリア１１５０が確保される。また、磁気ディスク１
０７０には区切り文字テーブル格納領域１２００、高頻
度語リスト格納領域１２１０、検索用インデクス格納領
域１２２０、各種プログラム格納領域１２３０が確保さ
れている。なお、本実施例ではこれらの格納領域を磁気
ディスク上１０７０上に確保したが、光磁気ディスク装
置など他の二次記憶装置であっても構わない。以上が文
書登録サブシステム１０００の構成である。Next, the document registration subsystem in this embodiment, that is, the reference numeral 1000 in FIG. 1 will be described with reference to the drawings. FIG. 2 is a diagram illustrating a configuration of the document registration subsystem 1000 according to the present embodiment. The document registration subsystem 1000 shown in the figure includes a display 1010 for displaying the execution status of the processing, a keyboard 1020 for inputting a registration command and the like, a central processing unit CPU 1030 for executing the registration processing, and reading data from a floppy disk. A floppy disk driver 1040, a floppy disk 1050 storing document data to be registered in the database, a main memory 1060 temporarily storing a registration program and data, and a magnetic disk 1070 storing various data and programs
And a bus 1080 connecting them. The main memory 1060 has a system control program 11
00, registration control program 1110, registration word extraction program 1120, registration high frequency word determination program 11
30, the search index creation / registration program 1140 is read from the magnetic disk 1070, and the work area 1150 is secured. The magnetic disk 1
In 070, a delimiter table storage area 1200, a high-frequency word list storage area 1210, a search index storage area 1220, and various program storage areas 1230 are secured. In the present embodiment, these storage areas are secured on the magnetic disk 1070, but may be another secondary storage device such as a magneto-optical disk device. The above is the configuration of the document registration subsystem 1000.

【００１２】次に、本実施例における文書登録処理の手
順について説明する。始めに、キーボード１０２０から
入力される登録コマンドによりシステム制御プログラム
１１００は登録制御プログラム１１１０を起動し、文書
の登録処理を開始する。以下、文書登録時の処理につい
て図３に示すPAD(Problem Analysis Diagram)を用いて
説明する。登録制御プログラム１１１０は、フロッピデ
ィスク１０５０に格納されている全ての登録対象文書に
ついて、ステップ１３１０からステップ１３５０までに
示す一連の処理を繰り返し実行する（ステップ１３０
０）。まず、ステップ１３１０ではフロッピディスクド
ライバ１０４０を通じてフロッピディスク１０５０に格
納されている登録対象文書群から未処理の文書を１個選
択し、主メモリ１０６０上のワークエリア１１５０に読
み出す。次に、ステップ１３２０を実行し、ステップ１
３１０で読み込んだ登録対象文書に対し、文書データベ
ース中で該当文書を一意に識別するための番号である文
書識別子を割り当てる。さらに、ステップ１３３０にお
いて登録用単語抽出プログラム１１２０を実行し、主メ
モリ１０６０上の登録対象文書から単語を抽出する。す
なわち、登録対象文書の先頭から末尾にかけて、各文字
に対し磁気ディスク１０７０上の区切り文字テーブル１
２００を参照していくことにより、登録対象文書から区
切り文字を識別する。そして、区切り文字と区切り文字
の間に囲まれた一連の文字列を単語として抽出する。次
に、ステップ１３４０で登録用高頻度語判定プログラム
１１３０を実行し、ステップ１３３０により抽出した各
単語が高頻度語であるか否かを判定する。すなわち、磁
気ディスク１０７０上の高頻度語リスト格納領域１２１
０に格納されている高頻度語リストを参照しながら、各
単語が高頻度語リスト中に定義されているか否かを判定
することにより、各単語が高頻度語であるか否かを判定
する。最後に、ステップ１３５０で検索用インデクス作
成登録プログラム１１４０を実行し、ステップ１３３０
で抽出した単語に対し、検索用インデクスデータの作成
登録処理を実行する。Next, the procedure of the document registration process in this embodiment will be described. First, the system control program 1100 activates the registration control program 1110 in response to a registration command input from the keyboard 1020, and starts document registration processing. Hereinafter, a process at the time of document registration will be described using a PAD (Problem Analysis Diagram) shown in FIG. The registration control program 1110 repeatedly executes a series of processes shown from Step 1310 to Step 1350 for all the registration target documents stored on the floppy disk 1050 (Step 130).
0). First, in step 1310, one unprocessed document is selected from the registration target document group stored in the floppy disk 1050 through the floppy disk driver 1040, and is read out to the work area 1150 on the main memory 1060. Next, step 1320 is executed, and step 1 is executed.
A document identifier, which is a number for uniquely identifying the document in the document database, is assigned to the registration target document read in 310. Further, in step 1330, the registration word extraction program 1120 is executed to extract words from the registration target document on the main memory 1060. In other words, from the beginning to the end of the registration target document, the delimiter table 1 on the magnetic disk 1070 is assigned to each character.
By referring to 200, a delimiter is identified from the registration target document. Then, a series of character strings enclosed between delimiters is extracted as a word. Next, in step 1340, the registration high-frequency word determination program 1130 is executed, and it is determined whether or not each word extracted in step 1330 is a high-frequency word. That is, the high-frequency word list storage area 121 on the magnetic disk 1070
It is determined whether or not each word is a frequent word by determining whether each word is defined in the frequent word list while referring to the frequent word list stored in 0. . Finally, in step 1350, the search index creation / registration program 1140 is executed.
Execute the search index data creation registration process for the word extracted in step.

【００１３】また、ステップ１３５０における処理内
容、すなわち検索用インデクス作成登録プログラム１１
４０の処理内容について、図４を用いてもう少し詳細に
説明する。登録用インデクス作成登録プログラム１１４
０では、ステップ１３３０で抽出した全ての単語に対し
ステップ１４１０〜１４３０に示す一連の処理を実行す
る（ステップ１４００）。まず、ステップ１４１０でス
テップ１３４０における登録用高頻度語判定プログラム
１１３０での判定結果を元に、検索用インデクスの作成
処理を場合分けする。すなわち、該当単語が高頻度語で
あった場合には、ステップ１４２０を実行することによ
り高頻度語用の検索用インデクスの作成処理を行う。ま
た、該当単語が高頻度語でなかった場合には、ステップ
１４３０を実行することにより、高頻度語以外の一般語
としての検索用インデクスの作成処理を行う。ステップ
１４２０では、高頻度語用の検索用インデクス作成処理
として、該当単語に続く１単語の組に対して該当文書の
文書識別子と該当文書中での単語位置の組として、主メ
モリ１０６０上のワークエリア１１５０に格納すること
により検索用インデクスデータを作成する。また、ステ
ップ１４３０では、高頻度語以外の一般語としての検索
用インデクス作成処理として、該当単語に対して該当文
書の文書識別子と該当文書中での単語位置の組として、
主メモリ１０６０上のワークエリア１１５０に格納する
ことにより検索用インデクスデータを作成する。以上の
処理が、ステップ１３３０で抽出した全ての単語につい
て完了すると、ステップ１４４０において検索用インデ
クスの更新処理を行う。すなわち主メモリ１０６０上の
ワークエリア１１５０内に格納した各単語に関する検索
用インデクスデータを、磁気ディスク１０７０上の検索
用インデクス格納領域に追加、更新する。以上が、本実
施例における文書登録処理手順である。Further, the processing contents in step 1350, that is, the search index creation registration program 11
The contents of the process 40 will be described in more detail with reference to FIG. Index creation registration program 114 for registration
At 0, a series of processing shown in steps 1410 to 1430 is performed on all the words extracted in step 1330 (step 1400). First, in step 1410, the process of creating the search index is divided into cases based on the determination result of the high frequency word determination program for registration 1130 in step 1340. That is, if the word is a high-frequency word, the process of creating a search index for the high-frequency word is performed by executing step 1420. If the word is not a high-frequency word, the process of step 1430 is executed to create a search index as a general word other than the high-frequency word. In step 1420, as a process of creating a search index for a high-frequency word, a work set in the main memory 1060 is set as a set of a document identifier of the corresponding document and a word position in the corresponding document for a set of one word following the corresponding word. By storing the index data in the area 1150, index data for search is created. In step 1430, as a search index creation process as a general word other than the high-frequency word, a set of a document identifier of the corresponding document and a word position in the corresponding document is set for the corresponding word.
Index data for search is created by storing it in the work area 1150 on the main memory 1060. When the above process is completed for all the words extracted in step 1330, a process of updating the search index is performed in step 1440. That is, the search index data for each word stored in the work area 1150 on the main memory 1060 is added to and updated in the search index storage area on the magnetic disk 1070. The above is the document registration processing procedure in the present embodiment.

【００１４】次に、本実施例における文書登録処理につ
いて、“The president of Bank ofAmerica has decide
d …”という文書が登録された場合を例として具体的に
説明する。始めに、ステップ１３００における繰り返し
処理では、まずステップ１３１０において登録対象文書
に対応する“The president of Bank of America has d
ecided …”をフロッピーディスク１０５０から読み出
し、主メモリ１０６０上のワークエリア１１５０に格納
する。そして、ステップ１３２０において、該当文書に
対する文書識別子として初期値１を設定する。次に、ス
テップ１３３０では、磁気ディスク１０７０上の区切り
文字テーブル格納領域１２００に格納されている区切り
文字テーブルを参照しながら単語の抽出処理を行う。な
お、本実施例における区切り文字テーブルの構成を図５
に示す。区切り文字テーブルは各文字の文字コードをエ
ントリとして、単語間の区切れ目を表す区切り文字に対
しては“１”を、そしてそれ以外の文字については
“０”を区切り文字フラグとして設定しておく。すなわ
ち、図５に示す例においては“ ”(スペース)、“!”
(エクスプラネーションマーク)および“/”などを区切
り文字として定義している。そして、登録対象文書内テ
キスト“The president of Bank of America has decid
ed …”の各文字に対し、区切り文字テーブルを参照し
ていき、その値が１となる文字（本例では“ ”(スペー
ス)）を境界として識別することにより、図６に示すよ
うに“The”、“president”、“of”、“Bank”、“o
f”、“America”、“has”、“decided”、等の単語を
抽出していく。Next, regarding the document registration processing in the present embodiment, “The president of Bank of America has decide
First, in the repetition processing in step 1300, first, in step 1310, “the president of Bank of America has d” corresponding to the document to be registered is described.
"ecided..." is read from the floppy disk 1050 and stored in the work area 1150 on the main memory 1060. Then, in step 1320, an initial value 1 is set as a document identifier for the relevant document. Word extraction processing is performed with reference to the delimiter table stored in the delimiter table storage area 1200 on the column 1070. The configuration of the delimiter table in this embodiment is shown in FIG.
Shown in In the delimiter table, the character code of each character is used as an entry, and "1" is set as a delimiter character indicating a delimiter character indicating a break between words, and "0" is set as a delimiter character for other characters. . That is, in the example shown in FIG. 5, "" (space), "!"
(Explanation mark) and "/" are defined as delimiters. Then, the text in the document to be registered, “The president of Bank of America has decid
For each character of “ed...”, the delimiter table is referred to, and a character whose value is 1 (in this example, “” (space)) is identified as a boundary, thereby obtaining “ The, president, of, bank, o
Words such as “f”, “America”, “has”, and “decided” are extracted.

【００１５】そしてステップ１３４０では、ステップ１
３３０において抽出した各単語に対し、磁気ディスク１
０７０上の高頻度語リスト格納領域１２１０に格納され
ている高頻度語リストを参照しながら、高頻度語の判定
処理を行う。本実施例では、高頻度語リストとして公知
例１において記載されているストップリストに相当する
単語が登録されることを想定しており、例えば図７に示
す単語などが高頻度語として格納されている場合には、
“the”、“of”および“has”に対する高頻度語フラグ
として“１”が付加されて出力される。最後にステップ
１３５０では、ステップ１３４０において高頻度語フラ
グを付与された単語に対し検索用インデクスデータを生
成する。すなわち、高頻度語フラグの値が“１”である
か“０”であるかを判定し、“１”の場合には高頻度語
として該当単語とそれに続く１語について、該当文書識
別子と該当単語の文書先頭からの単語位置の組みを検索
用インデクスデータとして生成する。また、“０”の場
合には高頻度語以外の一般語として、該当単語１語につ
いて該当文書識別子と該当単語の文書先頭からの単語位
置の組みを検索用インデクスデータとして生成する。In step 1340, step 1
For each word extracted in 330, the magnetic disk 1
The high-frequency word determination process is performed with reference to the high-frequency word list stored in the high-frequency word list storage area 1210 on the 070. In the present embodiment, it is assumed that a word corresponding to the stop list described in the publicly known example 1 is registered as a high-frequency word list, and for example, words such as those illustrated in FIG. 7 are stored as high-frequency words. If you have
“1” is added as a high-frequency word flag for “the”, “of”, and “has” and output. Finally, in step 1350, search index data is generated for the word to which the high frequency word flag has been added in step 1340. That is, it is determined whether the value of the high-frequency word flag is “1” or “0”. A set of word positions from the head of the document is generated as search index data. In addition, in the case of “0”, as a general word other than the high-frequency word, a set of a relevant document identifier and a word position of the relevant word from the head of the document is generated as search index data for one relevant word.

【００１６】すなわち、図８に示す例において、まず最
初に“The”が入力されるが、該当語に関する項頻度語
フラグは“１”であるため、高頻度語として“The”に
続く１単語“president”を併せた“The president”に
ついて検索用インデクスデータを生成する。つまり、文
書識別子としてはD1(“D”は文書識別子であることを示
す)を、単語位置としては該当文書の先頭にあたるので
初期値P1(“P”は単語位置であることを示す)を設定
し、この組みでもって“The president”に対し検索用
インデクスデータを生成する。次に、“president”が
入力されるが、該当語に関する項頻度語フラグは“０”
であるため、高頻度語以外の一般語として該当単語“pr
esident”について検索用インデクスデータを生成す
る。つまり、文書識別子としてD1、単語位置としてP2を
設定し、この組みでもって“president”に対し検索用
インデクスデータを生成する。以下、同様の処理を繰り
返すことにより、該当文書に対し検索用インデクスデー
タを主メモリ上１０６０のワークエリア１１５０に生成
していき、全入力データについて処理が完了すると、こ
れを磁気ディスク１０７０上の検索用インデクス格納領
域１２２０に追加格納する。そして、ステップ１３１０
において次の登録対象文書を主メモリ上１０６０のワー
クエリア１１５０に読み出し、全登録対象文書について
同様の処理を繰り返す（ステップ１３００）。以上が、
本実施例における文書登録処理例である。なお、本実施
例において主メモリ上１０６０ワークエリア１１５０で
の検索用インデクスデータを生成、および磁気ディスク
１０７０上の検索用インデクス格納領域１２２０への更
新の処理単位を登録対象文書１件毎としたが、これを複
数件単位として処理することも可能である。That is, in the example shown in FIG. 8, "The" is input first, but since the term frequency flag relating to the word is "1", one word following "The" as a high frequency word is input. Index data for search is generated for “The president” combined with “President”. In other words, D1 is set as the document identifier ("D" indicates a document position), and the word position is set to the initial value P1 ("P" indicates the word position) because it is at the beginning of the document. Then, with this combination, search index data is generated for “The president”. Next, “president” is input, and the term frequency flag for the corresponding word is set to “0”.
Therefore, as a general word other than the high-frequency word,
Generate search index data for “esident”. That is, set D1 as the document identifier and P2 as the word position, and generate search index data for “president” with this combination. As a result, search index data for the corresponding document is generated in the work area 1150 of the main memory 1060, and when processing of all input data is completed, this is added to the search index storage area 1220 on the magnetic disk 1070. Then, step 1310 is stored.
In, the next document to be registered is read into the work area 1150 of the main memory 1060, and the same processing is repeated for all the documents to be registered (step 1300). More than,
5 is an example of a document registration process according to the embodiment. In this embodiment, the unit of processing for generating the search index data in the 1060 work area 1150 on the main memory and updating the search index storage area 1220 on the magnetic disk 1070 is set to each registration target document. It is also possible to process this as a unit of a plurality of cases.

【００１７】次に、本実施例における文書検索サーバ、
すなわち図１における２０００について図を用いて説明
する。図９は、本実施例における文書検索サーバ２００
０の構成を示す図である。本図に示す文書検索サーバ２
０００は、処理の実行状況などを表示するディスプレイ
２０１０、サーバの起動や停止などのコマンド等を入力
するキーボード２０２０、検索処理を実行する中央演算
処理装置ＣＰＵ２０３０、検索用のプログラムならびに
データなどを一時的に格納する主メモリ２０４０、各種
データおよびプログラムを格納する磁気ディスク２０５
０およびこれらを接続するバス２０６０で構成される。
また、主メモリ２０４０にはシステム制御プログラム２
１００、検索制御プログラム２１１０、検索用単語抽出
プログラム２１２０、検索用高頻度語判定プログラム２
１３０、検索実行プログラム２１４０が磁気ディスク２
０５０から読み出されて格納されるとともに、ワークエ
リア２１５０が確保される。なお、検索用抽出単語プロ
グラム２１２０および検索用高頻度語判定プログラム２
１３０は、それぞれ登録用単語抽出プログラム（図２に
おける１１２０）および登録用高頻度語判定プログラム
１１３０と別プログラムとして記載しているが、本実施
例においては実行する処理内容は同じものである。ま
た、磁気ディスク２０５０には区切り文字テーブル格納
領域２２００、高頻度語リスト格納領域２２１０、検索
用インデクス格納領域２２２０、各種プログラム格納領
域２２３０が確保されている。Next, the document search server in the present embodiment
That is, 2000 in FIG. 1 will be described with reference to the drawings. FIG. 9 illustrates a document search server 200 according to the present embodiment.
FIG. 3 is a diagram showing a configuration of a zero. Document search server 2 shown in FIG.
Reference numeral 000 denotes a display 2010 for displaying the execution status of the processing, a keyboard 2020 for inputting commands such as start and stop of the server, a central processing unit CPU 2030 for executing a search process, and temporarily stores a search program and data. Main memory 2040 for storing data and magnetic disk 205 for storing various data and programs
0 and a bus 2060 connecting them.
The main memory 2040 has a system control program 2
100, search control program 2110, search word extraction program 2120, search high frequency word determination program 2
130, the search execution program 2140 is the magnetic disk 2
050 is read and stored, and a work area 2150 is secured. The extracted word program for search 2120 and the high-frequency word determination program for search 2
130 is described as a separate program from the registration word extraction program (1120 in FIG. 2) and the registration high-frequency word determination program 1130, however, the processing to be executed in the present embodiment is the same. The magnetic disk 2050 secures a delimiter table storage area 2200, a high-frequency word list storage area 2210, a search index storage area 2220, and various program storage areas 2230.

【００１８】なお、全体システムの概要説明において述
べたように、区切り文字テーブル格納領域２２００、高
頻度語リスト格納領域２２１０および検索用インデクス
格納領域２２２０は文書登録サブシステム１０００から
ネットワーク５０００を介してデータ転送されるもので
あり、本実施例においては、それぞれ図２における区切
り文字テーブル格納領域１２００、高頻度語リスト格納
領域１２１０および検索用インデクス格納領域１２２０
と同じ内容のデータが格納されることになる。また、本
実施例ではこれらの格納領域を磁気ディスク上２０５０
上に確保したが、光磁気ディスク装置など他の二次記憶
装置であっても構わない。以上が文書検索サーバ２００
０の構成である。As described in the description of the outline of the entire system, the delimiter table storage area 2200, the high-frequency word list storage area 2210, and the search index storage area 2220 are transmitted from the document registration subsystem 1000 via the network 5000. In the present embodiment, the delimiter table storage area 1200, the frequent word list storage area 1210, and the search index storage area 1220 in FIG.
Will be stored. In this embodiment, these storage areas are stored on the magnetic disk in 2050.
Although secured above, another secondary storage device such as a magneto-optical disk device may be used. The above is the document search server 200
0.

【００１９】次に、本実施例における文書検索処理の手
順について説明する。始めに、図１における文書検索ク
ライアント３０００ないしは４０００から、検索ターム
が検索コマンドとして入力されると、検索コマンドはネ
ットワーク５０００を介して文書検索サーバ２０００に
伝えられる。そして、文書検索サーバ２０００が検索コ
マンドを受け取ると、システム制御プログラム２１００
が検索制御プログラム２１１０を起動し、文書の検索処
理を開始する。以下、文書検索時の処理について図１０
に示すPADを用いて説明する。検索制御プログラム２１
１０は、ステップ２３００において検索用単語抽出プロ
グラム２１２０を実行し、検索タームの先頭から末尾に
かけて、各文字に対し磁気ディスク２０５０上の区切り
文字テーブル２２００を参照していくことにより、検索
タームから区切り文字を識別する。そして、区切り文字
と区切り文字の間に囲まれた一連の文字列を単語として
抽出する。次に、ステップ２３１０で、検索用高頻度語
判定プログラム２１３０を実行し、ステップ２３００に
おいて抽出した各単語が高頻度語であるか否かを判定す
る。すなわち、磁気ディスク２０５０上の高頻度語リス
ト格納領域２２１０に格納されている高頻度語リストを
参照しながら、各単語が高頻度語リスト中に定義されて
いるか否かを判定することにより、各単語が高頻度語で
あるか否かを判定する。最後に、ステップ２３２０で検
索実行プログラム２１４０を実行し、磁気ディスク２０
５０上の検索用インデクス格納領域２２２０に格納され
ている検索用インデクスを参照し、文書の検索処理を実
行する。Next, the procedure of the document search process in the embodiment will be described. First, when a search term is input as a search command from the document search client 3000 or 4000 in FIG. 1, the search command is transmitted to the document search server 2000 via the network 5000. When the document search server 2000 receives the search command, the system control program 2100
Starts the search control program 2110 and starts the document search process. Hereinafter, processing at the time of document search will be described with reference to FIG.
This will be described using the PAD shown in FIG. Search control program 21
10 executes the search word extraction program 2120 in step 2300, and refers to the delimiter table 2200 on the magnetic disk 2050 for each character from the beginning to the end of the search term, thereby obtaining the delimiter from the search term. Identify. Then, a series of character strings enclosed between delimiters is extracted as a word. Next, in step 2310, the search high frequency word determination program 2130 is executed, and it is determined whether or not each word extracted in step 2300 is a high frequency word. That is, while referring to the high-frequency word list stored in the high-frequency word list storage area 2210 on the magnetic disk 2050, it is determined whether or not each word is defined in the high-frequency word list. It is determined whether or not the word is a frequent word. Finally, in step 2320, the search execution program 2140 is executed, and the
Reference is made to the search index stored in the search index storage area 2220 on 50 to execute document search processing.

【００２０】さらに、ステップ２３２０における処理内
容、すなわち検索実行プログラム２１４０の処理内容に
ついて、図１１を用いてもう少し詳細に説明する。検索
実行プログラム２１４０では、まず始めにステップ２４
００において、ステップ２３１０における判定結果を元
に、先頭の単語が高頻度語であるか否かを判定する。そ
して、ステップ２４００における判定結果が高頻度語で
あった場合には、ステップ２４１０において該当単語と
それに続く１単語の組に関する検索用インデクスデータ
を参照し、ステップ２４２０において後のインデクス間
の隣接判定処理時に使用する隣接判定語数として２を設
定する。また、ステップ２４００における判定結果が高
頻度語以外の一般語であった場合には、ステップ２４３
０において該当単語に関する検索用インデクスデータを
参照し、ステップ２４４０において隣接判定語数として
１を設定する。さらに、ステップ２４５０において着目
している単語が検索ターム末尾であるか否かを判定す
る。その結果、末尾でなかった場合にはステップ２４６
０からステップ２５４０に至る一連の処理を行い、その
検索タームにおいて該当単語の後に続く単語に関するイ
ンデクスデータとの隣接判定を行いながら検索タームの
照合処理を行う。すなわちステップ２４６０では、イン
デクス間の隣接判定処理の前準備として、ステップ２４
１０またはステップ２４３０において抽出した検索用イ
ンデクスデータを主メモリ２０４０上のワークエリア２
１５０に確保した隣接判定対象領域に移動する。そし
て、ステップ２４７０において検索タームの末尾に至る
まで、ステップ２４８０からステップ２５３０における
一連の処理を繰り返す。まず、ステップ２４８０では、
着目している単語が高頻度語であるか否かを判定する。
そして、高頻度語であった場合にはステップ２４９０に
おいて該当単語とそれに続く１単語に関する検索用イン
デクスデータを参照し、隣接判定対象領域に格納された
インデクスデータと隣接判定処理を行う。この際、隣接
判定結果となるタームの単語数は２単語分増加するた
め、ステップ２５００において隣接判定語数として２を
加算する。また、ステップ２４８０における判定処理の
結果が高頻度語でない一般語であった場合には、ステッ
プ２５１０において該当単語に関する検索用インデクス
データを参照し、隣接判定対象領域に格納されたインデ
クスデータと隣接判定処理を行う。この際、隣接判定結
果となるタームの単語数は１単語分増加するため、ステ
ップ２５２０において隣接判定語数として１を加算す
る。そして、次の単語との隣接判定処理を行うための準
備として、ステップ２５３０において隣接判定結果を主
メモリ２０４０上のワークエリア２１５０に格納してあ
る隣接判定対象領域に移動しておく。こうした処理を検
索ターム末尾まで繰り返し、ステップ２４７０における
繰り返し処理が完了すると、ステップ２５４０において
隣接判定対象領域にあるデータを該当検索タームに関す
る検索結果として出力して処理を終了する。また、ステ
ップ２４５０における判定処理の結果、検索ターム末尾
だった場合にはステップ２５５０を実行し、ステップ２
４１０またステップ２４３０で抽出した検索用インデク
スデータを検索タームに関する検索結果として出力して
処理を終了する。以上が、本実施例における文書検索処
理手順である。Further, the processing contents in step 2320, that is, the processing contents of the search execution program 2140 will be described in more detail with reference to FIG. In the search execution program 2140, first, in step 24,
At 00, it is determined whether or not the first word is a high-frequency word based on the determination result at step 2310. If the result of the determination in step 2400 is a high-frequency word, reference is made to search index data relating to a set of the relevant word and one subsequent word in step 2410, and in step 2420 the adjacency determination processing between subsequent indexes 2 is set as the number of adjacent judgment words to be used at times. If the result of the determination in step 2400 is a general word other than a high-frequency word, step 243 is executed.
At 0, reference is made to the search index data for the word, and at step 2440, 1 is set as the number of adjacent determination words. Further, in step 2450, it is determined whether or not the focused word is the end of the search term. If the result is not the end, step 246 is executed.
A series of processes from 0 to step 2540 are performed, and the search term is collated while determining the adjacency with the index data of the word following the word in the search term. That is, in step 2460, as preparation for the adjacentness determination process between indexes, step 24
10 or the search index data extracted in step 2430 in work area 2 on main memory 2040.
Move to the adjacent determination target area secured at 150. Then, in step 2470, a series of processes in steps 2480 to 2530 is repeated until the end of the search term is reached. First, in step 2480,
It is determined whether the focused word is a high-frequency word.
If the word is a high-frequency word, in step 2490, reference is made to the index data for the relevant word and the following one word, and the adjacent data is subjected to the adjacent data determination processing with the index data stored in the adjacent determination target area. At this time, since the number of words of the term as the result of the adjacent determination increases by two words, 2 is added as the number of adjacent determination words in step 2500. If the result of the determination processing in step 2480 is a general word that is not a high-frequency word, reference is made to the search index data for the word in step 2510, and the index data stored in the adjacent determination target area and the adjacent determination Perform processing. At this time, since the number of words of the term as the result of the adjacent determination increases by one word, 1 is added as the number of adjacent determination words in step 2520. Then, as a preparation for performing the process of determining the adjacency with the next word, the result of the adjacency determination is moved to an adjacent determination target area stored in the work area 2150 on the main memory 2040 in step 2530. Such processing is repeated to the end of the search term. When the repetition processing in step 2470 is completed, in step 2540, the data in the adjacent determination target area is output as a search result related to the relevant search term, and the processing ends. If the result of the determination processing in step 2450 is that the search term is at the end, step 2550 is executed, and step 2550 is executed.
410 Also, the search index data extracted in step 2430 is output as a search result related to the search term, and the process ends. The above is the document search processing procedure in the present embodiment.

【００２１】次に、本実施例における文書検索処理につ
いて、“Bank of America”という検索タームが指定さ
れた場合を例に図を用いて説明する。始めに、図１０に
おけるステップ２３００における繰り返し処理では、検
索ターム“Bank of America”について区切り文字テー
ブルを参照しながら単語の抽出処理を行う。なお、本実
施例における区切り文字テーブルは、登録時に用いるも
のと同じく、図５に示す構成を取るものとする。すなわ
ち、検索ターム“Bank of America”からは“ ”(スペ
ース)を区切り文字として、図１２に示すように“Ban
k”、“of”および“America”が抽出される。そしてス
テップ２３１０では、ステップ２３００において抽出し
た各単語に対し、磁気ディスク１０７０上の高頻度語リ
スト格納領域１２１０に格納されている高頻度語リスト
を参照しながら、高頻度語の判定処理を行う。高頻度語
リストについても、区切り文字テーブルと同様に、登録
時に用いた高頻度語リスト（図７）と同じものを参照す
るものとする。この結果、図１３に示すように高頻度語
フラグとして“of”に対し“１”が、“Bank”および
“America”については“０”が付加されて出力され
る。最後にステップ２３２０では、図１１に示した処理
手順に従い、ステップ２３１０において高頻度語フラグ
を付与された単語列について検索処理を実行する。この
際の処理例について図１４と図１５を用いて説明する。
まず始めに、図１４において単語列“Bank”、“of”お
よび“America”における先頭語である“Bank”に着目
する。そして、図１１のステップ２４００における判定
処理の結果、“Bank”に関する高頻度語フラグの値が
“０”であるため高頻度語以外の一般語として処理され
る。すなわちステップ２４３０において“Bank”に関す
る検索用インデクスデータを磁気ディスク２０５０上の
検索用インデクス格納領域２２２０から読み出し、ステ
ップ２４４０において隣接判定語数として“１”を設定
する。次に、図１１におけるステップ２４５０で、“Ba
nk”検索タームの末尾語であるか否化の判定を行う。本
例では、続く語として“of America”が存在し、末尾で
はないためステップ２４６０以降の一連の処理を実行す
る。すなわち、ステップ２４６０において、“Bank”に
関する検索用インデクスデータを主メモリ２０４０上の
ワークエリア２１５０に確保されている隣接判定対象領
域に移動する。次に、図１５に示す通りステップ２４８
０において“Bank”に続く単語である“of”が高頻度語
であるか否かを判定する。その結果、“of”については
高頻度語フラグの値が“１”であり、高頻度語と判断さ
れる。その結果、ステップ２４９０が実行されることと
なり、該当単語とそれに続く１語の組、すなわち“of B
ank”に関する検索用インデクスデータを参照する。こ
れに対し、隣接判定語数“１”として隣接判定対象領域
に格納されている“Bank”の検索用インデクスデータと
隣接判定する（すなわち同一文書中に“Bank”と“of A
merica”が存在し、かつ“Bank”と“of America”の単
語位置が１だけ違うものを抽出する）ことにより、“Ba
nk of America”に関する検索結果を得る。また、ステ
ップ２５００において、この時点での隣接判定語数
“１”に対し隣接判定語数として“２”を加算すること
により隣接判定語数は３となる（この時点で検索ターム
の末尾となるため、この値は実際には使用されない）。
そして、“of”に続く“America”は検索ターム末尾で
あるため、ステップ２４７０における繰り返し処理は、
ここで終了することになる。最後に、ステップ２５４０
において隣接判定結果、すなわち“Bank of America”
に関する検索結果を検索タームに関する検索結果として
出力し、検索処理を終了する。以上が、本実施例におけ
る検索処理の例である。Next, the document search process in the present embodiment will be described with reference to the drawings, taking as an example a case where a search term "Bank of America" is designated. First, in the repetition processing in step 2300 in FIG. 10, a word extraction processing is performed for the search term “Bank of America” while referring to the delimiter table. Note that the delimiter table in the present embodiment has the configuration shown in FIG. 5, similarly to that used at the time of registration. In other words, from the search term “Bank of America”, as shown in FIG.
k, “of” and “America.” Then, in step 2310, the high-frequency words stored in the high-frequency word list storage area 1210 on the magnetic disk 1070 for each word extracted in step 2300. The high-frequency word determination process is performed with reference to the list, and the same high-frequency word list (FIG. 7) used at the time of registration as in the delimiter table is referred to. As a result, “1” is added to “of” and “0” is added to “Bank” and “America” as high-frequency word flags as shown in Fig. 13. Finally, in step 2320, According to the processing procedure shown in Fig. 11, a search process is executed for the word string to which the high-frequency word flag has been added in step 2310. An example of this process is shown in Figs. It will be described with reference to the 5.
First, in FIG. 14, attention is paid to "Bank" which is the first word in the word strings "Bank", "of" and "America". Then, as a result of the determination processing in step 2400 in FIG. 11, the value of the high-frequency word flag related to “Bank” is “0”, so that it is processed as a general word other than high-frequency words. That is, in step 2430, search index data relating to “Bank” is read from the search index storage area 2220 on the magnetic disk 2050, and in step 2440, “1” is set as the number of adjacent determination words. Next, in step 2450 in FIG.
It is determined whether or not it is the last word of the “nk” search term. In this example, since the following word “of America” exists and is not the last word, a series of processes after step 2460 are executed. At 2460, the search index data relating to “Bank” is moved to the adjacent determination target area secured in the work area 2150 on the main memory 2040. Next, as shown in FIG.
At 0, it is determined whether or not “of”, which is a word following “Bank”, is a high-frequency word. As a result, the value of the high-frequency word flag for “of” is “1”, and is determined to be a high-frequency word. As a result, step 2490 is executed, and a set of the corresponding word and the following one word, that is, “of B
The search index data related to “ank” is referred to.On the other hand, the search index data of “Bank” stored in the adjacent determination target area as the number of adjacent determination words “1” is determined to be adjacent (that is, “ Bank ”and“ of A
merica ”exists and the word position of“ Bank ”differs from the word position of“ of America ”by 1).
nk of America ”. In step 2500, by adding“ 2 ”as the number of adjacent determination words to the number of adjacent determination words“ 1 ”at this time, the number of adjacent determination words becomes three (at this time, , Which is not actually used because it ends the search term.)
Since “America” following “of” is the end of the search term, the repetition processing in step 2470
It will end here. Finally, step 2540
In the adjacency determination result, that is, "Bank of America"
The search result regarding the search term is output as the search result regarding the search term, and the search process ends. The above is an example of the search processing in this embodiment.

【００２２】なお、本実施例において、検索タームとし
て“Bank”一語が指定された場合には、上記処理におい
てステップ２４５０における判定結果が検索ターム末尾
となる。すなわち、この場合にはステップ２５５０を実
行することになり、“Bank”に関して参照した検索用イ
ンデクスデータそのものを検索結果として出力すること
になる。また、本実施例において、例えば“Bank of Am
erica has decided”というように検索タームとして“B
ank of America”の後にいくつかの単語が続く場合に
は、ステップ２４８０における処理をさらに繰り返すこ
とにより、“Bank of America”に続く“has decided”
などについて検索用インデクスの参照、隣接判定し検索
結果を得ることができる。以上が本実施例における文書
検索時の処理内容である。In this embodiment, when one word "Bank" is designated as a search term, the result of the determination in step 2450 in the above processing becomes the end of the search term. That is, in this case, step 2550 is executed, and the search index data itself referred to for “Bank” is output as a search result. In this embodiment, for example, “Bank of Am
erica has decided ”as a search term“ B
If some words follow ank of America, the process in step 2480 is further repeated to make “has decided” following “Bank of America”.
For example, the search index can be referred to and the adjacency can be determined to obtain a search result. The above is the processing content at the time of document search in this embodiment.

【００２３】このように、本発明によると文書の登録時
には、登録対象文書中の高頻度語については、該当高頻
度語に続く一単語を組として検索用インデクスを作成す
る。そして、検索時に、高頻度語を含むフレーズについ
ては、高頻度語に続く１単語の組に対して検索用インデ
クスデータを参照することにより、検索時に読み出す検
索用インデクスデータの容量を削減し、ひいてはフレー
ズ検索の検索性能を大幅に向上することが可能になる。
なお、本実施例においては高頻度語に続く一単語を組み
として検索用インデクスデータを作成する方法について
説明したが、これを２単語以上の連続する単語列を組み
することにより、さらに検索性能を向上することも可能
である。また、本実施例では高頻度語リストに登録され
る高頻度語を、例えば公知例１においてストップリスト
として挙げられているストップワード（排除語）として
定義してたが、例えば所定数のサンプル文書群における
出現頻度情報などの統計的情報に基づき定義したもので
あっても構わない。また、検索履歴情報を用いて、エン
ドユーザから入力された検索ターム内で検索に時間を要
した検索タームの中から選択して定義したものであって
も構わない。さらに、本実施例において検索対象として
英文のデータを対象とした場合について説明を行った
が、ドイツ語、フランス語を始めとするその他の文書に
ついても適用することができる。As described above, according to the present invention, at the time of registering a document, for a high-frequency word in a document to be registered, a search index is created by combining one word following the high-frequency word. At the time of a search, for a phrase including a high-frequency word, by referring to the search index data for a set of one word following the high-frequency word, the capacity of the search index data to be read at the time of the search is reduced. The search performance of phrase search can be greatly improved.
In the present embodiment, a method for creating index data for search by combining one word following a high-frequency word has been described. However, by combining this with a continuous word string of two or more words, search performance can be further improved. It can be improved. In the present embodiment, the high-frequency words registered in the high-frequency word list are defined as stop words (excluded words) listed as a stop list in, for example, the known example 1. It may be defined based on statistical information such as appearance frequency information in a group. Further, the search term information may be selected and defined from search terms requiring time for a search within the search terms input by the end user. Further, in the present embodiment, the case where English data is targeted as a search target has been described. However, the present invention can be applied to other documents including German and French.

【００２４】以上述べたように、本発明の第一の実施例
では高頻度語リストとして公知例１において記載されて
いるストップリストに相当する単語が登録されることを
想定し、文書の登録時に予め定義された高頻度語リスト
を参照しながら、登録対象文書から高頻度語を抽出する
方法について述べた。しかし、本実施例における文書検
索方法では、データベース固有の高頻度語を含むフレー
ズについて、十分な検索性能が得られないという問題が
ある。すなわち、英文特許明細書を格納したデータベー
スにおいては、“step”、“metod”や“apparatus”な
どの出現頻度が高いと考えられる。しかし、ストップリ
ストによる高頻度語には、これらの語が高頻度語として
定義されていないため、これらの単語を含む“step roc
ket”(多段式ロケット)や“step function”(階段関数)
などのフレーズについては高頻度語“step”のインデク
スデータと、それに接続する単語のインデクスデータを
磁気ディスクなどの二次記憶から読み出し、これらの隣
接判定を行う必要がある。このため、検索時間が著しく
長くなってしまうという問題がある。そこで、本発明第
二の実施例では、文書登録サブシステムにおいて、検索
用インデクス作成登録時に各単語のインデクスデータが
所定のインデクス容量を越えたか否かを判定する。そし
て、所定のインデクス容量を越えた単語については、該
当単語に接続する１単語を併せた、連続する２単語に関
するインデクスを生成し登録するとともに、該当単語を
高頻度リストに登録する。これにより、登録文書中に多
く出現した単語を動的に高頻度語リストに登録し、かつ
該当単語を含むフレーズを高速に検索できる文書検索シ
ステムを実現することが可能になるものである。As described above, in the first embodiment of the present invention, it is assumed that words corresponding to the stop list described in the well-known example 1 are registered as a high-frequency word list, The method of extracting a high-frequency word from a registration target document while referring to a predefined high-frequency word list has been described. However, the document search method according to the present embodiment has a problem that sufficient search performance cannot be obtained for a phrase including a high-frequency word unique to a database. That is, in the database storing the English-language patent specifications, it is considered that the appearance frequency of “step”, “metod”, “apparatus”, etc. is high. However, the high frequency words in the stoplist do not define these words as high frequency words.
ket ”(multi-stage rocket) and“ step function ”(step function)
For such a phrase, it is necessary to read the index data of the high-frequency word “step” and the index data of the word connected to the word from a secondary storage such as a magnetic disk, and determine the adjacency of these. For this reason, there is a problem that the search time becomes extremely long. Therefore, in the second embodiment of the present invention, the document registration subsystem determines whether or not the index data of each word has exceeded a predetermined index capacity at the time of creating and registering the search index. Then, for a word exceeding a predetermined index capacity, an index for two consecutive words, including one word connected to the corresponding word, is generated and registered, and the corresponding word is registered in a high frequency list. This makes it possible to realize a document search system that can dynamically register words that frequently appear in a registered document in a high-frequency word list and can quickly search for a phrase that includes the word.

【００２５】本発明の第二の実施例における文書登録サ
ブシステムの構成を図１６に示す。本実施例における文
書登録サブシステムは、図２に示す第一の実施例におけ
る文書登録サブシステムとほぼ同様の構成を取るが、図
２における検索用インデクス作成登録プログラム１１４
０が高頻度単語抽出型検索用インデクス作成登録プログ
ラム１１４１に置き換わり、かつ検索用インデクス拡張
プログラム１１６０が新たに加わったものである。次
に、本実施例における文書登録処理の手順について図１
７に示すPADを用いて説明する。なお、本実施例におけ
る文書登録処理手順は、図３に示す第一の実施例におけ
る処理手順とほとんどが同じであるため、ここでは第一
の実施例と異なるステップ１３５１および１３６０につ
いて説明を補足する。まずステップ１３５１では、検索
用インデクス作成登録プログラム１１４１を実行し、図
３におけるステップ１３５０と同じくステップ１３３０
で抽出した単語に対し、検索用インデクスデータの作成
登録処理を実行する。また、ここで登録した各単語に関
するインデクスデータの容量が所定の値より大きいか否
かを判定し、大きいと判定された場合には該当単語を高
頻度単語として抽出し、これを主メモリ１０６０上のワ
ークエリア１１５０に格納する。そして、ステップ１３
６０では主メモリ１０６０上のワークエリア１１５０を
参照し、新たに抽出された高頻度単語が存在するか否か
を判定し、存在する場合にはステップ１３７０およびス
テップ１３８０を実行する。すなわち、ステップ１３７
０では登録済みテキスト（図示せず）に対して主メモリ
１０６０上のワークエリア１１５０に格納された高頻度
単語を探索する。そして、ステップ１３５１において得
られた高頻度単語と接続する１単語を抽出するととも
に、これらの併せた連続する２単語について全文検索用
インデクスデータの作成処理を行う（なお、ここでの検
索用インデクスデータの作成処理については、図４にお
けるステップ１４２０等で示した方法を用いるものとす
る）。最後にステップ１３８０において、ステップ１３
５１において得られた高頻度単語を高頻度語リスト格納
領域１２１０に追加格納し、文書の登録処理を完了す
る。以上が本発明第二の実施例における登録処理の概要
である。FIG. 16 shows the configuration of the document registration subsystem according to the second embodiment of the present invention. The document registration subsystem in the present embodiment has substantially the same configuration as the document registration subsystem in the first embodiment shown in FIG. 2, but the search index creation registration program 114 shown in FIG.
0 is replaced by a high-frequency word extraction type search index creation registration program 1141, and a search index extension program 1160 is newly added. Next, the procedure of the document registration process in this embodiment is shown in FIG.
This will be described using the PAD shown in FIG. Note that the document registration processing procedure in the present embodiment is almost the same as the processing procedure in the first embodiment shown in FIG. 3, and here, supplementary description of steps 1351 and 1360 different from the first embodiment. . First, in step 1351, the search index creation / registration program 1141 is executed, and the same as step 1350 in FIG.
Execute the search index data creation registration process for the word extracted in step. Further, it is determined whether or not the capacity of the index data for each registered word is larger than a predetermined value. If it is determined that the capacity is larger, the corresponding word is extracted as a high-frequency word. In the work area 1150. And step 13
At 60, the work area 1150 on the main memory 1060 is referred to, and it is determined whether or not a newly extracted high-frequency word exists. If it does, steps 1370 and 1380 are executed. That is, step 137
At 0, a high-frequency word stored in the work area 1150 on the main memory 1060 is searched for a registered text (not shown). Then, one word connected to the high-frequency word obtained in step 1351 is extracted, and processing for creating full-text search index data is performed for the two consecutive words that have been combined (the search index data here is used). Is used for the creation process of (1) in FIG. 4). Finally, in step 1380, step 13
The high-frequency words obtained in 51 are additionally stored in the high-frequency word list storage area 1210, and the document registration processing is completed. The above is the outline of the registration processing in the second embodiment of the present invention.

【００２６】このように本実施例によると、文書登録時
に各単語に関するインデクス容量が所定のインデクス容
量より大きいか否かを判定する。そして、所定のインデ
クス容量を越えた単語については、高頻度語であるもの
として、登録済みのテキストから該当単語に接続する１
単語を併せた、連続する２単語に関するインデクスデー
タを作成、登録する。また、該当単語を高頻度語リスト
に追加登録する。これにより、該当単語を含むフレーズ
については検索時にそれを含む２単語に関する検索用イ
ンデクスを参照するものとし、任意の高頻度語に対し該
当高頻度語を含むフレーズ検索を高速に実現することが
可能になる。As described above, according to the present embodiment, it is determined whether or not the index capacity for each word is larger than a predetermined index capacity at the time of document registration. Then, a word exceeding a predetermined index capacity is regarded as a high-frequency word and is connected to the corresponding word from the registered text.
Create and register index data for two consecutive words, including words. The corresponding word is additionally registered in the high-frequency word list. As a result, for a phrase including the corresponding word, the search index for the two words including the phrase is referred to at the time of the search, and a phrase search including the corresponding high-frequency word can be realized for any high-frequency word at high speed. become.

【００２７】なお、ステップ１３５１において高頻度語
であるか否かの判定基準である基準インデクス容量は、
本実施例では所定の値というように記載していたが、こ
れらの値をシステムにおけるコンフィグレーション情報
として設定する方式であっても構わない。また、本実施
例においては文書登録の都度、ステップ１３５１におけ
る高頻度語の抽出とステップ１３６０〜１３８０におけ
る検索用インデクスの拡張処理ならびに高頻度単語とし
ての登録処理を実行するものとした。しかし、ステップ
１３６０〜１３８０における処理を暫くの間保留してお
き、データベースの保守を行うのに好適なタイミングで
もって、これらの処理を実行することも可能である。ま
た、本実施例において、ステップ１３６０〜１３８０に
おける一連の処理は、文書登録時に抽出した高頻度単語
に対し実行するものであったが、図１８に示す構成を取
ることにより、外部から高頻度語として入力指定した単
語に対して実行することも可能である。In step 1351, the reference index capacity as a criterion for determining whether a word is a high-frequency word is:
In the present embodiment, predetermined values are described, but a method of setting these values as configuration information in the system may be used. Further, in this embodiment, every time a document is registered, extraction of a high-frequency word in step 1351, expansion of the search index in steps 1360 to 1380, and registration processing as a high-frequency word are executed. However, it is also possible to suspend the processes in steps 1360 to 1380 for a while and execute these processes at a timing suitable for performing database maintenance. Further, in this embodiment, a series of processing in steps 1360 to 1380 is executed for a high-frequency word extracted at the time of document registration. However, by adopting the configuration shown in FIG. It is also possible to execute for a word input and designated as.

【００２８】[0028]

【発明の効果】本発明によると、高頻度語を含むフレー
ズが検索タームに指定された際には、予め文書登録時に
作成しておいた、高頻度語に続く１単語の組みに関する
検索用インデクスデータを参照することにより、検索時
に読み出す検索用インデクスデータの容量を削減するこ
とが可能になり、ひいてはフレーズ検索の検索性能を大
幅に向上することが可能になる。According to the present invention, when a phrase including a high-frequency word is designated as a search term, a search index for a set of one word following the high-frequency word, which has been created in advance at the time of document registration. By referring to the data, it is possible to reduce the capacity of the search index data read at the time of the search, and it is possible to greatly improve the search performance of the phrase search.

[Brief description of the drawings]

【図１】本発明の第一の実施例における構成を示す図で
ある。FIG. 1 is a diagram showing a configuration according to a first embodiment of the present invention.

【図２】第一の実施例における文書登録サブシステムの
構成を示す図である。FIG. 2 is a diagram illustrating a configuration of a document registration subsystem according to the first embodiment.

【図３】第一の実施例における文書登録時の処理フロー
を示す図である。FIG. 3 is a diagram showing a processing flow at the time of document registration in the first embodiment.

【図４】第一の実施例における検索用インデクス作成登
録プログラムの処理フローを示す図である。FIG. 4 is a diagram showing a processing flow of a search index creation registration program in the first embodiment.

【図５】第一の実施例における区切り文字テーブルの構
成を示す図である。FIG. 5 is a diagram showing a configuration of a delimiter table according to the first embodiment.

【図６】第一の実施例における文書登録時の単語抽出処
理の概要を示す図である。FIG. 6 is a diagram showing an outline of a word extraction process at the time of document registration in the first embodiment.

【図７】第一の実施例における文書登録時の高頻度語判
定処理の概要を示す図である。FIG. 7 is a diagram illustrating an outline of a high-frequency word determination process at the time of document registration in the first embodiment.

【図８】第一の実施例における文書登録時の検索用イン
デクス作成登録処理の概要を示す図である。FIG. 8 is a diagram showing an outline of search index creation registration processing at the time of document registration in the first embodiment.

【図９】第一の実施例における検索サーバの構成を示す
図である。FIG. 9 is a diagram illustrating a configuration of a search server according to the first embodiment.

【図１０】第一の実施例における文書検索時の処理フロ
ーを示す図である。FIG. 10 is a diagram showing a processing flow at the time of document search in the first embodiment.

【図１１】第一の実施例における検索実行プログラムの
処理フローを示す図である。FIG. 11 is a diagram illustrating a processing flow of a search execution program in the first embodiment.

【図１２】第一の実施例における検索時の単語抽出処理
の概要を示す図である。FIG. 12 is a diagram showing an outline of a word extraction process at the time of a search in the first embodiment.

【図１３】第一の実施例における検索時の高頻度語判定
処理の概要を示す図である。FIG. 13 is a diagram illustrating an outline of a high-frequency word determination process at the time of search in the first embodiment;

【図１４】第一の実施例における検索時の処理内容を示
す図である。FIG. 14 is a diagram showing processing contents at the time of retrieval in the first embodiment.

【図１５】第一の実施例における検索時の処理内容を示
す図である。FIG. 15 is a diagram showing processing contents at the time of retrieval in the first embodiment.

【図１６】本発明の第二の実施例における文書登録サブ
システムの構成を示す図である。FIG. 16 is a diagram illustrating a configuration of a document registration subsystem according to a second embodiment of the present invention.

【図１７】第二の実施例における文書登録時の処理フロ
ーを示す図である。FIG. 17 is a diagram showing a processing flow at the time of document registration in the second embodiment.

【図１８】外部からの入力指定による高頻度語の追加を
実現するための文書登録サブシステムの構成を示す図で
ある。FIG. 18 is a diagram showing a configuration of a document registration subsystem for realizing addition of a high-frequency word by input designation from outside.

[Explanation of symbols]

１０００文書登録サブシステム２０００文書検索サーバ３０００検索クライアント１４０００検索クライアント２５０００ネットワーク１０１０ディスプレイ１０２０キーボード１０３０中央演算処理装置（ＣＰＵ）１０４０フロッピディスクドライバ１０５０フロッピディスク１０６０主メモリ１０７０磁気ディスク１１００システム制御プログラム１１１０登録制御プログラム１１２０登録用単語抽出プログラム１１３０登録用高頻度語判定プログラム１１４０検索用インデクス作成登録プログラム１１４１高頻度単語抽出型検索用インデクス作成登録
プログラム１１５０ワークエリア１１６０検索用インデクス拡張プログラム１１７０高頻度単語入力指定プログラム１２００区切り文字テーブル格納領域１２１０高頻度語リスト格納領域１２２０検索用インデクス格納領域１２３０各種プログラム格納領域２０１０ディスプレイ２０２０キーボード２０３０中央演算処理装置（ＣＰＵ）２０４０主メモリ２０５０磁気ディスク２１００システム制御プログラム２１１０検索制御プログラム２１２０検索用単語抽出プログラム２１３０検索用高頻度語判定プログラム２１４０検索実行プログラム２１５０ワークエリア２２００区切り文字テーブル格納領域２２１０高頻度語リスト格納領域２２２０検索用インデクス格納領域２２３０各種プログラム格納領域1000 Document registration subsystem 2000 Document search server 3000 Search client 1 4000 Search client 2 5000 Network 1010 Display 1020 Keyboard 1030 Central processing unit (CPU) 1040 Floppy disk driver 1050 Floppy disk 1060 Main memory 1070 Magnetic disk 1100 System control program 1110 Registration Control program 1120 Registration word extraction program 1130 Registration high frequency word determination program 1140 Search index creation registration program 1141 High frequency word extraction type search index creation registration program 1150 Work area 1160 Search index expansion program 1170 High frequency word input designation Program 1200 delimiter table storage area 1210 High-frequency word list storage area 1220 Search index storage area 1230 Various program storage areas 2010 Display 2020 Keyboard 2030 Central processing unit (CPU) 2040 Main memory 2050 Magnetic disk 2100 System control program 2110 Search control program 2120 Search word extraction program 2130 Search high frequency word determination program 2140 Search execution program 2150 Work area 2200 Delimiter table storage area 2210 High frequency word list storage area 2220 Search index storage area 2230 Various program storage areas

───────────────────────────────────────────────────── フロントページの続き (72)発明者菅谷奈津子神奈川県川崎市幸区鹿島田890番地株式会社日立製作所ビジネスソリューション開発本部内 (72)発明者松林忠孝神奈川県川崎市幸区鹿島田890番地株式会社日立製作所ビジネスソリューション開発本部内 (72)発明者稲場靖彦神奈川県川崎市幸区鹿島田890番地株式会社日立製作所ビジネスソリューション開発本部内 (72)発明者川下靖司神奈川県横浜市戸塚区戸塚町5030番地株式会社日立製作所ソフトウェア事業部内Ｆターム(参考） 5B075 NK02 NK12 NK24 NK32 5B082 EA05 GC04 ──────────────────────────────────────────────────続き Continuing from the front page (72) Inventor Natsuko Sugaya 890 Kashimada, Saiwai-ku, Kawasaki-shi, Kanagawa Prefecture, Japan Business Solution Development Headquarters, Hitachi, Ltd. (72) Inventor Tadataka Matsubayashi 890, Kashimada, Sai-ku, Kawasaki-shi, Kanagawa Prefecture, Japan Hitachi, Ltd. Business Solution Development Headquarters (72) Inventor Yasuhiko Inaba 890 Kashimada, Saiwai-ku, Kawasaki-shi, Kanagawa Prefecture Hitachi, Ltd. Business Solution Development Headquarters (72) Inventor Yasushi Kawashita Totsuka-cho, Totsuka-ku, Yokohama-shi, Kanagawa Prefecture 5030 F-term in the Software Division of Hitachi, Ltd. (Reference) 5B075 NK02 NK12 NK24 NK32 5B082 EA05 GC04

Claims

[Claims]

In a document search method in a document search system for searching for a document including a character string specified as a search term from a set of documents registered in advance, the process of registering a document includes the steps of: A registration word extraction step of extracting a sequence of more than one character as a word; a high-frequency word determination step of determining whether the extracted word is a word defined as a high-frequency word; If the determination result in the frequent word determination step is a high-frequency word, one word connected to the high-frequency word is extracted, and a word string composed of the high-frequency word and the extracted one word is searched for. A document search method comprising a search index creation registration step of creating and registering index data.

2. A document search method in a document search system for searching for a document including a character string specified as a search term from a set of documents registered in advance, wherein a process of searching for a document is designated as a search term. A search word extraction step of extracting a sequence of one or more characters as a word from a character string, and a high-frequency word determination step of determining whether the extracted word is a word defined as a high-frequency word If the determination result in the high-frequency word determination step is a high-frequency word, one word connected to the high-frequency word is extracted, and a word string including the high-frequency word and the extracted one word is extracted. A document search method comprising a search execution step of performing a search by referring to search index data.

3. The document search method according to claim 1, wherein a word group that is considered to have a high frequency of appearance in a document database of the document search system is defined in advance, and the high-frequency word determination is performed. In the document search method, it is determined whether the extracted word is a word included in the word group.

4. The document retrieval method according to claim 1, wherein the appearance frequency is considered high based on statistical information such as appearance frequency information on each word registered in a document database of the document retrieval system. A word group to be extracted is defined in advance, and the high-frequency word determining step determines whether or not the extracted word is a word included in the word group.

5. In a document search method in a document search system for searching for a document including a character string specified as a search term from a set of documents registered in advance, the process of registering a document includes the steps of: A registration word extraction step of extracting a sequence of characters or more as words, and determining whether the word extracted in the registration word extraction step is a high-frequency word that appears frequently in a set of registered documents. If the determination result in the high-frequency word determination step is a high-frequency word, one word connected to the high-frequency word is extracted,
Index data for search is created and registered for the word string consisting of the high-frequency word and the extracted one word. If the high-frequency word is not a high-frequency word, the index capacity of the index data related to the word is set to a predetermined index. It is determined whether or not the capacity is larger than the capacity. If the capacity is larger, the corresponding word is newly determined as a high-frequency word, and an index creation and registration step for high-frequency word extraction search is performed. For a high-frequency word newly extracted in the index creation registration step, refer to a set of registered documents, and connect the word to the word.
Extracting a word string composed of words, creating and registering search index data for the word string, and further including a search index expansion step of newly registering the high-frequency word as a high-frequency word. The document search method you want.

6. The document search method according to claim 5, wherein
A document search method comprising configuration information for setting an index capacity serving as a criterion for determining a high-frequency word.

7. In a document search method in a document search system for searching for a document including a character string specified as a search term from a set of documents registered in advance, the process of registering a document includes: A registration word extraction step of extracting a sequence of characters or more as words, and determining whether the word extracted in the registration word extraction step is a high frequency word that appears frequently in a set of registered documents If the determination result in the high-frequency word determination step is a high-frequency word, one word connected to the high-frequency word is extracted,
A search index creation / registration step of creating and registering search index data for the word string consisting of the high-frequency word and the extracted one word; and a high-level input / designation of a word group to be newly registered as a high-frequency word. Referring to a set of registered documents for the word input in the frequent word input designating step and the frequent word input designating step, extracting a word string consisting of the word and one word connected to the word, A document search method, comprising: a search index expansion step of creating and registering search index data for a column and newly registering the high-frequency word as a high-frequency word.

8. A document registration program for performing a document registration process in a document search system for searching for a document including a character string specified as a search term from a set of documents registered in advance. A registration word extraction procedure for extracting the above character sequence as a word, a high frequency word determination procedure for determining whether the extracted word is a word defined as a high frequency word, If the determination result in the word determination procedure is a high-frequency word, one word connected to the high-frequency word is extracted, and a search index is assigned to a word string including the high-frequency word and the extracted one word. A computer-readable recording medium recording a document registration program having a search index creation registration procedure for creating and registering data.

9. A document search program for performing a document search process in a document search system for searching for a document including a character string specified as a search term from a set of documents registered in advance. A search word extraction procedure for extracting a sequence of one or more characters from a character string as a word, a high-frequency word determination procedure for determining whether or not the extracted word is a word defined as a high-frequency word If the determination result in the high-frequency word determination procedure is a high-frequency word, one word connected to the high-frequency word is extracted, and a word string including the high-frequency word and the extracted one word is extracted. And a computer-readable recording medium storing a document search program having a search execution procedure for performing a search by referring to search index data.