JP2002183144A

JP2002183144A - System and method for document retrieval and recording medium

Info

Publication number: JP2002183144A
Application number: JP2000384439A
Authority: JP
Inventors: Sakiko Honma; 咲子本間
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 2000-12-18
Filing date: 2000-12-18
Publication date: 2002-06-28

Abstract

PROBLEM TO BE SOLVED: To provide a system and a method for document retrieval and a recording medium which improve retrieval precision by expanding a retrieval word by referring to a pattern gathered from an object document to be retrieved without normalizing an index-registered character string as to a different notation having a hyphen. SOLUTION: This system is equipped with a document storage means which stores pieces of document data described in a language whose words have clear breaks, a character string extracting means which extracts a character string sectioned with a specific delimiter character and appearing position information on the character string, etc., from the document data stored in the document storage means, and a normalization information storage means which stores the character string and a normalized character string in a normalization information memory means while making them correspond to each other when the extracted character string matches a specific pattern, and retrieves a document that a user desires from multiple documents.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、文書データベース
からユーザが所望する文書データを検索する文書検索シ
ステム、文書検索方法および記録媒体に関し、特に、英
語のように単語の区切りが明確な言語によって記述され
た文書データを対象とする検索システムに関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a document retrieval system, a document retrieval method, and a recording medium for retrieving document data desired by a user from a document database, and more particularly, to a description in a language such as English where words are clearly separated. The present invention relates to a search system for targeted document data.

【０００２】[0002]

【従来の技術】従来の文書検索システムでは、先ず、英
語のように単語の区切りが明確な言語によって記述され
た文書データを対象として全文検索を行う場合、索引登
録時には、空白やピリオド、コンマなどの区切り文字を
境界として各文書を単語単位の文字列に区切り、抽出し
た文字列を、その文字列が出現した文書の識別子および
その文書における文字列の出現位置（文書先頭からの単
語数）などと共に索引情報として記憶する。次に、文書
検索システムは、検索するとき、ユーザがテキスト形式
で入力した問い合わせから索引登録時と同様に単語文字
列を抽出し、抽出された文字列をもとに検索条件を作成
して、検索を実行する。2. Description of the Related Art In a conventional document search system, first, when performing full-text search on document data described in a language in which words are clearly separated, such as English, at the time of index registration, spaces, periods, commas, etc. are used. Each document is divided into character strings in word units using the delimiter character as a boundary, and the extracted character strings are identified by the identifier of the document in which the character string appears and the position of the character string in the document (the number of words from the beginning of the document) Together with the index information. Next, when searching, the document search system extracts a word character string from the query input by the user in text format in the same manner as at the time of index registration, creates a search condition based on the extracted character string, Perform a search.

【０００３】また、ピリオドやコンマの他に区切り文字
として扱われる記号の１つとして、ハイフンがある。こ
のハイフンは、連続する複数の語を１つのまとまり（複
合語）として表記する際に用いられる。例えば"client-
server system", "phosphate-rich water"では、ハイフ
ンで連結された２語（以下、ハイフン語）が一まとまり
の語として後続の１語を修飾している。このような場
合、ハイフン語を構成する個々の語（例えば"phosphat
e"）が検索対象となる可能性もあるため、ハイフンの前
後を区切って個々の索引語として登録することにより、
検索漏れを避けることができる。[0003] In addition to periods and commas, there is a hyphen as one of the symbols treated as a delimiter. The hyphen is used when a plurality of continuous words are described as one unit (compound word). For example, "client-
In "server system" and "phosphate-rich water", two words connected by a hyphen (hereinafter referred to as "hyphen") qualify the following word as a group of words. In such a case, the hyphen is composed Individual words (eg "phosphat
e ") may be a search target, so by registering each index term before and after the hyphen,
Search omissions can be avoided.

【０００４】[0004]

【発明が解決しようとする課題】しかしながらハイフン
は、本来１語として記述すべき文字列、あるいは、１語
として記述することが可能な文字列を、分割して記述す
る際に用いられることもある。例えば、単語の途中で改
行する場合や（例えば"edu-cation"）、接頭辞と語幹の
区切りを表す場合（例えば"pre-election"）がこれに相
当する。ハイフンを区切り文字として索引登録した場
合、前者では"edu", "cation" が索引語となるため、ユ
ーザが問い合わせとして"education" を入力してもマッ
チしない。また、後者では"pre", "election" が索引語
となるが、ユーザの問い合わせが"preelection" の場合
にはマッチしないという問題が生じる。However, a hyphen may be used when a character string which should be described as one word or a character string which can be described as one word is divided and described. . For example, a line break in the middle of a word (for example, "edu-cation") or a case where a prefix and a stem are separated (for example, "pre-election") correspond to this. When the index is registered with a hyphen as a delimiter, "edu" and "cation" are index words in the former case, so even if the user enters "education" as a query, there is no match. In the latter case, "pre" and "election" are index words, but there is a problem that if the user's inquiry is "preelection", there is no match.

【０００５】かかる問題を解決する方法として、行末に
おけるハイフンの直前のアルファベット文字列と、次行
の先頭のアルファベット文字列を結合し、ハイフンを削
除して１語として索引登録するという方法がある。しか
しながら、行末のハイフンが常に単語の途中の改行を表
すとは限らない。例えば、"phosphate-rich"などのハイ
フンがたまたま行末に現れた場合、この方法では"phosp
haterich" で索引登録されてしまう。As a method for solving such a problem, there is a method of combining an alphabet character string immediately before a hyphen at the end of a line and an alphabet character string at the head of the next line, deleting the hyphen, and index-registering as one word. However, a hyphen at the end of a line does not always indicate a line break in the middle of a word. For example, if a hyphen happens to appear at the end of a line, such as "phosphate-rich", this method uses "phosp
haterich ".

【０００６】このような場合、機械翻訳システムなど大
規模な辞書を用いるシステムにおいては、行末ハイフン
処理で結合された文字列を辞書引きし、辞書登録されて
いる場合にのみその文字列を採用するという手法が採ら
れる。上記の場合、"phosphaterich" のような文字列は
辞書登録されている可能性が低いため採用されず、"edu
cation" のような文字列だけを索引登録することができ
る。しかしながら、検索システムでは、処理の効率化の
ため大規模な辞書は用いないのが通常であり、このよう
な手法は現実的とは言えない。In such a case, in a system using a large-scale dictionary, such as a machine translation system, a character string combined by end-of-line hyphen processing is looked up in a dictionary, and the character string is adopted only when registered in the dictionary. Is adopted. In the above case, a character string such as "phosphaterich" is not adopted because it is unlikely that it is registered in the dictionary.
cation "can be indexed. However, search systems usually do not use large dictionaries to improve processing efficiency, and such a method is not practical. I can not say.

【０００７】一方、特開平７−６５０１３号公報による
技術では、検索対象文書やユーザの問い合わせにおける
表記のゆれを、索引登録時に統一してしまうのではな
く、検索時に異表記による検索語の展開を行っている。
この方法によると、異表記候補を記述した異表記辞書を
検索時に参照して検索語の展開を行っている。しかしな
がら、単語の途中での改行は、大多数の英単語が対象に
なる上に、１つの単語で複数の改行可能箇所があるため
（例えば"education" の場合には"ed-u-ca-tion"の３所
で改行が可能）、辞書の構築が困難であることに加え
て、検索語の展開数が多くなり過ぎることによる検索効
率低下の問題が生じる。また、接頭辞と語幹の連続につ
いても、新語生成力が強いため、辞書による問題の解決
には限界がある。On the other hand, according to the technique disclosed in Japanese Patent Application Laid-Open No. 7-65013, the fluctuation of the notation in the search target document and the user's inquiry is not unified at the time of index registration, but the expansion of the search word by a different notation is performed at the time of search. Is going.
According to this method, a search term is developed by referring to a different notation dictionary describing different notation candidates at the time of search. However, a line break in the middle of a word covers most English words, and there are multiple possible line breaks in one word (for example, in the case of "education", "ed-u-ca- In addition, it is difficult to construct a dictionary, and there is a problem of a decrease in search efficiency due to an excessive number of expansions of search words. Also, with regard to the continuation of prefixes and stems, there is a limit to solving problems with dictionaries because of the strong new word generation power.

【０００８】本発明は、上述の問題を解決するためのも
のであり、ハイフンを伴う異表記に関して、索引登録さ
れる文字列を正規化することなく、検索対象となる文書
から収集したパターンを参照して検索語を展開すること
により、検索精度を向上させる文書検索システム、文書
検索方法および記録媒体を提供することを目的とする。The present invention has been made to solve the above-described problem, and refers to a pattern collected from a document to be searched without normalizing a character string to be indexed with respect to a different notation accompanied by a hyphen. An object of the present invention is to provide a document search system, a document search method, and a recording medium that improve search accuracy by expanding search words.

【０００９】[0009]

【課題を解決するための手段】上記の問題を解決するた
めに、本発明の請求項１の文書検索システムは、複数の
文書のなかからユーザの所望する文書を検索する文書検
索システムにおいて、単語の区切りが明確な言語によっ
て記述された複数の文書データを記憶する文書記憶手段
と、前記文書格納手段に格納された文書データから、所
定の区切り文字によって区切られた文字列とその文字列
の出現位置情報などを抽出する文字列抽出手段と、この
抽出された文字列を類型化し、この文字列が出現する前
記文書格納手段に格納された文書データとこの文字列の
出現位置情報とをその類型化した文字列に対応付けを行
って索引記憶手段へ記憶させる索引格納手段と、前記文
字列抽出手段で抽出された文字列が特定のパターンと合
致する場合、この文字列とこの文字列の正規化文字列と
を対応付けを行って、正規化情報記憶手段へ記憶させる
正規化情報格納手段とを備え、複数の文書の中からユー
ザの所望する文書を検索することを特徴とする。According to a first aspect of the present invention, there is provided a document search system for searching a document desired by a user from a plurality of documents. Document storage means for storing a plurality of document data described in a language with a clear delimiter, and a character string delimited by a predetermined delimiter and the appearance of the character string from the document data stored in the document storage means A character string extracting means for extracting position information and the like, classifying the extracted character string, and converting the document data stored in the document storage means in which the character string appears and the appearance position information of the character string into the type Index storage means for associating with the converted character string and storing it in the index storage means, and when the character string extracted by the character string extraction means matches a specific pattern, A normalization information storage unit for associating a character string with a normalized character string of the character string and storing the normalized character string in the normalization information storage unit, and searches a document desired by the user from a plurality of documents It is characterized by the following.

【００１０】また、本発明の請求項２の文書検索システ
ムは、請求項１に記載の文書検索システムにおいて、前
記正規化情報格納手段は、行末のハイフンとその前後の
アルファベット文字列を特定のパターンとし、この特定
のパターンからハイフンと改行コードと空白を削除し
て、前後のアルファベット文字列を結合した文字列を正
規化文字列として生成することを特徴とする。According to a second aspect of the present invention, in the document search system according to the first aspect, the normalization information storage means stores a hyphen at the end of a line and an alphabetic character string before and after the hyphen at a specific pattern. Then, a hyphen, a line feed code, and a blank are deleted from this specific pattern, and a character string obtained by combining the preceding and following alphabetic character strings is generated as a normalized character string.

【００１１】また、本発明の請求項３の文書検索システ
ムは、請求項１に記載の文書検索システムにおいて、前
記正規化情報格納手段は、ハイフンを伴う接頭辞とそれ
に後続するアルファベット文字列を特定のパターンと
し、この特定のパターンの文字列からハイフンを削除し
て、接頭辞と後続するアルファベット文字列を結合した
文字列を正規化文字列として生成することを特徴とす
る。According to a third aspect of the present invention, in the document search system according to the first aspect, the normalization information storage unit specifies a prefix accompanied by a hyphen and an alphabet character string following the prefix. And a hyphen is removed from the character string of this specific pattern, and a character string obtained by combining the prefix and the subsequent alphabetic character string is generated as a normalized character string.

【００１２】また、本発明の請求項４の文書検索システ
ムは、請求項１乃至請求項３のいずれか１つに記載の文
書検索システムにおいて、文書を検索するための検索テ
キストを入力する検索テキスト入力手段と、前記検索テ
キストから所定の区切り文字によって区切られた文字列
とその文字列の出現位置情報などを抽出する文字列抽出
手段と、前記正規化情報記憶手段に記憶された文字列を
参照して、その抽出された文字列を展開する抽出文字列
展開手段と、前記文字列抽出手段で抽出した文字列およ
び前記抽出文字列展開手段で展開された文字列をもとに
検索条件を作成する検索条件作成手段と、この検索条件
と前記索引記憶手段とによって前記文書記憶手段中の文
書を検索する検索処理手段とを備えて、複数の文書の中
からユーザの所望する文書を検索することを特徴とす
る。According to a fourth aspect of the present invention, in the document search system according to any one of the first to third aspects, a search text for inputting a search text for searching a document is provided. Input means, character string extraction means for extracting a character string delimited by a predetermined delimiter from the search text and appearance position information of the character string, and the like, and refer to the character string stored in the normalization information storage means. Extracting character string expanding means for expanding the extracted character string; and creating a search condition based on the character string extracted by the character string extracting means and the character string expanded by the extracted character string expanding means. And a search processing unit for searching for a document in the document storage unit based on the search condition and the index storage unit. And said that a document that.

【００１３】また、本発明の請求項５の文書検索システ
ムは、請求項２乃至請求項４のいずれか１つに記載の文
書検索システムにおいて、前記抽出文字列展開手段は、
前記文字列抽出手段で抽出された文字列がハイフンを伴
う接頭辞とそれに後続するアルファベット文字列である
場合は、ハイフンを削除して、接頭辞と後続するアルフ
ァベット文字列を結合した文字列により、抽出された文
字列を展開することを特徴とする。According to a fifth aspect of the present invention, in the document search system according to any one of the second to fourth aspects, the extracted character string expanding means includes
If the character string extracted by the character string extraction means is a prefix with a hyphen and an alphabet character string following it, a hyphen is deleted, and a character string obtained by combining the prefix and the subsequent alphabet character string is used. The extracted character string is expanded.

【００１４】また、本発明の請求項６の文書検索方法
は、複数の文書のなかからユーザの所望する文書を検索
する文書検索方法において、単語の区切りが明確な言語
によって記述された複数の文書データを記憶する文書記
憶手段に格納された文書データから、所定の区切り文字
によって区切られた文字列とその文字列の出現位置情報
などを抽出し、この抽出された文字列を類型化し、この
文字列が出現する前記文書格納手段に格納された文書デ
ータとこの文字列の出現位置情報とをその類型化した文
字列に対応付けを行って索引記憶手段へ記憶させ、抽出
された文字列が特定のパターンと合致する場合、この文
字列とこの文字列の正規化文字列とを対応付けを行っ
て、正規化情報記憶手段へ記憶させることを特徴とす
る。According to a sixth aspect of the present invention, there is provided a document retrieval method for retrieving a document desired by a user from a plurality of documents, the plurality of documents having words separated by a clear language. From the document data stored in the document storage means for storing data, a character string delimited by a predetermined delimiter and information on the appearance position of the character string are extracted, and the extracted character string is categorized and The document data stored in the document storage unit where the column appears and the appearance position information of this character string are associated with the categorized character string and stored in the index storage unit, and the extracted character string is identified. When this pattern matches, this character string is associated with a normalized character string of this character string, and stored in the normalized information storage means.

【００１５】また、本発明の請求項７の文書検索方法
は、請求項６に記載の文書検索方法において、行末のハ
イフンとその前後のアルファベット文字列を特定のパタ
ーンとし、この特定のパターンからハイフンと改行コー
ドと空白を削除して、前後のアルファベット文字列を結
合した文字列を正規化文字列として生成することを特徴
とする。According to a seventh aspect of the present invention, there is provided a document search method according to the sixth aspect, wherein a hyphen at the end of a line and an alphabetic character string before and after the hyphen are defined as a specific pattern, and the hyphen is converted from the specific pattern. And a line feed code and a space are deleted, and a character string obtained by combining the preceding and following alphabetic character strings is generated as a normalized character string.

【００１６】また、本発明の請求項８の文書検索方法
は、請求項６に記載の文書検索方法において、ハイフン
を伴う接頭辞とそれに後続するアルファベット文字列を
特定のパターンとし、この特定のパターンの文字列から
ハイフンを削除して、接頭辞と後続するアルファベット
文字列を結合した文字列を正規化文字列として生成する
ことを特徴とする。The document search method according to claim 8 of the present invention is the document search method according to claim 6, wherein a prefix with a hyphen and an alphabetic character string following the prefix are used as a specific pattern. , A hyphen is removed from the character string, and a character string obtained by combining the prefix and the subsequent alphabetic character string is generated as a normalized character string.

【００１７】また、本発明の請求項９の文書検索方法
は、請求項１乃至請求項８のいずれか１つに記載の文書
検索方法において、文書を検索するための検索テキスト
を入力し、この検索テキストから所定の区切り文字によ
って区切られた文字列とその文字列の出現位置情報など
を抽出し、前記正規化情報記憶手段に記憶された文字列
を参照して、その抽出された文字列を展開し、前記抽出
され文字列とその展開された文字列をもとに検索条件を
作成し、この検索条件と前記索引記憶手段とによって前
記文書記憶手段中の文書を検索することによって、複数
の文書の中からユーザの所望する文書を検索するように
したことを特徴とする。According to a ninth aspect of the present invention, there is provided a document search method according to any one of the first to eighth aspects, wherein a search text for searching a document is input, and A character string separated by a predetermined delimiter and appearance position information of the character string are extracted from the search text, and the extracted character string is referred to by referring to the character string stored in the normalization information storage unit. Expanding, creating a search condition based on the extracted character string and the expanded character string, and searching the document in the document storage means by the search condition and the index storage means, It is characterized in that a document desired by the user is retrieved from the documents.

【００１８】また、本発明の請求項１０の文書検索方法
は、請求項６乃至請求項９のいずれか１つに記載の文書
検索方法において、前記検索テキストから抽出された文
字列がハイフンを伴う接頭辞とそれに後続するアルファ
ベット文字列である場合は、ハイフンを削除して、接頭
辞と後続するアルファベット文字列を結合した文字列に
より、抽出された文字列を展開することを特徴とする。According to a tenth aspect of the present invention, in the document search method according to any one of the sixth to ninth aspects, a character string extracted from the search text includes a hyphen. In the case of a prefix and an alphabet character string following the prefix, hyphens are deleted, and the extracted character string is expanded by a character string obtained by combining the prefix and the subsequent alphabet character string.

【００１９】また、本発明の請求項１１の記録媒体は、
単語の区切りが明確な言語によって記述された複数の文
書データを記憶する文書記憶手段と、前記文書格納手段
に格納された文書データから、所定の区切り文字によっ
て区切られた文字列とその文字列の出現位置情報などを
抽出する文字列抽出手段と、この抽出された文字列を類
型化し、この文字列が出現する前記文書格納手段に格納
された文書データとこの文字列の出現位置情報とをその
類型化した文字列に対応付けを行って索引記憶手段へ記
憶させる索引格納手段と、前記文字列抽出手段で抽出さ
れた文字列が特定のパターンと合致する場合、この文字
列とこの文字列の正規化文字列とを対応付けを行って、
正規化情報記憶手段へ記憶させる正規化情報格納手段と
を備え、複数の文書のなかからユーザの所望する文書を
検索する文書検索システムとしてコンピュータを機能さ
せるためのプログラムを記録する。Further, a recording medium according to claim 11 of the present invention is:
A document storage means for storing a plurality of document data written in a language in which words are clearly separated, and a character string delimited by a predetermined delimiter from a document data stored in the document storage means. Character string extracting means for extracting appearance position information and the like, classifying the extracted character string, and storing the document data and the appearance position information of the character string stored in the document storage means where the character string appears. An index storage unit that associates the character string with the categorized character string and stores the character string in the index storage unit; and if the character string extracted by the character string extraction unit matches a specific pattern, the character string and the character string By associating with the normalized string,
A normalization information storage unit for storing the normalization information storage unit in the normalization information storage unit;

【００２０】また、本発明の請求項１２の記録媒体は、
請求項１１に記載の記録媒体において、文書を検索する
ための検索テキストを入力する検索テキスト入力手段
と、前記検索テキストから所定の区切り文字によって区
切られた文字列とその文字列の出現位置情報などを抽出
する文字列抽出手段と、前記正規化情報記憶手段に記憶
された文字列を参照して、その抽出された文字列を展開
する抽出文字列展開手段と、前記文字列抽出手段で抽出
した文字列および前記抽出文字列展開手段で展開された
文字列をもとに検索条件を作成する検索条件作成手段
と、この検索条件と前記索引記憶手段とによって前記文
書記憶手段中の文書を検索する検索処理手段とを備え
る。Further, the recording medium according to claim 12 of the present invention is:
12. The recording medium according to claim 11, wherein search text input means for inputting a search text for searching for a document, a character string separated from the search text by a predetermined delimiter, and appearance position information of the character string. Character string extracting means for extracting the character string, extracting character string expanding means for expanding the extracted character string with reference to the character string stored in the normalized information storage means, and extracting the character string by the character string extracting means. Search condition creation means for creating a search condition based on a character string and a character string expanded by the extracted character string expansion means, and a document in the document storage means is searched by the search condition and the index storage means. Search processing means.

【００２１】[0021]

【発明の実施の形態】以下に、図面を用いて本発明の実
施例の構成および動作を詳細に述べる。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS The construction and operation of an embodiment of the present invention will be described below in detail with reference to the drawings.

【００２２】（１）本発明の実施例の構成図１、図２および図３は、本発明の一実施例である文書
検索システムの機能構成を概略的に示すブロック図であ
る。文書検索システム１００は、制御部１１０、登録処
理部１２０、検索処理部１３０、文書記憶手段１４０、
索引記憶手段１５０、正規化情報記憶手段１６０、接頭
辞テーブル１７０とから構成される。さらに、登録処理
部１２０は、文字列抽出手段１２１、索引格納手段１２
２、正規化情報格納手段１２３とから構成される。ま
た、検索処理部１３０は、検索テキスト入力手段１３
１、文字列抽出手段１３２、抽出文字列展開手段１３
３、検索条件作成手段１３４、検索処理手段１３５、出
力手段１３６とから構成される。(1) Configuration of Embodiment of the Present Invention FIGS. 1, 2 and 3 are block diagrams schematically showing the functional configuration of a document search system according to an embodiment of the present invention. The document search system 100 includes a control unit 110, a registration processing unit 120, a search processing unit 130, a document storage unit 140,
It comprises an index storage unit 150, a normalized information storage unit 160, and a prefix table 170. Further, the registration processing unit 120 includes a character string extraction unit 121, an index storage unit 12
2. Normalization information storage means 123. In addition, the search processing unit 130 includes the search text input unit 13.
1, character string extracting means 132, extracted character string expanding means 13
3. It comprises a search condition creating means 134, a search processing means 135, and an output means 136.

【００２３】制御部１１０は、文書検索システム１００
の文書データの登録を行う登録処理部１２０および登録
された文書に対してユーザの所望の文書を検索してその
結果を出力する検索処理部１３０等に対する全体を制御
する。登録処理部１２０は、文書記憶手段１４０へ文書
データを格納すると共に、索引記憶手段１５０および正
規化情報記憶手段１６０を作成する。検索処理部１３０
は、ユーザからの検索指示に基づいて文書記憶手段１４
０を検索し、その結果を出力させる。文書記憶手段１４
０は、多数の文書データを文書データベースとして、少
なくとも文書データとその識別子とを対応させて記憶装
置に記憶する。索引記憶手段１５０は、各文書データに
含まれる文字列と文書情報とを対応付けて索引（インデ
ックス）として格納する。このとき文字列は類別化して
格納する。この文書情報としては、この文字列を含んで
いる文書データの識別子、この文字列が文書中に出現し
た出現位置とがある。正規化情報記憶手段１６０は、文
書データから抽出された文字列が特定のパターンである
ときに、その文字列の正規化文字列に対して、そのもと
になった文字列とを対応付けて格納する。このとき、こ
の正規化文字列から展開される文字列も正規化文字列に
対応させて格納される。接頭辞テーブル１７０は、登録
処理部１２０や検索処理部１３０で文書データや検索テ
キストの接頭辞語の処理を行なう際に参照する接頭辞を
記憶している。The control unit 110 controls the document search system 100
And a search processing unit 130 that searches the registered document for a user's desired document and outputs the result. The registration processing unit 120 stores the document data in the document storage unit 140 and creates the index storage unit 150 and the normalized information storage unit 160. Search processing unit 130
Is based on the search instruction from the user.
Search for 0 and output the result. Document storage means 14
No. 0 stores a large number of document data as a document database in a storage device in association with at least the document data and its identifier. The index storage unit 150 stores a character string included in each document data and document information in association with each other as an index. At this time, the character strings are classified and stored. The document information includes an identifier of the document data including the character string, and an appearance position where the character string appears in the document. When the character string extracted from the document data has a specific pattern, the normalization information storage unit 160 associates the normalized character string of the character string with the original character string. Store. At this time, a character string expanded from the normalized character string is also stored in correspondence with the normalized character string. The prefix table 170 stores a prefix that is referred to when the registration processing unit 120 or the search processing unit 130 processes a prefix word of document data or a search text.

【００２４】（Ａ）登録処理部１２０の概略登録処理部１２０は、文書記憶手段１４０から１文書ず
つ文書データを読み出し、その読み出された文書データ
を文字列抽出手段１２１に送る。文字列抽出手段１２１
は、送られたデータを処理して索引語となる文字列を抽
出し、出現位置情報と共に登録処理部１２０に返す。こ
の文字列抽出手段１２１文字列を抽出したとき、正規化
情報格納手段１２３を呼び出し、抽出した文字列が予め
定めておいた特定パターンに合致する場合は、その文字
列に対する正規化文字列を生成し、抽出した文字列と正
規化文字列を共に、正規化情報記憶手段１６０へ格納す
る。このとき、特定パターンが接頭辞を伴うものである
場合は、接頭辞テーブル１７０を参照する。登録処理部
１２０は、文字列抽出手段１２１から受け取った結果を
索引格納手段１２２に送る。索引格納手段１２２は、文
字列抽出手段１２１で抽出された文字列を類型化し、こ
の文字列の出現した文書の識別子とその文字列の位置情
報とをこの抽出した文字列と対応させて索引記憶手段１
５０へ格納する。(A) Outline of Registration Processing Unit 120 The registration processing unit 120 reads out document data one by one from the document storage unit 140 and sends the read document data to the character string extraction unit 121. Character string extraction means 121
Processes the transmitted data, extracts a character string serving as an index word, and returns it to the registration processing unit 120 together with the appearance position information. When the character string is extracted, the normalization information storage means 123 is called, and if the extracted character string matches a predetermined specific pattern, a normalized character string for the character string is generated. Then, both the extracted character string and the normalized character string are stored in the normalized information storage unit 160. At this time, if the specific pattern has a prefix, the prefix table 170 is referred to. The registration processing unit 120 sends the result received from the character string extraction unit 121 to the index storage unit 122. The index storage unit 122 classifies the character string extracted by the character string extraction unit 121 and stores the identifier of the document in which the character string appears and the position information of the character string in the index storage in association with the extracted character string. Means 1
Store in 50.

【００２５】（Ｂ）検索処理部１３０の概略検索処理部１３０は、ユーザからの検索の問い合わせの
テキストを検索テキスト入力手段１３１で入力させる。
入力されたテキストは検索処理部１３０から検索条件作
成手段１３４に送る。検索条件作成手段１３４は、この
問い合わせのテキストを文字列抽出手段１３２に送る。
この文字列抽出手段１３２は、送られた問い合わせのテ
キストを登録処理部１２０と同様に処理して検索語の文
字列を抽出し、検索条件作成手段１３４へ戻す。この文
字列を抽出した際に、抽出文字列展開手段１３３を呼び
出し、正規化情報記憶手段１６０を参照し、抽出された
検索語と一致する文字列があれば、その文字列に対応す
る正規化文字列も一緒に検索条件作成手段１３４に戻
し、更に、接頭辞テーブル１７０を参照し、ハイフン付
き接頭辞語に対する展開文字列が得られれば、展開文字
列も一緒に検索条件作成手段１３４に戻す。検索条件作
成手段１３４は、送られた検索語を演算子の形式に変換
して、検索条件を作成して、検索処理部１３０へ戻す。
この検索条件をもとに検索処理部１３０は検索処理手段
１３５を呼び出す。検索処理手段１３５は、この検索条
件に従って、索引記憶手段１５０に格納された索引（イ
ンデックス）を検索し、条件を満たす文書を特定する。
出力手段１３６は、検索処理手段１３５で検索されたイ
ンデックスに基づく文書の情報をディスプレイ等の表示
装置へ表示する。必要であれば、文書記憶手段１４０に
アクセスして、文書データを出力する。(B) Outline of Search Processing Unit 130 The search processing unit 130 allows a search text input unit 131 to input a text of a search inquiry from a user.
The input text is sent from the search processing unit 130 to the search condition creating unit 134. The search condition creating means 134 sends the text of the inquiry to the character string extracting means 132.
The character string extracting unit 132 processes the sent inquiry text in the same manner as the registration processing unit 120 to extract the character string of the search word, and returns the character string to the search condition creating unit 134. When this character string is extracted, the extracted character string expanding means 133 is called, and the normalization information storage means 160 is referred to. If there is a character string that matches the extracted search word, the normalized character string corresponding to the extracted character string is found. The character string is returned to the search condition creating means 134 together. Further, if the expanded character string for the prefixed word with a hyphen is obtained by referring to the prefix table 170, the expanded character string is also returned to the search condition creating means 134 together. . The search condition creating unit 134 converts the sent search word into an operator format, creates a search condition, and returns the search condition to the search processing unit 130.
The search processing unit 130 calls the search processing unit 135 based on the search conditions. The search processing unit 135 searches the index stored in the index storage unit 150 according to the search condition, and specifies a document satisfying the condition.
The output unit 136 displays information of a document based on the index searched by the search processing unit 135 on a display device such as a display. If necessary, the document storage unit 140 is accessed to output document data.

【００２６】実施例をこのような構成にすることによ
り、次のような効果をもたらすことができる。・対象とする文書データから、特定のパターンに合致す
る文字列を収集、蓄積しておくので、検索時にこれを参
照して検索語を展開することができ、検索の効率が向上
する。・検索時に、検索対象文書中に実際に出現する行末ハイ
フンパターンを参照することができるので、ユーザが入
力した検索語を検索対象文書中に存在するハイフン付き
文字列に展開して実際に即した検索ができる。・検索時に、検索対象文書中に実際に出現するハイフン
付き接頭辞語を参照することができるので、ユーザが入
力したハイフンなしの接頭辞語をハイフン付きの接頭辞
語に展開して実際に即した検索ができる。The following effects can be obtained by adopting such a configuration in the embodiment. A character string that matches a specific pattern is collected and stored from the target document data, so that a search word can be expanded by referring to the character string at the time of search, thereby improving search efficiency. -At the time of search, the end-of-line hyphen pattern that actually appears in the search target document can be referred to, so the search term entered by the user is expanded to a hyphenated character string that exists in the search target document to match the actual You can search. -When searching, it is possible to refer to the hyphenated prefix that actually appears in the search target document, so that the non-hyphenated prefix entered by the user is expanded into the hyphenated prefix and actually immediately You can search.

【００２７】（２）文書検索システムの動作図４に示すような登録対象文書の具体例を用いて、本実
施例を説明する。図４に示すように、文書Ａは、文書先
頭に"The pre-election group"、文書中に"cooperatio
n" なる文字列を含む英語の文書データであって、文書
Ａをユニークに示す識別子である文書ＩＤは" １" を付
与されている。文書Ｂは、文書中に"research and edu-
cation system"なる文字列（"edu-"の直後には改行が存
在する）を含む英語の文書データであって、文書Ｂをユ
ニークに示す識別子である文書ＩＤは" ２" を付与され
ている。文書Ｃは、文書中に"phosphate-rich water"な
る文字列（"phosphate-"の直後には改行が存在する）を
含む英語の文書データであって、文書Ｃをユニークに示
す識別子である文書ＩＤは" ３" を付与されている。文
書Ｄは、文書中に"the professional educa-tion syste
m"なる文字列（"edica-"の直後には改行が存在する）を
含む英語の文書データであって、文書Ｄをユニークに示
す識別子である文書ＩＤは" ４" を付与されている。(2) Operation of the Document Retrieval System This embodiment will be described using a specific example of a document to be registered as shown in FIG. As shown in FIG. 4, document A has "The pre-election group" at the beginning of the document and "cooperatio" in the document.
The document ID is "1", which is English document data that includes a character string "n" and is an identifier that uniquely indicates document A. Document B has "research and edu-
The document ID is an English document data including a character string “cation system” (a line feed exists immediately after “edu-”), and a document ID that is an identifier uniquely indicating the document B is “2”. The document C is English document data including a character string “phosphate-rich water” (a line break exists immediately after “phosphate-”) in the document, and is an identifier that uniquely indicates the document C. The document ID is given “3.” The document D has “the professional education system” in the document.
The document ID is “4”, which is English document data including a character string “m” (a line feed exists immediately after “edica-”) and is an identifier that uniquely indicates the document D.

【００２８】（Ａ）文書登録処理の動作図５は登録処理部１２０における文書登録処理の動作を
概略的に示すフローチャートである。まず、文書記憶手
段１４０に格納された文書データの中に、索引（インデ
ックス）が未作成の文書があるか否かを調べる（ステッ
プＳ１）。索引（インデックス）が未作成の文書がなけ
れば、すべての検索対象の文書データのインデックスが
作成されているので、文書の登録処理を終了する。(A) Operation of Document Registration Process FIG. 5 is a flowchart schematically showing the operation of the document registration process in the registration processing unit 120. First, it is checked whether or not there is any document whose index has not been created in the document data stored in the document storage unit 140 (step S1). If there is no document for which an index (index) has not been created, since the index of all the document data to be searched has been created, the document registration process ends.

【００２９】一方、インデックス未作成の文書がある場
合には、その索引（インデックス）が未作成の文書デー
タを読み込む（ステップＳ２）。その文書データの文書
記憶手段１４０に格納された各文書をユニークに示す文
書ＩＤを取得する（ステップＳ３）。この文書データを
文字列抽出手段１２１へ送る（ステップＳ４）。文字列
抽出手段１２１から文字列情報が返された場合は（ステ
ップＳ５のＹ）、この文字列情報を文書ＩＤと共に索引
記憶手段１５０へ登録し（Ｓ６）、ステップＳ４に戻っ
て文字列情報の抽出処理を繰り返す。登録するときに文
字列情報は類型化（例えば、同じ文字列に対しては同じ
グループに割り当て、文字コードによってソートす
る。）してから格納する。その結果、図４の４つの文書
から抽出された文字列情報は、図９に示されるように索
引記憶手段１５０に格納される。文字列抽出手段１２１
から文字列情報が返されない場合は（ステップＳ５の
Ｎ）、終了指示であれば（ステップＳ７のＹ）、処理中
の文書は終了したことを示すので、ステップＳ１に戻っ
て登録処理を繰り返す。終了指示でなければ（ステップ
Ｓ７のＮ）、ステップＳ４に戻って文字列抽出処理を繰
り返す。On the other hand, if there is a document for which an index has not been created, document data whose index (index) has not been created is read (step S2). A document ID uniquely indicating each document stored in the document storage unit 140 of the document data is acquired (step S3). This document data is sent to the character string extracting means 121 (step S4). When the character string information is returned from the character string extracting means 121 (Y in step S5), the character string information is registered in the index storage means 150 together with the document ID (S6), and the process returns to step S4 to return the character string information. Repeat the extraction process. At the time of registration, character string information is categorized (for example, the same character string is assigned to the same group and sorted by character code) and stored. As a result, the character string information extracted from the four documents in FIG. 4 is stored in the index storage unit 150 as shown in FIG. Character string extraction means 121
If no character string information is returned (N in step S5), if it is an end instruction (Y in step S7), it indicates that the document being processed has ended, so the process returns to step S1 to repeat the registration process. If it is not an end instruction (N in step S7), the process returns to step S4 to repeat the character string extraction process.

【００３０】図６は、文字列抽出手段１２１が文字列を
抽出するときの動作を概略的に示すフローチャートであ
る。まず、登録モード指定か否かをチェックし（ステッ
プＳ１０）、登録モードでなければ検索モードの処理を
行なう（ステップＳ１０のＮ）。登録モードが指定され
ていれば、出現位置をゼロにセット（初期化）する（ス
テップＳ１１）。次に、開始位置をセットする（ステッ
プＳ１２）。この開始位置が文書データの末尾に達して
いれば（ステップＳ１３のＹ）、終了指示の状態を戻し
て（ステップＳ１４）、処理を終了する。まだ、文書デ
ータの末尾に達していない場合は（ステップＳ１３の
Ｎ）、開始位置の文字が区切り文字であり（ステップＳ
１５のＹ）、かつハイフンであれば（ステップＳ１６の
Ｙ）、後述のハイフン語処理を実施（ステップＳ２１）
して、ステップＳ１７へ進む。ここで区切り文字として
は、スペース、タブ、改行以外に、図７で示した記号類
が用いられる。また、開始位置の文字が区切り文字であ
り（ステップＳ１５のＹ）、かつ、ハイフンでなければ
（ステップＳ１６のＮ）、ステップＳ１７へ進む。ステ
ップＳ１７では、現在の開始位置から区切り文字の連続
をスキップする。区切り文字の連続をスキップした後、
もしくは、開始位置の文字が区切り文字でない場合（ス
テップＳ１５のＮ）は、非区切り文字の連続を抽出し
（ステップＳ１８）、抽出された文字列と出現位置を呼
び出し元へ戻す（ステップＳ１９）。ここで上述した各
文書Ａ、Ｂ、Ｃ、Ｄ（図４参照）から索引登録対象とし
て抽出された文字列情報は、例えば、図８のように抽出
される。なお、ここでは文書中の大文字は、全て小文字
に変換して登録するものとしている。その後、出現位置
を１つ進めて（ステップＳ２０）、ステップＳ１２に戻
って処理を繰り返す。FIG. 6 is a flowchart schematically showing the operation when the character string extracting means 121 extracts a character string. First, it is checked whether or not the registration mode has been designated (step S10). If the registration mode has not been specified, processing in the search mode is performed (N in step S10). If the registration mode has been designated, the appearance position is set to zero (initialization) (step S11). Next, a start position is set (step S12). If the start position has reached the end of the document data (Y in step S13), the state of the end instruction is returned (step S14), and the process ends. If the end of the document data has not yet been reached (N in step S13), the character at the start position is a delimiter (step S13).
15 (Y in step S16) and if it is a hyphen (Y in step S16), a hyphen word process described later is performed (step S21).
Then, the process proceeds to step S17. Here, as the delimiters, the symbols shown in FIG. 7 are used in addition to spaces, tabs, and line feeds. If the character at the start position is a delimiter (Y in step S15) and is not a hyphen (N in step S16), the process proceeds to step S17. In step S17, the continuation of the delimiter is skipped from the current start position. After skipping the sequence of delimiters,
Alternatively, if the character at the start position is not a delimiter (N in step S15), a sequence of non-delimiters is extracted (step S18), and the extracted character string and appearance position are returned to the caller (step S19). Here, the character string information extracted as an index registration target from each of the documents A, B, C, and D (see FIG. 4) is extracted, for example, as shown in FIG. Note that all uppercase letters in the document are converted to lowercase letters before registration. Thereafter, the appearance position is advanced by one (step S20), and the process returns to step S12 to repeat the processing.

【００３１】図１０は、ハイフン語処理（図６のステッ
プＳ２１）の動作を概略的に示すフローチャートであ
る。まず、ハイフンの直前がアルファベットのみで構成
される文字列（英字列）でない場合は（ステップＳ３０
のＮ）、処理を終了する。また、ハイフンの直前が英字
列で（Ｓ３０のＹ）、かつ直後が改行の場合は（ステッ
プＳ３１のＹ）、次の行の先頭が英字列でなければ（ス
テップＳ３２のＮ）、処理を終了する。次の行の先頭が
英字列であれば（ステップＳ３２のＹ）、ハイフン前後
の文字列を、ハイフンおよび改行を削除して連結し（ス
テップＳ３３）、連結した文字列にハイフン付きの文字
列（改行は削除するものとする）を対応させて、正規化
文字列記憶手段１６０に格納して（ステップＳ３４）、
処理を終了する。このとき、正規化文字列から展開可能
なパターンを生成して、正規化文字列と一緒に記憶させ
る。FIG. 10 is a flowchart schematically showing the operation of the hyphen processing (step S21 in FIG. 6). First, if the character string immediately before the hyphen is not a character string (English character string) composed of only alphabets (step S30)
N), the process ends. If the character string immediately before the hyphen is a character string (Y in S30) and the character string immediately after the hyphen is a line feed (Y in step S31), the head of the next line is not a character string (N in step S32), and the process ends. I do. If the beginning of the next line is an alphabetic character string (Y in step S32), the character strings before and after the hyphen are concatenated by removing the hyphen and the line feed (step S33), and the character string with the hyphen ( The line feed is to be deleted) and stored in the normalized character string storage unit 160 (step S34).
The process ends. At this time, an expandable pattern is generated from the normalized character string, and stored together with the normalized character string.

【００３２】ハイフン直前が英字列で、かつ直後が改行
でない場合は（ステップＳ３１のＮ）、接頭辞テーブル
１７０（例えば、図１１参照）を参照して、その直前の
英字列が接頭辞であって（ステップＳ３５のＹ）、かつ
直後が英字列であれば（ステップＳ３６のＹ）、ハイフ
ン前後の文字列を、ハイフンを削除して連結し（ステッ
プＳ３３）、連結した文字列にハイフン付きの文字列を
対応させて、正規化文字列記憶手段１６０に格納し（ス
テップＳ３４）、処理を終了する。ハイフン直前の英字
列が接頭辞でない場合（Ｓ３５のＮ）、もしくは、ハイ
フン直後が英字列でない場合（Ｓ３６のＮ）は、処理を
終了する。If the character string immediately before the hyphen is a character string and the character string immediately after the hyphen is not a line feed (N in step S31), the character string immediately before the hyphen is referred to the prefix table 170 (see, for example, FIG. 11). (Y in step S35) and if the character string immediately after is a character string (Y in step S36), the character strings before and after the hyphen are concatenated by removing the hyphen (step S33), and the concatenated character string with a hyphen The corresponding character strings are stored in the normalized character string storage unit 160 (step S34), and the process is terminated. If the character string immediately before the hyphen is not a prefix (N in S35), or if the character string immediately after the hyphen is not a character string (N in S36), the process ends.

【００３３】図１２は、図４に示した４つの文書に対し
て、ハイフン語処理（図１０参照）して抽出された正規
化文字列とそのもとになった文字列と対比させた正規化
情報記憶手段１６０の内容を示している。例えば、文書
Ａからは、ハイフン付き接頭辞語である"pre-election"
が抽出され、その正規化文字列"preelection" をキーと
する展開パターンが格納される。文書Ｂからは、行末ハ
イフン前後の文字列"edu-cation"が抽出され、その正規
化文字列"education" をキーとする展開パターンが格納
される。文書Ｃからは、行末ハイフン前後の文字列"pho
sphate-rich"が抽出され、その正規化文字列"phosphate
rich" をキーとする展開パターンが格納される。文書Ｄ
からは、行末ハイフン前後の文字列"educa-tion"が抽出
され、その正規化文字列"education" をキーとする展開
パターンが格納される。FIG. 12 shows a normalized character string extracted from the four documents shown in FIG. 4 by hyphenation processing (see FIG. 10) and a character string based on the normalized character string. 3 shows the contents of the activation information storage means 160. For example, from document A, the hyphenated prefix "pre-election"
Is extracted, and an expansion pattern using the normalized character string "preelection" as a key is stored. From the document B, a character string “education” around the hyphen at the end of the line is extracted, and an expansion pattern using the normalized character string “education” as a key is stored. From document C, the string "pho" around the hyphen at the end of the line
sphate-rich "is extracted and its normalized string" phosphate
The expansion pattern with "rich" as a key is stored. Document D
Extracts the character string "educa-tion" before and after the hyphen at the end of the line, and stores an expansion pattern using the normalized character string "education" as a key.

【００３４】（Ｂ）検索処理部の動作図１３は検索処理部１３０における文書検索処理の動作
を概略的に示すフローチャートである。まず、検索テキ
スト入力手段１３１からの入力されたテキストを検索条
件作成手段１３４に送る。検索条件作成手段１３４は、
送られた指示が終了指示であれば（ステップＳ４０の
Ｙ）、処理を終了する。終了指示でなければ（ステップ
Ｓ４０のＮ）、検索のテキストが入力されれば（ステッ
プＳ４１のＹ）、検索テキストを文字列抽出手段１３２
に送る（ステップＳ４２）。文字列抽出手段１３２から
文字列情報が返された場合（ステップＳ４３のＹ）、返
された文字列がハイフン語でなければ（ステップＳ４７
のＮ）、OR演算に変換する（ステップＳ４８）。また、
返された文字列がハイフン語であれば（ステップＳ４７
のＹ）、隣接演算に変換してから（ステップＳ４９）、
さらに、OR演算に変換し（ステップＳ４８）、ステップ
Ｓ４２に戻って処理を繰り返す。(B) Operation of Search Processing Unit FIG. 13 is a flowchart schematically showing the operation of the document search processing in the search processing unit 130. First, the text input from the search text input unit 131 is sent to the search condition creation unit 134. The search condition creation means 134
If the sent instruction is an end instruction (Y in step S40), the process ends. If it is not an end instruction (N in step S40), if a search text is input (Y in step S41), the search text is input to the character string extracting unit 132.
(Step S42). When the character string information is returned from the character string extraction unit 132 (Y in step S43), the returned character string is not a hyphen (step S47).
N), and is converted into an OR operation (step S48). Also,
If the returned character string is a hyphen (step S47)
Y), after being converted to an adjacent operation (step S49),
Further, the operation is converted to an OR operation (step S48), and the process returns to step S42 to repeat the processing.

【００３５】検索テキストに対する文字列抽出処理が終
了したら（ステップＳ４４）、作成された検索条件によ
って検索処理を行ない（ステップＳ４５）、検索結果を
出力し（ステップＳ４６）、次のユーザからの指示を受
けるためにステップＳ４０へ戻る。When the character string extraction process for the search text is completed (step S44), the search process is performed according to the created search condition (step S45), the search result is output (step S46), and an instruction from the next user is issued. Return to step S40 to receive.

【００３６】図１４は、文字列抽出手段１３２の動作を
概略的に示すフローチャートである。まず、登録モード
指定か否かをチェックし、指定されていれば（ステップ
Ｓ６０のＹ）、登録モードの処理を行なう。登録モード
が指定されていなければ（ステップＳ６０のＮ）、開始
位置をセットし（ステップＳ６１）、検索テキストの末
尾に達していれば（ステップＳ６２のＹ）、終了指示の
状態を返して（ステップＳ６３）、処理を終了する。検
索テキストの末尾に達していない場合は（ステップＳ６
２のＮ）、開始位置が区切り文字であり（ステップＳ６
４のＹ）、かつハイフンでなければ（ステップＳ６５の
Ｎ）、開始位置から区切り文字の連続をスキップした後
（ステップＳ６６）、ステップＳ６７へ進む。開始位置
が区切り文字でなければ（ステップＳ６４のＮ）、ステ
ップＳ６７へ進む。ステップＳ６７では、非区切り文字
の連続を抽出し、抽出された文字列を返す（ステップＳ
６８）。さらに、抽出文字列が正規化情報記憶手段１６
０の正規化文字列として格納されていれば（ステップＳ
６９のＹ）、正規化文字列に対応する展開パターンを戻
して（ステップＳ７０）、ステップＳ６１に戻って処理
を繰り返す。正規化文字列として格納されていない場合
には（ステップＳ６９のＮ）、ステップＳ６１へ戻って
処理を繰り返す。FIG. 14 is a flowchart schematically showing the operation of the character string extracting means 132. First, it is checked whether or not the registration mode has been designated, and if it has been designated (Y in step S60), the processing of the registration mode is performed. If the registration mode is not specified (N in step S60), the start position is set (step S61). If the end of the search text is reached (Y in step S62), the state of the end instruction is returned (step S62). S63), the process ends. If the end of the search text has not been reached (step S6
2 N), the start position is a delimiter (step S6)
If Y is not a hyphen (N in step S65), the continuation of delimiters is skipped from the start position (step S66), and the process proceeds to step S67. If the start position is not a delimiter (N in step S64), the process proceeds to step S67. In step S67, a sequence of non-separable characters is extracted, and the extracted character string is returned (step S67).
68). Further, the extracted character string is stored in the normalized information storage unit 16.
If it is stored as a normalized character string of 0 (step S
69, Y), the development pattern corresponding to the normalized character string is returned (step S70), and the process returns to step S61 to repeat the processing. If it is not stored as a normalized character string (N in step S69), the process returns to step S61 to repeat the processing.

【００３７】開始位置の文字が区切り文字であり（ステ
ップＳ６４のＹ）、かつハイフンであれば（ステップＳ
６５のＹ）、後述のハイフン語展開処理（ステップＳ７
１）を実施する。ハイフン語展開処理が結果を返した場
合は、その処理結果の文字列を返し（ステップＳ７
３）、ステップＳ６１に戻って処理を繰り返す。この
際、開始位置はハイフン語処理結果の末尾位置に進め
る。ハイフン語展開処理が結果を返さない場合は、ステ
ップＳ６６へ進む。If the character at the start position is a delimiter (Y in step S64) and is a hyphen (step S64).
65, Y), a hyphen expansion process described later (step S7)
Perform 1). When the hyphen expansion processing returns a result, a character string of the processing result is returned (step S7).
3) Return to step S61 and repeat the process. At this time, the start position is advanced to the end position of the hyphen processing result. If the hyphen expansion does not return a result, the process proceeds to step S66.

【００３８】図１５は、ハイフン語展開処理（図１４の
ステップＳ７１）の動作を概略的に示すフローチャート
である。まず、接頭辞テーブル１７０参照して、直前の
英字列が接頭辞で（ステップＳ８０のＹ）、かつ直後が
英字列であれば（ステップＳ８１のＹ）、ハイフを削除
して前後の文字列を連結した文字列を返し（ステップＳ
８２）、さらに、ハイフン付き文字列を返し（ステップ
Ｓ８３）、処理を終わる。また、直前の英字列が接頭辞
でない場合（ステップＳ８０のＮ）や、直後が英字列で
ない場合には（ステップＳ８１）、本対象外であるから
直ちに処理を終了する。FIG. 15 is a flowchart schematically showing the operation of the hyphen expansion process (step S71 in FIG. 14). First, referring to the prefix table 170, if the immediately preceding alphabetic character string is a prefix (Y in step S80) and the character string immediately afterward is (Y in step S81), the hyphen is deleted and the preceding and succeeding character strings are replaced. Returns the concatenated character string (step S
82) Further, a character string with a hyphen is returned (step S83), and the process ends. If the immediately preceding alphabetic string is not a prefix (N in step S80) or if the immediately following alphabetic string is not an alphabetic string (step S81), the process is immediately terminated because it is not the subject.

【００３９】（３）検索条件の作成例図４で説明した４つの文書に対する検索テキストから検
索条件式を作成する時の説明を図１６および図１７を使
って説明する。図１６は、文書記憶部１４０に対する検
索テキストの例である。(3) Example of Creating Search Conditions A description will be given, with reference to FIGS. 16 and 17, of creating search condition expressions from search texts for the four documents described in FIG. FIG. 16 is an example of a search text for the document storage unit 140.

【００４０】問い合わせ１においては、"education sys
tem"なるテキストが入力される。文字列抽出手段１３２
によって、まず"education" が抽出される。この文字列
をキーとして、正規化情報記憶手段１６０（図１０参
照）を探すと、"education" の展開パターンとして"edu
-cation","educa-tion" が見つかり、"education","edu
-cation","educa-tion" の３つの文字列を返す。次に、
検索条件作成手段１３４では、この返された３つの文字
列をOR演算（ここでは#OR( )の形式とする）に変換する
が、"edu-cation","educa-tion" の２つはハイフン付き
文字列であるため、各々を隣接演算（ここでは#NEXT( )
の形式とする）に変換した上で、OR演算の項とする。さ
らに、検索テキストの残りの文字列から文字列抽出手段
は、"system"を返し、検索条件作成手段は、これをOR演
算に加え、最終的に図１７に示す検索条件１が作成され
る。この検索条件式を使って、文書記憶手段１４０を検
索すると、検索条件１にマッチする文書として、文書Ｉ
Ｄ２および４が同定され、検索結果として、文書Ｂおよ
び文書Ｄが出力される。In query 1, "education sys
The text "tem" is input.
First, "education" is extracted. When this character string is used as a key to find the normalization information storage means 160 (see FIG. 10), the expansion pattern of "education" is "edu".
-cation "," educa-tion "found," education "," edu
-cation "," educa-tion "
The search condition creating means 134 converts the returned three character strings into an OR operation (here, in the form of #OR ()), but two of "edu-cation" and "educa-tion" Since each string is a hyphenated string, each of them is an adjacent operation (here, #NEXT ()
) And convert it into an OR operation term. Further, the character string extracting means returns "system" from the remaining character strings of the search text, and the search condition creating means adds this to the OR operation, and finally, the search condition 1 shown in FIG. 17 is created. When the document storage means 140 is searched using this search condition expression, the document I
D2 and D4 are identified, and document B and document D are output as search results.

【００４１】問い合わせ２においては、"preelection c
ampaign"なるテキストが入力される。文字列抽出手段１
３２によって、まず"preelectin"が抽出される。この文
字列をキーとして、正規化情報記憶手段１６０（図１０
参照）を探すと、"preelection" の展開パターンとし
て"pre-election"が見つかり、"preelection","pre-ele
ction"の２つの文字列を返す。次に、検索条件作成手段
１３４では、この返された２つの文字列をOR演算（ここ
では#OR( )の形式とする）に変換するが、"pre-electio
n"はハイフン付き文字列であるため、隣接演算（ここで
は#NEXT( )の形式とする）に変換した上で、OR演算の項
とする。さらに、検索テキストの残りの文字列から文字
列抽出手段は、"campaign"を返し、検索条件作成手段
は、これをOR演算に加え、最終的に図１７に示す検索条
件２が作成される。この検索条件式を使って、文書記憶
手段１４０を検索すると、検索条件２にマッチする文書
として、文書ＩＤ１が同定され、検索結果として、文書
Ａが出力される。In query 2, "preelection c
ampaign "is input. Character string extraction means 1
At step 32, "preelectin" is first extracted. Using this character string as a key, the normalized information storage unit 160 (FIG. 10)
Search), "pre-election" is found as an expansion pattern of "preelection", and "preelection", "pre-ele
ction ". The search condition creating means 134 converts the two returned character strings into an OR operation (here, in the form of #OR ()). -electio
Since "n" is a character string with a hyphen, it is converted to an adjacent operation (here, in the form of #NEXT ()) and then used as an OR operation term. The extraction means returns "campaign", and the search condition creation means adds the result to the OR operation to finally create search condition 2 shown in Fig. 17. Using this search condition expression, the document storage means 140 is used. Is searched, document ID1 is identified as a document that matches search condition 2, and document A is output as a search result.

【００４２】問い合わせ３においては、"co-operation"
なるテキストが入力される。文字列抽出手段１３２によ
って、"co-operation"が抽出されが、ハイフンつきのた
め、これをハイフン語展開処理を行う。ハイフン語展開
処理では、接頭辞テーブル１７０（図１１参照）を参照
し、ハイフン直前の英字列"co"は接頭辞であることが分
かり、ハイフン前後の文字列を連結した"cooperation"
およびハイフン付き文字列である"co-operation"返す。
検索条件作成手段１３４では、返された文字列をOR演算
に変換するが、"co-operation"ハイフン付き文字列であ
るため、隣接演算に変換した上で、OR演算の項とし、最
終的に検索条件３（図１７参照）が作成される。この検
索条件式を使って、文書記憶手段１４０を検索すると、
検索条件３にマッチする文書として、文書ＩＤ１が同定
され、検索結果として、文書Ａが出力される。In the inquiry 3, "co-operation"
Is entered. "Co-operation" is extracted by the character string extracting means 132, but because of the hyphen, it is subjected to hyphen expansion processing. In the hyphen expansion process, referring to the prefix table 170 (see FIG. 11), it is found that the alphabetic character string “co” immediately before the hyphen is a prefix, and “cooperation” in which the character strings before and after the hyphen are connected.
And "co-operation" which is a character string with a hyphen.
The search condition creating means 134 converts the returned character string into an OR operation. However, since the character string is a "co-operation" hyphenated character string, it is converted into an adjacent operation, and then converted into an OR operation item. Search condition 3 (see FIG. 17) is created. When the document storage means 140 is searched using this search condition expression,
Document ID 1 is identified as a document that matches search condition 3, and document A is output as a search result.

【００４３】（４）本発明のコンピュータによる実施例さらに、本発明は上記の実施の形態のみに限定されたも
のではない。図１８は、本発明の文書検索システムをコ
ンピュータで実現するときのハードウェアの構成を示す
ブロツク図である。即ち、入力装置１、出力装置２、Ｃ
ＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉ
ｔ；中央処理ユニット）３、メモリ４、記憶装置５、媒
体駆動装置６とをバス８で接続する。(4) Computer Example of the Present Invention Further, the present invention is not limited to only the above embodiment. FIG. 18 is a block diagram showing a hardware configuration when the document search system of the present invention is realized by a computer. That is, input device 1, output device 2, C
PU (Central Processing Uni)
t; central processing unit) 3, memory 4, storage device 5, and medium drive device 6 are connected by a bus 8.

【００４４】入力装置１は、キーボード、マウス、デジ
タルスチルカメラ、スキャナ等により構成され、情報の
入力に使用される。出力装置２は、種々の出力情報や入
力装置１からの入力された情報などを出力させるディス
プレイやプリンタ装置で構成される。ＣＰＵ３は、種々
のプログラムを動作させる。メモリ４は、プログラム自
身を保持し、またそのプログラムがＣＰＵ３によって実
行されるときに一時的に作成される情報等を保持する。
記憶装置５は、データ、プログラムやプログラム実行時
の一時的な情報等を保持する。媒体駆動装置６は、プロ
グラムやデータ等を記憶した記録媒体を装着してそれら
を読み込み、メモリ４または記憶装置５へ格納するのに
用いられる。また、直接データの入出力やプログラム実
行するのに使ってもよい。The input device 1 includes a keyboard, a mouse, a digital still camera, a scanner, and the like, and is used for inputting information. The output device 2 includes a display and a printer device for outputting various output information, information input from the input device 1, and the like. The CPU 3 operates various programs. The memory 4 holds the program itself, and also holds information temporarily created when the program is executed by the CPU 3.
The storage device 5 holds data, programs, temporary information at the time of program execution, and the like. The medium drive device 6 is used to mount a recording medium storing programs, data, and the like, read them, and store them in the memory 4 or the storage device 5. Further, it may be used for directly inputting / outputting data or executing a program.

【００４５】このようなコンピュータ装置において、文
書検索システムを構成する各機能（図１、図２および図
３参照）をそれぞれプログラム化し、予めＣＤ−ＲＯＭ
等の記録媒体に書き込んでおき、このＣＤ−ＲＯＭを各
サイトのＣＤ−ＲＯＭドライブのような媒体駆動装置６
を搭載したコンピュータに装着して、プログラムをメモ
リ４あるいは記憶装置５に格納し、それらを実行するこ
とによって、本文書検索システムの機能を実現すること
ができる。このとき、文書記憶手段１４０、索引記憶手
段１５０、正規化情報記憶手段１６０および接辞テーブ
ル１７０等は、記憶装置５に格納され、検索テキストや
文書検索システムへの指示等には入力装置１のキーボー
ドやマウス等が使われる。また、出力には、入力や中間
結果等の出力には主に出力装置２のディスプレイが使わ
れ、最終結果としては出力装置２のプリンタや記憶装置
５のディスク等が使われる。In such a computer device, each function (see FIG. 1, FIG. 2, and FIG. 3) constituting the document search system is programmed respectively, and a CD-ROM is previously prepared.
The CD-ROM is written in a recording medium such as a CD-ROM drive at each site.
The program is stored in the memory 4 or the storage device 5 by installing the program in the computer equipped with the program, and the functions of the document search system can be realized by executing the program. At this time, the document storage unit 140, the index storage unit 150, the normalization information storage unit 160, the affix table 170, and the like are stored in the storage unit 5, and the search text, the instruction to the document search system, etc. And a mouse are used. For output, the display of the output device 2 is mainly used for input and output of intermediate results, and for the final result, the printer of the output device 2 and the disk of the storage device 5 are used.

【００４６】尚、記録媒体としては半導体媒体（例え
ば、ＲＯＭ、ＩＣメモリカード等）、光媒体（例えば、
ＤＶＤ−ＲＯＭ、ＭＯ、ＭＤ、ＣＤ−Ｒ等）、磁気媒体
（例えば、磁気テープ、フレキシブルディスク等）のい
ずれであってもよい。As a recording medium, a semiconductor medium (for example, ROM, IC memory card, etc.), an optical medium (for example, ROM,
Any of a DVD-ROM, MO, MD, CD-R, etc.) and a magnetic medium (eg, magnetic tape, flexible disk, etc.) may be used.

【００４７】また、ロードしたプログラムを実行するこ
とにより前述した実施の形態の機能が実現されるだけで
なく、そのプログラムの指示に基づき、オペレーティン
グシステム等が実際の処理の一部または全部を行い、そ
の処理によって上述した実施の形態の機能が実現される
場合も含まれる。By executing the loaded program, not only the functions of the above-described embodiment are realized, but also the operating system or the like performs part or all of the actual processing based on the instructions of the program. The case where the function of the above-described embodiment is realized by the processing is also included.

【００４８】（５）ネットワーク上での運用上述した実施の形態では、文書検索システムをスタンド
アロン環境のシステムを示したが、これに限るものでは
なく、図１９に示すように端末２００と文書検索サーバ
２１０とによってネットワーク環境で運用するようにし
てもよい。この場合、端末２００や文書検索サーバ２１
０のハードウェア装置は、汎用のコンピュータ装置（図
１８参照）にネットワーク接続装置７を追加して、ネッ
トワーク９へ接続するようにする。また、必要に応じて
端末２００や文書検索サーバ２１０は複数台設置しても
よい。(5) Operation on Network In the above-described embodiment, the document search system is a stand-alone environment. However, the present invention is not limited to this, and the terminal 200 and the document search server are used as shown in FIG. 210 may be used in a network environment. In this case, the terminal 200 or the document search server 21
The hardware device 0 is connected to the network 9 by adding the network connection device 7 to a general-purpose computer device (see FIG. 18). Further, a plurality of terminals 200 and document search servers 210 may be provided as necessary.

【００４９】このネットワーク９は、一般には、ケーブ
ルで実現され、通信プロトコルにはＴＣＰ／ＩＰが使わ
れる。但し、伝送路としてはケーブルだけではなく、そ
れらの間の通信プロトコルが一致するものであれば無
線、有線および放送波のいずれでもよく、例えば、ＬＡ
Ｎ（ＬｏｃａｌＡｒｅａＮｅｔｗｏｒｋ）、ＷＡＮ
（ＷｉｄｅＡｒｅａＮｅｔｗｏｒｋ）、インターネ
ット、アナログ電話網、デジタル電話網（ＩＳＤＮ：Ｉ
ｎｔｅｇｒａｌＳｅｒｖｉｃｅＤｉｇｉｔａｌＮ
ｅｔｗｏｒｋ）、ＰＨＳ（パーソナルハンディシス
テム）、携帯電話網、衛星通信網等を用いることができ
る。This network 9 is generally realized by a cable, and TCP / IP is used as a communication protocol. However, the transmission path is not limited to the cable, and may be any of wireless, wired and broadcast waves as long as the communication protocol between them is the same.
N (Local Area Network), WAN
(Wide Area Network), Internet, analog telephone network, digital telephone network (ISDN: I
ntegral Service Digital N
network, a PHS (Personal Handy System), a mobile phone network, a satellite communication network, or the like.

【００５０】このようなネットワーク環境において、上
述の実施例の文書検索システムの機能を文書検索サーバ
２１０に設け、端末２００の入力装置１からの検索条件
等の入力を文書検索サーバ２１０へ送信し、検索結果を
要求元の端末２００へ返信し、この結果を受信して端末
２００は、ディスプレイ等の出力装置２へ表示するよう
に構成する。また、検索結果を要求元ではなく他の端末
２００へ送信するようにしてもよい。In such a network environment, the function of the document search system of the above-described embodiment is provided in the document search server 210, and an input of search conditions and the like from the input device 1 of the terminal 200 is transmitted to the document search server 210. The search result is returned to the requesting terminal 200, and the result is received, and the terminal 200 is configured to display the search result on the output device 2 such as a display. Further, the search result may be transmitted to another terminal 200 instead of the request source.

【００５１】（６）媒体の配布また、本発明の文書検索システムの機能を実現するプロ
グラムは、ＣＤ−ＲＯＭ等の媒体の形で頒布することが
できるし、本発明の文書検索システムの機能を実現する
プログラムを磁気ディスク等の記憶装置に格納してお
き、有線又は無線のネットワークによりダウンロード等
の形式で頒布したり、放送波によって配布することで提
供するようにしてもよい。(6) Distribution of Medium A program for realizing the functions of the document search system of the present invention can be distributed in the form of a medium such as a CD-ROM. The program to be realized may be stored in a storage device such as a magnetic disk or the like, and may be provided by being distributed in a form such as download over a wired or wireless network, or distributed by broadcast waves.

【００５２】[0052]

【発明の効果】以上説明したように、本発明によれば、
ハイフンを伴う異表記に関して、索引登録される文字列
を正規化することなく、検索対象となる文書から収集し
たパターンを参照して検索語を展開することにより、検
索精度を向上させることができた。As described above, according to the present invention,
Search precision was improved by referencing patterns collected from documents to be searched for, without normalizing the character strings to be indexed, with regard to different notations with hyphens. .

[Brief description of the drawings]

【図１】本発明の実施の一形態である文書検索システム
の全体の機能構成を概略的に示すブロック図である。FIG. 1 is a block diagram schematically showing an overall functional configuration of a document search system according to an embodiment of the present invention.

【図２】文書検索システムの登録処理部の機能構成を概
略的に示すブロック図である。FIG. 2 is a block diagram schematically showing a functional configuration of a registration processing unit of the document search system.

【図３】文書検索システムの検索処理部の機能構成を概
略的に示すブロック図である。FIG. 3 is a block diagram schematically showing a functional configuration of a search processing unit of the document search system.

【図４】登録対象文書の具体例である。FIG. 4 is a specific example of a registration target document.

【図５】登録処理部の動作を概略的に示すフローチャー
トである。FIG. 5 is a flowchart schematically showing an operation of a registration processing unit.

【図６】登録時の文字列抽出手段の動作を概略的に示す
フローチャートである。FIG. 6 is a flowchart schematically showing an operation of a character string extracting unit at the time of registration.

【図７】区切り文字として用いられる記号類の例であ
る。FIG. 7 is an example of symbols used as delimiters.

【図８】文字列抽出手段で抽出された文字列情報の例で
ある。FIG. 8 is an example of character string information extracted by a character string extracting unit.

【図９】索引記憶手段に格納される内容例である。FIG. 9 is an example of contents stored in an index storage unit.

【図１０】文書登録時のハイフン語処理の動作を概略的
に示すフローチャートである。FIG. 10 is a flowchart schematically showing an operation of hyphen processing at the time of document registration.

【図１１】接頭辞テーブルの内容例である。FIG. 11 is an example of the contents of a prefix table.

【図１２】ハイフン語処理で抽出された正規化文字列と
その展開パターンの例である。FIG. 12 is an example of a normalized character string extracted by hyphen processing and its expansion pattern.

【図１３】検索条件作成手段の動作を概略的に示すフロ
ーチャートである。FIG. 13 is a flowchart schematically showing an operation of a search condition creating means.

【図１４】検索時の文字列抽出手段の動作を概略的に示
すフローチャートである。FIG. 14 is a flowchart schematically showing an operation of a character string extracting unit at the time of a search.

【図１５】検索条件作成時のハイフン語展開処理の動作
を概略的に示すフローチャートである。FIG. 15 is a flowchart schematically showing an operation of a hyphen expansion process when a search condition is created.

【図１６】検索のための検索テキストの例である。FIG. 16 is an example of a search text for search.

【図１７】各問い合わせテキストに対して検索条件作成
部１３で作成される検索条件の例である。FIG. 17 is an example of search conditions created by the search condition creation unit 13 for each inquiry text.

【図１８】本発明を実施するコンピュータ装置のハード
ウエア構成を示す図である。FIG. 18 is a diagram illustrating a hardware configuration of a computer device that implements the present invention.

【図１９】本発明のネットワーク環境による運用形態を
説明するための図である。FIG. 19 is a diagram for explaining an operation mode according to the network environment of the present invention.

[Explanation of symbols]

１００文書検索システム１１０制御部１２０登録処理部１２１色再現範囲選択手段１２２色再現範囲設定手段１２３代表色データ記憶手段１３０検索処理部１３１検索テキスト入力手段１３２文字列抽出手段１３３抽出文字展開手段１３４検索条件作成手段１３５検索処理手段１３６出力手段１４０文書記憶手段１５０索引記憶手段１６０正規化情報記憶手段１７０接辞テーブル１入力装置２出力装置３ＣＰＵ４メモリ５記憶装置６媒体駆動装置７ネットワーク接続装置８バス９ネットワーク２００端末２１０文書検索サーバ REFERENCE SIGNS LIST 100 document search system 110 control unit 120 registration processing unit 121 color gamut selection unit 122 color gamut setting unit 123 representative color data storage unit 130 search processing unit 131 search text input unit 132 character string extraction unit 133 extracted character expansion unit 134 search Condition creation means 135 Search processing means 136 Output means 140 Document storage means 150 Index storage means 160 Normalization information storage means 170 Affix table 1 Input device 2 Output device 3 CPU 4 Memory 5 Storage device 6 Medium drive device 7 Network connection device 8 Bus 9 network 200 terminal 210 document search server

Claims

[Claims]

1. A document retrieval system for retrieving a document desired by a user from a plurality of documents, a document storage means for storing a plurality of document data described in a language in which words are clearly separated, and said document storage. Character string extracting means for extracting a character string delimited by a predetermined delimiter and appearance position information of the character string from the document data stored in the means, and classifying the extracted character string, Index storage means for associating the document data stored in the document storage means in which the character string appears and the appearance position information of this character string with the categorized character string and storing it in the index storage means; If the character string extracted by the means matches a specific pattern, the character string is associated with the normalized character string of this character string, and stored in the normalization information storage means. A document retrieval system comprising: a normalization information storage unit for retrieving a document desired by the user from a plurality of documents.

2. The document search system according to claim 1, wherein the normalization information storage means uses a hyphen at the end of a line and an alphabetic character string before and after the hyphen as a specific pattern,
A document search system characterized in that a hyphen, a line feed code, and a space are removed from the specific pattern, and a character string obtained by combining preceding and succeeding alphabetic character strings is generated as a normalized character string.

3. The document retrieval system according to claim 1, wherein the normalization information storage means sets a prefix including a hyphen and a subsequent alphabetic character string as a specific pattern, A document search system characterized in that a hyphen is removed and a character string obtained by combining a prefix and a subsequent alphabetic character string is generated as a normalized character string.

4. The document search system according to claim 1, wherein a search text input means for inputting a search text for searching for a document, and a predetermined delimiter from the search text. Character string extracting means for extracting a character string delimited by the character string and appearance position information of the character string and the like, and expanding the extracted character string with reference to the character string stored in the normalization information storage means. Extraction string expansion means, search condition creation means for creating a search condition based on the character string extracted by the character string extraction means and the character string expanded by the extraction character string expansion means, Search processing means for searching for a document in the document storage means by the index storage means, and searching for a document desired by the user from the plurality of documents. Stem.

5. The document search system according to claim 2, wherein the extracted character string expanding unit includes a character string extracted by the character string extracting unit, the character string being prefixed with a hyphen. And a character string following the character string, in which the hyphen is deleted, and the extracted character string is expanded by a character string obtained by combining the prefix and the subsequent alphabetical character string.

6. A document retrieval method for retrieving a document desired by a user from a plurality of documents, wherein a document stored in document storage means for storing a plurality of document data described in a language in which words are clearly separated. From the data, a character string separated by a predetermined delimiter and appearance position information of the character string and the like are extracted, and the extracted character string is categorized and stored in the document storage unit where the character string appears. The document data and the appearance position information of this character string are associated with the categorized character string and stored in the index storage unit. If the extracted character string matches a specific pattern, this character string and this character string A document search method, wherein a character string is associated with a normalized character string and stored in a normalized information storage unit.

7. The document search method according to claim 6, wherein a hyphen at the end of a line and an alphabet character string before and after the hyphen are a specific pattern, and a hyphen, a line feed code, and a space are deleted from the specific pattern, A document search method characterized by generating a character string obtained by combining alphabetic character strings as a normalized character string.

8. The document search method according to claim 6, wherein a prefix accompanied by a hyphen and an alphabet character string following the prefix are used as a specific pattern, and the prefix is obtained by deleting the hyphen from the character string of the specific pattern. A document search method comprising: generating a character string obtained by combining a character string following the character string and a subsequent character string as a normalized character string.

9. A document search method according to claim 1, wherein a search text for searching for a document is input, and a character delimited by a predetermined delimiter from the search text is input. Extracting the string and the appearance position information of the character string, expanding the extracted character string with reference to the character string stored in the normalization information storage means, and extracting the extracted character string and the expanded character string. Create a search condition based on the text string
A document search method, wherein a document desired by a user is searched from a plurality of documents by searching the document in the document storage means using the search condition and the index storage means.

10. The document search method according to claim 6, wherein the character string extracted from the search text is a prefix accompanied by a hyphen and an alphabet character string following the prefix. Is a method of extracting an extracted character string by a character string obtained by combining a prefix and a subsequent alphabetic character string by removing a hyphen.

11. A document storage means for storing a plurality of document data described in a language in which words are clearly separated,
From the document data stored in the document storage unit, a character string extraction unit that extracts a character string delimited by a predetermined delimiter and appearance position information of the character string, and classifies the extracted character string, Index storage means for associating document data stored in the document storage means in which the character string appears and appearance position information of the character string with the categorized character string and storing the document data in the index storage means; If the character string extracted by the character string extracting means matches a specific pattern, the character string is associated with a normalized character string of the character string, and the normalized information stored in the normalized information storing means is stored. Storage means for reading a computer-readable recording program for causing a computer to function as a document search system for searching for a document desired by the user from among a plurality of documents Capacity recording medium.

12. The recording medium according to claim 11, wherein search text input means for inputting a search text for searching for a document, a character string separated from the search text by a predetermined delimiter, and the character string Character string extracting means for extracting appearance position information of the character string; extracting character string developing means for expanding the extracted character string by referring to the character string stored in the normalization information storage means; Search condition creating means for creating a search condition based on the character string extracted by the extracting means and the character string expanded by the extracted character string expanding means; and And a search processing means for searching for the document.