JPH0954781A

JPH0954781A - Document retrieving system

Info

Publication number: JPH0954781A
Application number: JP7231915A
Authority: JP
Inventors: Kumiko Wada; 久美子和田; Emi Ikeda; 恵美池田
Original assignee: Oki Electric Industry Co Ltd
Current assignee: Oki Electric Industry Co Ltd
Priority date: 1995-08-17
Filing date: 1995-08-17
Publication date: 1997-02-25

Abstract

PROBLEM TO BE SOLVED: To retrieve a character string to be a retrieval object at a high speed by extracting only a meaningful phrase to be the retrieval object as a main index word from the output of a morpheme analyzing part and generating the index file making each main index word and the appearance positional information in a document correspond to each other. SOLUTION: An electronized document to be a retrieval object is inputted in a morpheme analyzing part 2 from a retrieval object document input part 1, the information on the phrase composing a document is obtained in the morpheme analyzing part 2 and only a meaningful phrase which can be a retrieval object is extracted as a main index word from the output of this morpheme analyzing part 2 in a main index word extraction part 3. An auxiliary index word is generated for the main index word in an auxiliary index word generation part 4. Further, the index file 21 making each index word and the appearance positional information in the document correspond to each other is generated by a leading character hash table generation/addition part 5 and an index item table generation/addition part 6. Thus, a character string to be the retrieval object can be retrieved at a high speed.

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は、大量の電子化文書
から指定された文字列を高速に検索するための文書検索
システムに関するものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a document retrieval system for rapidly retrieving a designated character string from a large amount of digitized documents.

【０００２】[0002]

【従来の技術】電子メディアの著しい普及に伴い、報
道、出版、特許出願等様々な分野で大量の文書の電子化
が進んでいる。これらの大量の文書を、必要に応じて簡
単かつ高速に検索するための検索方式として、以下のよ
うな技術が紹介されている。そのひとつは、キーワード
検索方式である。これは、文書を登録する際に予めキー
ワードを付与しておき、それを用いて文書を検索する方
式である。この方式では、キーワードから該当文書を引
く転置ファイルを用いて文書を高速に検索できる。ま
た、キーワード付与時にある程度の言語解析あるいは主
題解析をするため、検索精度も良い。しかし、人手によ
るキーワード付与には専門的知識や労力が必要である
上、付与基準が作業者毎に異なるので質が一定しない
し、保守にも手間がかかる等の問題がある。キーワード
の自動抽出も試みられているが、高品質のものを得るに
は高品質な言語解析用の辞書が必要である等の問題があ
る。2. Description of the Related Art With the remarkable spread of electronic media, digitization of a large amount of documents is progressing in various fields such as news reports, publications and patent applications. The following techniques have been introduced as a search method for searching such a large number of documents simply and quickly as needed. One of them is a keyword search method. This is a method in which a keyword is added in advance when a document is registered and the document is searched using the keyword. In this method, a document can be searched at high speed by using a transposed file that subtracts the relevant document from the keyword. In addition, since a certain degree of language analysis or subject analysis is performed at the time of assigning keywords, the search accuracy is good. However, manual keyword assignment requires specialized knowledge and labor, and the assignment standard varies from worker to worker, so the quality is not constant and maintenance is troublesome. Automatic extraction of keywords has been attempted, but there is a problem that a high-quality dictionary for language analysis is necessary to obtain high-quality keywords.

【０００３】そこで、キーワード検索方式に代わって全
文検索方式が注目されるに至った。全文検索方式とは、
書誌情報やキーワード等の、文書の本文データをもとに
加工作成されたデータ（二次情報という）でなく、本文
全体（一次情報という）を直接参照して、検索者が自由
に指定するキーワードをもとに検索する方式である。し
かし、全文検索方式では本文全体を走査することによっ
て検索するため、小規模文書に対しては有効だが、大規
模文書に対しては検索時間が問題となる。そこで、専用
ハードウェアが開発されているが、二次記憶上に格納さ
れた文書をメモリに転送するのに時間がかかってしまい
ハードウェアの性能を十分に出すことが困難であった
り、機種依存性が高く、ハードウェア自身が高価で簡単
に導入することが困難である等の問題がある。Then, the full-text search method has been attracting attention in place of the keyword search method. What is the full-text search method?
Keywords freely specified by the searcher by directly referencing the entire text (primary information), not the data (called secondary information) created based on the text data of the document, such as bibliographic information and keywords It is a method of searching based on. However, the full-text search method searches by scanning the entire text, so it is effective for small-scale documents, but the search time becomes a problem for large-scale documents. Therefore, although dedicated hardware has been developed, it takes time to transfer the document stored in the secondary storage to the memory, and it is difficult to bring out the full performance of the hardware. And the hardware itself is expensive and difficult to introduce easily.

【０００４】そこで、より廉価でハードウェアの機種に
依存しない、ソフトウェアによる高速な全文検索方式が
注目されている。これらの方式では、検索を高速化する
ために索引ファイルを予め自動生成するものが多く、本
文中に出現する各文字に対してその出現位置情報を格納
する等、様々な手法が開発されている。Therefore, a software-based high-speed full-text search method, which is less expensive and does not depend on the hardware model, is drawing attention. Many of these methods automatically generate an index file in advance in order to speed up the search, and various methods have been developed such as storing appearance position information for each character appearing in the text. .

【０００５】[0005]

【発明が解決しようとする課題】ところで、上記のよう
な従来の全文検索方法では、ユーザが指定した任意の文
字列を高速に検索可能とするために、索引ファイルの容
量が本文の大きさに比べて著しく大きくなる傾向があ
る。検索対象となる文書は飛躍的に増加し大容量化しつ
つあり、索引ファイルの容量が本文に比べてあまりに巨
大だと対応することができない。By the way, in the conventional full-text search method as described above, the size of the index file is limited to the size of the text so that an arbitrary character string designated by the user can be searched at high speed. In comparison, it tends to be significantly larger. The documents to be searched are dramatically increasing and increasing in capacity, and it is impossible to deal with them if the size of the index file is too large compared to the text.

【０００６】一方、索引ファイルの容量を小さく抑える
と、検索速度が不十分になったり、検索洩れが起こる等
の危険がある。本発明では、より小さい容量の索引ファ
イルに、より意味のある情報を格納し、一般にユーザが
検索対象とする文字列を十分にカバーして、それを高速
に検索できるようにすることを目的とする。On the other hand, if the capacity of the index file is kept small, there is a risk that the search speed becomes insufficient, or search omission occurs. An object of the present invention is to store more meaningful information in a smaller-capacity index file, sufficiently cover a character string to be searched by a user in general, and enable it to be searched at high speed. To do.

【０００７】[0007]

【課題を解決するための手段】本発明は以上の点を解決
するため次の構成を採用する。（構成１）検索対象となる電子化された文書を受け入れ
て、その文書を構成する語句に関する情報を得る形態素
解析部と、この形態素解析部の出力から、検索対象とな
り得る意味のある語句のみを主索引語として抽出する主
索引語抽出部と、各主索引語と文書中の出現位置情報と
を対応付けた索引ファイルを生成する索引ファイル生成
部とを備える。（説明）電子化された文書とは、文字コード化されて、
情報処理装置によって演算処理できるような形式の文書
をいう。形態素解析部は、解析により語句の品詞を含む
情報を得る。意味のある語句とは、助詞単独といった、
それのみでは意味のない、通常では検索対象となり得な
い語句を除外した語句のことである。これにより、現実
に使用される可能性のある語句以外の索引ファイル登録
を防止して、索引ファイルの縮小化を図る。出現位置情
報とは、索引語の文書中の位置を表すデータである。The present invention employs the following structure to solve the above problems. (Structure 1) A morpheme analysis unit that receives a digitized document that is a search target and obtains information about the words and phrases that make up the document, and outputs only the meaningful words and phrases that can be the search target from the output of this morpheme analysis unit A main index word extraction unit that extracts the main index word and an index file generation unit that generates an index file that associates each main index word with the appearance position information in the document are provided. (Explanation) An electronic document is a character code,
A document in a format that can be processed by an information processing device. The morphological analysis unit obtains information including a part of speech of a phrase by analysis. Meaningful phrases are, for example, particles alone.
It is a term that excludes words that are not meaningful by itself and that cannot normally be searched. As a result, it is possible to prevent the index file from being registered except for words and phrases that may actually be used, and to reduce the size of the index file. The appearance position information is data representing the position of the index word in the document.

【０００８】ユーザが実際に検索を行うとき、検索対象
文書中の全ての語句が等しい確率で検索対象とされるわ
けではない。文書中にはユーザに検索されやすいものと
そうでないものが存在する。例えば、助詞や接続詞等の
付属語のみをキーとした検索が行われることはまれであ
る。そこで、ユーザの検索対象となるような語句を適当
な処理によって本文から切り出し、それらの出現位置情
報を格納した索引ファイルを生成し、これらの語句につ
いて高速に検索できるようにする。When the user actually searches, not all the terms in the document to be searched are targeted for search with equal probability. There are documents that are easily searched by users and documents that are not. For example, it is rare that a search is performed using only adjuncts such as particles and conjunctions as keys. Therefore, words or phrases that are to be searched by the user are cut out from the text by appropriate processing, an index file storing the appearance position information of them is generated, and these words and phrases can be searched at high speed.

【０００９】また、文字を単位として出現位置情報を格
納するのでなく、語句を単位として格納することによ
り、多くの情報を格納しながら索引ファイルの容量を小
さく保つ。各語句から発生する部分語を索引ファイルに
格納すると、索引ファイルの容量が大きくなりがちだ
が、語句の切り出し時に形態素解析を用いて不要語を削
除する。Further, by storing the appearance position information in units of characters instead of storing the information in units of characters, the capacity of the index file can be kept small while storing a lot of information. When the partial words generated from each word are stored in the index file, the capacity of the index file tends to be large, but unnecessary words are deleted by using morphological analysis when cutting out the word.

【００１０】（構成２）検索対象となる電子化された文
書を受け入れて、その文書を構成する文字の文字種に着
目して、文書を構成する語句を主索引語として切り出す
語句切り出し部と、主索引語と文書中の出現位置情報と
を対応付けた索引ファイルを生成する索引ファイル生成
部とを備える。（説明）文字種に着目するとは、文書中の区切りとなる
「、」や「。」、「→」等の特殊な文字種を検出する処
理を含む。また、例えば平仮名の直後に出現する漢字を
区切り文字として語句を切り出すような処理を含む、例
えば「コンピュータの技術における、……」では「コン
ピュータの」、「技術における」が切り出される。この
ような区切り文字を検出する処理は形態素解析処理と比
べて簡便で高速化でき、多量の文書から迅速に索引ファ
イルを生成できる。主索引語を切り出すのは、区切り文
字を境にして切り出すのであって、助詞等も含めた索引
語が生成されてよい。(Structure 2) Accepting an electronic document to be searched, paying attention to the character type of the characters forming the document, and a word cutting section for cutting out the words forming the document as main index words; An index file generation unit that generates an index file in which the index word and the appearance position information in the document are associated with each other. (Explanation) Focusing on the character type includes a process of detecting a special character type such as “,”, “.”, “→”, which is a delimiter in a document. Further, for example, a process of cutting out a phrase using a Chinese character that appears immediately after a hiragana as a delimiter is included. For example, in “in computer technology, ...”, “in computer” and “in technology” are cut out. The process of detecting such delimiters is simpler and faster than the morphological analysis process, and an index file can be quickly generated from a large number of documents. The main index word is cut out at the delimiter character as a boundary, and an index word including a particle or the like may be generated.

【００１１】（構成３）主索引語の第２番目以上の文字
から始まる、主索引語の一部による部分文字列を補助索
引語として、その補助索引語とその文中の出現位置情報
とを対応付けて、索引ファイルに格納することが好まし
い。（説明）主索引語の一部による部分文字列を検索可能に
すれば、主索引語が接頭語等を含む複合語の場合に、そ
の部分に対する検索も高速にできる効果がある。(Structure 3) A partial character string, which starts from the second or more characters of the main index word and is formed by a part of the main index word, is used as an auxiliary index word, and the auxiliary index word is associated with the appearance position information in the sentence. It is preferable to store the data in the index file. (Explanation) By making a partial character string searchable by a part of the main index word, when the main index word is a compound word including a prefix or the like, the search for that part can be speeded up.

【００１２】（構成４）主索引語及び補助索引語が非平
仮名と平仮名文字列により構成されるとき、語尾の平仮
名文字列を、前記主索引語または補助索引語の語長が縮
小されるように圧縮することが好ましい。（説明）検索の際に重要度の低い平仮名部分の語尾を圧
縮することによって、検索語の語長を制限し、索引ファ
イルを小容量で実現できる。(Structure 4) When the main index word and the auxiliary index word are composed of non-hiragana and hiragana character strings, the word length of the hiragana character string at the end of the main index word or the auxiliary index word is reduced. It is preferable to compress it. (Explanation) The word length of the search word is limited by compressing the ending part of the hiragana part which is less important in the search, and the index file can be realized with a small capacity.

【００１３】[0013]

【発明の実施の形態】以下、本発明の実施の形態を具体
例を用いて説明する。〈具体例１〉図１は、本発明の文書検索システムの索引
ファイル生成／追加処理装置のブロック図である。ここ
で、具体例１として、索引ファイル生成の際の索引語の
抽出に形態素解析を用いる場合の処理について説明す
る。図１の装置は、検索対象文書入力部１と、形態素解
析部２と、主索引語抽出部３と、補助索引語生成部４
と、先頭文字ハッシュ表生成／追加部５と、索引項目表
生成／追加部６とから構成されている。検索対象文書入
力部１では、電子化された文書が入力される。ここには
図示しないメモリ等が設けられ、検索対象となる文書が
一時記憶される。形態素解析部２は、文書を文字単位で
区切って予め用意した辞書と照合し、文書で使用されて
いる単語や文書の構文等を解析し各単語のかかり具合い
等もデータとして得る部分である。これは、従来より文
書の自動的な構文解析のために使用されていた装置と全
く同様の構成をしている。即ち、検索対象文書入力部１
から入力された対象文書に対し形態素解析部２は所定の
形態素解析を施し、その結果を主索引語抽出部３に向け
て出力する。BEST MODE FOR CARRYING OUT THE INVENTION Embodiments of the present invention will be described below with reference to specific examples. <Specific Example 1> FIG. 1 is a block diagram of an index file generation / addition processing device of the document search system of the present invention. Here, as a first specific example, a process when morphological analysis is used to extract an index word when generating an index file will be described. The apparatus shown in FIG. 1 includes a search target document input unit 1, a morpheme analysis unit 2, a main index word extraction unit 3, and an auxiliary index word generation unit 4.
And a leading character hash table generation / addition unit 5 and an index item table generation / addition unit 6. An electronic document is input to the search target document input unit 1. A memory or the like (not shown) is provided here, and a document to be searched is temporarily stored. The morphological analysis unit 2 is a unit that divides a document into character units and collates it with a dictionary prepared in advance, analyzes the words used in the document and the syntax of the document, and obtains the degree of each word as data. It has exactly the same configuration as the device conventionally used for automatic parsing of documents. That is, the search target document input unit 1
The morpheme analysis unit 2 performs a predetermined morpheme analysis on the target document input from, and outputs the result to the main index word extraction unit 3.

【００１４】主索引語抽出部３は、形態素解析の結果得
られた語情報をもとにして、検索に用いられやすい品詞
の語だけを抽出する部分である。付属語や句読点等は、
独自でユーザの検索対象となることは少ないので、不要
語として切り捨て、索引生成の対象から除外する。ここ
では例えば、自立語と未知語、英数字を索引生成対象と
する。ここで索引生成対象として抽出された語を便宜上
「主索引語」と呼ぶことにする。The main index word extraction section 3 is a section for extracting only words of a part of speech that are easily used for retrieval, based on word information obtained as a result of morphological analysis. The attached words and punctuation marks are
Since it is rarely a search target by the user on its own, it is truncated as an unnecessary word and excluded from the index generation target. Here, for example, an independent word, an unknown word, and an alphanumeric character are subjected to index generation. Here, the word extracted as the index generation target will be referred to as a "main index word" for convenience.

【００１５】補助索引語生成部４は、全ての主索引語に
対して補助索引語を生成する部分である。補助索引語と
は、与えられた主索引語Ｗ（語長Ｌ）に対して、Ｗの第
ｉ番目から最後までの文字から成る部分文字列をいう
（０＜ｉ＜Ｌ）。あるひとつの主索引語に対して、その
語長がＬのとき、高々Ｌ−１個の補助索引語が生成され
る。以下、便宜上、主索引語と補助索引語を合わせて索
引語と呼ぶ。その説明は図３を用いて行う。先頭文字ハ
ッシュ表生成／追加部５と索引項目表生成／追加部６と
は、主索引語抽出部３で抽出した主索引語及び、補助索
引語生成部４で生成した補助索引語をもとに索引ファイ
ル２１を生成する部分である。The auxiliary index word generating section 4 is a part for generating auxiliary index words for all main index words. The auxiliary index word refers to a partial character string consisting of the i-th character to the end of W for the given main index word W (word length L) (0 <i <L). When the word length is L for a certain main index word, at most L-1 auxiliary index words are generated. Hereinafter, for convenience, the main index word and the auxiliary index word are collectively referred to as an index word. The description will be given with reference to FIG. The first character hash table generation / addition unit 5 and the index item table generation / addition unit 6 are based on the main index word extracted by the main index word extraction unit 3 and the auxiliary index word generated by the auxiliary index word generation unit 4. Is a part for generating the index file 21.

【００１６】図２は、生成された索引ファイルを用い
て、ユーザに指定された検索パターン文字列を検索する
ための検索処理装置のブロック図である。この装置は、
検索パターン文字列入力部１１と、形態素解析部１２、
索引語抽出部１３、自立索引語検索部１４、未知語検索
部１５、本文照合部１６及び結果出力部１７から構成さ
れている。検索パターン文字列は、文書検索のためにオ
ペレータ等によってキーボード等を用いて入力される。
検索パターン文字列入力部１１は、このようなキーボー
ド、その他の入力装置から構成される。形態素解析部１
２は、図１で用いた形態素解析部２と全く同様の構成の
もので、同一部分を索引ファイルの生成にもまた検索処
理のためにも使用することができる。この形態素解析部
１２は、文書データ検索のために入力された検索パター
ン文字列の形態素解析を行う。FIG. 2 is a block diagram of a search processing device for searching a search pattern character string designated by the user using the generated index file. This device is
A search pattern character string input unit 11, a morphological analysis unit 12,
The index word extracting unit 13, the independent index word searching unit 14, the unknown word searching unit 15, the text matching unit 16, and the result output unit 17 are included. The search pattern character string is input by an operator or the like using a keyboard or the like for document search.
The search pattern character string input unit 11 is composed of such a keyboard and other input devices. Morphological analyzer 1
Reference numeral 2 has the same configuration as that of the morphological analysis unit 2 used in FIG. 1, and the same portion can be used for generating an index file and for searching processing. The morpheme analysis unit 12 performs a morpheme analysis of the search pattern character string input for document data search.

【００１７】索引語抽出部１３は、形態素解析の結果と
して得られた語リストのうち、不要語を取り除いて、実
際に索引の対象となる索引語を抽出する。自立索引語検
索部１４と、未知語検索部１５と、本文照合部１６は、
実際の検索を行う部分である。まず、自立索引語検索部
１４は、索引語抽出部１３で抽出された索引語のうち、
品詞が自立語、英数字であるもの全てについて索引ファ
イルを検索し、出現位置情報の候補を得る。次に、未知
語検索部１５は、索引語抽出部１３の索引語で未知語と
判定されたものがあれば、それらについても索引ファイ
ルを検索し、出現位置の候補を得る。なお、未知語と言
うのは形態素解析部１２の用いる単語辞書に無い単語の
ことをいう。最後に、本文照合部１６は、検索パターン
文字列が不要語を含むならば、それらについて本文を用
いて照合し、結果出力部１７は、得られた検索結果を出
力する。即ち、例えば「コンピュータの技術」という検
索パターン文字列を入力したとすれば、「コンピュー
タ」と「技術」とが索引ファイル中で検索される。この
とき、「の」は不要語である。従って、「コンピュー
タ」の出現位置と「技術」の出現位置の付近で「コンピ
ュータの技術」という文字列が無いかを今度は直接対象
文書本文を見ながら照合して検索結果を得る。The index word extraction unit 13 removes unnecessary words from the word list obtained as a result of the morpheme analysis, and extracts index words that are actually indexed. The independent index word search unit 14, the unknown word search unit 15, and the body text matching unit 16 are
This is the part where the actual search is performed. First, the independent index word search unit 14 selects one of the index words extracted by the index word extraction unit 13.
The index file is searched for all the parts of speech that are independent words and alphanumeric characters, and candidates for appearance position information are obtained. Next, the unknown word search unit 15 searches the index file for any of the index words determined by the index word extraction unit 13 as unknown words, and obtains candidates for the appearance position. The unknown word is a word that is not in the word dictionary used by the morphological analysis unit 12. Finally, if the search pattern character string includes unnecessary words, the body matching unit 16 uses the body to match the unnecessary words, and the result output unit 17 outputs the obtained search result. That is, for example, if a search pattern character string “computer technology” is input, “computer” and “technology” are searched in the index file. At this time, "no" is an unnecessary word. Therefore, the search result is obtained by directly looking at the text of the target document for the character string "computer technology" near the appearance position of "computer" and the appearance position of "technology".

【００１８】図３は、索引ファイルの構成説明図を示
す。図１の索引ファイル２１は、ひとつの先頭文字ハッ
シュ表２２と、Ｎ個の索引項目表２３から成る。Ｎは、
全ての索引語の先頭に出現する文字種の総数である。索
引項目表２３は、索引語の先頭文字毎に存在する。各索
引項目表は、ある文字で始まる全ての索引語に関する出
現位置情報を格納したもので、索引語をキーとして、そ
の出現位置情報を検索できる構成になっている。FIG. 3 is a diagram showing the structure of the index file. The index file 21 shown in FIG. 1 includes one head character hash table 22 and N index item tables 23. N is
It is the total number of character types that appear at the beginning of all index words. The index item table 23 exists for each leading character of the index word. Each index item table stores appearance position information regarding all index words starting with a certain character, and is configured to search the appearance position information using the index word as a key.

【００１９】図４は、先頭文字ハッシュ表２２の格納デ
ータ形式である。先頭文字ハッシュ表２２は、索引語の
先頭文字をキーとして、対応する索引項目表を検索する
ためのハッシュ表である。先頭文字ハッシュ表２２の各
データは３つのデータ項目の組で表される。図４に示し
た項目３１は索引語の先頭に出現する１文字、項目３２
はそれぞれ対応する索引項目表へのポインタである。項
目３３は、先頭文字ハッシュ表の次のバケットへのポイ
ンタである。次のバケットがない場合はｎｕｌｌ（無効
データ）である。FIG. 4 shows a storage data format of the leading character hash table 22. The first character hash table 22 is a hash table for searching the corresponding index item table using the first character of the index word as a key. Each data in the leading character hash table 22 is represented by a set of three data items. Item 31 shown in FIG. 4 is one character that appears at the beginning of the index word, item 32.
Are pointers to the corresponding index item tables. Item 33 is a pointer to the next bucket in the first character hash table. If there is no next bucket, it is null (invalid data).

【００２０】図５は、索引項目表の格納データ形式であ
る。各データは４つのデータ項目の組で表される。項目
４１は索引語を表す固定長文字列、項目４２は項目４１
に索引語が格納しきれない場合、その残余文字列へのポ
インタである。残余文字列がない場合はｎｕｌｌであ
る。項目４３はこの索引語に対応する出現位置情報の総
数を表す。出現位置情報には２種類の表現形式がある。
ひとつは本文中の具体的な出現位置を示す値となってい
る場合で、もうひとつは索引項目表中の別の索引項目へ
のポインタとなっている場合である。ある索引語に対応
する出現位置情報が後者の形式でのみ格納されている場
合は、項目４３の総数を負数で表す。そうでない場合は
正数で表す。項目４４は出現位置情報のリストへのポイ
ンタである。FIG. 5 shows the storage data format of the index item table. Each data is represented by a set of four data items. Item 41 is a fixed length character string representing an index word, item 42 is item 41
This is a pointer to the remaining character string when the index word cannot be stored in. It is null if there is no residual character string. Item 43 represents the total number of appearance position information corresponding to this index word. The appearance position information has two types of expression formats.
One is a case where it is a value indicating a specific appearance position in the text, and the other is a case where it is a pointer to another index item in the index item table. When the appearance position information corresponding to a certain index word is stored only in the latter format, the total number of items 43 is represented by a negative number. If not, express it as a positive number. Item 44 is a pointer to a list of appearance position information.

【００２１】〈動作〉以下、本発明の文字検索システム
の具体的な動作をフローチャートを用いて説明する。ま
ず、予め索引ファイルを生成する。図６と図７は、索引
ファイルの生成処理を示すフローチャートである。ま
ず、ステップＳ１で検索対象文書を入力し、ステップＳ
２で形態素解析を行う。次に、形態素解析の結果得られ
た語のうち、不要語を削除して主索引語を抽出する（ス
テップＳ３）。抽出された各主索引語について、ステッ
プＳ４以降の処理を繰り返す。<Operation> A specific operation of the character retrieval system of the present invention will be described below with reference to a flowchart. First, an index file is generated in advance. 6 and 7 are flowcharts showing the index file generation processing. First, in step S1, the search target document is input, and then in step S
Morphological analysis is performed in 2. Next, of the words obtained as a result of the morphological analysis, unnecessary words are deleted and main index words are extracted (step S3). The process from step S4 is repeated for each extracted main index word.

【００２２】始めに、Ｗを主索引語とし（ステップＳ
５）、Ｗの先頭文字ａで先頭文字ハッシュ表を検索する
（ステップＳ６）。文字ａに対応する索引項目表がある
ならば（ステップＳ７）、Ｗの出現位置情報を索引項目
表に登録する（ステップＳ９）。このステップＳ９で
は、索引項目表をキーＷで検索し、Ｗのデータがないな
らば、Ｗをキーとして新たにＷの出現位置情報を格納す
る。既にＷのデータがあるならば、Ｗの出現位置情報
を、出現位置情報リストの先頭に追加する。ステップＳ
７で索引項目表がないならば、ステップＳ８で新たに文
字ａに対応する索引項目表を生成してからステップＳ９
を行う。First, let W be the main index word (step S
5), the leading character hash table is searched for by the leading character a of W (step S6). If there is an index item table corresponding to the character a (step S7), the appearance position information of W is registered in the index item table (step S9). In this step S9, the index item table is searched with the key W, and if there is no data of W, the appearance position information of W is newly stored with W as a key. If there is already W data, the appearance position information of W is added to the head of the appearance position information list. Step S
If there is no index item table in step 7, a new index item table corresponding to the character a is generated in step S8, and then step S9.
I do.

【００２３】次に、補助索引語の登録処理を行う。まず
ステップＳ１０でＬをＷの語長から１引いた値とし、Ｘ
０をＷとする（ステップＳ１１）。さらに、ＬがＬ＞０
を満たす間（ステップＳ１２）、ステップＳ１３以下の
処理を繰り返す。ステップＳ１３では、Ｗの最後からＬ
文字をとって部分文字列Ｘを生成し、これを補助索引語
とする。次に、Ｘに関する情報を索引ファイルに格納す
る。まず、Ｘの先頭文字ａで先頭文字ハッシュ表を検索
し（ステップＳ１４）、文字ａに対応する項目表がある
ならば（ステップＳ１５）、Ｘで索引項目表を検索する
（ステップＳ１６）。該当するデータがあるならば（ス
テップＳ１７）、Ｘに対応する出現位置情報リストの最
後にＸ０へのポインタを追加登録する（ステップＳ１
８）。これ以降の補助索引語は既に索引ファイルに登録
されているので、主索引語Ｗに対応する補助索引語の登
録はここで打ち切り、次の主索引語の登録処理を行う
（ステップＳ４）。Next, the auxiliary index word registration process is performed. First, in step S10, L is set to a value obtained by subtracting 1 from the word length of W, and X is set.
0 is set to W (step S11). Furthermore, L is L> 0
While the above conditions are satisfied (step S12), the processing from step S13 onward is repeated. In step S13, from the end of W to L
A character is taken to generate a partial character string X, which is used as an auxiliary index word. Then, information about X is stored in the index file. First, the head character hash table is searched with the head character a of X (step S14), and if there is an item table corresponding to the character a (step S15), the index item table is searched with X (step S16). If there is the corresponding data (step S17), the pointer to X0 is additionally registered at the end of the appearance position information list corresponding to X (step S1).
8). Since the auxiliary index words after this are already registered in the index file, the auxiliary index word corresponding to the main index word W is aborted here, and the next main index word is registered (step S4).

【００２４】ステップＳ１７でＸのデータがないなら
ば、ＸをキーとしてＸ０へのポインタを出現位置情報リ
ストに格納する（ステップＳ２０）。即ち、Ｘ０：＝Ｘ
（ステップＳ２１）、Ｌ：＝Ｌ−１（ステップＳ２２）
としてステップＳ１２へ戻る。If there is no X data in step S17, the pointer to X0 is stored in the appearance position information list using X as a key (step S20). That is, X0: = X
(Step S21), L: = L-1 (Step S22)
Then, the process returns to step S12.

【００２５】ステップＳ１５でａに対応する索引項目表
がないならば、文字ａに対応する索引項目表を新たに生
成し（ステップＳ１９）、Ｘに対応するデータを新たに
生成してその出現位置情報としてＸ０へのポインタを格
納する（ステップＳ２０）。更にＸ０：＝Ｘ（ステップ
Ｓ２１）、Ｌ：＝Ｌ−１（ステップＳ２２）としてステ
ップＳ１２以降の補助索引語の生成、登録処理を繰り返
す。以上のようにして、全ての主索引語や補助索引語に
対する登録処理が終ると、処理が終了する（ステップＳ
４）。If there is no index item table corresponding to a in step S15, a new index item table corresponding to the character a is newly generated (step S19), data corresponding to X is newly generated, and its appearance position is generated. A pointer to X0 is stored as information (step S20). Further, X0: = X (step S21) and L: = L-1 (step S22) are set, and the process of generating and registering the auxiliary index word after step S12 is repeated. When the registration process for all the main index words and auxiliary index words is completed as described above, the process ends (step S
4).

【００２６】図８と図９は、生成された索引ファイルを
用いて検索を行う場合の検索処理を示すフローチャート
である。まず、検索パターン文字列を入力し（ステップ
Ｓ１）、その文字列の形態素解析を行う（ステップＳ
２）。次に、ステップＳ３で形態素解析の結果得られた
語のリストから索引語Ｗｉ（０≦ｉ＜ｎ）を抽出する。
このとき、不要語として切り捨てられる語が存在する場
合は（ステップＳ４）、ステップＳ５でｆｌａｇ：＝
１、そうでない場合は、ステップＳ６でｆｌａｇ：＝０
とする。次に、得られた索引語に対して索引ファイルを
検索する。まず、最初に出現位置情報の候補集合Ｈを空
に設定する（ステップＳ７）。FIG. 8 and FIG. 9 are flow charts showing a search process when a search is performed using the generated index file. First, a search pattern character string is input (step S1), and a morphological analysis of the character string is performed (step S).
2). Next, in step S3, the index word Wi (0 ≦ i <n) is extracted from the word list obtained as a result of the morphological analysis.
At this time, if there is a word that is truncated as an unnecessary word (step S4), flag: = in step S5.
1, otherwise flag: = 0 in step S6
And Next, the index file is searched for the obtained index word. First, the candidate set H of the appearance position information is first set to be empty (step S7).

【００２７】Ｗｉのうち自立語及び英数字のみを取り出
してＸｊ（０≦ｊ＜ｍ）とし、（ステップＳ８）、Ｘｊ
間の距離ｄｊを算出する（ステップＳ９）。例えば、
「コンピュータの技術」という検索パターン文字列の場
合には、「コンピュータ」と「技術」とは自立語で、両
者の間には「の」が存在する。この場合、両者の距離は
「の」の文字コードのバイト数即ち２バイトとなる。
尚、出現位置情報をその語の文字列の第１文字目が出現
する場所とすることもできる。このとき、「コンピュー
タの技術」の「コンピュータ」の出現位置情報が例えば
３０１（バイト目）とすれば「技術」の出現位置情報は
これに７文字分の１４バイトを加算した３１５となる。
即ち、「コンピュータ」と「技術」の距離は１４バイト
というように表すこともできる。次に、ステップＳ１０
〜１７でこれらの索引語について検索する。処理は以下
のように行う。ステップＳ１０で、ｊ：＝０として、ｊ
＜ｍの間（ステップＳ１１）以下の処理を繰り返す。Ｘ
ｊで索引ファイルを検索するために処理Ａを実行する
（ステップＳ１２）。処理Ａは図１０で後述する。該当
するデータがあれば（ステップＳ１３）、Ｘｊの出現位
置情報をＨ０の要素として登録し（ステップＳ１４）、
ＨとＨ０の要素を参照して、距離ｄｊを満たすもののみ
を新たにＨの要素とする（ステップＳ１５）。例えば
「コンピュータの技術」という検索パターン文字列の場
合に、「コンピュータ」と「技術」という語句の間の距
離が２バイト以上の物を除外するためにこの処理を行
う。ステップＳ１６では、ｊ：＝ｊ＋１として、ステッ
プＳ１１へ戻る。Only independent words and alphanumeric characters are extracted from Wi and set as Xj (0≤j <m) (step S8), Xj
The distance dj between them is calculated (step S9). For example,
In the case of the search pattern character string “computer technology”, “computer” and “technology” are independent words, and “no” exists between them. In this case, the distance between them is the number of bytes of the character code of "no", that is, 2 bytes.
The appearance position information may be the place where the first character of the character string of the word appears. At this time, if the appearance position information of “computer” in “computer technology” is, for example, 301 (byte), the appearance position information of “technology” is 315 which is obtained by adding 14 bytes for 7 characters.
That is, the distance between “computer” and “technology” can be expressed as 14 bytes. Next, step S10
Search for these index terms in ~ 17. The processing is performed as follows. In step S10, j: = 0, and j
During <m (step S11), the following process is repeated. X
Process A is executed to search the index file by j (step S12). Process A will be described later with reference to FIG. If there is corresponding data (step S13), the appearance position information of Xj is registered as an element of H0 (step S14),
With reference to the elements of H and H0, only the element satisfying the distance dj is newly set as the element of H (step S15). For example, in the case of a search pattern character string "computer technology", this processing is performed in order to exclude objects whose distance between the word "computer" and "technology" is 2 bytes or more. In step S16, j: = j + 1 is set, and the process returns to step S11.

【００２８】ステップＳ１３で該当するデータがないな
らば、検索パターン文字列は見つからないことになるの
で、ステップＳ１７でＨを空にして、処理Ｄへ進む。全
てのＸｊについて以上の処理を終了したら、次に、Ｗｉ
のうち未知語があれば、それらについて検索を行う。図
９のステップＳ１８で、Ｗｉのうち未知語をＵｊ（０≦
ｊ＜ｒ）とし、ステップＳ１９でＵｊに隣接するＷとの
距離ｄｊを算出する。ステップＳ２０でｊ：＝０とし
て、ステップＳ２１によりｊ＜ｒの間以下の処理を繰り
返す。Ｕｊで索引ファイルを検索するために処理Ｃを実
行する（ステップＳ２２）。処理Ｃは、例えば、処理Ａ
と同様にしてＵｊで索引ファイルを検索すればよいこれ
も図１０により詳述する。該当するデータがあれば（ス
テップＳ２３）、Ｕｊの出現位置情報をＨ０の要素とし
て登録し（ステップＳ２４）、ＨとＨ０の要素を参照し
て、距離ｄｊを満たすもののみを新たにＨの要素とする
（ステップＳ２５）。If there is no corresponding data in step S13, the search pattern character string is not found, so H is emptied in step S17, and the process proceeds to process D. After the above processing is completed for all Xj, next, Wi
If there are unknown words, search for them. In step S18 of FIG. 9, unknown words in Wi are denoted by Uj (0 ≦
j <r) and the distance dj with W adjacent to Uj is calculated in step S19. In step S20, j: = 0 is set, and in step S21, the following process is repeated while j <r. Process C is executed to search the index file by Uj (step S22). The process C is, for example, the process A.
It suffices to search the index file with Uj in the same manner as above. This will also be described in detail with reference to FIG. If there is corresponding data (step S23), the appearance position information of Uj is registered as an element of H0 (step S24), and only the elements satisfying the distance dj are newly referred to by referring to the elements of H and H0. (Step S25).

【００２９】ステップＳ２６では、ｊ：＝ｊ＋１とし
て、ステップＳ２１へ戻る。ステップＳ２３で該当する
データがないならば、検索パターン文字列は見つからな
いことになるので、ステップＳ２７でＨを空にして、処
理Ｄへ進む。全てのＵｊについて以上の処理を終了した
ら、最後に、Ｗｉの抽出時に不要語があった場合、それ
らを含めた本文照合を行う。ステップＳ２８でｆｌａｇ
が１ならば不要語があったことを意味するので、Ｈの各
要素について本文照合し、照合するもののみをＨの要素
とする（ステップＳ２９）。そうでなければ、本文照合
はしなくてよい。最後にステップＳ３０でＨを出力し
て、終了する（ステップＳ３１）。In step S26, j: = j + 1 is set, and the process returns to step S21. If there is no corresponding data in step S23, the search pattern character string is not found, so H is emptied in step S27 and the process proceeds to process D. After the above processing is completed for all Uj, finally, if there is an unnecessary word when Wi is extracted, the text matching including them is performed. Flag in step S28
If 1 is 1, it means that there is an unnecessary word. Therefore, the text of each element of H is collated, and only the element to be collated is made the element of H (step S29). Otherwise, text matching need not be done. Finally, H is output in step S30, and the process ends (step S31).

【００３０】図１０は、索引語が自立語である場合の、
索引ファイルの検索処理（処理Ａ）について示したフロ
ーチャートである。まずＷを索引語とし（ステップＳ
１）、出現位置の候補集合Ｍを空に設定する（ステップ
Ｓ２）。次に、Ｗの先頭文字ａで先頭文字ハッシュ表を
検索し（ステップＳ３）、文字ａに対応する索引項目表
があるならば（ステップＳ４）、索引項目表をＷに対す
る前方一致で検索する（ステップＳ５）。該当データが
あるならば（ステップＳ６）、以下の処理を行う。FIG. 10 shows the case where the index word is an independent word.
9 is a flowchart showing an index file search process (process A). First, let W be an index word (step S
1) The candidate set M of appearance positions is set to empty (step S2). Next, the first character hash table is searched for by the first character a of W (step S3), and if there is an index entry table corresponding to the character a (step S4), the index entry table is searched by prefix match with W ( Step S5). If there is the corresponding data (step S6), the following processing is performed.

【００３１】前方一致で照合したならば（ステップＳ
７）、Ｗの後続の索引語を用いて更に照合するかどうか
検査し、照合するならば（ステップＳ８）、Ｍにその出
現位置情報を追加して（ステップＳ９）、ステップＳ５
へ戻る。このようにして、索引項目表で満足するものを
全て取り出して処理を繰り返す。全ての該当するデータ
を取り出したら、Ｍを出力して（ステップＳ１１）終
了。なお、ステップＳ４で、文字ａに対応する索引項目
表がない場合は、Ｍを空にして検索を終了する。If matching is performed by prefix matching (step S
7), using the subsequent index word of W, it is checked whether or not to further collate, and if it collates (step S8), the appearance position information is added to M (step S9), and step S5.
Return to. In this way, all the items that satisfy the index item table are extracted and the process is repeated. When all the relevant data have been taken out, M is output (step S11) and the process ends. If there is no index item table corresponding to the character a in step S4, M is emptied and the search is terminated.

【００３２】〈具体例１の効果〉以上の具体例１では、
対象となる文書データから形態素解析を用いて主索引語
を抽出する。このとき品詞情報を用いて、自立語等のユ
ーザが検索対象としやすい語を積極的に抽出し、単体で
は検索対象となり得ないと思われる語を積極的に除外す
ることができる。これによって、確実に意味のある文字
並びのみを索引生成対象とし、それ以外の文字並びに関
する出現位置情報を索引ファイルに格納しないようにす
ることができるので、索引ファイルの容量を大幅に縮小
し、しかも多くの意味ある情報を索引ファイルに含める
ことができる。また、ユーザの検索が名詞句について行
われることが多いとすれば、具体例１では、これらの検
索に対して特に、極めて高速に解が得られる。複合語や
未知語の検索では、上記の場合の検索より速度的には幾
分劣るものの、補助索引語に関する情報を格納すること
によって十分実用的な検索が可能である。<Effect of Concrete Example 1> In Concrete Example 1 described above,
The main index word is extracted from the target document data using morphological analysis. At this time, using the part-of-speech information, it is possible to actively extract words that are easy for the user to search, such as independent words, and positively exclude words that are unlikely to be search targets by themselves. This ensures that only meaningful character sequences are targeted for index generation, and that appearance position information regarding other character sequences is not stored in the index file, which significantly reduces the size of the index file. Moreover, much meaningful information can be included in the index file. Also, assuming that the user's search is often performed for noun phrases, in the first specific example, a solution for these searches can be obtained extremely quickly. Although searching for a compound word or an unknown word is somewhat inferior in speed to the above-mentioned search, it is possible to perform a sufficiently practical search by storing information about auxiliary index words.

【００３３】〈具体例２〉具体例２として、索引語の抽
出を簡易語句解析によって実現する場合の処理について
示す。図１１は、具体例２の文書検索システムの索引フ
ァイル生成／追加処理装置のブロック図である。図１１
の装置は、検索対象文書入力部５１と、索引語の簡易切
り出し部５２と、補助索引語の圧縮展開部５３と、先頭
文字ハッシュ表生成／追加部５４と、索引項目表生成／
追加部５５とから構成されている。検索対象文書入力部
１には、電子化された文書が入力される。ここには図示
しないメモリ等が設けられ、検索対象となる文書が一時
記憶される。索引語の簡易切り出し部５２は、文書を適
当に簡易的に区切って語句を取り出す部分である。ここ
で切り出した語句を便宜上索引語と呼ぶことにする。<Specific Example 2> As a specific example 2, a process in the case where the extraction of the index word is realized by the simple phrase analysis will be described. FIG. 11 is a block diagram of the index file generation / addition processing device of the document search system of the second specific example. FIG.
The device of FIG. 1 includes a search target document input unit 51, an index word simple cutout unit 52, an auxiliary index word compression / expansion unit 53, a leading character hash table generation / addition unit 54, and an index item table generation /
It is composed of an adding unit 55. A digitized document is input to the search target document input unit 1. A memory or the like (not shown) is provided here, and a document to be searched is temporarily stored. The index word simple cutout unit 52 is a unit for extracting a phrase by appropriately and simply dividing a document. The words and phrases cut out here are called index words for convenience.

【００３４】日本語文を簡易的に語句に区切る方法とし
ては様々なものがあるが、ここでは、例えば句読点や漢
字、アルファベット、片仮名等の文字種を区切り記号と
した、文字種による簡易切り出しを考える。句読点は単
語の切れ目を表すものである。また、日本語の自立語の
先頭文字は漢字で表記されることが多いことや、ほとん
どの名詞は非平仮名列で表示される、等の経験則を利用
すると、文字種を用いた簡易切り出しで、ある程度正し
く語句の先頭を割り出すことができる。There are various methods for simply dividing a Japanese sentence into words and phrases, but here, for example, a simple cut-out by a character type is considered in which a character type such as punctuation marks, kanji, alphabets, katakana, etc. is used as a delimiter. Punctuation marks represent word breaks. In addition, by using the rule of thumb that the first character of Japanese independent words is often written in kanji, and most nouns are displayed in non-hiragana strings, simple cutout using the character type You can figure out the beginning of a phrase to some extent correctly.

【００３５】次に、補助索引語の圧縮展開部５３は、簡
易切り出し部５２で切り出した各索引語について、補助
索引語を生成し、索引ファイルへの格納用に圧縮する。
その生成方法は具体例１の場合と同様である。ある索引
語Ｗ（語長Ｌとする）の補助索引語は、高々Ｌ−１個あ
り、それぞれ索引語の第ｉ番目から最後までの文字から
成る部分文字列（０＜ｉ＜Ｌ）である。もとの索引語及
び補助索引語は、圧縮して索引ファイルに格納する。文
字列の圧縮方法としては様々な方法が考えられるが、こ
こでは例えば、以下のような方法を考える。即ち、索引
語及び補助索引語の語尾の部分が平仮名文字から成る場
合、語尾の平仮名文字部分を、ある適当な文字数で折り
たたむことによって圧縮する。例えば、圧縮単位文字数
を４とすると、９文字から成る平仮名部分列は、４文字
ずつ折りたたまれて、圧縮結果は４文字となる。折りた
たみによる圧縮方法は従来より各種のものが知られてい
るが、例えば複数の文字コードの対応するビットの論理
和をとって圧縮データを得る。このようにすると、補助
索引語の語尾を圧縮して索引ファイルの縮小化を図ると
同時にひとつの索引語から生成される補助索引語の個数
も抑えることが可能となり、ファイルを更に小さくする
ことができる。Next, the auxiliary index word compression / decompression unit 53 generates an auxiliary index word for each index word cut out by the simple cut-out unit 52 and compresses it for storage in the index file.
The generation method is the same as in the case of the first specific example. There are at most L-1 auxiliary index words of a certain index word W (let's assume a word length L), each of which is a partial character string (0 <i <L) consisting of the characters from the i-th to the end of the index word. . The original index word and auxiliary index word are compressed and stored in the index file. Although various methods can be considered as a method of compressing a character string, here, for example, the following method is considered. That is, when the ending part of the index word and the auxiliary index word is composed of hiragana characters, the hiragana character part of the ending part is compressed by folding it by a certain suitable number of characters. For example, assuming that the number of compression unit characters is 4, a hiragana subsequence consisting of 9 characters is folded by 4 characters, and the compression result becomes 4 characters. Various compression methods based on folding have been known in the past. For example, compressed data is obtained by taking the logical sum of corresponding bits of a plurality of character codes. By doing this, it is possible to reduce the size of the index file by compressing the end of the auxiliary index word, and at the same time reduce the number of auxiliary index words generated from one index word, making it possible to further reduce the file size. it can.

【００３６】先頭文字ハッシュ表生成／追加部５４と索
引項目表生成／追加部５５とは、主索引語及び補助索引
語をもとに索引ファイルを生成する。索引ファイルの構
成は、実施の形態１とほぼ同様である。ただし、索引項
目表に格納される索引語は、圧縮された形式で格納され
ている。The leading character hash table generation / addition unit 54 and the index item table generation / addition unit 55 generate an index file based on the main index word and the auxiliary index word. The structure of the index file is almost the same as that of the first embodiment. However, the index words stored in the index item table are stored in a compressed form.

【００３７】図１２は、生成された索引ファイルを用い
て、ユーザに指定された検索パターン文字列を検索する
ための検索処理装置のブロック図である。この装置は、
検索パターン文字列入力部６１と、索引語の簡易切り出
し部６２、索引語の圧縮部６３、索引ファイル検索部６
４、本文照合部６５及び結果出力部６６から構成されて
いる。検索パターン文字列は、文書検索のためにオペレ
ータ等によってキーボード等を用いて入力される。検索
パターン文字列入力部６１は、このようなキーボード、
その他の入力装置から構成される。索引語の簡易切り出
し部６２は、索引語を切り出す。次に、索引語の圧縮部
６３は、得られた索引語を圧縮する。圧縮方法は、索引
ファイルの生成時と同様の方法で行う。索引ファイル検
索部６４は、圧縮された索引語で索引ファイルを検索す
る。索引語を圧縮して検索しているため、ここで得られ
る解集合には誤った解も含まれる可能性がある。そこ
で、最終的には本文照合部６５によって本文と照合を行
い、結果出力部６６から結果を出力する。FIG. 12 is a block diagram of a search processing device for searching a search pattern character string designated by the user using the generated index file. This device is
Search pattern character string input unit 61, index word simple cutout unit 62, index word compression unit 63, index file search unit 6
4, a text collation unit 65 and a result output unit 66. The search pattern character string is input by an operator or the like using a keyboard or the like for document search. The search pattern character string input unit 61 uses such a keyboard,
It is composed of other input devices. The index word simple cutout unit 62 cuts out the index word. Next, the index word compression unit 63 compresses the obtained index word. The compression method is the same as when the index file was created. The index file search unit 64 searches the index file with the compressed index word. Since the index words are compressed and searched, the solution set obtained here may include incorrect solutions. Therefore, finally, the text collation unit 65 collates the text and the result output unit 66 outputs the result.

【００３８】〈動作〉まず、予め索引ファイルを生成す
る。図１３は、索引ファイルの生成処理を示すフローチ
ャートである。まずステップＳ１で検索対象文書を入力
し、ステップＳ２で索引語の簡易切り出しを行う。切り
出された各索引語について、ステップＳ３以降の処理を
繰り返す。始めに、Ｗを索引語とし（ステップＳ４）、
Ｗの語長をＬとする（ステップＳ５）。また、Ｗに対す
る文字列の圧縮開始位置をｐとする（ステップＳ６）。
具体的には、例えばＷの最初の数文字が漢字で残りが平
仮名文字の場合、ｐは最初に平仮名文字が出現した位置
である。ｐ＜Ｌならば（ステップＳ７）、Ｗのｐ番目以
降の文字を折りたたみ圧縮したものをＷ１とする（ステ
ップＳ８）。そして、ｐに圧縮単位文字数を加えた値を
ｑとする（ステップＳ９）。<Operation> First, an index file is generated in advance. FIG. 13 is a flowchart showing the index file generation processing. First, in step S1, a document to be searched is input, and in step S2, an index word is simply cut out. The process after step S3 is repeated for each of the cut out index words. First, let W be an index word (step S4),
The word length of W is set to L (step S5). Further, the compression start position of the character string with respect to W is set to p (step S6).
Specifically, for example, when the first few characters of W are Chinese characters and the rest are Hiragana characters, p is the position where the Hiragana characters first appear. If p <L (step S7), the p-th and subsequent characters of W are folded and compressed to be W1 (step S8). Then, a value obtained by adding the number of compression unit characters to p is set to q (step S9).

【００３９】ｑは、補助索引語を生成する際の生成抑止
パラメータとして用いる。理論的には、与えられた語長
Ｌの索引語Ｗに対して、その補助索引語はＬ−１個あ
り、それぞれＷの第ｉ番目から最後までの文字から成る
Ｌ−１個の部分文字列である。しかし、ここでは文字並
びを折り畳んで圧縮する方法をとっているので、ｐ番目
以降の補助索引語については圧縮単位文字数分の補助索
引語を生成してしまえば、残りは生成する必要がない。
ステップＳ７でｐ≧Ｌならば、Ｗの後半部分が非平仮名
なので圧縮を行わない。そこで、ステップＳ１０でＷ
１：＝Ｗとし、ステップＳ１１でｑ：＝Ｌとする。Q is used as a generation inhibition parameter when generating the auxiliary index word. Theoretically, for an index word W with a given word length L, there are L-1 auxiliary index words, and L-1 partial characters each consisting of the i-th character to the last character of W. It is a column. However, since the method of folding and compressing the character sequence is adopted here, once the auxiliary index words for the pth and subsequent auxiliary index words have been generated for the number of compression unit characters, the rest need not be generated.
If p ≧ L in step S7, the latter half of W is a non-hiragana character, and thus compression is not performed. Therefore, in step S10, W
1: = W, and q: = L in step S11.

【００４０】次に、索引語Ｗの圧縮形式Ｗ１を索引ファ
イルに登録する。まずＷ１の先頭文字ａで先頭文字ハッ
シュ表を検索する（ステップＳ１２）。文字ａに対応す
る索引項目表があるならば（ステップＳ１３）、Ｗ１の
データを索引項目表に登録する（ステップＳ１５）。ま
ず索引項目表をキーＷ１で検索し、Ｗ１のデータがない
ならば、Ｗ１をキーとして新たにＷの出現位置情報を格
納する。既にＷ１のデータがあるならば、Ｗの出現位置
情報を、出現位置情報リストの先頭に追加する。ステッ
プＳ１３で索引項目表がないならば、新たに文字ａに対
応する索引項目表を生成してから（ステップＳ１４）ス
テップＳ１５を実行する。Next, the compression format W1 of the index word W is registered in the index file. First, the head character hash table is searched for the head character a of W1 (step S12). If there is an index item table corresponding to the character a (step S13), the data of W1 is registered in the index item table (step S15). First, the index item table is searched with the key W1, and if there is no data of W1, the appearance position information of W is newly stored with W1 as a key. If there is already W1 data, the appearance position information of W is added to the head of the appearance position information list. If there is no index item table in step S13, an index item table corresponding to the character a is newly generated (step S14) and step S15 is executed.

【００４１】次に、索引語Ｗに対応する補助索引語を生
成して索引ファイルに登録する。まず、ステップＳ１６
でＸ０：＝Ｗ１とし、ステップＳ１７でｉ：＝１とす
る。図１４へ移り、ｉ≦ｑの間（ステップＳ１８）、以
下の処理を繰り返す。まず、Ｗの第ｉ番目から最後まで
の文字をとった部分文字列Ｘを生成し（ステップＳ１
９）、Ｘのｐ番目以降の平仮名列を圧縮したものをＸ１
とする（ステップＳ２０）。Ｘ１の先頭文字ａで先頭文
字ハッシュ表を検索し（ステップＳ２１）、文字ａに対
応する索引項目表があるならば（ステップＳ２２）、索
引項目表をキーＸ１で検索する（ステップＳ２３）。既
にＸ１のデータがあるならば（ステップＳ２４）、Ｘ０
へのポインタを出現位置情報リストの最後に追加する
（ステップＳ２５）。これ以降の補助索引語は既に索引
ファイルに登録されているので、索引語Ｗに対応する補
助索引語の生成はここで打ち切り、ステップＳ３に戻っ
て次の索引語の登録処理を行う（ステップＳ３）。Next, an auxiliary index word corresponding to the index word W is generated and registered in the index file. First, step S16
Then, X0: = W1 is set, and i: = 1 is set in step S17. Moving to FIG. 14, the following processing is repeated while i ≦ q (step S18). First, a partial character string X including the characters from the i-th character to the last character of W is generated (step S1
9), X1 is obtained by compressing the p-th and subsequent hiragana strings of X
(Step S20). The head character hash table is searched with the head character a of X1 (step S21), and if there is an index entry table corresponding to the character a (step S22), the index entry table is searched with the key X1 (step S23). If data of X1 already exists (step S24), X0
A pointer to is added to the end of the appearance position information list (step S25). Since the auxiliary index words after this are already registered in the index file, the generation of the auxiliary index word corresponding to the index word W is terminated here, and the process returns to step S3 to perform the registration processing of the next index word (step S3). ).

【００４２】ステップＳ２４でＸ１のデータがないなら
ば、Ｘ１をキーとしてＸ０へのポインタを出現位置情報
リストに格納する（ステップＳ２７）。ステップＳ２８
でＸ０：＝Ｘ１とし、ステップＳ２９でｉ：＝ｉ＋１と
してステップＳ１８へ戻る。また、ステップＳ２２で文
字ａに対応する索引項目表がないならば、新たに文字ａ
に対応する索引項目表を生成してから（ステップＳ２
６）、ステップＳ２７以降を実行する。このようにし
て、全ての索引語について索引ファイルへの登録処理が
完了すれば処理を終了する（ステップＳ３０）。If there is no X1 data in step S24, the pointer to X0 is stored in the appearance position information list using X1 as a key (step S27). Step S28
Is set to X0: = X1, i: = i + 1 is set in step S29, and the process returns to step S18. If there is no index item table corresponding to the character a in step S22, the character a is newly added.
After generating the index item table corresponding to (step S2
6), step S27 and subsequent steps are executed. In this way, when the registration process for all index words in the index file is completed, the process ends (step S30).

【００４３】図１５は、生成された索引ファイルを用い
て検索を行う場合の処理を示すフローチャートである。
まず、検索パターン文字列を入力し（ステップＳ１）、
索引語の簡易切り出しを行う（ステップＳ２）。最初
に、該当する出現位置情報の候補集合Ｈを空に設定する
（ステップＳ３）。そして、ステップＳ２で切り出され
た全ての索引語に対してステップＳ４以降の検索処理を
繰り返し、解の候補集合を絞り込む。FIG. 15 is a flow chart showing the processing when a search is performed using the generated index file.
First, enter the search pattern character string (step S1),
Simple cutout of the index word is performed (step S2). First, the candidate set H of the corresponding appearance position information is set to be empty (step S3). Then, the search process from step S4 is repeated for all the index words cut out in step S2 to narrow down the candidate set of solutions.

【００４４】まず、Ｗを索引語とし（ステップＳ５）、
Ｗの圧縮文字列をＷ１とする（ステップＳ６）。圧縮方
法は、索引ファイルの生成時と同様の方法で行う。次
に、Ｗ１の先頭文字ａで先頭文字ハッシュ表を検索し
（ステップＳ７）、文字ａに対応する索引項目表がある
ならば（ステップＳ８）、その索引項目表をキーＷ１の
前方一致で検索する（ステップ９）。キーＷ１に前方一
致するデータがあれば（ステップＳ１０）、それらの全
てのデータの出現位置情報リストを取り出してＨ０とし
（ステップＳ１１）、Ｈ：＝Ｈ∩Ｈ０とする（ステップ
Ｓ１２）。ステップＳ１３でｉ：＝ｉ＋１としてステッ
プＳ４へ戻る。このようにして、全ての索引語について
検索処理が完了したら、Ｈに含まれる各出現位置情報に
ついて、本文と照合し（ステップＳ１５）、最終的な解
を出力して（ステップＳ１６）、終了する（ステップＳ
１７）。ステップＳ８で文字ａに対応する索引項目表が
ない場合、あるいはステップＳ１０でキーＷ１に前方一
致するデータがない場合は、Ｈを空にして（ステップＳ
１４）検索を終了する。First, let W be an index word (step S5),
The compressed character string of W is set to W1 (step S6). The compression method is the same as when the index file was created. Next, the head character hash table is searched for with the head character a of W1 (step S7), and if there is an index entry table corresponding to the character a (step S8), the index entry table is searched with the prefix match of the key W1. (Step 9). If the key W1 has data that prefix-matches (step S10), the appearance position information list of all the data is taken out and set to H0 (step S11), and H: = H∩H0 (step S12). In step S13, i: = i + 1 is set, and the process returns to step S4. In this way, when the search process is completed for all index words, each appearance position information included in H is collated with the text (step S15), a final solution is output (step S16), and the process is terminated. (Step S
17). If there is no index item table corresponding to the character a in step S8, or if there is no data that prefix matches the key W1 in step S10, H is emptied (step S
14) End the search.

【００４５】〈具体例２の効果〉この具体例２では、文
字種を用いた簡易切り出しを用いて検索語を切り出す。
従って、具体例１に比べて非常に軽い処理で索引語を抽
出できるのが特徴のひとつである。それでありながら、
既知の日本語文に関する経験則を用いて切り出しを行う
ため、ある程度高い確率で正しく語の先頭を識別でき
る。このように切り出した語に対して、更に補助索引語
を生成して索引ファイルに格納する。抽出した語をもと
に索引ファイルを生成する従来技術では、しばしば語を
なさない文字並びについて、高速に検索できないという
欠点があったが、この例によれば、このような場合の検
索でも十分高速に検索を実現することができる。<Effect of Specific Example 2> In this specific example 2, the retrieval word is cut out by using the simple cutout using the character type.
Therefore, one of the features is that the index word can be extracted by a process that is much lighter than in the first example. And yet
Since the extraction is performed using a known rule of thumb for Japanese sentences, the beginning of a word can be correctly identified with a high probability. A supplementary index word is further generated for the word cut out in this way and stored in the index file. The conventional technique of generating an index file based on the extracted words has a drawback that a character sequence that does not often form a word cannot be searched at high speed. However, according to this example, the search in such a case is also sufficient. The search can be realized at high speed.

【００４６】具体例２では、本文を適当に分割して生成
される語を単位として対応する出現位置情報を格納し、
更に、これに加えて補助索引語に関する情報を格納す
る。従って、具体例１と比較すると句読点等の特殊な文
字を除き、任意の文字列について均一に高速に検索する
ことが可能である。一方、具体例１と比べると、助動詞
等の付属語はほとんど検索対象にならないにも関わら
ず、これらの情報を省略できない。従って、前方一致操
作によって複数のデータ項目に対するアクセスが必要で
あり、索引ファイル容量も大きくなりがちである。しか
し、この具体例２では、各索引語に対して、語尾の圧縮
等によって補助索引語の生成に抑制効果のある、効率よ
い圧縮方法を採用できるから、結果的には索引ファイル
の容量を小さく抑えることができる。In the second specific example, the corresponding appearance position information is stored in units of words generated by appropriately dividing the body,
Further, in addition to this, information about auxiliary index terms is stored. Therefore, as compared with the first specific example, it is possible to uniformly and quickly search any character string except for special characters such as punctuation marks. On the other hand, as compared with the specific example 1, although the auxiliary words such as auxiliary verbs are hardly searched, these pieces of information cannot be omitted. Therefore, it is necessary to access a plurality of data items by the prefix matching operation, and the index file capacity tends to be large. However, in this second specific example, an efficient compression method can be adopted for each index word, which has a suppressing effect on the generation of the auxiliary index word by compressing the word end, etc., and as a result, the size of the index file can be reduced. Can be suppressed.

[Brief description of drawings]

【図１】索引ファイル生成／追加処理装置のブロック図
である。FIG. 1 is a block diagram of an index file generation / addition processing device.

【図２】検索処理装置のブロック図である。FIG. 2 is a block diagram of a search processing device.

【図３】索引ファイルの構成説明図である。FIG. 3 is an explanatory diagram of a structure of an index file.

【図４】先頭文字ハッシュ表の格納データ形式説明図で
ある。FIG. 4 is an explanatory diagram of a storage data format of a leading character hash table.

【図５】索引項目表の格納データ形式説明図である。FIG. 5 is an explanatory diagram of a storage data format of an index item table.

【図６】索引ファイル生成処理動作フローチャート（そ
の１）である。FIG. 6 is an index file generation processing operation flowchart (No. 1).

【図７】索引ファイル生成処理動作フローチャート（そ
の２）である。FIG. 7 is an index file generation processing operation flowchart (No. 2).

【図８】検索処理動作フローチャート（その１）であ
る。FIG. 8 is a search processing operation flowchart (No. 1).

【図９】検索処理動作フローチャート（その２）であ
る。FIG. 9 is a search processing operation flowchart (No. 2).

【図１０】処理Ａの自立語の索引ファイル検索動作フロ
ーチャートである。10 is a flowchart of an independent word index file search operation of process A. FIG.

【図１１】索引ファイル生成／追加処理装置のブロック
図である。FIG. 11 is a block diagram of an index file generation / addition processing device.

【図１２】検索処理装置のブロック図である。FIG. 12 is a block diagram of a search processing device.

【図１３】索引ファイル生成処理動作フローチャート
（その１）である。FIG. 13 is an index file generation processing operation flowchart (No. 1).

【図１４】索引ファイル生成処理動作フローチャート
（その２）である。FIG. 14 is an index file generation processing operation flowchart (No. 2).

【図１５】検索処理動作フローチャートである。FIG. 15 is a search processing operation flowchart.

[Explanation of symbols]

１検索対象文書入力部２形態素解析部３主索引語抽出部４補助索引語生成部５先頭文字ハッシュ表生成／追加部６索引項目表生成／追加部 1 Search target document input unit 2 Morphological analysis unit 3 Main index word extraction unit 4 Auxiliary index word generation unit 5 First character hash table generation / addition unit 6 Index item table generation / addition unit

Claims

[Claims]

1. A morphological analysis unit that accepts a digitized document to be searched and obtains information about the words and phrases that make up the document, and outputs only the meaningful words and phrases that can be searched from the output of this morphological analysis unit. A document including a main index word extraction unit for extracting as a main index word, and an index file generation unit for generating an index file in which each main index word is associated with appearance position information in the document. Search system.

2. An electronic document to be searched is accepted, and attention is paid to the character type of the characters forming the document,
A phrase cutout unit that cuts out a phrase forming a document as a main index word, and an index file generation unit that generates an index file in which the main index word and the appearance position information in the document are associated with each other. Document search system to do.

3. A sub-character string starting from the second and subsequent characters of the main index word, which is a partial character string of the main index word, is used as an auxiliary index word, and the auxiliary index word is associated with appearance position information in the document. 3. The document search system according to claim 1, wherein the document search system is stored in an index file.

4. When the main index word and the auxiliary index word are composed of non-hiragana and hiragana character strings, the hiragana character string at the end is compressed so that the word length of the main index word or the auxiliary index word is reduced. The document search system according to claim 2, wherein