JPH1196184A

JPH1196184A - Method and system for retrieving whole sentence

Info

Publication number: JPH1196184A
Application number: JP9270628A
Authority: JP
Inventors: Emi Ikeda; 恵美池田; Kumiko Wada; 久美子和田; Kohaku Morita; 幸伯森田
Original assignee: Oki Electric Industry Co Ltd
Current assignee: Oki Electric Industry Co Ltd
Priority date: 1997-09-17
Filing date: 1997-09-17
Publication date: 1999-04-09

Abstract

PROBLEM TO BE SOLVED: To attain a whole sentence retrieving method capable of shortening the retrieving time. SOLUTION: When a query is divided into plural retrieving character strings, whether an index word completely coincident with the initial retrieving character string exists or not is judged (step S13). When the index word exists, an index word development word position list and an extracted word position list are read in (steps S15, S16). In the case of the 2nd retrieving character string and after, only the extracted word position list is read in (step S16). In the case of the final retrieving character string, whether an index word forward coincident with the retrieving character stiring exists or not is checked (step S18). When the index word exists, only an index word extraction word list is read in (step S21), these read position lists are merged (step S22) and the merged result is outputted.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、索引ファイルを用
いて文書中の文字列の検索を行う全文検索方法および全
文検索システムに関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a full-text search method and a full-text search system for searching a character string in a document using an index file.

【０００２】[0002]

【従来の技術】近年、コンピュータやワープロが普及
し、また、ＣＤ−ＲＯＭによる電子出版が進み、大量の
文書が電子化されつつある。そして、通信ネットワーク
環境の整備により、ネットワークを介して電子化文書の
やり取りも増え、ますますこれらを取り扱う量や機会が
多くなってきている。2. Description of the Related Art In recent years, computers and word processors have become widespread, and electronic publishing using CD-ROMs has been advanced, and a large number of documents have been digitized. With the improvement of the communication network environment, the number of exchanges of electronic documents via the network has increased, and the volume and opportunities for handling these documents have been increasing.

【０００３】そこで、これらを効率的に、かつ、容易に
扱うための種々の手段が必要となってきた。その一つ
に、膨大な量の文書から必要なものだけを迅速に探し出
す情報検索技術がある。電子化された文書の資源をいっ
そう生かすことができ、更に作業能率を上げられること
から、この記述へのニーズは大きい。[0003] Therefore, various means for efficiently and easily handling these are required. One of them is an information search technology for quickly finding only necessary items from an enormous amount of documents. There is a great need for this description because the resources of the digitized document can be further utilized and the work efficiency can be further improved.

【０００４】また、これを移植性の良さやコストの面か
ら、ソフトウェアで実装したアプリケーションのニーズ
が大きく、様々な製品が開発されている。これらには、
予め文書（データ）から高速検索のための検索補助ファ
イル（索引ファイル）を生成し、実際の検索には、この
索引ファイルを用いるタイプが多い。[0004] Further, in terms of portability and cost, there is a great need for an application implemented by software, and various products have been developed. These include
A search auxiliary file (index file) for high-speed search is generated from a document (data) in advance, and the actual search often uses the index file.

【０００５】一般的には、索引語にデータの参照情報を
対応させるという形式で索引ファイルを作る。そして、
シグニチャファイルや文字成分表などの方法で記録装置
への格納、操作を効率的にしている。このように作成さ
れた索引ファイルに対する検索は、索引語と検索文字列
（クエリ）の先頭から一つ一つ照合をとっていく前方一
致で行われる。[0005] In general, an index file is created in a format in which data reference information is associated with an index word. And
The storage and operation in the recording device are made efficient by a method such as a signature file or a character component table. The search for the index file created in this manner is performed by head-on matching in which an index word and a search character string (query) are collated one by one from the beginning.

【０００６】しかし、正規表現を用いた検索、例えば、
“理論”で終わる文字列で検索したい場合、上記の検索
構造では結局全ての索引について照合をとっていかなく
てはならない。なぜならば、これは前方一致ではなく、
後方一致で照合を行っているからである。途中に“理
論”という文字が現れるような部分一致を必要とするク
エリに対しても同様である。However, a search using a regular expression, for example,
If you want to search for a string that ends with "theory", the above search structure must eventually match all indexes. Because this is not a prefix,
This is because the matching is performed with a backward match. The same applies to a query that requires partial matching in which the word "theory" appears in the middle.

【０００７】これに対する最も単純な手段として、索引
ファイルの作成時に索引語の先頭から１文字ずつずらし
て得た部分文字列をも索引語として登録するという部分
展開法がある。例えば「言語理論」に対し、部分文字列
「語理論」「理論」「論」も索引語として登録するので
ある。機械的に１文字ずつずらしていけば、索引語を生
成できるので、実装は非常に簡単で、この構成により、
“〜理論”といった検索要求にも早く簡単に結果を得る
ことができるようになる。[0007] The simplest means for this is a partial expansion method in which a partial character string obtained by shifting one character at a time from the head of an index word when an index file is created is also registered as an index word. For example, for "language theory", the substrings "word theory", "theory" and "theory" are also registered as index words. The index word can be generated by mechanically shifting one character at a time, so the implementation is very simple. With this configuration,
A search request such as "-theory" can be obtained quickly and easily.

【０００８】[0008]

【発明が解決しようとする課題】しかしながら、索引を
上記のように実装すると、文書データのサイズが大きく
なればなるほど、一つの索引語に登録される位置情報の
数も大きくなる。大規模索引においては、索引を全てメ
モリ上に展開することは実用上難しく、二次記憶を利用
して索引を構成するのが一般的であるため、二次記憶中
から格納されている位置情報を読み出す場合が多くな
る。この時、読み出す位置情報数が相当数あれば、その
読み出し時間も相当なものになる。However, when the index is implemented as described above, as the size of the document data increases, the number of pieces of position information registered in one index word increases. In a large-scale index, it is practically difficult to develop all indexes on a memory, and it is common to construct an index using secondary storage. Therefore, position information stored in the secondary storage is used. Is often read out. At this time, if the number of pieces of position information to be read is considerable, the reading time is also considerable.

【０００９】また、クエリが複数の索引語に分割される
時、分割して得られる索引語のそれぞれの結果（位置情
報）について、オフセットの照合を行う必要がある。こ
れの計算量は、ｎｌｏｇｎであり、このマージ処理時間
は、検索した位置情報数が大きく影響し、検索時間に反
映される。Further, when a query is divided into a plurality of index words, it is necessary to check the offset of each result (position information) of the index words obtained by the division. The amount of calculation is nlogn, and the merge processing time is greatly influenced by the number of searched position information, and is reflected in the search time.

【００１０】このような点から、検索時間の短縮化を図
ることのできる全文検索方法の実現が望まれていた。[0010] From such a point, it has been desired to realize a full-text search method capable of shortening the search time.

【００１１】[0011]

【課題を解決するための手段】本発明は、前述の課題を
解決するため次の構成を採用する。〈請求項１の構成〉文書中に出現する文字列に対して、
予め決められた索引語抽出規則により抽出した各文字列
のオリジナルの文字列を表す抽出語と、この抽出語の先
頭文字を１文字ずつずらした部分文字列を表す展開語と
により索引語を設定し、文書中の任意の検索文字列の検
索を行う場合は、この索引語を用いて行う全文検索方法
において、任意の検索文字列の検索を行う場合、この検
索文字列を索引語抽出規則により分割し、その結果得た
文字列のうち、最初の文字列は索引語の抽出語と展開語
の両方を検索すると共に、２番目以降の文字列は抽出語
のみを検索し、これら検索結果をマージすると共に、最
後の文字列の場合は、索引語の前方一致の照合を行うこ
とを特徴とする全文検索方法である。The present invention employs the following structure to solve the above-mentioned problems. <Structure of Claim 1> For a character string appearing in a document,
An index word is set by using an extracted word representing an original character string of each character string extracted according to a predetermined index word extraction rule, and an expanded word representing a partial character string in which the first character of the extracted word is shifted by one character at a time. However, when performing a search for an arbitrary search character string in a document, in a full-text search method using this index word, when performing an search for an arbitrary search character string, the search character string is converted according to an index word extraction rule. Of the character strings obtained as a result, the first character string searches both the extracted word and the expansion word of the index word, and the second and subsequent character strings search only the extracted word. The full-text search method is characterized in that merging is performed, and in the case of the last character string, a prefix match of an index word is collated.

【００１２】〈請求項１の説明〉検索時間を短縮するた
めには、索引語から得られる位置情報の集合について、
解候補としての精度、即ち、解の適合率が大きいもので
あることが望まれる。そこで、請求項１の発明では、検
索文字列が分割された場合に、最初の文字列のみ抽出語
と展開語を検索するが、２番目以降の文字列に対しては
抽出語のみ検索するようにし、読み込む位置リストの個
数を減らして、より適合率の高い解の候補を得るように
したものである。<Explanation of Claim 1> In order to shorten the search time, a set of position information obtained from the index words is set as follows.
It is desired that the accuracy as a solution candidate, that is, the solution precision is large. Therefore, in the invention of claim 1, when the search character string is divided, only the first character string is searched for the extracted word and the expanded word, but only the extracted character string is searched for the second and subsequent character strings. Then, the number of position lists to be read is reduced to obtain a solution candidate with a higher precision.

【００１３】予め決められた索引語抽出規則とは、どの
ようなものであってもよく、この索引語抽出規則を使用
して、索引語の生成処理と検索文字列の分割処理とを行
う。抽出語とは、この索引語抽出規則を用いて抽出され
た文字列であり、展開語とは、この抽出語を部分展開し
た部分文字列である。検索文字列が分割された場合の２
番目以降の文字列は、この索引語抽出規則を用いて抽出
される文字列であるため、抽出語にしかなり得ない。従
って、抽出語のみ検索することで、読み込む位置リスト
の個数を減らすことができる。The predetermined index word extraction rule may be any rule, and the index word generation process and the search character string division process are performed using the index word extraction rule. The extracted word is a character string extracted using the index word extraction rule, and the expanded word is a partial character string obtained by partially expanding the extracted word. 2 when the search string is split
The character strings following the first character string are character strings extracted using this index word extraction rule, and therefore cannot be used as extracted words. Therefore, the number of position lists to be read can be reduced by searching only the extracted words.

【００１４】その結果、計算量が対象の配列の長さに比
例するマージ処理において、累積結果とのマージ処理の
配列長が小さくなるため、計算の高速化を図ることがで
きる。また、読み出しにかかる時間を短縮することがで
き、検索処理全体の高速化を図ることができる。これ
は、検索文字列から切り出される検索文字列の個数が多
ければ多いほど、大きな効果を得ることができる。As a result, in the merging process in which the calculation amount is proportional to the length of the target array, the array length of the merging process with the accumulation result is reduced, so that the calculation can be sped up. In addition, the time required for reading can be reduced, and the speed of the entire search process can be increased. This has a greater effect as the number of search character strings cut out from the search character string is larger.

【００１５】〈請求項２の構成〉請求項１において、検
索文字列を予め決められた索引語抽出規則により分割し
た結果、文字列が一つだけであった場合、索引語の前方
一致の照合を行い、かつ、索引語の抽出語と展開語の両
方を検索することを特徴とする全文検索方法である。<Structure of Claim 2> In claim 1, if the search character string is divided according to a predetermined index word extraction rule and there is only one character string, the matching of the prefix of the index word is checked. And a search is performed for both the extracted word and the expanded word of the index word.

【００１６】〈請求項２の説明〉請求項２の発明は、分
割した検索文字列が一つだけであった場合に、索引語に
関しては、複数の分割した場合の最初の文字列と同様
に、抽出語と展開語の両方を検索し、また、索引語の照
合は、最後の文字列と同様に前方一致の照合を行うよう
にしたものである。これにより、索引語抽出規則により
分割される文字列が一つだけであった場合でも、複数に
分割される場合の検索方法を適用することができる。<Explanation of Claim 2> According to the invention of claim 2, when there is only one divided search character string, the index term is the same as the first character string in the case of a plurality of divided character strings. , And both the extracted word and the expanded word are searched, and the collation of the index word is a collation of the head match like the last character string. Thus, even when only one character string is divided according to the index word extraction rule, a search method in a case where the character string is divided into a plurality can be applied.

【００１７】〈請求項３の構成〉文書中に出現する文字
列に対して、予め決められた索引語抽出規則により抽出
した各文字列のオリジナルの文字列を表す抽出語と、抽
出語の先頭文字を１文字ずつずらした部分文字列を表す
展開語とにより構成した索引語を格納する索引ファイル
と、文書中の検索対象として任意の検索文字列が入力さ
れた場合は、この検索文字列を索引語抽出規則により分
割する検索文字列抽出部と、検索文字列抽出部で分割さ
れた文字列において、最初の文字列の場合は、索引ファ
イル中の索引語の抽出語と展開語の両方を検索すると共
に、２番目以降の文字列に関しては抽出語のみを検索
し、これら検索結果をマージすると共に、最後の文字列
の場合は、索引語の前方一致の照合を行って検索する検
索処理部とを備えたことを特徴とする全文検索システム
である。<Structure of Claim 3> With respect to a character string appearing in a document, an extracted word representing an original character string of each character string extracted according to a predetermined index word extraction rule, and a head of the extracted word When an index file storing index words composed of expanded words representing partial character strings in which characters are shifted one character at a time, and when an arbitrary search character string is input as a search target in a document, this search character string is used. In the search character string extraction unit divided by the index word extraction rule, and in the character string divided by the search character string extraction unit, in the case of the first character string, both the extraction word and the expansion word of the index word in the index file are A search processing unit that searches and searches only the extracted words with respect to the second and subsequent character strings, merges these search results, and searches for the last character string by collating a prefix match of the index word. With Is a full-text search system characterized and.

【００１８】〈請求項３の説明〉請求項３の発明は、請
求項１の発明の全文検索方法を実現する全文検索システ
ムを特徴とするものである。索引ファイルは、予め決め
られた索引語抽出規則により抽出した索引語を格納する
ファイルである。検索文字列抽出部は、検索文字列を、
索引語の抽出処理に使用した索引語抽出規則を用いて分
割する機能部である。また、検索処理部は、索引ファイ
ルの索引語を用いて実際の検索を行う機能部である。<Explanation of Claim 3> The invention of claim 3 is characterized by a full-text search system for realizing the full-text search method of the invention of claim 1. The index file is a file for storing index words extracted according to a predetermined index word extraction rule. The search string extracting unit converts the search string into
This is a functional unit that divides using the index word extraction rule used in the index word extraction processing. The search processing unit is a functional unit that performs an actual search using the index words of the index file.

【００１９】このような構成により、計算量が対象の配
列の長さに比例するマージ処理において、累積結果との
マージ処理の配列長が小さくなるため、計算の高速化を
図ることができる。また、読み出しにかかる時間を短縮
することができ、検索処理全体の高速化を図ることがで
きる。これは、検索文字列から切り出される検索文字列
の個数が多ければ多いほど、大きな効果を得ることがで
きる。According to such a configuration, in the merging process in which the calculation amount is proportional to the length of the target array, the array length of the merging process with the accumulation result is reduced, so that the calculation can be sped up. In addition, the time required for reading can be reduced, and the speed of the entire search process can be increased. This has a greater effect as the number of search character strings cut out from the search character string is larger.

【００２０】〈請求項４の構成〉請求項３において、検
索文字列抽出部によって分割された文字列が一つだけで
あった場合、索引語の前方一致の照合を行い、かつ、索
引語の抽出語と展開語の両方を検索する検索処理部を備
えたことを特徴とする全文検索システムである。<Structure of Claim 4> In claim 3, when only one character string is divided by the search character string extracting unit, the matching of the prefix of the index word is performed, and the matching of the index word is performed. A full-text search system comprising a search processing unit for searching for both an extracted word and a developed word.

【００２１】〈請求項４の説明〉請求項４の発明は、検
索処理部の構成を、分割した検索文字列が一つだけであ
った場合に、索引語に関しては、複数の分割した場合の
最初の文字列と同様に、抽出語と展開語の両方を検索
し、また、索引語の照合は、最後の文字列と同様に前方
一致の照合を行う機能を付加したものである。これによ
り、索引語抽出規則により分割される文字列が一つだけ
であった場合でも、対応することができる。<Explanation of Claim 4> According to the invention of claim 4, the configuration of the search processing unit is such that when only one search character string is divided and the index word is divided into a plurality, Similar to the first character string, both the extracted word and the expansion word are searched, and the collation of the index word is added with a function of performing prefix matching like the last character string. This makes it possible to cope with a case where only one character string is divided according to the index word extraction rule.

【００２２】[0022]

【発明の実施の形態】以下、本発明の実施の形態を図面
を用いて詳細に説明する。Embodiments of the present invention will be described below in detail with reference to the drawings.

【００２３】《本発明の原理》クエリが予め決められた
索引語抽出規則により、ｎ個の検索文字列（Ｓ₁，Ｓ₂，
…，Ｓ_nに分割された時、クエリの検索結果は、それぞ
れの検索文字列の検索結果の論理積で求められる。しか
し、各検索文字列の検索結果は、検索文字列に対応する
索引語そのものの出現位置集合であるため、適当でない
出現位置の含まれる可能性が多分にある。<< Principle of the Present Invention >> According to a predetermined index word extraction rule for a query, n search character strings (S ₁ , S ₂ ,
..., when divided into S _n, the search results of the query is determined by the logical product of the results for each of the search string. However, since the search result of each search character string is a set of appearance positions of the index word itself corresponding to the search character string, there is a possibility that an inappropriate appearance position is included.

【００２４】検索文字列と索引語の関係でみると、検索
文字列Ｓ２〜Ｓｎは、索引語抽出処理で抽出される文字
列であるので、抽出語にしかなり得ない。一方、検索文
字列Ｓ１については、抽出語、展開語の両方の可能性が
ある。本具体例ではこれを利用している。検索文字列Ｓ
２〜Ｓｎまでは該当する索引語の抽出語の位置リストの
みを読み込んでくればよい。これにより、読み込む位置
リストの個数を減らし、より適合率の高い解の候補を得
ることができる。In terms of the relationship between the search character string and the index word, the search character strings S2 to Sn are character strings extracted in the index word extraction process, and therefore cannot be obtained as extracted words. On the other hand, the search character string S1 may be both an extracted word and a developed word. This is used in this specific example. Search string S
From 2 to Sn, only the position list of the extracted term of the corresponding index term needs to be read. As a result, the number of position lists to be read can be reduced, and a solution candidate with a higher precision can be obtained.

【００２５】《具体例》〈構成〉図１は本発明の全文検索方法の具体例を示すフ
ローチャートであるが、この説明に先立ち、本発明の全
文検索方法を実現するための全文検索システムを説明す
る。<Specific Example><Configuration> FIG. 1 is a flowchart showing a specific example of a full-text search method of the present invention. Prior to this description, a full-text search system for realizing the full-text search method of the present invention will be described. I do.

【００２６】図２は、その全文検索システムの構成図で
ある。図のシステムは、コンピュータで実現され、索引
ファイル１、本文データ２、検索式処理部３、入力装置
６、出力装置７からなる。FIG. 2 is a block diagram of the full-text search system. The system shown in the figure is realized by a computer and includes an index file 1, text data 2, a search formula processing unit 3, an input device 6, and an output device 7.

【００２７】索引ファイル１は、索引語と出現位置集合
の組を格納したものであり、これについては後述する。
本文データ２は、本文照合を必要とする場合の本文デー
タである。また、これら索引ファイル１および本文デー
タ２は、外部記憶装置に格納されている。The index file 1 stores a set of an index word and a set of appearance positions, which will be described later.
The text data 2 is text data when text collation is required. The index file 1 and the body data 2 are stored in an external storage device.

【００２８】検索式処理部３は、コンピュータにおける
プロセッサやメモリおよびプロセッサが実行するプログ
ラム等から実現され、検索文字列抽出部４と、検索処理
部５からなる。検索文字列抽出部４は、入力装置６で取
り込まれた検索式から索引語を抽出する機能を有してい
る。検索処理部５は、検索文字列抽出部４から抽出した
各検索文字列を索引ファイル１を用いて検索する機能を
有している。The search formula processing unit 3 is realized by a processor and a memory in a computer, a program executed by the processor, and the like, and includes a search character string extracting unit 4 and a search processing unit 5. The search character string extraction unit 4 has a function of extracting an index word from a search expression fetched by the input device 6. The search processing unit 5 has a function of searching for each search character string extracted from the search character string extraction unit 4 using the index file 1.

【００２９】入力装置６は、キーボードやマウス等のユ
ーザインタフェース、あるいは通信手段を用いて入力す
る入力部である。また、出力装置７は、データの参照情
報から抽出した結果を出力するための装置であり、ＣＲ
Ｔ等の表示装置やプリンタあるいは通信手段を介して出
力する出力部からなる。The input device 6 is an input unit for inputting using a user interface such as a keyboard or a mouse, or using communication means. The output device 7 is a device for outputting the result extracted from the reference information of the data.
It is composed of a display device such as T, a printer or an output unit for outputting via a communication means.

【００３０】次に、索引ファイル１の構成について説明
する。本システムでは、索引語に対し、前方一致照合で
任意の文字列検索を効率的に行えるような形で索引を生
成する。索引ファイル１は、上述したように、索引語と
出現位置集合の組を格納したものである。ここで、索引
語は、本文データ２より抽出された文字列より生成され
たものである。また、出現位置集合は、その索引語文字
列の本文データ２内における全ての出現位置の集合であ
る。Next, the configuration of the index file 1 will be described. In this system, an index is generated for an index word in such a manner that an arbitrary character string search can be efficiently performed by prefix matching. As described above, the index file 1 stores a set of an index word and an appearance position set. Here, the index term is generated from the character string extracted from the body data 2. The appearance position set is a set of all occurrence positions in the text data 2 of the index word character string.

【００３１】本システムでは、索引語抽出処理（例え
ば、形態素解析等により行う）によって、本文データ２
より抽出された文字列のみならず、その文字列の先頭か
ら１文字ずつずらして得られる残りの部分文字列（部分
展開した文字列）も索引語として登録する。即ち、索引
語において、その生成方法により、索引語抽出処理によ
り抽出されたオリジナルの索引語を抽出語とし、この抽
出語を更に部分展開して得られる索引語を展開語と区別
する。In the present system, text data 2 is obtained by index word extraction processing (for example, by performing morphological analysis).
In addition to the extracted character string, the remaining partial character strings (partially expanded character strings) obtained by shifting one character at a time from the head of the character string are registered as index words. That is, in the index word, the original index word extracted by the index word extraction process is used as an extracted word, and the index word obtained by further partially expanding the extracted word is distinguished from the expanded word.

【００３２】図３は、抽出語と展開語の説明図である。
図示の例で説明すると、索引語抽出処理によって、“情
報検索”という文字列が抽出された場合、“情報検索”
“報検索”“検索”“索”の四つの索引語が生成され、
それらは一つの抽出語（“情報検索”）と、三つの展開
語（“報検索”“検索”“索”）に分類される。FIG. 3 is an explanatory diagram of extracted words and expanded words.
In the illustrated example, when the character string “information search” is extracted by the index word extraction processing, “information search” is performed.
Four index words of "report search""search""search" are generated,
They are classified into one extracted word (“information search”) and three expanded words (“report search”, “search”, “search”).

【００３３】このような索引語は、その出現位置集合と
組になって索引ファイルを構成している。図４は、索引
語と出現位置集合の組を示す説明図である。図示のよう
に、索引語に対して、抽出語の出現位置のリストと、展
開語の出現位置のリストとが格納され、かつ、抽出語の
出現位置の個数と展開語の出現位置の個数とが組となっ
て格納されている。Such an index word forms an index file in combination with its appearance position set. FIG. 4 is an explanatory diagram showing a set of an index word and an appearance position set. As shown in the figure, for the index word, a list of the appearance positions of the extracted words and a list of the appearance positions of the expanded words are stored, and the number of the appearance positions of the extracted words and the number of the appearance positions of the expanded words are stored. Are stored as a set.

【００３４】例えば、図４の例で、索引語“情報検索”
の抽出語は、索引語抽出処理で、“情報検索”と抽出さ
れたもので、展開語は、“ネットワーク情報検索”や
“特許情報検索”といった語句を部分展開して生成され
たものである。即ち、“情報検索”が展開語である場
合、例えば、“ネットワーク情報検索”が抽出語であ
り、“情報検索”は、この部分文字列として抽出されて
いる。そして、図示のように、出現位置を抽出語と展開
語にふるい分けして索引ファイル１に格納する。For example, in the example of FIG.
Are extracted as "information search" in the index word extraction process, and the expanded words are generated by partially expanding words such as "network information search" and "patent information search". . That is, when “information search” is an expansion word, for example, “network information search” is an extraction word, and “information search” is extracted as this partial character string. Then, as shown in the figure, the appearance position is sifted into the extracted word and the expanded word and stored in the index file 1.

【００３５】〈動作〉図５は、全文検索処理の全体の流
れを示すフローチャートである。先ず、入力装置６より
クエリが入力されると（ステップＳ１）、検索文字列抽
出部４は、索引ファイル１を検索する単位にクエリを分
割する（ステップＳ２）。次に、検索処理部５は、分割
された検索単位で索引ファイル１を検索し、各検索単位
の検索結果をマージしていく。そして、途中で検索結果
が０にならない限り、全ての検索単位でこれを行う（ス
テップＳ３）。検索処理が終了すると、検索処理部５
は、その結果を出力装置７から出力し（ステップＳ
４）、処理を終了する。<Operation> FIG. 5 is a flowchart showing the entire flow of the full-text search process. First, when a query is input from the input device 6 (step S1), the search character string extraction unit 4 divides the query into units for searching the index file 1 (step S2). Next, the search processing unit 5 searches the index file 1 in the divided search units, and merges the search results of each search unit. This is performed in all search units unless the search result becomes 0 in the middle (step S3). When the search processing is completed, the search processing unit 5
Outputs the result from the output device 7 (step S
4), end the process.

【００３６】次に、上記ステップＳ３の、分割された検
索単位毎の検索と累積結果のマージ処理について、図１
に沿って詳細に説明する。Next, the process of searching for the divided search units and merging the accumulated results in step S3 will be described with reference to FIG.
It will be described in detail along.

【００３７】図６は、クエリの検索条件の説明図であ
る。図示のように、最初の検索文字列（Ｓ₁）では、抽
出語位置リストと展開語位置リストとを読み込むが、２
番目以降の検索文字列（Ｓ₂〜Ｓ_n）では抽出語位置リス
トのみ読み込む。また、照合方法は、最後の検索文字列
は前方一致の照合を行い、それ以外の検索文字列に関し
ては完全一致の照合を行う。FIG. 6 is an explanatory diagram of query search conditions. As shown in the figure, in the first search character string (S ₁ ), the extracted word position list and the expanded word position list are read.
Only the extracted word position list is read in the search character strings (S _{2 to} S _n ) after the first character string. In the collation method, the last search character string is matched with a forward match, and the other search character strings are matched with perfect match.

【００３８】今、検索対象として与えられたクエリを
“Ｃ言語プログラミング”であるとする。先ず、クエリ
“Ｃ言語プログラミング”は、検索文字列抽出部４にお
いて、索引語抽出処理と同様の抽出規則により、三つの
検索文字列“Ｃ”“言語”“プログラミング”に分割さ
れたとする。Assume that a query given as a search target is "C language programming". First, it is assumed that the query “C language programming” is divided into three search character strings “C”, “language”, and “programming” in the search character string extraction unit 4 according to the same extraction rules as in the index word extraction processing.

【００３９】これにより、検索処理部５は、先ず、検索
対象のカウンタｉをリセットし（ステップＳ１１）、１
番目（ｉ＝０）の検索文字列“Ｃ”について検索を開始
する。そして、ステップＳ１２において、ｉ＜ｎ−１で
あるか、即ち、対象とする検索文字列は最後の文字列よ
り小さいかを判定するが、検索文字列“Ｃ”は、最後の
文字列ではないので、ステップＳ１３に移行する。Thus, the search processing section 5 first resets the counter i to be searched (step S11).
The search is started for the search character string “C” at the i-th (i = 0). Then, in step S12, it is determined whether i <n−1, that is, whether the target search character string is smaller than the last character string, but the search character string “C” is not the last character string. Therefore, the process proceeds to step S13.

【００４０】ステップＳ１３では、索引中に“Ｃ”と完
全一致する索引語Ｉｎｄを探索する。このステップＳ１
３において、索引中に、“Ｃ”と完全一致する索引語Ｉ
ｎｄが登録されていなければ、該当場所はないものとみ
なし、その結果を出力して検索処理を終了する。In step S13, an index word Ind that exactly matches "C" is searched for in the index. This step S1
3. In the index, the index term I that exactly matches "C"
If nd is not registered, it is assumed that there is no corresponding place, the result is output, and the search processing is terminated.

【００４１】ステップＳ１３において、索引中に“Ｃ”
と完全一致するする索引語Ｉｎｄがあった場合は、ステ
ップＳ１４に移行し、対象とする検索文字列がクエリの
最初の文字列であるかを判定する。“Ｃ”は最初の検索
文字列なので、索引語Ｉｎｄの抽出語位置リストをバッ
ファＢｕｆ（ｉ）に読み込む（ステップＳ１６）と共
に、索引語Ｉｎｄの展開語位置リストをバッファＢｕｆ
（ｉ）に読み込む（ステップＳ１５）。その後、カウン
タｉをインクリメントして（ステップＳ１７）、ステッ
プＳ１２に戻る。In step S13, "C" is included in the index.
If there is an index word Ind that completely matches, the process proceeds to step S14, and it is determined whether the target search character string is the first character string of the query. Since “C” is the first search character string, the extraction word position list of the index word Ind is read into the buffer Buf (i) (step S16), and the expanded word position list of the index word Ind is stored in the buffer Buf.
Read in (i) (step S15). Thereafter, the counter i is incremented (step S17), and the process returns to step S12.

【００４２】次に、検索文字列“言語”について索引を
検索する。検索文字列“言語”も最後の検索文字列では
ないので、ステップＳ１３において、索引中に“言語”
と完全一致するものを探す。ここで、該当するものがな
ければ、その旨を出力する。一方、ステップＳ１３にお
いて、完全一致する索引語Ｉｎｄがあった場合は、ステ
ップＳ１４に進む。検索文字列“言語”は最初の検索文
字列ではないので、ステップＳ１６に移行し、索引語Ｉ
ｎｄの抽出語位置リストをバッファＢｕｆ（ｉ）に読み
込む。そして、カウンタｉをインクリメントして（ステ
ップＳ１７）、ステップＳ１２に戻る。Next, an index is searched for the search character string "language". Since the search character string “language” is not the last search character string, the “language” is included in the index in step S13.
Look for an exact match with. Here, if there is no corresponding one, the fact is output. On the other hand, if there is an index word Ind that completely matches in step S13, the process proceeds to step S14. Since the search character string "language" is not the first search character string, the process proceeds to step S16, where the index term I
The nd extracted word position list is read into the buffer Buf (i). Then, the counter i is incremented (step S17), and the process returns to step S12.

【００４３】次に、検索文字列“プログラミング”につ
いて索引を探索する。ここで、検索文字列“プログラミ
ング”は最後の検索文字列であるため、ステップＳ１２
からステップＳ１８に移行する。ステップＳ１８では、
索引中に“プログラミング”と前方一致するものを全て
探し出す。該当する索引語Ｉｎｄが一つもなければ、そ
の旨を結果出力する。Next, an index is searched for the search character string "programming". Here, since the search character string "programming" is the last search character string, step S12 is executed.
Then, control goes to a step S18. In step S18,
Find all occurrences of "programming" in the index. If there is no corresponding index word Ind, the result is output to that effect.

【００４４】次に、ステップＳ１９において、検索文字
列“プログラミング”は最初の文字列であるかを判定す
るが、この検索文字列“プログラミング”は最初の文字
列ではないため、ステップＳ２１に進み、索引語Ｉｎｄ
の抽出語位置リストのみをバッファＢｕｆ（ｉ）に読み
込む。そして、バッファＢｕｆ（０）〜Ｂｕｆ（ｎ−
１）のマージを行い（ステップＳ２２）、その結果を出
力する。尚、ステップＳ１９において、最初の検索文字
列である場合とは、クエリの文字列が一つしかない場合
であり、このような場合は、ステップＳ２０に進み、索
引語Ｉｎｄの展開語位置リストもバッファＢｕｆ（ｉ）
に読み込む。Next, in step S19, it is determined whether the search character string "programming" is the first character string. Since this search character string "programming" is not the first character string, the flow advances to step S21. Index word Ind
Is read into the buffer Buf (i). Then, buffers Buf (0) to Buf (n-
The merging of 1) is performed (step S22), and the result is output. In step S19, the case of the first search character string is a case where there is only one character string of the query. In such a case, the process proceeds to step S20, and the expanded word position list of the index word Ind is also displayed. Buffer Buf (i)
Read in.

【００４５】このように、本具体例では、クエリの２番
目以降の文字列に対しては、その展開語の位置リストは
読み込まず、抽出語の位置リストのみを読み込む。これ
により、ステップＳ２２のマージ処理において、無駄な
マージ処理を行わずに済む。即ち、索引語抽出規則によ
り、文字列が“Ｃ言語…”の場合は、“Ｃ”と“言語”
に分割されるため、“言語”は抽出語以外にはあり得な
いことになる。As described above, in this specific example, for the second and subsequent character strings of the query, the position list of the expanded word is not read, and only the position list of the extracted word is read. As a result, in the merge processing in step S22, unnecessary merge processing is not performed. That is, if the character string is "C language ..." according to the index word extraction rule, "C" and "language"
Therefore, the “language” cannot be any other than the extracted word.

【００４６】また、上記具体例の検索処理では、クエリ
を分割した各文字列の位置情報リストを最後にまとめて
マージしたが、各文字列の出現位置情報リストをマージ
しながら検索処理を行ってもよい。In the search processing of the above specific example, the position information lists of the character strings obtained by dividing the query are merged together at the end, but the search processing is performed while merging the appearance position information lists of the character strings. Is also good.

【００４７】図７は、このような出現位置情報リストを
マージしながら処理する場合のフローチャートである。
ここで、クエリを上記の例と同様に、“Ｃ言語プログラ
ミング”であるとする。そして、クエリ“Ｃ言語プログ
ラミング”は、索引語抽出処理により、三つの検索文字
列“Ｃ”“言語”“プログラミング”に分割されたとす
る。FIG. 7 is a flow chart in the case where such an appearance position information list is processed while being merged.
Here, it is assumed that the query is “C language programming” as in the above example. Then, it is assumed that the query “C language programming” is divided into three search character strings “C”, “language”, and “programming” by the index word extraction processing.

【００４８】図７のフローチャートにおいて、ステップ
Ｓ３１〜Ｓ３５の処理は、上述した図１のフローチャー
トのステップＳ１１〜Ｓ１５の処理と同様である。そし
て、ステップＳ３４において、検索文字列が最初の文字
列であった場合は、ステップＳ３５に移行し、索引語Ｉ
ｎｄの展開語位置リストと抽出語位置リストとをバッフ
ァＢｕｆに読み込む。その後、カウンタｉをインクリメ
ントして（ステップＳ３８）、ステップＳ３２に戻る。In the flowchart of FIG. 7, the processing of steps S31 to S35 is the same as the processing of steps S11 to S15 of the flowchart of FIG. If the search character string is the first character string in step S34, the process proceeds to step S35, where the index term I
The nd expanded word position list and the extracted word position list are read into the buffer Buf. Thereafter, the counter i is incremented (step S38), and the process returns to step S32.

【００４９】次に、検索文字列“言語”について索引を
検索する。検索文字列“言語”も最後の検索文字列では
ないので、ステップＳ３３において、索引中に“言語”
と完全一致するものを探す。ここで、該当するものがな
ければ、その旨を出力する。一方、ステップＳ３３にお
いて、完全一致する索引語Ｉｎｄがあった場合は、ステ
ップＳ３４に進む。検索文字列“言語”は最初の検索文
字列ではないので、今度はステップＳ３６に移行し、索
引語Ｉｎｄの抽出語位置リストをバッファＢｕｆとマー
ジしながら読み込む。そして、バッファＢｕｆの累積結
果が０か、即ち、“言語”の出現位置リストと、“Ｃ”
の出現位置リストとで、文書中のオフセットした位置に
基づく同一位置になかった場合は、該当する場所がない
として結果を出力する。また、ステップＳ３７におい
て、累積結果が０でなかった場合は、ステップＳ３８に
移行する。Next, an index is searched for the search character string "language". Since the search character string “language” is not the last search character string, the “language” is included in the index in step S33.
Look for an exact match with. Here, if there is no corresponding one, the fact is output. On the other hand, if there is an index word Ind that completely matches in step S33, the process proceeds to step S34. Since the search character string "language" is not the first search character string, the process proceeds to step S36 to read the extracted word position list of the index word Ind while merging it with the buffer Buf. Then, the accumulation result of the buffer Buf is 0, that is, a list of the appearance positions of “language” and “C”
If the occurrence position list is not at the same position based on the offset position in the document, the result is output as there is no corresponding position. If the accumulation result is not 0 in step S37, the process proceeds to step S38.

【００５０】次に、検索文字列“プログラミング”につ
いて索引を探索する。ここで、検索文字列“プログラミ
ング”は最後の検索文字列であるため、ステップＳ３２
からステップＳ３９に移行する。ステップＳ３９では、
上述した最後にマージする場合のステップＳ１８と同様
に、索引中に“プログラミング”と前方一致するものを
全て探し出す。該当する索引語Ｉｎｄが一つもなけれ
ば、その旨を結果出力する。Next, an index is searched for the search character string "programming". Here, since the search character string “programming” is the last search character string, step S 32
Then, control goes to a step S39. In step S39,
Similar to step S18 in the case of the last merging described above, the index is searched for all that have a forward match with "programming". If there is no corresponding index word Ind, the result is output to that effect.

【００５１】次に、ステップＳ４０において、検索文字
列“プログラミング”は最初の文字列であるかを判定す
るが、この検索文字列“プログラミング”は最初の文字
列ではないため、ステップＳ４２に進み、索引語Ｉｎｄ
の抽出語位置リストをバッファＢｕｆとマージしながら
読み込み、その結果を出力する。また、ステップＳ４０
において、検索文字列が最初の文字列であった場合は、
ステップＳ４１において、索引語Ｉｎｄの展開語位置リ
ストと抽出語位置リストをバッファＢｕｆに読込み、こ
れを該当場所として出力する。Next, in step S40, it is determined whether or not the search character string "programming" is the first character string. Since this search character string "programming" is not the first character string, the flow advances to step S42. Index word Ind
Is read while merging with the buffer Buf, and the result is output. Step S40
In, if the search string is the first string,
In step S41, the expanded word position list and the extracted word position list of the index word Ind are read into the buffer Buf, and are output as corresponding locations.

【００５２】このように、マージしながら検索処理を行
う場合は、上記図１に示したような最後にまとめてマー
ジする場合に比べて、バッファの容量が小さいもので済
むという効果がある。As described above, when the search process is performed while merging, there is an effect that the capacity of the buffer can be smaller than when the merging is performed at the end as shown in FIG.

【００５３】また、検索式処理部３における上記のよう
な全ての処理は、このような検索式処理部３の役割を実
現するコンピュータのプログラムによる制御で実現する
ことができる。従って、そのプログラムをフロッピーデ
ィスクやＣＤ−ＲＯＭ等の記録媒体に記録してから、一
般の該当するコンピュータにインストールするといった
方法や、ネットワークを経由してプログラムをダウンロ
ードするといった方法を用いることで本発明の全文検索
方法および全文検索システムを実現することができる。All of the above-described processing in the search formula processing unit 3 can be realized by control of a computer program that realizes the role of the search formula processing unit 3. Therefore, the present invention can be implemented by using a method of recording the program on a recording medium such as a floppy disk or a CD-ROM and then installing the program on a general computer, or a method of downloading the program via a network. Can be realized.

【００５４】〈効果〉以上のように、本具体例によれ
ば、検索文字列で分割された場合の２番目以降の文字列
に関しては、抽出語の位置リストのみ読み込むようにし
たので、計算量が対象の配列の長さに比例する（ｎｌｏ
ｇｎ：但し、ｎは配列の長さ）マージ処理において、累
積結果とのマージ処理の配列長が小さくなるため、計算
の高速化を図ることができる。また、読み出しにかかる
時間を短縮することができ、検索処理全体の高速化を図
ることができる。これは、検索文字列から切り出される
検索文字列の個数が多ければ多いほど、大きな効果を得
ることができる。<Effects> As described above, according to this specific example, only the position list of the extracted word is read for the second and subsequent character strings when divided by the search character string, so that the amount of calculation is Is proportional to the length of the sequence of interest (nlo
gn: where n is the length of the array) In the merging process, the array length of the merging process with the accumulation result is reduced, so that the calculation can be sped up. In addition, the time required for reading can be reduced, and the speed of the entire search process can be increased. This has a greater effect as the number of search character strings cut out from the search character string is larger.

[Brief description of the drawings]

【図１】本発明の全文検索方法における最後にマージす
る場合の検索処理のフローチャートである。FIG. 1 is a flowchart of a search process when merging is performed last in a full-text search method according to the present invention.

【図２】本発明の全文検索方法を実現するための全文検
索システムの構成図である。FIG. 2 is a configuration diagram of a full-text search system for realizing the full-text search method of the present invention.

【図３】本発明の全文検索方法における抽出語と展開語
の説明図である。FIG. 3 is an explanatory diagram of extracted words and expanded words in the full-text search method of the present invention.

【図４】本発明の全文検索方法における索引語と出現位
置集合の組を示す説明図である。FIG. 4 is an explanatory diagram showing a set of an index word and an appearance position set in the full-text search method of the present invention.

【図５】本発明の全文検索方法における検索処理の全体
の流れを示すフローチャートである。FIG. 5 is a flowchart showing an overall flow of a search process in the full-text search method of the present invention.

【図６】本発明の全文検索方法におけるクエリの検索条
件を示す説明図である。FIG. 6 is an explanatory diagram showing search conditions of a query in the full-text search method of the present invention.

【図７】本発明の全文検索方法におけるマージしながら
検索処理する場合のフローチャートである。FIG. 7 is a flowchart in the case of performing search processing while merging in the full-text search method of the present invention.

[Explanation of symbols]

１索引ファイル２本文データ３検索式処理部４検索文字列抽出部５検索処理部 DESCRIPTION OF SYMBOLS 1 Index file 2 Body data 3 Search formula processing part 4 Search character string extraction part 5 Search processing part

Claims

[Claims]

An extracted word representing an original character string of each character string extracted according to a predetermined index word extraction rule for a character string appearing in a document, and the first character of the extracted word is one character. In a case where an index word is set by a development word representing a partial character string shifted by one by one, and a search for an arbitrary search character string in the document is performed, in the full-text search method performed using this index word, When performing a search for a character string, the search character string is divided according to the index word extraction rule, and among the character strings obtained as a result, the first character string searches both the extracted word and the expanded word of the index word. A full-text search method, wherein only the extracted words are searched for the second and subsequent character strings, the search results are merged, and in the case of the last character string, the head words of the index words are matched. .

2. The method according to claim 1, wherein if the search character string is divided according to a predetermined index word extraction rule and only one character string is found, matching of the prefix of the index word is performed, and A full-text search method, wherein both an extracted word and an expanded word of the index word are searched.

3. An extracted word representing an original character string of each character string extracted according to a predetermined index word extraction rule for a character string appearing in a document, and a first character of the extracted word is one character. An index file that stores an index word composed of expansion words representing partial character strings shifted by one at a time, and, when an arbitrary search character string is input as a search target in the document, the search character string is converted to the index word. A search character string extracting unit that divides by the extraction rule, and in the character string divided by the search character string extracting unit, in the case of the first character string, both the extracted word and the expanded word of the index word in the index file are A search process that searches and searches only the extracted words for the second and subsequent character strings, merges these search results, and searches for the last character string by matching the head words of the index words. Department and Full text search system characterized by comprising.

4. The method according to claim 3, wherein when the character string divided by the search character string extraction unit is only one, the matching of the prefix of the index word is performed, and
A full-text search system comprising a search processing unit for searching for both the extracted word and the expanded word of the index word.