JP3099683B2

JP3099683B2 - Information retrieval device

Info

Publication number: JP3099683B2
Application number: JP07168457A
Authority: JP
Inventors: 智子田邊; 忠一菊池
Original assignee: Panasonic Corp; Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Corp; Panasonic Holdings Corp
Priority date: 1995-06-09
Filing date: 1995-07-04
Publication date: 2000-10-16
Anticipated expiration: 2015-07-04
Also published as: JPH0954777A

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【産業上の利用分野】本発明は電子計算機を利用して大
量の文書を検索する際に利用される情報検索装置に関す
るものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to an information retrieval apparatus used for retrieving a large number of documents using an electronic computer.

【０００２】[0002]

【従来の技術】近年、多様な文書が電子化されてきてい
るのに伴い、大量の文書に対する検索の要求が高まって
いる。これらの要求に対して、従来の多くの検索装置
は、文書からキーワードを抽出し、そのキーワードを文
書に付加して登録を行なっておき、検索の際には、その
キーワードに対して行なう方法を採用している。2. Description of the Related Art In recent years, as various documents have been digitized, a demand for retrieval of a large number of documents has been increased. In response to these requests, many conventional search devices extract a keyword from a document, add the keyword to the document and register it, and perform a search for that keyword when searching. Has adopted.

【０００３】以下、図２４を用いてそのような従来の検
索装置について説明する。図２４において、２４０１は
検索対象データ記憶部、２４０２は形態素解析処理手
段、２４０３は形態素解析用辞書、２４０４は単語デー
タ、２４０５はキーワード抽出手段、２４０６はキーワ
ード抽出用辞書、２４０７はキーワードデータ、２４０
８は検索処理手段、２４０９は入力手段、２４１０は出
力手段である。Hereinafter, such a conventional search apparatus will be described with reference to FIG. In FIG. 24, 2401 is a search target data storage unit, 2402 is a morphological analysis processing unit, 2403 is a morphological analysis dictionary, 2404 is word data, 2405 is a keyword extraction unit, 2406 is a keyword extraction dictionary, 2407 is keyword data, 240
8 is a search processing unit, 2409 is an input unit, and 2410 is an output unit.

【０００４】以上のように構成された検索装置につい
て、以下にその動作を説明する。まず、登録開始合図が
入力手段から入力されると、形態素解析処理手段２４０
２が検索対象データ記憶部２４０１に格納された１つの
検索対象データに対して、形態素解析用辞書２４０３を
参照して形態素解析処理を行ない、単語データ２４０４
を作成する。[0004] The operation of the retrieval device configured as described above will be described below. First, when a registration start signal is input from the input means, the morphological analysis processing means 240
2 performs morphological analysis processing on one piece of search target data stored in the search target data storage unit 2401 with reference to the morphological analysis dictionary 2403 and obtains word data 2404.
Create

【０００５】前記形態素解析処理が終了すると次にキー
ワード抽出手段２４０５がキーワード抽出用辞書２４０
６を用いてキーワードデータ２４０７を作成する。上記
の一連の動作が検索対象データ記憶部２４０１に格納さ
れたすべての検索対象データの対して行なわれた後、入
力手段２４０９から検索条件が入力されると、キーワー
ドデータ２４０７に対して検索処理手段２４０８は、キ
ーワード検索を行ない、照合結果より検索対象データ記
憶部２４０１に格納された検索対象データを出力手段２
４１０に出力する。[0005] When the morphological analysis process is completed, the keyword extracting means 2405 next executes the keyword extracting dictionary 240.
6 is used to create keyword data 2407. After the above series of operations are performed on all the search target data stored in the search target data storage unit 2401, when a search condition is input from the input unit 2409, the search processing unit Reference numeral 2408 performs a keyword search and outputs the search target data stored in the search target data storage unit 2401 based on the collation result to the output unit 2
Output to 410.

【０００６】[0006]

【発明が解決しようとする課題】しかしながら上記の従
来の検索装置の構成では、入力した検索条件がキーワー
ドデータに存在しないと検索してもヒットせず、そのキ
ーワードは形態素解析によって抽出されるので、検索で
きるかどうかは形態素解析が正しく行なわれたどうかに
よる。つまり形態素解析が正しく行なわれないとキーワ
ードもれがおき、ついては検索もれをひき起こす。However, in the configuration of the conventional search apparatus described above, if the input search condition does not exist in the keyword data, no search is performed, and the keyword is extracted by morphological analysis. Whether it can be searched depends on whether the morphological analysis was performed correctly. In other words, if morphological analysis is not performed correctly, keyword leakage will occur, and search leakage will occur.

【０００７】また、他にも上記のような検索もれを防ぐ
方法として、全文検索の方法があるが、検索に必要ない
文字列まで検索するため、大量のデータになるほど検索
速度の低下が問題になる。As another method for preventing the above-mentioned search omission, there is a full-text search method. However, since character strings that are not necessary for the search are searched, the search speed decreases as the amount of data becomes large. become.

【０００８】検索速度をあげるため、索引ファイルを使
った方式もあるが、大量のデータに対し索引作成時間が
非常にかかるという問題が発生する。To increase the search speed, there is a method using an index file. However, there is a problem that it takes a very long time to create an index for a large amount of data.

【０００９】本発明は、上記従来技術の課題を解決する
もので、検索もれをおさえる全文検索を用いながら、大
量のデータに対してデータのもつ情報量を失うことな
く、検索対象データの容量を小さくすることで高速に検
索し、さらにデータ記憶部の省資源化を実現する情報検
索装置を提供することを目的とする。The present invention solves the above-mentioned problem of the prior art, and uses a full-text search for preventing a search leak without losing the information amount of the data for a large amount of data and without changing the amount of data to be searched. It is an object of the present invention to provide an information retrieval apparatus which performs high-speed retrieval by reducing the size of the information storage and further realizes resource saving of the data storage unit .

【００１０】[0010]

【課題を解決するための手段】この目的を達成するため
に本発明の情報検索装置は、検索対象データを格納する
検索対象データ記憶部と、検索対象にしない単語を格納
した不要語辞書と、前記検索対象データ記憶部に格納さ
れている検索対象データ中から前記不要語辞書を用いて
検索対象とならない語を削除する不要語削除手段と、同
じく検索対象データ中の文字列の重複部分を削除する重
複文字列削除手段と、前記不要語削除手段と前記重複削
除処理によって検索対象データから作成された圧縮デー
タを格納する圧縮データ記憶部と、検索条件を入力する
入力手段と、前記検索条件に従い前記圧縮データに対し
て全文検索を行なう検索処理手段と、前記検索処理手段
の検索結果を出力する出力手段と、前記不要語削除手段
が不要語辞書を用いて検索対象データから削除した不要
語を記憶しておく不要語記憶テーブルとを備え、前記不
要語記憶テーブルを検索処理手段の検索対象とすること
を特徴とするものである。In order to achieve this object, an information retrieval apparatus according to the present invention comprises: a search target data storage section for storing search target data; an unnecessary word dictionary storing words not to be searched; Unnecessary word deletion means for deleting words that are not to be searched from the search target data stored in the search target data storage unit using the unnecessary word dictionary, and also deleting overlapping portions of character strings in the search target data A redundant character string deleting unit, a compressed data storage unit for storing compressed data created from search target data by the unnecessary word deleting unit and the redundant deletion processing, an input unit for inputting search conditions, For the compressed data
Processing means for performing a full-text search, output means for outputting a search result of the search processing means, and unnecessary word deleting means
Unnecessary from the search target data using unnecessary word dictionary
An unnecessary word storage table for storing words,
Key word storage table to be searched by search processing means
It is characterized by the following.

【００１１】[0011]

【作用】本発明は、形態素解析処理を行なわず、辞書に
登録された不要語のみを削除する不要語削除手段によっ
て検索の対象とならないデータと重複文字列削除手段を
設けることにより重複文字列を削除することで、さらに
データ容量を小さくする。また、このようにデータ容量
を圧縮したデータを全文検索する検索処理手段を設け
て、検索もれを防ぎ、高速な検索し、さらに、データ記
憶部の省資源化により装置構築の容易化を実現する。According to the present invention, an unnecessary word deleting means for deleting only unnecessary words registered in a dictionary without performing a morphological analysis process and a data not to be searched and a duplicate character string deleting means are provided. By deleting, the data capacity is further reduced. In addition, a search processing means for performing full-text search of data with such a reduced data volume is provided to prevent omission of search, to perform high-speed search , and to further store data.
Realization of device construction is realized by saving resources of storage unit .

【００１２】[0012]

【Example】

（実施例１）以下、本発明の第１の実施例について、図
面を参照しながら説明する。図１は本発明の一実施例に
おける情報検索装置の構成図である。図１において、１
０１は検索対象データ記憶部、１０２は不要語削除手
段、１０３は不要語辞書、１０４は重複文字列削除手
段、１０５圧縮データ記憶部、１０６は検索処理手段、
１０７は入力手段、１０８は出力手段である。Embodiment 1 Hereinafter, a first embodiment of the present invention will be described with reference to the drawings. FIG. 1 is a configuration diagram of an information search device according to an embodiment of the present invention. In FIG. 1, 1
01 is a search target data storage unit, 102 is an unnecessary word deletion unit, 103 is an unnecessary word dictionary, 104 is a duplicate character string deletion unit, 105 is a compressed data storage unit, 106 is a search processing unit,
107 is an input means, and 108 is an output means.

【００１３】まず、本実施例における検索条件と検索対
象データと不要語辞書と圧縮データについて説明する。First, search conditions, search target data, unnecessary word dictionaries, and compressed data in this embodiment will be described.

【００１４】１つの検索条件は、照合文字列と和、積、
否定などの論理的関係を表す論理記号によって表され、
入力手段１０７によって入力される。One search condition includes a collation character string, a sum, a product,
Represented by logical symbols that represent logical relationships such as negation,
It is input by the input means 107.

【００１５】本実施例における検索対象データは文書で
あり、検索対象データ記憶部１０１に格納されている。
１つの文書データは、文書の内容を表すテキストやイメ
ージなどにより構成されている。また、１つの検索対象
データは検索対象データ番号等のデータ識別用のヘッダ
をもち、１つのファイルの中に複数の検索対象データが
前述のヘッダを区切りにして存在する。The search target data in the present embodiment is a document, and is stored in the search target data storage unit 101.
One piece of document data is composed of text, images, and the like representing the contents of the document. Also, one search target data has a header for data identification such as a search target data number, and a plurality of search target data exists in one file with the above-mentioned header delimited.

【００１６】不要語辞書１０３には、不要語が格納され
ている。ここでの不要語とは、特定の文書中において、
それ自身では意味を持たず検索の対象とならない単語、
例えば記号、タグ、接尾語、接続語、その文書特有の名
詞などを指す。また、不要語は、更新の必要性の低い単
語とする。The unnecessary word dictionary 103 stores unnecessary words. Unnecessary words here are, in a specific document,
Words that have no meaning in themselves and are not searched for;
For example, it indicates a symbol, a tag, a suffix, a connecting word, a noun specific to the document, and the like. Unnecessary words are words that do not need to be updated.

【００１７】圧縮データとは、前記の検索対象データか
ら、前述のように検索の対象とならない単語を不要語と
して取り除き、不要語と不要語との間に残された文字
列、すなわち検索対象となる必要語、の重複を取り除
き、区切り文字で区切って並べ登録したものを指す。区
切り文字としては検索対象とならない１バイトコードな
どが使用される。The compressed data is obtained by removing words that are not to be searched as unnecessary words from the data to be searched as described above, and character strings left between unnecessary words, that is, the search object A word that has been registered by removing the duplication of necessary words and separating them with delimiters. As a delimiter, a one-byte code that is not a search target is used.

【００１８】１つの検索対象データから１つの圧縮デー
タが作成され、対応した検索対象データのヘッダと同じ
ものをヘッダとして持ち、１つのファイルの中に複数の
圧縮データが前述のヘッダを区切りにして存在する。上
記で説明した検索対象データと不要語辞書１０３と圧縮
データの例を図２に示す。One piece of compressed data is created from one piece of data to be searched, and the same header as that of the corresponding piece of data to be searched is used as a header. Exists. FIG. 2 shows an example of the search target data, the unnecessary word dictionary 103, and the compressed data described above.

【００１９】以上のように構成された情報検索装置につ
いて、その動作を説明する。全体の流れを図３で示す。
全体の流れとしては、データの登録処理と検索処理とに
大きく分けられる。またデータの登録処理は不要語削除
処理と重複文字列削除処理に分けられる。またデータの
登録処理は１検索対象ファイル毎に行なわれ、すべての
検索対象データに対して終了するまで行なわれる。The operation of the information retrieval apparatus configured as described above will be described. FIG. 3 shows the overall flow.
The overall flow is roughly divided into data registration processing and search processing. Further, the data registration processing is divided into unnecessary word deletion processing and duplicate character string deletion processing. The data registration process is performed for each search target file, and is performed until all the search target data are completed.

【００２０】最初にデータの登録処理について説明す
る。データの登録処理としては、不要語削除手段１０２
と重複文字列削除手段１０４によって、前記検索対象デ
ータから、あらかじめ実際に検索対象とする圧縮データ
の作成を行なう。First, the data registration process will be described. As the data registration processing, the unnecessary word deleting unit 102
Then, compressed data to be actually searched is created in advance from the data to be searched by the duplicate character string deleting unit 104.

【００２１】まず、入力手段１０７からデータ登録開始
命令が出ると、検索対象データ記憶部１０１に格納され
たファイル中の前記検索対象データ１つに対して、不要
語削除手段１０２が不要語削除処理を開始する。ここで
不要語削除処理の流れを図４に示す。First, when a data registration start command is issued from the input unit 107, the unnecessary word deletion unit 102 performs an unnecessary word deletion process on one of the search target data in the file stored in the search target data storage unit 101. To start. Here, the flow of the unnecessary word deletion processing is shown in FIG.

【００２２】また図５に示すように、不要語削除手段１
０２は、不要語算出部１０２ａと必要語算出部１０２ｂ
と必要語格納テーブル１０２ｃからなる。As shown in FIG. 5, unnecessary word deleting means 1
02 is an unnecessary word calculation unit 102a and a necessary word calculation unit 102b
And the required word storage table 102c.

【００２３】不要語削除処理として、不要語位置抽出部
１０２ａでは、不要語辞書１０３を参照し、前記不要語
辞書に格納されている不要語と検索対象データを照らし
合わせ、検索対象データ中の不要語の位置とその語の長
さを抽出する。その際の不要語の抽出は最長一致方法に
よる。As unnecessary word deletion processing, the unnecessary word position extracting unit 102a refers to the unnecessary word dictionary 103, compares unnecessary words stored in the unnecessary word dictionary with search target data, and searches for unnecessary words in the search target data. Extract word position and word length. The unnecessary words are extracted by the longest matching method.

【００２４】最長一致方法について、その動作の流れを
図６に示す。まず、検索対象データの最初から１文字ず
つ不要語辞書１０３と照合を行ない（ステップ１）、前
方一致する文字を発見するまで（ステップ２）１文字ず
つずらしていく（ステップ３）。前方一致した文字を
発見したら（ステップ２）、次の文字と結合して（ステ
ップ４）再び前方一致しているか調べる（ステップ
５）。前方一致しなかった場合は、一つ前の文字列が不
要語となり文字の開始位置と長さを求める（ステップ
６）。FIG. 6 shows the operation flow of the longest matching method. First, the character string is compared with the unnecessary word dictionary 103 one character at a time from the beginning of the search target data (step 1), and shifted one character at a time until a character that matches forward is found (step 2) (step 3). When a character whose front is matched is found (step 2), it is combined with the next character (step 4), and it is checked again whether or not the front matches (step 5). If the prefix does not match, the previous character string becomes an unnecessary word and the start position and length of the character are obtained (step 6).

【００２５】前方一致した場合は（ステップ５）再び前
方一致をしなくなるまで前記の動作を繰り返す（ステッ
プ７）。ここで前方一致した単語に対して完全一致も成
立した場合は、その単語は不要語辞書１０３に登録され
ている単語であるが、次の文字と結合した場合に再び前
方一致が成立すると、前記の単語を含むさらに長い文字
列が不要語辞書１０３に登録されていることになる。具
体例を図７に示す。If there is a forward match (step 5), the above operation is repeated until the forward match is no longer made (step 7). Here, if a perfect match is also established for the word that matches the head, the word is a word registered in the unnecessary word dictionary 103. A longer character string including the word is registered in the unnecessary word dictionary 103. A specific example is shown in FIG.

【００２６】続いて、必要語算出部１０２ｂが、求めら
れた不要語の位置と長さから不要語以外の文字列、すな
わち検索対象となる必要語を算出し、必要語格納テーブ
ル１０２ｃに必要語を格納する。例を図８に示す。以上
で不要語削除処理は終了する。Subsequently, the required word calculating unit 102b calculates a character string other than the unnecessary word, that is, a required word to be searched from the obtained position and length of the unnecessary word, and stores the required word in the required word storage table 102c. Is stored. An example is shown in FIG. This completes the unnecessary word deletion processing.

【００２７】次に、図９を用いて重複文字列削除処理の
流れを説明する。重複文字列削除処理として重複文字列
削除手段１０４が必要語位置格納テーブル１０２ｃを参
照して、各必要語の重複を調べ、重複していない必要語
を抽出する。その際、必要語の部分一致を認める。Next, the flow of the duplicate character string deletion process will be described with reference to FIG. As a duplicated character string deletion process, the duplicated character string deletion unit 104 refers to the required word position storage table 102c to check for duplication of each required word, and extracts a necessary word that is not duplicated. At that time, partial matching of necessary words is recognized.

【００２８】例を図１０に示す。文字列Ａと文字列Ｂが
存在し、文字列Ａの中に文字列Ｂが含まれている場合
（Ａ⊇Ｂと記す）、文字列Ｂは重複文字列として扱われ
る。例えば、テレビとテレビジョンを比較した場合、テ
レビはテレビジョンに含まれるので重複文字列となる。An example is shown in FIG. If the character strings A and B exist and the character string B includes the character string B (denoted as A⊇B), the character string B is treated as a duplicate character string. For example, when a television and a television are compared, the television is included in the television, so that a duplicate character string is obtained.

【００２９】処理対象としている検索対象データのヘッ
ダを圧縮データ記憶部１０５に格納し、続いて上記の処
理によって抽出された重複していない必要語を前述の区
切り文字で区切り並べた圧縮データを作成し格納する。
１つの検索対象データのすべての必要語に対して重複調
査・格納が終了すると、重複削除処理は終了する。すべ
ての検索対象データに対して重複文字列削除処理が終了
するとデータ登録処理は終了する。The header of the search target data to be processed is stored in the compressed data storage unit 105, and then the required data extracted by the above processing are separated by the above-described delimiter to form compressed data. And store.
When the duplication check / storage is completed for all necessary words of one search target data, the duplication deletion processing ends. When the duplication character string deletion processing ends for all the search target data, the data registration processing ends.

【００３０】次に、検索処理について説明する。検索処
理の一連の流れを図１１に示す。上記の動作で作成され
たすべての圧縮データに対して、入力手段１０７から前
記検索条件が入力されると、検索処理手段１０６が検索
を行なう。検索の方法としては文書そのものを検索対象
として利用する全文検索を用いる。具体的には前記圧縮
データの区切り文字から次の区切り文字まで間の文字列
に対して全文検索を行なう。この際、直接データを検索
する方法や、索引ファイルを作成し索引検索を行なう方
法も用いることができる。索引検索を行なう場合は、前
述のデータ登録処理の一つとして、前記圧縮データを作
成した後、索引を作成する処理を行なう。Next, the search processing will be described. FIG. 11 shows a flow of a series of search processing. When the search condition is input from the input unit 107 to all the compressed data created by the above operation, the search processing unit 106 performs a search. As a search method, a full-text search using the document itself as a search target is used. Specifically, a full-text search is performed on a character string between the delimiter of the compressed data and the next delimiter. At this time, a method of directly searching data or a method of creating an index file and performing an index search can be used. When performing an index search, a process of creating an index is performed after creating the compressed data as one of the data registration processes described above.

【００３１】検索処理手段１０６は、検索処理を行なっ
た後、照合したデータのヘッダの情報を取り出し、その
ヘッダの情報から検索対象データ記憶部１０１に格納さ
れている検索対象ファイル中から該当する検索対象デー
タを取り出し、ヘッダ情報と共に出力手段１０８に送り
出力する。After performing the search process, the search processing means 106 extracts the information of the header of the collated data, and from the information of the header, searches the corresponding search target file from the search target file stored in the search target data storage unit 101. The target data is extracted and sent to the output unit 108 together with the header information and output.

【００３２】具体的に図２を使って説明すると、検索条
件を「イメージデータ」として、検索処理手段１０６が
圧縮データに対して検索を行なう。すると文書番号２の
「イメージデータの送信」の文字列中に「イメージデー
タ」が存在するので、検索処理手段１０６は文書番号２
から、検索対象データ記憶部１０１に格納されている文
書番号２の検索対象データを取り出し、文書番号２とい
うヘッダ情報と検索対象データを出力手段１０８に送り
出力する。More specifically, referring to FIG. 2, the search condition is set to "image data", and the search processing means 106 searches the compressed data. Then, since “image data” is present in the character string “send image data” of document number 2, the search processing means 106 sets the document number 2
Then, the search target data of the document number 2 stored in the search target data storage unit 101 is extracted, and the header information and the search target data of the document number 2 are sent to the output unit 108 and output.

【００３３】以上のように本実施例によれば、不要語削
除手段１０２によって検索の対象とならない文字列を取
り除き、ついで重複削除手段１０４によって検索に複数
個必要のない文字列を削除して文書の情報量をそのまま
保ちながらデータを圧縮し、容量を小さくすることがで
きるので、より検索もれの少ない検索で、かつ高速に検
索できる。加えて検索対象データの容量が小さくなるこ
とで、索引ファイルを使用した検索システムでは、索引
ファイルを小さくすることができる。As described above, according to the present embodiment, the unnecessary word deleting means 102 removes a character string which is not a search target, and the duplication deleting means 104 deletes a plurality of character strings which are unnecessary for the search. Since the data can be compressed and the capacity can be reduced while keeping the information amount as it is, the search can be performed with less search omission and at high speed. In addition, by reducing the size of the search target data, a search system using an index file can reduce the size of the index file.

【００３４】また圧縮したデータに対して全文検索を行
なうので、従来のキーワード検索のような検索もれを防
ぎ、加えて従来の完全一致のキーワード検索より、自由
な検索条件で検索できる。Further, since the full-text search is performed on the compressed data, search omission such as the conventional keyword search can be prevented, and the search can be performed with more free search conditions than the conventional perfect match keyword search.

【００３５】なお、上記では、１つのファイルに複数の
検索対象データを格納したが、ファイルは複数になって
もよい。また、１つの検索対象データのヘッダをファイ
ル名にして、１つの検索対象データを１つのファイルに
格納することもできる。その場合は、検索処理手段１０
６は、圧縮データに対して検索処理を行なった後、照合
したデータのファイル名を取り出し、そのファイル名か
ら該当する検索対象データファイルを探しその中に格納
されている検索対象データを取り出し、ファイル名と共
に出力手段１０８に送り出力する。In the above description, a plurality of search target data are stored in one file, but a plurality of files may be stored. Also, one search target data can be stored in one file with the header of one search target data as a file name. In that case, the search processing means 10
6 performs a search process on the compressed data, retrieves the file name of the collated data, searches for the corresponding search target data file from the file name, retrieves the search target data stored therein, The output is sent to the output unit 108 together with the name.

【００３６】また、検索対象データ記憶部１０１は光デ
ィスクに設けることも可能であるので、検索対象データ
の格納スペースを少なくすることもできる。Further, since the search target data storage unit 101 can be provided on the optical disk, the storage space for the search target data can be reduced.

【００３７】（実施例２）以下、本発明の第２の実施例
について、図面を参照しながら説明する。図１２は本発
明の一実施例における情報検索装置の構成図である。図
１２において、１０１は検索対象データ記憶部、１０２
は不要語削除手段、１０３は不要語辞書、１０４は重複
文字列削除手段、１０５圧縮データ記憶部、１０６は検
索処理手段、１０７は入力手段、１０８は出力手段、１
２０１は不要語削除手段１０２と重複文字列削除手段１
０４によってデータを圧縮する範囲を予め登録しておく
圧縮範囲記憶部である。(Embodiment 2) Hereinafter, a second embodiment of the present invention will be described with reference to the drawings. FIG. 12 is a configuration diagram of the information search device in one embodiment of the present invention. In FIG. 12, reference numeral 101 denotes a search target data storage unit;
Is an unnecessary word deletion unit, 103 is an unnecessary word dictionary, 104 is a redundant character string deletion unit, 105 is a compressed data storage unit, 106 is a search processing unit, 107 is an input unit, 108 is an output unit,
201 is an unnecessary word deleting unit 102 and a duplicate character string deleting unit 1
This is a compression range storage unit in which a range for compressing data is registered in advance by using a compression range 04.

【００３８】以上のように構成された情報検索装置の圧
縮範囲記憶部１２０１について説明する。圧縮範囲記憶
部１２０１は、データの登録処理を開始する前に予め入
力手段１０７から入力された圧縮範囲が登録される。圧
縮範囲として、検索対象データの先頭位置からのオフセ
ットで表された圧縮開始位置と圧縮終了位置の組みか、
もしくは、圧縮開始位置と圧縮終了位置のタグを指定す
る。例を図１３に示す。A description will now be given of the compression range storage unit 1201 of the information retrieval apparatus configured as described above. The compression range storage unit 1201 registers the compression range input from the input unit 107 in advance before starting the data registration process. Whether the compression range is a combination of a compression start position and a compression end position represented by an offset from the start position of the search target data,
Alternatively, the tags of the compression start position and the compression end position are specified. An example is shown in FIG.

【００３９】続いて、処理の流れについて説明する。入
力手段１０７からデータの圧縮処理の開始の命令が入力
されると、不要語削除手段１０３は、圧縮範囲記憶部１
２０１を参照して、検索対象データ中の圧縮開始位置と
圧縮終了位置を得る。続いて前述のとおり得た位置に該
当する検索対象データの範囲に対して不要語削除処理を
行ない、その後に重複文字列削除手段１０４が重複文字
列削除処理を行なう。ここで扱われる検索対象データの
構造と不要語削除処理と重複文字列削除処理は第１の実
施例と同じである。続いて圧縮しない検索対象データ、
つまり非圧縮データは、図１３に示すように、重複削除
処理で使用された区切り文字列を圧縮データとの区切り
にしてそのまま圧縮データに格納される。以後、検索の
処理の流れは第１の実施例と同様に行なわれる。Next, the flow of the processing will be described. When a command to start data compression processing is input from the input unit 107, the unnecessary word deletion unit 103 causes the compression range storage unit 1
With reference to 201, a compression start position and a compression end position in the search target data are obtained. Subsequently, unnecessary word deletion processing is performed on the range of the search target data corresponding to the position obtained as described above, and then the duplicate character string deletion unit 104 performs the duplicate character string deletion processing. The structure of search target data, unnecessary word deletion processing, and duplicate character string deletion processing handled here are the same as in the first embodiment. Followed by the uncompressed search data,
That is, as shown in FIG. 13, the uncompressed data is stored in the compressed data as it is, with the delimiter character string used in the duplication elimination process as a delimiter from the compressed data. Thereafter, the flow of the search process is performed in the same manner as in the first embodiment.

【００４０】本実施例のように、検索対象データの圧縮
範囲指定が出来ると、テキストの中でも単語の列挙部分
と文章で書かれている部分がある文書などに対して、文
章部分のみ圧縮したいなど、希望した場所だけ圧縮の対
象にでき、より自由な検索対象データを扱うことができ
る。As in the present embodiment, if the compression range of the search target data can be specified, it is desirable to compress only the sentence portion of a document having a word enumeration portion and a sentence portion in the text. Thus, only the desired location can be subjected to compression, and more flexible search target data can be handled.

【００４１】（実施例３）以下、本発明の第３の実施例
について、図面を参照しながら説明する。図１４は本発
明の一実施例における情報検索装置の構成図である。図
１４において、１０１は検索対象データ記憶部、１０２
は不要語削除手段、１０３は不要語辞書、１０４は重複
文字列削除手段、１０５圧縮データ記憶部、１０７は入
力手段、１０８は出力手段、１４０１は不要語削除手段
１０２と重複文字列削除手段１０４によってデータを圧
縮する範囲を複数箇所指定し登録できる圧縮範囲複数記
憶部、１４０２は圧縮・非圧縮範囲毎に検索を行なえる
検索処理手段である。(Embodiment 3) Hereinafter, a third embodiment of the present invention will be described with reference to the drawings. FIG. 14 is a configuration diagram of the information search device in one embodiment of the present invention. In FIG. 14, reference numeral 101 denotes a search target data storage unit;
Is an unnecessary word deleting unit, 103 is an unnecessary word dictionary, 104 is a redundant character string deleting unit, 105 is a compressed data storage unit, 107 is an input unit, 108 is an output unit, and 1401 is an unnecessary word deleting unit 102 and a redundant character string deleting unit 104. A plurality of compression range storage units 1402, each of which can specify and register a plurality of ranges for compressing data by using a search, can perform a search for each compression / non-compression range.

【００４２】以上のように構成された情報検索装置の圧
縮範囲複数記憶部１４０１について説明する。圧縮範囲
複数記憶部１４０１はデータの登録処理を開始する前に
予め入力手段１０７から入力された圧縮範囲が登録され
る。圧縮範囲指定方法は第２の実施例と同様である。た
だし圧縮範囲が複数箇所指定できる点は実施例２と異な
っている。圧縮範囲複数記憶部１４０１の例を図１５に
示す。A description will now be given of the multiple-compression-range storage unit 1401 of the information retrieval apparatus configured as described above. Before starting the data registration process, the compression range plural storage unit 1401 registers the compression range input from the input unit 107 in advance. The method of specifying the compression range is the same as in the second embodiment. However, it differs from the second embodiment in that a plurality of compression ranges can be specified. FIG. 15 shows an example of the compression range plural storage section 1401.

【００４３】続いて、全体の流れについて説明する。入
力手段１０７からデータの圧縮処理の開始の命令が入力
されると、不要語削除手段１０３は、圧縮範囲複数記憶
部１４０１を参照して、検索対象データ中の圧縮開始位
置と圧縮終了位置を１組得る。続いて前述のとおり得た
位置に該当する検索対象データの範囲に対して不要語削
除処理を行ない、その後に重複文字列削除手段１０４が
重複文字列削除処理を行なう。ここで扱われる検索対象
データの構造と不要語削除処理と重複文字列削除処理は
第１の実施例と同様である。１つの圧縮範囲について重
複文字列削除処理が終了すると、必要語と必要語の区切
るために使用されている区切り文字とは異なる圧縮範囲
を区切る区切り文字を格納する。圧縮範囲の区切り文字
は、必要語の区切り文字と同様に、検索対象とならない
１バイトコードなどが使用される。Next, the overall flow will be described. When a command to start data compression processing is input from the input unit 107, the unnecessary word deletion unit 103 refers to the multiple compression range storage unit 1401 to determine the compression start position and the compression end position in the search target data by one. Can be paired. Subsequently, unnecessary word deletion processing is performed on the range of the search target data corresponding to the position obtained as described above, and then the duplicate character string deletion unit 104 performs the duplicate character string deletion processing. The structure of the search target data handled here, the unnecessary word deletion processing, and the duplicate character string deletion processing are the same as in the first embodiment. When the duplication character string deletion processing is completed for one compression range, a delimiter character that separates a necessary word from a decompression range different from the delimiter character used to separate the required word is stored. As a delimiter for the compression range, a 1-byte code or the like that is not a search target is used, like the delimiter for the necessary word.

【００４４】以上で１つの圧縮範囲についてのデータ登
録処理が終了する。続いて、圧縮しない検索対象デー
タ、つまり非圧縮データは、圧縮範囲の区切り文字で区
切ってそのまま格納する。すべての圧縮指定範囲につい
ての上記データ登録処理とすべての非圧縮データの格納
が終了するまで繰り返し処理を行なう。このように作成
された圧縮データの例を図１５に示す。Thus, the data registration processing for one compression range is completed. Subsequently, the search target data that is not compressed, that is, the non-compressed data, is stored as it is, separated by the delimiter of the compression range. The above-described data registration processing for all the specified compression ranges and the processing are repeated until the storage of all the uncompressed data is completed. FIG. 15 shows an example of the compressed data created as described above.

【００４５】データ登録処理が終了すると第１の実施列
の流れと同様に、検索処理手段１４０２が検索処理を開
始する。検索方法は第１の実施例と同様である。ただ
し、上記のデータ登録処理によって圧縮範囲の区切り文
字で区切られた範囲内で検索を行なう点は第１の実施列
と異なっている。つまり、圧縮範囲の区切り文字は検索
処理において検索範囲の区切り文字になる。このような
一つの範囲で検索条件に適合したら、ただちに一つの検
索対象データの文字列の照合処理は終了する。後の動作
は第１の実施例と同様で、照合したデータのヘッダ情報
をとり出し出力手段１０８に検索結果を送り出す。以下
すべての検索対象データに対して行なう。When the data registration processing is completed, the search processing means 1402 starts the search processing as in the flow of the first embodiment. The search method is the same as in the first embodiment. However, the difference from the first embodiment is that the search is performed within the range delimited by the delimiter of the compression range by the data registration process. That is, the delimiter of the compression range becomes the delimiter of the search range in the search process. As soon as the search condition is satisfied in one such range, the collation processing of the character string of one search target data ends. Subsequent operations are the same as in the first embodiment, and the header information of the collated data is extracted and the search result is sent to the output unit 108. The following is performed for all search target data.

【００４６】このように、検索の範囲を設け、いわばタ
グつけされたブロックとして区分されたデータ毎に検索
を行なうことで検索条件として使用されている論理式が
有効に活用できる。例えば近接した文字に対して有効に
なる検索条件で検索を行うことができる。As described above, by providing a search range and performing a search for each data sectioned as a tagged block, a logical expression used as a search condition can be effectively used. For example, a search can be performed with a search condition that is valid for a nearby character.

【００４７】例を図１６で説明する。検索意図が「オン
ラインでイメージデータを扱える検索装置」について知
りたい場合に、検索条件を「オンライン＆イメージデー
タ＆検索装置」とし、図に示す文書番号１である、「請
求項１」、「請求項２」、「図面の説明」からなる一つ
のデータを検索対象データとする場合、検索範囲を設定
しない場合は、上記検索式に該当するものとして文書番
号１がヒットするが、範囲を設定した場合、文書番号１
はヒットせず、意図した結果を得ることができる。An example will be described with reference to FIG. If the user intends to find out about "a search device that can handle image data online", the search condition is set to "online & image data & search device", and the document number 1 shown in the figure is "claim 1", "claim" If one piece of data consisting of “item 2” and “explanation of drawings” is used as search target data, and if the search range is not set, document number 1 is hit as corresponding to the above search formula, but the range is set. If document number 1
Does not hit, and the intended result can be obtained.

【００４８】もちろん、検索対象を全体にする場合は、
圧縮範囲を全体にすれば良いのでいろいろな検索対象デ
ータにも使える。Of course, when searching the whole object,
Since the entire compression range is sufficient, it can be used for various search target data.

【００４９】また、圧縮範囲別に不要語削除処理、重複
文字列削除処理を行なったが、最初に圧縮指定範囲全体
において不要語を削除しておき、続いて指定された圧縮
範囲毎に重複文字列削除処理を行なうこともできる。Also, unnecessary word deletion processing and duplicate character string deletion processing are performed for each compression range. Unnecessary words are first deleted in the entire specified compression range, and then duplicate character strings are deleted for each specified compression range. Deletion processing can also be performed.

【００５０】（実施例４）以下、本発明の第４の実施例
について、図面を参照しながら説明する。図１７は本発
明の一実施例における情報検索装置の構成図である。図
１７において、１０１は検索対象データ記憶部、１０２
は不要語削除手段、１０３は不要語辞書、１０４は重複
文字列削除手段、１０５は圧縮データ記憶部、１４０１
は圧縮範囲複数記憶部、１７０１は検索対象データ中の
検索対象とする範囲を指定する検索範囲指定手段、１７
０２は検索条件から検索範囲番号を抽出し前記の検索範
囲手段１７０１に送る機能を備えた検索処理手段、１０
７は入力手段、１０８は出力手段である。(Embodiment 4) Hereinafter, a fourth embodiment of the present invention will be described with reference to the drawings. FIG. 17 is a configuration diagram of the information search device in one embodiment of the present invention. In FIG. 17, reference numeral 101 denotes a search target data storage unit;
Is an unnecessary word deleting unit, 103 is an unnecessary word dictionary, 104 is a duplicate character string deleting unit, 105 is a compressed data storage unit, 1401
Is a compression range plural storage unit; 1701 is a search range designating unit for designating a range to be searched in the search target data;
02 is a search processing unit having a function of extracting a search range number from the search conditions and sending the search range number to the search range unit 1701;
7 is an input means, and 108 is an output means.

【００５１】まず、本実施例における、検索対象データ
と圧縮データと検索条件について図１８を使って説明す
る。検索対象データの構造は、第３の実施例と同様であ
る。１つ１つの検索対象データはすべての検索対象デー
タに共通したブロック毎の構造を持っている。例えば図
１８では、目的、図の説明、再び目的、図の説明と繰り
返しタグをもち、目的、説明を１単位（１検索対象デー
タ）とする。First, search data, compressed data, and search conditions in the present embodiment will be described with reference to FIG. The structure of the search target data is the same as in the third embodiment. Each search target data has a block-by-block structure common to all search target data. For example, in FIG. 18, the purpose, the description of the figure, the purpose, the description of the figure, and the repetition tag are included, and the purpose and the description are defined as one unit (one search target data).

【００５２】圧縮データも第３の実施例と同様で、圧
縮、非圧縮範囲を区切る区切り文字で区切られていて、
前記検索対象データと同様に１つ１つの圧縮データに共
通した構造を持っている。Similarly to the third embodiment, the compressed data is also separated by a delimiter for separating the range of compression and non-compression.
Like the search target data, it has a structure common to each piece of compressed data.

【００５３】検索条件は、照合文字列と和、積、否定な
どの論理的関係を表す論理記号と共に、検索対象とする
範囲を指定する１検索対象データの先頭からのデータ範
囲の順番である検索範囲番号を添付する。図１８におい
ては、１検索対象データは１つの検索範囲区切り文字に
より、２つの検索範囲に区切られる。The search condition is a search character string and a logical symbol indicating a logical relationship such as sum, product, negation, etc., and a search range which specifies the range to be searched. Attach a range number. In FIG. 18, one search target data is divided into two search ranges by one search range delimiter.

【００５４】以上のような、検索条件と検索対象データ
と検索範囲指定手段１７０１と検索手段１７０２を持つ
情報検索装置の流れについて説明する。全体としてはデ
ータ登録処理と検索処理にわかれ、データ登録処理は第
３の実施例と同様である。The flow of the information search apparatus having the above-described search conditions, search target data, search range designation means 1701 and search means 1702 will be described. The whole is divided into a data registration process and a search process, and the data registration process is the same as in the third embodiment.

【００５５】つづいて検索処理手段１７０２によって行
なわれる検索処理について説明する。入力手段１０７か
ら前記検索条件が入力されると、検索処理手段１７０２
は前記検索条件から、検索対象とする検索範囲番号を得
て、検索範囲指定手段１７０１に送る。Next, search processing performed by search processing means 1702 will be described. When the search condition is input from the input unit 107, the search processing unit 1702
Obtains the search range number to be searched from the search condition and sends it to the search range specifying means 1701.

【００５６】続いて検索範囲指定手段１７０１は、１つ
の圧縮データの先頭から走査し、圧縮データ及び非圧縮
データ範囲の区切り文字をカウントし、検索処理手段１
７０２から送られた検索範囲番号に該当するデータ範囲
の開始位置を見つける。次に見つけた開始位置を検索処
理手段１７０２に指定する。検索処理手段１７０２は指
定された開始位置から、当該範囲の最後までに対して検
索処理を行なう。検索の処理方法は第１の実施例と同様
である。以上の検索処理をすべての圧縮データに対して
行なう。Subsequently, the search range designating means 1701 scans from the beginning of one piece of compressed data, counts the delimiters of the range of compressed data and the range of non-compressed data, and
The start position of the data range corresponding to the search range number sent from 702 is found. Next, the found start position is specified to the search processing means 1702. Search processing means 1702 performs search processing from the designated start position to the end of the range. The search processing method is the same as in the first embodiment. The above search process is performed on all compressed data.

【００５７】本実施例によれば、同じ構造をもつ検索対
象データに対して、範囲別に、具体的に述べれば項目別
に検索を行なうことができ、目的とするデータを取得す
ることが可能になる。According to the present embodiment, it is possible to perform a search on search target data having the same structure by range, specifically, by item, and obtain desired data. .

【００５８】（実施例５）以下、本発明の第５の実施例
について、図面を参照しながら説明する。図１９は本発
明の一実施例における情報検索装置の構成図である。図
１９において、１０１は検索対象データ記憶部、１０３
は不要語辞書、１０４は重複文字列削除手段、１０５圧
縮データ記憶部、１０７は入力手段、１０８は出力手
段、１９０１は抽出した不要語と抽出先の検索対象デー
タのヘッダ情報を後述の不要語記憶テーブルに格納する
機能をもつ不要語削除手段、１９０２は前記不要語削除
手段１９０１によって抽出された不要語と抽出先の検索
対象データのヘッダ情報を保持する不要語記憶テーブ
ル、１９０３は前記不要語記憶テーブル１９０２を検索
する機能をもった検索処理手段である。Embodiment 5 Hereinafter, a fifth embodiment of the present invention will be described with reference to the drawings. FIG. 19 is a configuration diagram of the information search device in one embodiment of the present invention. 19, reference numeral 101 denotes a search target data storage unit;
Is an unnecessary word dictionary, 104 is a redundant character string deleting unit, 105 is a compressed data storage unit, 107 is an input unit, 108 is an output unit, and 1901 is header words of extracted unnecessary words and extraction target search target data. An unnecessary word deleting unit having a function of storing the unnecessary word in the storage table; 1902, an unnecessary word storage table for storing header information of the unnecessary word extracted by the unnecessary word deleting unit 1901 and the search target data of the extraction destination; A search processing unit having a function of searching the storage table 1902.

【００５９】本実施例の不要語記憶テーブル１９０２の
構造について図２０を用いて説明する。図に示すよう
に、不要語記憶テーブル１９０２には、不要語辞書１０
３に登録されている不要語と当該不要語が削除された検
索対象データのヘッダ情報のペアのリストが格納されて
いる。The structure of the unnecessary word storage table 1902 of this embodiment will be described with reference to FIG. As shown in the figure, the unnecessary word storage table 1902 stores the unnecessary word dictionary 10
A list of pairs of unnecessary words registered in No. 3 and header information of search target data from which the unnecessary words have been deleted is stored.

【００６０】上記の不要語記憶テーブル１９０２と不要
語削除手段１９０１と検索処理手段１９０３を持つ情報
検索装置の流れについて説明する。全体としてはデータ
登録処理と検索処理にわかれる。データ登録処理は第３
の実施例と同様に不要語削除処理と重複文字列削除処理
からなる。ただし、不要語削除手段１９０１は不要語削
除処理を行なう際に抽出した不要語と抽出先の検索対象
データのヘッダ情報を不要語記憶テーブル１９０１に格
納する。The flow of the information retrieval apparatus having the unnecessary word storage table 1902, the unnecessary word deleting unit 1901 and the search processing unit 1903 will be described. The whole is divided into a data registration process and a search process. Data registration process is 3rd
As in the third embodiment, the processing includes an unnecessary word deletion process and a duplicate character string deletion process. However, the unnecessary word deletion unit 1901 stores the unnecessary word extracted at the time of performing the unnecessary word deletion process and the header information of the search target data of the extraction destination in the unnecessary word storage table 1901.

【００６１】例えば、図２０で不要語辞書に格納されて
いる文字列「For example, in FIG. 20, the character string “

【目的】」は、不要語として検索対象データから削除の
対象となるので、不要語削除手段１９０１によって検索
対象データの文書番号１と文書番号２の中から抽出され
る。次に不要語削除手段１９０１は不要語記憶テーブル
１９０２に文字列「[Purpose] is an unnecessary word to be deleted from the search target data, and is extracted from the document number 1 and the document number 2 of the search target data by the unnecessary word deletion unit 1901. Next, the unnecessary word deletion unit 1901 stores the character string “

【目的】」と、文書番号１と文書番号２がペアにして格
納する。以下、同様に不要語とヘッダ情報のペアを不要
語記憶テーブル１９０２に格納する。すべての検索対象
データに対して上記の動作を行なうデータの登録処理が
終了する。[Purpose] ", and document number 1 and document number 2 are stored as a pair. Hereinafter, similarly, a pair of the unnecessary word and the header information is stored in the unnecessary word storage table 1902. The data registration processing for performing the above operation for all the search target data is completed.

【００６２】次に検索処理の流れについて図２１を用い
て説明する。まず検索処理手段１９０３は、入力手段１
０７から検索条件が入力されると、その検索条件から照
合文字列を抽出する。次に検索処理手段１９０３は不要
語記憶テーブルを参照して前記の照合文字列が格納され
ているか調べる。格納されている場合は、不要語として
検索対象データから削除されている（つまり圧縮データ
に格納されていない）。不要語記憶テーブル１９０２に
格納されていなかった場合、検索対象データから削除さ
れていないことになる。Next, the flow of the search process will be described with reference to FIG. First, the search processing unit 1903 sets the input unit 1
When a search condition is input from 07, a matching character string is extracted from the search condition. Next, the search processing unit 1903 refers to the unnecessary word storage table and checks whether the above-mentioned collation character string is stored. If it is stored, it is deleted from the search target data as an unnecessary word (that is, it is not stored in the compressed data). If it is not stored in the unnecessary word storage table 1902, it means that it has not been deleted from the search target data.

【００６３】前記の照合文字列が不要語記憶テーブル１
９０２を参照し、格納されていない場合、検索処理手段
１９０３は第１の実施例と同様にすべての圧縮データに
対して検索の処理を行なう。The above collation character string is unnecessary word storage table 1
Referring to 902, if not stored, the search processing unit 1903 performs a search process on all compressed data as in the first embodiment.

【００６４】格納されていた場合、検索処理手段１９０
３はそのテーブルから格納されている検索対象データの
ヘッダ情報を取得する。次に、ヘッダ情報から検索対象
ファイルを取得する。続いて前記の動作で得た検索対象
ファイルとそのヘッダ情報を検索結果として出力手段１
０８に送る。次に前記の出力手段１０８に送られた検索
対象データ以外のデータ、つまり前記の出力手段１０８
に送られた検索対象データのヘッダ情報を持たない圧縮
データについて検索の処理を行なう。ここでの検索の処
理は第１の実施例と同様である。If stored, the search processing means 190
3 acquires the header information of the search target data stored from the table. Next, a search target file is obtained from the header information. Then, the search target file and the header information obtained by the above operation are output as a search result by the output unit 1.
Send to 08. Next, data other than the search target data sent to the output unit 108, that is, the output unit 108
A search process is performed on the compressed data having no header information of the search target data sent to. The search process here is the same as in the first embodiment.

【００６５】本発明によれば、検索対象データから不要
語として削除された場合でも、不要語削除手段１９０１
が不要語記憶テーブル１９０２に削除記録を保持し、検
索処理手段１９０３が前記の削除記録を参照して検索を
行なうので、検索対象データの圧縮を行ないながら検索
もれの少ない検索を同時に実現できる。According to the present invention, even when an unnecessary word is deleted from search target data, unnecessary word deleting means 1901
Holds the deletion record in the unnecessary word storage table 1902, and the search processing unit 1903 performs the search by referring to the deletion record. Therefore, it is possible to simultaneously perform the search with less search omission while compressing the search target data.

【００６６】（実施例６）以下、本発明の第６の実施例
について、図面を参照しながら説明する。図２２は本発
明の一実施例における情報検索装置の構成図である。図
２２において、１０１は検索対象データ記憶部、１０３
は不要語辞書、１０４は重複文字列削除手段、１０５圧
縮データ記憶部、１０７は入力手段、１０８は出力手
段、１９０１は不要語削除手段、１９０２は不要語記憶
テーブル、１９０３は検索処理手段、２２０１は指定さ
れた不要語を不要語辞書１０３から削除し、不要語記憶
テーブル１９０２を参照して当該不要語が削除された圧
縮データのヘッダ情報を得て、当該不要語を該当する圧
縮データに再格納するデータ再現手段である。Embodiment 6 Hereinafter, a sixth embodiment of the present invention will be described with reference to the drawings. FIG. 22 is a configuration diagram of the information search device in one embodiment of the present invention. In FIG. 22, reference numeral 101 denotes a search target data storage unit;
Is an unnecessary word dictionary, 104 is a redundant character string deletion unit, 105 is a compressed data storage unit, 107 is an input unit, 108 is an output unit, 1901 is an unnecessary word deletion unit, 1902 is an unnecessary word storage table, and 1903 is a search processing unit 2201 Deletes the specified unnecessary word from the unnecessary word dictionary 103, obtains header information of the compressed data from which the unnecessary word has been deleted by referring to the unnecessary word storage table 1902, and regenerates the unnecessary word into the corresponding compressed data. This is the data reproduction means to be stored.

【００６７】本実施例のデータ再現手段２２０１につい
て図２３を用いて説明する。データ再現手段２２０１
は、入力手段１０７から不要語辞書１０３へ格納中止の
不要語が入力指定されると、まず、不要語辞書１０３か
ら指定された不要語を削除する。例えば図２３において
「こと」は検索対象データから削除する対象となってい
る（図２３（ａ））。The data reproducing means 2201 of this embodiment will be described with reference to FIG. Data reproduction means 2201
When an unnecessary word whose storage is to be stopped is input and designated to the unnecessary word dictionary 103 by the input means 107, first, the unnecessary word designated from the unnecessary word dictionary 103 is deleted. For example, in FIG. 23, “koto” is a target to be deleted from the search target data (FIG. 23A).

【００６８】次に、不要語記憶テーブル１９０２を参照
して、すでに当該不要語が削除されている圧縮データの
ヘッダ情報を得る（図２３（ｂ））。続いて、得たヘッ
ダ情報から該当する圧縮データを探し、最後部に不要語
を添付する（図２３（ｃ））。次にデータ再現手段２２
０１はすべての該当圧縮データに対して指定された不要
語を添付しおわると、不要語記憶テーブル１９０２の当
該不要語とその不要語が削除された検索対象データのヘ
ッダ情報を削除する（図２３（ｂ））。Next, referring to the unnecessary word storage table 1902, header information of the compressed data from which the unnecessary word has been deleted is obtained (FIG. 23B). Subsequently, the corresponding compressed data is searched for from the obtained header information, and an unnecessary word is attached to the last part (FIG. 23 (c)). Next, the data reproducing means 22
When the designated unnecessary word is attached to all the compressed data, 01 deletes the unnecessary word in the unnecessary word storage table 1902 and the header information of the search target data from which the unnecessary word has been deleted (FIG. 23). (B)).

【００６９】上記の動作は、データ登録処理と検索処理
が行なわれていない時に入力手段１０７から命令が入力
されるとただちに開始され、次のデータ登録処理と検索
処理に反映される。The above operation is started immediately when an instruction is input from the input means 107 when the data registration process and the search process are not performed, and is reflected in the next data registration process and the search process.

【００７０】本実施例によれば、データ再現手段２２０
１を設けたことで、いったん不要語として不要語辞書１
０３に登録しておいてもただちに登録の取り止めがで
き、また、すでに不要語として削除された圧縮データに
対しても指定された不要語を添付することでデータを再
現でき、より使いやすい情報検索装置を実現できる。According to this embodiment, the data reproducing means 220
1 provides unnecessary word dictionary 1 once as unnecessary words.
03 can be canceled immediately, and data can be reproduced by attaching the specified unnecessary words to the compressed data that has already been deleted as unnecessary words, making information retrieval easier to use. The device can be realized.

【００７１】[0071]

【発明の効果】以上のように本発明の情報検索装置は、
検索対象データの情報量を失うことなく検索対象データ
の容量を小さくできる不要語削除手段と検索対象データ
から削除した不要語を記憶しておく不要語記憶テーブル
と、重複削除手段を設けたことにより、検索もれを防
ぎ、かつ検索が高速にでき、また、削除記録を参照して
検索を行うことができるため、検索もれの少ない検索を
することができるという効果を有する。As described above, the information retrieval apparatus of the present invention
Unnecessary word deletion means and search target data that can reduce the size of search target data without losing the information amount of search target data
Word storage table that stores unnecessary words deleted from
When, by providing the overlapping deletion means to prevent search omission, and the search can be quickly, and with reference to the deleted record
Because search can be performed, search with few search omissions
It has the effect that it can be done .

【００７２】[0072]

【００７３】[0073]

【００７４】[0074]

【００７５】[0075]

【００７６】[0076]

【００７７】[0077]

[Brief description of the drawings]

【図１】本発明の第１の実施例における情報検索装置の
構成図FIG. 1 is a configuration diagram of an information search device according to a first embodiment of the present invention.

【図２】本発明の第１の実施例におけるデータの概念図FIG. 2 is a conceptual diagram of data according to the first embodiment of the present invention.

【図３】本発明の第１の実施例における全体の動作を示
す流れ図FIG. 3 is a flowchart showing an overall operation in the first embodiment of the present invention.

【図４】本発明の第１の実施例における不要語削除処理
を示す流れ図FIG. 4 is a flowchart showing unnecessary word deletion processing according to the first embodiment of the present invention.

【図５】本発明の第１の実施例における不要語削除手段
の構成図FIG. 5 is a configuration diagram of unnecessary word deleting means in the first embodiment of the present invention.

【図６】本発明の第１の実施例おける不要語抽出の際の
最長一致方法を示す流れ図FIG. 6 is a flowchart showing a longest matching method at the time of unnecessary word extraction in the first embodiment of the present invention.

【図７】本発明の第１の実施例おける最長一致方法を示
す概念図FIG. 7 is a conceptual diagram showing a longest matching method in the first embodiment of the present invention.

【図８】本発明の第１の実施例における必要語算出処理
と必要語格納テーブルの概念図FIG. 8 is a conceptual diagram of a required word calculation process and a required word storage table in the first embodiment of the present invention.

【図９】本発明の第１の実施例における重複文字列削除
処理を示す流れ図FIG. 9 is a flowchart showing a duplicate character string deletion process according to the first embodiment of the present invention.

【図１０】本発明の第１の実施例における重複文字列削
除処理を示す概念図FIG. 10 is a conceptual diagram showing a duplicate character string deletion process according to the first embodiment of the present invention.

【図１１】本発明の第１の実施例における検索処理を示
す流れ図FIG. 11 is a flowchart showing a search process according to the first embodiment of the present invention.

【図１２】本発明の第２の実施例における情報検索装置
の構成図FIG. 12 is a configuration diagram of an information search device according to a second embodiment of the present invention.

【図１３】本発明の第２の実施例におけるデータの概念
図FIG. 13 is a conceptual diagram of data according to the second embodiment of the present invention.

【図１４】本発明の第３の実施例における情報検索装置
の構成図FIG. 14 is a configuration diagram of an information search device according to a third embodiment of the present invention.

【図１５】本発明の第３の実施例におけるデータの概念
図FIG. 15 is a conceptual diagram of data according to a third embodiment of the present invention.

【図１６】本発明の第３の実施例における検索条件と検
索結果を示す概念図FIG. 16 is a conceptual diagram showing search conditions and search results according to a third embodiment of the present invention.

【図１７】本発明の第４の実施例における情報検索装置
の構成図FIG. 17 is a configuration diagram of an information search device according to a fourth embodiment of the present invention.

【図１８】本発明の第４の実施例におけるデータの概念
図FIG. 18 is a conceptual diagram of data in a fourth embodiment of the present invention.

【図１９】本発明の第５の実施例における情報検索装置
の構成図FIG. 19 is a configuration diagram of an information search device according to a fifth embodiment of the present invention.

【図２０】本発明の第５の実施例におけるデータの概念
図FIG. 20 is a conceptual diagram of data according to a fifth embodiment of the present invention.

【図２１】本発明の第５の実施例における検索処理を示
す流れ図FIG. 21 is a flowchart showing a search process according to a fifth embodiment of the present invention.

【図２２】本発明の第６の実施例における情報検索装置
の構成図FIG. 22 is a configuration diagram of an information search device according to a sixth embodiment of the present invention.

【図２３】本発明の第６の実施例におけるデータ再現手
段の処理図FIG. 23 is a processing diagram of a data reproducing unit according to a sixth embodiment of the present invention.

【図２４】従来の情報検索装置の構成図FIG. 24 is a configuration diagram of a conventional information search device.

[Explanation of symbols]

１０１検索対象データ記憶部１０２不要語削除手段１０２ａ不要語位置抽出手段１０２ｂ必要語算出手段１０２ｃ必要語格納テーブル１０３不要語辞書１０４重複文字列削除手段１０５圧縮データ記憶部１０６検索処理手段１０７入力手段１０８出力手段１２０１圧縮範囲記憶部１４０１圧縮範囲複数記憶部１４０２検索処理手段１７０１検索範囲指定手段１７０２検索処理手段１９０１不要語削除手段１９０２不要語記憶テーブル１９０３検索処理手段２２０１データ再現手段２４０１検索対象データ記憶部２４０２形態素解析処理手段２４０３形態素解析用辞書２４０４単語データ２４０５キーワード抽出手段２４０６キーワード抽出用辞書２４０７キーワードデータ２４０８検索処理手段２４０９入力手段２４１０出力手段 101 Search target data storage unit 102 Unnecessary word deletion unit 102a Unnecessary word position extraction unit 102b Necessary word calculation unit 102c Necessary word storage table 103 Unnecessary word dictionary 104 Duplicate character string deletion unit 105 Compressed data storage unit 106 Search processing unit 107 Input unit 108 Output unit 1201 Compression range storage unit 1401 Compression range multiple storage unit 1402 Search processing unit 1701 Search range designation unit 1702 Search processing unit 1901 Unnecessary word deletion unit 1902 Unnecessary word storage table 1903 Search processing unit 2201 Data reproduction unit 2401 Search target data storage unit 2402 morphological analysis processing means 2403 morphological analysis dictionary 2404 word data 2405 keyword extraction means 2406 keyword extraction dictionary 2407 keyword data 2408 search processing means 24 9 input means 2410 output means

フロントページの続き (56)参考文献特開平３−174652（ＪＰ，Ａ) 特開昭63−228326（ＪＰ，Ａ) 特開平７−121548（ＪＰ，Ａ) 特開平６−301721（ＪＰ，Ａ) 特開平６−348756（ＪＰ，Ａ) 特開平５−334355（ＪＰ，Ａ) 特開平２−287674（ＪＰ，Ａ) 特開平５−67147（ＪＰ，Ａ) 特開昭64−31227（ＪＰ，Ａ) (58)調査した分野(Int.Cl.⁷，ＤＢ名) G06F 17/30 ＪＩＣＳＴファイル（ＪＯＩＳ)Continuation of the front page (56) References JP-A-3-174652 (JP, A) JP-A-63-228326 (JP, A) JP-A-7-121548 (JP, A) JP-A-6-301721 (JP, A) JP-A-6-348756 (JP, A) JP-A-5-334355 (JP, A) JP-A-2-2877674 (JP, A) JP-A-5-67147 (JP, A) JP 64-31227 (JP, A) (58) Field surveyed (Int. Cl. ⁷ , DB name) G06F 17/30 JICST file (JOIS)

Claims

(57) [Claims]

A search target data storage unit for storing search target data; an unnecessary word dictionary storing words that are not to be searched; and an unnecessary word from search target data stored in the search target data storage unit. Unnecessary word deletion means for deleting words that are not to be searched using a dictionary, duplicated character string deletion means for also deleting overlapping portions of character strings in search target data, and the unnecessary word deletion means and the duplication deletion processing. A compressed data storage unit that stores compressed data created from the search target data, and an input unit that inputs a search condition;
A search processing means for performing a full-text search the relative compressed data in accordance with said search conditions, and output means for outputting the retrieval result of the retrieval processing unit, the unnecessary word deletion means unnecessary word dictionary
Store unnecessary words deleted from search target data using
An unnecessary word storage table for storing the unnecessary word storage table.
An information retrieval apparatus characterized in that a table is a search target of a search processing means .

2. Canceling unnecessary word registration from the unnecessary word dictionary,
The unnecessary word deletion means uses the unnecessary word dictionary to store the unnecessary words deleted from the search target data, matches the unnecessary word storage table, and adds the unnecessary words for which the unnecessary word registration has been canceled to the compressed data. from memory table and characterized in that a data reproduction means for canceling an unnecessary word registration claim 1
An information retrieval device according to item 1.