JP2669601B2

JP2669601B2 - Information retrieval method and system

Info

Publication number: JP2669601B2
Application number: JP6287642A
Authority: JP
Inventors: 理恵久保田
Original assignee: International Business Machines Corp
Current assignee: International Business Machines Corp
Priority date: 1994-11-22
Filing date: 1994-11-22
Publication date: 1997-10-29
Anticipated expiration: 2012-10-29
Also published as: CN1151558A; KR960018993A; JPH08147320A

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【産業上の利用分野】この発明は、例えばテキスト・フ
ァイル形式でディスクに格納された大量の文書を、高速
且つ所望の曖昧度を許容しつつ検索するシステム及び方
法に関するものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a system and method for searching a large number of documents stored in a disk, for example, in a text file format, at high speed and with a desired ambiguity.

【０００２】[0002]

【従来の技術】従来より、新聞記事、特許公報、科学技
術文献などの、自然言語で書かれ、ディスクに格納され
た大量の文書を高速で検索したいという要望があり、さ
まざまな検索方式が提案されている。そのような検索方
式を大別すると、次のとおりである。2. Description of the Related Art There has been a demand for a high-speed search for a large number of documents written in a natural language and stored on a disk, such as newspaper articles, patent gazettes, and scientific and technical literature, and various search methods have been proposed. Have been. Such search methods are roughly classified as follows.

【０００３】(a) キーワード検索方式この方式では、個々の文書とその文書の内容をあらわす
キーワードとの索引づけが予め行われる。このとき、キ
ーワードを決める方法としては、形態素解析などによる
自動キーワード抽出、人手によるキーワード付加、およ
びそれらを組み合わせた方法がある。しかし、この方式
は、キーワードとして索引付けされた文字列でしか検索
できず、また、形態素解析による自動キーワード抽出の
精度は、単語・文法辞書の精度に左右されるため、辞書
の保守に多くの人的作業を要するという欠点がある。(A) Keyword search method In this method, indexing of individual documents and keywords representing the contents of the documents is performed in advance. At this time, as a method of determining a keyword, there are automatic keyword extraction by morphological analysis, manual keyword addition, and a method combining them. However, this method can only search for character strings indexed as keywords, and the accuracy of automatic keyword extraction by morphological analysis depends on the accuracy of word / grammar dictionaries. It has the drawback of requiring human work.

【０００４】(b) 索引なしの全文検索方式これは、索引を使用することなく、検索したい文字列が
指定されるたびに、検索対象となる文書をすべてスキャ
ンする方式である。特別なハードウェアを使用して検索
速度をあげている方式もある。しかし、特別なハードウ
ェアを使用したシステムは、コストが嵩み、また、クラ
イアント・サーバ環境では、使用できるマシンの種別に
制約が生じる場合がある。(B) Full-text search method without index This is a method of scanning all documents to be searched each time a character string to be searched is specified without using an index. Some systems use special hardware to speed up the search. However, a system using special hardware is costly and, in a client-server environment, there may be restrictions on the types of machines that can be used.

【０００５】(c) 索引による全文検索方式本発明は、索引による全文検索方式に属するものであ
る。索引を使用することにより、全文検索を高速化する
ことを意図するものとしては、以下に示すような技法が
知られている。(C) Full-text search method using an index The present invention belongs to a full-text search method using an index. The following techniques are known for the purpose of speeding up full-text search by using an index.

【０００６】特開平４−２０５５６０号公報には、検索
対象となる文字列を検索を行う単位である検索単位に分
け、この検索単位毎に昇順の符号を付与し、この分けら
れた検索単位に対してその論理的な区分を示す属性符号
を付与し、検索対象となる文字列を各文字毎に検索単位
中での位置を示す文字位置順序符号を付与し、検索単位
識別符号と、文字位置順序符号と、属性符号とからなる
文字位置情報を作成して、この文字位置情報を文字種ご
との領域に格納して検索ファイルを作成することが開示
されている。In Japanese Patent Laid-Open No. 4-205560, a character string to be searched is divided into search units, which are units for searching, and an ascending code is assigned to each search unit. On the other hand, an attribute code indicating its logical division is given, a character position order code indicating the position in the search unit of the character string to be searched is given for each character, the search unit identification code and the character position It is disclosed that character position information including an order code and an attribute code is created, and the character position information is stored in an area for each character type to create a search file.

【０００７】特開平４−２１５１８１号公報には、検索
処理のための文字列照合回数を低減し，汎用の情報処理
装置で高速照合を可能にするために、検索対象となる文
字列を構成する各文字セツトが文字列中のどの位置にあ
るかを示す文字セツト位置情報を文字セツト種ごとにグ
ル−プ化した検索フアイルを作成することが開示されて
いる。In Japanese Patent Laid-Open No. 4-215181, a character string to be searched is constructed in order to reduce the number of times the character string is matched for search processing and to enable high-speed matching in a general-purpose information processing device. It is disclosed to create a search file in which character set position information indicating the position of each character set in a character string is grouped for each character set type.

【０００８】ところで、検索文字列と完全に一致する文
字列を含むを検索するのみならず、部分的に一致する文
字列を含む文書も検索したいという必要性が生じること
も少なくない。例えば、ユーザーの、検索文字列に対す
る記憶が曖昧であったり、検索文字列がさまざまな変形
をともなって出現し得るものであって、そのような変形
すべてを列挙することが困難であったりする場合があ
る。By the way, it is often necessary to search not only for a character string that completely matches the search character string but also for a document that includes a character string that partially matches the search character string. For example, when the user's memory of the search string is ambiguous, or the search string can appear with various modifications, and it is difficult to list all such modifications. There is.

【０００９】従来技術における典型的な部分文字列指定
方法は、正規表現を使用するものである。これによれ
ば、任意の文字の０回以上の繰り返し、任意の文字の１
回以上の繰り返し、行末位置、行頭位置、特定の文字コ
ードの範囲内の任意の文字、などの指定が可能である。A typical method of specifying a partial character string in the prior art is to use a regular expression. According to this, 0 or more repetitions of any character, 1 of any character
It is possible to specify more than once, a line end position, a line start position, an arbitrary character within a specific character code range, and the like.

【００１０】さらに、特開昭６３−９９８３０号公報
は、検索文字列データと、被検索文字列データとの部分
一致の機能を有するシステムにおいて、検索文字列デー
タの同類語関係に関するデータと、当該検索文字列デー
タが被検索文字列データのいずれかに出現するか否かを
表すデータとを蓄積するテーブルを設けることを開示す
る。Further, Japanese Patent Application Laid-Open No. 63-99830 discloses a system having a function of partially matching search character string data with search target character string data. Disclosed is to provide a table for accumulating data indicating whether the search character string data appears in any of the searched character string data.

【００１１】また、特開昭６２−２２１０２７号公報
は、部分対象文字列の先頭から切出した文字列が辞書か
ら検索されなかった場合、その長さを１だけ増加した次
の切出し文字列については前方検索を行うことにより、
無効な検索回数を減らし、比較的高速で且つ効率的な単
語切出しを行えるようにすることを開示する。Further, Japanese Patent Laid-Open No. 62-221027 discloses that when a character string cut out from the beginning of a partial object character string is not retrieved from the dictionary, the length of the next cut out character string is increased by one. By performing a forward search,
It is disclosed that the number of invalid searches is reduced to enable relatively fast and efficient word segmentation.

【００１２】また、特開平４−３２６１６４号公報及び
特開平５−１７４０６７号公報は、デ−タベ−ス検索シ
ステムにおいて、検索対象の物件毎にその自己相関情報
を記憶し、検索キ−の自己相関情報と検索対象１０の上
記自己相関情報との合致度を物件毎に求め、物件番号を
合致度の降順に出力する検索手段を設けることを開示す
る。Further, Japanese Patent Laid-Open No. 4-326164 and Japanese Patent Laid-Open No. 5-174067 disclose self-correlation information for each property to be searched in a database search system, and the self-correlation information of the search key is stored. Disclosed is the provision of a search means for obtaining the degree of matching between the correlation information and the autocorrelation information of the search target 10 for each property, and outputting the property numbers in descending order of the degree of matching.

【００１３】しかし、これら従来技術の文字列検索技法
では、検索すべき文字列の曖昧性の度合といったものを
指定することが困難であり、検索結果は、ユーザーにと
って所望でなく、あるいは不自然な多くの文字列を含む
ことになる。However, with these conventional character string search techniques, it is difficult to specify the degree of ambiguity of the character string to be searched, and the search result is undesired or unnatural for the user. It will contain many strings.

【００１４】[0014]

【発明が解決しようとする課題】この発明の目的は、検
索すべき文字列の曖昧性の度合を任意に指定できるよう
な文字列検索技法を提供することにある。SUMMARY OF THE INVENTION An object of the present invention is to provide a character string search technique capable of arbitrarily specifying the degree of ambiguity of a character string to be searched.

【００１５】この発明の他の目的は、検索すべき文字列
の曖昧性の度合を任意に指定できるような文字列検索技
法を実現するための索引構造を提供することにある。Another object of the present invention is to provide an index structure for realizing a character string search technique capable of arbitrarily specifying the degree of ambiguity of a character string to be searched.

【００１６】この発明のさらに他の目的は、曖昧検索に
よって、人間の感覚に近いものを検索し得る文字列検索
技法を提供することにある。Still another object of the present invention is to provide a character string search technique capable of searching for something close to human senses by fuzzy search.

【００１７】[0017]

【課題を解決するための手段】本発明によれば、複数の
文書からなるデータベースを全文検索するために、個々
の文書には、一意的な番号（または記号）が付与され、
検索ファイルには、各々の文書中の個々の連続するＮ個
の文字と、そのＮ個の文字があらわれる文書の番号及び
その文書中の位置情報が格納される。好適には、索引フ
ァイルは、文字パターン・ファイルと、位置情報ファイ
ルの２つのファイルで構成される。文字パターン・ファ
イルには、文字パターン・区切りパターンとそれに対応
する文書番号・文書内位置番号が位置情報ファイルのど
こに位置するかが格納される。位置情報ファイルには、
文書番号・文書内位置番号が格納される。According to the present invention, a unique number (or symbol) is assigned to each document in order to perform a full-text search on a database composed of a plurality of documents.
The search file stores N consecutive characters in each document, a document number in which the N characters appear, and positional information in the document. Preferably, the index file is composed of two files, a character pattern file and a position information file. The character pattern file stores where the character pattern / separation pattern and the corresponding document number / in-document position number are located in the position information file. The location information file contains
The document number and the position number in the document are stored.

【００１８】本発明によれば、このような索引ファイル
を使用して、指定された文字列と文字の並びが似ている
文字列を含む文書を高速に検索する方式が提案される。
この方式では、検索したい文字列と検索精度（０より大
きく１以下）とを指定し、検索したい文字列との"似て
いる度合"が指定の検索精度以上である"似ている文字
列"を含む文書および"似ている文字列"の文書内位置を
特定することができる。According to the present invention, there is proposed a method for quickly searching for a document including a character string whose character sequence is similar to a designated character string using such an index file.
In this method, a character string to be searched and a search precision (greater than 0 and less than or equal to 1) are specified, and a "similar character string" in which the "similarity degree" of the character string to be searched is equal to or greater than the specified search precision. It is possible to specify the position in the document of the document including "and similar character string".

【００１９】これは、具体的には、文書中から、検索し
たい文字列と"似ている文字列"を選びだして、何文字連
続して一致しているか途中にどのくらい余分な文字がは
さまっているかの２つの観点から"似ている度合い"を数
値化する処理である。Specifically, this is done by selecting a "character string" similar to the character string to be searched from the document, and checking how many characters are consecutively matched and how many extra characters are inserted in the middle. This is a process of digitizing the "similarity" from two perspectives.

【００２０】このとき、"似ている度合い"が最高値１に
なるとき、文字列は完全に一致しており、また、文字列
が完全に一致していれば必ず"似ている度合い"は１にな
る。似ている文字列に、検索したい文字列にない余分な
文字がはさまっていたり、検索したい文字列の一部しか
似ている文字列にあらわれない場合、"似ている度合い"
は、１よりも小さい値になるが、本発明に従えば、この
ような"似ている度合い"の数値が、人間の通常の感覚に
かなりよく合致する、妥当なものとなる。At this time, when the "similarity degree" reaches the maximum value of 1, the character strings are completely matched, and if the character strings are completely matched, the "similarity degree" is always the same. Becomes 1. "Similarity" when a similar character string has extra characters that are not in the character string you want to search for, or when only part of the character string you want to search appears in a similar character string
Is less than one, but in accordance with the present invention, such a "similarity" value is reasonable and fairly consistent with human normal sensation.

【００２１】上記索引ファイルは、文書中の任意の連続
するＮ文字を高速で検索することを可能にするので、上
記索引ファイルを使用して検索すべき文字列とのＮ文字
の比較を順次行うことにより、何文字連続して一致して
いるか、及び途中にどのくらい余分な文字がはさまって
いるかを高速で検出することができる。Since the index file enables a high-speed search for any continuous N characters in a document, the N characters are sequentially compared with a character string to be searched using the index file. This makes it possible to detect at a high speed how many consecutive characters match and how many extra characters are interposed in the middle.

【００２２】[0022]

【実施例】以下、図面を参照して本発明の実施例を説明
する。Embodiments of the present invention will be described below with reference to the drawings.

【００２３】Ａ．ハードウェア構成図１を参照すると、本発明を実施するためのシステム構
成の概観図が示されている。この構成は、バス１０１
に、演算及び入出力制御機能をもつ中央処理装置（ＣＰ
Ｕ）１０２、プログラムをロードし、また、ＣＰＵ１０
２のための作業領域を与える主記憶（ＲＡＭ）１０４、
コマンドや文字列などをキー入力するためのキーボード
１０６と、中央処理装置を制御するためのオペレーティ
ング・システム、データベース・ファイル、検索エンジ
ン、索引ファイルなどを格納したハードディスク１０８
と、データベースの検索結果を表示するためのディスプ
レイ装置１１０と、ディスプレイ装置１１０の画面上の
任意の位置をポイントしてその位置情報を中央処理装置
に伝えるためのマウス１１２を接続した通常の構成であ
る。A. Hardware Configuration Referring to FIG. 1, there is shown an overview of a system configuration for implementing the present invention. This configuration corresponds to the bus 101
In addition, the central processing unit (CP
U) 102, loads the program, and CPU 10
Main memory (RAM) 104, which provides a work area for
A keyboard 106 for inputting commands, character strings, and the like, and a hard disk 108 storing an operating system for controlling the central processing unit, a database file, a search engine, an index file, and the like.
With a normal configuration, a display device 110 for displaying a search result of a database and a mouse 112 for pointing an arbitrary position on the screen of the display device 110 and transmitting the position information to a central processing unit are connected. is there.

【００２４】オペレーティング・システムとしては、Ｗ
ｉｎｄｏｗｓ（マイクロソフトの商標）、ＯＳ／２（Ｉ
ＢＭの商標）、ＡＩＸ（ＩＢＭの商標）上のＸ−ＷＩＮ
ＤＯＷシステム（ＭＩＴの商標）などの、標準でＧＵＩ
マルチウインドウ環境をサポートするものが望ましい
が、本発明は、ＰＣ−ＤＯＳ（ＩＢＭの商標）、ＭＳ−
ＤＯＳ（マイクロソフトの登録商標）などのキャラクタ
・ベース環境でも実現可能であり、特定のオペレーティ
ング・システム環境に限定されるものではない。As an operating system, W
Windows (trademark of Microsoft), OS / 2 (I
BM trademark), X-WIN on AIX (IBM trademark)
GUI as standard, such as DOW system (trademark of MIT)
Although it is desirable to support a multi-window environment, the present invention relates to PC-DOS (trademark of IBM), MS-
It can be realized in a character-based environment such as DOS (registered trademark of Microsoft), and is not limited to a specific operating system environment.

【００２５】また、図１は、スタンド・アロン環境のシ
ステムを示しているが、一般的に、データベース・ファ
イルは大容量のディスク装置を要するものであるので、
クライアント／サーバ・システムとして本発明を実現
し、サーバ・マシンにデータベース・ファイルと検索エ
ンジンを配置し、クライアント・マシンは、サーバ・マ
シンに対して、イーサネット、トークン・リングなどで
ＬＡＮ接続し、クライアント・マシン側には、検索結果
を見るための表示制御部のみを配置するようにしてもよ
い。FIG. 1 shows a system in a stand-alone environment. Generally, a database file requires a large-capacity disk device.
The present invention is realized as a client / server system, a database file and a search engine are arranged on a server machine, and the client machine is connected to the server machine via LAN by Ethernet, token ring, etc. -On the machine side, only the display control unit for viewing the search result may be arranged.

【００２６】Ｂ．システム構成次に、図２のブロック図を参照して、本発明のシステム
構成について説明する。尚、図２で個別のブロックで示
されている要素は、図１のハードディスク１０８に、個
別にまたは集合的に、データ・ファイルまたはプログラ
ム・ファイルとして格納されているものであることに留
意されたい。B. System Configuration Next, the system configuration of the present invention will be described with reference to the block diagram of FIG. It should be noted that the elements shown as individual blocks in FIG. 2 are individually or collectively stored in the hard disk 108 in FIG. 1 as a data file or a program file. .

【００２７】データベース２０２として本発明が主に想
定するものは、新聞記事のデータベース、特許公報デー
タベースなどの、複数の文書が格納されたものである。
しかし、本発明の適用範囲は、複数の文書からなるデー
タベースの検索に限定されず、単一の文書内の検索にも
適用できることに留意されたい。このとき、個別の文書
のコンテンツは、例えばテキスト・ファイル形式で、検
索可能に格納されているものである。さらに、個々の文
書には、一意的な文書番号が付与されている。好適な文
書番号は、１から始まる昇順の順次番号であるが、特許
公報データベースの場合、出願番号あるいは公開番号を
一意的な文書番号として使用することもできる。また、
個々の文書を識別するために順次番号ではなく、"AB
C"、"&XYZ"などの記号を使用してもよい。但し、一般的
に、そのような識別記号を表現するためには、数字より
も多くのバイト数を要するので、実際上、順次番号で文
書を識別する方が好ましい。The database 202 mainly stores a plurality of documents such as a newspaper article database and a patent publication database.
However, it should be noted that the scope of the present invention is not limited to searching a database composed of a plurality of documents, but can be applied to searching within a single document. At this time, the content of each individual document is stored in a searchable manner, for example, in a text file format. Further, a unique document number is given to each document. The preferred document number is an ascending sequential number starting from 1, but in the case of a patent publication database, an application number or a publication number can be used as a unique document number. Also,
Use "AB" instead of a sequential number to identify individual documents
Symbols such as "C", "&XYZ", etc. may be used, however, in general, representing such an identification symbol requires more bytes than numbers, so in practice a sequential number It is preferable to identify the document with.

【００２８】一般的に、データベース２０２に格納され
ている新聞記事あるいは特許公報のような膨大なコンテ
ンツを直接検索するのは長い処理時間を要するので、デ
ータベース２０２に格納されている全ての新聞記事のコ
ンテンツを対象として予め、索引ファイル２０４が、索
引作成・更新モジュール２０６によって作成される。本
発明の後述する実施例では、このようにして作成された
索引ファイル２０４は、文字パターン・ファイルと、位
置情報ファイルの２つのファイルで構成される。文字パ
ターン・ファイルには、文字パターン・区切りパターン
とそれに対応する文書番号・文書内位置番号が位置情報
ファイルのどこに位置するかが格納される。位置情報フ
ァイルには、文書番号・文書内位置番号が格納される。Generally, it takes a long processing time to directly search a huge amount of contents such as newspaper articles or patent publications stored in the database 202, and therefore, it is necessary to search all newspaper articles stored in the database 202. The index file 204 is created in advance by the index creation / update module 206 for the content. In an embodiment to be described later of the present invention, the index file 204 thus created is composed of two files, a character pattern file and a position information file. The character pattern file stores where the character pattern / separation pattern and the corresponding document number / in-document position number are located in the position information file. The position information file stores the document number and the position number in the document.

【００２９】また、データベース２０２は、個々の文書
を、個別のファイルとして管理するものでもよく、ある
いは、連続する単一のファイルに、全ての文書を順次配
列したものでもよく、要するに、本質的なのは、個々の
文書に、一意的な番号が付与され、その一意的な番号で
もって、個々の文書の内容にアクセスできることであ
る。前者の場合、データベース２０２は、個々の文書の
一意的な番号と、文書を格納する実際のファイル名とを
対応付けるテーブルを管理し、後者の場合、データベー
ス２０２は、個々の文書の一意的な番号と、単一のデー
タベース・ファイル中のオフセット及び文書のサイズと
を対応付けるテーブルを管理することになる。The database 202 may manage individual documents as individual files, or may arrange all documents in a continuous single file in order. In short, the essential point is that A unique number is assigned to each document, and the contents of each document can be accessed with the unique number. In the former case, the database 202 manages a table that associates the unique number of each document with the actual file name that stores the document. In the latter case, the database 202 stores the unique number of each document. Will manage a table that correlates with the offset and document size in a single database file.

【００３０】検索エンジン２０８は、検索文字入力モジ
ュール２１０によって与えられた検索文字列を入力とし
て索引ファイル２０４を検索し、入力された検索文字列
を含む文書の文書番号（複数あり得る）と、その入力さ
れた検索文字列が文書中にあらわれる位置（やはり複数
あり得る）とを返す機能をもつ。検索文字入力モジュー
ル２１０は、好適には、マルチウインドウ環境における
ダイアログ・ボックスで構成され、その入力ボックス
に、キーボード１０６で所望の検索すべき文字を入力す
るようにした形式のものである。The search engine 208 searches the index file 204 by using the search character string provided by the search character input module 210 as an input, and the document number (there may be a plurality) of the document including the input search character string and its It has a function of returning a position (also a plurality of positions) where an input search character string appears in a document. The search character input module 210 is preferably composed of a dialog box in a multi-window environment, and is of a form in which a desired character to be searched is input by the keyboard 106 in the input box.

【００３１】さらに本発明の特徴によれば、検索文字入
力モジュール２１０は、曖昧検索の類似度を、０〜１の
数値（百分率を基準として、０〜１００の数字でもよ
い）で入力することを可能とする。このため、検索文字
入力モジュール２１０は、０〜１間の任意の位置を指し
示すハンドルをもつスライダまたはスクロール・バーを
表示する。そのスライダのハンドルは、例えばデフォー
ルトでは１を指し示し、さらに、ハンドルをマウス１１
２でドラッグして移動することにより、別の値を指し示
すように操作可能である。Further, according to a feature of the present invention, the search character input module 210 inputs the degree of similarity of the fuzzy search as a numerical value of 0 to 1 (a numerical value of 0 to 100 may be used on the basis of percentage). Make it possible. Therefore, the search character input module 210 displays a slider or a scroll bar having a handle indicating an arbitrary position between 0 and 1. The handle of the slider indicates, for example, 1 by default, and further, the handle is moved to the mouse 11
By dragging and moving in 2, it can be operated to point to another value.

【００３２】結果表示モジュール２１２は、検索エンジ
ン２０８から与えられた検索結果である文書番号と、検
索文字が当該文書中にあらわれる位置の値に基づき、デ
ータベース２０２にアクセスし、その文書のその位置に
対応する行を、好ましくは個別の検索結果表示ウインド
ウに表示する。検索結果がそのウインドウの１画面に収
まらない場合、スクロール・バーがあらわれ、ユーザー
が、それをクリックすることによって、順次検索結果を
眺めることができるようにする。The result display module 212 accesses the database 202 based on the document number which is the search result provided from the search engine 208 and the value of the position where the search character appears in the document, and the result is displayed in that position of the document. Corresponding rows are preferably displayed in a separate search result display window. If the search results do not fit on one screen of the window, a scroll bar will appear to allow the user to click through it to sequentially view the search results.

【００３３】Ｃ．索引ファイルの構造及び作成方法本発明においては、すべての連続するＮ文字とその文書
内位置、及び文書内分割情報を文書と索引づけしたファ
イルが作成される。ここで、文書内分割情報とは、典型
的には、「。」、「、」などの文章の区切りや、「第１
章」、「要約」などの、広い意味で文書の区切りであ
る。C. Structure of Index File and Creation Method In the present invention, a file is created in which all consecutive N characters, their positions in the document, and the division information in the document are indexed as documents. Here, the in-document division information is typically a sentence break such as “.”, “,” Or “first
It is a document break in a broad sense such as "chapter" or "summary".

【００３４】Ｃ１．文字列の正規化索引ファイルを作成するために必要な最初の処理は、文
字列の正規化であって、それは次のような処理である。
すなわち、検索すべき文書が特に日本語テキスト・ファ
イルである場合、半角と全角が混在することがあり得
る。そこで、例えば、半角文字を対応する全角文字に置
換する、という処理を行う。C1. String Normalization The first process required to create an index file is string normalization, which is as follows.
That is, when the document to be searched is a Japanese text file, half-width characters and full-width characters may be mixed. Therefore, for example, a process of replacing a half-width character with a corresponding full-width character is performed.

【００３５】Ｃ２．文字パターン情報の取り出し索引ファイルを作成するための次のステップは、正規化
された文字列のすべての文字について、その文字から始
まる連続するＮ個の文字（以下，文字パターンと呼ぶ）
を取り出して、文書番号、文書内位置番号とともに索引
ファイルに格納する。ただし、Ｎ≧１であって、日本語
の場合、Ｎ＝２が適当である。C2. Extraction of character pattern information The next step to create an index file is N consecutive characters starting from that character for all characters in the normalized character string (hereinafter referred to as character pattern).
Is extracted and stored in the index file together with the document number and the position number within the document. However, N ≧ 1 and in the case of Japanese, N = 2 is appropriate.

【００３６】文書内位置番号は、文書内の検索対象とな
るすべての文字に文書内で一意的な番号を順につけたも
のである。そして、文字パターンの最初の文字の文書内
位置番号を、その文字パターンの文書内位置番号とす
る。文書の終わりで後続の文字とあわせてＮ個に満たな
い場合には、Ｘ'００'など定められた詰め文字を詰めて
あわせてＮ個になるようにする。The position number in the document is obtained by sequentially assigning a unique number in the document to all characters to be searched in the document. Then, the in-document position number of the first character of the character pattern is set as the in-document position number of the character pattern. If the number of characters at the end of the document is less than N including the following characters, a predetermined padding character such as X'00 'is padded to make a total of N characters.

【００３７】また、この実施例では、個々の単独の文書
を、検索時に意味を持つような分け方でブロックに分割
し、分割情報を索引ファイルに格納する。分割情報の格
納は前述の文字パターンと同等の形式で行う。すなわ
ち、正規化された文字列からとりだした文字パターンの
かわりに、特別に定めた区切りパターンを、文書の文書
番号とブロックの境界の文字の文書内位置情報とともに
格納する。In this embodiment, each individual document is divided into blocks in a manner that is meaningful at the time of retrieval, and the division information is stored in an index file. The division information is stored in the same format as the above-mentioned character pattern. That is, instead of the character pattern extracted from the normalized character string, a specially defined delimiter pattern is stored together with the document number of the document and the in-document position information of the character at the boundary of the block.

【００３８】区切りパターンを数種類定めることによ
り、数種類の異なる分割方法を持つことができるように
なる。ただし、区切りパターンは、正規化された文字列
から取り出される文字パターンと重複しないように定め
なくてはならない。この実施例では、正規化処理によっ
て、１バイトのコードも、２バイト・コードに変換され
るので、２バイトを１ワードと見たときに、その１ワー
ドの値が２５５以下である場合は、通常の文字コードに
は該当しない。そこで、０〜２５５の任意のワード値
を、複数種類の区切りパターンに個別に割り当てること
ができる。By defining several types of division patterns, it becomes possible to have several types of different division methods. However, the delimiter pattern must be defined so as not to overlap with the character pattern extracted from the normalized character string. In this embodiment, the 1-byte code is also converted into the 2-byte code by the normalization process. Therefore, when 2 bytes are regarded as 1 word, if the value of 1 word is 255 or less, It does not correspond to a normal character code. Therefore, an arbitrary word value of 0 to 255 can be individually assigned to a plurality of types of delimiter patterns.

【００３９】分割情報を文字パターンと同様なこのよう
な形式で格納することの利点は、以下のとおりである。 - 索引の作成・更新が簡単。分割情報のために特別な処
理をする必要がない。 - 索引の容量をいちじるしく増加させることがない。例えば、文書内位置番号ごとにそれが属するブロック番
号を付加するような形式に比べて容量の増加ははるかに
小さい。The advantages of storing the division information in such a format similar to the character pattern are as follows. -Easy to create and update indexes. No special processing is required for the division information. -Does not significantly increase index capacity. For example, the increase in capacity is much smaller than in a format in which a block number to which a position number in a document belongs is added for each position number in a document.

【００４０】Ｃ３．文書内位置番号の具体例例えば、「本日は晴天なり。ただいま、マイクのテスト
中。」という文章を先頭に含む文書がデータベース２０
２（図２）に含まれていたとする。この文章の各文字に
文書内位置番号を付与すると、次のとおりである。C3. Specific example of the position number in the document For example, a document including the sentence "Today is fine weather. The microphone is currently being tested."
2 (FIG. 2). The position number in the document is given to each character of this sentence as follows.

【表１】文字の文書内位置番号 1 2 3 4 5 6 7 8 9 10111213141516171819202122 正規化された文字列本日は晴天なり。ただいま、マイクのテスト中。区切り方その１ | | 区切り方その２ | | |[Table 1] Position number of the character in the document 1 2 3 4 5 6 7 8 9 10111213141516171819202122 Normalized character string Today is sunny. We are currently testing the microphone. Separation method 1 | | Separation method 2 |

【００４１】そして、その文書の文書番号が１番である
とし、上記文字パターンの文字数Ｎを２であるとする。
すると、個々の文字パターン（長さ２）に関連付けられ
る文書番号、及び文書内位置番号は、次のとおりであ
る。It is assumed that the document number of the document is No. 1 and the number N of characters of the character pattern is 2.
Then, the document number and the position number in the document associated with each character pattern (length 2) are as follows.

【表２】文字パターン文書番号文書内位置番号 ------------------------------------------ 本日 1 1 日は 1 2 は晴 1 3 晴天 1 4 天な 1 5 なり 1 6 り。 1 7 。た 1 8 区切りパターン１ 1 8 区切りパターン２ 1 8 ただ 1 9 だい 1 10 いま 1 11 ま、 1 12 、マ 1 13 区切りパターン２ 1 13 マイ 1 14 イク 1 15 クの 1 16 のテ 1 17 テス 1 18 スト 1 19 ト中 1 20 中。 1 21 。 1 22 区切りパターン１ 1 22 区切りパターン２ 1 22[Table 2] Character pattern Document number Document position number -------------------------------------- ---- Today 1 1 is fine 1 2 fine 1 3 fine 1 4 heaven 1 5 1 7. 1 8 Separation pattern 1 1 8 Separation pattern 2 1 8 Only 1 9 1 1 10 Now 1 1 1, 1 1 2 and 1 1 3 Separation pattern 2 1 13 My 1 1 4 1 18 strikes 1 19 strikes 1 20 strikes. 1 21. 1 22 Separation pattern 1 1 2 2 Separation pattern 2 1 2 2

【００４２】Ｃ４．文書内分割情報の役割次に、検索における、文書内の分割情報（区切り）の利
用価値について説明する。C4. Role of document division information Next, the utility value of document division information (delimiters) in a search will be described.

【００４３】・特定ブロックだけを対象にした検索例えば、文書がタイトル・要約・本文という構成にな
っている場合に、タイトルだけ、要約だけなど特定部分
だけを対象にして検索することは一般的な要望である
といえる。タイトルの終わり、要約の終わりについて、
区切りパターンとその位置情報を格納することにより、
そのような検索が実現できる。Retrieval Targeting Only Specific Blocks For example, when a document is composed of a title, an abstract, and a body, it is common to search for only a specific part such as only a title or only an abstract. It can be said that it is a request. About the end of the title, the end of the summary,
By storing the delimiter pattern and its position information,
Such a search can be realized.

【００４４】・複数の文字列どうしの関連が強い文書の
検索複数の文字列どうしの文脈中での関連の強さを意識した
検索をすることは一般的な要望であるといえる。たとえ
ば、文字列どうしが単に同一文書内にあるだけよりは、
同一段落中にあったほうが関係が強い可能性が高く、同
一文中にあればさらに関係が強いことが予測される。段
落や文の終わりについて、区切りパターンとその位置情
報を格納しておくことにより、複数の文字列どうしが同
一ブロック内にある文書を検索することが可能になり、
関係の強さを意識した検索ができるようになる。Retrieval of Documents Strongly Related to a Plurality of Strings It can be said that it is a general desire to perform a search conscious of the strength of the relation between a plurality of character strings in the context. For example, rather than just strings being in the same document,
It is highly likely that the relationship is stronger in the same paragraph, and it is predicted that the relationship is stronger in the same sentence. By storing the delimiter pattern and its position information at the end of a paragraph or sentence, it becomes possible to search for documents in which multiple character strings are in the same block,
You will be able to search with awareness of the strength of the relationship.

【００４５】Ｃ５．索引ファイルの構造文字パターン・区切りパターンとその文書番号・文書内
位置番号は、検索時に効率よくとりだせる形で格納する
必要がある。そのために、この実施例では、索引ファイ
ルを文字パターン・ファイル（主に文字パターン・区切
りパターンを格納するファイル）、位置情報ファイル
（主に文書番号・文書内位置番号を格納するファイル）
の２つのファイルで構成する。文字パターン・ファイル
には、文字パターン・区切りパターンとそれに対応する
文書番号・文書内位置番号が位置情報ファイルのどこに
位置するかを格納する。位置情報ファイルには、文書番
号・文書内位置番号を格納する。C5. Structure of Index File The character pattern / separation pattern and its document number / position number in the document must be stored in a form that can be efficiently retrieved during retrieval. Therefore, in this embodiment, the index file is a character pattern file (a file that mainly stores character patterns and delimiter patterns), and a position information file (a file that mainly stores document numbers and in-document position numbers).
It consists of two files. The character pattern file stores where the character pattern / separation pattern and the corresponding document number / in-document position number are located in the position information file. The position information file stores the document number and the position number in the document.

【００４６】このような文字パターン・ファイルと、そ
れに対応する位置情報ファイルの例を図３に示す。FIG. 3 shows an example of such a character pattern file and the corresponding position information file.

【００４７】図３において、文字パターン・ファイル３
０２のエントリは、データベース２０２の全ての文書に
おける連続するＮ文字（ここでは、Ｎ＝２である）の文
字パターンである。文字パターン・ファイル３０２のエ
ントリは、好適には、２分探索を可能ならしめるよう
に、正規化された文字パターンの先頭の文字のコード値
で昇順にソートされている。「区切りパターン１」、
「区切りパターン２」、「なり」、「は晴」などが、文
字パターン・ファイル３０２の個別のエントリである。
尚、例えば、「区切りパターン１」は、「，」、「、」
または「。」などの文・句の区切りを包括的に示すもの
であって、特殊な２バイト値が割り当てられる。In FIG. 3, character pattern file 3
The entry 02 is a character pattern of N consecutive characters (here, N = 2) in all documents in the database 202. The entries in the character pattern file 302 are preferably sorted in ascending order by the code value of the first character of the normalized character pattern to allow a binary search. "Separation pattern 1",
“Separation pattern 2”, “Nari”, “Haharu”, and the like are individual entries in the character pattern file 302.
Note that, for example, "separation pattern 1" is ",", ","
Or it is a comprehensive delimiter of sentences and phrases such as ".", And a special 2-byte value is assigned.

【００４８】図３の位置情報ファイル３０４は、文字パ
ターン・ファイル３０２の個々のエントリに対応する少
なくとも１つの文書番号及び、その個々の文書番号毎に
関連付けられた少なくとも１つの文書内位置番号を格納
している。The position information file 304 in FIG. 3 stores at least one document number corresponding to each entry of the character pattern file 302 and at least one in-document position number associated with each document number. doing.

【００４９】文字パターン・ファイル３０２のエントリ
と、位置情報ファイル３０４のエントリとを対応付ける
ために、図示しないが、文字パターン・ファイル３０２
の個々のエントリは、対応する位置情報ファイル３０４
のエントリの、位置情報ファイル３０４の先頭からのオ
フセット、及び、対応する位置情報ファイル３０４のエ
ントリのサイズの情報をもつ。すなわち、図３で例え
ば、文字パターン・ファイル３０２は、「区切りパター
ン２」に関連してそこに格納されているオフセットの情
報から、位置情報ファイル３０４を先頭からシークし、
サイズの情報に指定されたバイト数だけシークした位置
から読取り、これによって、「区切りパターン２」に関
連して、文書番号１における８，１３，２２・・・とい
う文書内位置番号値と、文書番号２に関連する文書内位
置番号値、・・・（もしあるなら）文書番号ｎに関連す
る文書内位置番号値を一括して読み取ることが可能とな
る。Although not shown, the character pattern file 302 is associated with the entry of the character pattern file 302 and the entry of the position information file 304.
Of each location information file 304
Of the entry of the position information file 304, and the size of the corresponding entry of the position information file 304. That is, in FIG. 3, for example, the character pattern file 302 seeks the position information file 304 from the beginning from the offset information stored therein in relation to the “delimiter pattern 2”,
Read from the position seeked by the number of bytes specified in the size information, and thereby, in relation to “separation pattern 2”, the position number value in the document of 8, 13, 22,. It is possible to collectively read the in-document position number value related to the number 2,... (If any) the in-document position number value related to the document number n.

【００５０】一般に、文書番号ｉに関連する文書内位置
番号値は、例えば、（文書番号ｉ：４バイト）（文書内
位置番号の数ｋ：４バイト）（１番目の文書内位置番
号：４バイト）・・・（ｋ番目の文書内位置番号：４バ
イト）のような形式で格納されている。この例では、文
書内位置番号を格納するフィールドとして、文書の絶対
位置を格納するために４バイトをとるようにしている
が、実際上、１つ前の文書内位置番号からのオフセット
を格納するようにして、これを１〜３バイトに節約する
ようにした方がよい。文書番号及び文書内位置番号の数
を格納するフィールドについても同様である。Generally, the value of the position number in the document associated with the document number i is, for example, (document number i: 4 bytes) (the number of position numbers in the document k: 4 bytes) (first position number in the document: 4). (Byte) ... (k-th document position number: 4 bytes). In this example, the field for storing the in-document position number takes 4 bytes to store the absolute position of the document, but actually stores the offset from the previous in-document position number. Thus it is better to save this to 1-3 bytes. The same applies to the field for storing the number of the document number and the position number in the document.

【００５１】Ｃ６．索引ファイルの作成処理次に、図４を参照して、索引ファイルの作成処理につい
て説明する。この処理は、最初のデータベース２０２の
構築時または、データベース２０２に文書を追加あるい
はデータベース２０２から文書を削除したときに、図２
の索引作成・更新モジュール２０６によって行われる処
理である。C6. Index File Creation Process Next, the index file creation process will be described with reference to FIG. This processing is performed when the first database 202 is constructed or when a document is added to or deleted from the database 202 as shown in FIG.
This is a process performed by the index creation / update module 206 of FIG.

【００５２】図４で、先ずステップ４０２では、メモリ
領域を確保する処理が行われる。これは、例えば、オペ
レーティング・システムの機能を呼び出すことによっ
て、ＲＡＭ１０４上で、所定のサイズの作業領域を獲得
する処理である。In FIG. 4, first, at step 402, a process of securing a memory area is performed. This is a process of acquiring a work area of a predetermined size on the RAM 104 by calling a function of the operating system, for example.

【００５３】ステップ４０４では、データベース２０２
から１つの文書が、好適には上記ステップ４０２で獲得
されたメモリ領域に読み込まれる。In step 404, the database 202
From the document is preferably loaded into the memory area obtained in step 402 above.

【００５４】ステップ４０６では、ステップ４０４で読
み込まれた文書につき、前述の正規化処理が行われる。In step 406, the above-described normalization processing is performed on the document read in step 404.

【００５５】ステップ４０８では、正規化された文書を
走査することによって、文字パターン・区切りパターン
が作成され、文字パターン・区切りパターンと、当該文
書の文書番号と、文字パターン・区切りパターンの文書
内位置番号が、上記ステップ４０２で獲得されたメモリ
領域に格納される。In step 408, a character pattern / delimiter pattern is created by scanning the normalized document, and the character pattern / delimiter pattern, the document number of the document, and the position of the character pattern / delimiter pattern in the document. The number is stored in the memory area obtained in step 402 above.

【００５６】ステップ４０８の処理において、文字パタ
ーン、文書番号及び文書内位置番号をステップ４０２で
予め獲得されたメモリ領域に格納していくにつれて、そ
の獲得されたメモリ領域の空き領域が不足してくること
があり得る。そこで、ステップ４１０では、獲得された
メモリ領域が満杯かどうかを調べる処理が行われ、もし
そうなら、ステップ４１２で、メモリ領域に格納されて
いる文字パターン・区切りパターンと、文書の文書番号
と、文字パターン・区切りパターンの文書内位置情報と
が、例えば、文字パターン・区切りパターンの文字コー
ド値・文書番号・文書内位置番号に基づきソートされ
て、中間ファイルとしてディスク１０８（図１）に書き
出され、これによって、中間ファイルに書き出されたデ
ータが格納されていたメモリ領域は、以下の処理で使用
可能に開放される。そして、この後処理は、ステップ４
１４に進む。In the process of step 408, as the character pattern, the document number, and the position number in the document are stored in the memory area obtained in advance in step 402, the obtained memory area becomes insufficient in free space. It is possible. Therefore, in step 410, a process is performed to check whether the obtained memory area is full. If so, in step 412, the character pattern / separation pattern stored in the memory area, the document number of the document, The position information in the document of the character pattern / delimiter pattern is sorted based on, for example, the character code value of the character pattern / delimiter pattern / the document number / the position number in the document, and is written to the disk 108 (FIG. 1) as an intermediate file. As a result, the memory area in which the data written in the intermediate file was stored is released for use in the following process. Then, this post-processing is step 4
Proceed to 14.

【００５７】ステップ４１０で、メモリ領域にまだ余裕
があると判断されたなら、処理は直ちにステップ４１４
に進む。If it is determined in step 410 that the memory area still has room, the process immediately proceeds to step 414.
Proceed to.

【００５８】ステップ４１４では、ステップ４０４でま
だ読み込んでいない文書がデータベース２０２に残って
いるかどうかが判断される。もしそうなら、処理は、ス
テップ４０４に戻る。In step 414, it is judged whether or not the document not read in step 404 remains in the database 202. If so, the process returns to step 404.

【００５９】ステップ４１４で、データベース２０２の
全ての文書の読み込み処理が完了したと判断されると、
ステップ４０２で獲得されたメモリ領域に書き出されな
いで残っている文字パターン・区切りパターンと、文書
の文書番号と、文字パターン・区切りパターンの文書内
位置番号とが、やはり文字パターン・区切りパターンの
文字コード値・文書番号・文書内位置番号に基づきソー
トされて、中間ファイルとしてディスク１０８（図１）
に書き出される。When it is determined in step 414 that the reading process of all the documents in the database 202 is completed,
The character pattern / delimiter pattern remaining in the memory area acquired in step 402, which is not written out, the document number of the document, and the document position number of the character pattern / delimiter pattern are the characters of the character pattern / delimiter pattern. The files are sorted based on the code value, the document number, and the position number in the document, and the disk 108 (FIG. 1) as an intermediate file
Is written out.

【００６０】ステップ４１２とステップ４１６での中間
ファイルの書き出しによって、ディスク１０８には、複
数の中間ファイルが存在し、また、その各々の中間ファ
イルは予めソートされているので、ステップ４１８で
は、周知のマージ・ソートの技法で、それらの複数の中
間ファイルから、図３に示す文字パターン・ファイル３
０２と、位置情報ファイル３０４とを作成しディスク１
０８に格納する処理が行われる。尚、もとの複数の中間
ファイルには、文字パターンは重複して何度もあらわれ
得るので、ここでは、重複する同一の文字パターンのエ
ントリを１つにまとめて、それに関連する文書番号及び
文書内位置番号を関連付ける処理が行われる。その後、
中間ファイルは最早不要なので削除される。By writing the intermediate files in steps 412 and 416, there are a plurality of intermediate files on the disk 108, and each of the intermediate files has been sorted in advance. The character pattern file 3 shown in FIG. 3 is extracted from the plurality of intermediate files by the technique of merge sort.
02 and the location information file 304, and disk 1
The processing of storing in 08 is performed. It should be noted that, since the character patterns can appear repeatedly in the original plurality of intermediate files many times, here, the entries of the same overlapping character pattern are combined into one, and the document number and document A process of associating the internal position number is performed. afterwards,
The intermediate files are no longer needed and are deleted.

【００６１】Ｄ．索引ファイルを使用した検索処理次に、上述のようにして作成された索引ファイルを使用
して、文字列検索を行う処理の例について、図５のフロ
ーチャートを参照して説明する。ステップ５０２では、
先ず、例えば、入力ボックスをもつダイアログ・ボック
スを表示し、ユーザーに、その入力ボックスに検索文字
列を入力するようにプロンプトする処理が行われる。D. Search Process Using Index File Next, an example of a process of performing a character string search using the index file created as described above will be described with reference to the flowchart of FIG. In step 502,
First, for example, a process of displaying a dialog box having an input box and prompting the user to input a search character string in the input box is performed.

【００６２】ユーザーが入力ボックスに検索文字列を入
力し、ＯＫボタンをクリックすると、必要に応じて検索
文字列の正規化処理が行われた後で、その検索文字列の
最初のＮ文字パターンで以って上記索引ファイルを使用
した検索処理が行われる。尚、ここでいうＮ文字パター
ンの長さは、上記索引ファイルの文字列パターンの長さ
Ｎと同一であり、従って、検索文字列の部分文字列であ
るこのＮ文字パターンをキーとして、上記索引ファイル
を高速に２分探索することができる。日本語の文書に適
切なＮの値の１つの例は、２である。When the user inputs a search character string in the input box and clicks the OK button, the search character string is normalized as necessary, and then the first N character pattern of the search character string is used. Thus, the search process using the index file is performed. The length of the N character pattern here is the same as the length N of the character string pattern of the index file. Therefore, the N character pattern, which is a partial character string of the search character string, is used as a key for the index. Files can be searched for two minutes at high speed. One example of a suitable value for N for Japanese documents is 2.

【００６３】ステップ５０６で、検索文字列の最初のＮ
文字パターンが見つからなかったと判断されると、好適
には、検索文字列が見つからなかったことを示すメッセ
ージ・ボックスがステップ５０８で表示され、処理は終
了する。In step 506, the first N of the search character string
If it is determined that the character pattern was not found, a message box indicating that the search character string was not found is preferably displayed in step 508, and the process ends.

【００６４】ステップ５０６で、検索文字列の最初のＮ
文字パターンが見つかったと判断されると、索引ファイ
ルからは、１つ以上の文書番号とその文書番号における
少なくとも１つの文書内位置番号が返されるので、この
情報は、ステップ５１０で後の処理のため主記憶または
ディスク上の所定のバッファ領域に一旦格納される。At step 506, the first N of the search character string
If it is determined that the character pattern is found, then the index file returns one or more document numbers and at least one in-document position number for that document number, so that this information is available for further processing at step 510. The data is temporarily stored in a main storage or a predetermined buffer area on a disk.

【００６５】ステップ５１２では、検索文字列全てをＮ
文字パターンの部分文字列として検索してしまったかど
うかが判断され、もしそうなら、処理はステップ５２０
に進む。もしそうでないなら、ステップ５１８で、検索
文字列の次のＮ文字パターンで以って上記索引ファイル
を使用した検索処理が行われる。尚、検索文字列の長さ
は、一般的にＮの倍数とは限らず、従って、Ｎ文字パタ
ーンずつ検索していって検索文字列の末端付近まで部分
文字列をとる処理が進んだ時、索引ファイルのキーとな
る文字列がＮ文字よりも短いという場合もある。この場
合は、検索文字列の最後のＮ文字を部分文字列としてと
る。すると、直前にとったＮ文字と重複する文字がある
ことになる。検索文字列がＮ文字未満である場合には、
２分探索の結果は複数の候補を有し、その後の処理は、
順次探索によって複数の候補を見出すことになる。In step 512, all the search character strings are set to N
It is determined whether or not the character string has been searched as a partial character string, and if so, the process proceeds to step 520.
Proceed to. If not, at step 518, a search process is performed using the index file with the next N character patterns of the search string. Note that the length of the search character string is generally not limited to a multiple of N, and therefore, when the process of searching for N character patterns and taking a partial character string to near the end of the search character string proceeds, In some cases, the key character string of the index file is shorter than N characters. In this case, the last N characters of the search character string are taken as the partial character string. Then, there is a character that overlaps with the N characters taken immediately before. If the search string is less than N characters,
The result of the binary search has multiple candidates, and the subsequent processing is
A sequential search will find multiple candidates.

【００６６】ステップ５１６では、ステップ５０６と同
様に、検索文字列の当該Ｎ文字パターンが索引ファイル
で見つかったかどうかが判断される。但し、ステップ５
１６がステップ５０６と本質的に異なるのは、ステップ
５１６では、検索文字列の直前のＮ文字パターンに関す
るどれかの文書番号における、どれかの文書内位置番号
をＮだけ増分した文書内位置番号をもつ文字パターンに
ヒットしなければ、見つかったと見なされない、という
ことである。In step 516, it is determined whether or not the N character pattern of the search character string is found in the index file, as in step 506. However, step 5
16 is essentially different from step 506 in step 516 in which the position number in the document obtained by incrementing the position number in any document by N in any document number related to the N character pattern immediately before the search character string. If it does not hit the character pattern that it has, it is not considered to have been found.

【００６７】ステップ５１６で、検索文字列の当該Ｎ文
字パターンが見つからなかったと判断されると、好適に
は、検索文字列が見つからなかったことを示すメッセー
ジ・ボックスがステップ５０８で表示され、処理は終了
する。If it is determined in step 516 that the N character pattern of the search string was not found, then preferably a message box indicating that the search string was not found is displayed in step 508 and the process is terminated. finish.

【００６８】ステップ５１６で、検索文字列の当該Ｎ文
字パターンが見つかったと判断されると、索引ファイル
の検索の結果返される文書番号とその文書番号における
少なくとも１つの文書内位置番号のうち１つ前のＮ文字
パターンの情報と同一文書内で位置番号が順次的にリン
クされるもののみ、ステップ５１８で、後の処理のため
主記憶またはディスク上の所定のバッファ領域に一旦格
納される。If it is determined in step 516 that the N character pattern of the search character string has been found, the previous one of the document number returned as a result of the search of the index file and at least one position number in the document in the document number Only those whose position numbers are sequentially linked within the same document as the N character pattern information are temporarily stored in the main memory or a predetermined buffer area on the disk for subsequent processing in step 518.

【００６９】ステップ５１２で、検索文字列が全て検索
されたと判断されると、ステップ５２０に進んで、バッ
ファに格納されている文書番号及び文書内位置番号か
ら、検索文字列が存在する文書番号とその位置が決定さ
れ、ステップ５２２では、その文書番号と文書内位置番
号から、データベース２０２の文書のコンテンツがアク
セスされ、文書検索文字列が存在する文書の該当行が、
好適には個別のウインドウ内に表示される。If it is determined in step 512 that all the search character strings have been searched, the flow advances to step 520 to determine the document number in which the search character string exists from the document number and the document position number stored in the buffer. The position is determined. In step 522, the contents of the document in the database 202 are accessed from the document number and the position number in the document, and the corresponding line of the document in which the document search character string exists is
Preferably it is displayed in a separate window.

【００７０】尚、検索文字列が文書内の特定ブロック
（例：３番目のブロック）にあらわれることを調べるた
めには、上記検索文字列が文書中にあらわれる位置まで
にあらわれる上記文書内の区切り位置を数えて、上記検
索文字列が上記文書内でどのブロック（何番目のブロッ
ク）に位置するかを調べて、指定のブロック番号と比較
すればよい。In order to check that the search character string appears in a specific block (eg, the third block) in the document, the delimiter position in the document appearing up to the position where the search character string appears in the document. , The block (the block number) in which the search character string is located in the document, and comparison with a designated block number.

【００７１】Ｅ．曖昧検索処理図５で示す処理は、索引ファイルを使用して、いわば厳
密検索を行う処理であったが、本発明に従えば、索引フ
ァイルを使用して、指定された文字列と文字の並びが似
ている文字列を含む、いわゆる曖昧検索処理をも、デー
タベースの個々の文書に関して高速に実行することが可
能である。特に、この方式では、検索したい文字列と、
検索精度（０より大きく１以下）とを指定し、検索した
い文字列との"似ている度合"が指定の検索精度以上であ
る"似ている文字列"を含む文書および"似ている文字列"
の文書内位置を特定するものである。E. Ambiguous search process The process shown in FIG. 5 was a process of performing an exact search using an index file, but according to the present invention, a specified character string and character sequence are used using the index file. A so-called ambiguous search process including a character string similar to can be executed at high speed for each document in the database. Especially, in this method, the character string you want to search,
Specify the search precision (greater than 0 and less than or equal to 1), and documents containing "similar character strings" and "similar characters" whose "similarity" with the character string you want to search is greater than or equal to the specified search precision Column "
To identify the position within the document.

【００７２】Ｅ１．文字列を似ていると感じる人間の感
覚日本語のわかる人間が見て、文字の並びが似ていてしか
も意味が近いと感じる日本語の文字列には，次のような
ものがある。E1. Human sense of feeling that character strings are similar There are the following Japanese character strings that humans who understand Japanese feel that the arrangement of characters is similar and the meaning is similar.

【表３】（１）カタカナ語の異表記小さい字と大きい字「ソフトウェア」「ソフトウエ
ア」長音「ー」の有無「コンパイラー」「コンパイラ」中黒「・」の有無「アイビーエム」「アイ・ビー・
エム」その他「ビルディング」「ビルヂング」（２）漢字熟語と漢字熟語の間に助詞等が入ったもの「在宅起訴」「在宅のまま起訴」「政界再編」「政界の再編」（３）漢字熟語の複合語と一部が欠けた組み合わせの複
合語「国立民族博物館」「国立博物館」「民族博物館」（４）省略語などにより一部がかけたもの「ソフトウェア開発」「ソフト開発」（５）入力まちがい「カリフォルニア」「カリフォリニア」[Table 3] (1) Different notation of katakana small and large letters "Software""Software" Presence of long sound "-""Compiler""Compiler" Presence of Nakaguro "・""IBM""I-B"・
"M" and others "Building""Building" (2) Particles inserted between Kanji compound words and Kanji compound words "In-house prosecution""In-homeprosecution""Political world reorganization""Political world reorganization" (3) Kanji compound words The compound word of the compound and the compound word with a missing part “National Museum of Ethnology”, “National Museum”, “Ethnic Museum” (4) Partially applied by abbreviations, etc. “Software Development” “Software Development” (5) Wrong input "California""California"

【００７３】これらに共通しているのは、ほとんどの文
字は連続して一致しているが不足文字や余分な文字があ
る、ということである。What is common to these is that most characters match continuously but there are missing or extra characters.

【００７４】次に、どちらが似ているかという観点から
いくつかの言葉を考えてみると、「ソフトメーカー」に
似ているのは、「ソフトのメーカー」、「ソフト開発メ
ーカー」、「ソフトの開発メーカー」の順であるし、
「政治資金規正法案」と比べるならば、「政治資金規正
法」、「政治資金規正」、「政治資金」の順に似ている
と感じる。Next, considering some words from the viewpoint of which one is similar, what is similar to “software maker” is “software maker”, “software development maker”, “software development”. In order of "maker",
Compared to the "Political Funds Control Bill," I feel that the order is similar to "Political Funds Control Act", "Political Funds Control", and "Political Funds".

【００７５】また、文字が一致するとはいっても、「ソ
フトクリーム製造機械の製造を主業務とする機械メーカ
ー」を「ソフトメーカー」を似ている文字列と言うには
無理があると感じられる。Even though the characters match, it seems unreasonable to say that "machine maker whose main business is the manufacture of soft ice cream manufacturing machines" is a character string similar to "soft maker".

【００７６】これらのことから、文字列を似ていると感
じるかどうかの人間の感覚をまとめると、From these facts, the human sense of whether or not character strings are similar can be summarized as follows.

【００７７】(A) 連続して一致する文字が多いほど似て
いると感じ、(B) 途中にはさまる不一致文字が多いほど
似ていないと感じ、(C) 途中にはさまる不一致文字が多
すぎると一つの文字列とは感じられないということがい
える。(A) The more consecutive characters are matched, the more similar the characters are, (B) The more mismatched characters are trapped in the middle, the more dissimilar the characters are, and (C) The too many mismatched characters are caught in the middle. It can be said that it is not felt as one character string.

【００７８】このとき、入力文字列のある部分が文書中
の近い位置で重複して出現する特殊な場合を考慮しなく
てはならない。例をあげると、入力文字列が「理学部長
に就任」、文書中に「理学部部長に就任」とあった場合
である。重複して出現している「部」という文字の一方
は余分な文字であるが、「理学部の長に就任」の「の」
のようにまったく無関係な文字よりは、一致文字に近い
文字と考えるのが妥当である。At this time, it is necessary to consider a special case where a part of the input character string appears repeatedly at a close position in the document. For example, a case where the input character string is "Included as Dean of Science" and the document indicates "Inducted as Dean of Science" is described in the document. One of the characters "Dub" appearing redundantly is an extra character, but "No"
It is more reasonable to think that it is a character closer to the matching character than a completely unrelated character such as.

【００７９】Ｅ２．索引ファイルの構造との整合性図３で示す索引ファイルの構造は、Ｎ文字の文字パター
ンを文書番号・文書内位置番号と索引づけしたものであ
り、検索処理の最小単位は、ひとつの文字パターンを探
して、その文書番号・文書内位置番号を得る処理であ
る。Ｎ文字未満の文字列を検索する場合には、検索した
い文字列で始まる文字パターンのすべてについて、その
個数分、最小単位の検索処理を繰り返す必要があり、そ
の個数はかなり多い場合がある。入力文字列がＮ文字以
上の検索は高々入力文字列の文字数の回数の最小単位検
索を実行すればよいのに比べて、Ｎ文字未満の入力文字
列の検索は負荷が大きいといえる。E2. Consistency with the structure of the index file The structure of the index file shown in FIG. 3 is an N-character character pattern indexed with the document number / position number in the document, and the minimum unit of search processing is one character pattern. To obtain the document number and the position number in the document. When searching for a character string of less than N characters, it is necessary to repeat the minimum unit search process for all the character patterns starting with the character string to be searched, and the number may be quite large. It can be said that a search for an input character string of less than N characters has a heavy load, whereas a search for an input character string of N characters or more may be performed by performing a minimum unit search of the number of characters of the input character string at most.

【００８０】したがって、Ｎ文字未満の部分一致は切り
捨てて、Ｎ文字以上連続して一致する部分を元に似てい
る文字列を決めることが、高速性を保つために適当と言
える。Therefore, it can be said that, in order to maintain high speed, it is appropriate to cut off the partial match of less than N characters and determine a similar character string based on the part of continuous match of N characters or more.

【００８１】Ｅ３．似ている文字列と似ている度合いの
決定ルール入力文字列とＭ文字以上連続で一致する文字列の中か
ら、互いに入力文字列中と同じ順序関係である程度近く
にあるものを集めて似ている文字列とし、一致する文字
数、一致しない文字数から、似ている度合いを決めるの
がルールの概要である。E3. Rules for determining similar character strings and degree of similarity Among character strings that continuously match the input character string with M characters or more, those that are close to each other in the same order as in the input character string are collected and made similar. The outline of the rule is to determine the degree of similarity based on the number of characters that match and the number of characters that do not match.

【００８２】まず、説明で使う言葉を定義する。First, words used in the description are defined.

【００８３】一致文字列：検索したい文字列と文書テキ
ストとがＭ文字以上連続して一致する部分。同じ文字か
ら始めた中では長さが最大になるものを選ぶ。Matching character string: A part where the character string to be searched and the document text continuously match for M characters or more. Choose the one with the largest length starting with the same letter.

【００８４】（例）検索したい文字列 : 政治資金規正法案文書テキスト : ...資金規正のために法の力で...(Example) Character string to be searched: Political Fund Regulation Bill Document Text: ... For the purpose of fund regulation, the power of law ...

【００８５】Ｍ＝２とする。すると、「資金規正」が
一致文字列。このとき、最長選択のため「資金」や「資
金規」は一致文字列とは呼ばない。また、「法」は２文
字未満なので一致文字列にはならない。Let M = 2. Then, "Fund control" is a matching character string. At this time, the “fund” and the “fund rule” are not called the matching character strings because they are the longest selection. In addition, since "mod" is less than two characters, it does not become a matching character string.

【００８６】有効一致文字列：似ている文字列を構成す
る一致文字列。Valid match character string: A match character string that constitutes a similar character string.

【００８７】最大不一致文字列長Ｌ：似ている文字列中
に含める不一致文字は連続Ｌ文字までとする。Ｌは１以
上の定数とする。Maximum non-matching character string length L: The number of non-matching characters included in a similar character string is limited to consecutive L characters. L is a constant of 1 or more.

【００８８】"似ている文字列"の選びだし方と、"似て
いる度合い"の数値化の方法について説明する。A method of selecting a “similar character string” and a method of digitizing the “similarity degree” will be described.

【００８９】(1) １番目の有効一致文字列の決定文書中での順序で、１番目の一致文字列を、１番目の有
効一致文字列とする。ここで、(1) Determination of First Valid Matching Character String The first matching character string is determined as the first valid matching character string in the order in the document. here,

【００９０】ｉ番目の有効一致文字列の文書中での開始
位置を s(D, i) ｉ番目の有効一致文字列の文書中での終了位置を e(D,
i) ｉ番目の有効一致文字列の検索したい文字列中での開始
位置を s(C, i) ｉ番目の有効一致文字列の検索したい文字列中での終了
位置を e(C, i) と表記することにする。The start position of the i-th effective matching character string in the document is s (D, i). The ending position of the i-th effective matching character string in the document is e (D, i).
i) The start position of the i-th valid matching character string in the character string to be searched is s (C, i) The end position of the i-th valid matching character string in the character string to be searched is e (C, i) Will be written as.

【００９１】(2) 次の有効一致文字列の決定ｉ番目の有効一致文字列が決定しているとき、ｉ＋１番
目の有効一致文字列を次のようにして決定する。(2) Determination of the Next Valid Matching Character String When the i-th valid matching character string has been determined, the (i + 1) th valid matching character string is determined as follows.

【００９２】次の２つの条件a),b)を満たす最初の一致
文字列を、ｉ＋１番目の有効一致文字列とする。The first matching character string that satisfies the following two conditions a) and b) is defined as the (i + 1) th valid matching character string.

【００９３】[0093]

【数１】 a) e(D, i) + 1 ≦ s(D, i+1) ≦ e(D, i) + L + 1A) e (D, i) + 1 ≦ s (D, i + 1) ≦ e (D, i) + L + 1

【００９４】これはｉ番目の有効一致文字列とｉ＋１番
目の有効一致文字列の間に入る余分な文字はＬ文字まで
許すことを意味する。（後述する例３参照）This means that an extra character between the i-th valid matching character string and the (i + 1) -th valid matching character string is allowed up to L characters. (See Example 3 below)

【数２】b) s(C, i+1) ＞ e(C, i) - (M-1)B) s (C, i + 1)> e (C, i)-(M-1)

【００９５】そのような有効一致文字列が選べなくなる
まで繰り返す。Repeat until such a valid matching character string cannot be selected.

【００９６】(3) "似ている文字列"とその"似ている度
合い"（類似度）の決定それ以上有効一致文字列が選べなくなったら１番目の有
効一致文字列の最初の文字から最後の有効一致文字列の
最後の文字までを"似ている文字列"とし、次の式で"似
ている度合い"を計算する。(3) Determination of "similar character string" and its "similarity" (similarity) If no more valid matching character string can be selected, the first character to the last character of the first valid matching character string The "similar character string" is used up to the last character of the effective matching character string of "," and the "similarity degree" is calculated by the following expression.

【数３】類似度 =minimum ( 検索したい文字列中で有効
一致文字列に属している文字数/ 検索したい文字列の文
字数,"似ている文字列"中で有効一致文字列に属してい
る文字数/ "似ている文字列"の文字数)[Equation 3] Similarity = minimum (Number of characters belonging to valid matching character string in character string to be searched / number of characters in character string to be searched, number of characters belonging to valid matching character string in "similar character string" / Number of characters in "similar character string")

【００９７】Ｅ４．．"似ている文字列"中で有効一致文字列に属している文
字数の数え方２つの文字が、検索したい文字列の同一文字と対応して
いる場合には１つ目は１と数え、２つ目は０．５と数え
る。その他の場合には１文字を１と数える。（後述する
例４を参照）E4. . How to count the number of characters that belong to a valid matching character string in "similar character strings" If the two characters correspond to the same character in the character string you want to search, count the first as 1 and 2 The second counts as 0.5. In other cases, 1 character is counted as 1. (See Example 4 below)

【００９８】Ｅ５．"似ている文字列"の決定順序１番目の"似ている文字列"は文書の先頭から比較を始め
て決定する。ｉ番目の"似ている文字列"が決定している
時、ｉ＋１番目の"似ている文字列"は、ｉ番目の"似て
いる文字列"の先頭の文字より後ろで、ｉ番目の"似てい
る文字列"を構成する有効一致文字列に属さない最初の
文字から比較を始めて決定する。E5. Order of determination of "similar character string" The first "similar character string" is determined by starting the comparison from the beginning of the document. When the i-th "similar character string" is determined, the (i + 1) th "similar character string" is after the first character of the i-th "similar character string" and the i-th "similar character string" Determine by starting the comparison with the first character that does not belong to a valid match string that constitutes a "similar string".

【００９９】定数Ｌ,Ｍを適当な値に設定することによ
り、文字の並びが似ているかどうかについて、人間の一
般的な判断とかなり一致した"似ている度合い"を算出す
ることができる。By setting the constants L and M to appropriate values, it is possible to calculate the "similarity degree" which is substantially in agreement with general human judgment as to whether or not the arrangement of characters is similar.

【０１００】なお、"似ている度合い"が最高値１になる
とき、文字列は完全に一致しており、また、文字列が完
全に一致していれば必ず"似ている度合い"は１になる。Ｅ６．曖昧検索の処理フローチャート以上の処理をフローチャートであらわすと、図６のよう
になる。図６では先ず、ステップ６０２で、検索文字列
の入力がプロンプトされる。また、ステップ６０４で
は、０〜１の類似度の入力がプロンプトされる。通常、
ステップ６０２とステップ６０４における文字列及び値
の入力は、単一のダイアログ・ボックスで、入力ボック
スとスクロール・バーを使用して行われる。When the “similarity” reaches the maximum value of 1, the character strings are completely matched, and if the character strings are completely matched, the “similarity” is always 1 become. E6. Ambiguous search processing flowchart The above processing is shown in a flowchart of FIG. In FIG. 6, first, in step 602, a search character string is prompted. Also, in step 604, the user is prompted to enter a degree of similarity of 0-1. Normal,
Input of character strings and values in steps 602 and 604 is performed using a single dialog box using an input box and a scroll bar.

【０１０１】ステップ６０６では、有効一致文字列の番
号ｉが１にセットされ、ステップ６０８では、有効一致
文字列の検索が行われる。今、有効一致文字列の長さが
Ｍ以上であるという条件があったとすると、図４の処理
で、Ｍ文字パターンの索引ファイルを作成しておけば有
利である。というのは、そのような索引ファイルが予め
存在すると、任意のＭ文字パターンの検索が、索引ファ
イルの２分探索によって高速に実行されるからである。
続いて、検索文字列での、Ｍ文字パターンをとる開始位
置を１つずらしＭ文字パターンの検索を索引ファイルで
行い、その結果得られた文書番号が一回前のＭ文字パタ
ーンの検索と同一であり、且つ、文書内位置番号が順次
的であれば、Ｍ＋１の長さの有効一致文字列が得られた
ことになる。そのようにして、文書番号が一回前のＭ文
字パターンの検索と同一であり、且つ、文書内位置番号
が順次的である、という条件が満たされる毎に、有効一
致文字列の長さも１つ増分される。しかし、索引ファイ
ルを使用したＭ文字パターンの探索で何も見つからない
か、返される文書番号が不一致か、文書内位置番号が順
次的でなくなれば、有効一致文字列の終了位置が見出さ
れたことになる。In step 606, the number i of the valid matching character string is set to 1. In step 608, a search for a valid matching character string is performed. Assuming now that there is a condition that the length of the valid matching character string is M or more, it is advantageous if an index file of M character patterns is created in the processing of FIG. This is because, if such an index file exists in advance, a search for an arbitrary M character pattern is executed at high speed by a binary search of the index file.
Subsequently, the start position for taking the M character pattern in the search character string is shifted by one, and the search for the M character pattern is performed in the index file. And the position numbers in the document are sequential, it means that a valid matching character string of length M + 1 has been obtained. Thus, every time the condition that the document number is the same as the search of the M character pattern one time before and the position numbers in the document are sequential is satisfied, the length of the valid matching character string is also 1 Is incremented by one. However, if nothing is found in the search for the M character pattern using the index file, the returned document numbers do not match, or the position numbers in the document are not sequential, the end position of the valid matching character string is found. Will be.

【０１０２】場合によっては、全く有効一致文字列が見
出されないこともあり、そのような場合、ステップ６１
０での判断で、ステップ６２６に進み、そこで、見つか
らなかったことを表示して処理を終了する。In some cases, no valid matching character string can be found. In such a case, step 61
If the result is 0, the process proceeds to step 626, where the message "not found" is displayed and the process ends.

【０１０３】ステップ６１０で、有効一致文字列が見つ
かったと判断されると処理は、ステップ６１２に進み、
文書中ではs(D,i)からe(D,i)、検索文字列中ではs(C,i)
からe(C,i)までが、有効文字列であるとしてマークされ
る。If it is determined in step 610 that a valid matching character string has been found, the process proceeds to step 612.
S (D, i) to e (D, i) in the document, s (C, i) in the search string
From to e (C, i) are marked as valid strings.

【０１０４】ステップ６１４では、At step 614,

【数４】 a) e(D, i) + 1 ≦ s(D, i+1) ≦ e(D, i) + L + 1 且つ、 b) s(C, i) ＞ e(C, i) - (M-1)A) e (D, i) + 1 ≦ s (D, i + 1) ≦ e (D, i) + L + 1 and b) s (C, i)> e (C, i) )-(M-1)

【０１０５】という条件をみたす、i+1番目の有効一致
文字列がやはり索引ファイルを使用して検索され、もし
見つかると、ステップ６１２に戻って、そのi+1番目の
有効一致文字列に関して、文書中ではs(D,i+1)からe(D,
i+1)、検索文字列中ではs(C,i+1)からe(C,i+1)までが、
有効文字列であるとしてマークされる（ステップ６１８
でのiの増分は、次の有効一致文字列に注目することを
示す）。The i + 1th valid matching character string satisfying the condition is searched again by using the index file, and if found, the process returns to step 612 and, regarding the i + 1th valid matching character string, In the document, s (D, i + 1) to e (D,
i + 1), s (C, i + 1) to e (C, i + 1) in the search string,
Marked as a valid string (step 618)
The increment of i in indicates that we are looking at the next valid match string).

【０１０６】一方、ステップ６１６で、最早有効一致文
字列が見つからなくなると、ステップ６２０で、類似度
の計算が行われる。これは、上述のように例えば、On the other hand, if the valid matching character string is no longer found in step 616, the similarity is calculated in step 620. This is, for example,

【０１０７】[0107]

【数５】類似度 =minimum ( 検索したい文字列中で有効
一致文字列に属している文字数/ 検索したい文字列の文
字数,"似ている文字列"中で有効一致文字列に属してい
る文字数/ "似ている文字列"の文字数)[Equation 5] Similarity = minimum (number of characters belonging to the valid matching character string in the character string to be searched / number of characters in the character string to be searched, number of characters belonging to the valid matching character string in the "similar character string" / Number of characters in "similar character string")

【０１０８】で与えられる。このとき、"似ている文字
列"とは、文書中の、最初の有効一致文字列の開始位置
から、最後の有効一致文字列の最後の位置までの間の文
字列である。Is given by At this time, the “similar character string” is a character string in the document from the start position of the first valid matching character string to the last position of the last valid matching character string.

【０１０９】ステップ６２２では、ステップ６２０で計
算された類似度と、ステップ６０４で入力された類似度
とから、結果の選別が行われ、結果がステップ６０４で
入力された類似度以上であるもののみ、ステップ６２４
で結果表示を行う。At step 622, a result is selected from the similarity calculated at step 620 and the similarity input at step 604, and only those whose results are equal to or higher than the similarity input at step 604 are selected. , Step 624
To display the result.

【０１１０】ステップ６２４では、ステップ６０８、ス
テップ６１４での索引ファイルの検索の結果返された文
書番号と、文書内位置番号に基づいて、データベースに
格納されている文書コンテンツにアクセスし、該当箇所
を含む行を表示する処理が行われる。In step 624, the document content stored in the database is accessed based on the document number returned as a result of the search of the index file in steps 608 and 614 and the position number in the document, and the corresponding location is determined. The process of displaying the containing line is performed.

【０１１１】尚、１つの検索文字列に対する"似ている
文字列"は、複数の文書で同時に見つかることがあり得
るが、単一の文書内でも、複数の箇所で見つかることが
ある。従って、ステップ６０６〜６２２は、そのような
複数の"似ている文字列"の個々に対して適用され、ステ
ップ６２４では、複数の"似ている文字列"のうち、類似
度の条件を満たすもののみが選別して表示される、とい
うことに留意されたい。The "similar character string" for one search character string may be found in a plurality of documents at the same time, but may be found in a plurality of places even in a single document. Therefore, steps 606 to 622 are applied to each of the plurality of “similar character strings”, and in step 624, the similarity condition is satisfied among the plurality of “similar character strings”. Note that only the items are sorted and displayed.

【０１１２】Ｅ７．"似ている文字列"と類似度の決定例Ｍ = ２, Ｌ = ３として例を示す。E7. Example of determining “similar character string” and similarity M = 2, L = 3.

【０１１３】(例１) （*アイビーエムは、ＩＢＭ社の商標である）(Example 1) (* IBM is a trademark of IBM Corporation)

【０１１４】 [0114]

【０１１５】最初に一致する最長文字列は"アイ" だか
らSince the longest character string that matches first is "eye"

【０１１６】１番目の有効一致文字列は"アイ" s(C,1) = 1 e(C,1) = 2 s(D,1) = 1 e(D,1) = 2The first valid matching character string is “eye” s (C, 1) = 1 e (C, 1) = 2 s (D, 1) = 1 e (D, 1) = 2

【０１１７】e(C,1)-(M-1) = 1 なので、検索したい文
字列の２文字目以降から始まる文字列を、文書の３，
４，５または６文字目から始まる文字列と比較して２番
目の有効一致文字列を探す( e(D,1)+1 = 3, e(D,1)+L+1
= 6 なので )。Since e (C, 1)-(M-1) = 1, the character string starting from the second character of the character string to be searched is replaced with the character string 3 of the document.
Search for the second valid matching character string by comparing the character string starting from the fourth, fifth or sixth character (e (D, 1) +1 = 3, e (D, 1) + L + 1
= 6).

【０１１８】２番目の有効一致文字列 "ビー" s(C,2) = 3 e(C,2) = 4 s(D,2) = 4 e(D,2) = 5Second valid matching character string “B” s (C, 2) = 3 e (C, 2) = 4 s (D, 2) = 4 e (D, 2) = 5

【０１１９】e(C,1)-(M-1) = 3 なので、検索したい文
字列の４文字目以降から始まる文字列を文書の５，６，
７または８文字目から始まる文字列と比較して３番目の
有効一致文字列を探す( e(D,2)+1 = 5, e(D,2)+L+1 = 8
なので )。３番目の有効一致文字列 "エム" s(C,3) = 5 e(C,3) = 6 s(D,3) = 7 e(D,3) = 8Since e (C, 1)-(M-1) = 3, the character string starting from the fourth character of the character string to be searched is set to 5, 6, 6,
Search for the third valid matching character string by comparing the character string starting from the 7th or 8th character (e (D, 2) +1 = 5, e (D, 2) + L + 1 = 8
So). Third valid match string "M" s (C, 3) = 5 e (C, 3) = 6 s (D, 3) = 7 e (D, 3) = 8

【０１２０】検索したい文字列の最後に到達したので有
効一致文字列は３番目が最後となる。Since the end of the character string to be searched has been reached, the third effective match character string is the last.

【０１２１】[0121]

【表４】 [Table 4]

【０１２２】番号は有効一致文字列の番号。The number is the number of a valid matching character string.

【０１２３】したがって"似ている文字列"は s(D, 1)
から e(D, 3) までの"アイ・ビー・エム"。 "類似度" = minimum( 6 / 6 , 6 / 8 ) = 6 / 8 = 0.75Therefore, "similar character string" is s (D, 1)
"IBM" from to e (D, 3). "Similarity" = minimum (6/6, 6/8) = 6/8 = 0.75

【０１２４】(例２)(Example 2)

【表５】 [Table 5]

【０１２５】[0125]

【表６】 [Table 6]

【０１２６】似ている文字列 = "ソフト開発メーカー" 類似度 = minimum( 7 / 10 , 7 / 9 ) = 0.7Similar character string = "Software development maker" Similarity = minimum (7/10, 7/9) = 0.7

【０１２７】(例３)(Example 3)

【表７】 [Table 7]

【表８】 1 2 3 4 5 6 7 8 9 101112131415 文書 D : 在宅のままで起訴にふみきった。[Table 8] 1 2 3 4 5 6 7 8 9 101112131415 Document D: I came to prosecution while staying at home.

【０１２８】最初に一致する最長文字列は"在宅" だか
らSince the longest character string that matches first is "at home"

【数６】１番目の有効一致文字列は在宅 s(C, 1) = 1 e(C, 1) = 2 s(D, 1) = 1 e(D, 1) = 2## EQU00006 ## The first valid matching character string is at home s (C, 1) = 1 e (C, 1) = 2 s (D, 1) = 1 e (D, 1) = 2

【０１２９】文書の３，４，５または６文字目から始ま
る文字列と( e(D,1)+1 = 3, e(D,1)+L+1 = 6 なので )
検索したい文字列の２文字目以降から始まる文字列を(
e(C,1)-(M-1) = 1なので )比較して２番目の有効一致文
字列を探す。A character string starting from the third, fourth, fifth or sixth character of the document and (e (D, 1) + 1 = 3, e (D, 1) + L + 1 = 6)
Replace the character string starting from the second character of the character string you want to search with (
e (C, 1)-(M-1) = 1) to find the second valid match.

【０１３０】２番目の有効一致文字列は見つからないの
で、検索したい文字列の最後に到達したので有効一致文
字列は１番目のみとなる。Since the second valid matching character string is not found, the end of the character string to be searched has been reached, so that only the first valid matching character string is found.

【表９】在宅起訴 1 在宅のままで起訴にふみきった。 1 [Table 9] was Tsu Ki tried fu between or in prosecutions of standing home prosecuted 1 standing home. 1

【０１３１】したがって１番目の"似ている文字列"は s
(D, 1) から e(D, 1) までの"在宅"。 "類似度" = minimum( 2 / 4 , 2 / 2 ) = 2 / 4 = 0.5Thus, the first “similar character string” is s
"At home" from (D, 1) to e (D, 1). "Similarity" = minimum (2/4, 2/2) = 2/4 = 0.5

【０１３２】"在"より後ろで最初の非有効一致文字は"
の"。"の"から後ろで２番目の"似ている文字列"を探す
と、The first non-valid matching character after “to” is “
If you search for the second "similar string" after "." Of ",

【表１０】在宅起訴 1 在宅のままで起訴にふみきった。 1 [Table 10] were Tsu Ki tried fu between or in prosecutions of standing home prosecuted 1 standing home. 1

【０１３３】但し、文書で、「在宅」と「起訴」と
は、４文字離れており、この例では、Ｌ＝３なので、上
記の「起訴」は、有効一致文字列とは見なさない。However, in the document, “at home” and “prosecution” are separated by 4 characters, and in this example, L = 3, so the above “promotion” is regarded as a valid matching character string. Absent.

【０１３４】(例４)(Example 4)

【表１１】 [Table 11]

【０１３５】最初に一致する最長文字列は"銀行" だか
ら[0135] The longest matching first character string is "bank".

【０１３６】[0136]

【数７】１番目の有効一致文字列は"銀行" s(C,1) = 1 e(C,1) = 2 s(D,1) = 2 e(D,1) = 3The first valid matching character string is “bank” s (C, 1) = 1 e (C, 1) = 2 s (D, 1) = 2 e (D, 1) = 3

【０１３７】検索したい文字列の２文字目以降から始ま
る文字列を( e(C,1)-(M-1)=1なので)文書の４，５，６
または７文字目から始まる文字列と( e(D,1)+1 = 4, e
(D,1)+L+1 = 7 なので )比較して２番目の有効一致文字
列を探す。A character string starting from the second character onward of the character string to be searched (because e (C, 1)-(M-1) = 1) is 4, 5, 6 of the document.
Or the character string starting from the 7th character and (e (D, 1) +1 = 4, e
(D, 1) + L + 1 = 7 (because this is the case) and find the second valid matching character string.

【０１３８】２番目の有効一致文字列は"行員" s(C,2) = 2 e(C,2) = 3 s(D,2) = 4 e(D,2) = 5The second valid matching character string is “line member” s (C, 2) = 2 e (C, 2) = 3 s (D, 2) = 4 e (D, 2) = 5

【０１３９】検索したい文字列の最後に到達したので有
効一致文字列は２つ。Since the end of the character string to be searched has been reached, there are two valid matching character strings.

【表１２】 1, 1, 0.5, 1 -> 3.5[Table 12] 1, 1, 0.5, 1-> 3.5

【０１４０】"似ている文字列"は s(D, 1) から e(D,
2) までの"銀行行員"。"類似度" = minimum( 3 / 3 ,
3.5 / 4 ) = 3.5 / 4 = 0.875"Similar character string" is from s (D, 1) to e (D,
2) Up to "bank clerk". "Similarity" = minimum (3/3,
3.5 / 4) = 3.5 / 4 = 0.875

【０１４１】Ｅ８．人間の感覚に近い曖昧の例E8. Examples of vagueness close to human sense

【表１３】ソフトウェアメーカーソフトウェアのメーカー 0.909 ソフトウェア開発メーカー 0.833 ソフトウェアの開発メーカー 0.769[Table 13] Software Maker Software Maker 0.909 Software Development Maker 0.833 Software Development Maker 0.769

【０１４２】この例は、余分な文字が入るにつれて"類
似度"が低下することを示す。This example shows that the "similarity" decreases as extra characters are entered.

【表１４】ニットウェアメーカー 0.800 ソフトメーカー 0.700 ソフトウェア 0.600[Table 14] Knitting Towea Studio 0.800 software maker 0.700 software 0.600

【０１４３】この例は、一致する文字が減るにつれて"
類似度"が低下することを示す。In this example, as the number of matching characters decreases, ""
"Similarity" decreases.

【表１５】理学部長選挙理学部長選挙 1.000 理学部部長選挙 0.929 理学の部長選挙 0.857[Table 15] Science elections Faculty of Science elections 1.000 Faculty of Science, director election 0.929 Science director election 0.857

【０１４４】Ｅ９．索引の構造と"似ている文字列"の検
索の関係Ｍの値を適当に定めることにより、"似ている文字列"を
探すあいまい検索は、本発明の索引の構造でかなり高速
に実現できる。E9. Relationship between Index Structure and Search for "Similar String" By appropriately determining the value of M, the fuzzy search for "similar string" can be realized at a considerably high speed by the index structure of the present invention. .

【０１４５】定数Ｎ,Ｍの定め方 How to determine constants N and M

【表１６】Ｎ: 索引に格納する文字パターンの文字数Ｍ: あいまい検索における有効一致文字列の最低長Ｌ: あいまい検索において、"似ている文字列"中の非有
効一致文字列の最大長[Table 16] N: Number of characters of character pattern stored in index M: Minimum length of valid matching character string in fuzzy search L: Maximum length of non-valid matching character string in "similar character string" in fuzzy search

【０１４６】Ｎを大きくすると、文字パターンの種類数
が増加し、文字パターン１つあたりのデータ量は減少す
るので、検索はより速くなるが、索引ファイルの容量は
増加する。平均的な日本語の文書では、Ｎ = ２で充分
な検索速度が得られる。When N is increased, the number of types of character patterns increases and the amount of data per character pattern decreases, so the search becomes faster, but the capacity of the index file increases. For an average Japanese document, N = 2 gives a sufficient search speed.

【０１４７】また、Ｍ≧ＮとなるようにMを定めれば、
あいまい検索において充分な検索速度が得られる。Ｍは
小さいほど、きめ細かなあいまい検索ができることから
考えると、Ｍ=Ｎとなるように決めることがよいと思わ
れる。If M is set so that M ≧ N,
Sufficient search speed can be obtained in fuzzy search. Considering that the smaller M is, the more detailed and ambiguous search can be performed, it may be better to determine M = N.

【０１４８】Ｅ１０．類似度決定の第２の実施例E10. Second Embodiment of Determining Similarity

【０１４９】第２の実施例の曖昧検索処理では特に、
「途中にはさまる不一致文字が多いほど似ていないと感
じる」ということと、「途中にはさまる不一致文字が多
すぎると一つの文字列とは感じられない」ということの
かねあいについて考慮される。文書中に入力文字列と一
致する文字列、不一致な文字列、一致する文字列の順に
並んでいた場合に、後者の一致文字列までを似ている文
字列として取りこむことによって、似ている度合いが下
がるのは不自然である。たとえば、入力文字列が「在宅
起訴」、文書１には「在宅のまま起訴」、文書２には
「在宅」とあった場合、 ”文書１は「在宅のまま起
訴」、文書２は「在宅」が似ている文字列であり、似て
いる度合いは「在宅」の方が高い”とするようなルール
は人間の感覚に反している。文書１が「起訴」というさ
らなる一致文字列があるがために、逆に低い評価を受け
るのは不自然である。「在宅のまま起訴」の似ている度
合いが「在宅」より高くなるか、あるいは、文書１の似
ている文字列は「在宅」と「起訴」の２つであると判断
されるかのどちらかが、自然である。In the fuzzy search process of the second embodiment, in particular,
A consideration is given to the fact that "the more mismatched characters that are trapped in the middle, the more dissimilar it feels" and "There are too many unmatched characters that are trapped in the middle to make us feel that it is not a single character string". If a document matches a character string that matches the input character string, a non-matching character string, and a matching character string in that order, the similarity is obtained by capturing the latter matching character string as a similar character string. It is unnatural for the to drop. For example, if the input character strings are “charged at home,” document 1 is “charged at home,” and document 2 is “charged at home,” document 1 is “charged at home,” and document 2 is “charged at home. Is a similar character string, and the degree of similarity is higher for "at home". This rule is contrary to human perception. On the contrary, it is unnatural to receive a low rating. Either the degree of similarity of “in-charge at home” is higher than “at-home”, or the similar character strings in document 1 are determined to be “at-home” and “indictment” But it is natural.

【０１５０】次に、第２の実施例の処理について説明す
る。図６のフローチャートを参照すると、この実施例で
は、ステップ６０２〜６１２までは同一であり、i+1番
目の有効一致文字列を検索するための条件を示すステッ
プ６１４の処理が次のように変更される。Next, the processing of the second embodiment will be described. Referring to the flowchart of FIG. 6, in this embodiment, steps 602 to 612 are the same, and the process of step 614 indicating the condition for searching for the (i + 1) th valid matching character string is changed as follows. To be done.

【０１５１】[0151]

【数８】s(C, i+1) > e(C, i) - (M - 1) ... (式１) s(D, i+1) > e(D, i) ... (式２) 且つ s(D, i+1) - e(D, i) - 1 + max(e(C, i) - s(C, i+1) + 1, 0) ≦ L ・・・ (式３)S (C, i + 1)> e (C, i)-(M-1) ... (Equation 1) s (D, i + 1)> e (D, i) ... (Equation 2) and s (D, i + 1)-e (D, i)-1 + max (e (C, i)-s (C, i + 1) + 1, 0) ≤ L (Formula 3)

【０１５２】尚、s(C, i)、e(C, i)、s(D, i)、s(D, i)
などの定義は、前述のとおりである。S (C, i), e (C, i), s (D, i), s (D, i)
The definitions such as are as described above.

【０１５３】式１は、前述の「理学部部長」の「部」の
ような重複出現文字をＭ−１文字（まで許容し、それ以
外は入力文字列中での文字の順序と同じ順序で出現する
文字列を有効とすることを意味する。Formula 1 shows that M-1 characters (up to M-1 characters), such as "part" of the "Department of Science", are allowed, and other characters appear in the same order as the character order in the input character string. Means that the character string to be valid is valid.

【０１５４】式２は、文書中で有効一致文字列どうしが
重ならないことを意味する。Expression 2 means that valid matching character strings do not overlap each other in the document.

【０１５５】式３は、間にはさまる不一致文字と「理学
部部長」の「部」のような重複出現文字を、あわせてＬ
文字まで許容することを意味する。In the expression 3, the non-matching characters sandwiched between them and the duplicate occurrence character such as "section" of "Department of Science" are combined into L
It means that even letters are allowed.

【０１５６】この実施例では、前の実施例のように、検
索文字列と、文書中の似ている文字列の各々で、有効一
致文字列が占める割合を計算し、そのうちの小さい方を
類似度として選ぶのではなく、似ている文字列に点数を
つけて、満点（完全に一致している場合の点数）で割っ
て割合を出すことによって算出する。似ている文字列の
点数は、各文字に次の規則で点数をつけてそれを加算す
ることで算出する。従って、図６のステップ６２０での
処理は次のようになる。In this embodiment, as in the previous embodiment, the ratio of the effective matching character string to each of the similar character strings in the search character string and the document is calculated, and the smaller one of the two is calculated. Rather than selecting as a degree, calculate by scoring similar character strings and dividing by a perfect score (score in the case of perfect match) to obtain a ratio. The score of a similar character string is calculated by adding a score to each character according to the following rules and adding them. Therefore, the processing in step 620 of FIG. 6 is as follows.

【０１５７】１番目の有効一致文字列に属する文字・・・ 1 点ｉ番目（i > 1）の有効一致文字列に属していて検索文字列における位置≧e(C,i-1)+1 (式４) ・・・ 1 点検索文字列における位置≦e(C,i-1) (式５) ・・・ -1/(2*L) 点有効一致文字列に属していない文字・・・ -1 / L 点Character belonging to the first valid matching character string: 1 point belongs to the i-th (i> 1) valid matching character string, and position in the search character string ≧ e (C, i−1) +1 (Equation 4) ・・・ 1 point Position in search character string ≤ e (C, i-1) (Equation 5) ・・・ -1 / (2 * L) points Characters that do not belong to valid matching character string ・・・ -1 / L point

【０１５８】この実施例でも、ｉ番目の似ている文字列
が決定している時、ｉ＋１番目の似ている文字列は、ｉ
番目の似ている文字列の先頭の文字より後ろで、ｉ番目
の似ている文字列を構成する有効一致文字列に属さない
最初の文字から比較を始めて決定する。Also in this embodiment, when the i-th similar character string is determined, the i + 1-th similar character string is i
The comparison is determined starting from the first character after the first character of the second similar character string and not belonging to the valid matching character string forming the i-th similar character string.

【０１５９】有効一致文字列に属していない文字のマイ
ナス点は、「途中にはさまる不一致文字が多いほど似て
いないと感じる」ということと、「途中にはさまる不一
致文字が多すぎると一つの文字列とは感じられない」と
いうことのかねあいを考慮して設定している。一つの非
一致文字列のマイナス点の合計の最大は1 / L * L =1,
次の一致文字列を取り入れることによるプラス点の最小
はＮ ≧ 1 （特に日本語の場合は２が推奨されている）
でマイナス点がプラス点を上回ることがない。また、式
５は前述の「理学部部長」の「部」のような重複出現文
字を示し、式４は重複出現文字でない単純な一致文字を
示している。式５で表される文字には単純な非一致文字
より小さなマイナス点をつけることで、重複して文字が
あらわれる場合に対処している。The minus point of the characters that do not belong to the valid matching character string is that "the more the number of non-matching characters that are trapped in the middle, the more dissimilar it feels", and the "there are too many non-matching characters that are trapped in the middle of one character string. It is set in consideration of the fact that "I can not feel it." The maximum of the sum of the negative points of one non-matching string is 1 / L * L = 1,
The minimum positive point by incorporating the following matching character string is N ≧ 1 (2 is recommended especially for Japanese)
Therefore, the negative point does not exceed the positive point. In addition, Expression 5 indicates a duplicate occurrence character such as “part” of the above-mentioned “Department of Science”, and Expression 4 indicates a simple matching character that is not a duplicate occurrence character. By adding a minus point smaller than that of a simple non-matching character to the character represented by Expression 5, it is possible to deal with the case where the character appears in duplicate.

【０１６０】Ｅ１１．第２の実施例での似ている文字列
と似ている度合いの決定例やはり、Ｎ = ２, Ｌ = ３として例を示す。E11. Example of Determining Similarity of Character String and Similarity in Second Embodiment An example will be shown again with N = 2 and L = 3.

【表１７】 [Table 17]

【０１６１】最初の一致文字列は「アイ」だから１番目
の有効一致文字列は「アイ」Since the first matching character string is "eye", the first valid matching character string is "eye".

【数９】s(C,1) = 1 e(C,1) = 2 s(D,1) = 1 e(D,1) = 2S (C, 1) = 1 e (C, 1) = 2 s (D, 1) = 1 e (D, 1) = 2

【０１６２】式１、式２、式３により２番目の有効一致
文字列は「ビー」The second valid matching character string is "Bee" according to Expression 1, Expression 2 and Expression 3.

【数１０】s(C,2) = 3 e(C,2) = 4 s(D,2) = 4 e(D,2) = 5S (C, 2) = 3 e (C, 2) = 4 s (D, 2) = 4 e (D, 2) = 5

【０１６３】式１、式２、式３により３番目の有効一致
文字列「エム」The third valid matching character string "M" according to Expression 1, Expression 2 and Expression 3

【数１１】s(C,3) = 5 e(C,3) = 6 s(D,3) = 7 e(D,3) = 8S (C, 3) = 5 e (C, 3) = 6 s (D, 3) = 7 e (D, 3) = 8

【０１６４】入力文字列の最後に到達したので有効一致
文字列は３つとなる。Since the end of the input character string is reached, there are three valid matching character strings.

【表１８】 [Table 18]

【０１６５】似ている文字列は s(D, 1) から e(D, 3)
までの「アイ・ビー・エム」。似ている度合い = (( 1
* 6 + (-1/3) * 2 ) / 6 ) = 0.88Similar strings are from s (D, 1) to e (D, 3)
Up to "IBM". Degree of similarity = ((1
* 6 + (-1/3) * 2) / 6) = 0.88

【０１６６】[0166]

【表１９】（例６） 1 2 3 4 5 6 7 8 9 10 入力文字列 C : ソフトウェアメーカー 1 2 3 4 5 6 7 8 9 文書の一部 D : ・・・ソフト開発メーカー・・・[Table 19] (Example 6) 1 2 3 4 5 6 7 8 9 10 Input character string C: Software maker 1 2 3 4 5 6 7 8 9 Part of document D:… Software development maker

【０１６７】似ている文字列 = "ソフト開発メーカー" 似ている度合い = ( ( 1 * 7 + (-1/3) * 2 ) / 10 ) =
0.63[0167] Similar string = "Software developer" Similarity = ((1 * 7 + (-1/3) * 2) / 10) =
0.63

【０１６８】[0168]

【表２０】 (例７) 1 2 3 4 入力文字列 C : 在宅起訴 1 2 3 4 5 6 7 8 91011121314 文書の一部 D : ・・・在宅のままで起訴にふみきった。・・・[Table 20] (Example 7) 1 2 3 4 Input character string C: Prosecution at home 1 2 3 4 5 6 7 8 91011121314 Part of document D: ... Prosecution was completed while staying at home. ...

【０１６９】最初の一致文字列は「在宅」だから１番目
の有効一致文字列は「在宅」次の一致文字列「起訴」は式３を満たさないので、有効
一致文字列は１番目のみとなる。Since the first matching character string is “at home”, the first valid matching character string is “at home”. Since the next matching character string “indictment” does not satisfy Equation 3, the first valid matching character string is only the first. .

【表２１】 C: 在宅起訴 1 D: 在宅のままで起訴にふみきった。 1[Table 21] C: was Tsu Ki tried fu between or in prosecutions of standing home: standing home prosecuted 1 D. 1

【０１７０】似ている文字列は「在宅」。似ている度合
い = 2 / 4 = 0.5A similar character string is “at home”. Similarity = 2/4 = 0.5

【０１７１】「在」より後ろで最初の非有効一致文字は
「の」。２番目の似ている文字列は、「の」から後ろで
探す。The first non-valid matching character after "present" is "no". The second similar string is searched after "no".

【０１７２】 C: 在宅起訴 1 D: 在宅のままで起訴にふみきった。 1[0172] C: was Tsu Ki tried fu between or in prosecutions of standing home: standing home prosecuted 1 D. 1

【０１７３】従って、２番目の似ている文字列は、「起
訴」。Therefore, the second similar character string is "prosecution".

【０１７４】[0174]

【表２２】 [Table 22]

【０１７５】有効一致文字列は「理学部」、「部長に就
任」の２つ。There are two valid matching character strings: "Science Department" and "Appointed Director".

【０１７６】[0176]

【表２３】 [Table 23]

【０１７７】似ている文字列は「理学部部長に就任」。
２つ目の「部」は式５を満たしている。そこで、似てい
る度合い = (( 1 * 7 + (-1/6) * 1 ) / 7 ) = 0.97
となる。A character string that resembles is "appointed as dean of science department".
The second “part” satisfies Expression 5. So the degree of similarity = ((1 * 7 + (-1/6) * 1) / 7) = 0.97
Becomes

【０１７８】Ｅ１２．第２の実施例の結果のまとめE12. Summary of the results of the second example

【表２４】入力文字列文書中類似度 ===================================================== ソフトメーカーソフトのメーカー 0.95 ソフトの開発メーカー 0.85 政治資金規正法案政治資金規正法 0.87 政治資金 0.50 理学部長に就任理学部部長に就任 0.97 理学部の長に就任 0.95[Table 24] Input character string Document similarity ======================================== the length of ============== software development of the manufacturer software manufacturers 0.95 software maker 0.85 political funds Control bill political funds Control Law 0.87 political funds 0.50 Faculty of Science 0.97 Faculty of Science, became the appointed Science Director in length Inauguration 0.95

【０１７９】[0179]

【発明の効果】以上説明したように、この発明によれ
ば、文書ファイルまたはデータベースにおいて、人間の
感覚に自然な曖昧検索を、固有のインデックス構造を用
いて高速に実現できるという効果が得られる。As described above, according to the present invention, an effect is obtained that a fuzzy search natural to human sense can be realized at high speed in a document file or a database using a unique index structure.

[Brief description of the drawings]

【図１】ハードウェア構成を示すブロック図である。FIG. 1 is a block diagram showing a hardware configuration.

【図２】処理要素のブロック図である。FIG. 2 is a block diagram of a processing element.

【図３】索引ファイルの構造を示す図である。FIG. 3 is a diagram showing a structure of an index file.

【図４】索引ファイル作成処理を示すフローチャート
である。FIG. 4 is a flowchart showing an index file creation process.

【図５】索引ファイルを使用した文字列検索処理のフ
ローチャートである。FIG. 5 is a flowchart of a character string search process using an index file.

【図６】索引ファイルを使用した曖昧検索処理のフロ
ーチャートである。FIG. 6 is a flowchart of a fuzzy search process using an index file.

フロントページの続き (56)参考文献特開平４−215181（ＪＰ，Ａ) 特開平４−293161（ＪＰ，Ａ) 特開平２−279973（ＪＰ，Ａ) 情報処理学会研究報告（92−ＦＩ− 25）Ｐ．１−８，Ｐ．９−16Continuation of front page (56) References JP-A-4-215181 (JP, A) JP-A-4-293161 (JP, A) JP-A-2-279973 (JP, A) Information Processing Society of Japan Research Report (92- FI-25) P.I. 1-8, P.I. 9-16

Claims

(57) [Claims]

1. A method for finding the similarity between a search character string and a character string in the document in a document that is searchably stored by computer processing, the method comprising: (a) inputting the search character string. (B) From the beginning of the search string above, (M
Finds a matching start position and end position in the above document for a partial character string of 2 or more scheduled integers (hereinafter,
A subcharacter string having a length of M characters or more determined by its start position and end position is referred to as a valid matching character string) stage, and (c) no valid matching character string was found in the above stage (b). In response, for each partial character string having a length of M characters or more (M is a planned integer of 2 or more) obtained by shifting the start position of the partial character string of the search character string, the valid matching character string And (d) in response to the above valid match string being found,
The start position of the partial character string in the search character string and the start position of the search in the document are shifted by the length of the immediately preceding effective match character string, and the immediately preceding effective match character string of the start position is shifted. The distance from the end position of is within L characters (L
Is a scheduled integer greater than or equal to 1), and (e) as long as the above valid match string is found, above step (d)
And (f) at least from the start position of the first valid match string in the above document to the end position of the last valid match string in the above document, Based on the existence information, the similarity between the search character string and the character string from the start position of the first valid matching character string in the document to the end position of the last valid matching character string in the document is determined. An information retrieval method, comprising a step of calculating.

2. The information search method according to claim 1, wherein the M is 2 and the L is 3 or more.

3. The calculation of the similarity is based on the ratio of the valid matching character string in the search character string and the start position of the first valid matching character string in the document, to the end of the last valid matching character string in the document. The information search method according to claim 1, wherein the method is performed using a smaller value of the ratios of the valid matching character strings in the character strings up to the position.

4. The calculation of the similarity is performed for each character of the character string from the start position of the first valid matching character string in the document to the end position of the last valid matching character string in the document, If it belongs to a valid match string, the score is added, otherwise it is deducted and the resulting score value is
The information search method according to claim 1, wherein the information search method is performed using a value divided by a score value of perfect match.

5. An information search method for finding a position where a search character string appears in a document stored so as to be searchable by computer processing, comprising: (a) inputting a search character string; The step of inputting the degree of similarity, and (c) From the beginning of the search character string above, (M
Finds a matching start position and end position in the above document for a partial character string of 2 or more scheduled integers (hereinafter,
(A partial character string having a length of M characters or more determined by its start position and end position is referred to as a valid match character string)); and (d) no valid match character string was found in step (c) above. In response to this, for a partial character string of M characters or more (M is an integer of 2 or more) in which the start position of the partial character string of the search character string is shifted by one, the valid matching character string And (e) in response to the above valid match string being found,
The start position of the partial character string in the search string and the start position of the search in the document are each shifted by the immediately preceding valid match character string, and the end position of the start position of the immediately preceding valid match character string is shifted. Distance from within L characters (L is 1
The step of finding a valid match string that is the above integer), and (f) the above step (e) as long as the above valid match string is found.
And (g) at least the valid match string in the string between the start position of the first valid match string in the above document and the end position of the last valid match string in the above document. Based on the existence information, the similarity between the search character string and the character string from the start position of the first valid matching character string in the document to the end position of the last valid matching character string in the document is determined. And (h) displaying the content of the document including the valid matching character string in response to the calculated similarity being equal to or greater than the similarity input in the step (b). An information retrieval method having steps.

6. The information retrieval method according to claim 5, wherein the M is 2 and the L is 3 or more.

7. The calculation of the similarity is based on the ratio of the valid matching character string in the search character string and the start position of the first valid matching character string in the document, and the end of the last valid matching character string in the document. The information search method according to claim 5, wherein the method is performed by using a smaller value of the ratios of the valid matching character strings in the character strings up to the position.

8. The similarity calculation is performed for each character of a character string from the start position of the first valid matching character string in the document to the end position of the last valid matching character string in the document. If it belongs to a valid match string, the score is added, otherwise it is deducted and the resulting score value is
The information search method according to claim 5, wherein the information search method is performed by using a value divided by a score value of perfect match.

9. The information search method according to claim 8, wherein the added point is 1 for one character, and the deducted point is 1 / L for one character.

10. A method for finding a similarity between a search character string and a character string in a document in a database including a plurality of documents retrievably stored by computer processing, comprising: (a) a search character string; (B) From the beginning of the search string above, enter (M
Finds a matching start position and end position in the same document in the database for a partial character string of 2 or more planned integers (hereinafter, M characters or more determined by the start position and end position) (C) calling a substring of the search string in response to the fact that no valid match string was found in step (b). Finding the valid matching character string for a partial character string of M characters or more (M is an integer of 2 or more) shifted by one position; and (d) finding the valid matching character string. In response to what was done
The start position of the partial character string in the search string and the start position of the search in the same document are each shifted by the length of the immediately preceding valid matching character string, and the immediately preceding valid matching character of the starting position is shifted. A step of finding a valid matching character string whose distance from the end position of the string is within L characters (L is an integer of 1 or more); and (e) the above step (d) as long as the valid matching character string is found.
And (f) at least the valid match character string in the character string between the start position of the first valid match character string in the above document and the end position of the last valid match character string in the same document. Similarity between the search character string and the character string from the start position of the first valid matching character string in the same document to the end position of the last valid matching character string in the same document, based on the existence information of An information retrieval method having a step of calculating a degree.

11. The above M is 2, and the above L is 3,
The information search method according to claim 10.

12. The calculation of the degree of similarity is based on the ratio of the valid matching character string in the search character string and the start position of the first valid matching character string in the document to the end of the last valid matching character string in the document. The information search method according to claim 10, which is performed using a smaller value of the ratios of the valid matching character strings in the character strings up to the position.

13. The calculation of the similarity is performed for each character of the character string from the start position of the first valid matching character string in the document to the end position of the last valid matching character string in the document. If it belongs to a valid matching character string, the score is added, otherwise it is deducted, and the score value of the result is divided by the score value of the perfect match.
The information search method according to claim 10.

14. An information search method for finding a place where a search character string appears in a database including a plurality of documents stored in a searchable manner by computer processing, comprising: (a) inputting a search character string; , (B) inputting the degree of similarity, and (c) from the beginning of the above search character string, (M
Finds a matching start position and end position in the same document in the database for a partial character string of 2 or more planned integers (hereinafter, M characters or more determined by the start position and end position) And (d) in response to the fact that no valid match was found in step (c), the start of the substring of the search string Finding a valid matching character string for a partial character string of M characters or more (M is an integer of 2 or more) shifted by one position; and (e) finding the valid matching character string. In response to
The start position of the partial character string in the search string and the start position of the search in the same document are each shifted by the immediately preceding valid match character string, and the end of the immediately preceding valid match character string at the start position is shifted. Distance from position within L characters (L
Is a scheduled integer greater than or equal to 1), and (f) as long as the above valid match string is found, above step (e)
And (g) at least the valid matching character in the character string from the start position of the first valid matching character string in the same document to the end position of the last valid matching character string in the same document. Based on the column existence information, the search string above,
Calculating a similarity between the start position of the first valid matching character string in the same document and the end position of the last valid matching character string in the same document; and (h) above. An information retrieval method, comprising: displaying content including the valid matching character string of the document in response to the calculated similarity being equal to or higher than the similarity input in step (b).

15. The information retrieval method according to claim 14, further comprising the step of previously assigning a unique number or symbol to the plurality of documents.

16. The information retrieval method according to claim 15, wherein the unique number or symbol is a sequential number.

17. The above M is 2 and the above L is 3.
The information search method according to claim 14.

18. The calculation of the degree of similarity is based on the ratio of the valid matching character string in the search character string and the start position of the first valid matching character string in the document, to the end of the last valid matching character string in the document. The information search method according to claim 14, which is performed by using a smaller value of the ratios of the valid matching character strings in the character strings up to the position.

19. The calculation of the similarity is performed for each character of the character string from the start position of the first valid matching character string in the document to the end position of the last valid matching character string in the document. If it belongs to a valid matching character string, the score is added, otherwise it is deducted, and the score value of the result is divided by the score value of the perfect match.
The information search method according to claim 14.

20. The information retrieval method according to claim 19, wherein the added point is 1 for one character, and the deducted point is 1 / L for one character.

21. An information retrieval system for finding a position where a search character string appears in a document stored so as to be searchable by computer processing, comprising: (a) means for inputting a search character string; For a partial character string of length M or more (M is an integer of 2 or more) of the search character string, find a matching start position and end position in the document (hereinafter, the start position and end position). (A partial character string of M characters or more determined by the position is referred to as a valid matching character string) means; and (c) applying the means (b) from the beginning of the search character string to find a valid matching character string. In response to this, the start position of the partial character string in the search character string and the start position of the search in the document are shifted by the immediately preceding valid matching character string, respectively, The end position of the valid matching string of Finds a valid matching character string whose distance is within L characters (L is an integer that is expected to be 1 or more), and responds to the fact that the valid matching character string is not found. Means for finding an effective matching character string for a partial character string of M characters or more (M is an integer of 2 or more) in which the start position of the effective matching character string is shifted by one; As far as it can be found, the above means (c)
And (e) at least the valid match character string in the character string from the start position of the first valid match character string in the above document to the end position of the last valid match character string in the above document. Based on the presence information of the above, the similarity between the search character string and the character string from the start position of the first valid matching character string in the document to the end position of the last valid matching character string in the document An information retrieval system having means for calculating.

22. An information retrieval system for finding a position where a search character string appears in a document stored so as to be searchable by computer processing, comprising: (a) means for inputting a search character string; Means for inputting the degree of similarity, and (c) for the partial character string of length M or more (M is an integer expected to be 2 or more) of the search character string, the start position and end that match in the document. Means for finding a position (hereinafter, a partial character string of M characters or more determined by its start position and end position is referred to as a valid matching character string); and (d) said means (c) from the beginning of said search character string. In response to finding a valid matching character string, the start position of the partial character string in the search character string and the starting position of the search in the document are immediately preceding the valid matching character string. Just before the start position In response to not finding a valid matching character string, a valid matching character string whose distance from the end position of the valid matching character string is within L characters (L is an integer of 1 or more) is found, Means for finding the effective matching character string for a partial character string of M characters or more (M is an integer of 2 or more) in which the start position of the partial character string of the search character string is shifted by one; e) As long as the above valid matching character string is found, the above means (d)
And (f) at least the valid matching character string in the character string from the start position of the first valid matching character string in the document to the end position of the last valid matching character string in the document. Based on the presence information of the above, the similarity between the search character string and the character string from the start position of the first valid matching character string in the document to the end position of the last valid matching character string in the document And (g) in response to the calculated similarity being equal to or greater than the similarity input in step (b), displaying the contents of the document including the valid matching character string. An information retrieval system including means for performing.

23. The above M is 2, and the above L is 3,
An information retrieval system according to claim 22.

24. The calculation of the similarity is based on the ratio of the valid matching character string in the search character string and the start position of the first valid matching character string in the document, and the end of the last valid matching character string in the document. The information search system according to claim 22, wherein the information retrieval system is performed by using a smaller value of the ratios of the valid matching character strings in the character strings up to the position.

25. The calculation of the similarity is performed for each character of the character string from the start position of the first valid matching character string in the document to the end position of the last valid matching character string in the document. If it belongs to a valid matching character string, the score is added, otherwise it is deducted, and the score value of the result is divided by the score value of the perfect match.
An information retrieval system according to claim 22.

26. The information retrieval system according to claim 25, wherein the added point is 1 for one character, and the deduction point is 1 / L for one character.

27. An information search system for finding a position where a search character string appears in a database including a plurality of documents stored in a searchable manner by computer processing, comprising: (a) means for inputting the search character string; , (B) means for inputting the degree of similarity, and (c) substrings of the search character string having a length of M characters or more (M is an integer of 2 or more) are matched in the above document. Means for finding a start position and an end position (hereinafter, a partial character string of M characters or more determined by the start position and the end position is referred to as a valid matching character string); and (d) The means (c) is applied, and in response to a valid matching character string being found, the start position of the partial character string in the search character string and the start position of the search in the same document are respectively set. Shift by last valid match Te, of its starting position, within the distance L characters from the end position of the effective match the character string before the straight (L is,
Find a valid match string that is a planned integer greater than or equal to 1),
In response to the fact that no valid matching character string was found, the start position of the partial character string of the search character string was shifted by one and the length was M characters or more (M is an integer of 2 or more) For the partial character string, the means for finding the valid matching character string, and (e) As long as the valid matching character string is found, the means (d)
And (f) at least in the character string from the start position of the first valid matching character string in the same document to the end position of the last valid matching character string in the same document, Based on the existence information of the valid matching character string, between the search character string and the start position of the first valid matching character string in the same document to the end position of the last valid matching character string in the same document. Means for calculating the degree of similarity with the character string; and (g) in response to the calculated degree of similarity being greater than or equal to the degree of similarity input in step (b), the validity of the document An information retrieval system comprising means for displaying contents including a matching character string.

28. The above M is 2, and the above L is 3.
An information retrieval system according to claim 27.

29. The calculation of the similarity is based on the ratio of the valid matching character string in the search character string and the start position of the first valid matching character string in the document, and the end of the last valid matching character string in the document. The information search system according to claim 27, which is performed using a smaller value of the ratios of the valid matching character strings in the character strings up to the position.

30. The calculation of the similarity is performed for each character of the character string from the start position of the first valid matching character string in the document to the end position of the last valid matching character string in the document. If it belongs to a valid matching character string, the score is added, otherwise it is deducted, and the score value of the result is divided by the score value of the perfect match.
An information retrieval system according to claim 27.

31. The information retrieval system according to claim 30, wherein the added point is 1 for one character, and the deduction point is 1 / L for one character.

32. An information search method for finding a place where a search character string appears in a database including a plurality of documents stored in a searchable manner by computer processing, comprising: (a) inputting a search character string; , (B) inputting the degree of similarity, and (c) from the beginning of the above search character string, (M
Finds a matching start position and end position in the same document in the database for a partial character string of 2 or more planned integers (hereinafter, M characters or more determined by the start position and end position) (D) the sub-character string of is called a valid matching character string), and (d) the start position of the i-th valid matching character string
i) The end position in the document of the i-th valid matching character string is e (D,
i) The start position of the i-th valid matching character string in the character string to be searched is s (C, i). The ending position of the i-th valid matching character string in the character string to be searched is e (C, i). , E (D, i) + 1 ≤ s (D, i + 1) ≤ e (D, i) + L + 1 and s (C, i + 1)> e (C, i)-(M- 1) finding the (i + i) th valid matching character string satisfying the condition (where L is an integer greater than or equal to 1); and (e) performing the above steps as long as the valid matching character string is found. d)
And (f) at least the valid matching character in the character string from the start position of the first valid matching character string in the same document to the end position of the last valid matching character string in the same document. Based on the column existence information, the search string above,
Calculating a similarity between a character string between the start position of the first valid matching character string in the same document and the end position of the last valid matching character string in the same document; and (g) above. An information retrieval method, comprising: displaying content including the valid matching character string of the document in response to the calculated similarity being equal to or higher than the similarity input in step (b).

33. The information search method according to claim 32, wherein the M is 2 and the L is 3 or more.

34. The calculation of the similarity is based on the ratio of the valid matching character string in the search character string and the start position of the first valid matching character string in the document, and the end of the last valid matching character string in the document. 33. The information search method according to claim 32, which is performed by using a smaller value of the ratios of the valid matching character strings in the character strings up to the position.

35. The search character is calculated in order to calculate a ratio occupied by the valid matching character string from the start position of the first valid matching character string in the document to the end position of the last valid matching character string in the document. 33. The information search method according to claim 32, wherein characters belonging to a plurality of valid matching character strings in a string are weighted lower than other characters.

36. An information search method for finding a place where a search character string appears in a database including a plurality of documents stored in a searchable manner by computer processing, comprising: (a) inputting a search character string; , (B) inputting the degree of similarity, and (c) from the beginning of the above search character string, (M
Is a planned integer of 2 or more), and finds a matching start position and end position in the same document of the above database (hereinafter, a length of M characters or more determined by the start position and end position). (D) the sub-character string of is called a valid matching character string), and (d) the start position of the i-th valid matching character string
i) The end position in the document of the i-th valid matching character string is e (D,
i) The start position of the i-th valid matching character string in the character string to be searched is s (C, i) The end position of the i-th valid matching character string in the character string to be searched is e (C, i) , S (C, i + 1)> e (C, i)-(M-1), s (D, i + 1)> e (D, i) and s (D, i + 1)-e The condition (D, i)-1 + max (e (C, i)-s (C, i + 1) + 1, 0) ≤ L (where L is a planned integer greater than or equal to 1) Finding the i + ith valid match string that satisfies, and (e) as long as the above valid match string is found, above step (d)
And (f) at least the valid matching character in the character string from the start position of the first valid matching character string in the same document to the end position of the last valid matching character string in the same document. Based on the column existence information, the search string above,
Calculating a similarity between a character string between the start position of the first valid matching character string in the same document and the end position of the last valid matching character string in the same document; and (g) above. An information retrieval method, comprising: displaying content including the valid matching character string of the document in response to the calculated similarity being equal to or higher than the similarity input in step (b).

37. The information search method according to claim 36, wherein the M is 2 and the L is 3 or more.

38. The calculation of the similarity is based on the ratio of the valid matching character string in the search character string and the start position of the first valid matching character string in the document, and the end of the last valid matching character string in the document. The information search method according to claim 36, which is performed by using a smaller value of the ratios of the valid matching character strings in the character strings up to the position.

39. The calculation of the similarity is performed for each character of the character string from the start position of the first valid matching character string in the document to the end position of the last valid matching character string in the document. If it belongs to a valid matching character string, the score is added, otherwise it is deducted, and the score value of the result is divided by the score value of the perfect match.
An information retrieval method according to claim 36.

40. The information search method according to claim 39, wherein the added point is 1 for one character, and the deducted point is 1 / L for one character.

41. It belongs to the i-th valid matching character string,
41. The information search method according to claim 40, wherein the character in the corresponding search character string deducts 1 / (2L) from the character belonging to the i-1th valid matching string.