JPH0782504B2

JPH0782504B2 - Information retrieval processing method and retrieval file creation device

Info

Publication number: JPH0782504B2
Application number: JP2338546A
Authority: JP
Inventors: 忠一菊池
Original assignee: 株式会社テレマティーク国際研究所
Priority date: 1990-11-30
Filing date: 1990-11-30
Publication date: 1995-09-06
Anticipated expiration: 2010-09-06
Also published as: JPH04205560A

Description

【発明の詳細な説明】〔産業上の利用分野〕本発明は、情報検索を行う情報検索処理方式に関する。
本発明は、特に全文検索を行うことに適し、入力された
検索入力と全文との照合回数を大幅に削減して高速に情
報検索を行うことができる情報検索処理方式に関する。
本発明はデータベースシステムにおいて全文検索処理を
行う情報検索処理方式に適する。The present invention relates to an information search processing method for performing information search.
The present invention relates to an information search processing method that is particularly suitable for performing full-text search, and that can perform information search at high speed by significantly reducing the number of collations between an input search input and full-text.
INDUSTRIAL APPLICABILITY The present invention is suitable for an information search processing method for performing full text search processing in a database system.

〔Overview〕

本発明は、検索対象となる文字列と検索入力の文字列と
の一致照合を行うことにより情報検索を行う情報検索処
理方式において、検索対象となる文字列を文字ごとに、その文字の属する
検索単位の識別符号、検索単位中での文字位置を示す文
字位置順序符号、検索単位の論理的区分を示す属性符号
とから構成される文字位置情報を生成して文字種ごとに
グループ化した検索ファイルを生成しておき、検索入力
があったときこの検索入力を構成する文字の文字位置情
報を検索ファイルから取り出して照合し、検索単位識別
符号が共通で、検索入力と文字順序が等しくかつ属性符
号が同じ文字列を検索ファイル中から取り出すことによ
り、全文検索を高速に行うことができるようにするものであ
る。The present invention relates to an information retrieval processing method for performing information retrieval by matching and collating a character string to be searched with a character string of a search input, in a character string to be searched for each character, a search to which the character belongs. Generates character position information consisting of a unit identification code, a character position sequence code indicating the character position in the search unit, and an attribute code indicating the logical division of the search unit, and creates search files grouped by character type. When the search input is generated, the character position information of the characters that make up the search input is extracted from the search file and collated, the search unit identification code is common, the character order is the same as the search input, and the attribute code is the same. By extracting the same character string from the search file, full-text search can be performed at high speed.

[Conventional technology]

従来から、全文の最初から最後まで、検索入力文字列と
の文字列照合を行い、検索者が指定する入力文字列と検
索条件に合致する文書を選出する逐次検索方式や全文か
らあらかじめキーワードを抽出して検索ファイルを作成
するインデックス方式が全文検索技術として一般的であ
る。また全文に出現する文字や文字列を表形式にして、
検索入力文字列から分解して作成する文字や文字列の出
現文書を絞り込むプリサーチ方式がある。Conventionally, from the beginning to the end of the whole sentence, the character string matching with the search input character string is performed, and the keyword that extracts the document that matches the input character string and the search condition specified by the searcher and the keyword in advance from the whole sentence An index method for creating a search file is generally used as a full-text search technology. In addition, the characters and character strings that appear in the whole sentence are made into a table format,
There is a pre-search method that narrows down the documents that appear when characters or character strings are created by decomposing them from the search input character string.

[Problems to be Solved by the Invention]

逐次検索方式では、全文の最初から最後まで、検索入力
文字列との照合を行うため、多量の文字列を有する文書
を検索する場合、多くの時間を要する。このため、多量
文書の検索では、高速な文字列照合を行う専用のプロセ
ッサやLSIが提案されているが、これらの方式では、ハ
ードウエアが限定されるほか、検索処理を行う計算機と
専用プロセッサやLSIとの間での文字列転送に時間がか
かり、システムとして満足できる高速性の実現が課題と
なっている。In the sequential search method, since the entire input is matched with the search input character string from the beginning to the end, it takes a lot of time to search a document having a large number of character strings. For this reason, dedicated processors and LSIs that perform high-speed character string collation have been proposed for large-volume document searches.However, these methods have limited hardware and a computer and dedicated processor that performs search processing. It takes a long time to transfer the character string to and from the LSI, and the realization of high speed that is satisfactory for the system is an issue.

また、プリサーチ方式では、高速性を実現するための並
列処理機構や文字列照合に専用のハードウエアが必要で
あるほか、登録時に抽出する文字列の精度向上が課題と
なっている。Further, in the pre-search method, a parallel processing mechanism for achieving high speed and dedicated hardware for character string matching are required, and improvement of the accuracy of the character string extracted at the time of registration is an issue.

本発明者は、日本語の場合には、全文中に同じ文字や同
じ文字列が出現する頻度が低い特徴がある点に着目し、
検索対象文字列を文字種ごとに分類してグループ化した
検索ファイルを作成し、検索時には、検索ファイル中か
ら文字列の連続性を照合することにより検索を高速化す
ることができることを見出した。The present inventor has noticed that, in the case of Japanese, there is a characteristic that the same character or the same character string appears less frequently in the whole sentence,
It was found that the search file can be speeded up by creating a search file in which the search target character strings are classified according to the character type and grouped, and the continuity of the character strings in the search file is collated at the time of the search.

本発明は、上述の観点から大量文書を対象とする全文検
索の高速化をソフトウエアだけで実現でき、しかも特定
のハードウエアに限定されず、検索処理を主記憶上で行
うことにより専用プロセッサやLSIとの文字列の転送が
不要であり、文字と文字位置に着目することにより任意
の文字列検索が可能である汎用性に富む情報検索処理方
式を提供することを目的とする。From the above viewpoint, the present invention can realize full-text search speedup for a large number of documents only by software, and is not limited to specific hardware, and a search processing is performed on the main memory so that a dedicated processor or An object of the present invention is to provide a versatile information retrieval processing method that does not require transfer of a character string with an LSI and can search for an arbitrary character string by focusing on characters and character positions.

[Means for Solving the Problems]

本発明の第一の特徴は、それぞれが文字列で構成され検
索を行う単位である複数の検索単位によって構成される
一連の文字列であって、この検索単位にはその論理区分
にしたがった属性が定められている一連の文字列を検索
対象として所定の検索入力文字列に合致する文字列を抽
出する情報検索方式の検索ファイル作成装置において、上記検索単位が現れるごとに検索単位ごとの昇順の符号
を付与する検索単位識別符号付与手段と、上記検索単位
にその属性にしたがって定められている属性符号を付与
する属性符号付与手段と、検索対象となる文字列を各文
字ごとに検索単位中での位置を示す文字位置順序符号を
付与する文字位置順序符号付与手段と、上記検索単位識
別符号と文字位置順序符号と属性符号とからなる文字位
置情報を作成して、この文字位置情報を文字種ごとの領
域に格納して検索ファイルを作成する手段とを備えたこ
とを特徴とする。A first feature of the present invention is a series of character strings each of which is composed of a character string and which is a unit for performing a search, and the search unit has an attribute according to its logical division. In a search file creation device of the information search method that extracts a character string that matches a predetermined search input character string with a series of character strings that are specified as search targets, each time the above search unit appears, an ascending order of each search unit A search unit identification code assigning unit that assigns a code, an attribute code assigning unit that assigns an attribute code determined according to the attribute to the search unit, and a character string to be searched in the search unit for each character. A character position order code giving means for giving a character position order code indicating the position of the character position information, and character position information consisting of the search unit identification code, the character position order code and the attribute code. A means for storing the character position information in an area for each character type to create a search file is provided.

なお、文字位置情報は、｛（検索単位識別符号×ｎ）＋文字位置順序符号｝×ａ
＋属性符号 n:最大検索単位文字数 a:最大属性数なる数字として与えられることが好ましい。The character position information is: {(search unit identification code × n) + character position sequence code} × a
+ Attribute code n: maximum number of search unit characters a: maximum number of attributes It is preferable to be given as a number.

また本発明の第二の特徴は、第一の特徴で作成された検
索ファイルを備え、検索入力文字列の構成文字と同じ文
字の文字位置情報を上記検索ファイルから取り出す手段
と、この取り出した各文字の文字位置情報間で、検索単
位識別符号が共通で文字位置順序符号が検索入力の文字
列と等しい順序であり、かつその属性符号が検索入力と
等しい文字位置情報を抽出する手段と、この抽出された
文字位置情報に基づいて検索入力と等しい文字列が属す
る検索単位および文字位置を検索結果として出力する手
段とを備えたことを特徴とする。A second feature of the present invention is to include the search file created in the first feature, and means for extracting character position information of the same character as the constituent characters of the search input character string from the search file, and the extracted each. Among the character position information of characters, a unit for extracting the character position information in which the search unit identification code is common, the character position order code is in the same order as the character string of the search input, and the attribute code is the same as the search input, A unit for outputting a search unit and a character position to which a character string equal to the search input belongs based on the extracted character position information as a search result.

また、検索入力の文字列と等しい文字位置情報の抽出
は、検索入力文字の出現頻度の小さい文字から順に行う
ことが好ましい。Further, it is preferable that the extraction of the character position information that is equal to the character string of the search input is performed in order from the character with the lowest appearance frequency of the search input character.

[Action]

日本語の文字列では、同一の文字が現れる頻度は英語な
どに比べると小さい。特に漢字については同一の漢字が
繰り返し現れる頻度は小さい。例えば広辞苑の見出し語
の説明文書は約900万字あるが、その中でJIS第１水準の
漢字の出現頻度を調べると平均出現頻度は1155回であ
る。このため、JIS第１水準2965種の漢字については、
検索入力がｎ文字の場合、全文から抽出する照合対象は
平均すればｎ×1155文字となる。一般的に検索入力は数
十文字以下であるため、出現頻度の高い文字列であって
も、全部の文字を逐次照合するものに比べるとその照合
回数は極めて少なくなる。In Japanese character strings, the frequency of occurrence of the same characters is lower than in English. Especially with regard to kanji, the frequency with which the same kanji appears repeatedly is small. For example, there are about 9 million explanatory documents for the headwords of Kojien, and the average frequency of occurrence is 1155 when the frequency of occurrence of JIS first-level kanji is examined. Therefore, for JIS 1st level 2965 kanji,
When the search input is n characters, the collation target extracted from the whole sentence is n × 1155 characters on average. In general, since the search input is less than several tens of characters, even if the character string has a high appearance frequency, the number of times of matching is extremely smaller than that of sequentially matching all the characters.

さらに日本語の文字列、特に漢字の文字列では同一の文
字列が発生する頻度は極めて小さい。例えば、「通信」
という２つの文字列を使用する用語は多々あるとしても
「通信・・」という４文字からなる文字列は「通信回
線」、「通信装置」のように４文字で同一の文字が発生
する頻度は非常に小さくなる。このため、検索入力文字
列の構成文字で全文との照合を進めていくと、それまで
に得られた検索対象候補の文字列の中から、検索入力文
字列と異なる文字列が削除され、照合する構成文字ごと
に検索対象が絞り込まれていく。特に、検索入力の中の
全文出現頻度の小さい文字から順に照合を行うと一層絞
り込まれて照合一致を取る回数を低減できる。Furthermore, the frequency of occurrence of the same character string is extremely low in Japanese character strings, especially in Kanji character strings. For example, "communication"
Even though there are many terms that use two character strings, the frequency of occurrence of the same characters in four characters such as "communication line" and "communication device" is a character string consisting of four characters "communication ..." Very small For this reason, if you proceed to match the full text with the constituent characters of the search input character string, the character string different from the search input character string will be deleted from the character strings of the search target candidates obtained up to that point, and the matching will be performed. The search target is narrowed down for each constituent character. In particular, if the collation is performed in order from the character having the lowest total sentence appearance frequency in the search input, the number of collation matches can be further narrowed down and the number of collation matches can be reduced.

したがって、検索対象となる文字列（全文）を構成する
各文字が文字列中のどの位置にあるかを示す文字位置情
報を文字種ごとにグループ化した検索ファイルを作成
し、この検索ファイルに対して検索入力文字列との照合
一致を行うことにより文字列検索における照合一致処理
回数を大幅に低減することができる。Therefore, create a search file that groups the character position information indicating the position of each character that constitutes the search target character string (full text) in the character string by character type, and for this search file By performing matching and matching with the search input character string, the number of matching and matching processes in the character string search can be significantly reduced.

この検索ファイルの作成は次のように行う。The search file is created as follows.

まず検索対象となる文字列を検索単位に分ける。検索対
象文字列が例えば書籍や論文の場合、目次、序文、章ま
たは節等のタイトル、本文、図または表等のタイトル、
参考文献という順序で構成されており、それぞれの構成
部分が論理的に区分されているため、検索単位として構
成できる。そこで書籍または論文を論理的に検索単位に
分け、それぞれの検索単位ごとに出現順序に従って昇順
に識別符号を付与する。このとき本文については複数の
検索単位に分割し、それぞれ他の検索単位とともに一連
の識別符号を付与することもできる。また、この検索単
位について、目次、序文、タイトル、本文のようにその
検索単位の論理的な種別が区分されるので、その論理的
な種別を属性として、その属性を示す属性符号を付与す
る。First, the character string to be searched is divided into search units. When the search target character string is, for example, a book or a paper, a title of a table of contents, an introductory text, a chapter or a section, a body text, a title of a figure or a table,
It is structured in the order of reference documents, and since each component is logically divided, it can be configured as a search unit. Therefore, the book or thesis is logically divided into search units, and the identification code is given to each search unit in ascending order according to the appearance order. At this time, the text can be divided into a plurality of search units, and a series of identification codes can be given to each of the other search units. Further, with respect to this search unit, since the logical type of the search unit is classified like the table of contents, the preface, the title, and the body, the logical type is used as an attribute, and an attribute code indicating the attribute is given.

そして、文字列をそれぞれの文字ごとに分解し、各文字
に検索単位識別符号と各文字が検索単位中のどの位置に
あるかを示す文字位置順序符号と検索単位の属性符号と
からなる文字位置情報を生成し、文字種ごとに構成され
た領域に格納し、検索対象文字列を構成する文字種別で
グループ化した検索ファイルを作成する。Then, the character string is decomposed for each character, and a character position that is composed of a search unit identification code for each character, a character position sequence code that indicates where each character is in the search unit, and an attribute code of the search unit Information is generated, stored in an area configured for each character type, and a search file is created that is grouped by the character type that constitutes the search target character string.

この検索ファイルは、文字種別ごとに文字位置情報が格
納された形のファイル構造となり、周知の記憶媒体に記
憶される。The search file has a file structure in which character position information is stored for each character type, and is stored in a known storage medium.

検索処理は、検索入力の文字列をそれぞれの構成文字に
分け、検索ファイル中から検索入力を構成する文字と同
じ文字の文字位置情報を取り出して、検索単位識別符号
が共通しており検索入力文字列と文字順序が等しくかつ
属性符号が同じ文字位置情報を照合して取り出す。In the search process, the character string of the search input is divided into each constituent character, character position information of the same character as the character that constitutes the search input is extracted from the search file, and the search unit identification code is common and the search input character Character position information having the same character order as the column and the same attribute code is collated and extracted.

この照合処理は、検索入力と検索ファイルとの文字列の
連続性の一致と属性の一致とをみるもので、検索ファイ
ル中の文字位置情報から検索単位識別符号が共通してい
て検索入力の文字位置順序と同一で属性符号が同じ文字
列を取り出すことにより行う。This matching process is to check the continuity of the character string between the search input and the search file and the matching of the attributes, and the search unit identification code is common from the character position information in the search file This is done by extracting a character string that has the same position order and the same attribute code.

これにより、全検索ファイルの照合が不要になり、検索
ファイルにある検索入力と同じ構成文字の文字位置情報
だけの照合一致を行えばよいので、照合回数は逐次照合
に比べるときわめて低減することができる。また、日本
語文書では、同じ文字列の出現頻度が小さく、文字照合
の都度、検索対象が絞り込まれるので、照合回数は低減
していく。This eliminates the need to collate all search files, and since it is only necessary to perform collation and match on character position information of the same constituent characters as the search input in the search file, the number of collations can be significantly reduced compared to sequential collation. it can. Further, in a Japanese document, the frequency of occurrence of the same character string is low, and the search target is narrowed down each time character matching is performed, so the number of times matching is reduced.

さらに、検索ファイルから取り出した文字位置情報を照
合するとき、検索入力の中の全文出現頻度の小さい文字
から順に行うと検索対象が一層絞り込まれ、照合一致を
とる回数がさらに低減できる。Furthermore, when collating the character position information extracted from the search file, if the letters in the search input with the lowest appearance frequency of all the sentences are sequentially performed, the search target is further narrowed down, and the number of collation matching can be further reduced.

このようにして同一の文字列を見出したときはその検索
単位識別符号から抽出すべき検索単位を抽出して、検索
者に検索結果として出力する。When the same character string is found in this way, the search unit to be extracted is extracted from the search unit identification code and is output to the searcher as a search result.

〔Example〕

以下図面を参照して本発明の実施例を説明する。 Embodiments of the present invention will be described below with reference to the drawings.

第１図は本発明一実施例における情報検索処理装置の構
成を示すものである。FIG. 1 shows the configuration of an information search processing device according to an embodiment of the present invention.

本実施例の情報検索処理装置は、各種演算処理あるいは
判断処理を行うCPU1と、検索処理、検索ファイル作成等
のプログラム、作成されたあるいは検索処理を行うため
の検索ファイル、検索入力等を記憶するメモリ２、キー
ボード４、ディスプレイ５を接続する入出力部３、各種
情報が記憶される外部記憶装置７を接続する外部記憶装
置制御部６、CPU1、メモリ２、入出力部３、外部記憶装
置制御部６を接続する共通バス８とを備える。The information search processing device of the present embodiment stores a CPU 1 that performs various arithmetic processing or determination processing, a program for search processing, search file creation, etc., a search file created or for performing search processing, search input, etc. Input / output unit 3 for connecting the memory 2, keyboard 4 and display 5, external storage device control unit 6 for connecting an external storage device 7 for storing various information, CPU 1, memory 2, input / output unit 3, external storage device control A common bus 8 connecting the units 6 is provided.

本実施例での情報検索処理は、検索処理に供するための
検索対象となる文字列について文字種ごとにグループ化
された検索ファイルを作成する検索ファイル作成処理
と、検索ファイルとの照合一致を行って検索入力に合致
する文字列を抽出する検索処理との二つに分けられる。The information search process in the present embodiment is performed by performing a search file creation process that creates a search file in which a character string to be searched for use in the search process is grouped for each character type, and a matching match with the search file. It can be divided into two types: a search process for extracting a character string that matches the search input.

まず、検索ファイル作成処理について説明する。First, the search file creation process will be described.

この検索ファイル作成処理は、大まかに分けると、検
索ファイル領域確保、各構成文字への文字位置情報の
付与、文字種別ごとにグループ化した文字位置情報の
ファイルへの格納の３つに分けることができる。この各
処理についてそれぞれ説明する。This search file creation process can be roughly divided into three parts: securing a search file area, adding character position information to each constituent character, and storing character position information grouped by character type in a file. it can. Each of these processes will be described.

検索ファイル領域確保全文の構成文字をJISコード表に準じて分類し、JISコー
ド表に記載されている文字種別に出現頻度を計数する。
これにより、検索ファイルを構成する各文字種グループ
に登録される文字位置情報の数がわかるので、全文字種
グループで構成される検索ファイルの領域を確保でき
る。また同時に、各文字種グループに登録される文字位
置情報から、検索ファイル内に連続して格納される文字
種グループの先頭番地もわかる。この文字種グループの
先頭番地をJISコード表の記載順に配列したのが第２図
に示す文字欄アドレス表である。Secure search file area Classify all the constituent characters of the sentence according to the JIS code table, and count the frequency of appearance in the character type described in the JIS code table.
As a result, the number of character position information registered in each character type group forming the search file can be known, so that the area of the search file composed of all character type groups can be secured. At the same time, from the character position information registered in each character type group, the start address of the character type group continuously stored in the search file can be known. The beginning address of this character type group is arranged in the order described in the JIS code table in the character column address table shown in FIG.

各構成文字へ文字位置情報の付与ここで述べる文字位置情報は、文字列を構成する各文字
が属する検索単位の現れる順番を示す検索単位番号と、
検索単位におけるその文字の出現する位置を示す文字位
置番号と、検索単位の論理的な種別を示す属性番号から
なる。Addition of character position information to each constituent character The character position information described here is a search unit number that indicates the order in which the search unit to which each character that makes up the character string belongs appears.
It is composed of a character position number indicating the position where the character appears in the search unit and an attribute number indicating the logical type of the search unit.

まず検索単位とその属性について説明する。例えば一般
的な書籍は、目次、序文、章または節のタイトル、本
文、図または表のタイトル、参考文献などの部分で構成
されており、ほぼこの順序に従って現れる。この書籍の
内容を検索するとき、検索対象としてこの部分部分を検
索単位に分け、その検索単位ごとに検索して検索出力と
することが便利であるし、また検索目的に合致すること
が多い。すなわち、検索目的によってタイトルのみや本
文のみを検索対象として指定することが実際の検索では
多いからである。First, the search unit and its attribute will be described. For example, a general book is composed of a table of contents, a preface, chapter or section titles, text, figure or table titles, references, and the like, and appears in almost this order. When searching the contents of this book, it is convenient to divide this partial portion into search units as a search target, search for each search unit, and output it as a search output, and often match the search purpose. That is, in actual search, it is often the case that only the title or only the text is specified as the search target depending on the search purpose.

したがって、一つの書籍を全文検索対象として検索する
場合に、その書籍を構成する論理的な部分に分けて検索
結果を出力することが好ましい。この検索単位は、検索
対象の文字列の論理的な分類を示すものであるため、こ
の検索単位に論理的区分に従って属性番号を付与する。
例えば、属性番号として、目次に「１」、序文に
「２」、章または節のタイトルに「３」、図または表の
タイトルに「４」、本文に「５」、参考文献に「６」を
付与する。Therefore, when one book is searched as a full-text search target, it is preferable to output the search result by dividing the book into logical parts. Since this search unit indicates a logical classification of the character string to be searched, an attribute number is given to this search unit according to a logical division.
For example, the attribute number is "1" in the table of contents, "2" in the introduction, "3" in the title of the chapter or section, "4" in the title of the figure or table, "5" in the text, "6" in the reference. Is given.

そしてこの検索単位が書籍に出現する順序に１から昇順
に番号を付与する。これを検索単位番号とする。なおこ
の際に本文が長文である場合には適当な区分に分けて本
文を複数の検索単位に分け、検索単位ごとに出現する順
位で検索単位番号を付与することもできる。Then, numbers are assigned in ascending order from 1 to the order in which this search unit appears in a book. This is the search unit number. At this time, if the body is a long sentence, the body can be divided into a plurality of search units by dividing it into appropriate sections, and the search unit numbers can be assigned in the order of appearance in each search unit.

次に検索単位ごとに文字の先頭から順に１、２、３…と
昇順に番号を付与して文字位置番号を付与する。Next, the character position numbers are assigned to each search unit in ascending order of 1, 2, 3, ... From the beginning of the characters.

そして、このように与えられた検索単位番号、文字位置
番号、属性番号とから検索単位を構成する文字を整数か
らなるコードに変換して文字位置情報を作成する。Then, the character forming the search unit is converted from the search unit number, the character position number, and the attribute number thus given into a code consisting of an integer to create character position information.

なお、一つの書籍の中でどれが目次、序文、タイトル等
の検索単位であるかは事前に区分されている。また、同
様に目次、序文等がどの属性であるかはあらかじめ決め
られている。このため、検索単位番号はこの区分された
どの検索単位が現れたかが識別されることで付与され、
属性番号もそれぞれの区分された検索単位について決め
られた番号が付与されるものである。It should be noted that which one of the books is a search unit such as a table of contents, a preface, a title, etc. is classified in advance. Similarly, the attributes such as the table of contents and the preface are predetermined. Therefore, the search unit number is given by identifying which of the divided search units appears,
The attribute number is also a number determined for each divided search unit.

この文字位置情報は、最大検索単位文字数をｎ、最大属
性数をａとするとき、文字位置情報コード＝｛検索単位番号×ｎ＋文字位置番
号｝×ａ＋属性番号 …（１）からなる式で与えられる整数のコードである。This character position information is given by an expression consisting of character position information code = {search unit number × n + character position number} × a + attribute number (1), where n is the maximum number of search unit characters and a is the maximum number of attributes. Is an integer code that can be used.

例えば、検索単位の最大文字数ｎ＝10000、最大属性数
ａ＝10とし、８番目の検索単位である本文（属性番号＝
５）の先頭から第121〜124番目の文字位置に「通信文
書」という文字列があった場合、この「通」、「信」、
「文」、「書」の文字には、それぞれ「801215」、「80
1225」、「801235」、「801245」の文字位置情報が与え
られる。For example, the maximum number of characters in the search unit is n = 10000, the maximum number of attributes is a = 10, and the body is the eighth search unit (attribute number =
5) If there is a character string "communication document" at the 121st to 124th character positions from the beginning, the "communication", "communication",
The letters "sentence" and "calligraphy" are "801215" and "80", respectively.
Character position information of "1225", "801235", and "801245" is given.

そしてこのように文字位置情報を４バイトのコードで構
成すれば、最大10000文字数の検索単位を 2⁶⁴／（ｎ×ａ）≒４万個取り扱うことが可能である。If the character position information is composed of a 4-byte code in this way, it is possible to handle a maximum of 10,000 character retrieval units of 2 ⁶⁴ / (n × a) ≈40,000.

文字位置情報の検索ファイルへの登録次にこの各文字ごとに付与された文字位置情報を検索フ
ァイルに登録する。Registration of Character Position Information in Search File Next, the character position information given for each character is registered in the search file.

上述のように文字種別グループは、JISコード表に記載
された順に検索ファイルに格納される。そして文字種別
グループに文字位置情報を登録する。この文字位置情報
の登録は、文字種グループの末尾にそれぞれ文字位置情
報を格納することによって行われる。このため、検索単
位順に登録するとすれば文字種グループ内には文字位置
情報が数値順の昇順に登録されることになる。As described above, the character type groups are stored in the search file in the order listed in the JIS code table. Then, the character position information is registered in the character type group. The registration of the character position information is performed by storing the character position information at the end of each character type group. Therefore, if the registration is performed in the search unit order, the character position information is registered in the character type group in ascending numerical order.

上述の「通信文書」の文字位置情報を検索ファイルに登
録した例を第３図に示す。このとき、各グループ内の文
字位置情報は昇順に格納される。このファイル容量は、
文字位置情報が４バイトであると、になる。An example in which the character position information of the above-mentioned "communication document" is registered in the search file is shown in FIG. At this time, the character position information in each group is stored in ascending order. This file size is
If the character position information is 4 bytes, become.

なお、文字位置情報の追加登録は、追加文書の各文字に
該当するグループの末尾に新規コードを追加することで
行う。また、削除は削除文書の各文字に該当するグルー
プ内の該当文字位置情報を特殊記号に変更することによ
って行う。これにより追加登録と削除を短時間に行うこ
とができる。The additional registration of the character position information is performed by adding a new code to the end of the group corresponding to each character of the additional document. Further, the deletion is performed by changing the corresponding character position information in the group corresponding to each character of the deleted document to a special symbol. Thereby, additional registration and deletion can be performed in a short time.

なお上述のようにこの検索ファイルの各文字種グループ
ごとに格納された文字位置情報は、文字欄アドレス表の
文字欄先頭番地をディレクトリとして取り出すことがで
きる。As described above, the character position information stored for each character type group of the search file can be obtained by extracting the character column head address of the character column address table as a directory.

以上の検索ファイルの作成処理の流れを第４図に示す。FIG. 4 shows the flow of the above search file creation processing.

すなわち、文字種の使用度数を計数して文字欄アドレス
表を作成し（S11、12）、検索ファイルの領域を確保す
る（S13）。次に検索単位登録順位カウンタをｋ＝１に
初期設定して、検索単位番号を「１」に、最大検索単位
文字数を「ｎ＝10000」に、最大属性数をａ＝10に設定
する（S14）。そして最初の検索単位を取り出す（S1
5）。ここまでが登録の前処理である。ここから検索単
位ごとの登録処理となり、まず、文字位置番号をｐ＝１
に、登録する検索単位の属性番号をa_iを設定する（S1
6）。次に、検索単位の先頭文字から順に、文字位置番
号ｐに該当する文字位置情報を次の（２）式を用いて作
成し（S17）、Ｄ＝（ｋ×100000＋ｐ）×10＋a_i …（２）文字位置番号ｐにある文字種グループが格納されている
検索ファイルの文字欄を示す文字欄ディレクトリ（文字
欄先頭番地）を文字欄アドレス表から取り出して（S1
8）、文字欄ディレクトリが示す検索ファイルの文字欄
の最後尾の次の行に文字位置情報を格納する（S19）。
そして、ｐ＝ｐ＋１、ｌ＝ｌ−１とし、検索単位内の全
ての文字を処理したところで、次の検索単位の処理に移
る（S23、24）。That is, the use frequency of the character type is counted to create a character column address table (S11, 12), and a search file area is secured (S13). Next, the search unit registration order counter is initialized to k = 1, the search unit number is set to "1", the maximum number of search unit characters is set to "n = 10000", and the maximum number of attributes is set to a = 10 (S14). ). Then retrieve the first search unit (S1
Five). The processing up to this point is the pre-processing of registration. From here, the registration process is performed for each search unit. First, the character position number is p = 1.
To, set the attribute number of the search unit to be registered to a _i (S1
6). Next, in order from the first character of the search unit, character position information corresponding to the character position number p is created using the following formula (2) (S17), and D = (k × 100000 + p) × 10 + a _i (2) ) Retrieve the character field directory (character field start address) indicating the character field of the search file that stores the character type group at character position number p from the character field address table (S1
8), the character position information is stored in the next line at the end of the character field of the search file indicated by the character field directory (S19).
Then, p = p + 1 and l = l-1 are set, and when all the characters in the search unit have been processed, the process moves to the next search unit (S23, 24).

次にこのようにして作成された検索ファイルを用いる検
索処理について説明する。Next, a search process using the search file created in this way will be described.

本実施例では、検索ファイルから取り出した文字位置情
報をもとに検索入力の文字列と同じ文字列を文字列照合
して全文検索を行う例で説明する。In the present embodiment, an example will be described in which a full-text search is performed by collating a character string that is the same as the character string of the search input based on the character position information extracted from the search file.

まず、その検索処理は大まかに分けると以下の構成から
なっている。First, the search processing is roughly divided into the following configurations.

検索入力文字列に該当する文字連アドレス表内文字
欄先頭番地を算出する。The start address of the character column in the character string address table corresponding to the search input character string is calculated.

検索入力文字列を出現頻度の少ない文字から順に並
べ変える。The search input character strings are rearranged in order from the character with the lowest appearance frequency.

並び変えた文字列の先頭から順に該当する文字種グ
ループを検索ファイルから取り出してそこに格納されて
いる文字位置情報から検索入力の文字列の順序と一致す
る文字位置情報を取り出す。The corresponding character type groups are taken out from the search file in order from the beginning of the rearranged character string, and the character position information that matches the order of the character string of the search input is extracted from the character position information stored therein.

抽出した文字位置情報から検索入力と同じ属性を有
する文字位置情報を取り出す。From the extracted character position information, character position information having the same attribute as the search input is extracted.

照合一致した文字を含む検索単位を検索結果として
出力する。The search unit that includes the matching characters is output as the search result.

次に具体的にそれぞれの処理を説明する。Next, each process will be specifically described.

検索入力文字列に該当する文字欄アドレス表内文字
欄先頭番地の算出検索ファイルの作成時と同様に、検索入力文字のJISコ
ード表記載順位を算出し、これを文字欄アドレス表にお
ける検索入力文字のアドレスポインタとする。Calculation of the starting address of the character field in the character field address table that corresponds to the search input character string Similar to when creating the search file, the JIS code table entry order of the search input character is calculated, and this is the search input character in the character field address table. Address pointer.

出現頻度順の並び変えそして、検索ファイルの各文字種グループの先頭番地を
示す文字欄アドレス表の文字欄先頭番地を参照して、検
索入力文字の出現頻度を調べ、検索入力の文字列を全文
出現頻度の小さいものから順に並び変える。上述のよう
に、文字欄アドレス表内の文字欄先頭番地は、検索ファ
イルに格納されている各文字種グループの先頭番地を示
しており、次に続く文字欄先頭番地との差をとれば、各
文字種グループに格納されている文字位置情報の数か
ら、全文中に出現する文字種別頻度がわかる。Rearrange in the order of frequency of occurrence Then, refer to the starting address of the character field in the character field address table that indicates the starting address of each character type group in the search file, check the frequency of occurrence of the search input character, and the full text of the search input character string appears. Sort from lowest to highest frequency. As described above, the starting address of the character field in the character field address table indicates the starting address of each character type group stored in the search file, and if the difference from the succeeding character field starting address is taken, From the number of character position information stored in the character type group, the frequency of character type appearing in the whole sentence can be known.

これは出現頻度の小さい文字から照合一致を行うことに
より、検索ファイルに格納された各文字の文字位置情報
との照合回数をきわめて低減できるためである。すなわ
ち文字位置情報を照合して文字列の連続性を調べる場合
に二つの文字種グループ内の文字位置情報中の文字位置
番号を照合するため、その二つの文字種グループ内に格
納されている文字位置情報の文字位置番号数が少なけれ
ばそれだけ照合回数を少なくすることができる。したが
って、文字位置情報の照合を行うときに、出現頻度の小
さい文字から照合を行うことが照合回数を低減させる。
特に検索入力文字が多くなるほど出現頻度の小さい文字
が含まれるため低減効果は大きい。This is because the number of collations with the character position information of each character stored in the search file can be extremely reduced by performing collation matching from a character having a low appearance frequency. That is, when the character position information is collated to check the continuity of the character string, the character position numbers in the character position information in the two character type groups are collated, so the character position information stored in the two character type groups If the number of character position numbers is small, the number of collation can be reduced. Therefore, when collating character position information, collating from a character having a low appearance frequency reduces the number of collations.
In particular, the larger the number of search input characters is, the smaller the appearance frequency is.

文字列の照合出現頻度の小さい文字から文字欄アドレス表を参照して
それぞれの文字種グループに格納されている文字位置情
報を取り出す。そして取り出した文字位置情報をもと
に、出現頻度の小さい文字種グループから、各文字種グ
ループ間で検索単位が等しくかつ文字位置番号の差が検
索入力文字列の文字位置差に等しい文字位置情報を抽出
する。Collation of character string The character position information stored in each character type group is extracted by referring to the character field address table from the character with the lowest appearance frequency. Then, based on the extracted character position information, character position information with a small search frequency is extracted from the character type group with a low frequency of occurrence, with the same search unit and the difference in the character position number being the same as the character position difference in the search input character string. To do.

この文字位置差の照合は、｛（検索入力文字列ｉ番目文字種グループ内文字位置情
報）−（検索入力文字列ｊ番目文字種グループ内文字位
置情報）｝÷ａ＝ｉ−ｊ …（３）ａ＝最大属性数となる文字位置情報を抽出すればよい。This matching of the character position differences is performed by: {(search input character string i-th character type group character position information)-(search input character string j-th character type group character position information)} / a = i-j (3) a = The character position information with the maximum number of attributes may be extracted.

この文字種グループ間での文字位置差の照合処理は、出
現頻度の小さい文字種グループの文字位置情報とそれよ
り出現頻度の大きい文字種グループの文字位置情報との
差を取って文字の連続を照合する。In the collation processing of the character position difference between the character type groups, the character continuity is collated by taking the difference between the character position information of the character type group having a low appearance frequency and the character position information of the character type group having a higher appearance frequency.

この文字位置番号差に該当するものを抽出するときに、
二つの文字種グループをＡとＢとし、その文字位置差が
Ｌであるとし、グループＡの文字位置番号をA_x、グルー
プＢの文字位置番号をB_yとしたとき A_x＋Ｌ＞B_yならB_yを削除 A_x＋Ｌ＜B_yならA_xを削除 A_x＋Ｌ＝B_yならA_x、B_yを合致として共に削除というように照合対象から削除していくことによりその
照合回数を削減させる。When extracting the ones that correspond to this character position number difference,
Two character type group and A and B, then the difference that character position as L, the group character position number A _x of A, if the character position number B _y and the time A _x + L> B _y groups B B remove _{_y} a _x + L <B _y if delete _{_{a x a x + L = B}} y if a _x, thereby reducing the matching number by going to remove both from the verification target and so remove B _y as matches.

例えばグループＡの文字位置番号が５、13、100、200、1000、1100 グループＢの文字位置番号が３、18、101、150、180 であった場合、この二つのグループ間の照合回数は全体
で７回だけですみ、グループ内の全ての文字位置情報を
照合する必要はない。For example, if the character position number of group A is 5,13,100,200,1000,1100 and the character position number of group B is 3,18,101,150,180, It only needs to be done 7 times, and it is not necessary to collate all the character position information in the group.

属性番号の照合文字列照合から得られた文字位置情報の中から、検索入
力と同じ属性番号の文字位置情報を取り出すことによ
り、検索入力で指定した属性に一致する文字位置情報を
抽出できる。Attribute number matching By extracting character position information having the same attribute number as the search input from the character position information obtained from the character string matching, the character position information that matches the attribute specified by the search input can be extracted.

検索単位の抽出取り出した文字位置情報から検索単位番号と文字位置番
号を検索結果として抽出する。Extraction of search unit The search unit number and the character position number are extracted as the search result from the extracted character position information.

なお、検索入力が複数ある場合には、２番目以降の検索
入力に対しては、先頭文字に該当する文字種グループか
らそれまでに得られた検索単位番号を有する文字位置情
報を取り出し、２文字目以降の処理を行うようにする。
これは第１番目の検索入力で得られた検索結果に対して
２番目以降の検索入力による照合を行うものである。When there are a plurality of search inputs, the character position information having the search unit number obtained up to that time is extracted from the character type group corresponding to the first character for the second and subsequent search inputs. Perform the following processing.
This is to collate the search results obtained by the first search input with the second and subsequent search inputs.

以上の〜の動作を具体例を挙げて説明する。The above-described operations (1) to (4) will be described with a specific example.

検索対象として本文が指定され、検索入力文字列として
は「通信文書」が指定されたとする。この場合本文の属
性番号は「５」とする。It is assumed that the text is specified as the search target and “communication document” is specified as the search input character string. In this case, the attribute number of the text is "5".

例えば各文字の全文出現頻度が「書」＜「文」＜「信」
＜「通」の順であり、照合をこの順序に行うとすると、
まず検索ファイル中の「書」の文字欄から取り出した文
字位置情報と「文」の文字欄から取り出した文字位置情
報とを上記（３）式を使用してその差が「−10」になる
文字位置情報を抽出すると、検索ファイルの「書」内の
文字位置情報の「801245」と「文」内の「801235」とを
連続性ある文字位置情報として抽出することができる。For example, the frequency of full-text appearance of each character is “call” <“sentence” <“shin”
<It is in the order of “communication”, and if matching is performed in this order,
First, using the above equation (3), the difference between the character position information extracted from the character column of "sho" and the character position information extracted from the character column of "sentence" in the search file becomes "-10". When the character position information is extracted, the character position information “801245” in the “writing” of the search file and the “801235” in the “sentence” can be extracted as continuous character position information.

次に、「書」の中で照合結果として残った文字位置情報
と、「信」に該当する検索ファイルの文字欄から取り出
した文字位置情報を上記（３）式を使用して、その差が
「−20」になる文字位置情報を抽出すると、「書」内の
文字位置情報の「801245」と「信」内の文字位置情報
「801225」とを連続性ある文字位置情報として抽出する
ことができる。同様にして、「書」内の文字位置情報の
「801245」と「通」内の文字位置情報「801215」とを連
続性ある文字位置情報として抽出することができる。さ
らに、検索条件は「本文」であるから、これまでの文字
列照合で残った文字位置情報の中から、属性番号が
「５」の文字位置情報として、「801215」〜「801245」
を抽出できる。Next, using the formula (3) above, the difference between the character position information remaining as the collation result in the “writing” and the character position information extracted from the character column of the search file corresponding to “shin” is calculated. When the character position information of “−20” is extracted, the character position information “801245” in the “calligraphy” and the character position information “801225” in the “shin” can be extracted as continuous character position information. it can. Similarly, the character position information “801245” in the “writing” and the character position information “801215” in the “communication” can be extracted as continuous character position information. Further, since the search condition is “text”, the character position information of the attribute number “5” is selected from the character position information remaining in the character string matching so far, “801215” to “801245”.
Can be extracted.

したがって、この文字列が属する検索単位番号「８」の
検索単位と文字位置番号「121〜124」を検索結果として
出力する。Therefore, the search unit of the search unit number "8" to which this character string belongs and the character position numbers "121 to 124" are output as the search results.

この検索処理動作を第５図にフローチャートとして示
す。This search processing operation is shown as a flowchart in FIG.

すなわち、検索入力を取り出し、その文字数、属性番号
を設定し、検索入力文字の出現頻度を文字欄アドレス表
を参照して調べ出現頻度の小さいものから順に並び変え
る（S41〜S43）。そして検索ファイルから並べ変えた検
索入力文字に該当する文字種グループ（文字欄）に格納
されている文字位置情報を取り出す（S44）。そして、
二つの文字種グループ間で、（出現頻度の小さい文字種
グループの文字位置情報）−（出現頻度の大きい文字種
グループの文字位置情報）＝（並べ変えた検索入力の二
つの文字の文字位置番号差）×（最大属性数）であり、
文字位置情報の属性番号がa_iである文字位置情報を一致
結果として取り出す（S45）。そして照合が終わったか
否かを判断した後、検索入力に一致した検索単位と文字
位置番号を検索結果として出力する（S48）。That is, the retrieval input is taken out, the number of characters and the attribute number are set, the appearance frequency of the retrieval input character is checked with reference to the character column address table, and the appearance frequency is rearranged in order from the smallest appearance frequency (S41 to S43). Then, the character position information stored in the character type group (character column) corresponding to the rearranged input characters is retrieved from the search file (S44). And
Between the two character type groups, (character position information of a character type group having a low frequency of appearance)-(character position information of a character type group of a high frequency of occurrence) = (character position number difference between two characters of rearranged search input) x (Maximum number of attributes)
The character position information whose attribute number of the character position information is a _i is extracted as the matching result (S45). Then, after determining whether the matching is completed, the search unit and the character position number that match the search input are output as the search result (S48).

〔The invention's effect〕

以上説明したように、本発明は検索対象文字列の文字種
ごとにその文字が属する検索単位識別符号、文字位置順
序符号、検索単位の種別を示す属性番号からなる文字位
置情報を格納した検索ファイルを作成し、この検索ファ
イルを検索入力の文字列を構成する文字種ごとにその文
字位置情報を取り出して、検索入力に合致する文字列を
検索するようにした。このため、（１）検索処理のための文字列照合回数を低減すること
ができるため、高速照合を行うことができる、（２）文字と文字位置に着目して検索処理を行うため任
意の文字列検索を行うことができ、プリサーチ方式のよ
うに登録時に文字列抽出を行う必要はない、（３）専用のハードウエアを用いることなくソフトウエ
アだけで高速検索を実現できるため、汎用の情報処理装
置で全文検索を効率よく行うことができ汎用性に富む、（４）全文検索のデータベースシステムに利用したと
き、その検索ファイルの作成にキーワード抽出を行う必
要がなく、機械入力された論文などの文字列から自動的
に検索ファイルを作成することができるため、データベ
ースシステムを経済的にかつ効率よく構築することが可
能である優れた効果がある。As described above, the present invention provides a search file that stores character position information consisting of a search unit identification code, a character position sequence code, and an attribute number indicating the type of search unit to which a character belongs for each character type of a search target character string. This search file is created, and the character position information is extracted for each character type that constitutes the search input character string, and the character string that matches the search input is searched. Therefore, (1) it is possible to reduce the number of times the character string collation is performed for the search process, and thus high-speed collation can be performed. (2) Any character is used to perform the search process by paying attention to the character and the character position. Column search can be performed, and there is no need to extract character strings at the time of registration unlike the pre-search method. (3) High-speed search can be realized only by software without using dedicated hardware, so general-purpose information can be obtained. The processing device can perform full-text search efficiently and is highly versatile. (4) When used in a full-text search database system, it is not necessary to perform keyword extraction to create the search file, such as a machine-input paper Since a search file can be automatically created from the character string of, there is an excellent effect that a database system can be constructed economically and efficiently. .

[Brief description of drawings]

第１図は本発明一実施例に使用する情報検索処理装置の
構成例。第２図は実施例の文字欄アドレス表。第３図は実施例の検索ファイル例。第４図は実施例の検索ファイル作成処理手順を説明する
フローチャート。第５図は実施例の検索処理手順を説明するフローチャー
ト。１…CPU、２…メモリ、３…入出力部、４…キーボー
ド、５…ディスプレイ、６…外部記憶装置制御部、７…
外部記憶装置、８…共通バス。FIG. 1 shows an example of the configuration of an information search processing device used in an embodiment of the present invention. FIG. 2 is a character column address table of the embodiment. FIG. 3 shows an example of a search file of the embodiment. FIG. 4 is a flowchart for explaining the search file creation processing procedure of the embodiment. FIG. 5 is a flow chart for explaining the search processing procedure of the embodiment. 1 ... CPU, 2 ... Memory, 3 ... Input / output unit, 4 ... Keyboard, 5 ... Display, 6 ... External storage device control unit, 7 ...
External storage device, 8 ... Common bus.

Claims

[Claims]

1. A series of character strings each consisting of a plurality of search units each of which is a character string and is a unit for performing a search, and the search unit has an attribute according to its logical division. In a search file creation device of an information search method that extracts a character string that matches a predetermined search input character string with a series of character strings that are present as a search target, assigns an ascending code for each search unit every time the above search unit appears Search unit identification code assigning means, attribute code assigning means for assigning an attribute code determined according to the attribute to the search unit, and a character string to be searched for each character indicating the position in the search unit A character position order code giving means for giving a character position order code, and character position information consisting of the search unit identification code, the character position order code, and the attribute code are created and A search file creation device having means for storing character position information in the area for each character type and creating a search file.

2. The character position information is: {(search unit identification code × n) + character position sequence code} × a
The search file creation device according to claim 1, wherein the search file creation device is given as a number of + attribute code n: maximum number of search unit characters a: maximum number of attributes.

3. A series of character strings, each of which is composed of a character string and is composed of a plurality of search units, which is a unit for performing a search, and the search unit has attributes defined according to its logical division. In the information search method that extracts a character string that matches a predetermined search input character string with a series of character strings that are searched as a search target, for each character that constitutes the character string, It consists of a search unit identification code that is added in ascending order to the search unit that is the unit for performing a false search, a character position sequence code that indicates the position of that character in the search unit, and an attribute code that indicates the logical division of the search unit. A search file that stores character position information for each character type is provided, and means for extracting character position information of the same characters as the constituent characters of the search input character string from the search file, A unit for extracting character position information having a common search unit identification code, a character position order code equal to the character string of the search input, and an attribute code equal to the search input between the character position information of the characters; An information search processing method comprising: a unit for outputting a search unit and a character position to which a character string equal to a search input belongs based on the extracted character position information.

4. The information search processing method according to claim 3, wherein the extraction of the character position information that is the same as the character string of the search input is performed in order from the character with the lowest appearance frequency of the search input character.