JP3489237B2

JP3489237B2 - Document search method

Info

Publication number: JP3489237B2
Application number: JP00240695A
Authority: JP
Inventors: 勝己多田; 敦畠山; 奈津子水谷; 川口　　久光; 寛次加藤; 悟志浅川
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 1995-01-11
Filing date: 1995-01-11
Publication date: 2004-01-19
Anticipated expiration: 2019-01-19
Also published as: JPH08190572A

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、大規模な文書データベ
ースの全文を対象として指定した文字列の存在する文書
を高速に検索する文書検索方法（フルテキストサーチの
方法）に係るものである。特に、データベース、文書管
理システム、文書ファイリングシステムおよびＤＴＰ
（ＤｅｓｋＴｏｐＰｕｂｌｉｓｈｉｎｇ）システム
などに適用されるものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a document search method (full text search method) for rapidly searching for a document in which a specified character string exists for the entire text of a large-scale document database. In particular, databases, document management systems, document filing systems and DTPs
(Desk Top Publishing) system and the like.

【０００２】[0002]

【従来の技術】従来、インデクス情報を必要としない文
書検索方法には、以下のものがある。登録時に文書を文
字コード化したテキストデータとして計算機に直接登録
しテキストデータベースとして蓄積し、検索時にテキス
トデータベース内の全てのテキストの内容を読んで、指
定された検索文字列（以下、検索タームと呼ぶ）を含む
文書を探し出すフルテキストサーチがそれである（Ｌ．
Ａ．Ｈｏｌｌａｒ， “ＴｅｘｔＲｅｔｒｉｅｖａｌ
Ｃｏｍｐｕｔｅｒｓ"，ＣＯＭＰＵＴＥＲ，Ｍａｒｃ
ｈ，１９７９）。このフルテキストサーチは、テキスト
データベースを構成する全文書のテキストをファイルと
してまとめたテキストファイル全体を先頭から１文字ず
つ走査して、指定された検索タームが存在するか否かを
調べることにより、指定された検索タームを含む文書を
探し出すことを特徴としている。2. Description of the Related Art Conventionally, there are the following document retrieval methods that do not require index information. At the time of registration, the document is directly registered in the computer as text data encoded as characters and stored as a text database. At the time of search, the contents of all texts in the text database are read and a specified search character string (hereinafter referred to as a search term). ) Is a full-text search for documents that contain (L.
A. Hollar, “Text Retrieval
Computers ", COMPUTER, Marc
h, 1979). This full-text search is performed by scanning the entire text file, in which the texts of all the documents that make up the text database are collected as a file, character by character from the beginning, and checking whether the specified search term exists. It is characterized in that a document including the specified search term is searched for.

【０００３】このようにすることにより、シソーラス等
の辞書を用いるインデキシングが不要となるため文書中
に現われるどんな言葉でも検索タームに指定して検索す
ることが可能となる。By doing so, indexing using a dictionary such as a thesaurus is unnecessary, and any word appearing in a document can be specified as a search term and searched.

【０００４】しかし、このフルテキストサーチはテキス
トファイル全体を先頭から全て走査するためにサーチ処
理に時間が掛かり、大規模なデータベースに適用できな
いという問題があった。However, the full-text search has a problem that it takes a long time to perform a search process because it scans the entire text file from the beginning and cannot be applied to a large-scale database.

【０００５】この処理時間の問題を解決するために、特
願平２−１９３０１５号（特開平３−１７４６５２号公
報参照）を提案した。これは、専用のハードウェアを使
用することによってテキストデータの読出しと検索ター
ムのサーチ処理の高速化を図るとともに、テキストをサ
ーチする前にテキストを予め情報圧縮しておいた補助フ
ァイルをサーチし、テキスト本体をサーチする文書件数
を絞り込む（階層プリサーチ）ことによって等価的に高
速なフルテキストサーチを実現する情報検索装置であ
る。In order to solve this processing time problem, Japanese Patent Application No. 2-193015 (see Japanese Patent Application Laid-Open No. 3-174652) was proposed. This aims at accelerating the reading process of text data and the search process of the search term by using the dedicated hardware, and searches the auxiliary file in which the text is compressed in advance before searching the text. This is an information search device that realizes equivalently high-speed full-text search by narrowing down the number of documents for searching the text body (hierarchical pre-search).

【０００６】以下、この従来例の特徴である階層プリサ
ーチについて説明する。The hierarchical pre-search, which is a feature of this conventional example, will be described below.

【０００７】階層プリサーチとは、テキストをサーチす
る前に予め情報圧縮しておいた二つの補助ファイル、す
なわち「文字成分表」と「凝縮テキスト」を階層的にサ
ーチして検索タームに関連のない文書を検索対象からふ
るい落とし、テキストをサーチする文書件数を減らすこ
とによって、等価的に検索速度を加速する。すなわち、
まず文字成分表のサーチで文字単位の絞込みを行い、次
にこの文字成分表サーチで絞り込まれた文書に対し凝縮
テキストのサーチで単語単位の絞込みを行う方式であ
る。Hierarchical pre-search is related to a search term by hierarchically searching two auxiliary files, that is, "character component table" and "condensed text", which are information-compressed in advance before searching text. By sieving out non-existing documents from the search target and reducing the number of documents searching for text, the search speed is equivalently accelerated. That is,
First, the character component table is narrowed down by character unit, and then the document narrowed down by the character component table search is narrowed down by word unit by condensed text search.

【０００８】この文字成分表サーチでは、登録テキスト
の１文字単位の情報しか文字成分表に記録しないため、
検索タームに用いられる文字をすべて含む文書を検索候
補とする。例えば、“イラン"が検索タームに指定され
た場合、“イ"、“ラ"および“ン"の３文字がテキスト
中のどこかに存在する文書、例えば“ライオン"や“オ
ンライン"などの文字列を含む文書も文字成分表サーチ
でヒットしてしまう。In this character component table search, since only character-by-character information of the registered text is recorded in the character component table,
Documents that include all characters used in the search term are search candidates. For example, if "Iran" is specified as the search term, a document in which the three letters "I", "La", and "N" exist somewhere in the text, such as the characters "Lion" and "Online" Documents containing columns will also be hit by the character component table search.

【０００９】すなわち、実際には検索ターム“イラン"
を含まないにもかかわらず、含んでいるとみなされる文
書（以下、ノイズという）がサーチ結果として多数出力
される場合がある。このような場合は凝縮テキストサー
チの対象となる文書件数を絞り込むことができないた
め、凝縮テキストサーチに時間が掛かり、十分な検索レ
スポンスが得られないことになる。That is, the search term "Iran" is actually used.
There are cases where a large number of documents (hereinafter, referred to as noise) that are considered to include the document are output as search results although they do not include the document. In such a case, the number of documents to be subjected to the condensed text search cannot be narrowed down, so that the condensed text search takes time and a sufficient search response cannot be obtained.

【００１０】この問題を解決する方法として特願平３−
１７４０６４号（特開平５−１７４０６４号公報参照）
で、文字成分を複数の文字の組み合わせとすることによ
って、単一文字より高い絞込み率を得る連接文字成分表
方法を提案した。As a method for solving this problem, Japanese Patent Application No. 3-
174064 (see Japanese Patent Laid-Open No. 5-174064)
Then, we proposed a concatenated character component table method that obtains a narrowing rate higher than that of a single character by combining character components.

【００１１】すなわち、この連接文字成分表方法では登
録時にテキストデータ内に所定の文字数(２文字以上)の
文字列が存在するか、否かという情報を連接文字成分表
に記録しておく。そして、検索する際には凝縮テキスト
をサーチする前に検索タームを上記所定の文字数の文字
列に分割し、そのすべての文字列が含まれる文書をこの
連接文字成分表を参照して抽出する。こうすることによ
り、入力された検索タームに関連しない文書を部分文字
列レベルで高精度にふるい落すことができ、凝縮テキス
トをサーチする文書を十分絞り込むことが可能となる。That is, in this concatenated character component table method, information indicating whether or not a character string having a predetermined number of characters (two or more characters) exists in the text data at the time of registration is recorded in the concatenated character component table. When searching, the search term is divided into character strings of the above-mentioned predetermined number of characters before searching the condensed text, and a document including all the character strings is extracted by referring to the concatenated character component table. By doing this, documents that are not related to the input search term can be filtered out with high precision at the substring level, and it is possible to sufficiently narrow down the documents that are searched for condensed text.

【００１２】例えば、図２に示すようにテキストデータ
(文書１、文書２、・・・、文書Ｎ)を検索する際、単一文
字成分表の場合は“イラン"という検索タームでは矢印
で示した“イ"、“ラ"および“ン"に対応するビット列
が検索対象となるが、“ライオン"や“オンライン"など
を含む文書、すなわち文書１も文書２も“イ"、“ラ”
および“ン”が含まれるため、“イラン"という文字列
がないにもかかわらず文字成分表で検索されてしまい、
ノイズとなってしまう。For example, as shown in FIG. 2, text data
When searching (Document 1, Document 2, ..., Document N), in the case of a single character component table, the search term “Iran” corresponds to “I”, “La” and “N” indicated by arrows. The bit string to be searched is the search target, but documents including "lion" and "online", that is, both document 1 and document 2 are "a" and "la".
And "n" are included, it will be searched in the character component table even if there is no character string "Iran",
It becomes noise.

【００１３】これに対し、連接文字成分表の場合は、
“イラ"と“ラン"の両方が含まれる文書としては文書Ｎ
だけに特定され、単一文字成分表の場合のようなノイズ
が混入しない。On the other hand, in the case of the concatenated character component table,
Document N as a document that includes both "Ira" and "Run"
The noise is not mixed as in the case of the single character component table.

【００１４】このように連接文字成分表を用いることに
より、文書１と文書２のように検索タームを構成する文
字がバラバラに含まれているような文書を検索対象から
削除できるため、単一文字成分表よりも余分な凝縮テキ
ストサーチを省くことが可能となる。その結果、十分に
絞り込まれた文書に対する凝縮テキストのサーチで済む
ことになるため、等価的に高速なフルテキストサーチが
実現できることになり、大規模なテキストデータベース
でも実用的な検索レスポンスでフルテキストサーチを実
行することが可能となる。By using the concatenated character component table in this way, documents such as document 1 and document 2 in which the characters that make up the search term are included separately can be deleted from the search target, so that a single character component can be deleted. It is possible to omit an extra condensed text search than a table. As a result, a condensed text search for documents that are sufficiently narrowed down can be performed, and an equivalently high-speed full-text search can be realized. Even with a large-scale text database, a full-text search can be performed with a practical search response. Can be executed.

【００１５】この連接文字成分では、例えば２文字の連
接文字成分表の場合、全文字種の二乗の文字の組み合わ
せすべてについて、その文字成分を記録する必要があ
る。本公知例では、テキスト内の連接文字成分の出現頻
度を考慮して複数の連接文字成分の有無を一つのエント
リに重畳させて記録する(ハッシングと呼ぶ)ことによ
り、文字成分表の容量を削減し、かつ絞込み率の低下も
抑えるように工夫をしている。With this concatenated character component, for example, in the case of a concatenated character component table of two characters, it is necessary to record the character component for all combinations of squared characters of all character types. In this known example, the capacity of the character component table is reduced by superimposing and recording the presence or absence of a plurality of concatenated character components in one entry in consideration of the appearance frequency of the concatenated character components in the text (called hashing). In addition, it is devised so that the reduction of the narrowing rate is also suppressed.

【００１６】[0016]

【発明が解決しようとする課題】以上説明した連接文字
成分表方法における文字成分表サーチには以下に示す二
つの問題がある。The character component table search in the concatenated character component table method described above has the following two problems.

【００１７】まず第一の問題は、この文字成分表サーチ
を英文などのように文字の種類が少なく文字の並びで意
味を表わす表音文字に適用した場合には、文字の組合せ
による検索ノイズが多く発生することである。例えば、
“ain"のような三文字の互いに隣り合う文字列(以後、
逐次連接文字と呼ぶ)は、“mountain",“paint",“Spai
n"などの単語に全て含まれてしまう。その結果、文字成
分表の絞込み率が上がらないため凝縮テキストサーチの
対象となる文書件数が削減できず、十分な検索レスポン
スが得られないことになる。これは、英語などの表音文
字では複数の子音と母音の組み合わせによって意味が表
わされるため、同じ連接文字を含む単語が多く存在し、
互いに隣り合う文字列(以後、逐次連接文字と呼ぶ)では
単語に特有の並びになり得ないためである。The first problem is that when this character component table search is applied to phonetic characters, such as English sentences, which have a small number of character types and represent meaning by a character sequence, search noise due to character combinations causes It happens a lot. For example,
A string of three letters next to each other, such as “ain” (hereinafter,
It is called "mount ain ", "p aint ", "Sp ai "
All of them are included in words such as " n ". As a result, the number of documents targeted for the condensed text search cannot be reduced because the narrowing rate of the character component table cannot be increased, and a sufficient search response cannot be obtained. This is because in a phonetic alphabet such as English the meaning is expressed by the combination of multiple consonants and vowels, there are many words that contain the same concatenated character,
This is because character strings adjacent to each other (hereinafter, referred to as sequential concatenated characters) cannot be unique to words.

【００１８】第二の問題は、ハッシングを行なうため文
字成分表の検索結果にノイズが含まれることである。す
なわち、文字成分表の１個のエントリに複数の連接文字
成分を割り付けるため、ある連接文字を指定して該当す
るエントリを読み出した場合、そのビット情報から全く
別の連接文字成分を含む文書が得られる可能性がある。
そのため、大量の文書を登録する大規模な文書検索シス
テムで、検索語に関係しない文書のふるい落とし、すな
わち絞込みが適確に行なわれず検索低能のの低下につな
がる恐れがある。この問題に対してハッシングを行わず
に、すべての連接文字について、それぞれ１個のエント
リを対応させることも考えられる。しかし、これは文字
成分表の容量が膨大なものとなるため実用的ではない。The second problem is that noise is included in the search result of the character component table due to hashing. That is, since a plurality of concatenated character components are assigned to one entry in the character component table, if a certain concatenated character is specified and the corresponding entry is read, a document containing a completely different concatenated character component is obtained from the bit information. There is a possibility that
Therefore, in a large-scale document search system that registers a large number of documents, documents that are not related to a search word may not be properly screened, that is, narrowing down may not be performed accurately, which may lead to a reduction in search efficiency. It is possible to make one entry for each concatenated character without hashing for this problem. However, this is not practical because the capacity of the character component table becomes enormous.

【００１９】具体的に説明すると、日本語で使用する文
字コードは約8,000種類あるため２文字の組合せとして
連接文字の種類は6,400万種類(8,000種類×8,000種類)
となる。登録する文書数を100万件とした場合、この6,4
00万種類のそれぞれの連接文字に100万bitの文書識別情
報を対応させなけらばならないため、文字成分表として
は8TByte(6,400万種類×100万bit)もの容量が必要とな
る。この文字成分表の大きさに対し文書本体の容量は、
１件分の容量を20kBとしても、100万件で20GByte(20kB/
件×100万件)であるため、圧倒的に文字成分表の容量の
ほうが大きくなってしまう。More specifically, since there are about 8,000 character codes used in Japanese, there are 64 million kinds of concatenated characters as a combination of two characters (8,000 kinds × 8,000 kinds).
Becomes If the number of documents to be registered is 1 million, this 6,4
Since 1 million bits of document identification information must be associated with each of the 0 million types of concatenated characters, a capacity of 8 TByte (64 million types x 1 million bits) is required for the character component table. The capacity of the document body for the size of this character composition table is
Even if the capacity for one case is 20kB, 20GByte (20kB /
(1 x 1 million cases), the capacity of the character composition table will be overwhelmingly larger.

【００２０】以上説明した問題に対し、本発明の解決し
ようとする第一の課題は表音文字である英字などにより
構成される単語が検索タームに指定された時でも、検索
ノイズの少ない文字成分表サーチを実現することであ
る。In contrast to the problems described above, the first problem to be solved by the present invention is that a character component that causes less search noise even when a word composed of alphabetic characters such as phonetic characters is designated as a search term. It is to realize a table search.

【００２１】さらに、本発明の解決しようとする第二の
課題はハッシングによる検索ノイズが生じない文字成分
表を大規模な文書データベースにおいても実用的な容量
で実現することである。Further, a second problem to be solved by the present invention is to realize a character component table in which search noise due to hashing does not occur with a practical capacity even in a large-scale document database.

【００２２】[0022]

【課題を解決するための手段】本発明において第一の課
題は、以下の構成を採用することにより解決できる。The first object of the present invention can be solved by adopting the following configuration.

【００２３】予め蓄積された各文書から予め定められた
形式で部分文字列を抽出し、各文書において部分文字列
が存在するか否かを示す連接文字成分表を作成し、各文
書から所望の文書を検索するために入力された検索ター
ムから予め定められた形式で検索用部分文字列を抽出
し、抽出された検索用部分文字列に対応する前記連接文
字成分表を参照して検索タームを構成する各検索用部分
文字列と一致する部分文字列が存在する文書を求めて、
検索タームに関連のない文書を検索対象からふるい落と
す文書検索方法において、文書から予め定められたｍ文
字（ｍは１以上の整数）おきに、予め定められたｎ文字
（ｎは２以上の整数）の文字列を部分文字列として抽出
し、検索タームから予め定められたｍ文字（ｍは１以上
の整数）おきに、予め定められたｎ文字（ｎは２以上の
整数）の文字列を検索用部分文字列として抽出すること
を特徴とする文書検索方法である。A partial character string is extracted from each of the documents stored in advance in a predetermined format, a concatenated character component table showing whether or not the partial character string exists in each document is created, and a desired character is extracted from each document. A search substring is extracted in a predetermined format from the search term input to search the document, and the search term is searched by referring to the concatenated character component table corresponding to the extracted search substring. Find documents that have substrings that match each of the constituent substrings for search,
In a document search method for filtering out documents unrelated to a search term from a search target, every predetermined m characters (m is an integer of 1 or more) from a document, a predetermined n characters (n is an integer of 2 or more) ) Is extracted as a partial character string, and a predetermined n-character (n is an integer of 2 or more) character string is extracted from the search term at every predetermined m characters (m is an integer of 1 or more). A document retrieval method characterized by extracting as a partial character string for retrieval.

【００２４】さらに第二の課題については、以下の構成
とすることにより解決できる。Further, the second problem can be solved by the following constitution.

【００２５】前述の連接文字成分表は、所定のしきい値
より出現頻度が高い連接文字の出現する文書番号に対応
するビット位置に１を記すことにより文字列の出現情報
を登録するビットリストと、所定のしきい値より出現頻
度が低い連接文字成分用に、前記所定の出現頻度より出
現頻度が低い連接文字の出現する文書番号をバイナリデ
ータのリストとして格納した文書番号リストを有し、予
め文書中に現われる連接文字成分の種類および各連接文
字成分の出現する文書数を算出し、算出された結果か
ら、テキストデータ中に現われる各連接文字の出現文書
数が所定のしきい値より大きいか否かを判定し、判定さ
れた結果、文書中に現われる各連接文字の出現文書数が
所定のしきい値より大きいと判定された場合には、ビッ
トリストに対し該当する連接文字の出現した文書番号に
相当するビット位置に‘１’を記すことにより連接文字
成分の出現情報を登録し、判定された結果、文書中に現
われる各連接文字の出現文書数が所定のしきい値より小
さいと判定された場合には、文書番号リストに対し該当
する連接文字の出現した文書番号をバイナリデータのリ
ストとして書き込むことにより、各連接文字成分の出現
情報を出現情報を登録し、検索タームから抽出された連
接文字に対し、抽出された連接文字に対応するビットリ
ストまたは文書番号リストを読み出し、文書番号リスト
の場合にはこれをビットリストに変換することにより連
接文字成分表を取得することを特徴とする文書検索方法
である。The above-mentioned concatenated character component table is a bit list in which appearance information of character strings is registered by writing 1 in the bit position corresponding to the document number in which a concatenated character whose appearance frequency is higher than a predetermined threshold appears. , For a concatenated character component whose appearance frequency is lower than a predetermined threshold value, has a document number list in which document numbers in which concatenated characters whose appearance frequency is lower than the predetermined appearance frequency appear are stored as a list of binary data. Calculate the type of concatenated character component that appears in the document and the number of documents in which each concatenated character component appears, and check whether the number of documents in which each concatenated character appears in the text data is greater than the specified threshold value. If it is determined that the number of occurrences of each concatenated character appearing in the document is larger than a predetermined threshold, it is determined that the bit list is found. The appearance information of the concatenated character component is registered by writing "1" in the bit position corresponding to the document number in which the concatenated character appears, and as a result of the judgment, the number of the appearing documents of each concatenated character appearing in the document is a predetermined number. If it is determined that it is smaller than the threshold value, the appearance information of each connection character component is registered as the appearance information by writing the document number in which the corresponding connection character appears in the document number list as a list of binary data. , For the concatenated characters extracted from the search term, read the bit list or document number list corresponding to the extracted concatenated characters, and in the case of the document number list, convert this to a bit list to obtain the concatenated character component table. This is a document search method characterized by acquiring.

【００２６】[0026]

【作用】まず、第一の文書検索方法における作用につい
て説明する。First, the operation of the first document search method will be described.

【００２７】本文書検索方法では、連接文字成分表の作
成登録処理におけるスキップ連接文字成分抽出ステップ
で、テキストデータからｍ文字おきにｎ個の文字列をス
キップ連接文字として切り出し、この出現情報を連接文
字成分表に登録する。そして、連接文字成分表のサーチ
処理においても同様に検索タームからｍ文字おきにｎ個
の文字列を切り出して連接文字成分表をサーチすること
により、英語などのように同じ部分文字列を含む単語が
多数存在する言語でも単語固有の文字成分を採ることが
できるため、連接文字成分表サーチにおける絞り込み率
を向上させることが可能となる。In this document search method, in the skip concatenated character component extraction step in the concatenated character component table creation / registration process, n character strings are cut out from the text data every m characters as skip concatenated characters, and this appearance information is concatenated. Register in the character composition table. Also in the process of searching the concatenated character component table, similarly, by cutting out n character strings every m characters from the search term and searching the concatenated character component table, words including the same partial character string such as English are extracted. Since a character component unique to a word can be taken even in a language in which a large number of characters exist, it is possible to improve the narrowing rate in the concatenated character component table search.

【００２８】例えば、検索タームに“mountain"という
文字列が指定された時には、従来例の互いに隣り合う文
字列間で連接文字成分表を参照する方法では、図３に示
すように“mou"，“oun"，“unt"，“nta"，“tai"およ
び“ain"が連接文字成分として抽出される。しかし、例
えば“ａｉｎ"については“painting"，“Spain"などの
単語にも含まれるためこれらの単語を含む文書が検索ノ
イズとしてヒットする可能性がある。これに対し、例え
ば１文字おきに文字列をとることにより、“mut"，“on
a"，“uti"および“nan"というように、その単語固有の
文字成分をとることができる。このため、同じ部分文字
列を含む単語が存在することによる検索ノイズを従来例
に比べ大幅に削減することができる。For example, when a character string "mountain" is specified in the search term, in the conventional method of referring to the concatenated character component table between mutually adjacent character strings, as shown in FIG. “Oun”, “unt”, “nta”, “tai” and “ain” are extracted as concatenated character components. However, for example, "ain" is also included in the words " pain ting", "Sp ain ", etc., so that a document including these words may be hit as search noise. On the other hand, for example, by taking a character string every other character, "mut", "on
Character components unique to the word can be taken, such as "a", "uti", and "nan". Therefore, the search noise caused by the existence of words that contain the same substring is significantly larger than in the conventional example. Can be reduced.

【００２９】次に、本発明第二の文書検索方法の作用を
説明する。Next, the operation of the second document retrieval method of the present invention will be described.

【００３０】本文書検索方法では、文書登録時に予め文
字出現頻度算出ステップで、テキストデータ中に現われ
た連接文字成分の種類および各連接文字成分の現われる
文書の件数を算出する。In this document retrieval method, at the time of document registration, in the character appearance frequency calculation step, the type of concatenated character component appearing in the text data and the number of documents in which each concatenated character component appears are calculated.

【００３１】さらに、連接文字成分表作成登録時には文
字出現頻度判定ステップで、文字出現頻度算出ステップ
における算出結果から、テキストデータ中に現われる各
連接文字の出現文書数が所定のしきい値より大きいか否
かを判定する。Furthermore, in the character appearance frequency determination step at the time of creating and registering the concatenated character component table, whether the number of appearing documents of each concatenated character appearing in the text data is larger than a predetermined threshold value from the calculation result in the character appearance frequency calculating step. Determine whether or not.

【００３２】そして、文字出現頻度が所定のしきい値よ
り大きい場合には、ビットリスト登録ステップで該当す
る連接文字の出現した文書番号に相当するビットリスト
中のビット位置に‘１'を記すことにより連接文字成分
の出現情報を登録する。When the character appearance frequency is larger than a predetermined threshold value, in the bit list registration step, write "1" at the bit position in the bit list corresponding to the document number in which the corresponding concatenated character appears. The appearance information of the concatenated character component is registered by.

【００３３】また、文字出現頻度が所定のしきい値より
小さい場合には、文書番号リスト登録ステップで該当す
る連接文字の出現した文書番号をバイナリデータのリス
トとして文書番号リストに対し書き込むことにより、各
連接文字成分の出現情報を出現情報を登録する。When the character appearance frequency is smaller than the predetermined threshold value, the document number in which the corresponding concatenated character appears in the document number list registration step is written as a binary data list in the document number list. The appearance information of each concatenated character component is registered as the appearance information.

【００３４】そして、検索時には連接文字成分表取得ス
テップで、検索タームから切り出された連接文字に対応
するビットリストまたは文書番号リストを読み出し、文
書番号リストの場合にはこれをビットリストに変換する
ことにより連接文字成分表を取得するこのように、本発
明第二の文書検索方法では出現頻度の高い連接文字につ
いては対応する連接文字成分表をビットリストで、出現
頻度の低い連接文字については文書番号リストで構成す
ることにより連接文字成分表のファイル容量を大幅に削
減することができる。具体的に説明すると、ビットリス
トの形式で連接文字成分表を構成するには、常にデータ
ベースに登録した全件分のビット数が必要になるが、文
書番号リストの形式で連接文字成分表を構成する場合に
は、文書番号を表わすビット数×登録文書数で済むこと
になる。例えば、データベースの全登録件数が100万件
で、一個の文書識別子情報を表わすのに32ビットを割り
当てるものとし、連接文字“構造"を含む文書がその内1
0件であった場合には、ビット列ならば、100万bit=125K
Bの格納領域が必要となる。これに対して、文書番号リ
ストの形式ならば、32bit×10件＝320bit＝40Bの格納領
域で済むことになる。At the time of retrieval, in the concatenated character component table acquisition step, the bit list or document number list corresponding to the concatenated characters cut out from the search term is read, and in the case of the document number list, this is converted to a bit list. Thus, in the second document retrieval method of the present invention, the corresponding concatenated character component table is a bit list for concatenated characters having a high appearance frequency, and the document number for a concatenated character having a low appearance frequency. By using a list, the file size of the concatenated character component table can be greatly reduced. To be more specific, in order to configure the concatenated character composition table in the bit list format, the number of bits for all cases registered in the database is always required, but the concatenated character composition table is constructed in the document number list format. To do so, the number of bits representing the document number x the number of registered documents is sufficient. For example, assume that the total number of registrations in the database is 1 million, and that 32 bits are allocated to represent one piece of document identifier information.
If it is 0, if it is a bit string, 1 million bits = 125K
A storage area for B is required. On the other hand, in the case of the document number list format, a storage area of 32 bits × 10 cases = 320 bits = 40B will suffice.

【００３５】また、連接文字“構成"を含む文書が100万
件中の90万件であった場合には、ビット列ならば、100
万bit=125KBの格納領域に済む。これに対し、IDリスト
形式の場合、32bit×90万件=4B×90万件=3.6MBの領域が
必要となる。If the number of documents including the concatenation character "composition" is 900,000 out of 1 million, if the bit string is 100, then 100
A storage area of 10,000 bits = 125KB is enough. On the other hand, in the case of the ID list format, an area of 32bit × 900,000 = 4B × 900,000 = 3.6MB is required.

【００３６】したがって、この100万件を、文書識別子3
2ビットで格納する場合には、100万bit÷32bit=31,250
件を境として、これよりも登録件数が多い場合はビット
リスト形式で、少ない場合は文書番号形式で連接文字成
分表を構成することにより連接文字成分表のファイル容
量を最小化することができる。Therefore, this one million cases are identified by the document identifier 3
When storing in 2 bits, 1 million bits ÷ 32 bits = 31,250
When the number of registered cases is larger than this, the file size of the concatenated character composition table can be minimized by configuring the concatenated character composition table in the bit list format and in the document number format when the number of registered cases is smaller.

【００３７】また、本発明の第二の文書検索方法による
連接文字成分表検索結果にはハッシングによるノイズが
含まれないため、これらの検索結果集合間の論理積演算
(AND)して得られる文字成分表サーチ結果も、従来のハ
ッシングを行う文字成分表のサーチ結果に対しノイズが
大幅に削減されたものであり、絞り込み精度の向上を実
現することが可能となる。Further, since the concatenated character component table search result according to the second document search method of the present invention does not include noise due to hashing, a logical product operation between these search result sets is performed.
Also in the character component table search result obtained by (AND), noise is significantly reduced compared to the conventional character component table search result for hashing, and it is possible to improve the refinement accuracy. .

【００３８】[0038]

【実施例】以下、本発明の第一の実施例について図１を
用いて説明する。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS A first embodiment of the present invention will be described below with reference to FIG.

【００３９】本実施例では、英文におけるｍ文字おきに
ｎ文字の文字列を抽出するスキップ連接文字成分表とし
てｍ＝１、ｎ＝３の場合を例にして説明する。In this embodiment, the case where m = 1 and n = 3 is described as an example of a skip concatenated character component table for extracting a character string of n characters every m characters in English.

【００４０】本発明を適用した文書検索システムは、デ
ィスプレイ１００、キーボード１０１、中央演算装置Ｃ
ＰＵ１０２、磁気ディスク１１０、フロッピディスクド
ライブ（ＦＤＤ）１０６、主メモり２００から構成され
る。また、これらはバス１０８で接続されている。磁気
ディスク１１０にはテキスト１０３、凝縮テキスト１０
４、連接文字成分表１０５および後述する各種プログラ
ム１１１やテーブル１１２が格納される。１０７は登録
する文書を格納したフロッピディスクである。The document retrieval system to which the present invention is applied includes a display 100, a keyboard 101, and a central processing unit C.
The PU 102, the magnetic disk 110, the floppy disk drive (FDD) 106, and the main memory 200. Further, these are connected by a bus 108. Magnetic disk 110 has text 103 and condensed text 10
4, a concatenated character component table 105 and various programs 111 and tables 112 described later are stored. A floppy disk 107 stores a document to be registered.

【００４１】主メモリ２００には、システム制御プログ
ラム２０１、文書登録制御プログラム２０２、テキスト
登録プログラム２０３、凝縮テキスト作成登録プログラ
ム２０４、連接文字成分表作成登録プログラム２０５、
検索制御プログラム２０９、検索条件式解析プログラム
２１０、連接文字成分表サーチプログラム２１１、凝縮
テキストサーチプログラム２１４、テキストサーチプロ
グラム２１５およびハッシュテーブル２１６が磁気ディ
スク１１０から読み出されて格納されるとともにワーク
エリア２１７が確保される。In the main memory 200, a system control program 201, a document registration control program 202, a text registration program 203, a condensed text creation registration program 204, a concatenated character component table creation registration program 205,
The search control program 209, the search conditional expression analysis program 210, the concatenated character component table search program 211, the condensed text search program 214, the text search program 215, and the hash table 216 are read from the magnetic disk 110 and stored, and the work area 217 is also stored. Is secured.

【００４２】連接文字成分表作成登録プログラム２０５
はスキップ連接文字抽出プログラム２０６、連接文字成
分表登録プログラム２０７およびハッシュテーブル作成
プログラム２０８から構成され、連接文字成分表サーチ
プログラム２１１はスキップ連接抽出プログラム２１２
とビットアンドプログラム２１３から構成される。これ
らのプログラムはユーザのキーボード１０１からの指示
に応じてシステム制御プログラム２０１の制御の下で実
行される。Concatenated character component table creation registration program 205
Is a skip concatenated character extraction program 206, a concatenated character component table registration program 207, and a hash table creation program 208. The concatenated character component table search program 211 is a skip concatenated character extraction program 212.
And bit and program 213. These programs are executed under the control of the system control program 201 in response to a user's instruction from the keyboard 101.

【００４３】以下、本発明の対象となる連接文字成分表
サーチを含む階層プリサーチにおける登録処理と検索処
理について具体的に説明する。The registration process and the search process in the hierarchical pre-search including the concatenated character component table search which is the subject of the present invention will be specifically described below.

【００４４】文書の登録の際は図４に示すように、キー
ボード１０１から入力されたコマンドを受け、システム
制御プログラム２０１は文書登録制御プログラム２０２
を起動する。この文書登録制御プログラム２０２は、最
初にステップ１０００でテキスト登録プログラム２０３
を起動して、フロッピディスクドライブ１０６に挿入さ
れたフロッピディスク１０７から登録文書のテキストデ
ータをワークエリア２１７に読み込み、これをテキスト
１０３として磁気ディスク１１０へ格納する。登録文書
はフロッピディスクを用いて入力するだけでなく、通信
回線（図１には示していない）等を用いて他の装置から
入力するような構成をとってもかまわない。When registering a document, as shown in FIG. 4, the system control program 201 receives a command input from the keyboard 101, and the system control program 201 receives the document registration control program 202.
To start. The document registration control program 202 is first executed by the text registration program 203 in step 1000.
Is started to read the text data of the registered document from the floppy disk 107 inserted in the floppy disk drive 106 into the work area 217 and store it as the text 103 in the magnetic disk 110. The registration document may be input not only by using the floppy disk but also by inputting from another device using a communication line (not shown in FIG. 1) or the like.

【００４５】次に、文書登録制御プログラム２０２はス
テップ１００１で凝縮テキスト作成登録プログラム２０
４を起動して、テキストデータをスペースや記号などを
区切りとして単語レベルで部分文字列へ分割し、分割し
た部分文字列間で相互に文字列の包含関係を調べ、他の
部分文字列に含まれる文字列を排除し、残った部分文字
列の集合を凝縮テキスト１０４として磁気ディスク１１
０へ格納し、本プログラムの終了を待って、登録処理を
完了する。Next, in step 1001, the document registration control program 202 sends the condensed text creation registration program 20.
4 is started, the text data is divided into partial character strings at the word level using spaces and symbols as delimiters, the character string inclusion relations between the divided partial character strings are examined, and they are included in other partial character strings. Of the remaining partial character strings as the condensed text 104 and the magnetic disk 11 is removed.
Store it in 0, wait for the end of this program, and complete the registration process.

【００４６】最後に、文書登録制御プログラム２０２は
ステップ１００２で連接文字成分表作成登録プログラム
２０５を起動する。Finally, the document registration control program 202 activates the concatenated character component table creation registration program 205 in step 1002.

【００４７】以下、連接文字成分表作成登録プログラム
２０５の処理について図５を用いて説明する。The processing of the concatenated character component table creation / registration program 205 will be described below with reference to FIG.

【００４８】まず、連接文字成分表作成登録プログラム
２０５はステップ１０１０でスキップ連接文字抽出プロ
グラム２０６を起動し、磁気ディスク１１０に格納され
たテキスト１０３からテキストデータをワークエリア２
１７に読み込む。そして、このテキストデータから１文
字おきに３文字の文字列をすべて抽出する。First, the concatenated character component table creation / registration program 205 activates the skip concatenated character extraction program 206 in step 1010, and the text data from the text 103 stored in the magnetic disk 110 is converted into the work area 2.
Read in 17. Then, from this text data, all three character strings are extracted every other character.

【００４９】次に、ステップ１０１１で連接文字成分表
登録プログラム２０７を起動し、スキップ連接文字抽出
プログラム２０６によってテキストデータから抽出され
た文字列を、ワークエリア２１７内の連接文字成分表１
０５にハッシュテーブル２１６に従って登録し、これを
磁気ディスク１１０へ格納する。Next, in step 1011 the concatenated character component table registration program 207 is activated, and the character string extracted from the text data by the skip concatenated character extraction program 206 is converted into the concatenated character component table 1 in the work area 217.
05 is registered according to the hash table 216, and this is stored in the magnetic disk 110.

【００５０】連接文字成分表１０５を新規に登録すると
きには、連接文字成分表登録プログラム２０７でハッシ
ュテーブル作成プログラム２０８を起動し、連接文字成
分表１０５の該当エントリを参照するために用いるハッ
シュテーブル２１６を作成するとともに、連接文字成分
表１０５の全エントリを初期化（‘０’クリア）してお
く。When the concatenated character component table 105 is newly registered, the concatenated character component table registration program 207 activates the hash table creation program 208, and the hash table 216 used to refer to the corresponding entry in the concatenated character component table 105 is set. At the same time as creating, all entries in the concatenated character component table 105 are initialized (cleared to “0”).

【００５１】ハッシュテーブル作成プログラム２０８に
より作成されるハッシュテーブル２１６は、連接文字成
分により連接文字成分表１０５のエントリを参照する際
に用いられるが、このハッシュテーブル２１６について
は単純なハッシュ関数でも、あるいは上述した特願平３
−３４２６９５号に示した文書データベース中の連接文
字成分の頻度を利用したハッシュ方式を用いてもよい。The hash table 216 created by the hash table creation program 208 is used when referring to the entry of the concatenated character component table 105 by the concatenated character component. For this hash table 216, a simple hash function or Japanese Patent Application No. 3 mentioned above
A hash method using the frequency of concatenated character components in the document database shown in No. 342695 may be used.

【００５２】以上が、連接文字成分表作成登録プログラ
ムの処理内容である。The above is the processing contents of the concatenated character component table preparation registration program.

【００５３】検索の際には、検索条件式がキーボード１
０１から入力されると、システム制御プログラム２０１
により検索制御プログラム２０９が起動される。そし
て、本制御プログラムの下で検索条件式解析プログラム
２１０、連接文字成分表サーチプログラム２１１、凝縮
テキストサーチプログラム２１４およびテキストサーチ
プログラム２１５が順次起動される。At the time of retrieval, the retrieval condition expression is the keyboard 1
When input from 01, the system control program 201
This starts the search control program 209. Then, under this control program, the search conditional expression analysis program 210, the concatenated character component table search program 211, the condensed text search program 214, and the text search program 215 are sequentially activated.

【００５４】以下、図６を用いて、連接文字成分表サー
チプログラム２１１、凝縮テキストサーチプログラム２
１４およびテキストサーチプログラム２１５による階層
検索処理の詳細について説明する。Hereinafter, referring to FIG. 6, a concatenated character component table search program 211 and a condensed text search program 2
14 and the text search program 215 for the hierarchical search processing will be described in detail.

【００５５】まず、検索制御プログラム２０９はステッ
プ１０２０で連接文字成分表サーチプログラム２１１を
起動する。本プログラムの実行により、まずスキップ連
接文字抽出プログラム２１２を起動し、入力された検索
条件式中の検索タームから１文字おきに３文字の文字列
を抽出し、ハッシュテーブル２１６を用いて、抽出され
たすべての文字列に対応する連接文字成分表のエントリ
に格納されているビットリストをワークエリア２１７に
読み込む。次に、ビットアンドプログラム２１３を起動
し、ワークエリア２１７に読み込まれたすべてのビット
リスト間で各ビット毎に論理積(AND)を取る。この論理
積演算の結果‘１'となったビットに対応する文書番号
を連接文字成分表サーチの結果として検索制御プログラ
ム２０９に出力し、この連接文字成分表サーチの結果件
数が０件であれば、ここで０件という検索結果を本プロ
グラム２０９がディスプレイに表示する。First, the search control program 209 activates the concatenated character component table search program 211 in step 1020. By executing this program, first, the skip concatenated character extraction program 212 is started, and a character string of three characters is extracted every other character from the search term in the input search condition expression and is extracted using the hash table 216. The bit list stored in the concatenated character component table entries corresponding to all the character strings is read into the work area 217. Next, the bit-and-program 213 is started, and a logical product (AND) is calculated for each bit among all the bit lists read in the work area 217. The document number corresponding to the bit that has become "1" as a result of this logical product operation is output to the search control program 209 as the result of the concatenated character component table search, and if the number of concatenated character component table searches is 0 The program 209 displays a search result of 0 here on the display.

【００５６】もし、連接文字成分表サーチの結果件数が
０件でなければ、検索制御プログラム２０９はステップ
１０２１で凝縮テキストサーチプログラム２１４を実行
する。ここでは、上述の連接文字成分表サーチプログラ
ム２１１によって出力された文書番号に対応する凝縮テ
キスト１０４をワークエリア２１７に読み込む。If the number of connected character component table searches is not 0, the search control program 209 executes the condensed text search program 214 in step 1021. Here, the condensed text 104 corresponding to the document number output by the concatenated character component table search program 211 is read into the work area 217.

【００５７】そして、読み込まれた凝縮テキスト１０４
を凝縮テキストサーチプログラム２１４で探索し、検索
タームが含まれる凝縮テキストの文書番号を検索制御プ
ログラム２０９に出力する。Then, the condensed text 104 that has been read
Is output by the condensed text search program 214, and the document number of the condensed text including the search term is output to the search control program 209.

【００５８】この凝縮テキストサーチの結果件数が０件
であれば、ここで０件という結果件数をシステム制御プ
ログラム２０１に出力して検索処理を終了する。If the number of results of the condensed text search is 0, the number of results of 0 is output to the system control program 201 and the search processing is terminated.

【００５９】また、与えられた検索条件式の中に単一の
検索タームか、あるいは複数の検索ターム間の論理的な
関係(AND条件やOR条件)が指定されているだけで、テキ
スト中での位置関係までは指定されていない場合には、
ここで検索を終了し凝縮テキストサーチプログラム２１
４によって出力された文書番号を検索結果としてシステ
ム制御プログラム２０１に出力する。In addition, a single search term or a logical relationship between a plurality of search terms (AND condition or OR condition) is specified in the given search condition expression, and If the positional relationship of is not specified,
The search ends here and the condensed text search program 21
The document number output by 4 is output to the system control program 201 as a search result.

【００６０】それ以外の場合、すなわち与えられた検索
条件式の中に複数の検索ターム間のテキスト中での位置
関係が指定されている場合には、ステップ１０２２でテ
キストサーチプログラム２１５を起動し、テキストサー
チを行う。In other cases, that is, when the positional relationship in the text between a plurality of search terms is specified in the given search condition expression, the text search program 215 is started in step 1022, Perform a text search.

【００６１】単一の検索タームが指定されたり、あるい
は単にＡＮＤ条件やＯＲ条件が指定されただけの場合に
凝縮テキストサーチで検索を終了できるのは、凝縮テキ
スト１０４にはその作成アルゴリズムからも分かるよう
に、テキスト１０３中に存在する単語が漏れなく抽出さ
れており、凝縮テキスト１０４を検索するだけで指定さ
れた単語がテキストデータ中に現われたか否かが判定で
きるためである。In the condensed text 104, it can be seen from the creation algorithm that the search can be ended by the condensed text search when a single search term is specified or only AND condition and OR condition are specified. As described above, the words existing in the text 103 are extracted without omission, and it is possible to determine whether or not the designated word appears in the text data only by searching the condensed text 104.

【００６２】例えば、「“ｉｎｆｏｒｍａｔｉｏｎ"
〈ＡＮＤ〉“ｓｙｓｔｅｍｓ”」のように記述される
「“ｉｎｆｏｒｍａｔｉｏｎ”と“ｓｙｓｔｅｍｓ”の
両方が現れる文書を探せ」という意味を持つＡＮＤ条件
や、「“ｉｎｆｏｒｍａｔｉｏｎ”〈ＯＲ〉“ｓｙｓｔ
ｅｍｓ”」のように記述される「“ｉｎｆｏｒｍａｔｉ
ｏｎ”か“ｓｙｓｔｅｍｓ”のどちらかが現れる文書を
探せ」という意味を持つＯＲ条件などは、複数の検索タ
ーム間の論理的な関係が指定されているだけで、テキス
ト中での位置関係までは指定されていない。そのため、
“ｉｎｆｏｒｍａｔｉｏｎ”と“ｓｙｓｔｅｍｓ”の存
在分かればよいだけなので凝縮テキストサーチだけで検
索条件の成否を判定することができる。For example, "" information "
<AND> “systems” ”, an AND condition that means“ search for a document in which both “information” and “systems” appear ”, and“ information ”<OR>“ system ”
“Informati” described as “ems”
An OR condition, which means "search for a document in which either" on "or" systems "appears", only specifies a logical relationship between multiple search terms, and does not include the positional relationship in the text. not specified. for that reason,
Since it suffices to know the existence of “information” and “systems”, the success or failure of the search condition can be determined only by the condensed text search.

【００６３】これに対し、「“ｉｎｆｏｒｍａｔｉｏ
ｎ”〈Ｓ〉“ｓｙｓｔｅｍｓ”」のように記述される
「“ｉｎｆｏｒｍａｔｉｏｎ”と“ｓｙｓｔｅｍｓ”が
同一の文（センテンス）に共起（同時に出現）する文書
を探せ」という意味を持つ文脈条件や、「“ｉｎｆｏｒ
ｍａｔｉｏｎ”〈２Ｗ〉“ｓｙｓｔｅｍｓ”」のように
記述される「“ｉｎｆｏｒｍａｔｉｏｎ”と“ｓｙｓｔ
ｅｍｓ”が２語以内に近接して現れる文書を探せ」とい
う意味を持つ近傍条件、あるいは「“ｉｎｆｏｒｍａｔ
ｉｏｎ”〈Ａ〉“ｓｙｓｔｅｍｓ”」のように記述され
る「“ｉｎｆｏｒｍａｔｉｏｎ”と“ｓｙｓｔｅｍｓ”
が隣接して現れる文書を探せ」という意味を持つ隣接条
件などは、複数の検索ターム間のテキスト中での位置関
係が指定されているため、単語の出現位置情報を持たな
い凝縮テキストサーチだけでは成否の判定ができず、テ
キストサーチまで行わなければならない。On the other hand, the "" informationatio
n ”<S>“ systems ””, a context condition having a meaning of “search for documents in which“ information ”and“ systems ”co-occur (occur simultaneously) in the same sentence (sentence)”, "" Info
“Information” and “system” described as “motion” <2W> “systems” ”.
A neighborhood condition that means "search for documents in which ems" appear within 2 words in close proximity, or "informat"
"" information "and" systems "described as" ion "<A>" systems ""
For the adjacency condition, which means that "search for documents that appear adjacent to each other", the positional relationship in the text between multiple search terms is specified, so a condensed text search that does not have word appearance position information is sufficient. The success or failure of the judgment cannot be determined, and a text search must be performed.

【００６４】凝縮テキストサーチの結果件数が０件でな
く、かつ上述した文脈条件、近傍条件あるいは隣接条件
が指定されている場合には、テキストサーチプログラム
２１５が起動され、凝縮テキストサーチプログラム２１
４で出力された文書番号に対応するテキストデータをテ
キスト１０３からワークエリア２１７に読み込む。そし
て、テキストサーチプログラム２１５はこのテキストデ
ータを探索し、与えられた検索タームを含み、かつ検索
ターム間の位置関係に関する指定条件を満たすものを抽
出し、この抽出テキストデータに対応する文書番号を検
索結果として検索制御プログラム２０９に出力する。When the number of results of the condensed text search is not 0 and the above-mentioned context condition, neighborhood condition or adjacency condition is designated, the text search program 215 is started and the condensed text search program 21 is started.
The text data corresponding to the document number output in 4 is read from the text 103 into the work area 217. Then, the text search program 215 searches this text data, extracts those that include the given search terms and satisfy the specified condition regarding the positional relationship between the search terms, and retrieves the document number corresponding to this extracted text data. The result is output to the search control program 209.

【００６５】以上が本発明のフルテキストサーチ方法を
適用した第一の実施例のフルテキストサーチシステムの
概略である。The above is the outline of the full-text search system of the first embodiment to which the full-text search method of the present invention is applied.

【００６６】本実施例における連接文字成分表の登録手
順は図７に示す通りである。以下、この登録手順を詳細
に説明する。The procedure for registering the concatenated character component table in this embodiment is as shown in FIG. Hereinafter, this registration procedure will be described in detail.

【００６７】まず、連接文字成分表の登録処理の詳細に
ついて説明する。ここでは、前述したように連接文字成
分表作成登録プログラム２０５により、スキップ連接文
字抽出プログラム２０６が起動される。First, the details of the process of registering the concatenated character component table will be described. Here, as described above, the concatenated character component table creation registration program 205 activates the skip concatenated character extraction program 206.

【００６８】本プログラムの実行により、磁気ディスク
１１０に格納されたテキスト１０３から文書毎にテキス
トデータがワークエリア２１７に読み込まれ、このテキ
ストデータから１文字おきに３文字の文字列が抽出され
る。By executing this program, text data is read from the text 103 stored on the magnetic disk 110 into the work area 217 for each document, and a character string of three characters is extracted every other character from the text data.

【００６９】次に、連接文字成分表登録プログラム２０
７が起動される。ここでは、スキップ連接文字抽出プロ
グラム２０５によってテキストデータから抽出された上
記連接文字成分に対してハッシュテーブル２１６を用い
て対応するエントリを算出し、該当するビット位置に
‘１'を設定し、連接文字成分の存在を記す。Next, the concatenated character component table registration program 20
7 is activated. Here, the corresponding entry is calculated using the hash table 216 for the above-mentioned concatenated character component extracted from the text data by the skip concatenated character extraction program 205, and "1" is set to the corresponding bit position to set the concatenated character. Note the presence of ingredients.

【００７０】このスキップ連接文字抽出プログラム２０
６の処理では、例えば図７に示すように“Ｍｕｌｔｉｍ
ｅｄｉａ"というテキストデータに対しては、１文字お
きに３文字の連接文字として“Ｍｌｉ”，“ｕｔｍ”，
“ｌｉｅ”，“ｔｍｄ”，“ｉｅｉ”および“ｍｄａ”
が抽出される。This skip concatenated character extraction program 20
In the processing of No. 6, for example, as shown in FIG.
For text data "edia", every other character is a concatenated character of three characters "Mli", "utm",
"Lie", "tmd", "iei" and "mda"
Is extracted.

【００７１】次に、抽出文字列に対して、ハッシュテー
ブル２１６を介して参照した連接文字成分表１０５の該
当エントリの対応ビット位置に‘１’を設定する。図８
の例では、文書１中に“Ｍｌｉ”があるのでハッシュテ
ーブル２１６を用いて参照した該当エントリの文書１に
対応するビット位置に‘１’を設定する。“ｓｓｅ”の
場合も同様に該当エントリに‘１’を設定する。以下、
同様にしてテキストデータ中に存在する連接成分のすべ
てについて、連接文字成分表１０５の該当エントリに
‘１’を設定する。最終的には、同図に示すように全て
の登録文書について‘１’と‘０’の列(ビットリスト)
ができあがる。例えば、“ｎｓｓ”の「１０１・・・
０」の列が１つのビットリストである。Next, for the extracted character string, "1" is set to the corresponding bit position of the corresponding entry of the concatenated character component table 105 referenced through the hash table 216. Figure 8
In the example, since “Mli” exists in the document 1, “1” is set to the bit position corresponding to the document 1 of the corresponding entry referred to by using the hash table 216. Similarly, in the case of "sse", "1" is set in the corresponding entry. Less than,
Similarly, "1" is set to the corresponding entry of the concatenated character component table 105 for all the concatenated components existing in the text data. Finally, as shown in the figure, for all registered documents, a sequence of "1" and "0" (bit list)
Is completed. For example, "101 ..." of "nss"
The column of "0" is one bit list.

【００７２】このようにして、連接文字成分表作成登録
プログラム２０５により文書の登録時に連接文字成分表
１０５が作成され、階層プリサーチの準備ができあが
る。In this way, the concatenated character component table preparation registration program 205 prepares the concatenated character component table 105 when a document is registered, and the preparation for the hierarchical pre-search is completed.

【００７３】次に、連接文字成分表の検索手順につい
て、図９を用いて詳細に説明する。Next, the procedure for searching the concatenated character component table will be described in detail with reference to FIG.

【００７４】まず、検索制御プログラム２０９はステッ
プ１０３０でスキップ連接文字抽出プログラム２１２を
起動する。ここでは、検索条件式中の検索タームから１
文字おきに３文字の文字列を抽出する。ただし、本実施
例では３文字の１文字おきの連接文字成分表を用いてい
るため、５文字未満の検索タームの場合は連接文字が得
られないことになる。この場合、本実施例では、連接文
字成分表サーチの結果を全件ヒットとし、すべての文書
に対して凝縮テキストサーチを行うことにする。すなわ
ち、この文書の番号を出力して、連接文字成分表サーチ
プログラム２１１が終了する。First, the search control program 209 activates the skip concatenated character extraction program 212 in step 1030. Here, 1 from the search term in the search condition expression
A character string of three characters is extracted every other character. However, in this embodiment, since the concatenated character component table of every three characters is used, the concatenated character cannot be obtained when the search term is less than five characters. In this case, in the present embodiment, the results of the concatenated character component table search are all hits, and the condensed text search is performed on all documents. That is, the number of this document is output, and the concatenated character component table search program 211 ends.

【００７５】５文字以上の検索タームが与えられた場合
には、ステップ１０３１でスキップ連接文字抽出プログ
ラム２１２によって抽出された文字列に対応するビット
リストを、ビットアンドプログラム２１３が連接文字成
分表１０５からハッシュテーブル２１６を介してワーク
エリア２１７に読み込み、ステップ１０３２で読み込ん
だビットリスト間でビット毎に論理積演算を行う。そし
てステップ１０３３で、この論理積演算の結果が‘１’
となったビットに対応する文書番号を算出し、これを連
接文字成分表サーチ結果として出力する。When a search term of five characters or more is given, the bit and program 213 extracts the bit list corresponding to the character string extracted by the skip concatenated character extraction program 212 from the concatenated character component table 105 in step 1031. It is read into the work area 217 via the hash table 216, and the logical product operation is performed for each bit between the bit lists read in step 1032. Then, in step 1033, the result of the logical product operation is "1".
The document number corresponding to the bit is calculated, and this is output as the concatenated character component table search result.

【００７６】例えば、図１０に示すように“Ｍｕｌｔｉ
ｍｅｄｉａ”という文字列が検索タームとして与えられ
た場合、“Ｍｌｉ”，“ｕｔｍ”，“ｌｉｅ”，“ｔｍ
ｄ”，“ｉｅｉ”および“ｍｄａ”に対応する連接文字
成分表１０５のビットリストがハッシュテーブル２１６
を介して読み出され、これらすべてのビットリストのビ
ットがすべて‘１’である文書が連接文字成分表サーチ
の検索結果として得られる。For example, as shown in FIG. 10, "Multi
When the character string "media" is given as the search term, "Mli", "utm", "lie", "tm"
The bit list of the concatenated character component table 105 corresponding to “d”, “iei”, and “mda” is the hash table 216.
A document in which all the bits in the bit list are all “1” is obtained as the search result of the concatenated character component table search.

【００７７】すなわち、読み出したすべてのビットリス
トの間でビット毎に論理積演算を施し、ビットアンド演
算結果９００を得る。このビットアンド演算結果のビッ
トリスト中で、‘１’となっているビット位置に対応す
る文書番号が連接文字成分表サーチの検索結果としての
ヒット文書を表わすことになる。That is, a logical product operation is performed for each bit among all the read bit lists to obtain a bit and operation result 900. In the bit list of the bit-and operation result, the document number corresponding to the bit position of "1" represents the hit document as the search result of the concatenated character component table search.

【００７８】これにより、“Ｍｌｉ”，“ｕｔｍ”，
“ｌｉｅ”，“ｔｍｄ”，“ｉｅｉ”および“ｍｄａ”
のすべてを含む文書が抽出されることになる。本図の例
では、文書１と文書Ｎがヒット文書ということになる。As a result, "Mli", "utm",
"Lie", "tmd", "iei" and "mda"
A document including all of will be extracted. In the example of this figure, document 1 and document N are hit documents.

【００７９】このように、本実施例における連接文字成
分表作成登録処理では、文書の登録時に、テキストデー
タから１文字おきに３文字の文字列を取り出し、この連
接文字の存在情報を予め連接文字成分表に登録しておく
ことにより、単語固有の連接文字成分を取ることができ
るため、検索時の連接文字成分表サーチにおける絞り込
み率を向上させることが可能となる。その結果、階層プ
リサーチにおける凝縮テキストの探索量が削減されるこ
とになるため、等価的に全体の検索速度を向上できるこ
とになる。したがって、より大量のフルテキストサーチ
を実時間で行うことが可能となる。As described above, in the concatenated character component table creation / registration process according to the present embodiment, at the time of document registration, a character string of three characters is extracted from the text data every other character, and the existence information of the concatenated character is extracted in advance. By registering in the composition table, a concatenated character component unique to a word can be obtained, so that it is possible to improve the narrowing rate in the concatenated character component table search at the time of search. As a result, the search amount of the condensed text in the hierarchical pre-search is reduced, so that the overall search speed can be equivalently improved. Therefore, a larger amount of full-text search can be performed in real time.

【００８０】なお、本実施例では単語を意識することな
く“_"(スペース)や“."(ピリオド)、“,"(カンマ)など
を含んだ全ての文字列を対象として文字成分表を作成し
ている。このため、“Multimedia information system
s"などのスペースを含んだ文字列が検索タームに指定さ
れた場合にも、単語間にまたがった連接文字成分を利用
した絞り込みを行うことが可能となっている。In the present embodiment, the character composition table is targeted for all character strings including "_" (space), "." (Period), "," (comma), etc. without being aware of words. Creating. Therefore, "Multimedia information system
Even when a character string containing a space such as s "is specified in the search term, it is possible to narrow down using the concatenated character component that spans between words.

【００８１】また、本実施例では連接文字成分表を１文
字おきに３文字の文字列、すなわち、ｍ＝１、ｎ＝３で
作成する場合について説明したが、何文字おきに文字列
を抽出しても、また文字列長が２文字および４文字以上
の場合についても同様な処理が可能である。これは、上
記の説明から明らかであろう。In this embodiment, a case has been described in which the concatenated character component table is created with a character string of three characters every other character, that is, m = 1 and n = 3. However, the character string is extracted every other character. Even if the character string length is 2 characters or 4 characters or more, the same processing can be performed. This will be apparent from the above description.

【００８２】さらに、本実施例では５文字未満の検索タ
ームの場合、連接文字成分表サーチの結果を全件ヒット
として出力するようにしているが、別途１文字おきに２
文字の連接文字成分表を作成し、この連接文字成分表を
用いて５文字未満の検索タームの連接文字成分表サーチ
を行うようにすることもできる。Further, in the present embodiment, in the case of a search term of less than 5 characters, the result of the concatenated character component table search is output as all hits.
It is also possible to create a concatenated character component table of characters and use this concatenated character component table to perform a concatenated character component table search for a search term of less than 5 characters.

【００８３】このように、本実施例による連接文字成分
表サーチでは、従来例に比べ検索ノイズを大幅に削減す
ることができるため、連接文字成分表サーチの検索結果
は凝縮テキストおよびテキストをサーチすることにより
得られる検索結果と大きな差が生じない。このため、連
接文字成分表サーチの検索結果をシステムの検索結果と
してそのままシステム制御プログラム２０１に出力する
ことも可能である。As described above, in the concatenated character component table search according to the present embodiment, the retrieval noise can be significantly reduced as compared with the conventional example. Therefore, the concatenated character component table search results in condensed text and text. There is no significant difference from the search results obtained by doing so. Therefore, the search result of the concatenated character component table search can be directly output to the system control program 201 as the search result of the system.

【００８４】次に、本発明の第二の実施例について説明
する。Next, a second embodiment of the present invention will be described.

【００８５】本発明の第一の実施例では、単語を意識す
ることなく、“_"(スペース)や“."(ピリオド)、“,"
(カンマ)などを含んだ全文字列を対象として文字成分表
を作成している。しかし、この方法では単語間にまたが
ったスキップ連接文字を文字成分表に登録することにな
るため、単語を指定した検索の場合には以下の２種類の
ノイズが発生する。In the first embodiment of the present invention, "_" (space), "." (Period), "," are recognized without being aware of words.
Character composition table is created for all character strings including (comma). However, in this method, since the skip concatenated characters spanning the words are registered in the character component table, the following two types of noise occur in the search in which the words are designated.

【００８６】まず、単語間にまたがったスキップ連接文
字成分が他の連接文字と同じエントリにハッシングされ
ることによりノイズが発生する。例えば、“angle"が検
索タームに指定されたときにはスキップ連接文字として
“age"が抽出されることになるが、“Multimedia infor
mation systems"における“_"(スペース)をまたがった
スキップ連接文字である“aif"が“age"と同じエントリ
にハッシングされた場合には“Multimedia information
systems"を含む文書がノイズとしてヒットしてしまう
ことになる。First, the skip concatenated character component spanning between words is hashed to the same entry as another concatenated character, resulting in noise. For example, when "angle" is specified as the search term, "age" is extracted as the skip concatenated character, but "Multimedi a i n f or
mation systems "in" _ "in the case has been hashing to the same entry as the""(space) is a skip articulated character across the" aif "is" age Multimedi a i n f ormation
Documents containing "systems" will be hit as noise.

【００８７】さらに、検索タームから抽出したスキップ
連接文字成分が単語間にまたがって現われる文書を抽出
することによりノイズが発生する。すなわち、テキスト
中に“・・・ a green cup ・・・"という文字列が含まれる文
書が登録された場合には、“_"(スペース)をまたがった
スキップ連接文字である“age"が連接文字成分表に登録
される。これに対し、検索タームに“angle"が指定され
たときには検索タームからスキップ連接文字として“ag
e"が抽出されることになり、前述した文書が検索ノイズ
としてヒットしてしまうことになる。Further, noise is generated by extracting a document in which the skip concatenated character component extracted from the search term appears across words. In other words, if the document that contains the "··· a g r e en cup ···" string in the text has been registered, "_" is a skip articulated characters across the (space) "age "Is registered in the concatenated character composition table. On the other hand, when " a n g l e " is specified for the search term, "ag
The e "will be extracted, and the above-mentioned document will be hit as search noise.

【００８８】これらの問題に対し、本発明第二の実施例
では連接文字成分の抽出時にテキストから単語を切り出
し、切り出された単語からスキップ連接文字成分を抽出
して連接文字成分表を作成することによって、二つの単
語にかかる連接文字成分を抽出しないようにして、前記
の検索ノイズを削減する。In order to solve these problems, in the second embodiment of the present invention, a word is cut out from the text at the time of extracting the connected character component, and the skip connected character component is extracted from the cut word to create the connected character component table. By doing so, the concatenated character components of the two words are not extracted, and the search noise is reduced.

【００８９】本実施例は図１に示した第一の実施例と基
本的に同様の構成をとるが、その中の連接文字成分表作
成登録プログラム２０５が図１１に示すような構成とな
る。This embodiment basically has the same structure as the first embodiment shown in FIG. 1, but the concatenated character component table preparation registration program 205 therein has a structure as shown in FIG.

【００９０】すなわち、本実施例における連接文字成分
表作成登録プログラム２０５は、単語切り出しプログラ
ム３００、スキップ連接文字抽出プログラム２０６、連
接文字成分表登録プログラム２０７およびハッシュテー
ブル作成プログラム２０８で構成される。That is, the concatenated character component table creation / registration program 205 in this embodiment comprises a word cutout program 300, a skip concatenated character extraction program 206, a concatenated character component table registration program 207 and a hash table preparation program 208.

【００９１】本実施例における連接文字成分表作成登録
プログラム２０５は、図１２に示すように、まずステッ
プ１１００で単語切り出しプログラム３００を起動し、
磁気ディスク１１０に格納されたテキスト１０３からテ
キストデータをワークエリア２１７に読み込む。そし
て、このテキストデータからスペースを区切りとして単
語を切り出す。As shown in FIG. 12, the concatenated character component table creation / registration program 205 in this embodiment first activates the word segmentation program 300 in step 1100,
Text data is read into the work area 217 from the text 103 stored in the magnetic disk 110. Then, words are cut out from this text data by separating spaces.

【００９２】次に、ステップ１１０１でスキップ連接文
字抽出プログラム２０６を起動し、単語切り出しプログ
ラム３００によって切り出されたすべての単語から１文
字おきに３文字の文字列をすべて抽出する。Next, in step 1101, the skip concatenated character extraction program 206 is activated, and all three character strings are extracted every other character from all the words cut out by the word cutout program 300.

【００９３】最後に、ステップ１１０２で連接文字成分
表登録プログラム２０７を起動し、スキップ連接文字抽
出プログラム２０６によって単語から抽出された連接文
字を、ワークエリア２１７内の連接文字成分表１０５に
ハッシュテーブル２１６に従って登録し、これを磁気デ
ィスク１１０へ格納する。Finally, in step 1102, the concatenated character component table registration program 207 is started, and the concatenated character extracted from the word by the skip concatenated character extraction program 206 is stored in the concatenated character component table 105 in the work area 217 in the hash table 216. It is registered according to the above, and is stored in the magnetic disk 110.

【００９４】このスキップ連接文字の抽出および連接文
字成分表の登録について、例えば“Ｍｕｌｔｉｍｅｄｉ
ａｉｎｆｏｒｍａｔｉｏｎｓｙｓｔｅｍｓｍｕｓ
ｔ・・・”というテキストが登録された場合を例に説明
する。Regarding the extraction of this skip concatenated character and the registration of the concatenated character component table, for example, "Multimedi"
a information systems mus
The case where the text "t ..." is registered will be described as an example.

【００９５】単語切り出しプログラム３００によって文
書１は図１３に示すように、“Ｍｕｌｔｉｍｅｄｉ
ａ”，“ｉｎｆｏｒｍａｔｉｏｎ”，“ｓｙｓｔｅｍ
ｓ”，“ｍｕｓｔ”，・・・に分割される。As shown in FIG. 13, the word cutout program 300 causes the document 1 to be "Multimediadi".
a ”,“ information ”,“ system ”
It is divided into s "," must ", ....

【００９６】次に、スキップ連接文字抽出プログラム２
０６によって、切り出された単語からスキップ連接文字
成分として“Ｍｌｉ”，“ｕｔｍ”，“ｌｉｅ”，“ｔ
ｍｄ”，“ｉｅｉ”，“ｍｄａ”，“ｉｆｒ”，“ｎｏ
ｍ”，“ｆｒａ”，“ｏｍｔ”，・・・が抽出される。Next, the skip concatenated character extraction program 2
According to 06, "Mli", "utm", "lie", and "t" are used as skip concatenated character components from the word cut out.
md ”,“ iei ”,“ mda ”,“ ifr ”,“ no
m ”,“ fra ”,“ omt ”, ... Are extracted.

【００９７】さらに、連接文字成分表登録プログラム２
０７では、スキップ連接文字抽出プログラム２０６によ
って切り出された“Ｍｌｉ”に対して、ハッシュテーブ
ル２１６を用いて参照した該当エントリの文書１に対応
するビット位置に‘１’を設定する。“ｕｔｍ”も同様
に‘１’を設定する。以下、同様にしてテキストデータ
中の単語に存在する連接文字成分のすべてについて、連
接文字成分表１０５の該当エントリに‘１’を設定す
る。最終的には、同図に示すように各登録文書について
‘１’と‘０’の列（ビットリスト）ができあがる。Furthermore, the concatenated character component table registration program 2
In 07, “1” is set to the bit position corresponding to the document 1 of the corresponding entry referred to using the hash table 216 for “Mli” cut out by the skip concatenated character extraction program 206. Similarly, "utm" is set to "1". In the same manner, “1” is set to the corresponding entry in the concatenated character component table 105 for all concatenated character components existing in the words in the text data. Finally, as shown in the figure, a column (bit list) of "1" and "0" is created for each registered document.

【００９８】検索時には、本発明の第一の実施例と同様
に連接文字成分表サーチプログラム２１１においてスキ
ップ連接文字抽出プログラム２１２が起動され、入力さ
れた検索条件式中の検索タームから１文字おきに３文字
の文字列をすべて抽出する。At the time of search, the skip concatenated character extraction program 212 is activated in the concatenated character component table search program 211, as in the first embodiment of the present invention, and every other character from the search term in the input search condition expression. Extract all 3-character strings.

【００９９】次に、ビットアンドプログラム２１３を起
動し、スキップ連接文字抽出プログラム２１２によって
抽出されたすべての文字列に対応する連接文字成分表１
０５のエントリに格納されているビットリストを、ハッ
シュテーブル２１６を介してワークエリア２１７に読み
込み、読み込まれたすべてのビットリスト間で各ビット
毎に論理積演算を行う。この論理積演算の結果‘１’と
なったビットに対応する文書番号を連接文字成分表サー
チの結果として出力する。Next, the bit and program 213 is started, and the concatenated character component table 1 corresponding to all the character strings extracted by the skip concatenated character extraction program 212
The bit list stored in the 05 entry is read into the work area 217 via the hash table 216, and the logical product operation is performed for each bit between all the read bit lists. The document number corresponding to the bit that has become "1" as a result of this logical product operation is output as the result of the concatenated character component table search.

【０１００】このように、本実施例における連接文字成
分表の作成登録処理では、テキストデータから単語を切
り出してから、その単語中の文字列から１文字おきに３
文字の文字列を抽出し、この連接文字の存在情報を予め
連接文字成分表に登録する。この単語分割により、二つ
の単語にまたがった連接文字成分を削除できるため、ハ
ッシングによるノイズおよび単語間にまたがったスキッ
プ連接文字によるノイズを減らすことができる。例え
ば、テキスト中に“・・・ a green cup ・・・"という文字列
が含まれる文書が登録された場合にも、“_"(スペース)
をまたがったスキップ連接文字である“age"が連接文字
成分表に登録さないため、スキップ連接文字として“ag
e"を持つ“angle"が検索タームに指定された場合にもノ
イズとして検索されることはない。その結果、検索時の
連接文字成分表サーチにおける絞り込み率を向上させる
ことができ、階層プリサーチにおける凝縮テキストの探
索量が削減できることになるため、等価的に全体の検索
速度を向上させることが可能となる。したがって、より
大量のフルテキストサーチを実時間で行うことが可能と
なる。As described above, in the process of creating and registering the concatenated character component table in this embodiment, after extracting a word from the text data, every three characters are separated from the character string in the word by three characters.
The character string of the character is extracted, and the existence information of the concatenated character is registered in the concatenated character component table in advance. By this word division, a concatenated character component that spans two words can be deleted, so that noise due to hashing and noise due to a skip concatenated character that spans between words can be reduced. For example, even if the document that contains the "··· a g r e en cup ···" string in the text has been registered, "_" (space)
Since the skip concatenated character "age" that crosses over is not registered in the concatenated character component table, "ag" is used as the skip concatenated character.
Even if " a n g l e " with " e " is specified as a search term, it will not be searched as noise. As a result, the narrowing rate in the concatenated character component table search can be improved. Since the amount of search for condensed text in hierarchical pre-search can be reduced, it is possible to equivalently improve the overall search speed, and thus it is possible to perform a larger amount of full-text search in real time. Become.

【０１０１】本実施例では、予め定められたｍ文字(ｍ
は１以上の整数)おきの連接文字(スキップ連接文字)を
対象としてテキストを単語に分割してから文字成分表を
作成する方法について説明した。しかし、従来の互いに
隣り合う連接文字(逐次連接文字)に対しても同様に、テ
キストを単語に分割してから連接文字成分表を作成する
ことにより、“_"(スペース)などを含む連接文字が他の
連接文字と同じエントリにハッシングされることによっ
て生じるノイズが削減できることは明らかであろう。In this embodiment, a predetermined m characters (m
Has described the method of creating a character component table after dividing the text into words for every concatenated character (skip concatenated character) every other integer. However, for conventional contiguous contiguous characters (sequential concatenated characters), similarly, by dividing the text into words and then creating a concatenated character component table, concatenated characters including "_" (space) etc. It will be clear that the noise caused by hashing to the same entry as other concatenated characters can be reduced.

【０１０２】次に、本発明の第三の実施例について説明
する。Next, a third embodiment of the present invention will be described.

【０１０３】本発明第二の実施例では、予めテキストデ
ータを単語に分割し、各単語毎にスキップ連接文字を抽
出することにより、単語間にまたがったスキップ連接文
字が他の連接文字と同じエントリにハッシングされるこ
とによって生じるノイズ、および単語間にまたがったス
キップ連接文字成分によって生じるノイズを削減するこ
とができた。しかし、この方法では検索タームに指定し
た単語を部分文字列として含む別の単語が現われる文書
をノイズとして検索してしまうという問題がある。すな
わち、テキスト中に“jangle"という単語を含む文書が
登録された時には、スキップ連接文字として“jnl"と
“age"が抽出され連接文字成分表に登録されることにな
る。それに対し、検索タームに“angle"が指定された時
には、検索タームからスキップ連接文字として“age"が
抽出されるため、検索タームである“angle"を部分文字
列として含む別の単語である“jangle"等の現われる文
書がノイズとしてヒットしてしまう。In the second embodiment of the present invention, text data is divided into words in advance, and skip concatenated characters are extracted for each word, so that the skip concatenated characters extending between words are the same as other concatenated characters. It was possible to reduce the noise caused by the hashing and the noise caused by the skip concatenated character component spanning between words. However, this method has a problem that a document in which another word including the word specified in the search term as a partial character string appears is searched as noise. That is, when a document including the word "jangle" in the text is registered, "jnl" and "age" are extracted as skip concatenated characters and registered in the concatenated character component table. On the other hand, when "angle" is specified for the search term, "age" is extracted as a skip concatenated character from the search term, so it is another word that contains the search term "angle" as a substring. Documents such as "jangle" appear as noise.

【０１０４】この問題に対し本発明の第三の実施例で
は、第二の実施例における文書検索方法において連接文
字成分表の登録時および検索時に単語の前後に特殊文字
等の所定の符号（以下特殊文字で説明する）を付加す
る。つまり、特殊文字（例えば、ここでは‘＾’とす
る）を付加し、それを含めて連接文字成分を抽出する特
殊文字付加型の連接文字成分表を作成する。これによ
り、特殊文字で単語の区切りを判別できるようにし、検
索タームを部分文字列として含む別の単語が現われる文
書を排除して、ノイズを削減する。To solve this problem, in the third embodiment of the present invention, a predetermined code such as a special character (hereinafter referred to as a special character) before and after a word is registered at the time of registration and search of the concatenated character component table in the document search method of the second embodiment. (Explained with special characters) is added. That is, a special character (for example, “^” here) is added, and a special character addition type concatenated character component table is created in which a concatenated character component is extracted by including the special character. As a result, it is possible to distinguish word breaks with special characters, eliminate documents in which another word that includes the search term as a partial character string appears, and reduce noise.

【０１０５】本実施例は図１に示した第一の実施例と基
本的に同様の構成をとるが、その中の連接文字成分表作
成登録プログラム２０５と連接文字成分表サーチプログ
ラム２１１の部分が、それぞれ図１４と図１５に示すよ
うな構成となる。This embodiment basically has the same configuration as that of the first embodiment shown in FIG. 1, but the portions of the concatenated character component table creation registration program 205 and the concatenated character component table search program 211 are included therein. The configurations are as shown in FIGS. 14 and 15, respectively.

【０１０６】すなわち、連接文字成分表作成登録プログ
ラム２０５は単語切出しプログラム３００、特殊文字付
加プログラム３０１、スキップ連接文字抽出プログラム
２０６、連接文字成分表登録プログラム２０７およびハ
ッシュテーブル作成プログラム２０８で構成され、連接
文字成分表サーチプログラム２１１は特殊文字付加プロ
グラム３０２、スキップ連接文字抽出プログラム２１２
およびビットアンドプログラム２１３で構成される。That is, the concatenated character component table creation / registration program 205 is composed of a word cutout program 300, a special character addition program 301, a skip concatenated character extraction program 206, a concatenated character component table registration program 207 and a hash table creation program 208. The character component table search program 211 is a special character addition program 302 and a skip concatenated character extraction program 212.
And a bit and program 213.

【０１０７】連接文字成分表作成登録プログラム２０５
は、図１６に示すように、まずステップ１２００で単語
切出しプログラム３００を起動し、磁気ディスク１１０
に格納されたテキスト１０３からテキストデータがワー
クエリア２１７に読み込む。そして、このテキストデー
タからスペースを区切りとして単語を切り出す。Concatenated character component table creation registration program 205
First, as shown in FIG. 16, in step 1200, the word cutout program 300 is started, and the magnetic disk 110
Text data is read into the work area 217 from the text 103 stored in. Then, words are cut out from this text data by separating spaces.

【０１０８】次に、ステップ１２０１で特殊文字付加プ
ログラム３０１を起動し、単語切出しプログラム３００
によって切り出された単語の前後に特殊文字‘＾’を付
加する。Next, in step 1201, the special character addition program 301 is started, and the word cutout program 300
The special character '^' is added before and after the word cut out by.

【０１０９】その後、ステップ１２０２でスキップ連接
文字抽出プログラム２０６を起動し、特殊文字付加プロ
グラム３００によって特殊文字を付加されたすべての単
語から１文字おきに３文字の文字列をすべて抽出する。After that, in step 1202, the skip concatenated character extraction program 206 is activated to extract all three character strings every other character from all words to which special characters have been added by the special character addition program 300.

【０１１０】最後に、ステップ１２０３で連接文字成分
表登録プログラム２０７を起動し、スキップ連接文字抽
出プログラム２０６によって単語から抽出された連接文
字を、ワークエリア２１７内の連接文字成分表１０５に
ハッシュテーブル２１６に従って登録し、これを磁気デ
ィスク１１０へ格納する。Finally, in step 1203, the concatenated character component table registration program 207 is activated, and the concatenated characters extracted from the words by the skip concatenated character extraction program 206 are stored in the concatenated character component table 105 in the work area 217 in the hash table 216. It is registered according to the above, and is stored in the magnetic disk 110.

【０１１１】検索時には、連接文字成分表サーチプログ
ラム２１１は、図１７に示すように、まずステップ１２
１０で特殊文字付加プログラム３０２を起動し、検索条
件式中の検索タームの前後に特殊文字‘＾’を付加す
る。At the time of search, the concatenated character component table search program 211 first executes step 12 as shown in FIG.
At 10, the special character addition program 302 is started, and the special character “^” is added before and after the search term in the search condition expression.

【０１１２】次に、ステップ１２１１でスキップ連接文
字抽出プログラム２１２を起動し、特殊文字付加プログ
ラム３０２によって特殊文字‘＾’が付加された検索タ
ームから３文字の一続きの文字列すべてを抽出する。Next, in step 1211, the skip concatenated character extraction program 212 is activated, and the special character addition program 302 extracts all three consecutive character strings from the search term to which the special character "^" is added.

【０１１３】その後、ステップ１２１２でビットアンド
プログラム２１３を起動し、スキップ連接文字抽出プロ
グラム２１２によって抽出されたすべての文字列に対応
する連接文字成分表１０５のエントリに格納されている
ビットリストを、ハッシュテーブル２１６を介してワー
クエリア２１７に読み込み、ステップ１２１３で読み込
まれたすべてのビットリスト間で各ビット毎に論理積演
算を行う。Then, in step 1212, the bit-and-program 213 is activated, and the bit list stored in the entries of the concatenated character component table 105 corresponding to all the character strings extracted by the skip concatenated character extraction program 212 is hashed. It is read into the work area 217 via the table 216, and the logical product operation is performed for each bit between all the bit lists read in step 1213.

【０１１４】この論理積演算の結果、‘１’となったビ
ットに対応する文書番号を連接文字成分表サーチの結果
として出力する。As a result of this logical product operation, the document number corresponding to the bit that becomes "1" is output as the result of the concatenated character component table search.

【０１１５】以下、上述した連接文字成分表作成登録プ
ログラム２０５の処理内容を詳細に説明する。The processing contents of the concatenated character component table creation registration program 205 described above will be described in detail below.

【０１１６】連接文字成分表作成登録プログラム２０５
では、まずテキスト１０３からスペースを区切りとして
単語が切り出され、各単語の前後に特殊文字‘＾’が付
加される。その後、特殊文字‘＾’が付加された単語か
ら１文字おきに３文字の文字列が抽出される。Concatenated character component table creation registration program 205
Then, first, words are cut out from the text 103 by separating spaces, and a special character “^” is added before and after each word. After that, a character string of three characters is extracted every other character from the word to which the special character “^” is added.

【０１１７】このスキップ連接文字の抽出処理につい
て、例えば、“Ｍｕｌｔｉｍｅｄｉａｉｎｆｏｒｍａｔ
ｉｏｎｓｙｓｔｅｍｓｍｕｓｔ・・・”というテキ
ストが登録された場合を例に説明する。Regarding the extraction processing of this skip concatenated character, for example, "Multimediainfoformat"
The case where the text "ion systems must ..." is registered will be described as an example.

【０１１８】単語切出しプログラム３００によって図１
８に示すように文書１は、“Ｍｕｌｔｉｍｅｄｉａ”，
“ｉｎｆｏｒｍａｔｉｏｎ”，“ｓｙｓｔｅｍｓ”，
“ｍｕｓｔ”，・・・に分割される。The word segmentation program 300 shown in FIG.
As shown in FIG. 8, the document 1 is “Multimedia”,
"Information", "systems",
It is divided into "must", ....

【０１１９】次に、特殊文字付加プログラム３０１によ
って、切り出された各単語の前後に特殊文字‘＾’が付
加され、“＾Ｍｕｌｔｉｍｅｄｉａ＾”，“＾ｉｎｆｏ
ｒｍａｔｉｏｎ＾”，“＾ｓｙｓｔｅｍｓ＾”，“＾ｍ
ｕｓｔ＾”，・・・となる。Next, the special character addition program 301 adds a special character "^" before and after each word cut out, and "^ Multimedia ^" and "^ info" are added.
rmation ^ ”,“ ^ systems ^ ”,“ ^ m
ust ^ ”, ...

【０１２０】次に、スキップ連接文字抽出プログラム２
０６によって、特殊文字‘＾’を前後に付加した単語か
ら、“＾ｕｔ”，“Ｍｌｉ”，“ｕｔｍ”，“ｌｉ
ｅ”，“ｔｍｄ”，“ｉｅｉ”，“ｍｄａ”，“ｅｉ
＾”，“＾ｎｏ”，“ｉｆｒ”，“ｎｏｍ”，“ｆｒ
ａ”，“ｏｍｔ”，・・・が抽出される。Next, the skip concatenated character extraction program 2
According to 06, the words "^ ut", "Mli", "utm", and "li" are added from the words with special characters "^" added before and after.
e ”,“ tmd ”,“ iei ”,“ mda ”,“ ei ”
^ ”,“ ^ No ”,“ ifr ”,“ nom ”,“ fr
"a", "omt", ... Are extracted.

【０１２１】最後に、連接文字成分表登録プログラム２
０７を起動する。ここでは、ハッシュテーブル２１６を
介して、スキップ連接文字抽出プログラム２０６によっ
て、特殊文字‘＾’を付加した単語から抽出された連接
文字成分に対応するエントリに‘１’を設定し、連接文
字成分の存在を記す。Finally, the concatenated character component table registration program 2
Start 07. Here, via the hash table 216, the skip concatenated character extraction program 206 sets "1" to the entry corresponding to the concatenated character component extracted from the word to which the special character "^" is added, and the concatenated character component Note the existence.

【０１２２】図１８の文書１の例では、“＾ｕｔ”があ
るのでハッシュテーブル２１６を用いて参照した該当エ
ントリの文書１に対応するビット位置に‘１’を設定す
る。“Ｍｌｉ”も同様に‘１’を設定する。In the example of the document 1 in FIG. 18, since there is "^ ut", "1" is set to the bit position corresponding to the document 1 of the corresponding entry referred to by using the hash table 216. Similarly, "Mli" is set to "1".

【０１２３】以下、同様にして特殊文字‘＾’を前後に
付加した単語中の連接文字成分のすべてについて、連接
文字成分表１０５の該当エントリに‘１’を設定する。
最終的には、同図に示すようにテキスト１０３中の各文
書について‘１’と‘０’の列（ビットリスト）ができ
あがる。Similarly, for all of the concatenated character components in the word with the special character "^" added before and after, "1" is set to the corresponding entry in the concatenated character component table 105.
Finally, as shown in the figure, a column (bit list) of '1' and '0' is created for each document in the text 103.

【０１２４】次に、連接文字成分表サーチプログラム２
１１の処理内容を詳細に説明する。連接文字成分表サー
チプログラム２１１では、まず検索条件式中の検索ター
ムの前後に特殊文字‘＾’を付加し、その検索タームか
ら３文字の一続きの文字列を抽出する。Next, the concatenated character component table search program 2
The processing content of 11 will be described in detail. The concatenated character component table search program 211 first adds a special character "^" before and after the search term in the search condition expression, and extracts a string of three characters from the search term.

【０１２５】その後、抽出された各文字列に対応するビ
ットリスト間でビット毎に論理積演算を行い、‘１’と
なったビットに対応する文書番号を連接文字成分表サー
チ結果として出力する。After that, a logical product operation is performed for each bit between the bit lists corresponding to the extracted character strings, and the document number corresponding to the bit that becomes "1" is output as the concatenated character component table search result.

【０１２６】例えば、図１９に示すように“Ｍｕｌｔｉ
ｍｅｄｉａ”という検索タームは特殊文字を付加するこ
とにより、“＾Ｍｕｌｔｉｍｅｄｉａ＾”となる。この
検索タームから１文字おきに３文字の文字列を抽出する
ことにより“＾ｕｔ”，“Ｍｌｉ”，“ｕｔｍ”，“ｌ
ｉｅ”，“ｔｍｄ”，“ｉｅｉ”，“ｍｄａ”および
“ｅｉ＾”が連接文字成分として得られる。そして、こ
れらに対応する連接文字成分表１０５のビットリストを
ハッシュテーブル２１６を介して読み出し、これらすべ
てのビットリストのビットがすべて‘１’である文書が
連接文字成分表サーチの検索結果として得られる。For example, as shown in FIG. 19, "Multi
The search term "media" becomes "^ Multimedia ^" by adding a special character. "^ ut", "Mli", "" by extracting a character string of three characters every other character from this search term. utm ”,“ l
"ie", "tmd", "iei", "mda", and "ei ^" are obtained as concatenated character components, and the bit list of the concatenated character component table 105 corresponding to these is read via the hash table 216, A document in which all the bits of all these bit lists are '1' is obtained as the search result of the concatenated character component table search.

【０１２７】すなわち、読み出したすべてのビットリス
トの間でビット毎に論理積演算を施し、ビットアンド演
算結果９００を得る。このビットアンド演算結果のビッ
トリスト中で、‘１’となっているビット位置に対応す
る文書番号が連接文字成分表サーチの検索結果としての
ヒット文書を表わすことになる。That is, a logical product operation is performed for each bit among all the read bit lists, and a bit and operation result 900 is obtained. In the bit list of the bit-and operation result, the document number corresponding to the bit position of "1" represents the hit document as the search result of the concatenated character component table search.

【０１２８】これにより、“＾ｕｔ”，“Ｍｌｉ”，
“ｕｔｍ”，“ｌｉｅ”，“ｔｍｄ”，“ｉｅｉ”，
“ｍｄａ”および“ｅｉ＾”のすべてを含む文書が検索
結果として抽出されることになる。図１９の例では、文
書１がヒット文書ということになる。By this, "^ ut", "Mli",
"Utm", "lie", "tmd", "iei",
A document including all of "mda" and "ei ^" will be extracted as a search result. In the example of FIG. 19, the document 1 is a hit document.

【０１２９】すなわち、テキスト中に“jangle"という
単語を含む文書が登録された時には、単語の前後に特殊
文字を付加した“∧jangle∧"からスキップ連接文字と
して“∧ag"，“jnl",“age"と“nl∧"が抽出され連接
文字成分表に登録されることになる。また、検索ターム
に“angle"が指定された時には、単語の前後に特殊文字
を付加した“∧angle∧"からスキップ連接文字として
“∧nl",“age",“nl∧"が抽出されることになるが、
“∧jangle∧"を含む文書中には“∧nl"に対応するスキ
ップ連接文字が含まれないため検索の対象から外され
る。すなわち、“jangle"を含む文書がノイズとして検
索されることがなくなる。That is, when a document including the word “jangle” in the text is registered, “∧ag”, “jnl”, and “∧ag” as skip connecting characters are added from “∧jangle∧” with special characters added before and after the word. “Age” and “nl∧” are extracted and registered in the concatenated character component table. When "angle" is specified as the search term, "∧nl", "age", "nl∧" are extracted as skip concatenated characters from "∧angle∧" with special characters added before and after the word. I mean,
Documents containing "∧jangle∧" do not include the skip concatenation character corresponding to "∧nl", so they are excluded from the search. That is, a document including "jangle" will not be searched as noise.

【０１３０】このように、本実施例における連接文字成
分表の作成登録処理では、文書の登録時に、テキストか
ら単語を切り出し、切り出された単語の前後に特殊文字
を付加してから、その中の文字列から１文字おきに３文
字の文字列を取り出し、この連接文字の存在情報を予め
連接文字成分表に登録するとともに、連接文字成分表の
検索時に、検索タームの前後に特殊文字を付加してから
検索を行うことにより、特殊文字で単語の前後を判別で
きる。そのため、検索タームを部分文字列としてその文
字列内部に含む無関係な単語が中間一致によってヒット
することを避けることができ、ノイズを減らすことがで
きる。その結果、検索時の連接文字成分表サーチにおけ
る絞り込み率を向上させることができ、階層プリサーチ
における凝縮テキストの探索量が削減できることになる
ため、等価的に全体の検索速度を向上させることが可能
となる。したがって、より大量のフルテキストサーチを
実時間で行うことが可能となる。As described above, in the process of creating and registering the concatenated character component table in this embodiment, when a document is registered, a word is cut out from the text, special characters are added before and after the cut out word, and then the Every three characters are extracted from the character string, the existence information of this concatenated character is registered in advance in the concatenated character component table, and special characters are added before and after the search term when searching the concatenated character component table. By performing a search after that, it is possible to distinguish before and after a word with special characters. Therefore, it is possible to avoid hitting irrelevant words that include the search term as a sub-character string inside the character string by intermediate matching, and reduce noise. As a result, the narrowing rate in the concatenated character component table search at the time of search can be improved, and the search amount of condensed text in the hierarchical presearch can be reduced, so that the overall search speed can be equivalently improved. Becomes Therefore, a larger amount of full-text search can be performed in real time.

【０１３１】本実施例では、連接文字成分表を１文字お
きに３文字の文字列、すなわち、ｎ＝３で作成する場合
について説明したが、文字列長が２文字および４文字以
上の場合についても同様な処理が可能であることは、上
記の説明から明らかであろう。In the present embodiment, the case where the concatenated character component table is created every other character with a character string of three characters, that is, n = 3 has been described. However, in the case where the character string length is two characters or four characters or more, It will be apparent from the above description that the same processing can be performed.

【０１３２】また、本実施例では、予め定められたｍ文
字おきの連接文字(スキップ連接文字)を対象としてテキ
ストから単語を切り出し、切り出された単語の前後に特
殊文字を付加してから連接文字成分表を作成する方法に
ついて説明した。しかし、従来の互いに隣り合う連接文
字(逐次連接文字)に対しても同様に、テキストから単語
を切り出し、切り出された単語の前後に特殊文字を付加
してから連接文字成分表を作成することにより、検索タ
ームに指定した単語を部分文字列として含む別の単語が
現われる文書をノイズとして検索の対象から外すことが
可能になることも明らかであろう。Further, in the present embodiment, a word is cut out from the text for a predetermined concatenated character every m characters (skip concatenated character), and special characters are added before and after the cut word, and then the concatenated character is added. The method of creating the ingredient table has been described. However, similar to the conventional concatenated characters (sequential concatenated characters) that are adjacent to each other, similarly, by extracting words from the text, adding special characters before and after the extracted words, and creating a concatenated character component table. It is also clear that it is possible to exclude a document in which another word including the word specified in the search term as a substring appears as noise from the search target.

【０１３３】次に、本発明の第四の実施例について説明
する。Next, a fourth embodiment of the present invention will be described.

【０１３４】本発明第一の実施例では、英文に対しｍ文
字おきにｎ文字の文字列を抽出し、これを連接文字成分
表に登録する方法について説明した。しかし、この方法
では文字数の少ない検索タームが指定されたときに抽出
できる連接文字成分の数が少ないため検索ノイズが多く
発生するという問題がある。すなわち、テキスト中に
“argue"という単語を含む文書が登録された時には、ス
キップ連接文字として“age"が抽出され連接文字成分表
に登録されることになる。それに対し、検索タームに
“angle"が指定された時にはスキップ連接文字として
“age"が抽出されることになり、“argue"を含む文書が
ノイズとして検索されることになる。さらに、従来例の
ように互いに隣り合う連接文字(逐次連接文字)から連接
文字成分表を作成する方法では、“angry"と“single"
を同時に含む文書の現われる文書等がノイズとしてヒッ
トしてしまう。In the first embodiment of the present invention, a method has been described in which a character string of n characters is extracted every m characters from an English sentence and is registered in the concatenated character component table. However, this method has a problem that a large amount of search noise occurs because the number of concatenated character components that can be extracted when a search term having a small number of characters is specified. That is, when a document including the word "argue" in the text is registered, "age" is extracted as the skip concatenated character and registered in the concatenated character component table. In contrast, results in the "age" as the skip articulated characters are extracted when the "angle" in the search term designated, the document containing "a r g u e" is to be retrieved as noise. Furthermore, in the method of creating a concatenated character component table from concatenated characters (sequential concatenated characters) that are adjacent to each other as in the conventional example, " ang ry" and " single "
Documents in which documents that simultaneously include appear appear as noise.

【０１３５】この問題に対し本発明の第四の実施例で
は、第一の実施例における文書検索方法において連接文
字成分の抽出時に、従来例の互いに連続するｉ(ｉは２
以上の整数)文字の連接文字を逐次連接文字として抽出
するとともに、ｍ文字おきにｎ文字の文字列をスキップ
連接文字として抽出し、逐次連接文字成分とスキップ連
接文字成分の両方の連接文字成分表を用いて検索対象と
する文書を絞り込むことによってノイズを削減する。To solve this problem, in the fourth embodiment of the present invention, when the concatenated character components are extracted in the document retrieval method of the first embodiment, i (i is 2
Concatenated characters of the above integers are sequentially extracted as concatenated characters, and a character string of n characters every m characters is extracted as a skip concatenated character, and a concatenated character component table of both the consecutive concatenated character component and the skip concatenated character component is extracted. Noise is reduced by narrowing down the documents to be searched by using.

【０１３６】本実施例は図１に示す第一の実施例と基本
的に同様の構成をとるが、その中の連接文字成分表作成
登録プログラム２０５と連接文字成分表サーチプログラ
ム２１０の部分が、それぞれ図２０と図２１に示す構成
となる。This embodiment basically has the same configuration as that of the first embodiment shown in FIG. 1, but the portions of the concatenated character component table creation registration program 205 and the concatenated character component table search program 210 therein are The configurations are shown in FIGS. 20 and 21, respectively.

【０１３７】すなわち、連接文字成分表作成登録プログ
ラム２０５は逐次連接文字抽出プログラム４００、スキ
ップ連接文字抽出プログラム２０６、連接文字成分表登
録プログラム２０７およびハッシュテーブル作成プログ
ラム２０８で構成され、連接文字成分表サーチプログラ
ム２１１は逐次連接文字抽出プログラム４０１、スキッ
プ連接文字抽出プログラム２１２およびビットアンドプ
ログラム２１３で構成される。That is, the concatenated character component table creation / registration program 205 is composed of a sequential concatenated character extraction program 400, a skip concatenated character extraction program 206, a concatenated character component table registration program 207, and a hash table creation program 208. The program 211 includes a sequential concatenated character extraction program 401, a skip concatenated character extraction program 212, and a bit and program 213.

【０１３８】連接文字成分表作成登録プログラム２０５
は、図２２に示すように、まずステップ１３００で逐次
連接文字抽出プログラム４００を起動し、磁気ディスク
１１０に格納されたテキスト１０３からテキストデータ
をワークエリア２１７に読み込む。そして、連続する３
文字の文字列を全て抽出する。Concatenated character component table creation registration program 205
As shown in FIG. 22, first, in step 1300, the sequential concatenated character extraction program 400 is activated to read the text data from the text 103 stored in the magnetic disk 110 into the work area 217. And three consecutive
Extract all character strings.

【０１３９】次に、ステップ１３０１でスキップ連接文
字抽出プログラム２０６を起動し、ワークエリア２１７
に取り込まれたテキストデータから１文字おきに３文字
の文字列を全て抽出する。Next, in step 1301, the skip concatenated character extraction program 206 is started, and the work area 217
All the character strings of 3 characters are extracted every other character from the text data taken in.

【０１４０】最後に、ステップ１３０２で連接文字成分
表登録プログラム２０７を起動し、逐次連接文字抽出プ
ログラム４００およびスキップ連接文字抽出プログラム
２０６によって抽出された連接文字を、ワークエリア２
１７内の連接文字成分表１０５にハッシュテーブル２１
６に従って登録し、これを磁気ディスク１１０へ格納す
る。Finally, in step 1302, the concatenated character component table registration program 207 is started, and the concatenated characters extracted by the sequential concatenated character extraction program 400 and the skip concatenated character extraction program 206 are transferred to the work area 2
The hash table 21 is added to the concatenated character component table 105 in FIG.
6 is registered and stored in the magnetic disk 110.

【０１４１】検索時には、連接文字成分表サーチプログ
ラム２１１が、図２３に示すように、ステップ１３１０
で逐次連接文字抽出プログラム４０１を起動し、検索タ
ームから連続する３文字の文字列すべてを抽出する。At the time of search, the concatenated character component table search program 211, as shown in FIG.
Then, the concatenated character extraction program 401 is started, and all three consecutive character strings are extracted from the search term.

【０１４２】次に、ステップ１３１１でスキップ連接文
字抽出プログラム２１２を起動し、検索タームから１文
字おきに３文字の文字列すべてを抽出する。Next, in step 1311, the skip concatenated character extraction program 212 is activated to extract all three character strings every other character from the search term.

【０１４３】その後、ステップ１３１２でビットアンド
プログラム２１３を起動し、逐次連接文字抽出プログラ
ム４０１およびスキップ連接文字抽出プログラム２１２
によって抽出されたすべての文字列に対応する連接文字
成分表１０５のエントリに格納されているビットリスト
を、ハッシュテーブル２１７を介してワークエリア２１
６に読み込み、読み込まれたすべてのビットリスト間で
各ビット毎に論理積演算を行う。Then, in step 1312, the bit-and-program 213 is activated, and the sequential concatenated character extraction program 401 and the skip concatenated character extraction program 212 are executed.
The bit list stored in the entry of the concatenated character component table 105 corresponding to all the character strings extracted by the work area 21 via the hash table 217.
6, and the logical product operation is performed for each bit between all the read bit lists.

【０１４４】この論理積演算の結果‘１’となったビッ
トに対応する文書番号を連接文字成分表サーチの結果と
して出力する。The document number corresponding to the bit which becomes "1" as a result of this logical product operation is output as the result of the concatenated character component table search.

【０１４５】以下、上述した連接文字成分表作成登録プ
ログラム２０５の処理内容を詳細に説明する。The processing contents of the concatenated character component table creation registration program 205 described above will be described in detail below.

【０１４６】連接文字成分表作成登録プログラム２０５
では、まずテキストデータから３文字の連続する文字列
および１文字おきに３文字の連接文字が抽出される。Concatenated character component table creation registration program 205
Then, first, a continuous character string of three characters and three concatenated characters every other character are extracted from the text data.

【０１４７】この文字列の抽出については、例えば、
“Ｍｕｌｔｉｍｅｄｉａｉｎｆｏｒｍａｔｉｏｎｓ
ｙｓｔｅｍｓｍｕｓｔ・・・”というテキストが登録さ
れた場合を例に説明する。Regarding the extraction of this character string, for example,
"Multimedia informations
The case where the text “systems must ...” Is registered will be described as an example.

【０１４８】まず、逐次連接文字抽出ステップ４００に
よって図２４に示す文書１から、“Ｍｕｌ”,“ｕｌ
ｔ”，“ｌｔｉ”，“ｔｉｍ”，“ｉｍｅ”，“ｍｅ
ｄ”，“ｅｄｉ”，“ｄｉａ”，“ｉａ_”，“ａ_
ｉ”，・・・が抽出される。さらに、スキップ連接文字
抽出プログラム２０６によって、“Ｍｌｉ”，“ｕｔ
ｍ”，“ｌｉｅ”，“ｔｍｄ”，“ｉｅｉ”，“ｍｄ
ａ”，“ｅｉ_”，“ｄａｉ”，“ｉ_ｎ”，“ａｉ
ｆ”，・・・が抽出される。First, in the sequential concatenated character extraction step 400, "Mul", "ul" is selected from the document 1 shown in FIG.
t ”,“ lti ”,“ tim ”,“ ime ”,“ me
d ”,“ edi ”,“ dia ”,“ ia_ ”,“ a_
i ", ... Is extracted by the skip concatenated character extraction program 206.
m ”,“ lie ”,“ tmd ”,“ iei ”,“ md
a ”,“ ei_ ”,“ dai ”,“ i_n ”,“ ai ”
f ″, ... Are extracted.

【０１４９】最後に、連接文字成分表登録プログラム２
０７が起動される。ここでは、連接文字用ハッシュテー
ブル２１６−ａおよびスキップ連接用ハッシュテーブル
２１６−ｂを介して、それぞれ逐次連接文字抽出プログ
ラム４００およびスキップ連接文字抽出プログラム２０
６によって抽出された連接文字成分に対応するエントリ
に‘１’を設定し、連接文字成分の存在を記す。Finally, the concatenated character component table registration program 2
07 is activated. Here, the sequential concatenated character extraction program 400 and the skip concatenated character extraction program 20 are respectively passed through the concatenated character hash table 216-a and the skip concatenated hash table 216-b.
"1" is set to the entry corresponding to the concatenated character component extracted by 6, and the existence of the concatenated character component is noted.

【０１５０】次に、検索時の処理について詳細に説明す
る。Next, the processing at the time of retrieval will be described in detail.

【０１５１】例えば、図２５に示すように“Ｍｕｌｔｉ
ｍｅｄｉａ”という検索タームから、逐次連接文字抽出
ステップ４０１によって、“Ｍｕｌ”，“ｕｌｔ”，
“ｌｔｉ”，“ｔｉｍ”，“ｉｍｅ”，“ｍｅｄ”，
“ｅｄｉ”および“ｄｉａ”が逐次連接文字成分として
抽出される。さらにスキップ連接文字抽出ステップ２１
２によって、“Ｍｌｉ”，“ｕｔｍ”，“ｌｉｅ”，
“ｔｍｄ”，“ｉｅｉ”および“ｍｄａ”がスキップ連
接文字成分として抽出される。次に、ビットアンドプロ
グラム２１３により、連接文字成分表１０５のビットリ
ストが逐次連接文字成分については逐次連接用ハッシュ
テーブル２１５−ａを介して、スキップ連接文字成分に
ついてはスキップ連接用ハッシュテーブル２１６−ｂを
介して読み出される。そして、これらすべてのビットリ
ストのビットがすべて‘１’である文書を連接文字成分
表サーチの検索結果として得る。For example, as shown in FIG. 25, "Multi
"Mul", "ult", from the search term "media"
"Lti", "tim", "ime", "med",
"Edi" and "dia" are sequentially extracted as concatenated character components. Further, skip concatenated character extraction step 21
2, "Mli", "utm", "lie",
"Tmd", "iei" and "mda" are extracted as skip concatenated character components. Next, the bit and program 213 causes the bit list of the concatenated character component table 105 to pass through the consecutive concatenated hash table 215-a for the consecutive concatenated character component and the skip concatenated hash table 216-b for the skip concatenated character component. Read through. Then, a document in which all the bits of all these bit lists are '1' is obtained as the search result of the concatenated character component table search.

【０１５２】すなわち、読み出したすべてのビットリス
トの間でビット毎に論理積演算を施し、論理積演算結果
９００を得る。このビットアンド演算結果のビットリス
ト中で、‘１’となっているビット位置に対応する文書
番号が連接文字成分表サーチの検索結果としてのヒット
文書を表わすことになる。これにより、図２５の例で
は、文書１がヒット文書ということになる。That is, a logical product operation is performed for each bit among all the read bit lists to obtain a logical product operation result 900. In the bit list of the bit-and operation result, the document number corresponding to the bit position of "1" represents the hit document as the search result of the concatenated character component table search. Thus, in the example of FIG. 25, the document 1 is a hit document.

【０１５３】このように、本実施例における連接文字成
分表の作成登録処理では、文書の登録時に、テキストか
ら連続する３文字の文字列(逐次連接文字)および１文字
おきに３文字の文字列(スキップ連接文字)を取り出し、
この連接文字の存在情報を予め連接文字成分表に登録す
る。検索時には、逐次連接文字およびスキップ連接文字
の両方の連接文字成分を全て含む文書を検索することに
より、連接文字成分表サーチのノイズを削減することが
できる。例えば、検索タームとして“angle"が指定され
た時には逐次連接文字として“ang"，“ngl"および“gl
e"，スキップ連接文字として“age"を含む文書をサーチ
する。これに対し、テキスト中に“argue"という単語を
含む文書が登録された時には、スキップ連接文字として
は“age"が抽出されるが、逐次連接文字として“ang"，
“ngl"および“gle"が抽出されないため、ノイズとして
削除される。また、“angry"と“single"を同時に含む
文書が登録された場合には、逐次連接文字として“an
g"，“ngl"および“gle"が連接文字成分表に登録される
ことになるが、スキップ連接文字として“age"が抽出さ
れないため、やはりノイズとして削除することができ
る。As described above, in the process of creating and registering the concatenated character component table according to the present embodiment, at the time of registering a document, a character string of three consecutive characters (sequential concatenated characters) from the text and a character string of three characters every other character are registered. Take out (Skip concatenation character),
The existence information of this concatenated character is registered in the concatenated character component table in advance. At the time of searching, by searching a document that includes all the concatenated character components of both the consecutive concatenated character and the skip concatenated character, it is possible to reduce noise in the concatenated character component table search. For example, when "angle" is specified as the search term, the sequential concatenation characters are "ang", "ngl", and "gl".
e ", as the skip articulated character""to search for a document that contains a. On the other hand, in the text" age a r g u e "when the document that contains the word is registered, as a skip articulated character" age " Is extracted, but "ang",
Since "ngl" and "gle" are not extracted, they are deleted as noise. Also, if a document that contains both " ang ry" and " single " is registered, "an
Although g "," ngl ", and" gle "are registered in the concatenated character component table," age "is not extracted as a skip concatenated character, and thus can also be deleted as noise.

【０１５４】その結果、連接文字成分表サーチにおける
絞り込み率を向上させることができ、階層プリサーチに
おける凝縮テキストの探索量が削減できることになるた
め、等価的に全体の検索速度が向上することになる。し
たがって、より大量のフルテキストサーチが実時間で可
能となる。As a result, the narrowing rate in the concatenated character component table search can be improved, and the search amount of the condensed text in the hierarchical presearch can be reduced, so that the overall search speed is equivalently improved. . Therefore, a larger amount of full text search is possible in real time.

【０１５５】本実施例では、逐次連接文字の連接文字成
分表を連続する３文字の文字列で作成する場合について
説明したが、文字列長が２文字および４文字以上の場合
についても同様な処理が可能であることは、上記の説明
から明らかであろう。また、スキップ連接文字の連接文
字成分表を１文字おきに３文字の文字列で作成する場合
について説明したが、何文字おきに文字列を抽出して
も、また文字列長が２文字および４文字以上の場合につ
いても同様な処理が可能であることは明らかであろう。In the present embodiment, the case has been described in which the concatenated character component table of consecutive concatenated characters is created by a character string of three consecutive characters, but the same processing is performed when the character string length is two characters or four characters or more. It will be clear from the above description that is possible. Also, the case has been described in which the concatenated character component table of skip concatenated characters is created with a character string of three characters for every other character, but no matter how many characters are extracted for a character string, the character string length is two or four. It will be apparent that the same process can be applied to the case of more than characters.

【０１５６】また、本実施例では単語を意識することな
く“_"(スペース)や“."(ピリオド)、“,"(コンマ)など
を含んだ全文字列を対象として文字成分表を作成した
が、第二の実施例に示したようにテキストを単語に分割
してから文字成分表を作成する方式、および第三の実施
例に示したようにテキストを単語に分割した後、単語の
前後に特殊文字を付加してから文字成分表を作成する方
法をとっても同様な効果が得られることは明らかであろ
う。In the present embodiment, the character component table is created for all character strings including "_" (space), "." (Period), "," (comma), etc. without being aware of words. However, as shown in the second embodiment, the method of creating a character component table after dividing the text into words, and as shown in the third embodiment, after dividing the text into words, It will be apparent that the same effect can be obtained by adding the special characters before and after the character component table.

【０１５７】次に、本発明の第五の実施例として第四の
実施例における文書検索方法を日本語テキストに適用し
た場合について説明する。Next, as a fifth embodiment of the present invention, a case where the document retrieval method in the fourth embodiment is applied to Japanese text will be described.

【０１５８】日本語は１文字１文字がそれぞれ意味を持
つ表意文字であるため、表音文字である英語などに比
べ、従来例で示されている互いに隣り合う文字列(逐次
連接文字)による文字成分表で、かなり検索ノイズを削
減することができる。しかし、単語の組合せで構成され
る文字列が検索タームに指定された場合には、日本語に
おいてもノイズが多く発生する。例えば、“動画像"と
いう文字列が検索タームに指定された時には逐次連接文
字成分表サーチでは“動画"と“画像"をともに含む文書
が検索されてしまう。その結果、“動画像"を含まない
にもかかわらず“動画"と“画像"が別々の単語として同
時に現われる文書等がノイズとしてヒットしてしまう。Since Japanese is an ideographic character in which each character has its own meaning, compared to English, which is a phonetic character, characters formed by adjacent character strings (sequential concatenated characters) shown in the conventional example. The component table can significantly reduce the search noise. However, when a character string composed of a combination of words is specified as a search term, a lot of noise also occurs in Japanese. For example, when a character string "moving image" is specified as a search term, a document containing both "moving image" and "image" is searched in the sequential concatenated character component table search. As a result, a document or the like in which "moving image" and "image" appear as different words at the same time even if they do not include "moving image" is hit as noise.

【０１５９】この問題に対し、本発明の第五の実施例で
は日本語文書に対しｉ文字(ｉは２以上の整数)の連続し
た文字列(逐次連接文字)に対し連接文字成分表を作成す
るともに、ｍ文字(ｍは１以上の整数)おきにｎ文字(ｎ
は２以上の整数)の文字列(スキップ連接文字)に対し連
接文字成分表を作成することにより、単語の組合せで構
成される文字列が検索タームに指定された場合にも検索
ノイズの少ない連接文字成分表サーチを実現する。To solve this problem, in the fifth embodiment of the present invention, a concatenated character component table is created for a continuous character string (sequential concatenated character) of i characters (i is an integer of 2 or more) for Japanese documents. In addition, every m characters (m is an integer of 1 or more) n characters (n
By creating a concatenated character component table for character strings (skip concatenated characters) of 2 or more), concatenation with less search noise even when a character string composed of word combinations is specified in the search term Realize character component table search.

【０１６０】本実施例における文書検索方法の構成は第
四の実施例と同じである。また、本実施例では日本文に
おけるｉ文字の連続した文字列においてｉ＝２、ｍ文字
おきにｎ文字の文字列においてｍ＝１およびｎ＝２とし
た場合について以下に例を挙げて説明する。The configuration of the document search method in this embodiment is the same as in the fourth embodiment. Further, in the present embodiment, a case where i = 2 in a continuous character string of i characters in a Japanese sentence and m = 1 and n = 2 in a character string of every nth character is described with reference to an example below. .

【０１６１】まず、連接文字成分表作成登録プログラム
２０５によって、まずテキスト１０３から２文字の連続
した文字列および１文字おきに２文字の連接文字が抽出
される。First, the concatenated character component table creation / registration program 205 first extracts a continuous character string of two characters and two concatenated characters every other character from the text 103.

【０１６２】この文字列の抽出については、例えば、
“自動画質調整機能を備えた画像処理装置・・・”とい
うテキストが入力された場合を例に説明する。Regarding the extraction of this character string, for example,
An example will be described in which the text "image processing apparatus with automatic image quality adjustment function ..." Is input.

【０１６３】逐次連接文字抽出ステップ４００によって
図２６に示す文書１から、“自動”、“動画”、“画
質”、“質調”、“調整”、“整機”、“機能”、“能
を”、・・・、“た画”、“画像”、“像処”、“処
理”、・・・が抽出される。さらに、スキップ連接文字
抽出プログラム２０６によって、“自画”、“動質”、
“画調”、“質整”、“調機”、“整能”、“機を”、
“能備”、・・・、“た像”、“画処”、“像理”、
“処装”、・・・が抽出される。In the sequential concatenated character extraction step 400, "automatic", "moving image", "image quality", "quality adjustment", "adjustment", "rectifier", "function", "function" are selected from the document 1 shown in FIG. , “, ...,” “image”, “image”, “image processing”, “processing” ,. Furthermore, by the skip concatenated character extraction program 206, "self-portrait", "movement",
"Picture adjustment", "Quality adjustment", "Adjustment", "Adjustment", "Machine",
"Nohbi", ..., "ta-image", "artist", "imaging",
“Processing”, ... Is extracted.

【０１６４】最後に、連接文字成分表登録プログラム２
０７が起動される。ここでは、連接文字用ハッシュテー
ブル２１６−ａおよびスキップ連接用ハッシュテーブル
２１６−ｂを介して、それぞれ逐次連接文字抽出プログ
ラム４００およびスキップ連接文字抽出プログラム２０
６によって抽出された連接文字成分に対応するエントリ
に‘１’を設定し、連接文字成分の存在を記す。Finally, the concatenated character component table registration program 2
07 is activated. Here, the sequential concatenated character extraction program 400 and the skip concatenated character extraction program 20 are respectively passed through the concatenated character hash table 216-a and the skip concatenated hash table 216-b.
"1" is set to the entry corresponding to the concatenated character component extracted by 6, and the existence of the concatenated character component is noted.

【０１６５】次に、検索時の処理について詳細に説明す
る。Next, the processing at the time of retrieval will be described in detail.

【０１６６】例えば、図２７に示すように“動画像”と
いう検索タームから、逐次連接文字抽出ステップ４０１
によって、“動画”および“画像”が逐次連接文字成分
として抽出される。さらにスキップ連接文字抽出ステッ
プ２１２によって、“動像”がスキップ連接文字成分と
して抽出される。次に、ビットアンドプログラム２１３
により、連接文字成分表１０５のビットリストが逐次連
接文字成分については逐次連接用ハッシュテーブル２１
５−ａを介して、スキップ連接文字成分についてはスキ
ップ連接用ハッシュテーブル２１６−ｂを介して読み出
される。そして、これらすべてのビットリストのビット
がすべて‘１’である文書を連接文字成分表サーチの検
索結果として得る。For example, as shown in FIG. 27, a step 401 of extracting consecutive concatenated characters from the search term "moving image"
Thus, the "moving image" and the "image" are sequentially extracted as concatenated character components. Further, in the skip concatenated character extraction step 212, "moving image" is extracted as a skip concatenated character component. Next, the bit and program 213
Thus, the bit list of the concatenated character component table 105 shows that the successive concatenated character component has the concatenation hash table 21.
The skip concatenated character component is read via 5-a and the skip concatenation hash table 216-b. Then, a document in which all the bits of all these bit lists are '1' is obtained as the search result of the concatenated character component table search.

【０１６７】すなわち、読み出したすべてのビットリス
トの間でビット毎に論理積演算を施し、論理積演算結果
９００を得る。このビットアンド演算結果のビットリス
ト中で、‘１’となっているビット位置に対応する文書
番号が連接文字成分表サーチの検索結果としてのヒット
文書を表わすことになる。これにより、図２７の例で
は、文書Ｎがヒット文書ということになる。That is, a logical product operation is performed for each bit among all the read bit lists to obtain a logical product operation result 900. In the bit list of the bit-and operation result, the document number corresponding to the bit position of "1" represents the hit document as the search result of the concatenated character component table search. As a result, in the example of FIG. 27, the document N is a hit document.

【０１６８】このように、本実施例における連接文字成
分表の作成登録処理では、文書の登録時に、日本語テキ
ストデータから連続する２文字の文字列(逐次連接文字)
および１文字おきに２文字の文字列(スキップ連接文字)
を取り出し、この連接文字の存在情報を予め連接文字成
分表に登録する。検索時には、逐次連接文字およびスキ
ップ連接文字の両方の連接文字成分を全て含む文書を検
索することにより、単語の組合せで構成される文字列が
検索タームが指定された場合にも、連接文字成分表サー
チの検索ノイズを削減することができる。例えば、検索
タームとして“動画像"が指定された時には逐次連接文
字として“動画"および“画像"、スキップ連接文字とし
て“動像"を含む文書をサーチする。これに対し、テキ
スト中に“自動画質調整機能を備えた画像処理装置・・
・”という文字列を含む文書が登録された時には、逐次
連接文字として“動画"および“画像"がが連接文字成分
表に登録されることになるが、スキップ連接文字として
“動像"が抽出されないため、ノイズとして削除するこ
とができる。As described above, in the process of creating and registering the concatenated character component table in the present embodiment, a character string of two consecutive characters (sequential concatenated characters) from Japanese text data is registered at the time of document registration.
And a string of 2 characters every other character (skip concatenated character)
And the presence information of this concatenated character is registered in advance in the concatenated character component table. At the time of search, by searching for a document that contains all concatenated character components of both sequential concatenated characters and skip concatenated characters, the concatenated character component table is displayed even when a search term is specified for a character string composed of word combinations. Search noise of the search can be reduced. For example, when "moving image" is specified as the search term, a document containing "moving image" and "image" as consecutively connected characters and "moving image" as skip connected characters is searched for. On the other hand, in the text, "an image processing device equipped with an automatic image quality adjustment function ...
-When a document containing the character string "" is registered, "moving image" and "image" will be sequentially registered in the concatenated character component table, but "moving image" will be extracted as the skip concatenated character. Since it is not performed, it can be deleted as noise.

【０１６９】その結果、連接文字成分表サーチにおける
絞り込み率を向上させることができ、階層プリサーチに
おける凝縮テキストの探索量が削減できることになるた
め、等価的に全体の検索速度が向上することになる。し
たがって、より大量のフルテキストサーチが実時間で可
能となる。As a result, the narrowing rate in the concatenated character component table search can be improved, and the search amount of condensed text in the hierarchical presearch can be reduced, so that the overall search speed is equivalently improved. . Therefore, a larger amount of full text search is possible in real time.

【０１７０】なお、本実施例では文字種を意識すること
なく漢字、平仮名、カタカナ、アルファベット、数字お
よび記号などの混在した全ての文字列を対象として文字
成分表を作成している。このため、“半導体レーザ"な
どの文字種の混在した検索タームに指定された場合で
も、文字種間にまたがった連接文字成分を利用した絞り
込みが行える。In the present embodiment, the character component table is prepared for all character strings in which kanji, hiragana, katakana, alphabets, numbers, and symbols are mixed, regardless of the character type. Therefore, even when a search term such as "semiconductor laser" in which character types are mixed is specified, narrowing down can be performed using the concatenated character component that spans character types.

【０１７１】また、本実施例では逐次連接文字の連接文
字成分表を連続する２文字の文字列で作成する場合につ
いて説明したが、文字列長が３文字以上の場合について
も同様な処理が可能であることは、上記の説明から明ら
かであろう。また、スキップ連接文字の連接文字成分表
を１文字おきに２文字の文字列で作成する場合について
説明したが、何文字おきに文字列を抽出しても、また文
字列長が３文字以上の場合についても同様な処理が可能
であることは明らかであろう。In the present embodiment, the case has been described in which the concatenated character component table of consecutive concatenated characters is created with a character string of two consecutive characters, but the same processing is also possible when the character string length is three characters or more. It will be apparent from the above description. Also, the case where the concatenated character component table for skip concatenated characters is created with a character string of two characters every other character has been described, but no matter how many character strings are extracted, the character string length is three characters or more. It will be apparent that the same process can be applied to the case.

【０１７２】本実施例では、逐次連接文字の連接文字成
分表を連続する２文字の文字列で作成する場合について
説明したが、文字列長が３文字以上の場合についても同
様な処理が可能であることは、上記の説明から明らかで
あろう。また、スキップ連接文字の連接文字成分表を１
文字おきに２文字の文字列で作成する場合について説明
したが、何文字おきに文字列を抽出しても、また文字列
長が３文字以上の場合についても同様な処理が可能であ
ることは明らかであろう。In this embodiment, the case has been described in which the concatenated character component table of consecutive concatenated characters is created by a character string of two consecutive characters, but the same processing can be performed when the character string length is three characters or more. Certainly, it will be apparent from the above description. Also, the concatenated character component table for skip concatenated characters is 1
Although the case of creating a character string of two characters for each character has been described, the same processing can be performed even if the character string is extracted for every character and the character string length is three characters or more. Would be obvious.

【０１７３】また、本実施例では単語を意識することな
く“_"(スペース)や“."(ピリオド)、“,"(コンマ)など
を含んだ全文字列を対象として文字成分表を作成した
が、第二の実施例に示したようにテキストを単語に分割
してから文字成分表を作成する方法、および第三の実施
例に示したようにテキストを単語に分割した後、単語の
前後に特殊文字を付加してから文字成分表を作成する方
法についても同様の処理が可能であることは明らかであ
ろう。In the present embodiment, the character component table is created for all character strings including "_" (space), "." (Period), "," (comma), etc. without paying attention to words. However, as shown in the second embodiment, the method of creating a character component table after dividing the text into words, and as shown in the third embodiment, after dividing the text into words, It will be apparent that the same processing can be performed for the method of creating a character component table after adding special characters before and after.

【０１７４】次に、本発明の第六の実施例について説明
する。Next, a sixth embodiment of the present invention will be described.

【０１７５】本発明の第五の実施例では、日本語文書に
対し文字種を意識することなく漢字、平仮名、カタカ
ナ、英字、数字および記号などの混在した全文字列を対
象として文字成分表を作成した。しかし、この方法では
文字種間にまたがった連接文字を文字成分表に登録する
ことになるため以下に示す種類のノイズが発生する。In the fifth embodiment of the present invention, a character component table is created for a Japanese document targeting all character strings mixed with Kanji, Hiragana, Katakana, English letters, numbers and symbols without considering the character type. did. However, in this method, the concatenated characters that span character types are registered in the character component table, so the following types of noise occur.

【０１７６】まず、文字種間にまたがった逐次連接文字
成分および文字スキップ連接文字成分が他の連接文字と
同じエントリにハッシングされることによるノイズが発
生する。すなわち、“動画像"が検索タームに指定され
たときには逐次連接文字として“動画"と“画像"が、ス
キップ連接文字として“動像"が抽出される。これに対
し、テキストデータ中に“自動画質調整機能を備えた画
像処理装置"という文字列を含む文書が登録され、スキ
ップ連接文字“能備"に対するエントリがスキップ連接
文字“動像"と同じエントリにハッシングされた場合に
は、平仮名の“を"をまたがったスキップ連接文字であ
る“能備"を含む本文書が抽出されることになり、“自
＜動＞＜画＞質調整機＜能＞を＜備＞えた＜画＞＜像＞
処理装置”を含む文書がノイズとしてヒットしてしまう
ことになる。First, noise is generated due to hashing of successive concatenated character components and character skip concatenated character components that span character types into the same entry as other concatenated characters. That is, when "moving image" is designated as the search term, "moving image" and "image" are sequentially extracted as the concatenated characters, and "moving image" is extracted as the skip concatenated character. On the other hand, a document containing a character string "image processing device with automatic image quality adjustment function" is registered in the text data, and the entry for the skip concatenated character "Nobi" is the same as the entry for the skip concatenated character "moving image". If the text is hashed to, this document will be extracted that contains the "Nobi", which is the skip concatenated character that crosses the Hiragana "o". <Equipment><Image><Image>
A document including the "processing device" will be hit as noise.

【０１７７】さらに、検索タームから抽出したスキップ
連接文字成分が異なる文字種間にまたがって現われる文
書を抽出することによりノイズとして検索されてしま
う。すなわち、テキストデータ中に“動画”と“画像"
を含み、かつ“・・・感＜動＞の＜像＞を写し出す・・・"と
いう文字列を含む文書が登録された場合には、逐次連接
文字として“動画"と“画像"、またスキップ連接文字と
して平仮名である“の"をまたがった“動像"が連接文字
成分表に登録される。これに対し、検索タームに“動画
像"が指定されたときには検索タームから逐次連接文字
として“動画"と“画像"が、そしてスキップ連接文字と
して“動像"が抽出されることになり、前述した文書が
ノイズとしてヒットしてしまうことになる。Further, by extracting a document in which the skip concatenated character component extracted from the search term appears across different character types, it is searched as noise. That is, "video" and "image" in the text data
When a document containing a character string that includes ", and displays the" image of feeling <motion> ... "is registered," moving image "and" image "are sequentially skipped as concatenated characters. As a concatenated character, the "moving image" that crosses the hiragana "no" is registered in the concatenated character component table. On the other hand, when "moving image" is specified for the search term, "moving image" and "image" are sequentially extracted as the concatenated characters, and "moving image" is extracted as the skip concatenated character from the search term. The document will be hit as noise.

【０１７８】これらの問題に対し、本発明第六の実施例
では連接文字成分の抽出時にテキストデータを文字種毎
に分割し、分割された文字列から逐次連接文字成分およ
びスキップ連接文字成分を抽出して連接文字成分表を作
成することにより、異なる文字種間にまたがった連接文
字成分を抽出しないようにして、前記のノイズを削減す
る方法を取る。To solve these problems, in the sixth embodiment of the present invention, the text data is divided into character types at the time of extracting the concatenated character component, and the consecutive concatenated character component and the skip concatenated character component are extracted from the divided character string. By creating a concatenated character component table by using the concatenated character component table, the concatenated character component extending over different character types is not extracted, and the noise is reduced.

【０１７９】本実施例は第四および第五の実施例と基本
的に同様の構成をとるが、図２０に示した連接文字成分
表作成登録プログラム２０５が図２８に示した構成に、
また図２１に示した連接文字成分表サーチプログラム２
１１が図２９に示す構成となる。This embodiment has basically the same structure as the fourth and fifth embodiments, but the concatenated character component table preparation registration program 205 shown in FIG. 20 has the structure shown in FIG.
Also, the concatenated character component table search program 2 shown in FIG.
11 has the configuration shown in FIG.

【０１８０】すなわち、本実施例における連接文字成分
表作成登録プログラム２０５は、文字種分割プログラム
５００、逐次連接文字抽出プログラム４００、スキップ
連接文字抽出プログラム２０６、連接文字成分表登録プ
ログラム２０７およびハッシュテーブル作成プログラム
２０８で構成され、連接文字成分表サーチプログラム２
１１は文字種分割プログラム５０１、逐次連接文字抽出
プログラム４０１、スキップ連接文字抽出プログラム２
１２およびビットアンドプログラム２１３で構成され
る。That is, the concatenated character component table creation / registration program 205 in this embodiment is the character type division program 500, the sequential concatenated character extraction program 400, the skip concatenated character extraction program 206, the concatenated character component table registration program 207, and the hash table creation program. Concatenated character component table search program 2 composed of 208
11 is a character type division program 501, a sequential concatenated character extraction program 401, a skip concatenated character extraction program 2
12 and bit and program 213.

【０１８１】本実施例における連接文字成分表作成登録
プログラム２０５は、図３０に示すように、まずステッ
プ１４００で文字種分割プログラム５００を起動し、磁
気ディスク１１０に格納されたテキスト１０３からテキ
ストデータをワークエリア２１７に読み込み、テキスト
データを文字種毎に分割する。As shown in FIG. 30, the concatenated character component table creation / registration program 205 in this embodiment first activates the character type division program 500 in step 1400 to work text data from the text 103 stored in the magnetic disk 110. The data is read into the area 217 and the text data is divided for each character type.

【０１８２】次に、ステップ１４０１で逐次連接文字抽
出プログラム４００を起動し、文字種分割プログラム５
００によって文字種毎に分割されたテキストデータから
連続する２文字の文字列を抽出する。Next, in step 1401, the consecutive concatenated character extraction program 400 is started, and the character type division program 5
A continuous character string of two characters is extracted from the text data divided by 00 for each character type.

【０１８３】その後、ステップ１４０２でスキップ連接
文字抽出プログラム２０６を起動し、文字種分割プログ
ラム５００によって文字種毎に分割されたテキストデー
タから１文字おきに２文字の文字列を抽出する。After that, in step 1402, the skip concatenated character extraction program 206 is activated, and a character string of two characters is extracted every other character from the text data divided by the character type division program 500 for each character type.

【０１８４】最後に、ステップ１４０３で連接文字成分
表登録プログラム２０７を起動し、逐次連接文字抽出プ
ログラム４００およびスキップ連接文字抽出プログラム
２０６によって抽出された連接文字を、ワークエリア２
１７内の連接文字成分表１０５にハッシュテーブル２１
６に従って登録し、これを磁気ディスク１１０へ格納す
る。Finally, in step 1403, the concatenated character component table registration program 207 is activated, and the concatenated characters extracted by the sequential concatenated character extraction program 400 and the skip concatenated character extraction program 206 are transferred to the work area 2
Hash table 21 in the concatenated character component table 105 in 17
6 is registered and stored in the magnetic disk 110.

【０１８５】検索時には、まず連接文字成分表サーチプ
ログラム２１１は図３１に示すようにステップ１４１０
で文字種分割プログラム５０１を起動し、検索条件式中
の検索タームを文字種毎に分割する。At the time of search, the concatenated character component table search program 211 first executes step 1410 as shown in FIG.
The character type division program 501 is started up, and the search term in the search condition expression is divided for each character type.

【０１８６】次に、ステップ１４１１で逐次連接文字抽
出プログラム４０１を起動し、文字種分割プログラム５
０１によって文字種毎に分割された検索タームから連続
する２文字の文字列すべてを抽出する。Next, in step 1411, the consecutive concatenated character extraction program 401 is started, and the character type division program 5
All consecutive two character strings are extracted from the search term divided by 01 for each character type.

【０１８７】その後、ステップ１４１２でスキップ連接
文字抽出プログラム２１２を起動し、文字種分割プログ
ラム５０１によって文字種毎に分割された検索タームか
ら１文字おきに２文字の文字列すべてを抽出する。Then, in step 1412, the skip concatenated character extraction program 212 is activated, and all two character strings are extracted every other character from the search term divided for each character type by the character type division program 501.

【０１８８】さらに、連接文字成分表サーチプログラム
２１１はステップ１４１３でビットアンドプログラム２
１３を起動し、逐次連接文字抽出プログラム４０１およ
びスキップ連接文字抽出プログラム２１２によって抽出
されたすべての文字列に対応する連接文字成分表１０５
のエントリに格納されているビットリストを、ハッシュ
テーブル２１７を介してワークエリア２１６に読み込
み、ステップ１４１４で読み込まれたすべてのビットリ
スト間で各ビット毎に論理積演算を行う。Furthermore, the concatenated character component table search program 211 executes the bit and program 2 in step 1413.
13, and the concatenated character component table 105 corresponding to all the character strings extracted by the sequential concatenated character extraction program 401 and the skip concatenated character extraction program 212.
The bit list stored in the entry is read into the work area 216 via the hash table 217, and a logical product operation is performed for each bit between all the bit lists read in step 1414.

【０１８９】この論理積演算の結果、‘１’となったビ
ットに対応する文書番号を連接文字成分表サーチの結果
として出力する。As a result of the logical product operation, the document number corresponding to the bit that becomes "1" is output as the result of the concatenated character component table search.

【０１９０】以下、上述した連接文字成分表作成登録プ
ログラム２０５の処理内容を詳細に説明する。The processing contents of the concatenated character component table creation registration program 205 described above will be described in detail below.

【０１９１】連接文字成分表作成登録プログラム２０５
では、まずテキストデータを文字種毎に分割し、分割さ
れたテキストデータから連続する２文字の文字列および
１文字おきに２文字の文字列を抽出する。Concatenated character component table creation registration program 205
Then, first, the text data is divided for each character type, and a continuous character string of two characters and a character string of two characters every other character are extracted from the divided text data.

【０１９２】この連接文字の抽出処理について、例え
ば、“自動画質調整機能を備えた画像処理装置”という
テキストが登録された場合を例に説明する。The process of extracting the concatenated characters will be described by taking as an example the case where the text "image processing device having an automatic image quality adjustment function" is registered.

【０１９３】文字種分割プログラム５００によって、文
書１は図３２に示すように、“自動画質調整機能”、
“を”、“備”、“えた”、“画像処理装置”・・・に
分割される。As shown in FIG. 32, the document 1 is displayed by the character type division program 500 as shown in FIG.
It is divided into "", "", "", "", and "image processing device".

【０１９４】次に、逐次連接文字抽出プログラム４００
によって、文字種毎に分割されたテキストデータから逐
次連接文字成分として“自動”、“動画”、“画質”、
“質調”、“調整”、“整機”、“機能”、“えた”、
“画像”、“像処”、“処理”、“理装”、“装置”、
・・・が抽出される。Next, the sequential concatenated character extraction program 400
The text data divided for each character type can be used as successive concatenated character components such as "automatic", "video", "image quality",
"Quality", "Adjustment", "Cleaner", "Function", "Et",
"Image", "Image processing", "Processing", "Gift", "Device",
... is extracted.

【０１９５】さらに、スキップ連接文字抽出プログラム
２０６によって、文字種毎に分割されたテキストデータ
からスキップ連接文字成分として“自画”、“動質”、
“画調”、“質整”、“調機”、“整能”、“画処”、
“像理”、“処装”、“理置”、・・・が抽出される。Further, by the skip concatenated character extraction program 206, “self-portrait”, “movement”, as a skip concatenated character component from the text data divided for each character type,
"Image adjustment", "Quality adjustment", "Adjustment", "Arrangement", "Image processing",
“Image”, “processing”, “laying”, ... Are extracted.

【０１９６】最後に、連接文字成分表登録プログラム２
０７が起動される。ここでは、連接文字用ハッシュテー
ブル２１６−ａおよびスキップ連接用ハッシュテーブル
２１６−ｂを介して、それぞれ逐次連接文字抽出プログ
ラム４００およびスキップ連接文字抽出プログラム２０
６によって抽出された連接文字成分に対応するエントリ
に‘１’を設定し、連接文字成分の存在を記す。Finally, the concatenated character component table registration program 2
07 is activated. Here, the sequential concatenated character extraction program 400 and the skip concatenated character extraction program 20 are respectively passed through the concatenated character hash table 216-a and the skip concatenated hash table 216-b.
"1" is set to the entry corresponding to the concatenated character component extracted by 6, and the existence of the concatenated character component is noted.

【０１９７】次に、検索時の処理について詳細に説明す
る。Next, the processing at the time of retrieval will be described in detail.

【０１９８】まず、文字種分割プログラム５０１によっ
て検索タームを文字種毎に分割する。図３３に示す例で
は、検索タームは“動画像”であり、全て漢字で構成さ
れているため文字種分割により“動画像”がそのまま切
り出される次に、文字種分割された検索タームから逐次
連接文字抽出プログラム２０６によって、“動画”およ
び“画像”が逐次連接文字成分として抽出される。さら
にスキップ連接文字抽出２１２によって、“動像”がス
キップ連接文字成分として抽出される。次に、ビットア
ンドプログラム２１３により、連接文字成分表１０５の
ビットリストが逐次連接文字成分については逐次連接用
ハッシュテーブル２１５−ａを介して、スキップ連接文
字成分についてはスキップ連接用ハッシュテーブル２１
６−ｂを介して読み出される。そして、これらすべての
ビットリストのビットがすべて‘１’である文書を連接
文字成分表サーチの検索結果として得る。これにより、
図３３の例では、文書Ｎがヒット文書として得られる。First, the character type division program 501 divides the search term for each character type. In the example shown in FIG. 33, since the search term is “moving image” and is composed entirely of Chinese characters, the “moving image” is cut out as it is by character type division. Next, consecutive concatenated character extraction is performed from the character type-divided search term. By the program 206, "moving image" and "image" are sequentially extracted as concatenated character components. Furthermore, the skip connecting character extraction 212 extracts “moving image” as a skip connecting character component. Next, the bit list of the concatenated character component table 105 is executed by the bit and program 213 via the sequential concatenation hash table 215-a for the consecutive concatenated character component and the skip concatenation hash table 21 for the skip concatenated character component.
It is read via 6-b. Then, a document in which all the bits of all these bit lists are '1' is obtained as the search result of the concatenated character component table search. This allows
In the example of FIG. 33, the document N is obtained as a hit document.

【０１９９】このように、本実施例における連接文字成
分表の作成登録処理では、文書の登録時に、日本語テキ
ストデータを文字種毎に分割する。そして、文字種毎に
分割されたテキストデータから連続する２文字の文字列
(逐次連接文字)および１文字おきに２文字の文字列(ス
キップ連接文字)を取り出し、この連接文字の存在情報
を予め連接文字成分表に登録する。検索時にも、文字種
毎に分割された検索タームに対し逐次連接文字成分およ
びスキップ連接文字成分を抽出することによって、文字
種間にまたがった連接文字およびスキップ連接文字が文
字成分表に登録されて生じるノイズを削減することがで
きる。As described above, in the process of creating and registering the concatenated character component table in this embodiment, the Japanese text data is divided into character types when the document is registered. Then, a character string of two consecutive characters from the text data divided for each character type
A character string of two characters (sequential concatenated character) and every other character (skip concatenated character) is taken out, and the existence information of this concatenated character is registered in the concatenated character component table in advance. Even at the time of search, by extracting the consecutive concatenated character component and the skip concatenated character component for the search term divided for each character type, the concatenated character and the skip concatenated character that span between character types are registered in the character component table, and noise is generated. Can be reduced.

【０２００】例えば、テキストデータ中に“動画"と
“画像"を含み、かつ“・・・感動の像を写し出す・・・"とい
う文字列を含む文書が登録された場合には、逐次連接文
字として“動画"と“画像"が抽出されるが、文字種毎に
文字列を分割してから連接文字成分を抽出することによ
り、文字種間にまたがって現われるスキップ連接文字
“動像"を排除することができる。このため、検索ター
ムとして“動画像"が指定されたときには逐次連接文字
である“動画"と“画像"は抽出されるが、スキップ連接
文字である“動像"は抽出されない。したがって、前記
の文書をノイズとして削除することができる。[0200] For example, when a document is registered that includes "moving image" and "image" in the text data and also includes a character string "... "Movies" and "images" are extracted as, but by removing the concatenated character "moving image" that appears across character types by dividing the character string for each character type and then extracting the concatenated character components. You can Therefore, when "moving image" is specified as the search term, the consecutive moving characters "moving image" and "image" are extracted, but the skip connecting character "moving image" is not extracted. Therefore, the document can be deleted as noise.

【０２０１】その結果、連接文字成分表サーチにおける
絞り込み率を向上させることができ、階層プリサーチに
おける凝縮テキストの探索量が削減できることになるた
め、等価的に全体の検索速度が向上することになる。し
たがって、より大量のフルテキストサーチが実時間で可
能となる。As a result, the narrowing rate in the concatenated character component table search can be improved, and the search amount of condensed text in the hierarchical presearch can be reduced, so that the overall search speed is equivalently improved. . Therefore, a larger amount of full text search is possible in real time.

【０２０２】なお、本実施例では漢字、平仮名、カタカ
ナ、アルファベット、数字および記号などを単位にテキ
ストデータおよび検索タームを文字種分割する方法につ
いて説明した。しかし、ある特定の文字種間では文字種
分割を行わず、連続した文字列として逐次連接文字成分
およびスキップ連接文字成分を抽出することにより、文
字種の混在した文字列が検索タームに指定された場合に
も、文字種間にまたがった連接文字成分を利用して高精
度に絞り込みを行うことが可能である。例えば、漢字と
カタカナの間で文字種分割を行わないことにより、“半
導体レーザ"や“磁気ディスク"などの文字列についても
高精度に絞り込みを行うことができる。さらに、漢字と
平仮名の間で文字種分割を行わないことにより、“日の
出"や“読み込み"などの文字列についても高精度に絞り
込みを行うことができる。In the present embodiment, the method of dividing the text data and the search term into character types in units of kanji, hiragana, katakana, alphabets, numbers and symbols has been described. However, character type division is not performed between certain character types, and sequential concatenated character components and skip concatenated character components are extracted as continuous character strings, so that a character string with mixed character types is also specified in the search term. , It is possible to perform narrowing down with high accuracy by using the concatenated character component that extends between character types. For example, by not dividing the character type between Kanji and Katakana, it is possible to narrow down a character string such as "semiconductor laser" or "magnetic disk" with high accuracy. Further, by not performing character type division between Kanji and Hiragana, it is possible to narrow down the character strings such as "sunrise" and "reading" with high accuracy.

【０２０３】また、本実施例では単純にテキストデータ
および検索タームを単純に文字種毎に分割する方法につ
いて説明したが、辞書や日本語処理を用いた単語切り出
し方法によってテキストデータおよび検索タームを分割
した後、逐次連接文字およびスキップ連接文字成分を抽
出する場合についても、同様の効果が得られることは上
記の説明から明らかであろう。Further, in the present embodiment, the method of simply dividing the text data and the search term for each character type has been described, but the text data and the search term are divided by a word cutout method using a dictionary or Japanese processing. It will be apparent from the above description that similar effects can be obtained in the case of extracting successive concatenated characters and skip concatenated character components later.

【０２０４】本実施例では逐次連接文字とスキップ連接
文字の両方の連接文字成分表を用いることにより検索対
象となる文書を絞り込む方法について説明した。しか
し、従来の互いに隣り合う連接文字(逐次連接文字)だけ
を用いる方法についても同様に、テキストデータおよび
検索タームを文字種単位に分割してから逐次連接文字成
分表を作成することにより、文字種間にまたがった連接
文字が他の連接文字と同じエントリにハッシングされる
ことによって生じるノイズを削減できることは明らかで
ある。In this embodiment, the method of narrowing down the documents to be searched by using the concatenated character component table of both the consecutive concatenated character and the skip concatenated character has been described. However, similar to the conventional method that uses only adjacent contiguous characters (sequential concatenated characters), similarly, by dividing the text data and the search term into character type units and creating a consecutive concatenated character component table, It is clear that the noise caused by straddling concatenated characters being hashed to the same entry as other concatenated characters can be reduced.

【０２０５】次に、本発明の第七の実施例について説明
する。Next, a seventh embodiment of the present invention will be described.

【０２０６】本発明の第六の実施例では、日本語文書に
対しテキストデータおよび検索タームを文字種毎に分割
してから逐次連接文字およびスキップ連接文字を抽出す
ることにより連接文字成分表サーチのノイズを削減する
方法について説明した。しかし、この方法では検索ター
ムに指定した文字列を部分文字列して含む別の単語が現
われる文書をノイズとして検索してしまうという問題が
ある。すなわち、検索タームとして“動画像"という文
字列が指定された場合には、逐次連接文字として“動
画"と“画像"が、スキップ連接文字として“動像"が抽
出される。これに対し、テキストデータ中に“高速な自
動画像処理装置・・・"という文字列を含む文書が登録され
た場合には、やはりテキストデータから逐次連接文字と
して“動画"と“画像"が、スキップ連接文字として“動
像"が抽出される。このため、検索タームである“動画
像"を部分文字列として含む別の単語である“自動画像
処理装置"の現われる文書がノイズとしてヒットしてし
まうことになる。In the sixth embodiment of the present invention, the text data and the search term of a Japanese document are divided for each character type, and the consecutive concatenated character and the skip concatenated character are extracted to detect noise in the concatenated character component table search. I explained how to reduce. However, this method has a problem that a document in which another word including a character string specified in the search term as a partial character string appears is searched as noise. That is, when the character string "moving image" is specified as the search term, "moving image" and "image" are sequentially extracted as the concatenated characters, and "moving image" is extracted as the skip concatenated character. On the other hand, when a document containing a character string “high-speed automatic image processing device ...” Is registered in the text data, “moving image” and “image” are also sequentially connected characters from the text data, The "moving image" is extracted as the skip concatenated character. For this reason, a document in which another word "automatic image processing device", which includes the search term "moving image" as a partial character string, appears as a noise.

【０２０７】この問題に対し本発明の第七の実施例で
は、第六の実施例における文書検索方法において文字種
毎に分割されたテキストデータおよび検索タームの前後
に特殊文字（例えば、ここでは‘＾’とする）を付加
し、それを含めて連接文字成分を抽出する特殊文字付加
型の連接文字成分表を作成する。これにより、特殊文字
で文字種の区切りを判別できるようにし、検索タームを
部分文字列として含む別の単語が現われる文書を排除し
て、ノイズを削減する。In contrast to this problem, in the seventh embodiment of the present invention, the special character (for example, here, "^" is used before and after the text data and the search term divided for each character type in the document search method of the sixth embodiment. ')) Is added and the concatenated character component table of the special character addition type which extracts the concatenated character component including it is created. As a result, it is possible to distinguish the character type delimiter with the special character, eliminate a document in which another word including the search term as a partial character string appears, and reduce noise.

【０２０８】本実施例は第六の実施例と基本的に同様の
構成をとるが、図２８に示した連接文字成分表作成登録
プログラム２０５が図３４に示した構成に、また図２９
に示した連接文字成分表サーチプログラム２１１が図３
５に示す構成となる。This embodiment basically has the same structure as the sixth embodiment, but the concatenated character component table preparation registration program 205 shown in FIG. 28 has the structure shown in FIG. 34, and FIG.
The connected character component table search program 211 shown in FIG.
The configuration shown in FIG.

【０２０９】すなわち、本実施例における連接文字成分
表作成登録プログラム２０５は、文字種分割プログラム
５００、特殊文字付加プログラム３０１、逐次連接文字
抽出プログラム４００、スキップ連接文字抽出プログラ
ム２０６、連接文字成分表登録プログラム２０７および
ハッシュテーブル作成プログラム２０８で構成され、連
接文字成分表サーチプログラム２１１は文字種分割プロ
グラム５０１、特殊文字付加プログラム３０２、逐次連
接文字抽出プログラム４０１、スキップ連接文字抽出プ
ログラム２１２およびビットアンドプログラム２１３で
構成される。That is, the concatenated character component table creation / registration program 205 in this embodiment is the character type division program 500, the special character addition program 301, the sequential concatenated character extraction program 400, the skip concatenated character extraction program 206, the concatenated character component table registration program. The concatenated character component table search program 211 includes a character type division program 501, a special character addition program 302, a sequential concatenated character extraction program 401, a skip concatenated character extraction program 212, and a bit and program 213. To be done.

【０２１０】本実施例における連接文字成分表作成登録
プログラム２０５は、図３６に示すように、まずステッ
プ１５００で文字種分割プログラム５００を起動し、磁
気ディスク１１０に格納されたテキスト１０３からテキ
ストデータをワークエリア２１７に読み込み、テキスト
データを文字種毎に分割する。As shown in FIG. 36, the concatenated character component table creation registration program 205 in this embodiment first activates the character type division program 500 in step 1500 to work the text data from the text 103 stored in the magnetic disk 110. The data is read into the area 217 and the text data is divided for each character type.

【０２１１】次に、ステップ１５０１で特殊文字付加プ
ログラム３０１を起動し、文字種分割プログラム５００
によって文字種毎に分割されたテキストデータの前後に
前後に特殊文字‘＾’を付加する。Next, in step 1501, the special character addition program 301 is started, and the character type division program 500
A special character "^" is added before and after the text data divided for each character type.

【０２１２】さらに、ステップ１５０２で逐次連接文字
抽出プログラム４００を起動し、特殊文字付加プログラ
ム３０１によって特殊文字の付加されたテキストデータ
から連続する２文字の文字列を抽出する。Further, in step 1502, the concatenated character extraction program 400 is activated, and the special character addition program 301 extracts a continuous character string of two characters from the text data to which the special characters are added.

【０２１３】その後、連接文字成分表作成登録プログラ
ム２０５はステップ１５０３でスキップ連接文字抽出プ
ログラム２０６を起動し、特殊文字付加プログラム３０
１によって特殊文字の付加されたテキストデータから１
文字おきに２文字の文字列をすべて抽出する。Thereafter, the concatenated character component table creation registration program 205 activates the skip concatenated character extraction program 206 in step 1503, and the special character addition program 30
1 from text data with special characters added by 1
Extract all two character strings every other character.

【０２１４】最後に、ステップ１５０４で連接文字成分
表登録プログラム２０７を起動し、逐次連接文字抽出プ
ログラム４００およびスキップ連接文字抽出プログラム
２０６によって抽出された連接文字を、ワークエリア２
１７内の連接文字成分表１０５にハッシュテーブル２１
６に従って登録し、これを磁気ディスク１１０へ格納す
る。Finally, in step 1504, the concatenated character component table registration program 207 is started, and the concatenated characters extracted by the sequential concatenated character extraction program 400 and the skip concatenated character extraction program 206 are transferred to the work area 2
The hash table 21 is added to the concatenated character component table 105 in FIG.
6 is registered and stored in the magnetic disk 110.

【０２１５】検索時には、まず連接文字成分表サーチプ
ログラム２１１は図３７に示すようにステップ１５１０
で文字種分割プログラム５０１を起動し、検索条件式中
の検索タームを文字種毎に分割する。At the time of retrieval, the concatenated character component table search program 211 first executes step 1510 as shown in FIG.
The character type division program 501 is started up, and the search term in the search condition expression is divided for each character type.

【０２１６】次に、ステップ１５１１で特殊文字付加プ
ログラム３０２を起動し、文字種分割プログラム５０１
によって文字種毎に分割された検索タームの前後に前後
に特殊文字‘＾’を付加する。Next, in step 1511, the special character addition program 302 is started, and the character type division program 501
The special character '^' is added before and after the search term divided for each character type.

【０２１７】その後、ステップ１５１２で逐次連接文字
抽出プログラム４０１を起動し、特殊文字付加プログラ
ム３０２によって特殊文字の付加された検索タームから
連続する３文字の文字列すべてを抽出する。Then, in step 1512, the concatenated character extraction program 401 is activated, and the special character addition program 302 extracts all three consecutive character strings from the search term to which the special character is added.

【０２１８】また、ステップ１５１３でスキップ連接文
字抽出プログラム２１２を起動し、特殊文字付加プログ
ラム３０２によって特殊文字の付加された検索タームか
ら１文字おきに３文字の文字列すべてを抽出する。In step 1513, the skip concatenated character extraction program 212 is activated, and the special character addition program 302 extracts all three character strings every other character from the search term to which the special character is added.

【０２１９】さらに、ステップ１５１４でビットアンド
プログラム２１３を起動し、逐次連接文字抽出プログラ
ム４０１およびスキップ連接文字抽出プログラム２１２
によって抽出されたすべての文字列に対応する連接文字
成分表１０５のエントリに格納されているビットリスト
を、ハッシュテーブル２１７を介してワークエリア２１
６に読み込み、読み込まれたすべてのビットリスト間で
各ビット毎に論理積演算を行う。Further, in step 1514, the bit and program 213 is activated, and the sequential concatenated character extraction program 401 and the skip concatenated character extraction program 212 are executed.
The bit list stored in the entry of the concatenated character component table 105 corresponding to all the character strings extracted by the work area 21 via the hash table 217.
6, and the logical product operation is performed for each bit between all the read bit lists.

【０２２０】この論理積演算の結果、‘１’となったビ
ットに対応する文書番号を連接文字成分表サーチの結果
として出力する。As a result of the logical product operation, the document number corresponding to the bit which becomes "1" is output as the result of the concatenated character component table search.

【０２２１】以下、上述した連接文字成分表作成登録プ
ログラム２０５の処理内容を詳細に説明する。The processing contents of the concatenated character component table creation registration program 205 described above will be described in detail below.

【０２２２】連接文字成分表作成登録プログラム２０５
では、まずテキストデータを文字種毎に分割し、分割さ
れたテキストデータから連続する３文字の文字列および
１文字おきに３文字の文字列をが抽出する。Concatenated character component table creation registration program 205
Then, first, the text data is divided for each character type, and a continuous character string of three characters and a character string of three characters at every other character are extracted from the divided text data.

【０２２３】この連接文字の抽出処理について、例え
ば、“高速な自動画像処理装置”というテキストが登録
された場合を例に説明する。The process of extracting the concatenated characters will be described by taking as an example the case where the text "high-speed automatic image processing device" is registered.

【０２２４】文字種分割プログラム５００によって図３
８に示すように文書１は、“高速”、“な”、“自動画
像処理装置”・・・に分割される。The character type division program 500 shown in FIG.
As shown in FIG. 8, the document 1 is divided into "high speed", "na", "automatic image processing device", ....

【０２２５】次に特殊文字付加プログラム３０１によっ
て、分割されたテキストデータの前後に特殊文字‘＾’
が付加され、“＾高速＾”、“＾な＾”、“＾自動画像
処理装置＾”、・・・となる。Next, by the special character addition program 301, a special character "^" is added before and after the divided text data.
Are added, and become "^ high speed ^", "^ na ^", "^ automatic image processing device ^", ....

【０２２６】次に、逐次連接文字抽出プログラム４００
によって、特殊文字の付加されたテキストデータから逐
次連接文字成分として、“＾高”、“高速”、“速
＾”、“＾な”、“な＾”、“＾自”、“自動”、“動
画”、“画像”、“像処”、“処理”、“理装”、“装
置”、・・・が抽出される。Next, the sequential concatenated character extraction program 400
According to the following, from the text data to which special characters are added, “^ high”, “high speed”, “fast ^”, “^ na”, “na ^”, “^ self”, “automatic”, “Movie”, “image”, “image processing”, “processing”, “dressing”, “apparatus”, ... Are extracted.

【０２２７】さらに、スキップ連接文字抽出プログラム
２０６によって、特殊文字の付加されたテキストデータ
からスキップ連接文字成分として、“＾速”、“高
＾”、“＾動”、“自画”、“動像”、“画処”、“像
理”、“処装”、“理置”、“装＾”、・・・が抽出さ
れる。Further, by the skip concatenated character extraction program 206, "^ speed", "high ^", "^ motion", "self-portrait", "moving image" are selected as skip concatenated character components from the text data to which special characters are added. , "Image processing", "image processing", "processing", "processing", "apparatus", ... Are extracted.

【０２２８】最後に、連接文字成分表登録プログラム２
０７を起動する。ここでは、連接文字用ハッシュテーブ
ル２１６−ａおよびスキップ連接用ハッシュテーブル２
１６−ｂを介して、それぞれ逐次連接文字抽出プログラ
ム４００およびスキップ連接文字抽出プログラム２０６
によって抽出された連接文字成分に対応するエントリに
‘１’を設定し、連接文字成分の存在を記す。Finally, the concatenated character component table registration program 2
Start 07. Here, the concatenation character hash table 216-a and the skip concatenation hash table 2
16-b through sequential concatenated character extraction program 400 and skip concatenated character extraction program 206, respectively.
"1" is set to the entry corresponding to the concatenated character component extracted by, and the existence of the concatenated character component is noted.

【０２２９】次に、検索時の処理について詳細に説明す
る。Next, the processing at the time of retrieval will be described in detail.

【０２３０】まず、文字種分割プログラム５０１によっ
て検索タームを文字種毎に分割する。図３９に示す例で
は、検索タームは“動画像”であり、全て漢字で構成さ
れているため文字種分割により“動画像”がそのまま切
り出される次に特殊文字付加プログラム３０２によっ
て、検索ターム“動画像”の前後に“＾”が付加され、
“＾動画像＾”となる。First, the character type division program 501 divides the search term for each character type. In the example shown in FIG. 39, the search term is a "moving image", and since it is composed entirely of Chinese characters, the "moving image" is cut out as it is by character type division. "^" Is added before and after ",
It becomes "^ moving image ^".

【０２３１】次に、特殊文字の付加された検索タームか
ら逐次連接文字抽出プログラム２０６によって“＾
動”、“動画”、“画像”および“像＾”が逐次連接文
字成分として抽出される。さらにスキップ連接文字抽出
２１２によって、“＾画”、“動像”および“画＾”が
スキップ連接文字成分として抽出される。次に、ビット
アンドプログラム２１３により、連接文字成分表１０５
のビットリストが逐次連接文字成分については逐次連接
用ハッシュテーブル２１５−ａを介して、スキップ連接
文字成分についてはスキップ連接用ハッシュテーブル２
１６−ｂを介して読み出される。そして、これらすべて
のビットリストのビットがすべて‘１’である文書を連
接文字成分表サーチの検索結果として得る。これによ
り、同図の例では、文書Ｎがヒット文書として得られ
る。Next, the sequential concatenated character extraction program 206 extracts "^" from the search term to which the special character is added.
"Moving", "moving image", "image" and "image ^" are sequentially extracted as concatenated character components. Furthermore, skip concatenated character extraction 212 causes "^ image", "moving image" and "image ^" to be skipped concatenated. Next, the concatenated character component table 105 is extracted by the bit and program 213.
The bit list of the sequential concatenation character component is passed through the sequential concatenation hash table 215-a, and the skip concatenation character component is skip concatenated hash table 2
It is read via 16-b. Then, a document in which all the bits of all these bit lists are '1' is obtained as the search result of the concatenated character component table search. As a result, in the example of the figure, the document N is obtained as a hit document.

【０２３２】すなわち、検索タームとして“動画像”が
指定された時には、検索タームの前後に特殊文字を付加
した“＾動画像＾”から逐次連接文字として“＾動”、
“動画”、“画像”および“像＾が、スキップ連接文字
として“＾画”、“動像”および“画＾”が抽出され
る。それに対し、“自動画像処理”を含む文書中からは
“＾動”と“像＾”に対応する逐次連接文字成分、およ
び“＾画”と“画＾”に対応するスキップ連接文字成分
が抽出されない。このため、検索タームである“動画
像”を部分文字列として含む別の単語の“自動画像処
理”が現われる文書をノイズとして検索の対象から外す
ことができる。[0232] That is, when "moving image" is specified as the search term, "^ moving" as a sequential concatenated character from "^ moving image ^" with special characters added before and after the search term,
"Movie", "image" and "image ^" are extracted as skip concatenated characters "^ image", "moving image" and "image ^", whereas from documents including "automatic image processing" Sequential concatenated character components corresponding to “^ movement” and “image ^” and skip concatenated character components corresponding to “^ image” and “image ^” are not extracted. A document in which another word "automatic image processing" that appears as a partial character string appears can be excluded from the search target as noise.

【０２３３】このように、本実施例における連接文字成
分表の作成登録処理では、文書の登録時に、文字種毎に
分割された日本語テキストデータの前後に特殊文字を付
加する。そして、特殊文字の付加されたテキストデータ
から連続する２文字の文字列(逐次連接文字)および１文
字おきに２文字の文字列(スキップ連接文字)を取り出
し、この連接文字の存在情報を予め連接文字成分表に登
録する。検索時にも、文字種単位に分割した検索ターム
の前後に特殊文字を付加した後、逐次連接文字およびス
キップ連接文字を抽出することにより、指定した検索タ
ームを部分文字列として含む別の単語が中間一致によっ
てヒットすることを避けることができ、ノイズを削減す
ることができる。すなわち、テキスト中に“自動画像処
理"という単語が登録された時には、逐次連接文字とし
て“＾自"、“自動"、“動画"、“画像"、“像処"、
“処理"および“理＾"が、スキップ連接文字として“＾
動"、“自画"、“動像"、“画処"、“像理"および“処
＾"が抽出される。これに対し、検索タームとして“動
画像"が指定されたときには、逐次連接文字として“＾
動"、動画"、“画像"および“像＾"が、スキップ連接文
字として、“＾画"、“動像"および“画＾"が抽出され
ることになるが、テキスト中に“自動画像処理"という
単語を含む文書からは逐次連接文字として“＾動"およ
び“像＾"がスキップ連接文字として“＾画"および“画
＾"が抽出されないため、前記文書をノイズとして削除
することができる。As described above, in the process of creating and registering the concatenated character component table in the present embodiment, special characters are added before and after the Japanese text data divided for each character type at the time of document registration. Then, a continuous two-character string (sequential concatenated character) and a two-character string every other character (skip concatenated character) are extracted from the text data to which the special character is added, and the existence information of this concatenated character is pre-connected. Register in the character composition table. Also during search, after adding special characters before and after the search term divided into character types, by extracting sequential concatenation characters and skip concatenation characters, another word that includes the specified search term as a substring matches in the middle. You can avoid hitting and reduce noise. That is, when the word “automatic image processing” is registered in the text, “^ self”, “automatic”, “moving image”, “image”, “image processing”,
"Process" and "R ^" are "^" as skip concatenated characters.
"Motion", "self-portrait", "moving image", "image processing", "image" and "processing ^" are extracted. On the other hand, when "moving image" is specified as the search term, sequential connection is performed. As a character "^
Motion ", video", "image" and "image ^" will be extracted as skip concatenated characters "^ image", "moving image" and "image ^". Since "^ motion" and "image ^" are not sequentially extracted as a concatenated character from a document including the word "process", the document can be deleted as noise. it can.

【０２３４】その結果、連接文字成分表サーチにおける
絞り込み率を向上させることができる。そのため、階層
プリサーチにおける凝縮テキストの探索量が削減でき、
等価的に全体の検索速度が向上することになる。したが
って、より大量のフルテキストサーチが実時間で可能と
なる。As a result, it is possible to improve the narrowing rate in the concatenated character component table search. Therefore, the search amount of condensed text in hierarchical pre-search can be reduced,
Equivalently, the overall search speed will be improved. Therefore, a larger amount of full text search is possible in real time.

【０２３５】なお、本実施例では逐次連接文字とスキッ
プ連接文字の両方の連接文字成分表を用いることにより
検索対象となる文書を絞り込む方法について説明した。
しかし、従来の互いに隣り合う連接文字(逐次連接文字)
だけを用いる方法についても同様に、文字種単位に分割
したテキストデータおよび検索タームの前後に特殊文字
を付加したし、逐次連接文字およびスキップ連接文字を
抽出することにより、指定した検索タームを部分文字列
として含む別の単語が中間一致によってヒットすること
によって生じるノイズを削減できることは明らかであろ
う。In this embodiment, the method of narrowing down the documents to be searched by using the concatenated character component table of both the consecutive concatenated character and the skip concatenated character has been described.
However, conventional contiguous contiguous characters (sequential concatenated characters)
Similarly, with the method using only, the specified search term is a partial character string by adding special characters before and after the text data divided into character types and the search term, and extracting sequential concatenation characters and skip concatenation characters. It will be clear that the noise caused by hitting an intermediate match with another word containing as can be reduced.

【０２３６】次に、本発明の第八の実施例について説明
する。Next, an eighth embodiment of the present invention will be described.

【０２３７】本発明第五の実施例では、連接文字成分表
の１個のエントリに複数の連接文字成分を割り付ける、
すなわちハッシングすることにより実用的な容量で連接
文字成分表を実現する方法について説明した。しかしこ
の方法では、ある連接文字を指定して該当する文字成分
表のエントリを読み出した場合、そのビット情報から全
く別の連接文字成分を含む文書が得られる可能性があ
る。そのため、大量の文書を登録する大規模な文書検索
システムで、検索語に関係しない文書のふるい落とし、
すなわち絞込みが適確に行なわれず検索低能の低下につ
ながるおそれがある。In the fifth embodiment of the present invention, a plurality of concatenated character components are assigned to one entry in the concatenated character component table.
That is, the method of realizing the concatenated character component table with a practical capacity by hashing has been described. However, in this method, when a certain concatenated character is designated and the entry of the corresponding character component table is read, there is a possibility that a document containing a completely different concatenated character component can be obtained from the bit information. Therefore, in a large-scale document search system that registers a large number of documents, sifting out documents that are not related to the search term,
That is, narrowing down may not be performed accurately, and search efficiency may be reduced.

【０２３８】この問題に対し本発明第八の実施例では、
連接文字成分表を作成する際に、出現頻度の高い連接文
字成分に対しては各連接文字の出現した文書番号に対応
するビット位置に‘１'を記したビットリストで、各連
接文字の出現した文書番号を格納する。さらに、出現頻
度の低い連接文字成分に対しては、各連接文字の出現し
た文書番号をバイナリデータのリストとして格納するこ
とにより、ハッシングによる検索ノイズの生じない連接
文字成分表を実用的な容量で実現する方法を取る。To solve this problem, in the eighth embodiment of the present invention,
When creating a concatenated character component table, for concatenated character components with a high frequency of occurrence, a bit list in which "1" is added to the bit position corresponding to the document number in which each concatenated character appears Stores the created document number. Furthermore, for concatenated character components with low frequency of occurrence, by storing the document numbers in which each concatenated character appears as a list of binary data, a concatenated character component table with no search noise due to hashing can be stored with a practical capacity. Take the way to make it happen.

【０２３９】本実施例は図１に示した第一の実施例と基
本的に同様の構成をとるが、その中の連接文字成分表１
０５、連接文字成分表作成登録プログラム２０５および
連接文字成分表サーチプログラム２１１の部分が、それ
ぞれ図４０、図４１および図４２に示す構成となる。This embodiment has basically the same configuration as that of the first embodiment shown in FIG. 1, except that the concatenated character component table 1 therein.
05, the concatenated character component table creation / registration program 205 and the concatenated character component table search program 211 have the configurations shown in FIGS. 40, 41 and 42, respectively.

【０２４０】すなわち、本実施例における連接文字成分
表１０５は図４０に示すようにビットリスト１０５−ａ
および文書番号リスト１０５−ｂで構成される。また、
連接文字成分表作成登録プログラム２０５は、図４１に
示すように逐次連接文字抽出プログラム４００、スキッ
プ連接文字抽出プログラム２０６、連接文字成分表登録
プログラム２０７および文字出現頻度算出プログラム６
００で構成され、連接文字成分表作成登録プログラム２
０５における連接文字成分表登録プログラム２０７は、
文書出現頻度判定プログラム６０１、ビットリスト登録
プログラム６０２および文書番号リスト登録プログラム
６０３で構成される。さらに、連接文字成分表サーチプ
ログラム２１１は図４２に示すように逐次連接文字抽出
プログラム４０１、スキップ連接文字抽出プログラム２
１２、連接文字成分表取得プログラム６０４およびビッ
トアンドプログラム２１３で構成される。That is, the concatenated character component table 105 in this embodiment is the bit list 105-a as shown in FIG.
And a document number list 105-b. Also,
The concatenated character component table creation / registration program 205 is, as shown in FIG. 41, a sequential concatenated character extraction program 400, a skip concatenated character extraction program 206, a concatenated character component table registration program 207, and a character appearance frequency calculation program 6.
00, a concatenated character component table creation registration program 2
The concatenated character component table registration program 207 in 05
It is composed of a document appearance frequency determination program 601, a bit list registration program 602, and a document number list registration program 603. Further, as shown in FIG. 42, the concatenated character component table search program 211 is a sequential concatenated character extraction program 401, a skip concatenated character extraction program 2
12, a concatenated character component table acquisition program 604 and a bit and program 213.

【０２４１】以下、本実施例における連接文字成分表の
登録処理、およびサーチ処理の概要について説明する。The outline of the concatenated character component table registration processing and search processing in this embodiment will be described below.

【０２４２】本実施例では、まず登録処理の前処理とし
て連接文字成分表作成登録プログラム２０５は文字出現
頻度算出プログラム６００を起動し、テキスト１０３か
らテキストデータをワークエリア２１７に読み出す。そ
して、テキストデータ中に現われた連続する２文字の文
字列(逐次連接文字)および１文字おきに２文字の文字列
(スキップ連接文字)に対し、各文字列の出現した文書件
数(出現文書数)を算出する。In the present embodiment, first, the concatenated character component table preparation registration program 205 activates the character appearance frequency calculation program 600 as preprocessing of the registration processing, and reads the text data from the text 103 into the work area 217. Then, a character string of two consecutive characters (sequential concatenated characters) appearing in the text data and a character string of two characters every other character.
For (skip concatenated characters), the number of documents in which each character string appears (the number of appearing documents) is calculated.

【０２４３】次に、連接文字成分表作成登録プログラム
２０５は図４３に示すようにステップ１６００で逐次連
接文字抽出プログラム４００を起動する。そして、各文
書毎にテキストデータ中に現われる連続する２文字の文
字列を逐次連接文字として抽出する。Next, the concatenated character component table creation / registration program 205 activates the concatenated character extraction program 400 in step 1600, as shown in FIG. Then, a continuous character string of two characters appearing in the text data for each document is sequentially extracted as a concatenated character.

【０２４４】また、連接文字成分表作成プログラム２０
５はステップ１６０１でスキップ連接文字抽出プログラ
ム２０６を起動し、各文書毎に１文字おきに２文字の文
字列をスキップ連接文字として抽出する。The concatenated character component table creation program 20
In step 1601, the skip concatenated character extraction program 206 is activated in step 1601 to extract a character string of two characters every other character as a skip concatenated character for each document.

【０２４５】さらに、連接文字成分表作成登録プログラ
ム２０５はステップ１６０２で連接文字成分表登録プロ
グラム２０７を起動し、逐次連接文字抽出プログラム４
００およびスキップ連接文字抽出プログラム２０６によ
って抽出された各連接文字の出現情報を連接文字成分表
に登録する。Further, the concatenated character component table creation / registration program 205 activates the concatenated character component table registration program 207 in step 1602, and the consecutively concatenated character extraction program 4
00 and the appearance information of each concatenated character extracted by the skip concatenated character extraction program 206 are registered in the concatenated character component table.

【０２４６】次に、連接文字成分表登録プログラム２０
７の処理の概要を図４４に示す。Next, the concatenated character component table registration program 20
An outline of the processing of No. 7 is shown in FIG.

【０２４７】連接文字成分表登録プログラム２０７は、
始めにステップ１６１０で文字出現頻度判定プログラム
６０１を起動し、逐次連接文字抽出プログラム４００お
よびスキップ連接文字抽出プログラム２０６によって抽
出された各連接文字の出現頻度が所定のしきい値より大
きいか否かを判定する。そして、大きい場合にはステッ
プ１６１１でビットリスト登録プログラム６０２を起動
し、各連接文字の出現した文書番号に該当するビット位
置に‘１’を記すことによって出現方法を記録する。ま
た、小さい場合にはステップ１６１２で文書番号リスト
登録プログラム６０３を起動し、各連接文字の出現した
文書番号をバイナリデータとして文書番号リストに登録
することにより出現情報を記録する。The concatenated character component table registration program 207
First, in step 1610, the character appearance frequency determination program 601 is started to check whether the appearance frequency of each concatenated character extracted by the consecutive concatenated character extraction program 400 and the skip concatenated character extraction program 206 is larger than a predetermined threshold value. judge. If it is larger, the bit list registration program 602 is started in step 1611, and "1" is written in the bit position corresponding to the document number in which each concatenated character appears, thereby recording the appearance method. If it is smaller, the document number list registration program 603 is started in step 1612, and the document number in which each concatenated character appears is registered as binary data in the document number list to record the appearance information.

【０２４８】以上が登録処理の概要である。The above is the outline of the registration processing.

【０２４９】サーチ時には、連接文字成分表サーチプロ
グラム２１１は図４５に示すようにステップ１６２０で
逐次連接文字抽出プログラム４０１を起動し、検索ター
ムから連続する２文字の文字列を逐次連接文字として抽
出する。At the time of searching, the concatenated character component table search program 211 activates the concatenated character extraction program 401 in step 1620 as shown in FIG. 45, and extracts a character string of two consecutive characters from the search term as a concatenated character. .

【０２５０】さらに、連接文字成分表サーチプログラム
２１１はステップ１６２１でスキップ連接文字抽出プロ
グラム２１２を起動し、１文字おきに２文字の文字列を
スキップ連接文字として抽出する。Further, the concatenated character component table search program 211 activates the skip concatenated character extraction program 212 in step 1621, and extracts a character string of every two characters as a skip concatenated character.

【０２５１】次に、連接文字成分表サーチプログラム２
１１はステップ１６２２で連接文字成分表取得プログラ
ム６０３を起動する。連接文字成分表取得プログラム６
０３では、図４６に示すようにステップ１６３０で各連
接文字に対応する文字成分表がビットリストで格納され
ているか、文書番号リストで格納されているかを判定す
る。そして、文書番号リストで格納されている場合には
ステップ１６３１を起動し、該当するビット列をそのま
ま連接文字成分表として取得する。また、文書番号リス
トで格納されている場合にはステップ１６３２を起動
し、文書番号リスト中の各文書番号に該当するビット位
置に‘１’を設定することによりビットリストに変換
し、該当する連接文字の文字成分表を取得する。Next, the concatenated character component table search program 2
11 starts the concatenated character component table acquisition program 603 in step 1622. Concatenated character component table acquisition program 6
In step 03, as shown in FIG. 46, it is determined in step 1630 whether the character component table corresponding to each concatenated character is stored in the bit list or the document number list. If it is stored in the document number list, step 1631 is activated and the corresponding bit string is acquired as it is as a concatenated character component table. If it is stored in the document number list, step 1632 is activated, and the bit position corresponding to each document number in the document number list is set to "1" to convert it to the bit list, and the corresponding concatenation is performed. Get the character composition table of a character.

【０２５２】最後に、連接文字成分表サーチプログラム
２１１はステップ１６２３でビットアンドプログラム２
１３を起動し、連接文字成分表取得プログラム６０４に
よって取得されたビットリストの間で各ビット毎に論理
積演算を行う。この論理積演算の結果‘１’となったビ
ットに対応する文書番号を連接文字成分表サーチの結果
として検索制御プログラム２０９に出力する。Finally, the concatenated character component table search program 211 executes the bit and program 2 in step 1623.
13 is started, and a logical product operation is performed for each bit between the bit lists acquired by the concatenated character component table acquisition program 604. The document number corresponding to the bit that becomes "1" as a result of this logical product operation is output to the search control program 209 as the result of the concatenated character component table search.

【０２５３】以上が、本発明による連接文字成分表の登
録およびサーチ処理の概略である。The above is the outline of the registration and search processing of the concatenated character component table according to the present invention.

【０２５４】さらに、実施例における連接文字成分表の
登録方法およびサーチ方法の詳細について、以下に例を
挙げて説明する。なお、本実施例では、全登録文書の件
数を100万件とし文書番号を32ビットのバイナリデータ
として文書番号リストに格納した場合について説明す
る。The details of the method of registering and searching the concatenated character component table in the embodiment will be described below with reference to examples. In the present embodiment, the case where the number of all registered documents is 1 million and the document number is stored in the document number list as 32-bit binary data will be described.

【０２５５】まず、サーチ処理から先に説明する。The search process will be described first.

【０２５６】本実施例では、検索タームから抽出された
連接文字成分に対する連接文字成分表を取得するための
管理テーブルとして、文字テーブルとファイルポインタ
テーブルを用いる。図４７は文字テーブルとファイルポ
インタテーブルを用いた検索処理の概要を示す図であ
る。In this embodiment, a character table and a file pointer table are used as a management table for acquiring the concatenated character component table for the concatenated character components extracted from the search term. FIG. 47 is a diagram showing an outline of search processing using the character table and the file pointer table.

【０２５７】前述したように、連接文字成分表サーチ時
には連接文字成分表サーチプログラム２１１は、まず始
めに逐次連接文字抽出プログラム４０１を起動し、検索
ターム中から連続する２文字の文字列を逐次連接文字と
して抽出する。例えば、“動画像"という文字列が検索
タームに指定された場合には、“動画"および“画像"を
逐次連接文字として抽出する。As described above, when searching the concatenated character component table, the concatenated character component table search program 211 first starts the consecutively concatenated character extraction program 401 to successively concatenate two consecutive character strings from the search term. Extract as a character. For example, when the character string "moving image" is specified in the search term, "moving image" and "image" are sequentially extracted as concatenated characters.

【０２５８】そして、スキップ連接文字抽出プログラム
２１２を起動し、検索ターム中から１文字おきに２文字
の文字列をスキップ連接文字として抽出する。例えば、
“動画像"という文字列が検索タームに指定された場合
には、“動像"をスキップ連接文字として抽出する。Then, the skip concatenated character extraction program 212 is activated, and a character string of two characters is extracted as a skip concatenated character every other character from the search term. For example,
When the character string "moving image" is specified in the search term, "moving image" is extracted as a skip concatenated character.

【０２５９】なお、これからの説明では簡略化のため、
主に逐次連接文字成分表の登録方法および検索について
説明する。また、スキップ連接においても同様の処理で
検索が実現できる。In the following description, for simplification,
The registration method and search of the sequential concatenated character component table will be mainly described. Further, even in the skip connection, the search can be realized by the same processing.

【０２６０】次に、連接文字成分表取得プログラム６０
４では文字テーブルに対し検索タームから抽出された連
接文字の先頭１文字目の文字コードに対応するレコード
を参照することによりファイルポインタテーブルへのポ
インタ情報を得る。例えば、逐次連接文字“動画"につ
いては、先頭１文字目の文字コードであるに“動"の文
字コードに対応する文字テーブルのレコードを参照して
ファイルポインタテーブルへのポインタ情報560を得
る。Next, the concatenated character component table acquisition program 60
In 4, the pointer information to the file pointer table is obtained by referring to the record corresponding to the character code of the first character of the concatenated character extracted from the search term in the character table. For example, for the consecutively-connected character "moving image", the pointer information 560 to the file pointer table is obtained by referring to the record of the character table corresponding to the character code "moving" which is the character code of the first character.

【０２６１】次に、文字テーブルを参照した結果得られ
たポインタ情報を元にファイルポインタテーブルを参照
し、該当連接文字に対する連接文字成分表が格納されて
いるファイルの識別子(以後ファイルIDと呼ぶ)およびフ
ァイル内での位置情報(ファイル先頭からの格納位置
で、以後オフセットとも呼ぶ)を得る。すなわち、図４
９の例では文字テーブルを参照した結果得られた560を
基に、ファイルポインタテーブルの先頭から560バイト
目以降の各レコードを参照して、第二文字目が“画"の
レコードを探索する。以上の処理により、ファイルIDと
して１、オフセットとして1,034を“動画"に対する連接
文字成分表を参照するための情報としてを得ることがで
きる。なお、ファイルポインタテーブルでは、各先頭文
字に対応するレコードの１番目には第二文字目が０のレ
コードを格納しておき、先頭文字一文字に対応する単一
文字成分表をアクセスするためのファイルIDとオフセッ
トを格納する。すなわち、本図の例では、ファイルポイ
ンタテーブルの580バイト目には“動"の一文字に対応す
る文字成分表のファイルIDとオフセットを格納する。こ
うすることにより、例えば二文字目に“画"を照合する
ことなく、次に第二文字目として０のレコードを検出し
た場合には、連接文字“動画"がテキストデータ中に現
われなかったものと判断することができる。Next, the file pointer table is referred to based on the pointer information obtained as a result of referring to the character table, and the identifier of the file in which the concatenated character component table for the corresponding concatenated character is stored (hereinafter referred to as file ID). And position information in the file (a storage position from the beginning of the file, which is also called an offset hereinafter) is obtained. That is, FIG.
In the example of No. 9, based on 560 obtained as a result of referring to the character table, each record from the beginning of the file pointer table to the 560th byte and thereafter is referenced to search for a record whose second character is "image". Through the above processing, it is possible to obtain 1 as the file ID and 1,034 as the offset as information for referring to the concatenated character component table for "moving image". In the file pointer table, the record having the second character 0 is stored in the first record of the first character, and the file ID for accessing the single character component table corresponding to the first character is stored. And store the offset. That is, in the example of this figure, the file ID and offset of the character component table corresponding to one character of "motion" are stored in the 580th byte of the file pointer table. By doing so, for example, when a record of 0 is detected as the second character without collating the "image" with the second character, the concatenated character "video" does not appear in the text data. Can be determined.

【０２６２】次に、ファイルポインタテーブルを参照し
た結果得られたファイルIDおよびオフセットから、該当
する連接文字成分に対応する連接文字成分表を取得す
る。本実施例においては、ビットリスト用のファイルID
および文書番号用のファイルIDを予め規定しておく。こ
うすることにより、ファイルIDの値によって各連接文字
に対応する連接文字成分表がビットリストで構成されて
いるか、または文書番号リストで構成されているかを判
定することができる。すなわち、図４７に示す例ではフ
ァイルIDが１に対応するファイルはビットリストで、２
に対応するファイルは文書番号リストで構成されてい
る。そして、指定された連接文字に対応するファイルID
が１の場合には、ファイル１内の該当するオフセット位
置から文書登録件数に相当する分のビットリストを読み
出す。また、ファイルIDが２の場合には、まずファイル
２内の該当するオフセット位置から該当する、連接文字
が現われた文書数(出現文書数)を読み込む。次に、出現
文書数に相当する文書番号を読み込むことにより該当す
る連接文字の現われた文書番号のリストを読み込む。そ
して、得られた文書番号リストをビットリストの形に変
換することにより該当する連接文字に対応する連接文字
成分表を得る。Next, a concatenated character component table corresponding to the corresponding concatenated character component is acquired from the file ID and offset obtained as a result of referring to the file pointer table. In this embodiment, the file ID for the bit list
And the file ID for the document number is specified in advance. By doing so, it is possible to determine whether the concatenated character component table corresponding to each concatenated character is composed of a bit list or a document number list depending on the value of the file ID. That is, in the example shown in FIG. 47, the file corresponding to the file ID 1 is a bit list and is 2
The file corresponding to is composed of a document number list. And the file ID corresponding to the specified concatenation character
Is 1, the bit list corresponding to the number of document registrations is read from the corresponding offset position in the file 1. If the file ID is 2, the number of documents (the number of appearing documents) in which the corresponding concatenated character appears is read from the corresponding offset position in the file 2. Next, by reading the document numbers corresponding to the number of appearing documents, the list of document numbers in which the corresponding concatenated characters appear is read. Then, by converting the obtained document number list into a bit list form, a concatenated character component table corresponding to the corresponding concatenated character is obtained.

【０２６３】図４７の例では、連接文字“動画"に対応
する連接文字成分表へのアクセス情報として、ファイル
IDとして１、オフセットとして875,000が得られるた
め、ファイル１内の875kバイト目から100万件分に相当
する125kバイト(＝1,000,000bit)のビット列“01110101
01...."を読み込む。このビット列は、先頭ビットから
文書番号に対応して、‘１’が連接文字“動画"を含む
文書を示すことになる。また、連接文字“画像"につい
ては、ファイルIDとして２、オフセットとして1,084が
得られため、ファイル２の先頭から1,084バイト目を参
照することによって、連接文字“画像"を含む文書数と
して34を読み込む。そして、文書番号リストから34件分
に相当する文書番号を読み込むことにより、“画像"の
現われた文書番号が783,1038,・・・であることがわかる。
この結果から、ビットリスト中の783,1038,・・・番目の文
書IDに該当する位置に‘１’を設定することによりビッ
ト列に変換する。In the example of FIG. 47, a file is used as access information to the concatenated character component table corresponding to the concatenated character "moving image".
Since 1 is obtained as the ID and 875,000 is obtained as the offset, a 125 kbyte (= 1,000,000 bit) bit string “01110101” corresponding to 1 million records starting from the 875 kth byte in file 1.
"01 ...." is read. This bit string corresponds to the document number from the first bit, and "1" indicates the document containing the concatenated character "video". Also, regarding the concatenated character "image" , 2 as the file ID and 1,084 as the offset are obtained, so by referring to the 1,084th byte from the beginning of the file 2, 34 is read as the number of documents including the concatenation character “image”. By reading the document numbers corresponding to the minutes, it can be seen that the document numbers in which the "image" appears are 783, 1038, ....
From this result, it is converted into a bit string by setting '1' to the position corresponding to the 783, 1038, ... Document ID in the bit list.

【０２６４】最後に、ビットアンドプログラム２１３で
は、これらすべてのビット列の論理積を取り、その結果
が‘１’である文書を連接文字成分表サーチの検索結果
として得る。Finally, in the bit and program 213, the logical product of all these bit strings is taken, and the document whose result is '1' is obtained as the search result of the concatenated character component table search.

【０２６５】以上が、本実施例における連接文字成分表
サーチ処理に関する説明である。The above is a description of the concatenated character component table search processing in this embodiment.

【０２６６】このような、連接文字成分表サーチを実現
する連接文字成分表の登録処理および文字テーブル、フ
ァイルポインタテーブルの作成方法について説明する。
なお、本実施例では全登録文書の件数を100万件とし、
文書番号リストに文書番号を32ビットのバイナリデータ
として格納しているため、文字出現頻度しきい値として
31,250件(100万ビット÷32ビット/件)を用いる。A method of registering a concatenated character component table and a method of creating a character table and a file pointer table for realizing such a concatenated character component table search will be described.
In this example, the number of all registered documents is 1 million,
Since the document numbers are stored as 32-bit binary data in the document number list,
Use 31,250 cases (1 million bits ÷ 32 bits / case).

【０２６７】始めに、各連接文字成分の出現文書数の算
出方法について説明する。ここでは、図４８に示す文字
出現頻度テーブルを使用して各連接文字成分の出現文書
数を算出する。まず、文字出現頻度テーブルは初期状態
として全てのデータに０を登録しておく。次に、テキス
ト１０３から各文書毎にテキストデータをワークエリア
２１７に読み出し、テキストデータから連続する２文字
の文字列を抽出する。そして、文字出現頻度テーブルに
対し、抽出された連接文字の文字コードに該当するデー
タに１を加算することにより各連接文字成分の出現した
文書の件数(出現文書数)を算出する。“自動画質調整機
能を備えた画像処理装置・・・”というテキストデータに
対しては、連続する２文字の連接列として“自動”、
“動画”、“画質”、・・・が抽出され、文字出現頻度
テーブルの各文字コードに対応するデータに１を加算
し、全ての登録文書中に各連接文字成分の出現した文書
数を算出する。First, a method of calculating the number of appearing documents of each connected character component will be described. Here, the number of appearing documents of each concatenated character component is calculated using the character appearance frequency table shown in FIG. First, in the character appearance frequency table, 0 is registered in all data as an initial state. Next, the text data is read from the text 103 for each document into the work area 217, and a continuous two-character string is extracted from the text data. Then, the number of documents in which each concatenated character component appears (the number of appearing documents) is calculated by adding 1 to the data corresponding to the character code of the extracted concatenated character in the character appearance frequency table. For text data "image processing device with automatic image quality adjustment function ...", "automatic" as a concatenated string of two consecutive characters,
"Video", "image quality", ... are extracted, 1 is added to the data corresponding to each character code in the character appearance frequency table, and the number of documents in which each concatenated character component appears in all registered documents is calculated. To do.

【０２６８】次に、先ほど作成した文字出現頻度テーブ
ルの値が０でない連接文字を抽出することにより、テキ
ストデータ中に現われた連接文字を抽出する。Next, the concatenated characters appearing in the text data are extracted by extracting the concatenated characters whose value is not 0 in the character appearance frequency table created earlier.

【０２６９】すなわち、図５０に示した例では、“動
動”に対する文字出現頻度テーブルの値は０であるた
め、テキストデータ中に現われた連接文字として抽出し
ない。That is, in the example shown in FIG. 50, since the value of the character appearance frequency table for "movement" is 0, it is not extracted as a concatenated character appearing in the text data.

【０２７０】また、“動画”については文字出現頻度テ
ーブルの値は０でないためテキストデータ中に現われた
連接文字として抽出する。そして、出現書数がしきい値
である31,250より大きいため、ビットリストに登録文書
件数に相当する125kバイト(＝1000,000ビット)の領域を
アロケートする。さらに、ファイルポインタテーブルに
第二文字目として“画”を、ファイルIDとしてビットリ
ストを表わす“１”を、オフセットとしてビットリスト
内にアロケートした領域の先頭オフセットに相当する87
5,000を書き込む。また、第二文字目として“０”を書
き込んだ場合には、該当するファイルポインタテーブル
内のオフセット値を、文字テーブルの連接文字成分の先
頭文字に対応するデータに書き込む。Since the value of the character appearance frequency table for "moving image" is not 0, it is extracted as a concatenated character appearing in the text data. Then, since the number of appearing documents is larger than the threshold value of 31,250, an area of 125 kbytes (= 1,000,000 bits) corresponding to the number of registered documents is allocated in the bit list. Further, “image” is the second character in the file pointer table, “1” is the file ID representing the bit list, and the offset is equivalent to the start offset of the area allocated in the bit list.
Write 5,000. When "0" is written as the second character, the offset value in the corresponding file pointer table is written in the data corresponding to the first character of the concatenated character component of the character table.

【０２７１】次に、“画像”についても文字出現頻度テ
ーブルの値は０でないためテキストデータ中に現われた
連接文字として抽出する。そして、出現書数がしきい値
である31,250より小さいため、文書番号リストに出現文
書数である“56”を書込み、出現文書数に相当する136
バイト(4バイト/件×34件)の領域をアロケートする。さ
らに、ファイルポインタテーブルに第二文字目として
“像”を、ファイルIDとして文書番号リストを表わす
“２”を、オフセットとして文書番号リスト内にアロケ
ートした領域の先頭オフセットに相当する1,084を書き
込む。Next, since the value of the character appearance frequency table for "image" is not 0, it is extracted as a concatenated character appearing in the text data. Since the number of appearing documents is smaller than the threshold value of 31,250, the number of appearing documents “56” is written in the document number list, which corresponds to the number of appearing documents.
Allocate an area of bytes (4 bytes / case x 34 cases). Further, "image" is written as the second character in the file pointer table, "2" representing the document number list as the file ID, and 1,084 corresponding to the start offset of the area allocated in the document number list is written as the offset.

【０２７２】以上示したように、本実施例では各連接文
字に対し予めビットリストおよび文書番号リストの格納
領域をアロケートすることにより連接文字成分表を登録
するための準備をしておく。As described above, in the present embodiment, preparation is made for registering the concatenated character component table by allocating the storage areas of the bit list and the document number list for each concatenated character in advance.

【０２７３】次に、文書番号783のテキストデータとし
て“動画像”という文字列が現われた場合を例に、連接
文字成分表の登録処理について説明する。Next, the process of registering the concatenated character component table will be described, taking as an example the case where the character string "moving image" appears as the text data of the document number 783.

【０２７４】まず、テキスト１０３からテキストデータ
を１件ずつワークエリア２１７に読み込み、テキストデ
ータ中に現われた連続する２文字の文字列を抽出するこ
とにより、逐次連接文字として“動画”と“画像”を抽
出する。次に、連接文字成分表サーチ時と同様に文字テ
ーブルおよびファイルポインタテーブルを参照すること
にことにより、各連接文字を格納しているファイルIDお
よびオフセットを得る。例えば、逐次連接文字“動画”
についてはファイルIDとして“１”を、オフセットとし
て875,000を得る。そして、ファイルIDとして“１”に
相当するビットリストの先頭875kバイト目から100万件
分のビット列、すなわち125kバイトのビット列を読み込
み、この内783番目の文書番号に対応するビット位置に
‘１’を設定することにより“動画”という文字列が文
書番号783に出現したことを記す。また、“画像”につ
いても同様に、文字テーブルおよびファイルポインタテ
ーブルを参照することによりファイルIDとして“２”
を、オフセットとして1,084を得る。そして、ファイルI
Dとして“２”に相当する文書番号リストの1,084バイト
目を参照することにより“画像”の出現文書数が34であ
るという情報を得る。そして、文書番号リストの続く34
件分に相当する文書番号を読み込み、その内から“０”
が初めて現われたエントリに対しバイナリデータとして
783を書き込むことにより“画像”という文字列が文書
番号783に現われたことを記す。First, by reading the text data from the text 103 into the work area 217 one by one, and extracting a character string of two consecutive characters appearing in the text data, "moving image" and "image" are successively connected. To extract. Next, by referring to the character table and the file pointer table in the same manner as when searching the concatenated character component table, the file ID and offset in which each concatenated character is stored are obtained. For example, the consecutive concatenated character "video"
For, the file ID is "1" and the offset is 875,000. Then, a bit string of 1 million items, that is, a 125 kbyte bit string is read from the first 875 kbyte of the bit list corresponding to "1" as the file ID, and "1" is set to the bit position corresponding to the 783th document number. It is noted that the character string "video" appeared in the document number 783 by setting. Similarly, for "images", by referring to the character table and file pointer table, "2" is set as the file ID.
Is obtained as an offset of 1,084. And file I
By referring to the 1,084th byte of the document number list corresponding to "2" as D, information that the number of appearing documents of "image" is 34 is obtained. And the following 34 in the document number list
Read the document number corresponding to the case, and from it, "0"
As binary data for the first appearing entry
It is noted that by writing 783, the character string "image" appeared in the document number 783.

【０２７５】以上が、本実施例における連接文字成分表
の登録処理の詳細である。The above is the details of the registration processing of the concatenated character component table in this embodiment.

【０２７６】このように、本実施例による連接文字成分
表では出現頻度の高い連接文字に対しては出現情報をビ
ットリストで、出現頻度の低い連接文字に対しては出現
情報を文書番号リストで格納する。こうすることによ
り、ハッシングによるノイズの生じない連接文字成分表
を実用的な容量で実現することができる。As described above, in the concatenated character component table according to the present embodiment, the appearance information is a bit list for a concatenated character having a high appearance frequency, and the appearance information is a document number list for a concatenated character having a low appearance frequency. Store. By doing so, a concatenated character component table in which noise due to hashing does not occur can be realized with a practical capacity.

【０２７７】なお、本実施例では登録処理の前処理とし
て、登録文書全文に対応するテキストデータを参照して
各連接文字成分の出現頻度を算出することにより、ビッ
トリストおよび文書番号リストの領域をアロケートする
方法を用いた。しかし、予め統計情報を用いて、出現頻
度の高いと判断される連接文字にはビットリストを、出
現頻度の低いと判断される連接文字には予想される出現
文書数に応じた容量をアロケートすることにより、出現
頻度算出プログラムの実行が不要になる。In this embodiment, as the preprocessing of the registration process, the appearance frequency of each concatenated character component is calculated by referring to the text data corresponding to the entire text of the registered document, and the areas of the bit list and the document number list are The method of allocating was used. However, using the statistical information in advance, a bit list is allocated to the concatenated character that is determined to have a high appearance frequency, and a capacity corresponding to the expected number of appearing documents is allocated to the concatenated character that is determined to be a low occurrence frequency. As a result, it becomes unnecessary to execute the appearance frequency calculation program.

【０２７８】さらに、本実施例では、本発明第五の実施
例における文書検索方法に対して、出現頻度の高い連接
文字に対しては出現情報をビットリストで、出現頻度の
低い連接文字に対しては出現情報を文書番号リストで格
納する方式について説明した。しかし、これまでに説明
してきた全ての実施例に対しても適用できることは明ら
かであろう。Further, in the present embodiment, in the document retrieval method of the fifth embodiment of the present invention, the appearance information is a bit list for the concatenated characters having a high appearance frequency, and the appearance information is a bit list for the concatenated characters having a low appearance frequency. The method of storing the appearance information in the document number list has been described. However, it will be clear that it is also applicable to all the embodiments described so far.

【０２７９】最後に、本実施例では、100万件の文書に
対して一回の登録処理で連接文字成分表の登録処理を行
う方法について説明したが、例えば１万件毎のテキスト
データと対象としてビットリストおよび文書番号リスト
を作成し、これを後でマージすることにより100万件分
の連接文字成分表を作成する方法であっても構わない。
この時には連接文字成分表を作成するために必要となる
テーブルが小さい容量で済む。このため、少量のメモリ
容量しか搭載されていないコンピュータにおいても、テ
ーブルをメモリ上に格納して文書の登録が行えるため、
登録時間が短縮できるという特長がある。Finally, in the present embodiment, the method of performing the registration processing of the concatenated character component table in one registration processing for one million documents has been described. Alternatively, a method of creating a bit list and a document number list and merging them later to create a concatenated character component table for 1 million cases may be used.
At this time, the capacity of the table required to create the concatenated character component table is small. Therefore, even in a computer equipped with a small amount of memory capacity, the table can be stored in the memory and the document can be registered.
It has the feature that registration time can be shortened.

【０２８０】[0280]

【発明の効果】本発明によれば、階層プリサーチにおい
てハッシングによるノイズが生じない連接文字成分表を
実用的な容量で実現することができる。さらに、英語等
の表音文字で構成される文字列、および単語の組合せで
構成される文字列が検索タームとして指定された場合で
も、連接文字成分表の絞り込み率をさらに向上させるこ
とができ、無用の凝縮テキストサーチを大幅に省くこと
ができるため、大規模な文書データベースに対しても実
用的な応答時間でフルテキストサーチを行うことが可能
となる。According to the present invention, a concatenated character component table in which noise due to hashing does not occur in hierarchical presearch can be realized with a practical capacity. Furthermore, even when a character string composed of phonetic characters such as English, and a character string composed of a combination of words are specified as a search term, it is possible to further improve the narrowing rate of the concatenated character component table, Since unnecessary condensed text search can be largely omitted, it becomes possible to perform full-text search with a practical response time even for a large-scale document database.

[Brief description of drawings]

【図１】本発明の第一の実施例の構成を示す図である。FIG. 1 is a diagram showing a configuration of a first exemplary embodiment of the present invention.

【図２】従来例の説明図である。FIG. 2 is an explanatory diagram of a conventional example.

【図３】第一の文書検索方法における作用を示す図であ
る。FIG. 3 is a diagram showing an operation in a first document search method.

【図４】本発明第一の実施例における文書の登録手順を
示すＰＡＤ図である。FIG. 4 is a PAD diagram showing a document registration procedure in the first embodiment of the present invention.

【図５】本発明第一の実施例における連接文字成分表作
成登録プログラムの処理手順を示すＰＡＤ図である。FIG. 5 is a PAD showing a processing procedure of a concatenated character component table creation registration program in the first embodiment of the present invention.

【図６】階層検索の制御手順を示すＰＡＤ図である。FIG. 6 is a PAD diagram showing a control procedure of hierarchical search.

【図７】本発明第一の実施例における連接文字成分の抽
出方法を示す図である。FIG. 7 is a diagram showing a method of extracting concatenated character components according to the first embodiment of the present invention.

【図８】本発明第一の実施例における連接文字成分表の
作成方法を示す図である。FIG. 8 is a diagram showing a method of creating a concatenated character component table in the first embodiment of the present invention.

【図９】本発明第一の実施例における連接文字成分表の
検索方法を示すＰＡＤ図である。FIG. 9 is a PAD diagram showing a method for searching a concatenated character component table in the first embodiment of the present invention.

【図１０】本発明第一の実施例における連接文字成分表
の検索方法を示す図である。FIG. 10 is a diagram showing a method for searching a concatenated character component table according to the first embodiment of the present invention.

【図１１】本発明第二の実施例における連接文字成分表
作成登録プログラムの構成を示す図である。FIG. 11 is a diagram showing the structure of a concatenated character component table creation / registration program in the second embodiment of the present invention.

【図１２】本発明第二の実施例における連接文字成分表
作成登録プログラムの処理手順を示すＰＡＤ図である。FIG. 12 is a PAD showing a processing procedure of a concatenated character component table creation registration program in the second embodiment of the present invention.

【図１３】本発明第二の実施例における連接文字成分表
の作成方法を示す図である。FIG. 13 is a diagram showing a method for creating a concatenated character component table in the second embodiment of the present invention.

【図１４】本発明第三の実施例における連接文字成分表
作成登録プログラムの構成を示す図である。FIG. 14 is a diagram showing the structure of a concatenated character component table creation / registration program according to the third embodiment of the present invention.

【図１５】本発明第三の実施例における連接文字成分表
サーチプログラムの構成を示す図である。FIG. 15 is a diagram showing the structure of a concatenated character component table search program in the third embodiment of the present invention.

【図１６】本発明第三の実施例における連接文字成分表
作成登録プログラムの処理手順を示すＰＡＤ図である。FIG. 16 is a PAD showing a processing procedure of a concatenated character component table creation registration program in the third embodiment of the present invention.

【図１７】本発明第三の実施例における連接文字成分表
サーチプログラムの処理手順を示すＰＡＤ図である。FIG. 17 is a PAD showing a processing procedure of a concatenated character component table search program according to the third embodiment of the present invention.

【図１８】本発明第三の実施例における連接文字成分表
の作成方法を示す図である。FIG. 18 is a diagram showing a method for creating a concatenated character component table according to the third embodiment of the present invention.

【図１９】本発明第三の実施例における連接文字成分表
のサーチ方法を示す図である。FIG. 19 is a diagram showing a method for searching a concatenated character component table according to the third embodiment of the present invention.

【図２０】本発明第四の実施例における連接文字成分表
作成登録プログラムの構成を示す図である。FIG. 20 is a diagram showing the structure of a concatenated character component table creation / registration program according to the fourth embodiment of the present invention.

【図２１】本発明第四の実施例における連接文字成分表
サーチプログラムの構成を示す図である。FIG. 21 is a diagram showing the structure of a concatenated character component table search program according to the fourth embodiment of the present invention.

【図２２】本発明第四の実施例における連接文字成分表
作成登録プログラムの処理手順を示すＰＡＤ図である。FIG. 22 is a PAD showing a processing procedure of a concatenated character component table creation registration program in the fourth example of the present invention.

【図２３】本発明第四の実施例における連接文字成分表
サーチプログラムの処理手順を示すＰＡＤ図である。FIG. 23 is a PAD showing a processing procedure of a concatenated character component table search program according to the fourth embodiment of the present invention.

【図２４】本発明第四の実施例における連接文字成分表
の作成方法を示す図である。FIG. 24 is a diagram showing a method of creating a concatenated character component table according to the fourth embodiment of the present invention.

【図２５】本発明第四の実施例における連接文字成分表
のサーチ方法を示す図である。FIG. 25 is a diagram showing a method for searching a concatenated character component table according to the fourth embodiment of the present invention.

【図２６】本発明第五の実施例における連接文字成分表
の作成方法を示す図である。FIG. 26 is a diagram showing a method for creating a concatenated character component table according to the fifth embodiment of the present invention.

【図２７】本発明第五の実施例における連接文字成分表
のサーチ方法を示す図である。FIG. 27 is a diagram showing a method for searching a concatenated character component table according to the fifth embodiment of the present invention.

【図２８】本発明第六の実施例における連接文字成分表
作成登録プログラムの構成を示す図である。FIG. 28 is a diagram showing the structure of a concatenated character component table creation / registration program according to the sixth embodiment of the present invention.

【図２９】本発明第六の実施例における連接文字成分表
サーチプログラムの構成を示す図である。FIG. 29 is a diagram showing the structure of a concatenated character component table search program according to the sixth embodiment of the present invention.

【図３０】本発明第六の実施例における連接文字成分表
作成登録プログラムの処理手順を示すＰＡＤ図である。FIG. 30 is a PAD showing a processing procedure of a concatenated character component table creation / registration program according to the sixth embodiment of the present invention.

【図３１】本発明第六の実施例における連接文字成分表
サーチプログラムの処理手順を示すＰＡＤ図である。FIG. 31 is a PAD showing the processing procedure of the concatenated character component table search program in the sixth embodiment of the present invention.

【図３２】本発明第六の実施例における連接文字成分表
の作成方法を示す図である。FIG. 32 is a diagram showing a method for creating a concatenated character component table according to the sixth embodiment of the present invention.

【図３３】本発明第六の実施例における連接文字成分表
のサーチ方法を示す図である。FIG. 33 is a diagram showing a method for searching a concatenated character component table according to the sixth embodiment of the present invention.

【図３４】本発明第七の実施例における連接文字成分表
作成登録プログラムの構成を示す図である。FIG. 34 is a diagram showing the structure of a concatenated character component table creation / registration program according to the seventh embodiment of the present invention.

【図３５】本発明第七の実施例における連接文字成分表
サーチプログラムの構成を示す図である。FIG. 35 is a diagram showing a structure of a concatenated character component table search program in the seventh embodiment of the present invention.

【図３６】本発明第七の実施例における連接文字成分表
作成登録プログラムの処理手順を示すＰＡＤ図である。FIG. 36 is a PAD showing a processing procedure of a concatenated character component table creation registration program in the seventh embodiment of the present invention.

【図３７】本発明第七の実施例における連接文字成分表
サーチプログラムの処理手順を示すＰＡＤ図である。FIG. 37 is a PAD showing a processing procedure of a concatenated character component table search program in the seventh embodiment of the present invention.

【図３８】本発明第七の実施例における連接文字成分表
の作成方法を示す図である。FIG. 38 is a diagram showing a method for creating a concatenated character component table in the seventh embodiment of the present invention.

【図３９】本発明第七の実施例における連接文字成分表
のサーチ方法を示す図である。FIG. 39 is a diagram showing a method for searching a concatenated character component table according to the seventh embodiment of the present invention.

【図４０】本発明第八の実施例における連接文字成分表
の構成を示す図である。FIG. 40 is a diagram showing the structure of a concatenated character component table according to the eighth embodiment of the present invention.

【図４１】本発明第八の実施例における連接文字成分表
作成登録プログラムの構成を示す図である。FIG. 41 is a diagram showing the structure of a concatenated character component table creation / registration program according to the eighth embodiment of the present invention.

【図４２】本発明第八の実施例における連接文字成分表
サーチプログラムの構成を示す図である。FIG. 42 is a diagram showing the structure of a concatenated character component table search program according to the eighth embodiment of the present invention.

【図４３】本発明第八の実施例における連接文字成分表
作成登録プログラムの処理手順を示すＰＡＤ図である。FIG. 43 is a PAD showing a processing procedure of a concatenated character component table creation registration program in the eighth embodiment of the present invention.

【図４４】本発明第八の実施例における連接文字成分表
登録プログラムの処理手順を示すＰＡＤ図である。FIG. 44 is a PAD showing the processing procedure of the concatenated character component table registration program in the eighth embodiment of the present invention.

【図４５】本発明第八の実施例における連接文字成分表
サーチプログラムの処理手順を示すＰＡＤ図である。FIG. 45 is a PAD showing a processing procedure of a concatenated character component table search program according to the eighth embodiment of the present invention.

【図４６】本発明第八の実施例における連接文字成分表
取得プログラムの処理手順を示すＰＡＤ図である。FIG. 46 is a PAD showing a processing procedure of the concatenated character component table acquisition program in the eighth embodiment of the present invention.

【図４７】本発明第八の実施例における連接文字成分表
のサーチ方法を示す図である。FIG. 47 is a diagram showing a method for searching a concatenated character component table according to the eighth embodiment of the present invention.

【図４８】本発明第八の実施例における連接文字成分表
の作成方法を示す図である。FIG. 48 is a diagram showing a method for creating a concatenated character component table in the eighth embodiment of the present invention.

[Explanation of symbols]

１００…ディスプレイ１０１…キーボード１０２
…ＣＰＵ１０３…テキスト１０４…凝縮テキスト１０５
…連接文字成分表１０６…フロッピディスクドライブ（ＦＤＤ）１０７
…フロッピディスク１０８…バス１１０…磁気ディスク２００
…主メモり100 ... Display 101 ... Keyboard 102
... CPU 103 ... Text 104 ... Condensed Text 105
... Concatenated character component table 106 ... Floppy disk drive (FDD) 107
... floppy disk 108 ... bus 110 ... magnetic disk 200
… Main memory

───────────────────────────────────────────────────── フロントページの続き (72)発明者水谷奈津子神奈川県川崎市麻生区王禅寺1099番地株式会社日立製作所システム開発研究所内 (72)発明者川口久光神奈川県川崎市麻生区王禅寺1099番地株式会社日立製作所システム開発研究所内 (72)発明者加藤寛次神奈川県川崎市麻生区王禅寺1099番地株式会社日立製作所システム開発研究所内 (72)発明者浅川悟志神奈川県横浜市戸塚区戸塚町5030番地株式会社日立製作所ソフトウエア開発本部内 (56)参考文献特開平７−319920（ＪＰ，Ａ) 特開平５−174064（ＪＰ，Ａ) 特開平５−81321（ＪＰ，Ａ) 特開平４−274557（ＪＰ，Ａ) (58)調査した分野(Int.Cl.⁷，ＤＢ名) G06F 17/30 ＪＩＣＳＴファイル（ＪＯＩＳ)─────────────────────────────────────────────────── ─── Continuation of the front page (72) Inventor Natsuko Mizutani 1099, Ozenji, Aso-ku, Kawasaki-shi, Kanagawa Hitachi, Ltd. System Development Laboratory (72) Hisamitsu Kawaguchi 1099, Ozen-ji, Aso-ku, Kawasaki, Kanagawa Hitachi, Ltd. Mfg. Co., Ltd. System Development Research Center (72) Inventor Kanji Kato 1099 Ozenji, Aso-ku, Kawasaki-shi, Kanagawa Hitachi, Ltd. System Development Research Center (72) Inventor Satoshi Asakawa 5030 Totsuka-cho, Totsuka-ku, Yokohama-shi Hitachi, Ltd. (56) References JP-A-7-319920 (JP, A) JP-A-5-174064 (JP, A) JP-A-5-81321 (JP, A) JP-A-4-274557 ( (58) Fields surveyed (Int.Cl. ⁷ , DB name) G06F 17/30 JISC file Le (JOIS)

Claims

(57) [Claims]

1. A partial character string is extracted from each document stored in advance in a predetermined format, and a concatenated character component table indicating whether or not the partial character string exists in each document is created. A search substring is extracted in a predetermined format from the search term input to search for a desired document from each document, and the concatenated character component table corresponding to the extracted search substring is displayed. Search for documents that have substrings that match the search substrings that make up the search term,
In a document search method for filtering out documents unrelated to the search term from a search target, every predetermined m characters (m is an integer of 1 or more) from the document, a predetermined n characters (n is 2 or more) Character string of the integer) is extracted as the partial character string, and every predetermined m characters (m is an integer of 1 or more) from the search term, a predetermined n characters (n is an integer of 2 or more) are extracted. Is extracted as the search sub-character string, and the concatenated character component table has an appearance frequency higher than a predetermined threshold value.
Bit position corresponding to the document number where the high concatenation character appears
Bits that register appearance information of character strings by writing 1 in
List and concatenated sentences that appear less frequently than a given threshold
For character components, a sequence of occurrences that is
Document numbers in which suffixes appear are a list of binary data
And a document ID list stored in the type and each of the connecting character component appearing previously in said document
When the number of documents in which the concatenated character component appears is calculated and the calculated number of documents is larger than a predetermined threshold value.
Is the occurrence of the corresponding concatenation character for the bit list.
By writing '1' in the bit position corresponding to the document number
When the appearance information of the concatenated character component is registered and the calculated number of documents is smaller than a predetermined threshold value.
Is the document number in which the relevant concatenation character appears
Write it in the document number list as a list of data.
The appearance information of each concatenated character component is stored by, and is extracted for the concatenated character extracted from the search term.
The bit list or document number
Read the list and in case of the document number list,
A document retrieval method, characterized in that a concatenated character component table is obtained by converting it into a list .

2. The document search method according to claim 1, wherein when extracting a partial character string from the document, a word is cut out from the document, and predetermined m characters (m is 1 or more) from the cut out word. A character string of predetermined n characters (n is an integer of 2 or more) is extracted as the partial character string.

3. The document search method according to claim 2, wherein a continuous character string of predetermined n characters (n is an integer of 2 or more) is extracted from the cut word as the partial character string, A document search method, wherein a series of predetermined n characters (n is an integer of 2 or more) is extracted from a search term as the search partial character string.

4. The document search method according to claim 2, wherein a predetermined code is added before and after the cut word, and a predetermined m characters (m is 1 or more) from the word to which the predetermined code is added. Every n), a character string of a predetermined n characters (n is an integer of 2 or more) is extracted, a predetermined code is added before and after the search term, and the search term to which the predetermined code is added is extracted. Predetermined n for every predetermined m characters (m is an integer of 1 or more)
A document search method comprising extracting a character string of characters (n is an integer of 2 or more) as a search character string.

5. The document search method according to claim 1, wherein a string of predetermined i characters (i is an integer of 2 or more) is extracted from the document as the partial character string, and the partial character string is extracted. Every predetermined m characters (m is an integer of 1 or more) predetermined from the document, a predetermined n characters (n is 2)
A character string of the above integer) is extracted as the partial character string, and a series of predetermined i characters (i is an integer of 2 or more) is extracted from the search term as the search partial character string. At the same time, a predetermined n-character (n is an integer of 2 or more) character string is extracted as the search character string from the search term at every predetermined m characters (m is an integer of 1 or more). A document retrieval method characterized by.

6. extracts substrings in a predetermined format from prestored respective document, means for creating a concatenated character component table indicating whether the substring is present in each of the document Corresponding to the extracted partial character string for search, which has a unit for extracting a partial character string for search in a predetermined format from a search term input to search for a desired document from each of the documents. In the document search device that refers to the concatenated character component table and obtains a document in which a partial character string that matches each search partial character string that constitutes a search term exists, the concatenated character component table has a predetermined threshold value. A bit list for registering appearance information of a character string by writing 1 in a bit position corresponding to a document number in which a concatenated character having a higher appearance frequency, and a concatenated character component having an appearance frequency lower than a predetermined threshold value And a document number list storing, as a list of binary data, document numbers in which concatenated characters whose appearance frequency is lower than the predetermined appearance frequency are stored. Means for extracting a predetermined character string of n characters (n is an integer of 2 or more) as the partial character string for each character (m is an integer of 1 or more), and a predetermined number of m characters from the search term. Means for extracting a predetermined character string of n characters (n is an integer of 2 or more) as the search partial character string every (m is an integer of 1 or more), and a concatenated character component that appears in the document in advance. And a means for calculating the number of documents in which each concatenated character component appears, and if the calculated number of documents is larger than a predetermined threshold value, the document number in which the corresponding concatenated character appears in the bit list Means for registering the appearance information of the concatenated character component by writing '1' in the corresponding bit position; and, if the calculated number of documents is smaller than a predetermined threshold value, the document in which the corresponding concatenated character appears A means for registering the appearance information of each concatenated character component by writing the number to the document number list as a list of binary data, and a bit corresponding to the extracted concatenated character for the concatenated character extracted from the search term. A document retrieving apparatus comprising: a list or a document number list; and a means for acquiring a concatenated character component table by converting the list or the document number list into a bit list in the case of the document number list.