JPH08329116A

JPH08329116A - Method for retrieving structured document

Info

Publication number: JPH08329116A
Application number: JP7161397A
Authority: JP
Inventors: Atsushi Hatakeyama; 敦畠山; Katsumi Tada; 勝己多田; Kanji Kato; 寛次加藤; Satoshi Asakawa; 悟志浅川
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 1995-06-05
Filing date: 1995-06-05
Publication date: 1996-12-13
Anticipated expiration: 2019-04-12
Also published as: JP3518933B2

Abstract

PURPOSE: To provide a character component table capable of reducing retrieval noise by practical capacity and to attain efficient document structure specification/retrieval in a large scale information retrieving system for a structured document. CONSTITUTION: A character component table describing the character appearance state of text data is prepared in each document to be registered, the structure of the document is recognized in accordance with a previously determined document structure name and text data are divided in each structure. In each appearing character, '1' is set up in a specific bit position corresponding to document structure indicating the appearance of the character, a structure bit string describing the appearance document structure position of each character is stored, and when a user specifies 'limit work' as a retrieving character string and 'the title invention', 'the claim', or 'merits', the character component table is searched based upon the 'limit work', documents 1, 7, 15, 38,... are obtained as the searched result and bit AND between a specified document structure bit string '100100001' based upon the specified document structure and the structure bit string of the retrieved document is found out to obtain documents 1, 7, 38,... as retrieved results.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、文書データベースを、
ユーザの指定する検索文字列から、文書中のユーザ指定
の文書構造部分のみを対象として探索し、所望の文書を
検索する文書検索方法に係わり、特に大量な文書を登録
し、高速な検索を行う場合に好適な情報検索方法に関
し、大規模文書データベースに適用されるものである。This invention relates to a document database,
It relates to a document search method that searches only a user-specified document structure part in a document from a search string specified by a user and searches for a desired document. Particularly, a large number of documents are registered and high-speed search is performed. The information retrieval method suitable for the case is applied to a large-scale document database.

【０００２】[0002]

【従来の技術】先に、文書の登録の際にキーワード付け
を行う必要のないフルテキストサーチ方式を「特開平０
３−１７４６５２」で提案した。この方式は、文書を単
語単位に圧縮した凝縮本文と、文書中の使用文字を一文
字単位で登録した文字成分表を用いて、検索語に関連し
ない文書をふるい落とすことによってサーチ速度を等価
的に高め、フルテキストサーチを実用レベルで高速に行
うことを目的としたものである。また、この文字成分表
を改良し更に高速なフルテキストサーチを実現する連接
文字成分表方式を「特開平０５−１７４０６４」で提案
した。この公知例で用いている連接文字成分表は、テキ
ストの中に含まれる所定の長さの連接する文字列を重複
なく全て取り出し、これらを含む文書の識別子情報とし
て、文書を特定する番号に対応するビット位置を１とし
たビット列で記述するものである。しかし、全ての連接
文字について識別子情報をビット列で記述すると、文字
の組み合わせの個数分だけビット列が必要となり、連接
文字成分表が膨大な容量になる。そこで、本公知例で
は、ハッシュ関数を用いて１個のビット列に複数個の連
接文字を割り当てるようにして、容量を抑える工夫をし
ている。2. Description of the Related Art First, a full-text search method that does not require adding a keyword when registering a document is described in "Patent Document 1".
3-174652 ”. This method equivalently increases the search speed by sieving out documents that are not related to the search word using a condensed text in which the document is compressed in word units and a character component table in which the characters used in the document are registered in single character units. The purpose is to perform full-text search at high speed at a practical level. In addition, a concatenated character component table system for improving this character component table and realizing a faster full-text search was proposed in "Japanese Patent Laid-Open No. 05-174064." The concatenated character component table used in this publicly known example extracts all concatenated character strings of a predetermined length contained in text without duplication, and corresponds to a number that identifies a document as identifier information of a document including these. It is described by a bit string in which the bit position to be set is 1. However, if the identifier information for all concatenated characters is described in bit strings, bit strings are required for the number of character combinations, and the concatenated character component table becomes enormous in capacity. Therefore, in this known example, a plurality of concatenated characters are assigned to one bit string by using a hash function so as to reduce the capacity.

【０００３】[0003]

【発明が解決しようとする課題】しかしながら、従来の
ハッシュ関数を用いて１個のビット列、すなわち該連接
文字の出現する文書番号を格納した文書識別子情報に複
数個の連接文字を割り当てた場合には、同じビット列に
まったく別の連接文字の文書識別子情報も重畳されるこ
とになる。従って、ある連接文字を指定して該当するビ
ット列から文書識別子情報を取り出した場合、その情報
からはまったく別の連接文字を含む文書が得られる可能
性がある。つまり、ハッシュ関数を用いた連接文字成分
表による検索結果には検索ノイズが含まれることにな
る。このことは、大量の文書を登録する大規模な文書検
索システムでは、検索文字列を含む可能性のない不要な
文書のふるい落とし、すなわち絞り込みが適切に行われ
ない可能性があることを意味し、その場合には検索性能
の低下につながる。However, in the case where a plurality of concatenated characters are assigned to one bit string, that is, the document identifier information storing the document number in which the concatenated character appears, using the conventional hash function, , Document identifier information of completely different concatenated characters is also superimposed on the same bit string. Therefore, when a certain concatenated character is specified and the document identifier information is extracted from the corresponding bit string, a document including a completely different concatenated character may be obtained from the information. That is, the search result by the concatenated character component table using the hash function includes search noise. This means that in a large-scale document search system that registers a large number of documents, unnecessary documents that may not contain the search string may not be properly screened, that is, narrowed down. In that case, search performance is degraded.

【０００４】ハッシュ関数を用いずに、全ての連接文字
についてそれぞれ１個のビット列を対応させることも考
えられるが、その場合にはビット列のデータ量が膨大な
ものとなり、実用的ではない。具体的に説明すると、日
本語で使用する文字コードは、現在約８，０００種類あ
るので、２文字の組み合わせとしての連接文字の種類
は、８，０００×８，０００＝６，４００万種類となる。登録する文書数を１００万件とした場合、この
６，４００万種類のそれぞれの連接文字に１００万ｂｉ
ｔの文書識別子情報をビット列として対応させるので、６，４００万種類×１００万ｂｉｔ＝８ＴＢｙｔｅもの容量が必要になる。この文字成分表の大きさに対
し、文書本体の大きさを２０ＫＢ／件としても、１００
万件で、２０ＫＢ×１００万件＝２０ＧＢｙｔｅであり、圧倒的に文字成分表の容量のほうが大きくなっ
てしまう。この文字成分表の容量を削減するためには、
固定長のビット列で該当文字が出現する文書識別子情報
を格納するだけでなく、該当文書数が少ない場合には文
書番号を直接書き込むことも考えられる。これをＩＤリ
スト格納形式と呼ぶ。また、従来のビット列で文書識別
子情報を格納する形式をビットリスト格納形式と呼ぶ。
例えば、１００万件を格納するデータベースでビットリ
スト格納形式で各文字の出現する文書識別子情報を格納
するには、各文字あたり、たとえ一件しか出現する文書
がなくとも１００万ｂｉｔ＝１２５ＫＢの容量が必ず必要となるが、文書番号で出現する文書識
別子情報を格納するＩＤリスト格納形式では、文書ＩＤ
を４Ｂ，格納文書ＩＤの数も４Ｂで格納するとして、４Ｂ＋４Ｂ＝８Ｂの容量で済むことになる。It is possible to use one bit string for each concatenated character without using a hash function, but in that case, the data amount of the bit string becomes enormous, which is not practical. More specifically, there are currently about 8,000 character codes used in Japanese, so the type of concatenated character as a combination of two characters is 8,000 × 8,000 = 64 million types. Become. Assuming that the number of documents to be registered is 1 million, 1 million bi will be added to each of these 64 million types of concatenated characters.
Since the document identifier information of t is made to correspond as a bit string, a capacity of 64 million types × 1 million bits = 8 TBytes is required. Even if the size of the document body is 20 KB / case, the size of the character component table is 100
In all cases, 20 KB x 1 million = 20 GBytes, and the capacity of the character component table is overwhelmingly larger. In order to reduce the capacity of this character composition table,
In addition to storing the document identifier information in which a corresponding character appears in a fixed-length bit string, it is also possible to directly write the document number when the number of applicable documents is small. This is called an ID list storage format. Further, the conventional format for storing the document identifier information in the bit string is called a bit list storage format.
For example, to store the document identifier information in which each character appears in a bit list storage format in a database that stores 1 million items, the capacity of 1 million bits = 125 KB per character even if there is only one document that appears. Is always required, but in the ID list storage format that stores the document identifier information that appears in the document number, the document ID
4B and the number of stored document IDs is 4B, the capacity of 4B + 4B = 8B is sufficient.

【０００５】一方、文書には、特許公報の例のように、
「発明の名称」、「発明者」、「出願人」、「請求の範
囲」、「発明が解決しようとする課題」のように、構造
を持ち、それぞれの構造内で特定の内容が収められる場
合が多い。このような構造化文書を対象に、探索対象の
文書構造を指定してフルテキストサーチを行うことを構
造指定検索と呼ぶ。この構造指定検索を実現するために
は、各構造毎に文字成分表を作成し、文字成分表サーチ
の段階で構造毎にそれぞれ別々の文字成分表を用いて検
索を行う必要がある。しかし、文書の各構造毎に文字成
分表を作成すると、各構造単位に同じ様な文字成分によ
る文書識別子情報を持つために容量が大きくなるという
問題が生じる。例えば、「発明の名称」、「発明者」、
「出願人」、「請求の範囲」等の文書構造が１０種類あ
る場合、文書の各構造についてそれぞれ文字成分表を作
成すると１０倍の文字成分表の容量が必要となる。この
ことは、上述の文書識別子情報をＩＤリスト形式で格納
して文字成分表の容量を削減する効果を相殺することに
なってしまう。また、複数の文書構造を対象に検索する
場合には、その回数分だけ文字成分表の検索を繰り返す
必要があり、ファイル入出力の回数が増え、効率的では
ないという問題もある。例えば、「“発明の名称”、
“請求の範囲”または、“効果”の文書構造中に“極限
作業”という文字列のある文書を探せ」という条件の場
合、３種類の構造のそれぞれに対応する文字成分表を検
索した後、それらのＯＲ演算を行う必要がある。本発明
の目的は、構造を持つ文書を格納する大規模な情報検索
システムにおいて、検索ノイズの少ない文字成分表を実
用的な容量で提供し、かつ効率的な文書構造指定検索を
実現することにある。On the other hand, in the document, as in the example of the patent publication,
It has a structure like "name of invention", "inventor", "applicant", "claim", "problem to be solved by the invention", and specific contents are contained in each structure. In many cases. Performing a full-text search on such a structured document by designating the document structure of the search target is called a structure-designated search. In order to realize this structure designation search, it is necessary to create a character component table for each structure and perform a search using a different character component table for each structure at the stage of character component table search. However, when a character component table is created for each structure of a document, there is a problem that the capacity becomes large because each structure unit has document identifier information with the same character component. For example, "Invention name", "Inventor",
When there are ten types of document structures such as “applicant” and “claim”, if a character composition table is created for each structure of the document, the capacity of the character composition table will be ten times as large. This offsets the effect of reducing the capacity of the character component table by storing the document identifier information in the ID list format. Further, when searching for a plurality of document structures, it is necessary to repeat the search of the character component table for the number of times, which causes a problem that the number of file input / output increases and it is not efficient. For example, "the title of the invention",
In the case of the condition "search for a document that has the character string" extreme work "in the document structure of claims or effect", after searching the character component table corresponding to each of the three types of structures, It is necessary to perform their OR operation. An object of the present invention is to provide a character component table with less search noise with a practical capacity and to realize an efficient document structure designation search in a large-scale information search system for storing documents having a structure. is there.

【０００６】[0006]

【課題を解決するための手段】上記目的を達成するた
め、本発明は、文書構造を持つ文書を格納し、ユーザが
検索対象の文書構造名と検索文字列を指定して、該当す
る文書を検索する文書検索システムにおいて、登録する
文書のそれぞれについて、文書のテキストデータにおけ
る文字の出現状況を記述した文字成分表を作成するステ
ップと、登録する文書のそれぞれについて、あらかじめ
定められた文書構造名に従って文書構造を認識し、構造
毎にテキストデータを分割するステップと、登録する文
書のそれぞれについて、出現する文字毎に各文字が出現
する文書構造に対応する特定のビット位置に特定ビット
値を立てることで、文字毎の出現文書構造位置を記述し
た構造ビット列を格納するステップと、ユーザからの検
索対象とする文書構造名と、検索文字列の入力を受ける
ステップと、ユーザから与えられた検索文字列につい
て、該文字成分表から、検索文字列を構成する文字成分
の全てが存在する文書を検索するステップと、該検索さ
れたそれぞれの文書毎に、検索文字列の各文字に対応す
る構造ビット列を読み出して、ユーザが指定する文書構
造のビット位置が特定ビット値となっている文書を抽出
するステップとからなり、ユーザが指定する文書構造に
検索文字列が含まれている文書を検索するようにしてい
る。さらに、文書構造の各名称と構造ビット列のビット
位置を対応付けるレコードからなる対応表を備え、該対
応表に基づき文書構造名と構造ビット列のビット位置の
対応をとるようにしている。さらに、前記対応表は、文
書構造の各名称と構造ビット列のビット位置と文書構造
の各名称を示す特殊な文字列である構造識別タグからな
るレコードからなり、前記構造識別タグをテキストデー
タの対応する文書構造に挿入し、該構造識別タグを挿入
されたテキストデータを蓄積するようにしている。さら
に、ユーザから入力された検索対象とする文書構造名に
基づき、前記構造ビット列の該検索対象とする文書構造
名に対応するビット位置を特定ビット値とした指定文書
構造ビット列を作成し、前記検索文字列の各文字に対応
する読み出された構造ビット列と前記指定文書構造ビッ
ト列の対応する各ビット位置のビット値同士についてＡ
ＮＤ演算をし、該演算の結果に基づき検索条件として指
定された複数の文書構造名間のＡＮＤまたはＯＲ条件の
処理を行なうようにしている。さらに、文字成分表の文
書識別子情報を格納する文書識別子情報ファイルと、構
造ビット列を格納する構造ビット列格納ファイルを別々
に作成し、文書識別子情報ファイルの各レコードに構造
ビット列格納ファイルへのポインタ情報を格納するよう
にしている。さらに、ユーザから検索対象とする文書構
造名が入力されたときは、前記検索文字列の各文字に対
応する読み出された構造ビット列と前記指定文書構造ビ
ット列の対応する各ビット位置のビット値同士について
ＡＮＤ演算をし、該演算の結果に基づき検索条件として
指定された複数の文書構造名間のＡＮＤまたはＯＲ条件
の処理を行ない、ユーザから検索対象とする文書構造名
が入力されないときは、前記文字成分表のみを参照し、
構造ビット列の読み出しを行なわないようにしている。To achieve the above object, according to the present invention, a document having a document structure is stored, a user designates a document structure name to be searched and a search character string, and the corresponding document is searched. In the document search system for searching, a step of creating a character component table that describes the appearance status of characters in the text data of the document for each registered document, and according to a predetermined document structure name for each registered document Recognizing the document structure, dividing the text data for each structure, and for each registered document, setting a specific bit value at a specific bit position corresponding to the document structure in which each character appears Then, the step of storing the structure bit string describing the appearance document structure position for each character and the document structure to be searched by the user. Inputting a name and a search character string; searching for a document in which all of the character components forming the search character string exist from the character component table for the search character string provided by the user; For each searched document, reading the structure bit string corresponding to each character of the search character string, and extracting the document in which the bit position of the document structure specified by the user is a specific bit value, Documents that include the search character string in the document structure specified by the user are searched. Further, a correspondence table including records that associate each name of the document structure with the bit position of the structure bit string is provided, and the document structure name and the bit position of the structure bit string are associated with each other based on the correspondence table. Further, the correspondence table is made up of records each consisting of a name of the document structure, a bit position of the structure bit string, and a structure identification tag which is a special character string indicating each name of the document structure. The structure identification tag corresponds to the text data. The structure identification tag is inserted into the document structure, and the inserted text data is stored. Further, based on the document structure name as the search target input by the user, a specified document structure bit string having a bit position corresponding to the document structure name as the search target of the structure bit string as a specific bit value is created, and the search is performed. A regarding the read structure bit string corresponding to each character of the character string and the bit value of each corresponding bit position of the specified document structure bit string A
An ND operation is performed, and an AND or OR condition between a plurality of document structure names designated as a search condition is processed based on the result of the operation. Furthermore, a document identifier information file that stores the document identifier information of the character component table and a structure bit string storage file that stores the structure bit string are created separately, and pointer information to the structure bit string storage file is added to each record of the document identifier information file. I am trying to store it. Furthermore, when the user inputs a document structure name to be searched, the read structure bit string corresponding to each character of the search character string and the bit value of each corresponding bit position of the specified document structure bit string AND operation is performed between a plurality of document structure names designated as search conditions based on the result of the operation, and when the user does not input the document structure name to be searched, Referring only to the character composition table,
The structure bit string is not read.

【０００７】[0007]

【作用】上記手段により、構造ビット列を格納している
ため、ユーザの指定する検索対象文書構造に検索文字列
を含む文書だけを簡単な処理で検索することができる。
特に、ユーザの指定する検索対象文書構造が複数ある場
合、格納された構造ビット列と指定文書構造ビット列の
対応する各ビット位置のビットＡＮＤ処理だけで条件判
定ができるので、高速な検索処理が行なうことができ
る。また、構造指定検索を行うために、従来構造毎に文
字成分表を持たなければならなかったが、文書全体の文
字成分表を用いて検索対象文書を絞り、次に構造ビット
列にて文書構造まで踏み込んだ検索を行うことで、文字
成分表を単一にして容量を節約することができる。Since the structure bit string is stored by the above means, only the document including the search character string in the search target document structure specified by the user can be searched by a simple process.
In particular, when there are a plurality of search target document structures specified by the user, the condition determination can be performed only by the bit AND process of the corresponding bit positions of the stored structure bit string and the specified document structure bit string, so high-speed search processing should be performed. You can In addition, in order to perform a structure-specified search, it was necessary to have a character component table for each structure in the past. By conducting an in-depth search, the character component table can be made single to save the capacity.

【０００８】[0008]

【実施例】以下、本発明の実施例について詳細に説明す
る。図１は、本実施例の構成を示す図である。本実施例
は、登録検索用の端末１０１，１０２，．．．１１０、
ネットワーク２００、文書サーバ１０００からなる。文
書サーバ１０００には、ＬＡＮアダプタ１０１０、ＣＰ
Ｕ１０２０、ワークメモリ１０３０、文字テーブル１１
００とファイルポインタテーブル１１１０、文書構造識
別タグ対応表１２００を格納するメモリ１０５０、文字
成分表作成プログラム１３１０、構造認識プログラム１
３２０、構造ビット列格納プログラム１３３０、検索条
件入力プログラム１３４０、文字成分表検索プログラム
１３５０、構造ビット列ＡＮＤプログラム１３６０を格
納するメモリ１３００、文字成分表を格納するファイル
１４０１，１４０２，．．．、構造ビット列を格納する
ファイル１４１１，１４１２，．．．．、テキストデー
タ１４２０からなる。EXAMPLES Examples of the present invention will be described in detail below. FIG. 1 is a diagram showing the configuration of this embodiment. In this embodiment, the terminals 101, 102 ,. ．． 110,
It consists of a network 200 and a document server 1000. Document server 1000 includes LAN adapter 1010 and CP
U1020, work memory 1030, character table 11
00, file pointer table 1110, document structure identification tag correspondence table 1200, memory 1050, character component table creation program 1310, structure recognition program 1
320, structure bit string storage program 1330, search condition input program 1340, character component table search program 1350, memory 1300 storing structure bit string AND program 1360, files 1401, 1402 ,. ．． , Files 1411, 1412 ,. ．．． , Text data 1420.

【０００９】まず、構造ビット列を用いた構造指定検索
の概要について説明する。図２は、２文字の連接文字を
文字成分とする文字成分表と構造ビット列を用いた構造
指定検索方式の概要を示している。本図では、ユーザの
指定する条件として、検索対象文書構造に「“発明の名
称”，“請求の範囲”，“効果”のいずれか」を、検索
文字列として“極限作業”が指定された状況を示してい
る。最初に検索の第１ステップとして、文字成分表を用
いて検索文字列“極限作業”を含む文書を検索する。本
図の例では、文字成分表には、２文字の連接文字を文字
成分として、それぞれの文字成分を含む文書の識別子情
報がＩＤリスト形式で格納されている。文字成分表サー
チでは、検索文字列“極限作業”の２文字の連接文字
「“極限”、“限作”、“作業”」の３個の文字成分を
全て含む文書の検索を、各文字成分に対応する文書識別
子情報の積集合をとることによって行っている。図２の
例では、こうして得られた文字成分表の検索結果とし
て、文書ＩＤ列「１，７，１５，３８，．．．．」が示
されている。検索の第２ステップは、検索対象となる文
書構造と対応する構造ビット列の位置を１とした指定文
書構造ビット列（図２では“１００１００００１”であ
り、最初の“１”は発明の名称が、２番目の“１”は請
求の範囲が、最後の“１”は効果が指定されていること
を示している）と、検索文字列の文字成分に対応する構
造ビット列とのビットＡＮＤ処理を行う。構造ビット列
は、図に示すように、各文字成分についてその文字成分
が出現する文書の構造ビット列が並んでいるように格納
されている。この構造ビット列の並びから、文字成分表
で検索された文書に対応する構造ビット列を読み出し
て、指定文書構造ビット列とビットＡＮＤ処理を行い、
検索文字列を構成する全ての文字成分について結果が非
０である文書を最終の検索結果とする。図２の例では、
文字成分“限作”の文書番号１５に対応する構造ビット
列と指定文書構造ビット列とのビットＡＮＤ処理は
“０”となるので、文字成分表の検索結果から文書番号
１５は漏れ、文字成分“極限”，“限作”，“作業”の
文書番号１に対応する構造ビット列と指定文書構造ビッ
ト列とのビットＡＮＤ処理は、“発明の名称”と“効
果”の文書構造において“１”となるので文書番号１は
検索結果となり、文書番号７では“請求の範囲”で
“１”となるので文書番号７は検索結果となる。このよ
うに文字成分表による検索結果からさらに絞り込まれた
文書ＩＤ列「１，７，３８，・・・・」が最終結果とし
て得られている。First, an outline of a structure designation search using a structure bit string will be described. FIG. 2 shows an outline of a structure designation search method using a character component table in which two concatenated characters are character components and a structure bit string. In this figure, as the condition specified by the user, "any of" invention name "," claim "and" effect "" is specified in the search target document structure, and "extreme work" is specified as the search character string. It shows the situation. First, as the first step of the search, a document containing the search character string "extreme work" is searched using the character component table. In the example of the figure, the character component table stores the identifier information of a document including each character component in the form of an ID list, with two concatenated characters as the character component. In the character component table search, a search for a document that includes all three character components of the search character string "extreme work", two concatenated characters "" extreme "," limit work ", and" work "" is performed for each character component. This is done by taking the product set of the document identifier information corresponding to. In the example of FIG. 2, the document ID string “1, 7, 15, 38, ...” Is shown as the search result of the character component table thus obtained. The second step of the search is a designated document structure bit string in which the position of the structure bit string corresponding to the document structure to be searched is 1 (“100100001” in FIG. 2, and the first “1” is the invention name 2). The second "1" indicates the scope of claims and the last "1" indicates that the effect is designated) and the structured bit string corresponding to the character component of the search character string is subjected to bit AND processing. As shown in the figure, the structural bit string is stored such that, for each character component, the structural bit string of the document in which the character component appears is aligned. From this arrangement of structural bit strings, the structural bit string corresponding to the document searched in the character component table is read, and the bit AND process is performed with the specified document structural bit string,
The document in which the result is non-zero for all the character components forming the search character string is the final search result. In the example of FIG.
Since the bit AND process of the structured bit string corresponding to the document number 15 of the character component "Limited" and the specified document structured bit string is "0", the document number 15 is omitted from the search result of the character component table, and the character component "extreme limit" is obtained. Since the bit AND processing of the structure bit string corresponding to the document number 1 of "," Limited work ", and" Work "and the specified document structure bit string is" 1 "in the document structure of" Invention title "and" Effect ". The document number 1 is the search result, and the document number 7 is "1" in the "claim", so the document number 7 is the search result. In this way, the document ID string “1, 7, 38, ...” Further narrowed down from the search result by the character component table is obtained as the final result.

【００１０】以上、構造指定検索の概要について説明し
た。次に、本実施例で用いる文字成分表および構造ビッ
ト列の構造について説明する。本実施例では、連接文字
に対応する文書識別子情報を管理するのに、文字テーブ
ル、ファイルポインタテーブルを用いる。図３は文字テ
ーブルおよびファイルポインタテーブルの概要を示す図
である。たとえば、“構成”という文字列を含む文書を
検索する場合には、まず文字テーブルについて“構”の
文字に対応するレコードを参照してファイルポインタテ
ーブルへのポインタ情報５８０を得る。次に、ファイル
ポインタテーブルの先頭から５８０バイト目からの各レ
コードを参照して、第二文字目が“成”のレコードを探
索する。ファイルポインタテーブルには、各連接文字の
第一文字目ごとに、先頭に第二文字目が０のレコードを
格納しておく。第二文字目が０のレコードには、第一文
字目の一文字を含んでいる全ての文書の文書識別子情報
へのポインタを格納しておく。すなわち、第二文字目が
０のレコードは、第一文字だけからなる単一文字に対応
する文書識別子情報をアクセスするためのファイル識別
子（以後ファイルＩＤとも呼ぶ）とファイル内バイト位
置（以後オフセットとも呼ぶ）を格納する。したがっ
て、各連接文字ごとに第二文字目が０のレコードが必ず
存在するため、例えば、“構成”の連接文字を探索する
場合は、“構”に対応するファイルポインタテーブルの
先頭から５８０バイト目のレコードから探索を開始し、
再び第二文字目が０になるまで探索を続け、もし“成”
の文字が見つからない場合は、該当する連接文字がない
と判断できる。図３の例では、“成”のレコードが存在
するため、ここからファイルＩＤが１、オフセットが１
０３４という文書識別子情報へアクセスするための情報
を得ることができる。The outline of the structure designation search has been described above. Next, the structure of the character component table and the structure bit string used in this embodiment will be described. In this embodiment, a character table and a file pointer table are used to manage the document identifier information corresponding to the concatenated character. FIG. 3 is a diagram showing an outline of the character table and the file pointer table. For example, when retrieving a document including a character string “structure”, first, the record corresponding to the character “structure” in the character table is referred to obtain pointer information 580 to the file pointer table. Next, each record from the 580th byte from the head of the file pointer table is referenced to search for a record in which the second character is "success". In the file pointer table, a record in which the second character is 0 is stored at the beginning for each first character of each concatenated character. In the record in which the second character is 0, pointers to the document identifier information of all the documents including the first character of the first character are stored. That is, in the record in which the second character is 0, a file identifier (hereinafter also referred to as a file ID) and a byte position in the file (hereinafter also referred to as an offset) for accessing the document identifier information corresponding to a single character including only the first character To store. Therefore, since there is always a record with the second character being 0 for each concatenated character, for example, when searching for a concatenated character of “composition”, the 580th byte from the beginning of the file pointer table corresponding to “structure” The search starts from the record
Continue the search until the second character becomes 0 again,
If the character is not found, it can be determined that there is no corresponding concatenated character. In the example of FIG. 3, since there is a record of “completion”, the file ID is 1 and the offset is 1 from here.
Information for accessing the document identifier information 034 can be obtained.

【００１１】文書識別子情報は、図４のように複数のフ
ァイルに分割格納する。ファイルポインタテーブルのフ
ァイルＩＤ情報により、どのファイルに文書識別子情報
が格納されているかを特定する。なおかつ特定のファイ
ルＩＤは、文書識別子情報をビットリスト形式で持つと
あらかじめ決めておく。図４の例では、ファイル１が文
書識別子情報をビットリスト形式で持つファイルとして
いる。また、各文書識別子情報の先頭には、構造ビット
列を格納するファイルのオフセット情報が収められてい
る。図３の例で、連接文字“構成”に関する文書識別子
情報へのアクセス情報として、ファイルＩＤが１、オフ
セットが１，０３４が得られる。したがって、ファイル
１内の１，０３４バイト目からのデータを読み出すこと
で、構造ビット列を格納するファイルのオフセット情報
６，７３４と文書識別子情報を示すビット列“０１１１
０１０１０１・・・・”が得られることになる。このビ
ット列は、先頭ビットから文書番号に対応して、“１”
が連接文字“構成”を含む文書を示すことになる。従っ
て、この例では、“構成”を含む文書の文書番号、すな
わちＩＤリストは、１，２，３，５，７，９，・・・・
のように機械的に変換できる。図４の他のファイル（フ
ァイル２及びファイル３）は文書識別子情報をＩＤリス
ト形式で格納したものである。各ＩＤリストの先頭は、
ビットリスト形式で格納されたファイルと同様に、構造
ビット列を格納するファイルのオフセット情報である。
また、オフセット情報に連なるＩＤリストの先頭は、格
納してある文書番号の個数を示している。例えば、連接
文字“構造”の場合、図３の例で、ファイルＩＤが２、
オフセットが３４０であるので、ファイル２の先頭から
３４０バイト目を参照することによって、オフセット情
報６８４と、連接文字“構造”を含む文書数が５６個あ
り、文書番号が５６２、１０３８、・・・というＩＤリ
スト情報を取得する。The document identifier information is divided and stored in a plurality of files as shown in FIG. The file ID information of the file pointer table identifies which file stores the document identifier information. Further, it is predetermined that the specific file ID has the document identifier information in the bit list format. In the example of FIG. 4, the file 1 is a file having document identifier information in the bit list format. Further, the offset information of the file that stores the structural bit string is stored at the beginning of each document identifier information. In the example of FIG. 3, the file ID of 1 and the offset of 1,034 are obtained as access information to the document identifier information regarding the concatenated character “composition”. Therefore, by reading the data from the 1,034th byte in the file 1, the offset information 6,734 of the file storing the structural bit string and the bit string “0111” indicating the document identifier information are read.
.. ”is obtained. This bit string is“ 1 ”corresponding to the document number from the first bit.
Indicates a document containing the concatenation character "composition". Therefore, in this example, the document numbers of the documents including the "configuration", that is, the ID lists are 1, 2, 3, 5, 7, 9, ...
Can be converted mechanically. The other files (file 2 and file 3) in FIG. 4 store the document identifier information in the ID list format. The beginning of each ID list is
Similar to the file stored in the bit list format, this is the offset information of the file storing the structural bit string.
Also, the head of the ID list linked to the offset information indicates the number of stored document numbers. For example, in the case of the concatenation character “structure”, in the example of FIG. 3, the file ID is 2,
Since the offset is 340, by referring to the 340th byte from the beginning of the file 2, there are 56 documents including the offset information 684 and the concatenation character “structure”, and the document numbers are 562, 1038, ... ID list information is acquired.

【００１２】構造ビット列についても、文書識別子情報
と同じ様に、ファイルポインタテーブルに格納されてい
るファイルＩＤにしたがって複数のファイルに分割して
格納する。図５は構造ビット列の格納ファイルの様子を
示している。例えば、連接文字“構造”の場合、図３の
例で、ファイルＩＤが２、オフセットが３４０であるの
で、文書識別子情報格納ファイル２の先頭から３４０バ
イト目を参照することによって、構造ビット列のオフセ
ット情報６８４が得られる。そこで、構造ビット列の格
納ファイル２の先頭から６８４バイト目を参照すること
によって、該当する文字成分“構造”を含む文書構造の
位置を示す構造ビット列“０１００１１１００１１１１
１１１”が得られる。この構造ビット列は、文書識別子
情報格納ファイルにあるオフセット情報の位置から、そ
の文字成分が出現する文書数分の数だけ順番に並べられ
ている。つまり、構造ビット列格納ファイル２からの構
造ビット列は、連接文字“構造”を含む文書数分すなわ
ち５６個分、文書５６２の構造ビット列、文書１０３８
の構造ビット列、と文書識別子情報にある文書ＩＤにし
たがって順番に並べられている。本実施例では、各構造
ビット列に１６ビットを割り当て、一文書につき１６個
の文書構造を管理できるようにしているが、このビット
数を増やすことによって、１６以上の文書構造を管理す
るように拡張することは容易である。Similarly to the document identifier information, the structural bit string is divided into a plurality of files according to the file ID stored in the file pointer table and stored. FIG. 5 shows a state of a storage file of a structural bit string. For example, in the case of the concatenation character “structure”, in the example of FIG. 3, the file ID is 2 and the offset is 340. Therefore, by referring to the 340th byte from the beginning of the document identifier information storage file 2, the offset of the structure bit string Information 684 is obtained. Therefore, by referring to the 684th byte from the beginning of the storage file 2 of the structure bit string, the structure bit string "0100111001111" indicating the position of the document structure including the corresponding character component "structure".
111 "is obtained. This structured bit string is arranged in order from the position of the offset information in the document identifier information storage file by the number of documents in which the character component appears. That is, the structured bit string storage file 2 The structure bit string from is the structure bit string of the document 562, that is, the structure bit string of the document 562 for the number of documents including the concatenation character “structure”, that is, 56
Are arranged in order according to the document ID in the structure bit string and the document identifier information. In the present embodiment, 16 bits are assigned to each structure bit string so that 16 document structures can be managed for one document. However, by increasing the number of bits, it is possible to manage 16 or more document structures. It's easy to do.

【００１３】このように、ファイルポインタテーブルに
は、データベース中に存在する連接文字のみを登録する
ので、データベース中に存在しない文字の組み合わせは
全て排除できるという利点がある。したがって、文字テ
ーブルやファイルポインタテーブルで実現している連接
文字の管理情報を格納するファイル量やメモリ量を大幅
に削減することができる。また、文書識別子情報をビッ
ト列あるいはＩＤリストの形式で格納し、多くの文書を
格納する場合はビット列で、少ない文書を格納する場合
はＩＤリストの形式で管理することによりファイル容量
を大幅に削減することができる。具体的に説明すると、
ビットリスト形式で文書識別子情報を格納するには、常
にデータベースに登録した全件分のビット数が必要にな
るが、ＩＤリストの形式で文書識別子情報を格納する場
合には、文書識別子を表わすビット数×登録文書数です
むことになる。例えば、データベースの全登録件数が１
００万件で、一個の文書識別子情報を表わすのに３２ビ
ットを割り当てるとすると、連接文字“構造”を含む文
書を１０件登録する場合には、ビットリスト形式なら
ば、１００万ｂｉｔ＝１２５ＫＢの格納領域が必要となるが、ＩＤリスト形式ならば、３２ｂｉｔ×１０件＝４０Ｂの格納領域ですむことになる。一方、例えば、連接文字
“構成”を含む文書が１００万件中で９０万件ある場合
には、ビットリスト形式ならば、１００万ｂｉｔ＝１２５ＫＢの格納領域にすむのに対し、ＩＤリスト形式の場合、３２ｂｉｔ×９０万件＝３．６ＭＢの領域が必要となる。したがって、この１００万件を、
文書識別子３２ビットで格納する場合には、１００万ｂｉｔ÷３２ｂｉｔ＝３１，２５０件を境として、これよりも登録件数が多い場合はビットリ
スト形式で、少ない場合はＩＤリスト形式で文書識別子
情報を格納するのが、最も格納領域を有効に使用する方
法である。Thus, since only concatenated characters existing in the database are registered in the file pointer table, there is an advantage that all combinations of characters not existing in the database can be eliminated. Therefore, it is possible to significantly reduce the file amount and the memory amount for storing the management information of concatenated characters realized by the character table and the file pointer table. Also, the document identifier information is stored in the form of a bit string or an ID list, and is managed as a bit string when storing a large number of documents, and is managed as an ID list when storing a small number of documents, thereby significantly reducing the file size. be able to. Specifically,
To store the document identifier information in the bit list format, the number of bits for all cases registered in the database is always required. However, when storing the document identifier information in the ID list format, the bits representing the document identifier are stored. The number x the number of registered documents will suffice. For example, the total number of registrations in the database is 1
Assuming that 32 bits are allocated to represent one document identifier information in one million documents, if 10 documents including the concatenation character “structure” are registered, in the case of a bit list format, 1 million bits = 125 KB A storage area is required, but in the case of the ID list format, a storage area of 32 bits x 10 cases = 40B will suffice. On the other hand, for example, if there are 900,000 documents containing the concatenation character “composition” in 1 million, the bit list format can be stored in the storage area of 1 million bits = 125 KB, whereas the ID list format can be used. In this case, an area of 32 bits × 900,000 = 3.6 MB is required. Therefore, this 1 million
When the document identifier is stored in 32 bits, the document identifier information is stored in a bit list format when the number of registrations is larger than this, and in the ID list format when the number of registrations is smaller than 1 million bits / 32 bits = 31,250 cases. Storing is the most effective way to use the storage area.

【００１４】また、構造ビット列を各文字成分に対応さ
せて格納することにより、文字テーブル、ファイルポイ
ンタテーブルと文書識別子情報ファイルからなる文字成
分表を文書構造毎に複数個作成する必要がないという利
点がある。このことは、データベースのファイル削減に
大きな効果がある。Further, by storing the structural bit string in association with each character component, it is not necessary to prepare a plurality of character component tables each including a character table, a file pointer table and a document identifier information file for each document structure. There is. This has a great effect on reducing the number of files in the database.

【００１５】以上、構造指定検索の概要と、文字成分表
および構造ビット列の構造について説明した。これよ
り、文書データの登録方法について説明する。図６は、
登録の流れを示すＰＡＤ図である。データの登録時に
は、文字成分表作成プログラム１３１０で登録文書の各
文字成分について必要に応じて文字テーブル、ファイル
ポインタテーブルに文字成分を登録し、各文字成分につ
いて文書識別子情報ファイルに文書ＩＤを格納する（６
０２０）。また、構造認識プログラム１３２０で各文書
の構造毎にテキストデータを分割し（６０３０）、構造
ビット列格納プログラム１３３０により各構造で用いら
れている各文字単位に構造ビット列を作成し格納する
（６０４０）。The outline of the structure designation search and the structures of the character component table and the structure bit string have been described above. Now, a method of registering document data will be described. FIG.
It is a PAD figure which shows the flow of registration. When registering data, the character component table creating program 1310 registers character components in the character table and the file pointer table as needed for each character component of the registered document, and stores the document ID in the document identifier information file for each character component. (6
020). Further, the structure recognition program 1320 divides the text data for each structure of each document (6030), and the structure bit string storage program 1330 creates and stores a structure bit string for each character used in each structure (6040).

【００１６】文字成分表の登録ステップ（６０２０）で
は、文書中に使われている各文字成分について、文字テ
ーブル、ファイルポインタテーブルを参照し、文字成分
が登録されているかチェックし、登録されていない場合
には、文字テーブル、ファイルポインタテーブルに文字
成分を登録する。この文字テーブルあるいはファイルポ
インタテーブルに該当文字が登録されていないときに
は、文書識別子情報を格納するファイルの空領域を確保
して、ファイルＩＤとオフセット値をファイルポインタ
テーブルに格納する。こうして文書中の各文字成分につ
いて、文書識別子情報を格納するファイルＩＤとオフセ
ット値をファイルポインタテーブルから取得し、該当す
る文書識別子情報を格納したファイルに文書ＩＤを格納
していく。In the character component table registration step (6020), for each character component used in the document, the character table and the file pointer table are referenced to check whether or not the character component is registered. In this case, the character components are registered in the character table and the file pointer table. When the corresponding character is not registered in this character table or file pointer table, an empty area of the file storing the document identifier information is secured, and the file ID and offset value are stored in the file pointer table. In this way, for each character component in the document, the file ID that stores the document identifier information and the offset value are acquired from the file pointer table, and the document ID is stored in the file that stores the corresponding document identifier information.

【００１７】構造分割のステップ（６０３０）では、図
７に示す文書構造識別タグ対応表にしたがって、識別タ
グ間のテキストデータを抽出する。例えば、図８に示す
ように、テキストデータを“＃ＢＩＪ−Ｔｉｔｌｅ”，
“＃ＢＩＪ−Ｉｎｖｅｎｔｏｒ”のような識別タグで区
切り、それぞれ“発明の名称”，“発明者”の文書構造
のテキストデータとして、次の構造ビット列格納ステッ
プへ送る。文書構造識別タグ対応表には、次の処理ステ
ップの構造ビット列格納ステップで用いる構造ビット列
のそれぞれの文書構造に対応するビット位置も格納して
いる。In the structure dividing step (6030), the text data between the identification tags is extracted according to the document structure identification tag correspondence table shown in FIG. For example, as shown in FIG. 8, text data is converted into “# BIJ-Title”,
It is separated by an identification tag such as "# BIJ-Inventor" and sent as text data of the document structure of "invention name" and "inventor" to the next structure bit string storing step. The document structure identification tag correspondence table also stores the bit position corresponding to each document structure of the structure bit string used in the structure bit string storing step of the next processing step.

【００１８】構造ビット列格納ステップ（６０４０）で
は、抽出された各文書構造のテキストデータごとにそこ
で用いられている各文字単位に構造ビット列を作成し格
納する。この構造ビット列の格納では、図７に示した文
書構造識別タグ対応表に格納しているビット位置に、そ
れぞれの文書構造のテキストデータ中に存在する文字成
分のデータを登録していく。例えば図８の例で、文書１
の文書構造「発明の名称」に“極限”という文字成分が
あるので、図５に示した例のように「発明の名称」を表
わす構造ビット列の第１ビットを１とする。各文字成分
に対応する構造ビットリストの格納位置は、本実施例で
既に説明したように、文字テーブル、ファイルポインタ
テーブルおよび文書識別子情報ファイルを参照すること
により行う。例えば、文字成分“極限”の場合には、文
字テーブルを第一文字の“極”で参照し、ファイルポイ
ンタテーブルへのオフセット値８７０を得る。次に、フ
ァイルポインタテーブルの先頭から８７０バイト目から
第二文字が“限”であるレコードを探索して、文書識別
子情報格納ファイルをアクセスするためのファイルＩＤ
とオフセット値を得る。こうして、文字成分“極限”に
対応する文書識別子情報をファイルＩＤが３のオフセッ
ト１０８４から読み出し、先頭４バイトの構造ビット列
を格納するファイルのオフセット値８６９２を得る。本
実施例では、このオフセット値８６９２から、文字成分
“極限”を含む文書について一文書につき１６ビットず
つ構造ビット列を格納する。従って、文書番号が１であ
る構造ビット列は、構造ビット列格納ファイルＩＤが３
で先頭から８６９２バイト目より１６ビットが対応して
いることがわかる。このようにして、登録する各文書の
文字成分について文字テーブル、ファイルポインタテー
ブル、および文書識別子情報の格納を行い、各文書の各
文字成分について、該文字成分が存在する文書構造の位
置を構造ビット列として格納していく。In the structure bit string storing step (6040), a structure bit string is created and stored for each extracted text data of each document structure in each character unit used therein. In the storage of this structure bit string, the data of the character component existing in the text data of each document structure is registered at the bit position stored in the document structure identification tag correspondence table shown in FIG. For example, in the example of FIG. 8, document 1
Since there is a character component "extreme" in the document structure "Invention name" of, the first bit of the structure bit string representing "Invention name" is set to 1 as in the example shown in FIG. The storage position of the structural bit list corresponding to each character component is determined by referring to the character table, the file pointer table and the document identifier information file, as already described in this embodiment. For example, in the case of the character component "extreme", the character table is referred to by the first character "extreme" to obtain the offset value 870 to the file pointer table. Next, a file ID for accessing the document identifier information storage file is searched for a record in which the second character is “limit” from the 870th byte from the beginning of the file pointer table.
And get the offset value. In this way, the document identifier information corresponding to the character component "extreme" is read from the offset 1084 with the file ID of 3, and the offset value 8692 of the file storing the structural bit string of the first 4 bytes is obtained. In this embodiment, a 16-bit structured bit string is stored for each document including the character component "extreme" from this offset value 8692. Therefore, the structural bit string whose document number is 1 has the structural bit string storage file ID 3
It can be seen that 16 bits correspond from the 8692th byte from the beginning. In this way, the character table, the file pointer table, and the document identifier information are stored for the character components of each document to be registered, and for each character component of each document, the position of the document structure in which the character component exists is determined by the structure bit string. Will be stored as.

【００１９】検索処理は、図９に示す手順で行う。ま
ず、検索語から文字成分を切り出す（９０１０）。次
に、切り出したそれぞれの文字成分について（９０２
０）、文字テーブルを探索する（９０３０）。そして、
該当するファイルポインタテーブルの各レコードについ
て、第二文字目の探索を行い（９０４０）、該当するフ
ァイルＩＤとオフセット値を得る。こうして、文書識別
子情報を格納したファイルとそのオフセット値により該
当する各連接文字に対応する文書識別子情報を取得する
（９０５０）。この文書識別子情報の取得の過程で該当
する連接文字が文字成分表に登録されていない場合（９
０６０，９０７０）には、すなわち検索語を構成する文
字成分のうちどれか一つでも文字成分表に登録されてい
なければ、検索語を含む該当文書がないので検索結果と
して０件という結果を、文書識別子情報探索プログラム
１３５０がＬＡＮアダプタ１０１０を介して検索端末に
返す。The search process is performed according to the procedure shown in FIG. First, the character component is cut out from the search word (9010). Next, regarding each extracted character component (902
0), the character table is searched (9030). And
For each record in the corresponding file pointer table, the second character is searched (9040) to obtain the corresponding file ID and offset value. In this way, the file storing the document identifier information and the document identifier information corresponding to each concatenated character corresponding to the offset value are acquired (9050). If the corresponding concatenated character is not registered in the character component table in the process of acquiring this document identifier information (9
(060, 9070), that is, if any one of the character components forming the search word is not registered in the character component table, there is no corresponding document including the search word, and the result of the search result is 0. The document identifier information search program 1350 returns it to the search terminal via the LAN adapter 1010.

【００２０】検索語を構成する全ての文字成分について
該当する文書識別子情報が得られた場合は、次に各文字
成分に対応する構造ビット列の読み出しを行う。この読
み出しは、ファイルポインタテーブルから得られるファ
イルＩＤと、文書識別子情報から得られる構造ビット列
格納ファイルのオフセット値および該文字成分を含む文
書数から行うことができる。こうして、全ての文字成分
について構造ビット列を読み出し、検索条件として指定
された文書構造を示すビット位置が１である文書のみを
抽出する（９０８０）。例えば、「発明の名称」に“極
限作業”という文字列を含む文書を検索する場合には、
検索文字列の各文字成分 “極限”、“限作”、“作業” のそれぞれについて、それぞれ文書識別子情報に記載さ
れた件数分の構造ビット列を読み込み、「発明の名称」
に該当するビット位置すなわち第１ビットが１である文
書のみを抽出する。このとき、検索文字列を構成する全
ての文字成分について該当する文書がない場合には、検
索文字列を指定の文書構造に含む文書は０件であるとす
ることができる（９０９０）。それ以外の場合には、全
てのこうして得られた、指定箇所に検索文字列の文字成
分を含む文書について、各文字成分を全て含む文書を検
索結果とする（９１００）。これは、得られたそれぞれ
の文字成分を含む文書の積集合（例えば、図２の場合、
検索結果の文書ＩＤ１，７，３８，・・・からなる積
集合）をとることによって行う。When the corresponding document identifier information is obtained for all the character components forming the search word, the structural bit string corresponding to each character component is read next. This reading can be performed from the file ID obtained from the file pointer table, the offset value of the structured bit string storage file obtained from the document identifier information, and the number of documents including the character component. In this way, the structure bit string is read for all the character components, and only the document having the bit position 1 indicating the document structure designated as the search condition is extracted (9080). For example, when searching for a document that includes the character string "extreme work" in "invention title",
For each of the character components “extreme”, “limited work”, and “work” of the search character string, the structural bit strings for the number of cases described in the document identifier information are read, and “name of invention” is read.
Only the document corresponding to the bit position corresponding to the above, that is, the first bit is 1, is extracted. At this time, if there is no corresponding document for all the character components forming the search character string, it can be determined that the number of documents including the search character string in the specified document structure is 0 (9090). In all other cases, with respect to all the documents including the character component of the search character string in the designated portion thus obtained, the document including all the character components is set as the search result (9100). This is the product intersection of the obtained documents including the respective character components (for example, in the case of FIG. 2,
It is performed by taking a product set of document IDs 1, 7, 38, ...

【００２１】この構造ビット列の読み出しと指定文書構
造に文字成分が含まれているか否かの判定では、複数個
の文書構造についての判定を一度に行うことができる。
図７の文書構造識別タグ対応表に示したビット位置に各
文書構造が対応している場合、「発明の名称」あるいは
「請求の範囲」に検索文字列を含む文書を検索する場合
であれば、これらの構造と対応するビット位置を１とす
る指定文書構造ビット列“１００１０００００００００
０００”と構造ビット列とのビットＡＮＤを行い、結果
が非０となる文書を抽出結果とすればよい。また、「発
明の名称」かつ「請求の範囲」に検索文字列を含む文書
を検索する場合は、指定文書構造ビット列と構造ビット
列とのビットＡＮＤ結果が指定文書構造ビット列と等し
い文書を抽出結果とする。In reading the structure bit string and determining whether the designated document structure includes a character component, it is possible to make a determination for a plurality of document structures at once.
When each document structure corresponds to the bit position shown in the document structure identification tag correspondence table of FIG. 7, if a document including a search character string in the “invention title” or the “claim” is searched, , A specified document structure bit string in which the bit positions corresponding to these structures are set to "1001000000000"
000 "and the structural bit string are bit-ANDed, and the document whose result is non-zero may be used as the extraction result. Further, the document including the search character string in the" invention name "and the" claim "is searched. In this case, a document whose bit AND result of the designated document structure bit string and the structured bit string is equal to the designated document structure bit string is set as the extraction result.

【００２２】このようにして得られた文字成分表の検索
結果は、検索ノイズが非常に少ないので、文字成分表の
サーチ結果を表示しても十分実用できる。なお、上記の
説明では文書構造名を指定した検索について述べたが、
文書構造名の指定がない場合は、文書識別子ファイルを
参照するだけで、構造ビット列格納ファイルにはアクセ
スしない。Since the search result of the character component table thus obtained has very little search noise, it can be sufficiently used even if the search result of the character component table is displayed. In the above explanation, the search in which the document structure name is specified has been described.
If no document structure name is specified, only the document identifier file is referenced and the structure bit string storage file is not accessed.

【００２３】もちろん、文字成分表のサーチ結果をもと
に、文書本文を探索し実際に検索語を含む文書のみに絞
り込むかあるいは、複数の検索語間の位置的関係を満た
す文書を探すことも可能である。また、文字成分表の検
索結果を一度検索端末に表示し、ユーザの指定により本
文の探索を行うかどうかを決定してもよい。Of course, based on the search result of the character component table, the text of the document may be searched and narrowed down to only the documents actually including the search word, or the document satisfying the positional relationship between the plurality of search words may be searched. It is possible. Further, the search result of the character component table may be displayed once on the search terminal, and whether or not to search the text may be determined by the user's designation.

【００２４】また、本実施例で用いた文字成分表は、連
接する２文字を文字成分としたが、３文字以上の連接文
字を文字成分として本発明を実施することも容易に実現
できる。３文字以上の連接文字を文字成分として文字成
分表を構成することは、文字成分表サーチの検索精度を
高めるという点で効果がある。特に、カタカナなどの文
字種類の少ない文字種によって構成された検索文字列を
検索する場合に効果がある。同様に文字成分表サーチの
精度を上げるために、１文字飛びに２文字の組み合わせ
を文字成分とすることも考えられる。このような飛び飛
びに数文字の組み合わせすなわちスキップ文字列を文字
成分とすることは、前後間の文字の相関が強い英単語な
どの検索をするのに適している。さらに、通常の連接文
字とこのスキップ文字列の両方を併用することもでき
る。この場合には、通常の連接文字を文字成分とするよ
りは文字成分表のファイル容量を必要とするが、３文字
の連接文字をとるよりは、少ないファイル容量でより高
精度な文字成分表サーチを実現できる。Further, in the character component table used in the present embodiment, two concatenated characters are character components, but it is also easy to implement the present invention with three or more concatenated characters as character components. Constructing a character component table with three or more concatenated characters as character components is effective in improving the search accuracy of the character component table search. In particular, this is effective when searching for a search character string composed of a small number of character types such as katakana. Similarly, in order to improve the accuracy of the character component table search, it is possible to use a combination of two characters for every one character as the character component. Such a combination of a few characters, that is, a skip character string as a character component is suitable for searching an English word or the like having a strong correlation between front and rear characters. Furthermore, both normal concatenated characters and this skip character string can be used together. In this case, the file capacity of the character component table is required rather than the normal concatenated character as the character component, but the character component table search with a smaller file capacity and higher accuracy than the case of taking three concatenated characters. Can be realized.

【００２５】以上、本実施例によれば、構造ビット列と
指定文書構造ビット列とのビットＡＮＤ処理だけで、文
書構造を指定した条件判定を高速に行えるという利点が
ある。また、構造指定検索を行うために、従来の文字成
分表と別に構造ビット列を格納することで文字成分表の
容量を最小限にし、構造を指定しない通常の検索処理に
ついても従来の高速性をそのまま維持して、構造指定検
索機能を付加することが可能となる。As described above, according to the present embodiment, there is an advantage that the condition determination designating the document structure can be performed at high speed only by the bit AND processing of the structure bit string and the designated document structure bit string. In addition, the structure bit string is stored separately from the conventional character component table in order to perform the structure-specified search, so that the capacity of the character component table is minimized, and the conventional high-speed processing is maintained even for ordinary search processing that does not specify the structure. It becomes possible to add the structure designation search function.

【００２６】[0026]

【発明の効果】本発明によれば、構造ビット列を格納し
ておくことにより、ユーザの指定する検索対象文書構造
に検索文字列を含む文書だけを、簡単な処理で検索する
ことができる。特に、ユーザの指定する検索対象文書構
造が複数ある場合、格納された構造ビット列と指定文書
構造ビット列の対応する各ビット位置のビットＡＮＤ処
理だけで条件判定ができるので、高速な検索処理が行え
るという利点がある。また、構造指定検索を行うため
に、従来構造毎に文字成分表を持たなければならなかっ
たが、文書全体の文字成分表を用いて検索対象文書を絞
り、次に構造ビット列にて文書構造まで踏み込んだ検索
を行うことで、文字成分表を単一にしてファイルおよび
メモリ容量を節約できるという利点がある。According to the present invention, by storing the structure bit string, it is possible to search only the document including the search character string in the search target document structure designated by the user by a simple process. In particular, when there are a plurality of search target document structures specified by the user, the condition determination can be performed only by the bit AND process of the corresponding bit positions of the stored structure bit string and the specified document structure bit string, so that high-speed search processing can be performed. There are advantages. In addition, in order to perform a structure-specified search, it was necessary to have a character component table for each structure in the past. By performing an in-depth search, there is an advantage that the character component table can be made single and the file and memory capacity can be saved.

[Brief description of drawings]

【図１】第一の実施例の構成を示す図である。FIG. 1 is a diagram showing a configuration of a first embodiment.

【図２】構造指定検索方法の概要を示す図である。FIG. 2 is a diagram showing an outline of a structure designation search method.

【図３】文字成分表のテーブル構成を示す図である。FIG. 3 is a diagram showing a table configuration of a character component table.

【図４】文書識別子情報格納ファイルの概要を示す図で
ある。FIG. 4 is a diagram showing an outline of a document identifier information storage file.

【図５】構造ビット列格納ファイルの概要を示す図であ
る。FIG. 5 is a diagram showing an outline of a structural bit string storage file.

【図６】文書の登録処理を示すＰＡＤ図である。FIG. 6 is a PAD diagram showing a document registration process.

【図７】文書構造識別タグ対応表の一例を示す図であ
る。FIG. 7 is a diagram showing an example of a document structure identification tag correspondence table.

【図８】文書構造別テキストデータ切り出しの例を示す
図である。FIG. 8 is a diagram showing an example of clipping text data by document structure.

【図９】検索処理を示すＰＡＤ図である。FIG. 9 is a PAD diagram showing a search process.

[Explanation of symbols]

１０１，１０２，・・・，１１０端末２００ネットワーク１０００文書サーバ１０１０ＬＡＮアダプタ１０２０ＣＰＵ１０３０，１０５０，１３００メモリ１１００文字テーブル１１１０ファイルポインタテーブル１２００文書構造識別タグ対応表１３１０文字成分表作成プログラム１３２０構造認識プログラム１３３０構造ビット列格納プログラム１３４０検索条件入力プログラム１３５０文字成分表検索プログラム１３６０構造ビット列ＡＮＤプログラム１４０１文書識別子情報ファイル１１４０２文書識別子情報ファイル２１４０３文書識別子情報ファイルｎ１４１１構造ビット列格納ファイル１１４１２構造ビット列格納ファイル２１４１３構造ビット列格納ファイルｎ１４２０テキストデータ 101, 102, ..., 110 Terminal 200 Network 1000 Document Server 1010 LAN Adapter 1020 CPU 1030, 1050, 1300 Memory 1100 Character Table 1110 File Pointer Table 1200 Document Structure Identification Tag Correspondence Table 1310 Character Component Table Creation Program 1320 Structure Recognition Program 1330 Structure bit string storage program 1340 Search condition input program 1350 Character component table search program 1360 Structure bit string AND program 1401 Document identifier information file 1 1402 Document identifier information file 2 1403 Document identifier information file n 1411 Structure bit string storage file 1 1412 Structure bit string storage file 2 1413 Structure bit string storage file n 1420 Text data

───────────────────────────────────────────────────── フロントページの続き (72)発明者浅川悟志神奈川県横浜市戸塚区戸塚町5030番地株式会社日立製作所ソフトウェア開発本部内 ─────────────────────────────────────────────────── ─── Continuation of the front page (72) Inventor Satoshi Asagawa 5030 Totsuka-cho, Totsuka-ku, Yokohama-shi, Kanagawa Prefecture Hitachi Ltd. Software Development Division

Claims

[Claims]

1. A document search system that stores a document having a document structure, and a user specifies a document structure name to be searched and a search character string to search for a corresponding document Step of creating a character composition table describing the appearance status of characters in the text data, and recognizing the document structure according to a predetermined document structure name for each registered document, and dividing the text data for each structure Then, for each of the registered documents, by setting a specific bit value at a specific bit position corresponding to the document structure in which each character appears, a structure bit string describing the appearance document structure position for each character is set. The step of storing, the step of receiving the document structure name to be searched by the user, and the input of the search character string, For the search character string given by the user, a step of searching the character component table for a document in which all of the character components forming the search character string are present; The step of reading the structure bit string corresponding to each character and extracting the document in which the bit position of the document structure specified by the user has a specific bit value, and the search character string is included in the document structure specified by the user. Structured document search method characterized by searching for existing documents.

2. The structured document search method according to claim 1, further comprising a correspondence table composed of records that associate each name of the document structure with the bit position of the structure bit string, and based on the correspondence table, the document structure name and the structure bit string A structured document search method characterized by taking correspondence between bit positions.

3. The structured document search method according to claim 2, wherein the correspondence table is a structure identification tag that is a special character string indicating each name of the document structure, the bit position of the structure bit string, and each name of the document structure. A structured document search method comprising: inserting the structure identification tag into a corresponding document structure of text data, and storing the inserted text data having the structure identification tag.

4. The structured document search method according to claim 1, wherein a bit position corresponding to the search target document structure name of the structure bit string is set based on a search target document structure name input by a user. A specified document structure bit string having a specific bit value is created, and the read structure bit string corresponding to each character of the search character string and the bit value of each corresponding bit position of the specified document structure bit string are ANDed.
A structured document search method comprising performing an operation, and processing an AND or OR condition between a plurality of document structure names designated as a search condition based on the result of the operation.

5. The structured document search method according to claim 1, wherein a document identifier information file storing document identifier information of a character component table and a structure bit string storage file storing a structure bit string are separately created, and the document identifier is created. A structured document search method characterized by storing pointer information to a structured bit string storage file in each record of an information file.

6. The structured document search method according to claim 4, wherein when the user inputs a document structure name to be searched, a read structure bit string corresponding to each character of the search character string is stored. An AND operation is performed on the bit values at corresponding bit positions of the specified document structure bit string, and an AND or OR condition between a plurality of document structure names specified as search conditions is processed based on the result of the operation. When the document structure name to be searched is not input from, the structured document search method is characterized in that only the character component table is referenced and the structured bit string is not read.