JP2001325293A

JP2001325293A - Method and device for retrieving whole sentences and storage medium with stored whole-sentence retrieval program

Info

Publication number: JP2001325293A
Application number: JP2000142121A
Authority: JP
Inventors: Junji Tomita; 準二富田; Genichiro Kikui; 玄一郎菊井; Yoshihiko Hayashi; 林　　良彦
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2000-05-15
Filing date: 2000-05-15
Publication date: 2001-11-22
Anticipated expiration: 2020-05-15
Also published as: JP3578045B2

Abstract

PROBLEM TO BE SOLVED: To provide a whole-sentence retrieving method, its device and a storage medium for storing a whole-sentence retrieval program, by which structure information is adopted in calculation of a conformity degree and a filter by attribute is realized by a simple operation such as description in a format file and designation by a retrieval formula when the conformity degrees of the retrieval formula with respective documents to be retrieved are properly calculated, the documents are arranged in order from the one with the lowest conformity degree and whole sentences are retrieved with them as a retrieval result. SOLUTION: An index corresponds to each part of each document structure. Which one of the indexes is generated is described in the format file or a structurized document. A plurality of index files are generated based on the description of the format file, which of one or plural indexes among the indexes of the generated index file are used is designated by the retrieval formula and document structure information is used for calculating the conformity degrees.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、全文検索方法及び
装置及び全文検索プログラムを格納した記憶媒体に係
り、特に、検索対象となる構造を持ったそれぞれの文書
（構造化文書）が検索式にどの程度合っているのかを表
す「適合度」という値を適切に計算することが可能な全
文検索方法及び装置及び全文検索プログラムを格納した
記憶媒体に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a full-text search method and apparatus and a storage medium storing a full-text search program, and in particular, each document (structured document) having a structure to be searched is converted into a search formula. The present invention relates to a full-text search method and apparatus capable of appropriately calculating a value of “fitness” indicating the degree of matching, and a storage medium storing a full-text search program.

【０００２】[0002]

【従来の技術】従来の全文検索装置の構成について説明
する。2. Description of the Related Art The configuration of a conventional full-text search apparatus will be described.

【０００３】図７は、従来の全文検索装置の構成を示
す。同図に示す全文検索装置は、文書インデックス部１
０と検索実行部２０から構成される。それぞれの機能は
以下の通りである。FIG. 7 shows a configuration of a conventional full-text search device. The full-text search device shown in FIG.
0 and a search execution unit 20. Each function is as follows.

【０００４】文書インデックス部１０は、以下の方法に
よりインデックスを作成する。[0004] The document index section 10 creates an index by the following method.

【０００５】検索対象となる文書（プレーンテキス
ト）の集合Ｎを検索対象文書データベース１１から入力
する。A set N of documents (plain text) to be searched is input from a search target document database 11.

【０００６】それぞれの文書ｊ∈Ｎに使用されてい
る単語ｉを抽出し、それぞれの単語ｉの重要度ｗijを計
算する。A word i used in each document j∈N is extracted, and the importance w ij of each word i is calculated.

【０００７】逆引きインデックス１２を作成し、出
力する。ここで、逆引きインデックスとは、以下のキー
と値を持つテーブルである。A reverse index 12 is created and output. Here, the reverse index is a table having the following keys and values.

【０００８】キー：単語ｉ値：単語ｉの出現する文書ｊのＩＤと、単語ｉの文書ｊ
における重要度ｗijのペアの集合ここで、逆引きインデックス１２の例を図８に示す。例
えば、図８において、単語’言語’は、文書１、３、６
に出現し、それぞれの文書における’言語’の重要度は
０．３、０．３、０．８である。Key: word i value: ID of document j in which word i appears, and document j of word i
Here, an example of the reverse index 12 is shown in FIG. For example, in FIG. 8, the word “language” corresponds to documents 1, 3, 6
And the importance of 'language' in each document is 0.3, 0.3, 0.8.

【０００９】検索実行部２０は、以下の方法により検索
を行う。The search execution section 20 performs a search by the following method.

【００１０】単語または、単語のブール演算子結合
によって記述される検索式ｑを入力する。A search expression q described by a word or a Boolean operator combination of words is input.

【００１１】文書インデックス部１０の作成した逆
引きインデックス１２から、検索式ｑに含まれるそれぞ
れの単語ｉの出現する文書集合と単語ｉのそれぞれの文
書ｊにおける重要度ｗijを取得する。From the reverse index 12 created by the document index unit 10, a document set in which each word i included in the search formula q appears and the importance w ij of each word i in each document j are acquired.

【００１２】検索式ｑに含まれるブール演算子と単
語の重要度ｗijを用いて、それぞれの文書ｊと検索式ｑ
の適合度を計算する。[0012] Each document j and the search formula q are calculated using the Boolean operator and the word importance wij included in the search formula q.
Is calculated.

【００１３】適合度の降順に文書を並べて検索結果
とする。[0013] Documents are arranged in descending order of relevance to obtain search results.

【００１４】例えば、文書インデックス部１０によっ
て、検索対象文書から図８の逆引きインデックス１２が
作成され、以下の検索式が与えられた場合の適合度の計
算方法を示す。For example, the document index unit 10 creates the reverse index 12 of FIG. 8 from the document to be searched, and shows a method of calculating the fitness when the following search formula is given.

【００１５】（言語ａｎｄ処理）ｏｒ知識ここで、ａｎｄ，ｏｒの処理方法は、処理系によって異
なるが、ここでは、以下のように定める。(Language and processing) or knowledge Here, the processing method of and and or differs depending on the processing system. Here, it is determined as follows.

【００１６】ｏｒ−左右の評価値の和を取ったものを評
価値とする。Or—The sum of the left and right evaluation values is taken as the evaluation value.

【００１７】ａｎｄ−左右の評価値の小さい方を評価値
とする。And-the smaller of the left and right evaluation values is taken as the evaluation value.

【００１８】ここで、単語の場合は、重要度が評価値と
なり、検索式全体から得られた評価値が適合度となる。
また、重要度ｗijが逆引きインデックスに登録されてい
ない場合、単語ｉの文書ｊにおける重要度は“０”であ
る。Here, in the case of a word, the degree of importance is an evaluation value, and the evaluation value obtained from the entire search formula is the degree of relevance.
If the importance wiij is not registered in the reverse index, the importance of the word i in the document j is “0”.

【００１９】文書１の適合度は以下のように“０．４”
と求まる。The relevance of document 1 is “0.4” as follows:
Is obtained.

【００２０】（言語‘０．３’ａｎｄ処理‘０．
２”）ｏｒ知識‘０．２’→（‘０．２’）ｏｒ‘０．
２’→０．４文書３の適合度は以下のように‘０”と求まる。（言語‘０．３’ａｎｄ処理‘０”）ｏｒ知識‘０’→
（‘０’）ｏｒ‘０’→０このように計算された適合度の降順に文書を並べ、上位
ｋ（ｋは定数）件を検索結果とする。(Language '0.3' and processing '0.
2 ") or knowledge '0.2' → ('0.2') or '0.
2 '→ 0.4 The relevance of document 3 is obtained as'0' as follows: (language '0.3' and processing '0') or knowledge '0' →
('0') or '0' → 0 The documents are arranged in descending order of the degree of matching calculated in this way, and the top k (k is a constant) items are set as search results.

【００２１】[0021]

【発明が解決しようとする課題】しかしながら、上記従
来の全文検索装置で構造化文書を検索対象とする場合に
は、以下のような問題がある。However, when a structured document is to be searched by the above-mentioned conventional full-text search apparatus, there are the following problems.

【００２２】（１）文書の構造情報を反映した適合度計
算ができない。(1) Compatibility calculation that reflects the structure information of a document cannot be performed.

【００２３】近年、ＸＭＬ等、文書内にタグを用いて構
造を記述することが多くなってきている。構造を持った
文書の例を図９に示す。同図では、タグ（＜ＴＥＸＴ＞
−＜／ＴＥＸＴ＞等）の入れ子によって構造が記述され
ている。In recent years, structures such as XML and the like are described using tags in documents. FIG. 9 shows an example of a document having a structure. In the figure, the tag (<TEXT>
− </ TEXT>)).

【００２４】従来の全文検索装置では、入力は構造を持
たないプレーンテキストであると仮定していたるため、
文書が構造情報を持っている場合でも、これらの情報を
適切に適合度計算に取り入れることができない。例え
ば、タイトルという構造を考慮に入れた「文書のどこか
に‘人工知能’を含むもののうち、タイトルに‘学習’
を含む文書の適合度を高くしたい。」という検索要求に
答えることができない。In the conventional full-text search device, the input is assumed to be plain text having no structure.
Even if a document has structural information, such information cannot be appropriately incorporated into the relevance calculation. For example, taking into account the structure of titles, "Those containing 'artificial intelligence' somewhere in the document,
I want to increase the relevance of documents that contain. "Cannot be answered.

【００２５】（２）文書に付けられた属性を用いたフィ
ルタが実現できない。(2) A filter using an attribute attached to a document cannot be realized.

【００２６】フィルタとは次のような検索要求を言う。 ‘ＵＮＩＸ（登録商標）’を含む文書で日本語のも
のだけを取り出したい。 ‘知識’を含む文書で著者が‘鈴木’の場合は、適
合度を高くしたい（ランキングを上位にしたい）。The filter refers to the following search request. I want to extract only Japanese documents that contain 'UNIX (registered trademark)'. If the document contains 'knowledge' and the author is 'Suzuki', we want to improve the relevance (ranking up the ranking).

【００２７】それぞれ、の場合は「使用言語」、の
場合は「著書」という属性を用いたフィルタである。特
に、ここでは、のように結果に対して絞込を行うよう
なものを「ａｎｄ型フィルタ」、のように結果の適合
度を修正し、特定のもののランキングを上げる働きをす
るものを「ｏｒ型フィルタ」と呼ぶ。In each case, the filter uses an attribute of "language to be used", and in the case of a filter, an attribute of "book". In particular, here, "and-type filters" are used to narrow down results, and "or-type filters" are used to correct the degree of conformity of results and raise the ranking of specific items. Type filter ".

【００２８】これらのフィルタは、従来の全文検索装置
に用いられている逆引きインデックスを用いて実現する
ことは困難である。そのため、フィルタが必要な場合
は、全文検索装置とは独立に、全く別のインデックスを
用意するか、文書全文を残しておき、検索時にこれらの
属性をいちいち走査する必要がある。It is difficult to realize these filters by using a reverse index used in a conventional full-text search device. Therefore, if a filter is required, it is necessary to prepare a completely different index or leave the full text of the document independently of the full text search device and scan these attributes at the time of search.

【００２９】また、従来の技術には、構造化文書を対象
としたものとして、検索式に直接部分構造を記述すると
いう方法もある。しかし、この方法は、全文検索装置に
利用されるのではなく、対象となる１つの構造化文書の
中から検索式に記述された条件に合う部分の抽出等の目
的で利用されている。例えば、図９の構造化文書に対し
て、検索式://DOC/AUTROR(<DOC> に入れ子になっている<AUT
RO> に囲まれた部分を抽出)で検索を行った場合、結果
として「鈴木田中」が得られる。このように、この方
法は、検索式とそれぞれの構造化文書との適合度の計算
を行うという目的には利用できない。Further, in the conventional technique, there is also a method of directly describing a partial structure in a retrieval formula for a structured document. However, this method is not used for a full-text search device, but is used for the purpose of extracting a part that satisfies a condition described in a search formula from one target structured document. For example, for the structured document in Fig. 9, search expression: // DOC / AUTROR (<AUT nested in <DOC>
If the search is performed using (the part enclosed by RO> is extracted), "Suzuki Tanaka" is obtained as a result. As described above, this method cannot be used for the purpose of calculating the relevance between the retrieval formula and each structured document.

【００３０】本発明は、上記の点に鑑みなされたもの
で、検索式と検索対象となるそれぞれの文書との適合度
を適切に計算し、適合度の降順に文書を並べて検索結果
とする全文検索を行う際に、フォーマットファイルへの
記述と検索式での指定という簡単な操作によって、構造
情報を適合度の計算に取り入れることができ、かつ属性
によるフィルタを実現できる全文検索方法及び装置及び
全文検索プログラムを格納した記憶媒体を提供すること
を目的とする。The present invention has been made in view of the above points, and appropriately calculates the relevance between a search formula and each document to be searched, arranges the documents in descending order of relevance, and obtains a full-text as a search result. When performing a search, a full-text search method and apparatus and a full-text that can incorporate structural information into the calculation of relevance and realize a filter by attributes by a simple operation of description in a format file and specification in a search formula It is an object to provide a storage medium storing a search program.

【００３１】[0031]

【課題を解決するための手段】図１は、本発明の原理を
説明するための図である。FIG. 1 is a diagram for explaining the principle of the present invention.

【００３２】本発明（請求項１）は、単語または、単語
のブール演算子結合からなる検索式に対して検索対象と
なる構造を持ったそれぞれの文書（構造化文書）がどの
程度合っているのかを表す「適合度」を計算し、適合度
の降順に文書を並べて検索結果とする全文検索方法にお
いて、それぞれの文書の構造のどの部分に対応したイン
デックスを作成するのかを構造化文書内の、または、独
立したフォーマットファイルに記述し（ステップ１）、
フォーマットファイルの記述に基づいて複数のインデッ
クスファイルを作成し（ステップ２）、作成されたイン
デックスファイルのインデックスのうち、１つまたは、
複数のどのインデックスを使用するのかを検索式で指定
し（ステップ３）、文書の構造情報を適合度の計算に用
いる。According to the present invention (claim 1), to what degree each document (structured document) having a structure to be searched matches a word or a search expression formed by combining Boolean operators of words. In a full-text search method that calculates the “relevance” that indicates whether a document is in order and sorts documents in descending order of relevance, the search result is used to determine which part of the structure of each document is to be indexed. Or in an independent format file (step 1)
A plurality of index files are created based on the description of the format file (step 2), and one or more of the indexes of the created index files are
Which of a plurality of indexes is to be used is specified by a search formula (step 3), and the structural information of the document is used for calculating the degree of matching.

【００３３】本発明（請求項２）は、インデックスの対
象となる文書の部分構造を構造化文書内のまたは、独立
したフォーマットファイルに記述する際に、対象となる
文書の構造が入れ子になっている場合に、文書を解析
し、それぞれの構造を表す要素をノードとし、文書全体
を表す要素をルートノードとする木構造を作成し、フォ
ーマットファイルに、ルートノードからのパスを記述す
ることによって、文書の構造のどの部分をインデックス
対象とするのかを指定する。According to the present invention (claim 2), when a partial structure of a document to be indexed is described in a structured document or in an independent format file, the structure of the target document is nested. By analyzing the document, creating a tree structure with elements representing each structure as nodes and elements representing the entire document as root nodes, and describing the path from the root node in the format file, Specify which part of the document structure is to be indexed.

【００３４】本発明（請求項３）は、インデックスファ
イルを作成する際に、単語をキーとし、該単語が出現す
る文書ＩＤと該単語の重要度のペアの集合を値とする逆
引きインデックスと、文書ＩＤをキーとし、該文書に出
現する単語と該単語の重要度のペアの集合を値とする順
引きインデックスのいずれか、または、両方をフォーマ
ットの記述に基づいて作成し、該インデックスのそれぞ
れにインデックスＩＤを付与する。According to the present invention (claim 3), when an index file is created, a word is used as a key, and a reverse index having a set of a pair of a document ID in which the word appears and the importance of the word as a value is used. One or both of a forward index and a document ID are used as a key, and a value of a set of pairs of words appearing in the document and importance of the word is created based on the description of the format. An index ID is assigned to each.

【００３５】本発明（請求項４）は、複数作成されたイ
ンデックスのうち、どのインデックスを使用するのか
を、それぞれのインデックスに付与されたインデックス
ＩＤで検索式を指定する。According to the present invention (claim 4), a search formula is specified by using an index ID assigned to each index to determine which index to use from among a plurality of created indexes.

【００３６】本発明（請求項５）は、インデックスファ
イルを用いて、単語、単語のブール演算子結合及びフィ
ルタ表現によって記述されている検索式と、検索対象と
なるそれぞれの文書のとの適合度を計算する際に、逆引
きインデックスを用いた検索を行い、順引きインデック
スを用いた検索を行う。According to the present invention (claim 5), the relevance between a search expression described by a word, a Boolean operator combination of words, and a filter expression using an index file, and each document to be searched. Is calculated using a reverse index, and a search using a forward index is performed.

【００３７】本発明（請求項６）は、順引きインデック
スを用いた検索を行う際に、取得した検索結果の中で、
適合度の高い上位の所定の件数の文書の中から、検索式
によって指定される条件にマッチする文書だけを取り出
し検索結果とする‘ａｎｄ型フィルタ’と、検索式によ
って指定される条件にマッチする文書の適合度を割増
し、再び修正された適合度の降順に文書を並べ替えて検
索結果とする‘ｏｒ型フィルタ’とを用いる。According to the present invention (claim 6), when performing a search using a forward index,
An 'and-type filter' that retrieves only documents that match the conditions specified by the search expression from among a specified number of documents with high relevance and that matches the conditions specified by the search expression, and matches the conditions specified by the search expression An "or-type filter" is used in which the relevance of the document is increased, the documents are rearranged in descending order of the corrected relevance, and the retrieval result is obtained.

【００３８】図２は、本発明の原理構成図である。FIG. 2 is a diagram showing the principle of the present invention.

【００３９】本発明（請求項７）は、単語または、単語
のブール演算子結合からなる検索式に対して検索対象と
なる構造を持ったそれぞれの文書（構造化文書）がどの
程度合っているのかを表す「適合度」を計算し、適合度
の降順に文書を並べて検索結果とする全文検索装置であ
って、それぞれの文書の構造のどの部分に対応したイン
デックスを作成するのかを構造化文書内の、または、独
立したフォーマットファイルに記述するフォーマットフ
ァイル作成手段３００と、フォーマットファイル１２０
の記述に基づいて複数のインデックスファイル１５０を
作成する文書インデックス作成手段１００と、作成され
たインデックスファイル１５０のインデックスのうち、
１つまたは、複数のどのインデックスを使用するのかを
検索式で指定し、文書の構造情報を適合度の計算に用い
る検索実行手段２００とを有する。According to the present invention (claim 7), to what extent each document (structured document) having a structure to be searched matches a word or a search expression formed by combining words with Boolean operators. This is a full-text search device that calculates the “fitness” that indicates whether a document is arranged and sorts the documents in descending order of the matchability and obtains a search result. The structured document determines which index of the structure of each document is to be created. A format file creating means 300 described in a separate or independent format file, and a format file 120
Document index creating means 100 for creating a plurality of index files 150 based on the description of
A search execution unit that specifies which one or a plurality of indexes to use by a search formula, and uses the document structure information for calculating the degree of relevance.

【００４０】本発明（請求項８）は、フォーマットファ
イル作成手段３００において、インデックスの対象とな
る文書の部分構造を構造化文書内のまたは、独立したフ
ォーマットファイル１２０に記述する際に、対象となる
文書の構造が入れ子になっている場合に、文書を解析
し、それぞれの構造を表す要素をノードとし、文書全体
を表す要素をルートノードとする木構造を作成する手段
と、フォーマットファイル１２０に、ルートノードから
のパスを記述することによって、文書の構造のどの部分
をインデックス対象とするのかを指定する手段とを含
む。According to the present invention (claim 8), when the partial structure of the document to be indexed is described in the structured file or in the independent format file 120 by the format file creating means 300, it is an object. When the structure of the document is nested, means for analyzing the document and creating a tree structure having elements representing the respective structures as nodes and elements representing the entire document as root nodes, Means for specifying which part of the document structure is to be indexed by describing the path from the root node.

【００４１】本発明（請求項９）は、文書インデックス
作成手段１００において、単語をキーとし、該単語が出
現する文書ＩＤと該単語の重要度のペアの集合を値とす
る逆引きインデックスと、文書ＩＤをキーとし、該文書
に出現する単語と該単語の重要度のペアの集合を値とす
る順引きインデックスのいずれか、または、両方を、フ
ォーマットファイル１２０の記述に基づいて作成し、該
インデックスのそれぞれにインデックスＩＤを付与する
手段を含む。According to the present invention (claim 9), in the document index creating means 100, a reverse index in which a word is used as a key, and a set of a pair of a document ID in which the word appears and the importance of the word is used as a value, Based on the description of the format file 120, one or both of a forward index and a document index are used as a key, and a value of a set of pairs of words appearing in the document and importance of the words is created. Means for assigning an index ID to each of the indexes is included.

【００４２】本発明（請求項１０）は、検索実行手段２
００において、複数作成されたインデックスのうち、ど
のインデックスを使用するのかを、それぞれのインデッ
クスに付与されたインデックスＩＤで検索式を指定する
手段を含む。According to the present invention (claim 10), the search execution means 2
00, means for designating a search formula by using an index ID assigned to each of the indexes to be used among the plurality of created indexes.

【００４３】本発明（請求項１１）は、検索実行手段２
００において、インデックスファイル１５０を用いて、
単語、単語のブール演算子結合及びフィルタ表現によっ
て記述されている検索式と、検索対象となるそれぞれの
文書のとの適合度を計算する適合度計算手段を有し、適
合度計算手段は、逆引きインデックスを用いた検索を行
う通常検索手段と、順引きインデックスを用いた検索を
行うフィルタ検索手段とを含む。According to the present invention (claim 11), the search execution means 2
At 00, using the index file 150,
A relevance calculation means for calculating relevance between a word, a search expression described by a Boolean operator combination of words and a filter expression, and each document to be searched; It includes a normal search means for performing a search using a pull-down index and a filter search means for performing a search using a forward-looking index.

【００４４】本発明（請求項１２）は、フィルタ検索手
段において、取得した検索結果の中で、適合度の高い上
位の所定の件数の文書の中から、検索式によって指定さ
れる条件にマッチする文書だけを取り出し検索結果とす
る‘ａｎｄ型フィルタ’と、検索式によって指定される
条件にマッチする文書の適合度を割増し、再び修正され
た適合度の降順に文書を並べ替えて検索結果とする‘ｏ
ｒ型フィルタ’とを用いる手段を含む。According to a twelfth aspect of the present invention, the filter search means matches a condition specified by a search formula from a predetermined number of documents having a high degree of relevance among acquired search results. An 'and-type filter' that retrieves only documents and uses them as search results, and increases the relevance of documents that match the conditions specified by the search expression, and sorts the documents again in descending order of the corrected relevance to obtain search results 'o
and means using an 'r-type filter'.

【００４５】本発明（請求項１３）は、単語または、単
語のブール演算子結合からなる検索式に対して検索対象
となる構造を持ったそれぞれの文書（構造化文書）がど
の程度合っているのかを表す「適合度」を計算し、適合
度の降順に文書を並べて検索結果とする全文検索プログ
ラムを格納した記憶媒体であって、それぞれの文書の構
造のどの部分に対応したインデックスを作成するのかを
構造化文書内の、または、独立したフォーマットファイ
ルに記述するフォーマットファイル作成プロセスと、フ
ォーマットファイルの記述に基づいて複数のインデック
スファイルを作成する文書インデックス作成プロセス
と、作成されたインデックスファイルのインデックスの
うち、１つまたは、複数のどのインデックスを使用する
のかを検索式で指定し、文書の構造情報を適合度の計算
に用いる検索実行プロセスとを有する。According to the present invention (claim 13), to what extent each document (structured document) having a structure to be searched matches a word or a search expression formed by combining Boolean operators of words. This is a storage medium storing a full-text search program which calculates the "relevance" indicating the degree of relevance, arranges the documents in descending order of the relevance, and obtains a search result, and creates an index corresponding to which part of the structure of each document Format file creation process that describes whether or not a document is in a structured document or in an independent format file, a document index creation process that creates multiple index files based on the description of the format file, and an index of the created index file Specify which index or multiple indexes to use in the search expression , And a search execution process using the structural information of the document in the calculation of fitness.

【００４６】本発明（請求項１４）は、フォーマットフ
ァイル作成プロセスにおいて、インデックスの対象とな
る文書の部分構造を構造化文書内のまたは、独立したフ
ォーマットファイルに記述する際に、対象となる文書の
構造が入れ子になっている場合に、文書を解析し、それ
ぞれの構造を表す要素をノードとし、文書全体を表す要
素をルートノードとする木構造を作成するプロセスと、
フォーマットファイルに、ルートノードからのパスを記
述することによって、文書の構造のどの部分をインデッ
クス対象とするのかを指定するプロセスとを含む。According to the present invention (claim 14), when the partial structure of a document to be indexed is described in a structured document or in a separate format file in the format file creation process, If the structure is nested, analyze the document and create a tree structure with elements representing each structure as nodes and elements representing the entire document as root nodes;
Specifying the portion of the document structure to be indexed by describing the path from the root node in the format file.

【００４７】本発明（請求項１５）は、文書インデック
ス作成プロセスにおいて、単語をキーとし、該単語が出
現する文書ＩＤと該単語の重要度のペアの集合を値とす
る逆引きインデックスと、文書ＩＤをキーとし、該文書
に出現する単語と該単語の重要度のペアの集合を値とす
る順引きインデックスのいずれかまたは、両方を、フォ
ーマットファイルの記述に基づいて作成し、該インデッ
クスのそれぞれにインデックスＩＤを付与するプロセス
を含む。According to the present invention (claim 15), in the document index creation process, a reverse index in which a word is used as a key and a set of a pair of a document ID in which the word appears and the importance of the word is used as a value, Using the ID as a key, create one or both of the forward index, which has a value of a set of pairs of words appearing in the document and the importance of the word, based on the description of the format file. And assigning an index ID to the.

【００４８】本発明（請求項１６）は、検索実行プロセ
スにおいて、複数作成されたインデックスのうち、どの
インデックスを使用するのかを、それぞれのインデック
スに付与されたインデックスＩＤで検索式を指定するプ
ロセスを含む。According to the present invention (claim 16), in the search execution process, a process of designating a search expression by using an index ID assigned to each index to determine which index is used among a plurality of created indexes. Including.

【００４９】本発明（請求項１７）は、検索実行プロセ
スにおいて、インデックスファイルを用いて、単語、単
語のブール演算子結合及びフィルタ表現によって記述さ
れている検索式と、検索対象となるそれぞれの文書のと
の適合度を計算する適合度計算プロセスを有し、適合度
計算プロセスは、逆引きインデックスを用いた検索を行
う通常検索プロセスと、順引きインデックスを用いた検
索を行うフィルタ検索プロセスとを含む。According to the present invention (claim 17), in the search execution process, using an index file, a search expression described by a word, a Boolean operator combination of words and a filter expression, and each document to be searched are provided. A fitness calculation process of calculating a fitness with a normal search process of performing a search using a reverse index, and a filter search process of performing a search using a forward index. Including.

【００５０】本発明（請求項１８）は、フィルタ検索プ
ロセスにおいて、取得した検索結果の中で、適合度の高
い上位の所定の件数の文書の中から、検索式によって指
定される条件にマッチする文書だけを取り出し検索結果
とする‘ａｎｄ型フィルタ’と、検索式によって指定さ
れる条件にマッチする文書の適合度を割増し、再び修正
された適合度の降順に文書を並べ替えて検索結果とする
‘ｏｒ型フィルタ’とを用いるプロセスを含む。上記の
ように、本発明は、検索式と検索対象となるそれぞれの
文書との適合度を適切に計算し、適合度の降順に文書を
並べて検索結果とする全文検索方法において、フォーマ
ットファイルへの記述と検索式での指定という簡単な操
作によって、構造情報を適合度の計算に取り入れること
が可能となる。また、属性によるフィルタを実現するこ
とが可能となる。According to the present invention (claim 18), in a filter search process, a condition specified by a search formula is selected from among a predetermined number of documents having a high degree of relevance among acquired search results. An 'and-type filter' that retrieves only documents and uses them as search results, and increases the relevance of documents that match the conditions specified by the search expression, and sorts the documents again in descending order of the corrected relevance to obtain search results Includes a process using an 'or-type filter'. As described above, the present invention provides a full-text search method that properly calculates the relevance between a search formula and each document to be searched, arranges the documents in descending order of relevance, and obtains a search result. By a simple operation of description and specification in a search formula, it is possible to incorporate structural information into the calculation of the degree of fitness. Further, it is possible to realize a filter based on attributes.

【００５１】[0051]

【発明の実施の形態】図３は、本発明の全文検索装置の
構成を示す。同図に示す全文検索装置は、文書インデッ
クス部１００と検索実行部２００を有し、文書インデッ
クス部１００には、検索対象文書データベース１１０と
フォーマットファイル１２０からのデータが入力され、
インデックスファイル群１５０が出力され、検索実行部
２００は、インデックスファイル群１５０のファイル用
いて検索を行う。FIG. 3 shows the configuration of a full-text search device according to the present invention. The full-text search device illustrated in FIG. 1 includes a document index unit 100 and a search execution unit 200. Data from the search target document database 110 and the format file 120 is input to the document index unit 100.
The index file group 150 is output, and the search execution unit 200 performs a search using the files of the index file group 150.

【００５２】インデックスファイル群１５０は、逆引き
インデックス１５１と順引きインデックス１５２があ
る。The index file group 150 includes a reverse index 151 and a forward index 152.

【００５３】最初に文書インデックス部１００の動作に
ついて説明する。First, the operation of the document index unit 100 will be described.

【００５４】検索対象となる構造化文書の集合Ｎを
検索対象文書データベース１１０から入力する。それぞ
れの文書には、図９に示すような構造が文書内に記述さ
れているものとする。A set N of structured documents to be searched is input from the search target document database 110. It is assumed that the structure shown in FIG. 9 is described in each document.

【００５５】それぞれの文書ｊ∈Ｎの構造を解析
し、構造木を作成する。構造木の例を図４に示す。図９
のようなタグの入れ子によって構造が記述されている場
合には、構造を解析すると図４のような木構造となる。
ここでは、このような構造を表す木を構造木と呼ぶ。構
造を持った文書から構造木を作成する方法は既存の技術
を用いるものとする。The structure of each document j∈N is analyzed, and a structure tree is created. FIG. 4 shows an example of the structural tree. FIG.
When the structure is described by nesting tags as shown in FIG. 4, when the structure is analyzed, a tree structure as shown in FIG. 4 is obtained.
Here, a tree representing such a structure is called a structural tree. A method of creating a structure tree from a document having a structure uses an existing technology.

【００５６】フォーマットファイル１２０に記述さ
れたインデックス対象の指定に基づき、構造木のどの部
分をインデックスするのかを決定する。フォーマットフ
ァイル１２０の記述形式は特に限定しないが、ここで
は、図５のように指定する。Based on the specification of the index target described in the format file 120, which part of the structure tree is to be indexed is determined. The description format of the format file 120 is not particularly limited, but is specified here as shown in FIG.

【００５７】インデックス対象の指定は、図５の（<FIE
LD＿DEFINITION>-</FIELD ＿DEFINITION> ）に記述され
ている。構造木のルートからのタグのパス（target＿ta
g)によって文書構造のどの部分をインデックス対象とす
るのかを指定している。target＿tagにおいて、‘//’
は、子孫、‘/ ’は直接の子を表す。The designation of the index target is shown in (<FIE
LD_DEFINITION>-</ FIELD_DEFINITION>). Tag path from the root of the structure tree (target_ta
g) specifies which part of the document structure is to be indexed. '//' in target_tag
Indicates a descendant, and '/' indicates a direct child.

【００５８】例えば、図５の<FIELD＿DEFINITION> の１
行目の terget＿tag=“//SECTION/SPEC” 文書ルート（ＤＯＣ）からの子孫として、‘ＳＥＣＴＩ
ＯＮ’があり、その直接の子‘ＳＰＥＣ’であるような
部分構造をインデックス対象とすることを表す。つま
り、図４における‘ＳＰＥＣ’以下の部分木を指定する
ものであり、元の文書における<SECTION> 〜</SECTION>
の間に入れ子になっている<SPEC>〜</SPEC>で囲まれた
部分をインデックスすることを表す。また、index ＿id
=“spec”は、以下に述べるどのインデックスファイル
に単語を登録するのかを指定する。For example, 1 of <FIELD_DEFINITION> in FIG.
Terget_tag = “// SECTION / SPEC” on the line As a descendant from the document root (DOC), 'SECTI
ON ', and indicates that a substructure that is a direct child' SPEC 'of the' ON 'is to be indexed. That is, a subtree below 'SPEC' in FIG. 4 is specified, and <SECTION> to </ SECTION> in the original document.
Indicates that the part enclosed by <SPEC> to </ SPEC> nested between is indexed. Also, index_id
= "Spec" specifies which index file to register a word in, as described below.

【００５９】フォーマットファイル１２０で記述さ
れたインデックス方法の指定に基づき、インデックスフ
ァイル群１５０を作成する。それぞれのインデックスフ
ァイル群１５０は、以下のいずれかである。The index file group 150 is created based on the specification of the index method described in the format file 120. Each index file group 150 is one of the following.

【００６０】・逆引きインデックス（逆引きインデック
スファイル１５１）：以下のキーと値を持つデーブルで
あり、従来の技術による逆引きインデックスと同じ形式
である。キー：単語ｉ値：単語ｉが出現する文書ｊのＩＤと単語ｉの文書ｊに
おける重量度ｗijの集合・順引きインデックス（順引きインデックスファイル１
５２）：以下のキーと値を持つテーブルである。・キー：文書ｊのＩＤ・値：文書ｊに出現する単語ｉと単語ｉの文書ｊにおけ
る重要度ｗijの集合ここで、順引きインデックスの例を図６に示す。同図に
おいて、文書３には、単語‘言語’と‘学習’が含ま
れ、それぞれに重要度は、０．３，０．６である。Reverse index (reverse index file 151): This table has the following keys and values, and has the same format as the conventional reverse index. Key: word i Value: set of ID of document j in which word i appears and weight wij in document j of word i • Forward index (forward index file 1)
52): A table having the following keys and values. -Key: ID of document j-Value: Set of word i appearing in document j and importance wij of word i in document j Here, an example of the forward index is shown in FIG. In the figure, the document 3 includes the words "language" and "learning", and their importance is 0.3 and 0.6, respectively.

【００６１】インデックス方法の指定は、図５のフォー
マットファイル１２０の<INDEX＿FILE＿DEFINITION> 〜
</INDEX ＿FILE＿DEFINITION> に記述されている。The specification of the index method is performed in the format file <INDEX_FILE_DEFINITION> to
</ INDEX_FILE_DEFINITION>.

【００６２】例えば、図５における<INDEX＿FILE id=
“spec”type= “INVERTED”/>という記述は、‘spec”
というインデックスＩＤを持った、逆引きインデックス
を作成することを表し、<INDEX＿FILE id=“lang’type
= “SEQUENTIAL”/>という記述は、langというインデッ
クスＩＤを持った、順引きインデックスを作成すること
を表す。これらのインデックスＩＤは、で述べたinde
x ＿idに対応する。For example, in FIG. 5, <INDEX_FILE id =
The description “spec” type = “INVERTED” />is' spec ”
<INDEX_FILE id = “lang'type”
The description “=“ SEQUENTIAL ”/> indicates that a forward index having an index ID of lang is created. These index IDs are based on the inde mentioned in
It corresponds to x_id.

【００６３】次に、検索実行部２００の動作について説
明する。Next, the operation of the search execution section 200 will be described.

【００６４】検索実行部２００は、インデックスファイ
ル群１５０を用いて、検索式（単語、単語のブール演算
子結合及びフィルタ表現によって記述される）と、検索
対象となるそれぞれの文書との適合度を計算する。「通
常検索」に続いて「フィルタ検索」を行うことによって
検索が行われる。ここで、通常検索は、基本的には従来
の全文検索技術を用いるが、本発明では、複数インデッ
クスの中から実際に使用するインデックスを検索式で指
定できる点が異なる。The search execution unit 200 uses the index file group 150 to determine the relevance between a search expression (described by a word, a Boolean operator combination of words and a filter expression) and each document to be searched. calculate. A search is performed by performing a “filter search” following the “normal search”. Here, the ordinary search basically uses a conventional full-text search technique, but the present invention is different in that an index to be actually used from a plurality of indexes can be specified by a search formula.

【００６５】ここで、本発明の検索実行部２００におけ
る通常検索（逆引きインデックスを用いた検索）につい
て説明する。Here, a normal search (search using a reverse index) in the search execution unit 200 of the present invention will be described.

【００６６】検索式に含まれるそれぞれの単語ｉを
キーとして、検索式で指定された逆引きインデックスを
引く。Using each word i included in the search expression as a key, the reverse index specified in the search expression is subtracted.

【００６７】逆引きインデックスから取得した値か
ら、それぞれの単語ｉの出現する文書ＩＤの集合と単語
ｉのそれぞれの文書ｊにおける重要度Ｗｗijを得る。From the values obtained from the reverse index, a set of document IDs in which each word i appears and the importance Wwij of the word i in each document j are obtained.

【００６８】検索式に含まれるブール演算子を適切
に処理し、単語の重要度を用いて、それぞれの文書の
「通常検索適合度」を計算する。The Boolean operator included in the search expression is appropriately processed, and the “ordinary search relevance” of each document is calculated using the importance of the word.

【００６９】「通常検索適合度」の降順に文書を並
べる。The documents are arranged in descending order of “normal search relevance”.

【００７０】次に、検索実行部２００におけるフィルタ
検索（順引きインデックスを用いた検索）について説明
する。Next, a filter search (search using a forward index) in the search execution unit 200 will be described.

【００７１】「通常検索適合度」の上位ｋ件の文書
の文書ＩＤをそれぞれキーとして、検索式で指定された
順引きインデックスを引く。Using the document IDs of the top k documents of “ordinary search relevance” as keys, a forward index specified by the search formula is retrieved.

【００７２】順引きインデックスから取得した値か
ら、それぞれの文書に出現する単語と単語の重要度の集
合を得る。From the values obtained from the forward index, a set of words appearing in each document and the importance of the words is obtained.

【００７３】検索式に含まれるブール演算子を適切
に処理し、単語の重要度を用いて、「フィルタ適合度」
を計算する。The Boolean operator included in the search expression is appropriately processed, and the “filter matching degree” is determined using the importance of the word.
Is calculated.

【００７４】 ‘ａｎｄ型フィルタ’が検索式で指定
された場合、で得られる「フィルタ適合度」が０の
時、その「文書全体の適合度」を０とする。‘ｏｒ型フ
ィルタ’が検索式で指定された場合、で得られる「フ
ィルタ適合度」を「通常検索適合度」に加えたものを
「文書全体の適合度」とする。When the “and type filter” is specified in the search expression, and the “filter suitability” obtained by is 0, the “fitness of the entire document” is set to 0. When 'or-type filter' is specified in the search expression, the value obtained by adding the "filter suitability" obtained by "normal search suitability" to "the overall document suitability" is used.

【００７５】次に、検索実行部２００における通常検索
について説明する。Next, the normal search in the search execution section 200 will be described.

【００７６】通常検索は、フィルタ検索が終わった後
で、「文書全体の適合度」の降順に文書を並べて検索結
果とする。以下に、構造情報の指定、通常検索、フィル
タ検索を用いた例を示す。In the normal search, after the filter search is completed, the documents are arranged in descending order of “the degree of suitability of the entire document” to obtain a search result. The following is an example of using structure information designation, normal search, and filter search.

【００７７】例１）検索式（１）： (UNIX and ti=(Network)) and ＿filter(lang=(japanes
e or english)) ここで、and ＿filterは、‘and 型フィルタ’を表し、
tiは、文書のタイトル部分を対象とした逆引きインデッ
クスのＩＤ、langは、文書の使用言語部分を対象とした
順引きインデックスのＩＤである。検索式（１）が与え
られたときの処理は次のようになる。Example 1) Search formula (1): (UNIX and ti = (Network)) and _filter (lang = (japanes)
e or english)) where and_filter represents 'and-type filter',
ti is the ID of the reverse index for the title part of the document, and lang is the ID of the forward index for the language part of the document. The processing when the search expression (1) is given is as follows.

【００７８】デフォルトの逆引きインデックスに
‘ＵＮＩＸ’を含み、かつ、逆引きインデックスtiに
‘Network ’を含む文書集合を取得し、単語の重要度と
‘and’演算子を基にそれぞれの文書の「通常検索適合
度」を計算する。A document set that includes 'UNIX' in the default reverse index and 'Network' in the reverse index ti is obtained, and based on the word importance and the 'and' operator, the document set of each document is obtained. Calculate “normal search fitness”.

【００７９】「通常検索適合度」の降順に文書をソ
ートし、上位ｋ件のそれぞれの文書ｊに対して以下の処
理を行う。The documents are sorted in descending order of “ordinary search suitability”, and the following processing is performed for each of the top k documents j.

【００８０】文書ｊのＩＤをキーとして、インデックス
langを引き、その値に‘japanese’か‘english ’が存
在するかを調べる。存在した場合は、‘japanese’や
‘english ’の単語の重要度を用いて「フィルタの適合
度」を計算する。‘japanese’や‘english ’がインデ
ックスlangの値に含まれない文書の「フィルタ適合度」
は０になる。An index is set using the ID of the document j as a key.
Subtract lang and check if its value is 'japanese' or 'english'. If it exists, calculate the "fitness of the filter" using the importance of the words "japanese" and "english". 'Filter fit' for documents where 'japanese' or 'english' is not included in the value of index lang
Becomes 0.

【００８１】‘ａｎｄ型フィルタ’が指定されているの
で、「フィルタの適合度」が０の文書の「文書全体の適
合度」は０、「フィルタ適合度」が正の文書の「文書全
体の適合度」は、「通常検索適合度」とする。Since the “and type filter” is specified, the “fitness of the entire document” is 0 for a document whose “filter suitability” is 0, and “the entire document” is a document whose “filter fit” is positive. The “fitness” is “normal search fitness”.

【００８２】最後に「文書全体の適合度」が正のも
のについて、この値の降順に文書を並べて検索結果とし
て出力する。Finally, when the “degree of relevance of the entire document” is positive, the documents are arranged in descending order of this value and output as a search result.

【００８３】このように、検索式（１）を用いると、
「ＵＮＩＸを含みかつタイトルにNetwork を含む文書
で、日本語か英語で記述された文書集合」が検索結果と
して得られる。As described above, using the retrieval formula (1),
A "document set including UNIX and a title including" Network "and described in Japanese or English" is obtained as a search result.

【００８４】例２）検索式（２）： (UNIX or ti=(Network)) or ＿filter=(bold=(linux an
d solaris)) ここで、or＿filterは、‘or型フィルタ’を表し、ti
は、文書タイトル部分を対象とした逆引きインデックス
のＩＤ，boldは、強調文字を対象とした順引きインデッ
クスのＩＤである。検索式（２）が与えられたときに処
理は次のようになる。Example 2) Search formula (2): (UNIX or ti = (Network)) or _filter = (bold = (linux an
d solaris)) where or_filter represents 'or type filter' and ti
Is the ID of the reverse index for the document title portion, and bold is the ID of the forward index for the emphasized character. When the search expression (2) is given, the processing is as follows.

【００８５】前述の例１）と同様に、デフォルトの
逆引きインデックスに‘ＵＮＩＸを含み、インデックス
tiにNetwork ’を含む文書を取得し、それぞれの単語に
付けられた重要度と‘or’演算子を基にそれぞれの文書
の「通常検索適合度」を計算する。As in Example 1) above, 'UNIX is included in the default reverse index,
Retrieve documents that contain 'Network' in ti, and calculate the “normal search relevance” of each document based on the importance assigned to each word and the 'or' operator.

【００８６】「通常検索適合度」の降順に文書をソ
ートし、上位ｋ件のそれぞれの文書ｊに対して以下の処
理を行う。The documents are sorted in descending order of “ordinary search suitability”, and the following processing is performed on each of the top k documents j.

【００８７】文書ｊのＩＤをキーとして、インデックス
boldを引き、その値に‘linux ’と‘ｓｏｌａｒｉｓ
（登録商標）’があるかを調べる。もしあった場合に
は、‘linux ’と‘solaris’の単語の重要度と‘and
’演算子を用いて「フィルタ適合度」を計算する。‘s
olaris ’と‘linux ’をインデックスboldの項目に共
に含まない文書の「フィルタ適合度」は０になる。An index is set using the ID of the document j as a key.
Subtract bold and add 'linux' and 'solaris'
(Registered trademark) '. If so, the importance of the words 'linux' and 'solaris' and 'and'
'Calculate the degree of filter fit using the' operator. 's
Documents that do not include both olaris' and 'linux' in the index bold entry have a "filter fit" of zero.

【００８８】上記ので得られた「通常検索適合度」に
「フィルタ適合度」を加算したのを「文書全体の適合
度」とする。The sum of the “normal search relevance” obtained above and the “filter relevance” is referred to as “the relevance of the entire document”.

【００８９】最後に「文書全体の適合」の降順に文
書を並べて検索結果として出力する。Finally, the documents are arranged in descending order of “fit of the entire document” and output as a search result.

【００９０】このように、検索式（２）を用いると、
「‘ＵＮＩＸ’を含むか、または、タイトルに‘Networ
k ’を含む文書で、特に、‘solaris ’と‘linux ’を
共に強調した文書が検索結果の上位にランキングされる
文書集合」が検索結果として得られる。As described above, using the retrieval formula (2),
"Include 'UNIX' or add 'Networ
As a search result, a set of documents including “k”, in particular, a document in which both “solaris” and “linux” are emphasized in the search result is obtained.

【００９１】[0091]

【実施例】以下に、実施例として、前述した問題をどの
ように解決するかを説明する。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS As an embodiment, how to solve the above-mentioned problem will be described below.

【００９２】（ａ）文書の構造情報を反映した適合度計
算ができないという問題の解決：本発明では、フォーマ
ットファイル１２０に文の構造情報を記述することによ
って、文書の構造のある特定の部分構造を対象としたイ
ンデックスファイル１５０を作成することができる。ま
た、このようなインデックスファイル１５０にＩＤを付
与し、このＩＤを検索式に記述することによって検索時
に適切なインデックスファイルを選択して適合度計算を
することができる。そのため、文書に含まれる構造を適
合度の計算に反映させることができる。(A) Solving the problem that it is impossible to calculate the degree of conformity reflecting the structure information of a document: In the present invention, by describing the structure information of a sentence in the format file 120, a specific partial structure of the structure of the document can be obtained. , An index file 150 can be created. In addition, by assigning an ID to such an index file 150 and describing this ID in a search formula, it is possible to select an appropriate index file at the time of search and calculate the fitness. Therefore, the structure included in the document can be reflected in the calculation of the degree of matching.

【００９３】例えば、「文書のどこかに、‘人工知能’
を含むもののうち、タイトルに‘学習’を含む文書の適
度を特に高くしたい。」という検索要求に対しては、次
のように答えることができる。For example, "somewhere in the document, 'artificial intelligence'
I would like to especially increase the moderateness of documents that include 'learning' in the title. Can be answered as follows.

【００９４】図５のフォーマットファイル１２０を
用いてインデックスファイル１５０を作成する。このと
き、テキスト部に加えてタイトル部分（ＴＩＴＬＥタグ
で囲まれた部分）がインデックスtiとして作成される。An index file 150 is created using the format file 120 shown in FIG. At this time, a title portion (a portion surrounded by a TITLE tag) is created as an index ti in addition to the text portion.

【００９５】以下の検索式を用いて検索を実行す
る。A search is executed using the following search formula.

【００９６】人工知能 or ti=（学習）（ｂ）文書に付けられた属性を用いたフィルタが実現
できないという問題の解決：本発明では、フォーマット
ファイル１２０に属性を記述するという簡単な操作で、
文書構造のある特定の部分に対応した逆引きインデック
ス及び順引きインデックスを作成することができる。そ
して、‘and ＿filter’や、‘or＿filter’を検索式で
指定することによって、逆引きインデックスを用いた検
索に加えて、順引きインデックスを用いた検索を行うこ
とができる。そのため、文書に付けられた属性を用いた
フィルタを容易に実現できる。例えば、‘and 型フィル
タ’を含む検索要求である「‘ＵＮＩＸ’を含む文書で日本語のものだけを取り出
したい」には、次のようにして答えることができる。[0096] Artificial intelligence or ti = (learning) (b) Solution of the problem that a filter using attributes attached to a document cannot be realized: In the present invention, a simple operation of describing attributes in the format file 120 is used.
A reverse index and a forward index corresponding to a specific part of the document structure can be created. By specifying 'and_filter' or 'or_filter' in a search expression, a search using a forward index can be performed in addition to a search using a reverse index. Therefore, a filter using the attributes attached to the document can be easily realized. For example, a search request that includes an 'and-type filter', "I want to retrieve only Japanese documents containing 'UNIX'" can be answered as follows.

【００９７】図５のフォーマットファイル１２０を
用いてインデックスファイル１５０を作成する。この
時、予め文書に付与された使用言語（<LANG>-</LANG>で
囲まれた部分) が順引きインデックスlangとして作成さ
れる。An index file 150 is created using the format file 120 shown in FIG. At this time, the language used (part enclosed by <LANG>-</ LANG>) assigned to the document in advance is created as a forward index lang.

【００９８】以下の‘and 型フィルタ’を含む検索
式を用いて検索を実行する。A search is executed using a search expression including the following 'and-type filter'.

【００９９】UNIX and ＿filter=(lang=(japanese)) また、同様に、‘ｏｒ型フィルタ’を含む検索要求であ
る「‘知識’を含む文書で著者が‘鈴木’の場合は、適
合度を高くしたい（ランキングを上位にしたい」には、
次にようにして答えることができる。UNIX and _filter = (lang = (japanese)) Similarly, when a document containing 'knowledge', which is a search request containing 'or type filter', and the author is 'Suzuki', Want to be higher (I want to rank higher)
Then you can answer as follows.

【０１００】図５のフォーマットファイル１２０を
用いてインデックスファイル１５０を作成する。この
時、予め文書に付与された著者情報（<AUTHOR>タグで囲
まれた部分) が順引きインデックスauthとして作成され
る。An index file 150 is created using the format file 120 shown in FIG. At this time, author information (portion enclosed by <AUTHOR> tags) assigned to the document in advance is created as a forward index auth.

【０１０１】以下の‘or型フィルタ’を含む検索式
を用いて検索を実行する。A search is executed using a search expression including the following 'or type filter'.

【０１０２】知識 or ＿filter=(auth=(鈴木)) また、上記の実施例では、フォーマットファイル１２０
を作成して、当該フォーマットファイル１２０の記述に
基づいてインデックスファイル１５０を作成する例を示
しているが、構造化文書中に、文書の構造のどの部分に
対応したインデックスを作成するのかを記述するように
してもよい。Knowledge or _filter = (auth = (Suzuki)) In the above embodiment, the format file 120
Is created, and the index file 150 is created based on the description of the format file 120. In the structured document, which part of the structure of the document is to be indexed is described. You may do so.

【０１０３】また、上記の説明において、文書インデッ
クス部１００と検索実行部２００について述べたが、こ
れらの動作をプログラムとして構築し、全文検索装置と
して利用されるコンピュータに接続されるディスク装置
や、フロッピー（登録商標）ディスク、ＣＤ−ＲＯＭ等
の可搬記憶媒体に格納しておき、本発明を実施する際に
インストールすることにより、容易に本発明を実現でき
る。In the above description, the document index unit 100 and the search execution unit 200 have been described. These operations are constructed as a program, and a disk device connected to a computer used as a full-text search device, a floppy The present invention can be easily realized by storing it in a portable storage medium such as a (registered trademark) disk or CD-ROM and installing it when implementing the present invention.

【０１０４】なお、本発明は、上記の実施例に限定され
ることなく、特許請求の範囲内において、種々変更・応
用が可能である。The present invention is not limited to the above-described embodiments, but can be variously modified and applied within the scope of the claims.

【０１０５】[0105]

【発明の効果】上述のように、本発明によれば、フォー
マットファイルにそれぞれの文書の構造のどの部分に対
応したインデックスを作成するのかを記述するという簡
単な操作によって、文書の構造情報とフィルタを利用し
た適切な適度度計算を行うことができ、全文検索システ
ムの利便性を向上させることができる。As described above, according to the present invention, the document structure information and the filter information can be obtained by a simple operation of describing in the format file which index of the structure of each document is to be created. , The appropriate moderateness calculation can be performed, and the convenience of the full-text search system can be improved.

[Brief description of the drawings]

【図１】本発明の原理を説明するための図である。FIG. 1 is a diagram for explaining the principle of the present invention.

【図２】本発明の原理構成図である。FIG. 2 is a principle configuration diagram of the present invention.

【図３】本発明の全文検索装置の構成図である。FIG. 3 is a configuration diagram of a full-text search device of the present invention.

【図４】本発明に用いられる構造木の例である。FIG. 4 is an example of a structural tree used in the present invention.

【図５】本発明におけるフォーマットファイルの例であ
る。FIG. 5 is an example of a format file according to the present invention.

【図６】本発明の順引きインデックスの例である。FIG. 6 is an example of a forward index according to the present invention.

【図７】従来の全文検索装置の構成図である。FIG. 7 is a configuration diagram of a conventional full-text search device.

【図８】従来の逆引きインデックスの例である。FIG. 8 is an example of a conventional reverse index.

【図９】構造化文書の例である。FIG. 9 is an example of a structured document.

[Explanation of symbols]

１００文書インデックス作成手段、文書インデックス
部１１０検索対象文書データベース１２０フォーマットファイル１５０インデックスファイル群（インデックスファイ
ル）１５１逆引きファイル１５２順引きファイル２００検索実行手段、検索実行部３００フォーマットファイル作成手段REFERENCE SIGNS LIST 100 document index creation means, document index section 110 document database to be searched 120 format file 150 index file group (index file) 151 reverse lookup file 152 forward lookup file 200 search execution means, search execution section 300 format file creation means

───────────────────────────────────────────────────── フロントページの続き (72)発明者林良彦東京都千代田区大手町二丁目３番１号日本電信電話株式会社内Ｆターム(参考） 5B075 ND03 ND35 PP23 PQ02 PQ74 UU06 ────────────────────────────────────────────────── ─── Continuing on the front page (72) Inventor Yoshihiko Hayashi 2-3-1 Otemachi, Chiyoda-ku, Tokyo F-term in Nippon Telegraph and Telephone Corporation (reference) 5B075 ND03 ND35 PP23 PQ02 PQ74 UU06

Claims

[Claims]

1. A "fitness" that indicates to what degree each document (structured document) having a structure to be searched matches a word or a search expression formed by combining Boolean operators of words. In a full-text search method that calculates documents and sorts documents in descending order of relevance, and determines the index corresponding to each part of the structure of each document in the structured document, or in an independent format file And a plurality of index files are created based on the description of the format file, and one or more of the indexes of the created index file are designated by a search expression. A full-text search method characterized by using structural information of a document for calculating a degree of relevance.

2. When describing a partial structure of a document to be indexed in the structured document or in an independent format file, if the structure of the target document is nested, By creating a tree structure with the elements representing the respective structures as nodes and the elements representing the entire document as root nodes, and describing the path from the root node in the format file. 2. The full-text search method according to claim 1, wherein a portion of the index to be indexed is specified.

3. When the index file is created, a reverse index using a word as a key, and a set of a pair of a document ID in which the word appears and the importance of the word as a value,
A document ID is used as a key, and one or both of a forward index and a value of a set of a pair of a word appearing in the document and the importance of the word are created based on the description of the format. 2. The full-text search method according to claim 1, wherein an index ID is assigned to each.

4. The full-text search method according to claim 1, wherein a search expression is specified by using an index ID assigned to each index to determine which of the plurality of created indexes is to be used.

5. When calculating the degree of relevance between a search expression described by a word, a Boolean operator combination of words and a filter expression using the index file, and each document to be searched, The full-text search method according to claim 1, wherein a search is performed using the reverse index, and a search is performed using the forward index.

6. When performing a search using the forward index, a condition specified by the search formula is selected from a predetermined number of documents having a high degree of relevance among acquired search results. 'An' to retrieve only matching documents and use them as search results
A request using a 'd-type filter' and an 'or-type filter' that increases the relevance of documents that match the conditions specified by the search expression, sorts the documents in descending order of the corrected relevance again, and sets the results as search results Item 5. The full-text search method described in Item 5.

7. A “fitness” that indicates to what degree each document (structured document) having a structure to be searched matches a word or a search expression formed by combining Boolean operators of words. Is a full-text search device that calculates documents and arranges documents in descending order of relevance to obtain search results, and determines which part of the structure of each document is to be indexed in a structured document or independent. Format file creating means to be described in a format file; document index creating means to create a plurality of index files based on the description of the format file; one or more of the indexes of the created index file A search expression specifies whether to use an index, and uses the document structure information to calculate relevance. A full-text search device comprising:

8. The format file creating means, when describing a partial structure of a document to be indexed in the structured document or in an independent format file, nests the structure of the document to be indexed. Means for analyzing the document, creating a tree structure having elements representing the respective structures as nodes and elements representing the entire document as root nodes; and a path from the root node in the format file. Means for designating which part of the structure of the document is to be indexed by describing the document.

9. The document index creating means includes: a reverse index using a word as a key, and a set of a pair of a document ID in which the word appears and the importance of the word as a value;
Using a document ID as a key, one or both of a forward index and a value of a set of pairs of words appearing in the document and importance of the words are created based on the description of the format file. 8. The full-text search device according to claim 7, further comprising means for assigning an index ID to each of the indexes.

10. The search execution means includes means for designating a search expression from among the plurality of created indexes by using an index ID assigned to each index. Full text search device.

11. The retrieving means, using the index file, is a relevance between a search expression described by a word, a Boolean operator combination of words and a filter expression, and each document to be searched. 8. A fitness calculation means for calculating a search index, wherein the fitness calculation means includes a normal search means for performing a search using a reverse index, and a filter search means for performing a search using a forward index. Full-text search device described.

12. The filter search means retrieves only documents that match a condition specified by the search expression from a predetermined number of documents having a high degree of relevance among the obtained search results. 'An the result
Means using a 'd-type filter' and an 'or-type filter' that increases the relevance of documents that match the conditions specified by the search expression, sorts the documents again in descending order of the corrected relevance, and sets the results as search results The full-text search device according to claim 11, further comprising:

13. A “fitness” that indicates to what degree each document (structured document) having a structure to be searched matches a word or a search expression composed of word Boolean operators. Is a storage medium that stores a full-text search program that arranges documents in descending order of relevance and obtains search results, and determines which part of the structure of each document is to be indexed in the structured document. Or a format file creation process described in an independent format file; a document index creation process for creating a plurality of index files based on the description of the format file; and one of the indexes of the created index file. Alternatively, specify which index to use in the search expression, and A storage medium storing a full-text search program, comprising: a search execution process that uses structural information for calculating a degree of relevance.

14. The format file creation process according to claim 1, wherein when describing a partial structure of the document to be indexed in the structured document or in an independent format file, the structure of the target document is nested. A process of analyzing the document and creating a tree structure with elements representing respective structures as nodes and elements representing the entire document as root nodes; and a path from the root node to the format file. 14. A storage medium storing a full-text search program according to claim 13, further comprising: a process of specifying which part of the structure of the document is to be indexed by describing the document.

15. The document index creation process includes: a reverse index using a word as a key and a set of a pair of a document ID in which the word appears and the importance of the word as a value;
Using a document ID as a key, one or both of a forward index and a value of a set of pairs of words appearing in the document and importance of the word are created based on the description of the format file, and the index is created. 14. The storage medium storing the full-text search program according to claim 13, including a process of assigning an index ID to each of the first and second programs.

16. The search execution process includes a process of designating which of the plurality of created indexes to use by using an index ID assigned to each index. A storage medium storing the full-text search program of the present invention.

17. The retrieving execution process, using the index file, a relevance between a search expression described by a word, a Boolean operator combination of words and a filter expression, and each document to be searched. 14. A fitness calculation process for calculating a fitness index, wherein the fitness calculation process includes a normal search process for performing a search using a reverse index and a filter search process for performing a search using a forward index. A storage medium that stores the described full-text search program.

18. The filter search process retrieves only documents that match a condition specified by the search expression from among a predetermined number of documents having high relevance in the obtained search results. 'An the result
A process using a 'd-type filter' and an 'or-type filter' which increases the relevance of documents matching the conditions specified by the search expression, sorts the documents again in descending order of the corrected relevance, and sets the results as search results. The full-text search device according to claim 17, further comprising: