JP4691117B2

JP4691117B2 - Text search device, text search method, text search program, and recording medium recording the program

Info

Publication number: JP4691117B2
Application number: JP2008011125A
Authority: JP
Inventors: 真鬼塚
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2008-01-22
Filing date: 2008-01-22
Publication date: 2011-06-01
Anticipated expiration: 2028-01-22
Also published as: JP2009175826A

Description

本発明は、大量のテキスト情報（例えば、多数の文書）から、テキスト情報に含まれる複数の単語と文書件数とを検索条件として、利用者の要求するテキスト情報を検索するテキスト検索技術に関する。 The present invention relates to a text search technique for searching text information requested by a user from a large amount of text information (for example, a large number of documents) using a plurality of words included in the text information and the number of documents as search conditions.

従来、テキスト情報（テキスト文書、以下、単に文書という）を高速に検索するために、転置ファイルを利用する技術が知られている。転置ファイルには、単語毎に文書ＩＤ順に文書を格納した文書ＩＤ順の転置ファイル（例えば、特許文献１および非特許文献１参照）と、単語毎にインパクト値順に文書を格納したインパクト値順の転置ファイル（例えば、非特許文献１〜４参照）とがある。インパクト値は、各テキスト文書において各単語毎に予め算出された単語出現頻度（ｔｆ値）と、当該テキスト文書に含まれる単語の総数と、に基づいて得られる重みの数値を示す。一般的にインパクト値順の転置ファイルを利用するテキスト検索装置の方が、文書ＩＤ順の転置ファイルを利用するテキスト検索装置よりも性能が良いことが知られている。 2. Description of the Related Art Conventionally, a technique using a transposed file is known in order to search text information (text document, hereinafter simply referred to as a document) at high speed. The transposed file includes a transposed file in document ID order (for example, refer to Patent Document 1 and Non-Patent Document 1) in which documents are stored in order of document ID for each word, and an impact value order in which documents are stored in order of impact value for each word. There are transposed files (for example, see Non-Patent Documents 1 to 4). The impact value indicates the numerical value of the weight obtained based on the word appearance frequency (tf value) calculated in advance for each word in each text document and the total number of words included in the text document. In general, it is known that a text search apparatus that uses a transposed file in order of impact value has better performance than a text search apparatus that uses a transposed file in order of document ID.

図３は、単語別に構築されたインパクト値順の転置ファイルを利用する従来のテキスト検索装置３００の概要を模式的に示す説明図である。テキスト検索装置３００は、検索処理装置３２０と、インデックス記憶装置３３０と、単語辞書記憶装置３５０とを備えている。検索処理装置３２０は、端末装置３６０から入力された検索条件４０１に含まれる単語群４０２を、単語辞書記憶装置３５０に記憶された単語辞書を参照して抽出し、インデックス記憶装置３３０に蓄積されたインパクト値順の転置ファイル３３１〜３３４を利用して、出力すべき候補となる文書（候補文書）を検索し、検索結果を端末装置３６０に出力する。図３に示した例では、検索処理装置３２０は、検索条件４０１である「東京の駅」というフレーズを分割し、単語群４０２として、「東京」、「の」、「駅」を抽出する。そして、各単語に対応した転置ファイル３３１，３３３，３３４について、符号４０３，４０４，４０５の矢印で示すように、インパクト値の高い順に文書毎に所定のスコアを算出する。これにより、共通の文書として３つの転置ファイルから、符号４１３，４１４，４１５でそれぞれ示した文書ＩＤが「１６」である文書（d16）等が探索され、候補文書として出力される。
特開平３−１０８０６４号公報（第１図） Justin Zobel and Alistair Moffat, “Inverted Files for Text Search Engines”, ACM Computing Surveys, Vol. 38, No. 2, Article 6, July, 2006 Vo Ngoc Anh, Owen de Kretser, Alistair Moffat, “Impact Transformation: Effective and Efficient Web Retrieval”, Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, p.35-42, 2001 Vo Ngoc Anh and Alistair Moffat, “Vector-Space Ranking with Effective Early Termination”, Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, p.3-10, 2002 Vo Ngoc Anh and Alistair Moffat, “Pruned Query Evaluation Using Pre-Computed Impacts”, Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, p.372-379, 2006 FIG. 3 is an explanatory diagram schematically showing an outline of a conventional text search apparatus 300 that uses a transposed file in order of impact value constructed for each word. The text search device 300 includes a search processing device 320, an index storage device 330, and a word dictionary storage device 350. The search processing device 320 extracts the word group 402 included in the search condition 401 input from the terminal device 360 with reference to the word dictionary stored in the word dictionary storage device 350 and stores it in the index storage device 330. Using the transposed files 331 to 334 in the order of impact values, a candidate document to be output (candidate document) is searched, and the search result is output to the terminal device 360. In the example illustrated in FIG. 3, the search processing device 320 divides the phrase “Tokyo station” as the search condition 401 and extracts “Tokyo”, “no”, and “station” as the word group 402. Then, for the transposed files 331, 333, and 334 corresponding to each word, a predetermined score is calculated for each document in descending order of impact value, as indicated by arrows 403, 404, and 405. As a result, the document (d16) having the document ID “16” indicated by the reference numerals 413, 414, and 415 is searched from the three transposed files as a common document, and is output as a candidate document.
Japanese Patent Laid-Open No. 3-108064 (FIG. 1) Justin Zobel and Alistair Moffat, “Inverted Files for Text Search Engines”, ACM Computing Surveys, Vol. 38, No. 2, Article 6, July, 2006 Vo Ngoc Anh, Owen de Kretser, Alistair Moffat, “Impact Transformation: Effective and Efficient Web Retrieval”, Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, p.35-42, 2001 Vo Ngoc Anh and Alistair Moffat, “Vector-Space Ranking with Effective Early Termination”, Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, p.3-10, 2002 Vo Ngoc Anh and Alistair Moffat, “Pruned Query Evaluation Using Pre-Computed Impacts”, Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, p.372-379, 2006

しかしながら、単語別に構築されたインパクト値順の転置ファイルを利用する従来のテキスト検索装置において分割された各単語のうち、「の」や「が」のような助詞は、検索対象とする全テキスト文書を通じて出現してしまう。そのため、例えば、図３に示すように、「の」の転置ファイル３３３において、「東京」と「駅」との両方を含むテキスト文書を探索するためには、「の」の転置ファイル３３３について全スキャンをしてしまう確率が高いという性能上の問題があった。 However, among the words divided in the conventional text search device using the transposed file in order of impact value constructed for each word, particles such as “no” and “ga” are all text documents to be searched. Will appear through. Therefore, for example, as shown in FIG. 3, in order to search for a text document including both “Tokyo” and “station” in the transposed file 333 of “no”, all of the transposed file 333 of “no” There was a performance problem that the probability of scanning was high.

また、例えば、Ｗｅｂ上の文書（ドキュメント）を検索対象とした場合に、ドキュメントに含まれる単語別に構築されたインパクト値順の転置ファイルを利用してテキスト文書をキーワード検索する際に、PageRank（登録商標）のような単語に依存しない文書独自のスコアをも考慮してテキスト文書を検索することが要望されている。しかしながら、インパクト値順の転置ファイルを利用してテキスト文書を検索する際に、単語に依存しない文書独自のスコアをどのように扱えば検索処理の性能を向上させることができるのか知られていなかった。 Also, for example, when searching a document on the Web (document), when searching for a text document using a transposed file in order of impact value constructed for each word included in the document, PageRank (registration) There is a demand for searching text documents in consideration of the unique score of the document such as trademark). However, when searching for text documents using transposed files in order of impact value, it has not been known how to improve the performance of search processing by handling the original scores independent of words. .

そこで、本発明では、前記した問題を解決し、インパクト値順の転置ファイルを利用する際に検索処理の処理性能を向上させることができるテキスト検索技術を提供することを目的とする。 Therefore, an object of the present invention is to provide a text search technique that can solve the above-described problems and improve the processing performance of search processing when using a transposed file in order of impact value.

前記課題を解決するため、請求項１に記載のテキスト検索装置は、文書ＩＤを有する各テキスト文書において単語毎に予め算出された単語出現頻度と、当該テキスト文書に含まれる単語の総数と、に基づいて得られる重みの数値を示すインパクト値を用いた単語毎に構築されたインパクト値順の転置ファイルと、前記各テキスト文書に含まれる単語毎に構築された文書ＩＤ順の転置ファイルとを含む複数の転置ファイルを文書検索用のインデックスとして用いて、前記テキスト文書を検索するテキスト検索装置であって、複数の単語を含む検索式と文書件数とを指定された検索条件として入力する検索条件入力手段と、前記入力された検索式に含まれる各単語を単語辞書に基づいて抽出する単語抽出手段と、単語毎に検索対象とする全テキスト文書を通じて予め算出された文書出現頻度に基づいて、前記抽出された各単語を、前記文書出現頻度が所定のしきい値よりも低い単語群と、そうではない単語群とに分類する単語群分類手段と、第１段階の探索として、前記文書出現頻度が所定のしきい値よりも低い単語群については前記インパクト値順の転置ファイルを利用して出力候補となる候補文書を探索し、第２段階の探索として、前記文書出現頻度が前記所定のしきい値以上の単語群については前記文書ＩＤ順の転置ファイルを利用して、前記第１段階の探索で探索された候補文書の文書ＩＤを前記文書ＩＤ順の転置ファイル上で探索することで、出力候補となる候補文書を探索する候補文書探索手段と、前記入力された検索条件を満たした時点で前記候補文書を確定して出力する候補文書出力手段とを備えることを特徴とする。 In order to solve the above-mentioned problem, the text search device according to claim 1 includes: a word appearance frequency calculated in advance for each word in each text document having a document ID; and a total number of words included in the text document. Including a transposed file in order of impact value constructed for each word using an impact value indicating a weight value obtained based on a transposed file in order of document ID constructed for each word included in each text document A text search device for searching for a text document using a plurality of transposed files as an index for document search, wherein a search expression including a plurality of words and the number of documents are input as specified search conditions. Means, word extraction means for extracting each word included in the inputted search expression based on a word dictionary, and all texts to be searched for each word Word group classification for classifying each extracted word into a word group whose document appearance frequency is lower than a predetermined threshold and a word group that is not so based on the document appearance frequency calculated in advance through the document It means, as the search of the first stage, the lower word groups than the document frequency is the predetermined threshold searches a candidate document to be output candidate by using the inverted file of the impact value order, second As a step search, for a word group whose document appearance frequency is equal to or higher than the predetermined threshold value , the document ID of the candidate document searched in the first step search is obtained using the transposed file in the document ID order. by searching on the document ID order of the inverted file, the candidate document search means for searching for a candidate document to be output candidates be output to confirm the candidate documents at the time of satisfying the inputted retrieval conditions Characterized in that it comprises a candidate document output means.

また、前記課題を解決するため、請求項７に記載のテキスト検索方法は、文書ＩＤを有する各テキスト文書において単語毎に予め算出された単語出現頻度と、当該テキスト文書に含まれる単語の総数と、に基づいて得られる重みの数値を示すインパクト値を用いた単語毎に構築されたインパクト値順の転置ファイルと、前記各テキスト文書に含まれる単語毎に構築された文書ＩＤ順の転置ファイルとを含む複数の転置ファイルを文書検索用のインデックスとして用いて、検索対象として前記テキスト文書を検索するテキスト検索装置のテキスト検索方法であって、前記テキスト検索装置が、検索条件入力手段と、単語抽出手段と、単語群分類手段と、候補文書探索手段と、候補文書出力手段とを備え、前記検索条件入力手段によって、複数の単語を含む検索式と文書件数とを指定された検索条件として入力する検索条件入力ステップと、前記単語抽出手段によって、前記入力された検索式に含まれる各単語を単語辞書に基づいて抽出する単語抽出ステップと、前記単語群分類手段によって、単語毎に検索対象とする全テキスト文書を通じて予め算出された文書出現頻度に基づいて、前記抽出された各単語を、前記文書出現頻度が所定のしきい値よりも低い単語群と、そうではない単語群とに分類する単語群分類ステップと、前記候補文書探索手段によって、第１段階の探索として、前記文書出現頻度が所定のしきい値よりも低い単語群については前記インパクト値順の転置ファイルを利用して出力候補となる候補文書を探索し、第２段階の探索として、前記文書出現頻度が所定のしきい値以上の単語群については前記文書ＩＤ順の転置ファイルを利用して、前記第１段階の探索で探索された候補文書の文書ＩＤを前記文書ＩＤ順の転置ファイル上で探索することで、出力候補となる候補文書を探索する候補文書探索ステップと、前記候補文書出力手段によって、前記入力された検索条件を満たした時点で前記候補文書を確定して出力する候補文書出力ステップとを含んで実行することを特徴とする。 In order to solve the above problem, the text search method according to claim 7 includes a word appearance frequency calculated in advance for each word in each text document having a document ID, a total number of words included in the text document, and , A transposed file in order of impact value constructed for each word using an impact value indicating a weight value obtained based on the above, a transposed file in order of document ID constructed for each word included in each text document, and A text search method of a text search device for searching the text document as a search target using a plurality of transposed files including a document search index, the text search device comprising: search condition input means; word extraction Means, a word group classification means, a candidate document search means, and a candidate document output means. A search condition input step for inputting a search expression including a word and the number of documents as a specified search condition, and a word for extracting each word included in the input search expression based on a word dictionary by the word extraction means Based on the document appearance frequency calculated in advance through all the text documents to be searched for each word by the extraction step and the word group classification means, the extracted frequency of each word is set to a predetermined threshold. The document frequency is lower than a predetermined threshold as a first-stage search by the word group classification step for classifying into a word group lower than the value and a word group that is not, and the candidate document search means the word groups to explore the candidate document to be output candidate by using the inverted file of the impact values in order, as the search of the second step, teeth the document frequency of occurrence of a predetermined For had values above word group by utilizing the inverted file of the document ID order, the document ID of the first stage has been searched candidate documents in search by searching on the document ID order of the inverted file, A candidate document searching step for searching for a candidate document as an output candidate; and a candidate document output step for determining and outputting the candidate document when the input search condition is satisfied by the candidate document output means. It is characterized by performing.

請求項１に記載のテキスト検索装置または請求項７に記載のテキスト検索方法によれば、テキスト検索装置は、検索式に含まれ全文書を通じて文書出現頻度の高くない単語群については、より性能の高いインパクト値順の転置ファイルを用いて侯補文書を探索し、その後、検索式に含まれる残りの単語については、文書ＩＤ順の転置ファイルを用いて候補文書を絞り込む。このように、テキスト検索装置は、検索式に含まれる単語の検索対象とする全文書を通じた文書出現頻度に応じて２種類の転置ファイルを適切に使い分ける。文書ＩＤ順の転置ファイルを従来のように単独で利用する場合には、従来のようにインパクト値順の転置ファイルを単独で用いる場合よりも一般に性能が良くはない。しかし、本発明のテキスト検索装置は、第１段階でインパクト値順の転置ファイルを用いて候補文書をある程度絞り込んでから、第２段階で文書ＩＤ順の転置ファイルを利用する。そのため、第２段階では、文書ＩＤの降順に１つ１つ文書を単純に検索する必要はなく、第１段階で絞り込まれた文書の文書ＩＤ以外を適宜スキップすることができる。その結果、文書ＩＤ順の転置ファイルにおける全スキャンを避けることが可能となる。これにより、テキスト検索装置は、従来に比べて性能を向上させることができる。 According to the text search device according to claim 1 or the text search method according to claim 7, the text search device has a higher performance for a word group that is included in the search formula and has a low document appearance frequency throughout the entire document. A supplementary document is searched using a transposed file in order of high impact value, and then candidate documents are narrowed down using a transposed file in order of document IDs for the remaining words included in the search formula. As described above, the text search apparatus appropriately uses the two types of transposed files according to the frequency of appearance of documents through all the documents to be searched for words included in the search formula. When a transposed file in document ID order is used alone as in the prior art, performance is generally not as good as when a transposed file in order of impact value is used alone as in the prior art. However, the text search apparatus of the present invention narrows down candidate documents to some extent using the transposed file in the order of impact values in the first stage, and then uses the transposed file in the order of document ID in the second stage. Therefore, in the second stage, it is not necessary to simply search for each document in descending order of the document ID, and it is possible to appropriately skip other than the document IDs of the documents narrowed down in the first stage. As a result, it is possible to avoid all scans in the transposed file in document ID order. Thereby, the text search apparatus can improve performance compared with the past.

また、請求項２に記載のテキスト検索装置は、請求項１に記載のテキスト検索装置において、前記候補文書探索手段が、前記文書出現頻度が前記所定のしきい値よりも低い単語群について前記インパクト値順の転置ファイルを読み込み、インパクト値の高い順に文書毎に前記検索式に依存する単語依存文書スコアを算出し、前記算出した単語依存文書スコアに基づいて前記候補文書を探索する第１探索手段と、前記探索された候補文書の件数が、前記指定された文書件数よりも大きくなったか否かを判別する件数判別手段と、前記探索された候補文書の件数が前記指定された文書件数よりも大きくなった場合に、前記文書出現頻度が前記所定のしきい値以上の単語群についての前記文書ＩＤ順の転置ファイルを読み込み、前記第１探索手段で探索された候補文書の文書ＩＤに一致する候補文書を探索しつつ文書毎のスコアを算出する第２探索手段と、前記第１探索手段と前記第２探索手段とを交互に用いて前記候補文書を絞り込み、前記候補文書の順位として前記指定された文書件数以内の順位を決定する絞込み制御手段とを備えることを特徴とする。 Further, the text search device according to claim 2 is the text search device according to claim 1, wherein the candidate document search means performs the impact on the word group whose document appearance frequency is lower than the predetermined threshold. First search means for reading a transposition file in order of values, calculating a word-dependent document score depending on the search formula for each document in descending order of impact value, and searching for the candidate document based on the calculated word-dependent document score And a number determination means for determining whether or not the number of searched candidate documents is larger than the specified number of documents, and the number of searched candidate documents is greater than the specified number of documents. When it becomes large, the transposition file in the document ID order for the word group whose document appearance frequency is equal to or higher than the predetermined threshold is read, and the first search means The candidate document is obtained by alternately using second search means for calculating a score for each document while searching for a candidate document that matches the document ID of the searched candidate document, and the first search means and the second search means. And narrowing-down control means for determining a rank within the specified number of documents as the rank of the candidate document.

かかる構成によれば、テキスト検索装置は、第１段階として、探索された候補文書が、指定された文書件数に達するまでは、インパクト値順の転置ファイルを利用して候補文書を探索する。そのため、文書件数を指定した利用者が要求する文書数の文書を高速に検索できる。そして、指定された文書件数を超えた後では、第２段階として、文書ＩＤ順の転置ファイルから探索される候補文書と、第１段階で探索された候補文書と突き合わせながら文書毎のスコアを算出する。したがって、例えば、この文書毎のスコアとして各文書が取りうる最大スコアを予め設けておくことで、指定された文書件数に達していた候補文書の中から所定数の文書を除去する足切り処理を行うことができる。そして、テキスト検索装置は、この足切りにより減少した候補文書数を起点に、前記した第１段階を再度実行し、探索された候補文書が、指定された文書件数に達するまでは、インパクト値順の転置ファイルを利用して候補文書を探索する。以下、同様である。これにより、スコアの高い候補文書として、検索対象とするテキスト文書群の中で偏ることなくより幅広い多くの文書の中から選択された適切な候補文書を出力することが可能となる。 According to this configuration, as a first stage, the text search apparatus searches for candidate documents using the transposed file in order of impact value until the searched candidate documents reach the designated number of documents. For this reason, it is possible to search documents of the number of documents requested by the user who has designated the number of documents at high speed. After the specified number of documents is exceeded, as a second step, the score for each document is calculated while matching the candidate document searched from the transposed file in document ID order with the candidate document searched in the first step. To do. Therefore, for example, by providing a maximum score that can be taken by each document as a score for each document, a cut-off process for removing a predetermined number of documents from candidate documents that have reached the designated number of documents is performed. It can be carried out. Then, the text search device executes the first stage again starting from the number of candidate documents decreased due to the cut-off, and in order of impact value until the searched candidate documents reach the designated number of documents. The candidate document is searched using the transposed file. The same applies hereinafter. As a result, it is possible to output a suitable candidate document selected from a wider range of documents as a candidate document having a high score without being biased in the text document group to be searched.

また、請求項３に記載のテキスト検索装置は、請求項２に記載のテキスト検索装置において、前記第１探索手段が、前記文書出現頻度が前記所定のしきい値よりも低い単語群について前記インパクト値順の転置ファイルを読み込み、文書毎にインパクト値の高い順に前記検索式に依存する単語依存文書スコアを算出する第１処理手段と、前記各テキスト文書毎に算出された文書独自のスコア順に予め構築された文書独自スコア順の転置ファイルから、前記単語依存文書スコアを算出した文書に対応する前記文書独自のスコアを読み込み、前記読み込んだ文書独自のスコアと、前記算出した単語依存文書スコアとの線形和を用いて前記候補文書を探索する第２処理手段とを備えることを特徴とする。 The text search device according to claim 3 is the text search device according to claim 2, wherein the first search means performs the impact on the word group whose document appearance frequency is lower than the predetermined threshold. A first processing unit that reads a transposition file in order of values and calculates a word-dependent document score depending on the search formula in descending order of impact value for each document; and a document-specific score order calculated for each text document in advance. The document-specific score corresponding to the document for which the word-dependent document score is calculated is read from the transposed file in the document-specific score order, and the read document-specific score and the calculated word-dependent document score And second processing means for searching for the candidate document using a linear sum.

かかる構成によれば、テキスト検索装置は、第１段階として、探索された候補文書が、指定された文書件数に達するまで候補文書を探索する際に、第１処理手段でインパクト値順の転置ファイルを利用して算出した単語依存文書スコアと、文書独自スコア順の転置ファイルから読み込んだ文書独自のスコアとの線形和を用いて候補文書を探索する。したがって、テキスト検索装置は、単語依存文書スコアと、文書独自のスコアとの線形和を用いる構成なので、従来とは異なって、単語には依存しない文書のスコア（文書独自のスコア）を、あたかも検索式に含まれる単語に依存するスコア（単語依存文書スコア）の一種であるかのように同様に扱うことができる。ここで、文書独自のスコアは、文書に含まれる単語には依存しないスコアであり、例えば、PageRank（登録商標）等である。つまり、文書独自のスコアを、例えばＷｅｂ文書（ドキュメント）のうちで、より多くのユーザに閲覧、利用されるＷｅｂ文書ほど、その値が高くなるように設定することで、本発明のテキスト検索装置は、検索対象とするテキスト文書を、例えばＷｅｂ文書（ドキュメント）とした場合に、より多くのユーザに閲覧、利用されるＷｅｂ文書の中から、指定された検索式に含まれる単語が含まれる文書を候補文書として出力することが可能となる。その結果、本発明のテキスト検索装置は、検索処理の性能を従来よりも向上させることができる。 According to such a configuration, as a first step, the text search apparatus, when searching for candidate documents until the number of searched candidate documents reaches the designated number of documents, is a transposed file in order of impact value by the first processing means. Candidate documents are searched using a linear sum of the word-dependent document score calculated by using the document-specific score read from the transposition file in the document-specific score order. Therefore, since the text search device uses a linear sum of the word-dependent document score and the document-specific score, unlike in the past, the word-dependent document score (document-specific score) is searched as if It can be handled in the same manner as if it is a kind of score (word-dependent document score) depending on the word included in the expression. Here, the document-specific score is a score that does not depend on words included in the document, and is, for example, PageRank (registered trademark). In other words, the text search device of the present invention is set by setting the document-specific score such that, for example, a Web document that is viewed and used by more users among Web documents (documents) has a higher value. If the text document to be searched is a Web document (document), for example, a document that includes words included in a specified search expression from Web documents that are viewed and used by more users Can be output as a candidate document. As a result, the text search apparatus of the present invention can improve the performance of the search process as compared to the conventional case.

また、請求項４に記載のテキスト検索装置は、請求項３に記載のテキスト検索装置において、前記テキスト文書毎に文書独自のスコアを算出する独自スコア算出手段と、前記算出した文書独自のスコアに基づいて、前記文書独自スコア順の転置ファイルを構築する文書独自スコア転置ファイル構築手段とをさらに備えることを特徴とする。 According to a fourth aspect of the present invention, there is provided the text search device according to the third aspect, in which the unique score calculation means for calculating a document-specific score for each text document and the calculated document-specific score. And a document unique score transposed file construction means for constructing a transposed file in the document unique score order.

かかる構成によれば、テキスト検索装置は、検索処理を行う前に、検索対象とするテキスト文書毎に文書独自のスコアを算出し、文書独自スコア順の転置ファイルを構築し、構築した文書独自スコア順の転置ファイルを蓄積保持することができる。したがって、テキスト検索装置は、文書独自スコア順の転置ファイルを蓄積保持した記憶装置に格納された文書独自スコア順の転置ファイルを用いて、例えば蓄積装置に格納されたテキスト情報を検索対象として検索処理を実行することができる。 According to such a configuration, the text search apparatus calculates a document-specific score for each text document to be searched before performing the search process, constructs a transposed file in the document-specific score order, and constructs the document-specific score thus constructed. It is possible to store and hold sequential transposed files. Accordingly, the text search apparatus uses the transpose file in the document unique score order stored in the storage device that stores and holds the transpose file in the document unique score order, for example, searches the text information stored in the accumulation device as a search target. Can be executed.

また、請求項５に記載のテキスト検索装置は、請求項１ないし請求項４のいずれか一項に記載のテキスト検索装置において、前記各テキスト文書に含まれる単語毎に構築された文書ＩＤ順の転置ファイルが、当該単語について前記検索対象とする全テキスト文書を通じて最大のインパクト値を保持したものであることを特徴とする。 The text search device according to claim 5 is the text search device according to any one of claims 1 to 4, wherein the text search device according to the document ID is constructed for each word included in each text document. The transposed file is one in which the maximum impact value is held for all the text documents to be searched for the word.

かかる構成によれば、テキスト検索装置は、インパクト値順の転置ファイルを利用する検索処理においては、そのインパクト値に基づいて足切りを行うように、文書ＩＤ順の転置ファイルを利用する検索処理においては、文書ＩＤ順の転置ファイルが保持する最大のインパクト値に基づいて足切りを行うことができる。そのため、文書ＩＤ順の転置ファイルをインパクト値順の転置ファイルと併用して検索処理する際に、いずれの種類の転置ファイルを用いたときにでも足切りを行うことができる。これにより、検索対象とするテキスト文書群の中でより幅広い多くの文書の中から選択された適切な候補文書を出力することが可能となる。 According to such a configuration, in the search process using the transposed file in the order of the impact value, the text search apparatus performs the search process using the transposed file in the order of the document ID so as to cut off based on the impact value. Can be cut off based on the maximum impact value held by the transposed file in document ID order. Therefore, when a transposition file in document ID order is used in combination with a transposition file in impact value order, a cut-off can be performed when any type of transposition file is used. This makes it possible to output an appropriate candidate document selected from a wider range of documents in the text document group to be searched.

また、請求項６に記載のテキスト検索装置は、請求項１ないし請求項５のいずれか一項に記載のテキスト検索装置において、前記インパクト値順の転置ファイルと、前記文書ＩＤ順の転置ファイルとを構築するインデックス構築手段をさらに備え、前記インデックス構築手段が、検索対象とするテキスト文書群を読み込む文書読込手段と、単語辞書に基づいて前記読み込んだ各テキスト文書から単語を抽出して当該テキスト文書に含まれる単語の特徴量として単語の総数および各単語の単語出現頻度を算出すると共に、単語毎に検索対象とする全テキスト文書を通じた文書出現頻度を算出する文書特徴量算出手段と、単語毎に前記文書出現頻度を前記検索対象とする全テキスト文書を通じて比較し、各単語の前記文書出現頻度が所定のしきい値よりも低いか否かを判別する出現頻度判別手段と、単語の前記文書出現頻度が前記所定のしきい値よりも低い場合に、当該単語について前記インパクト値順の転置ファイルを前記文書検索用のインデックスとして構築する第１転置ファイル構築手段と、単語の前記文書出現頻度が前記所定のしきい値以上である場合に、当該単語について前記検索対象とする全テキスト文書を通じて最大のインパクト値を算出し、前記算出した最大のインパクト値を含む文書ＩＤ順の転置ファイルを前記文書検索用のインデックスとして構築する第２転置ファイル構築手段とを備えることを特徴とする。 Further, the text search device according to claim 6 is the text search device according to any one of claims 1 to 5, wherein the transposed file in the order of the impact value, the transposed file in the order of the document ID, Index construction means for constructing a text document, the index construction means for reading a text document group as a search target, and extracting a word from each of the read text documents based on a word dictionary A document feature amount calculating means for calculating a total number of words and a word appearance frequency of each word as a feature amount of words included in the document, and calculating a document appearance frequency through all text documents to be searched for each word; The document appearance frequency is compared through all the text documents to be searched, and the document appearance frequency of each word is a predetermined threshold. An appearance frequency determining means for determining whether or not the document appearance frequency of the word is lower than the predetermined threshold value, the transposed file in the order of the impact value for the word is used for the document search. When the first inverted file construction unit constructed as an index and the document appearance frequency of a word are equal to or higher than the predetermined threshold value, the maximum impact value is calculated for all the text documents to be searched for the word. And second transposed file constructing means for constructing a transposed file in the document ID order including the calculated maximum impact value as the index for document retrieval.

かかる構成によれば、テキスト検索装置は、検索処理を行う前に、検索対象とするテキスト文書群から、単語毎に全文書を通じた文書出現頻度を算出し、算出した文書出現頻度に基づき構築すべき適切な転置ファイルの種類として、インパクト値順の転置ファイルと文書ＩＤ順の転置ファイルのうちのいずれかを選択して選択した転置ファイルを構築する。したがって、テキスト検索装置は、算出した各単語の文書出現頻度と、それぞれ構築したインパクト値順の転置ファイルと文書ＩＤ順の転置ファイルとを蓄積保持することができる。そのため、テキスト検索装置は、文書出現頻度とインパクト値順の転置ファイルと文書ＩＤ順の転置ファイルとを蓄積保持した記憶装置に格納された文書出現頻度および各転置ファイルを用いて、例えば蓄積装置に格納されたテキスト情報を検索対象として検索処理を実行することができる。 According to such a configuration, the text search apparatus calculates the document appearance frequency through all documents for each word from the text document group to be searched before performing the search process, and builds based on the calculated document appearance frequency. As the appropriate type of the transposed file, one of the transposed file in the order of impact values and the transposed file in the order of document ID is selected and the selected transposed file is constructed. Therefore, the text search apparatus can accumulate and hold the calculated document appearance frequency of each word, the transposed file in the order of impact values, and the transposed file in the order of document ID. Therefore, the text search device uses the document appearance frequency and each transposed file stored in the storage device that stores and holds the document appearance frequency, the transposed file in the order of the impact value, and the transposed file in the order of the document ID, for example, in the storage device. Search processing can be executed using the stored text information as a search target.

また、請求項８に記載のテキスト検索プログラムは、請求項１ないし請求項６のいずれか一項に記載のテキスト検索装置の機能をコンピュータで実現するためのプログラムであることを特徴とする。このように構成されることにより、このプログラムをインストールされたコンピュータは、このプログラムに基づいた各機能を実現することができる。 A text search program according to claim 8 is a program for realizing the function of the text search device according to any one of claims 1 to 6 by a computer. By being configured in this way, a computer in which this program is installed can realize each function based on this program.

また、請求項９に記載のコンピュータ読み取り可能な記録媒体は、請求項８に記載のテキスト検索プログラムが記録されたことを特徴とする。このように構成されることにより、この記録媒体を装着されたコンピュータは、この記録媒体に記録されたプログラムに基づいた各機能を実現することができる。 A computer-readable recording medium according to a ninth aspect stores the text search program according to the eighth aspect. By being configured in this way, a computer equipped with this recording medium can realize each function based on a program recorded on this recording medium.

本発明によれば、テキスト検索装置は、検索式に含まれる単語の検索対象とする全テキスト文書を通じた文書出現頻度が低い場合には、インパクト値順の転置ファイルを用いて文書の探索を行い、一方、全テキスト文書を通じた文書出現頻度が高い場合には、スキップが可能な文書ＩＤ順の転置ファイルを用いるので、検索処理を高速化する効果がある。そのため、インパクト値順の転置ファイルを利用する際に検索処理の処理性能を向上させることができる。 According to the present invention, the text search device searches for a document using a transposed file in order of impact value when the document appearance frequency is low throughout all text documents to be searched for words included in the search expression. On the other hand, when the frequency of appearance of documents through all text documents is high, the transposition file in order of document IDs that can be skipped is used, which has the effect of speeding up the search process. Therefore, the processing performance of the search process can be improved when using a transposed file in order of impact value.

以下、図面を参照して本発明のテキスト検索装置およびテキスト検索方法を実施するための最良の形態（以下「実施形態」という）について第１実施形態と第２実施形態とに分けて詳細に説明する。 Hereinafter, the best mode for carrying out a text search apparatus and text search method of the present invention (hereinafter referred to as “embodiment”) will be described in detail by dividing it into a first embodiment and a second embodiment with reference to the drawings. To do.

（第１実施形態）
［テキスト検索装置の構成］
図１は、本発明の第１実施形態に係るテキスト検索装置を模式的に示す構成図である。
テキスト検索装置１は、文書ＩＤを有する各テキスト文書においてインパクト値を用いた単語毎に構築されたインパクト値順の転置ファイルと、各テキスト文書に含まれる単語毎に構築された文書ＩＤ順の転置ファイルとを含む複数の転置ファイルを文書検索用のインデックスとして用いて、検索対象としてテキスト文書を検索するものである。ここで、インパクト値は、各テキスト文書において各単語毎に予め算出された単語出現頻度（ｔｆ値）と、当該テキスト文書に含まれる単語の総数と、に基づいて得られる重みの数値を示す。 (First embodiment)
[Configuration of text search device]
FIG. 1 is a block diagram schematically showing a text search apparatus according to the first embodiment of the present invention.
The text search apparatus 1 includes a transposed file in order of impact value constructed for each word using an impact value in each text document having a document ID, and a transposed in order of document ID constructed for each word included in each text document. A text document is searched as a search target by using a plurality of transposed files including the file as an index for document search. Here, the impact value indicates a numerical value of the weight obtained based on the word appearance frequency (tf value) calculated in advance for each word in each text document and the total number of words included in the text document.

テキスト検索装置１は、例えば、ＣＰＵ（Central Processing Unit）と、ＲＡＭ（Random Access Memory）と、ＲＯＭ（Read Only Memory）と、ＨＤＤ（Hard Disk Drive）と、入出力インタフェース等から構成される。このテキスト検索装置１は、テキスト検索エンジンとして、図１に示すように、インデックス構築装置１０と、検索処理装置２０と、インデックス記憶装置３０と、文書記憶装置４０と、単語辞書記憶装置５０とを備えている。 The text search device 1 includes, for example, a central processing unit (CPU), a random access memory (RAM), a read only memory (ROM), a hard disk drive (HDD), and an input / output interface. As shown in FIG. 1, the text search device 1 includes an index construction device 10, a search processing device 20, an index storage device 30, a document storage device 40, and a word dictionary storage device 50 as text search engines. I have.

テキスト検索装置１は、利用者の使用するパーソナルコンピュータ等の端末装置６０にケーブルで接続されたマウスやキーボード等の入力装置Ｍ（図６参照）から入力される検索条件（検索式、文書件数）を入力する。なお、テキスト検索装置１は、例えば、インターネット等の通信ネットワークを介して受信した利用者の検索条件を入力することも可能である。 The text search device 1 is a search condition (search formula, number of documents) input from an input device M (see FIG. 6) such as a mouse or a keyboard connected to a terminal device 60 such as a personal computer used by a user via a cable. Enter. The text search apparatus 1 can also input user search conditions received via a communication network such as the Internet.

インデックス構築装置（インデックス構築手段）１０は、文書記憶装置４０に蓄えられた検索対象となるテキスト文書を読み込み、単語辞書記憶装置５０に蓄積された単語辞書を用いて、テキスト文書の形態素解析を行って単語を抽出し、転置ファイルを構築して、構築した転置ファイルをインデックス記憶装置３０に書き出す。本実施形態では、転置ファイルをインデックスで示す。なお、インデックスは、データの検索速度を向上させるために、どの単語がどの文書にあるかを示した索引を示す。 The index construction device (index construction means) 10 reads a text document to be searched stored in the document storage device 40, and performs morphological analysis of the text document using the word dictionary stored in the word dictionary storage device 50. Then, the word is extracted, a transposed file is constructed, and the constructed transposed file is written in the index storage device 30. In this embodiment, the transposed file is indicated by an index. The index indicates an index indicating which word is in which document in order to improve the data search speed.

検索処理装置２０は、利用者により指定される文書件数（返却する結果件数）と検索式（複数の単語）とを端末装置６０から入力し、検索式を解析して単語群を抽出し、インデックス記憶装置３０から、抽出された単語に対応する転置ファイルを読み込み、検索条件を満たす候補文書を探索して、検索結果（候補文書）を端末装置６０に出力する。 The search processing device 20 inputs the number of documents specified by the user (number of results to be returned) and the search formula (a plurality of words) from the terminal device 60, analyzes the search formula, extracts a word group, and indexes The transposed file corresponding to the extracted word is read from the storage device 30, the candidate document satisfying the search condition is searched, and the search result (candidate document) is output to the terminal device 60.

インデックス記憶装置３０は、転置ファイル（インデックス）を蓄積するものであり、例えば、一般的なハードディスク等から構成される。
文書記憶装置４０は、検索対象となる複数の検索対象文書（テキスト文書、テキスト情報）を蓄積するものであり、例えば、一般的なハードディスク等から構成される。検索対象文書には、文書ＩＤ（識別情報）が付与されている。
単語辞書記憶装置５０は、単語辞書を蓄積するものであり、例えば、一般的なハードディスク等から構成される。なお、インデックス記憶装置３０、文書記憶装置４０、単語辞書記憶装置５０は、１以上の外部記憶装置で構成することもできる。 The index storage device 30 stores a transposed file (index), and is composed of, for example, a general hard disk.
The document storage device 40 accumulates a plurality of search target documents (text documents, text information) to be searched, and is composed of, for example, a general hard disk. A document ID (identification information) is assigned to the search target document.
The word dictionary storage device 50 stores a word dictionary, and is composed of, for example, a general hard disk. The index storage device 30, the document storage device 40, and the word dictionary storage device 50 can also be configured with one or more external storage devices.

［テキスト検索装置の検索処理の概要］
図２は、本発明の第１実施形態に係るテキスト検索装置の概要を模式的に示す説明図である。なお、図２では、テキスト検索装置１の検索処理の概要を説明するために、図１に示したインデックス構築装置１０および文書記憶装置４０を省略して表示している。 [Overview of text search device search processing]
FIG. 2 is an explanatory diagram schematically showing an outline of the text search apparatus according to the first embodiment of the present invention. In FIG. 2, the index construction device 10 and the document storage device 40 shown in FIG.

検索処理装置２０は、端末装置６０から入力された検索条件１０１に含まれる単語群１０２を、単語辞書記憶装置５０に記憶された単語辞書を参照して抽出し、インデックス記憶装置３０に蓄積されたインパクト値順の転置ファイル７１，７２，７４と、文書ＩＤ順の転置ファイル７３とを利用して、出力すべき候補文書を検索し、検索結果を端末装置６０に出力する。このテキスト検索装置１は、図３に示した従来のテキスト検索装置３００と比較すると文書ＩＤ順の転置ファイル７３を、インパクト値順の転置ファイル７１，７２，７４と併用して検索処理をする点が異なっている。 The search processing device 20 extracts the word group 102 included in the search condition 101 input from the terminal device 60 with reference to the word dictionary stored in the word dictionary storage device 50 and stores it in the index storage device 30. Using the transposed files 71, 72, and 74 in the order of impact values and the transposed file 73 in the order of the document ID, the candidate document to be output is searched, and the search result is output to the terminal device 60. Compared with the conventional text search device 300 shown in FIG. 3, the text search device 1 uses the transposed file 73 in the document ID order together with the transposed files 71, 72, and 74 in the order of impact values. Are different.

図２に示した例では、テキスト検索装置１の検索処理装置２０は、検索条件１０１である「東京の駅」というフレーズを分割し、単語群１０２として、「東京」、「の」、「駅」を抽出する。そして、検索処理装置２０は、各単語の文書出現頻度（ｄｆ値）に対応して、第１段階として、「東京」と「駅」についてはインパクト値順の転置ファイル７１，７４について、符号１０３，１０４の矢印で示すように、インパクト値の高い順（降順）に文書毎に所定のスコアを算出する。これにより、共通の文書として２つの転置ファイル７１，７４から、符号１０５，１０６でそれぞれ示した文書ＩＤが「１６」である文書（d16）等が探索される。 In the example shown in FIG. 2, the search processing device 20 of the text search device 1 divides the phrase “Tokyo station” as the search condition 101, and sets “Tokyo”, “no”, “station” as the word group 102. Is extracted. Then, the search processing device 20 corresponds to the document appearance frequency (df value) of each word, and as a first step, for the transposed files 71 and 74 in the order of impact values for “Tokyo” and “Station”, reference numeral 103 is used. , 104, a predetermined score is calculated for each document in descending order of impact value (descending order). As a result, the document (d16) having the document ID “16” indicated by the reference numerals 105 and 106 is searched from the two transposed files 71 and 74 as a common document.

そして、検索処理装置２０は、第２段階として、「の」については文書ＩＤ順の転置ファイル７３について、文書ＩＤが「１」である文書（d1）から１つずつスキャンするのではなく、符号１０７の矢印で示すように、適宜スキップして、符号１０８の矢印で示すように、文書ＩＤの小さい順（昇順）に、インパクト値順の転置ファイル７１，７４で探索された文書ＩＤに到達するまでスキャンし、符号１０９で示した文書ＩＤが「１６」である文書（d16）を探索し、見つけたときに文書毎に所定のスコアを算出する。以降、第１段階と第２段階とを繰り返して指定された文書件数の候補文書を出力する。なお、転置ファイル（インデックス）の個数は、図示した個数（４個）に限定されない。 Then, as a second stage, the search processing device 20 does not scan the transposed file 73 in the document ID order for “no” one by one from the document (d1) with the document ID “1”. As indicated by the arrow 107, the document ID is skipped as appropriate, and as indicated by the arrow 108, the document IDs searched in the transposed files 71 and 74 in order of impact value are reached in ascending order of document ID (ascending order). Until a document (d16) whose document ID indicated by reference numeral 109 is “16” is searched, and when it is found, a predetermined score is calculated for each document. Thereafter, the first stage and the second stage are repeated to output the designated number of document documents. The number of transposed files (indexes) is not limited to the illustrated number (4).

次に、図２に示したテキスト検索装置１の検索処理装置２０による検索処理を実現するための、テキスト検索装置１の詳細な構成および動作を説明する。以下では、説明の都合上、インデックス構築装置１０（図１参照）と、検索処理装置２０について、それぞれの構成および動作を詳細に説明することとする。 Next, a detailed configuration and operation of the text search device 1 for realizing the search processing by the search processing device 20 of the text search device 1 shown in FIG. 2 will be described. Hereinafter, for convenience of explanation, the configuration and operation of the index construction device 10 (see FIG. 1) and the search processing device 20 will be described in detail.

［インデックス構築装置の構成］
図４は、図１に示すインデックス構築装置の一例を模式的に示すブロック図である。
インデックス構築装置１０は、インパクト値順の転置ファイルと、文書ＩＤ順の転置ファイルとを構築するものであり、文書読込手段１１と、文書特徴量算出手段１２と、出現頻度判別手段１３と、第１転置ファイル構築手段１４と、第２転置ファイル構築手段１５とを備えている。 [Configuration of index building device]
FIG. 4 is a block diagram schematically showing an example of the index construction apparatus shown in FIG.
The index construction device 10 constructs a transposed file in the order of impact values and a transposed file in the order of document ID, and includes a document reading unit 11, a document feature amount calculating unit 12, an appearance frequency determining unit 13, 1 transposition file construction means 14 and second transposition file construction means 15 are provided.

文書読込手段１１は、文書記憶装置４０から、検索対象とするテキスト文書群を読み込むものである。
文書特徴量算出手段１２は、単語辞書記憶装置５０に蓄積された単語辞書に基づいて、文書読込手段１１で読み込んだ各テキスト文書から単語を抽出して当該テキスト文書に含まれる単語の特徴量として単語の総数および各単語の単語出現頻度（ｔｆ値：Term Frequency）を算出すると共に、単語毎に検索対象とする全テキスト文書を通じた文書出現頻度（ｄｆ値：Document Frequency）を算出するものである。なお、ｄｆ値は、その単語を含む文書数のことを指す。算出されたｄｆ値（文書出現頻度）は、例えば単語辞書記憶装置５０の所定領域に格納される。 The document reading unit 11 reads a text document group to be searched from the document storage device 40.
Based on the word dictionary stored in the word dictionary storage device 50, the document feature quantity calculation means 12 extracts words from each text document read by the document reading means 11 and uses them as the feature quantities of the words included in the text document. The total number of words and the word appearance frequency (tf value: Term Frequency) of each word are calculated, and the document appearance frequency (df value: Document Frequency) through all text documents to be searched is calculated for each word. . The df value indicates the number of documents including the word. The calculated df value (document appearance frequency) is stored in a predetermined area of the word dictionary storage device 50, for example.

出現頻度判別手段１３は、各単語の文書出現頻度を単語毎にテキスト文書群全体を通じて比較し、単語毎に文書出現頻度が所定のしきい値よりも低いか否かを判別するものである。本実施形態では、出現頻度判別手段１３は、未処理の単語を選択し、選択した単語のｄｆ値が所定のしきい値以上であるか否かを判別する。 The appearance frequency determining means 13 compares the document appearance frequency of each word through the entire text document group for each word, and determines whether the document appearance frequency is lower than a predetermined threshold value for each word. In the present embodiment, the appearance frequency determination unit 13 selects an unprocessed word and determines whether or not the df value of the selected word is equal to or greater than a predetermined threshold value.

第１転置ファイル構築手段１４は、単語の文書出現頻度が所定のしきい値よりも低い場合に、当該単語についてインパクト値順の転置ファイルを文書検索用のインデックスとして構築するものである。第１転置ファイル構築手段１４は、公知の方法でインパクト値順の転置ファイルを構築することができる。なお、インパクト値順の転置ファイルの構築の詳細については、例えば、非特許文献１〜４で述べられている。この第１転置ファイル構築手段１４は、構築したインパクト値順の転置ファイル３１をインデックス記憶装置３０に格納する。インパクト値順の転置ファイル３１は、インパクト値の降順で文書（文書ＩＤ）を列挙した転置ファイルである。なお、図４では、インパクト値順の転置ファイル３１を１つだけ例示したが、各単語別にそれぞれ設けられている。 When the document appearance frequency of a word is lower than a predetermined threshold, the first transposed file construction unit 14 constructs a transposed file in order of impact value for the word as an index for document search. The first transposed file construction means 14 can construct a transposed file in the order of impact values by a known method. The details of the construction of the transposed file in order of impact value are described in Non-Patent Documents 1 to 4, for example. The first inverted file construction unit 14 stores the constructed inverted file 31 in the order of impact values in the index storage device 30. The transposed file 31 in order of impact value is a transposed file that lists documents (document IDs) in descending order of impact value. In FIG. 4, only one transposed file 31 in order of impact value is illustrated, but it is provided for each word.

第２転置ファイル構築手段１５は、単語の文書出現頻度が所定のしきい値以上である場合に、当該単語についてテキスト文書群全体を通じて最大のインパクト値を算出し、算出した最大のインパクト値を含む文書ＩＤ順の転置ファイルを文書検索用のインデックスとして構築するものである。第２転置ファイル構築手段１５は、公知の方法で文書ＩＤ順の転置ファイルを構築することができる。なお、文書ＩＤ順の転置ファイルの構築の詳細については、例えば、非特許文献１で述べられている。この第２転置ファイル構築手段１５は、構築した文書ＩＤ順の転置ファイル３２をインデックス記憶装置３０に格納する。文書ＩＤ順の転置ファイル３２は、文書ＩＤの昇順で文書（文書ＩＤ）を列挙した転置ファイルである。なお、図４では、文書ＩＤ順の転置ファイル３２を１つだけ例示したが、各単語別にそれぞれ設けられている。 When the document appearance frequency of a word is equal to or higher than a predetermined threshold, the second transposed file construction unit 15 calculates the maximum impact value for the word through the entire text document group, and includes the calculated maximum impact value. The transposed file in the document ID order is constructed as a document search index. The second transposed file construction means 15 can construct a transposed file in document ID order by a known method. Details of the construction of the transposed file in document ID order are described in Non-Patent Document 1, for example. The second transposed file construction unit 15 stores the constructed transposed file 32 in the document ID order in the index storage device 30. The transposed file 32 in document ID order is a transposed file that lists documents (document IDs) in ascending order of document IDs. In FIG. 4, only one transposed file 32 in the order of document ID is illustrated, but it is provided for each word.

なお、これら文書読込手段１１、文書特徴量算出手段１２、出現頻度判別手段１３、第１転置ファイル構築手段１４および第２転置ファイル構築手段１５は、ＣＰＵが記憶手段に格納された所定のプログラムをＲＡＭに展開して実行することにより実現されるものである。 The document reading means 11, the document feature quantity calculating means 12, the appearance frequency determining means 13, the first inverted file constructing means 14, and the second inverted file constructing means 15 are stored in a predetermined program stored in the storage means by the CPU. This is realized by expanding and executing in the RAM.

［インデックス構築装置の動作］
図４に示したインデックス構築装置１０の動作について図５を参照（適宜図４参照）して説明する。図５は、図４に示すインデックス構築装置の動作を示すフローチャートである。インデックス構築装置１０は、文書読込手段１１によって、検索対象とするテキスト文書群を読み込み（ステップＳ１）、文書特徴量算出手段１２によって、各テキスト文書から単語を抽出して各文書に含まれる単語の総数、ｔｆ値およびｄｆ値を算出する（ステップＳ２）。 [Operation of index building device]
The operation of the index construction apparatus 10 shown in FIG. 4 will be described with reference to FIG. 5 (refer to FIG. 4 as appropriate). FIG. 5 is a flowchart showing the operation of the index construction apparatus shown in FIG. The index construction apparatus 10 reads a text document group to be searched by the document reading unit 11 (step S1), extracts a word from each text document by the document feature amount calculation unit 12, and extracts the word contained in each document. The total number, tf value, and df value are calculated (step S2).

そして、インデックス構築装置１０は、出現頻度判別手段１３によって、転置ファイルを構築していない未処理の単語を選択する（ステップＳ３）。そして、インデックス構築装置１０は、出現頻度判別手段１３によって、選択した単語のｄｆ値が所定のしきい値以上であるか否かを判別する（ステップＳ４）。選択した単語のｄｆ値が所定のしきい値以上である場合（ステップＳ４：Ｙｅｓ）、インデックス構築装置１０は、第２転置ファイル構築手段１５によって、当該単語について最大のインパクト値を算出し、最大のインパクト値を含む文書ＩＤ順の転置ファイルを構築する（ステップＳ５）。 Then, the index construction device 10 selects the unprocessed word for which the transposed file is not constructed by the appearance frequency determination unit 13 (step S3). Then, the index construction device 10 determines whether or not the df value of the selected word is greater than or equal to a predetermined threshold value by the appearance frequency determination unit 13 (step S4). When the df value of the selected word is equal to or greater than the predetermined threshold (step S4: Yes), the index construction device 10 calculates the maximum impact value for the word by the second transposed file construction means 15, A transposed file in the document ID order including the impact value is constructed (step S5).

一方、ステップＳ４において、選択した単語のｄｆ値が所定のしきい値よりも低い場合（ステップＳ４：Ｎｏ）、インデックス構築装置１０は、第１転置ファイル構築手段１４によって、当該単語についてインパクト値順の転置ファイルを構築する（ステップＳ６）。ステップＳ５またはステップＳ６に続いて、インデックス構築装置１０は、出現頻度判別手段１３によって、全単語を処理したか否かを判別する（ステップＳ７）。未処理の単語がある場合（ステップＳ７：Ｎｏ）、インデックス構築装置１０は、ステップＳ３に戻る。一方、対象とする全単語を処理した場合（ステップＳ７：Ｙｅｓ）、インデックス構築装置１０は、転置ファイル（インデックス）を構築する処理を終了する。 On the other hand, when the df value of the selected word is lower than the predetermined threshold value in step S4 (step S4: No), the index construction device 10 uses the first inverted file construction unit 14 to determine the order of impact values for the word. Is constructed (step S6). Subsequent to step S5 or step S6, the index construction device 10 determines whether or not all words have been processed by the appearance frequency determination means 13 (step S7). When there is an unprocessed word (step S7: No), the index construction device 10 returns to step S3. On the other hand, when all the target words are processed (step S7: Yes), the index construction device 10 ends the process of constructing the transposed file (index).

具体的には、図２では、インデックスの構築時に「東京」、「トマト」、「駅」に対しては、インパクト値順の転置ファイル７１，７２，７４が構築され、「の」に対しては文書ＩＤ順の転置ファイル７３が構築されている。ただし、図２では、文書ＩＤ順の転置ファイル７３において最大のインパクト値の図示を省略している。ここで、最大のインパクト値を含む文書ＩＤ順の転置ファイルの構造の一例を図１４に示す。図１４に示したテキスト検索装置１Ｂは、インデックス記憶装置３０に、文書ＩＤ順の転置ファイル７６を蓄積している点が異なる。文書ＩＤ順の転置ファイル７６は、左端に最大のインパクト値（s12.0）を含んでいる点を除いて、図２に示した文書ＩＤ順の転置ファイル７３と同様である。ここで、最大のインパクト値（s12.0）を含む列は、左端に限定されるものではない。 Specifically, in FIG. 2, transposed files 71, 72, and 74 in the order of impact values are constructed for “Tokyo”, “Tomato”, and “Station” when the index is constructed. The transposed file 73 in the document ID order is constructed. However, in FIG. 2, the maximum impact value is not shown in the transposed file 73 in the document ID order. Here, an example of the structure of the transposed file in the document ID order including the maximum impact value is shown in FIG. The text search device 1B shown in FIG. 14 is different in that the transposed file 76 in the document ID order is stored in the index storage device 30. The transposed file 76 in the document ID order is the same as the transposed file 73 in the document ID order shown in FIG. 2 except that the left end includes the maximum impact value (s12.0). Here, the column including the maximum impact value (s12.0) is not limited to the left end.

［検索処理装置の構成］
図６は、図１に示す検索処理装置の一例を模式的に示すブロック図である。検索処理装置２０は、図６に示すように、検索条件入力手段２１と、単語抽出手段２２と、単語群分類手段２３と、候補文書探索手段２４と、候補文書出力手段２５とを備えている。 [Configuration of search processing device]
FIG. 6 is a block diagram schematically showing an example of the search processing device shown in FIG. As shown in FIG. 6, the search processing device 20 includes search condition input means 21, word extraction means 22, word group classification means 23, candidate document search means 24, and candidate document output means 25. .

検索条件入力手段２１は、複数の単語を含む検索式と文書件数とを指定された検索条件として入力するものである。本実施形態では、検索条件入力手段２１は、テキスト検索装置１にケーブルで接続された端末装置６０から検索条件を入力する。なお、例えば、インターネット等の通信ネットワークを介して端末装置６０から検索条件を受信するようにしてもよい。端末装置６０の利用者は、例えば、マウスやキーボード等の入力装置Ｍを操作して、検索式および文書件数を検索条件として端末装置６０に入力する。 The search condition input means 21 inputs a search expression including a plurality of words and the number of documents as designated search conditions. In this embodiment, the search condition input means 21 inputs a search condition from the terminal device 60 connected to the text search device 1 with a cable. For example, the search condition may be received from the terminal device 60 via a communication network such as the Internet. The user of the terminal device 60 operates the input device M such as a mouse or a keyboard, for example, and inputs the search formula and the number of documents to the terminal device 60 as a search condition.

単語抽出手段２２は、入力された検索式に含まれる各単語を、単語辞書記憶装置５０に蓄積された単語辞書に基づいて抽出するものである。
単語群分類手段２３は、予め算出された文書出現頻度に基づいて、単語抽出手段２２で抽出された各単語を、文書出現頻度が所定のしきい値よりも低い単語群と、そうではない単語群とに分類するものである。本実施形態では、インデックス構築の際に得られて例えば単語辞書記憶装置５０の所定領域に格納されているｄｆ値（文書出現頻度）に基づいて、単語群分類手段２３は、抽出した単語群をｄｆ値（文書出現頻度）の高い単語群と低い単語群とに分類する。 The word extraction unit 22 extracts each word included in the input search formula based on the word dictionary stored in the word dictionary storage device 50.
The word group classification unit 23 determines each word extracted by the word extraction unit 22 based on a document appearance frequency calculated in advance as a word group having a document appearance frequency lower than a predetermined threshold and a word that is not so. They are classified into groups. In this embodiment, based on the df value (document appearance frequency) obtained at the time of index construction and stored in a predetermined area of the word dictionary storage device 50, for example, the word group classification means 23 extracts the extracted word group. It classifies into a word group with a high df value (document appearance frequency) and a low word group.

候補文書探索手段２４は、文書出現頻度が所定のしきい値よりも低い単語群についてはインパクト値順の転置ファイルを利用して出力候補となる候補文書を探索し、その後、文書出現頻度が所定のしきい値以上の単語群については文書ＩＤ順の転置ファイルを利用して出力候補となる候補文書を探索するものである。本実施形態では、候補文書探索手段２４は、候補文書探索処理を実行するモジュールであり、サブモジュールとして、図６に示すように、第１探索手段２４１と、件数判別手段２４２と、第２探索手段２４３と、絞込制御手段２４４とを備えている。 Candidate document search means 24 searches for candidate documents that are output candidates using a transposed file in order of impact value for a word group whose document appearance frequency is lower than a predetermined threshold, and then the document appearance frequency is predetermined. For a word group equal to or greater than the threshold value, a candidate document as an output candidate is searched using a transposed file in document ID order. In the present embodiment, the candidate document search means 24 is a module that executes candidate document search processing. As shown in FIG. 6, the first search means 241, the number determination means 242, and the second search are included as submodules. Means 243 and narrowing control means 244 are provided.

第１探索手段２４１は、文書出現頻度が所定のしきい値よりも低い単語群（ｄｆ値の低い単語群）についてインパクト値順の転置ファイル３１を読み込み、インパクト値の高い順に文書毎に検索式に依存する単語依存文書スコアを算出し、算出した単語依存文書スコアに基づいて候補文書を探索するものである。ここで、単語依存文書スコアには、例えば、ｔｆ−ｉｄｆや、ＢＭ２５等の検索式に基づく所定のランキング関数を用いることができる。また、この第１探索手段２４１が候補文書を探索する際に各文書が取りうるスコアのｋ−ベスト（スコアが上位ｋ個の文書）を計算することで検索処理の足切りを行うことができる。なお、検索処理の足切りの方法の詳細については、例えば、非特許文献４で述べられている。 The first search means 241 reads the transposed file 31 in the order of impact value for a word group whose word appearance frequency is lower than a predetermined threshold (word group having a low df value), and searches for each document in descending order of impact value. A word-dependent document score depending on the document is calculated, and candidate documents are searched based on the calculated word-dependent document score. Here, for the word-dependent document score, for example, a predetermined ranking function based on a search expression such as tf-idf or BM25 can be used. In addition, when the first search unit 241 searches for candidate documents, the k-best (scores with the highest score) of the scores that each document can take can be calculated to cut off the search process. . For example, Non-Patent Document 4 describes the details of the search processing cut-off method.

件数判別手段２４２は、第１探索手段２４１で探索された候補文書の件数が、検索条件で指定された文書件数よりも大きくなったか否かを判別するものである。本実施形態では、件数判別手段２４２は、第１探索手段２４１で探索された候補文書の件数が、検索条件で指定された文書件数よりも大きくなった場合に、その旨を第２探索手段２４３に出力する。 The number determination unit 242 determines whether or not the number of candidate documents searched by the first search unit 241 is larger than the number of documents specified by the search condition. In the present embodiment, when the number of candidate documents searched by the first search unit 241 becomes larger than the number of documents specified by the search condition, the number determination unit 242 notifies the second search unit 243 of that fact. Output to.

第２探索手段２４３は、第１探索手段２４１で探索された候補文書の件数が指定された文書件数よりも大きくなった場合に、文書出現頻度が所定のしきい値以上の単語群（ｄｆ値の高い単語群）についての文書ＩＤ順の転置ファイル３２を読み込み、第１探索手段２４１で探索された候補文書の文書ＩＤに一致する候補文書を探索しつつ文書毎のスコアを算出するものである。第２探索手段２４３は、候補文書を探索する際に、文書ＩＤ順の転置ファイル３２を適宜スキップしつつスキャンする。文書ＩＤ順の転置ファイル３２を適宜スキップしつつスキャンする方法の詳細は、例えば非特許文献１に記載されている。また、第２探索手段２４３は、候補文書の探索を行う際には、文書ＩＤ順の転置ファイル３２に保持した最大のインパクト値を用いて足切りを行う。 When the number of candidate documents searched by the first searching unit 241 is larger than the designated number of documents, the second searching unit 243 determines a word group (df value) having a document appearance frequency equal to or higher than a predetermined threshold. A transposed file 32 in the document ID order for the high-word group), and calculates a score for each document while searching for candidate documents that match the document ID of the candidate document searched by the first search means 241. . When searching for candidate documents, the second search unit 243 scans the transposed file 32 in document ID order while skipping appropriately. Details of the method of scanning the transposed file 32 in the document ID order while skipping appropriately are described in Non-Patent Document 1, for example. Further, when searching for a candidate document, the second search unit 243 performs a cut-off using the maximum impact value held in the transposed file 32 in document ID order.

絞込制御手段２４４は、第１探索手段２４１と第２探索手段２４３とを交互に用いて候補文書を絞り込み、候補文書の順位として指定された文書件数以内の順位を決定するものである。本実施形態では、絞込制御手段２４４は、検索条件で指定された文書件数の順位が決定したか否かを判別する。絞込制御手段２４４が検索式の条件を満たしている候補文書を特定し、利用者が指定した結果件数を用いて、最終段階の足切り処理を行う方法は、例えば、非特許文献４で述べられている方法を用いることができる。なお、非特許文献４で述べられている方法では、３段階の足切り処理を行っている。 The narrowing-down control unit 244 narrows down candidate documents by alternately using the first search unit 241 and the second search unit 243, and determines a rank within the number of documents designated as the rank of candidate documents. In the present embodiment, the narrowing down control unit 244 determines whether or not the order of the number of documents designated by the search condition has been determined. For example, Non-Patent Document 4 describes a method in which the narrowing-down control unit 244 identifies candidate documents that satisfy the search expression condition and performs the final stage cut-off process using the number of results specified by the user. Can be used. In the method described in Non-Patent Document 4, a three-stage cut-off process is performed.

候補文書出力手段２５は、入力された検索条件を満たした時点で候補文書を確定して出力するものである。本実施形態では、候補文書出力手段２５は、テキスト検索装置１にケーブルで接続された端末装置６０に候補文書の一覧を出力する。なお、例えば、インターネット等の通信ネットワークを介して端末装置６０に候補文書の一覧を送信するようにしてもよい。端末装置６０は、取得した候補文書の一覧を、液晶ディスプレイ等の出力装置Ｄに出力表示し、利用者に提示する。 The candidate document output means 25 determines and outputs the candidate document when the input search condition is satisfied. In the present embodiment, the candidate document output unit 25 outputs a list of candidate documents to the terminal device 60 connected to the text search device 1 with a cable. For example, a list of candidate documents may be transmitted to the terminal device 60 via a communication network such as the Internet. The terminal device 60 outputs and displays the acquired list of candidate documents on the output device D such as a liquid crystal display and presents it to the user.

なお、これら検索条件入力手段２１、単語抽出手段２２、単語群分類手段２３、候補文書探索手段２４および候補文書出力手段２５は、ＣＰＵが記憶手段に格納された所定のプログラムをＲＡＭに展開して実行することにより実現されるものである。 The search condition input means 21, the word extraction means 22, the word group classification means 23, the candidate document search means 24, and the candidate document output means 25 are arranged such that the CPU expands a predetermined program stored in the storage means to the RAM. It is realized by executing.

［検索処理装置の動作］
図６に示した検索処理装置２０の動作について図７を参照（適宜図６参照）して説明する。図７は、図６に示す検索処理装置の動作を示すフローチャートである。まず、検索処理装置２０は、検索条件入力手段２１によって、検索式と文書件数とを指定された検索条件として入力する（ステップＳ１１：検索条件入力ステップ）。そして、検索処理装置２０は、単語抽出手段２２によって、検索式を解析して含まれる単語群を抽出する（ステップＳ１２：単語抽出ステップ）。そして、検索処理装置２０は、単語群分類手段２３によって、抽出した単語群を、ｄｆ値（出現頻度）の高い単語群と、低い単語群とに分類する（ステップＳ１３：単語群分類ステップ）。そして、検索処理装置２０は、候補文書探索手段２４によって、候補文書探索処理を実行する（ステップＳ１４：候補文書探索ステップ）。そして、検索処理装置２０は、候補文書出力手段２５によって、入力された検索条件を満たした時点で候補文書を確定して検索結果として出力する（ステップＳ１５：候補文書出力ステップ）。 [Operation of search processing device]
The operation of the search processing device 20 shown in FIG. 6 will be described with reference to FIG. 7 (refer to FIG. 6 as appropriate). FIG. 7 is a flowchart showing the operation of the search processing device shown in FIG. First, the search processing device 20 inputs a search expression and the number of documents as specified search conditions by the search condition input means 21 (step S11: search condition input step). Then, the search processing device 20 uses the word extraction means 22 to analyze the search formula and extract the included word group (step S12: word extraction step). Then, the search processing device 20 classifies the extracted word group into a word group having a high df value (appearance frequency) and a word group having a low df value (appearance frequency) by the word group classification means 23 (step S13: word group classification step). Then, the search processing device 20 executes candidate document search processing by the candidate document search means 24 (step S14: candidate document search step). Then, the search processing device 20 determines and outputs the candidate document as a search result when the input search condition is satisfied by the candidate document output means 25 (step S15: candidate document output step).

＜候補文書探索処理＞
次に、前記したステップＳ１４の候補文書探索処理について図８を参照（適宜図６参照）して説明する。図８は、図７に示す候補文書探索処理の詳細を示すフローチャートである。まず、候補文書探索手段２４は、第１探索手段２４１によって、ｄｆ値の低い単語群についてインパクト値順の転置ファイル３１を読み込み、単語依存文書スコアを算出し、算出したスコアに基づいて候補文書を探索する（ステップＳ２１）。そして、候補文書探索手段２４は、件数判別手段２４２によって、候補文書の件数が、指定された文書件数を超えたか否かを判別する（ステップＳ２３）。候補文書の件数が、指定された文書件数を超えた場合（ステップＳ２３：Ｙｅｓ）、候補文書探索手段２４は、第２探索手段２４３によって、ｄｆ値の高い単語群について文書ＩＤ順の転置ファイル３２を読み込み（ステップＳ２５）、ステップＳ２１で探索された候補文書の文書ＩＤに一致する候補文書を探索する（ステップＳ２７）。ステップＳ２７において、第２探索手段２４３は、候補文書を探索する際に、文書ＩＤ順の転置ファイルを適宜スキップしつつスキャンすると共に、文書ＩＤを突合せつつ候補文書毎に文書毎のスコアを計算する。 <Candidate document search process>
Next, the candidate document search process in step S14 will be described with reference to FIG. 8 (see FIG. 6 as appropriate). FIG. 8 is a flowchart showing details of the candidate document search process shown in FIG. First, the candidate document search unit 24 uses the first search unit 241 to read the transposed file 31 in order of impact value for a word group having a low df value, calculate a word-dependent document score, and select a candidate document based on the calculated score. Search is performed (step S21). Then, the candidate document search unit 24 determines whether or not the number of candidate documents exceeds the designated number of documents by the number determination unit 242 (step S23). If the number of candidate documents exceeds the designated number of documents (step S23: Yes), the candidate document search means 24 uses the second search means 243 to transpose the file 32 in the document ID order for the word group having a high df value. (Step S25), and a candidate document matching the document ID of the candidate document searched in step S21 is searched (step S27). In step S27, when searching for a candidate document, the second search unit 243 scans the transposed file in the document ID order while skipping appropriately, and calculates a score for each document for each candidate document while matching the document ID. .

続いて、候補文書探索手段２４は、絞込制御手段２４４によって、検索条件で指定された文書件数の順位が決定したか否かを判別する（ステップＳ２９）。指定された文書件数の順位が決定した場合（ステップＳ２９：Ｙｅｓ）、候補文書探索手段２４は、処理を終了する。一方、指定された文書件数の順位が決定していない場合（ステップＳ２９：Ｎｏ）、候補文書探索手段２４は、ステップＳ２１に戻り処理を繰り返す。
また、前記したステップＳ２３において、候補文書の件数が、検索条件で指定された文書件数を超えていない場合（ステップＳ２３：Ｎｏ）、候補文書探索手段２４は、ステップＳ２１に戻り、インパクト値順の転置ファイルをさらに読み進めて他の候補文書を探索する。これにより、スコアの高い候補文書として、検索対象とする文書群の中で偏ることなくより幅広い多くの文書の中から選択された適切な候補文書を出力することが可能となる。 Subsequently, the candidate document search unit 24 determines whether the rank control unit 244 determines the order of the number of documents specified by the search condition (step S29). When the rank of the designated number of documents is determined (step S29: Yes), the candidate document search unit 24 ends the process. On the other hand, if the order of the designated number of documents has not been determined (step S29: No), the candidate document search unit 24 returns to step S21 and repeats the processing.
In step S23, if the number of candidate documents does not exceed the number of documents specified by the search condition (step S23: No), the candidate document search unit 24 returns to step S21, in order of impact value. The transposed file is further read to search for other candidate documents. Thereby, as a candidate document having a high score, it is possible to output an appropriate candidate document selected from a wider variety of documents without being biased in a document group to be searched.

なお、テキスト検索装置１は、一般的なコンピュータを、検索処理装置２０として機能させるテキスト検索プログラムを実行することで実現することもできる。また、テキスト検索装置１は、一般的なコンピュータを、前記したインデックス構築装置１０および検索処理装置２０として機能させるテキスト検索プログラムを実行することで実現することもできる。これらのプログラムは、通信回線を介して配布することも可能であるし、ＣＤ−ＲＯＭ等のコンピュータに読み取り可能な記録媒体に書き込んで配布することも可能である。 Note that the text search device 1 can also be realized by executing a text search program that causes a general computer to function as the search processing device 20. The text search device 1 can also be realized by executing a text search program that causes a general computer to function as the index construction device 10 and the search processing device 20 described above. These programs can be distributed via a communication line, or can be written and distributed on a computer-readable recording medium such as a CD-ROM.

本実施形態によれば、テキスト検索装置１は、検索式に含まれる単語のｄｆ値が低い場合には、インパクト値順の転置ファイル３１を用いて文書の探索を行い、一方、ｄｆ値が高い単語については、スキップが可能な文書ＩＤ順の転置ファイル３２を用いるので、検索処理を高速化することができる。そのため、インパクト値順の転置ファイル３１を利用するテキスト検索装置において、検索処理の処理性能を向上させることができる。その結果、リソース（ディスク、メモリ、ＣＰＵ）の消費を抑えることが可能である。 According to this embodiment, when the df value of a word included in the search formula is low, the text search device 1 searches for a document using the transposed file 31 in order of impact value, while the df value is high. For words, the transposition file 32 in order of document IDs that can be skipped is used, so that the search process can be speeded up. Therefore, in the text search apparatus that uses the transposed file 31 in the order of impact values, the processing performance of the search process can be improved. As a result, consumption of resources (disk, memory, CPU) can be suppressed.

（第２実施形態）
［テキスト検索装置の検索処理の概要］
図９は、本発明の第２実施形態に係るテキスト検索装置の概要を模式的に示す説明図である。テキスト検索装置１Ａは、インデックス構築装置１０Ａと、検索処理装置２０Ａと、インデックス記憶装置３０と、文書記憶装置４０（図１参照）と、単語辞書記憶装置５０とを備えている。なお、図９では、テキスト検索装置１Ａの検索処理の概要を説明するために、文書記憶装置４０を省略して表示している。また、図２と同様の構成については、同じ符号を付して説明を省略する。 (Second Embodiment)
[Overview of text search device search processing]
FIG. 9 is an explanatory diagram schematically showing an outline of a text search apparatus according to the second embodiment of the present invention. The text search device 1A includes an index construction device 10A, a search processing device 20A, an index storage device 30, a document storage device 40 (see FIG. 1), and a word dictionary storage device 50. In FIG. 9, the document storage device 40 is omitted and displayed in order to explain the outline of the search processing of the text search device 1A. Moreover, about the structure similar to FIG. 2, the same code | symbol is attached | subjected and description is abbreviate | omitted.

インデックス構築装置１０Ａは、インパクト値順の転置ファイルおよび文書ＩＤ順の転置ファイルに加えて、文書独自スコアを降順とした文書の順で、文書独自スコア順の転置ファイル７５を構築するものである。文書独自スコア順の転置ファイルは、検索式に含まれる単語に依存しない文書毎のスコアを格納する転置ファイルである。ここで、単語に依存しない文書毎のスコアとしては、検索サーバのログから得られる各Ｗｅｂページに対するクリック回数のデータを用いて算出された文書（Ｗｅｂページ）毎のスコアや、PageRank（登録商標）等を示す。なお、PageRank（登録商標）については、例えば、（Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd, “The PageRank Citation Ranking: Bringing Order to the Web”, Jan. 29, 1998、<URL：http://WWW-db.stanford.edu/~backru/pageranksub.ps>）に記載されている。 The index construction device 10A constructs a transposition file 75 in the document unique score order in the document order with the document unique score in descending order in addition to the transposed file in the order of impact values and the transposition file in the document ID order. The transposed file in document-specific score order is a transposed file that stores a score for each document that does not depend on words included in the search formula. Here, as the score for each document that does not depend on the word, the score for each document (Web page) calculated by using the data of the number of clicks for each Web page obtained from the search server log, or PageRank (registered trademark) Etc. For PageRank (registered trademark), for example, (Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd, “The PageRank Citation Ranking: Bringing Order to the Web”, Jan. 29, 1998, <URL: http: / /WWW-db.stanford.edu/~backru/pageranksub.ps>).

検索処理装置２０Ａは、インデックス記憶装置３０に蓄積されたインパクト値順の転置ファイル７１，７２，７４と、文書ＩＤ順の転置ファイル７３と、文書独自スコア順の転置ファイル７５とを利用して、出力すべき候補文書を検索し、検索結果（候補文書）を端末装置６０に出力する。この検索処理装置２０Ａは、第１実施形態の検索処理装置２０と比較して第１段階の探索方法が異なっている。 The search processing device 20A uses the transposed files 71, 72, 74 in order of impact values, the transposed file 73 in order of document ID, and the transposed file 75 in order of document unique score, which are stored in the index storage device 30. The candidate document to be output is searched, and the search result (candidate document) is output to the terminal device 60. This search processing device 20A differs from the search processing device 20 of the first embodiment in the first stage search method.

具体的には、テキスト検索装置１Ａの検索処理装置２０Ａは、図９に示した例では、第１段階として、「東京」と「駅」については、インパクト値順の転置ファイル７１，７４を利用して文書毎に所定のスコアを算出し、次いで、文書独自スコア順の転置ファイル７５から読み込まれる文書独自スコアと、算出したスコアとの合計を求め、合計したスコアに基づいて候補文書を探索する。これにより、共通の文書として３つの転置ファイル７１，７４，７５から、符号１０５，１０６，１１１でそれぞれ示した文書ＩＤが「１６」である文書（d16）等が探索される。 Specifically, in the example shown in FIG. 9, the search processing device 20A of the text search device 1A uses the transposed files 71 and 74 in order of impact value for “Tokyo” and “Station” as the first stage. Then, a predetermined score is calculated for each document, and then the sum of the document unique score read from the transposed file 75 in the document unique score order and the calculated score is obtained, and a candidate document is searched based on the total score. . As a result, the document (d16) having the document ID “16” indicated by the reference numerals 105, 106, and 111 is searched from the three transposed files 71, 74, and 75 as a common document.

次に、図９に示したテキスト検索装置１Ａの検索処理装置２０Ａによる検索処理を実現するための、テキスト検索装置１Ａの詳細な構成および動作を説明する。以下では、説明の都合上、インデックス構築装置１０Ａと、検索処理装置２０Ａについて、それぞれの構成および動作の詳細な説明をすることとする。 Next, a detailed configuration and operation of the text search device 1A for realizing the search processing by the search processing device 20A of the text search device 1A shown in FIG. 9 will be described. In the following, for convenience of explanation, the configuration and operation of the index construction device 10A and the search processing device 20A will be described in detail.

［インデックス構築装置の構成］
図１０は、図９に示すインデックス構築装置の一例を模式的に示すブロック図である。図１０に示すインデックス構築装置１０Ａは、独自スコア算出手段１６と、文書独自スコア転置ファイル構築手段１７とを備えている点を除いて、図４に示したインデックス構築装置１０と同じ構成なので、同じ構成には同じ符号を付して説明を省略する。 [Configuration of index building device]
FIG. 10 is a block diagram schematically showing an example of the index construction device shown in FIG. The index construction apparatus 10A shown in FIG. 10 has the same configuration as the index construction apparatus 10 shown in FIG. 4 except that it includes a unique score calculation means 16 and a document unique score transposition file construction means 17. The same reference numerals are given to the components, and description thereof is omitted.

独自スコア算出手段１６は、検索対象とするテキスト文書群を読み込み、テキスト文書毎に文書独自のスコアを算出するものである。
文書独自スコア転置ファイル構築手段１７は、独自スコア算出手段１６で算出した文書独自のスコアに基づいて、文書独自スコア順の転置ファイル３３を構築するものである。文書独自スコア転置ファイル構築手段１７は、構築した文書独自スコア順の転置ファイル３３をインデックス記憶装置３０に格納する。 The unique score calculation means 16 reads a text document group to be searched and calculates a document-specific score for each text document.
The document unique score transposed file construction unit 17 constructs the transposed file 33 in the document unique score order based on the document unique score calculated by the unique score calculation unit 16. The document unique score transposed file construction means 17 stores the constructed transposed file 33 in the document unique score order in the index storage device 30.

［インデックス構築装置の動作］
図１０に示したインデックス構築装置１０Ａがインパクト値順の転置ファイル３１および文書ＩＤ順の転置ファイル３２を構築する動作は、第１実施形態のインデックス構築装置１０と同じなので説明を省略し、異なる点について図１１を参照（適宜図１０参照）して説明する。図１１は、図１０に示すインデックス構築装置による文書独自スコア転置ファイルを構築する動作を示すフローチャートである。インデックス構築装置１０Ａは、独自スコア算出手段１６によって、検索対象とするテキスト文書群を読み込み（ステップＳ３１）、独自スコア算出手段１６によって、テキスト文書毎に文書独自のスコアを算出する（ステップＳ３２）。そして、インデックス構築装置１０Ａは、文書独自スコア転置ファイル構築手段１７によって、文書独自スコア順の転置ファイル３３を構築する（ステップＳ３３）。なお、インデックス構築装置１０Ａが、文書独自スコア順の転置ファイル３３を構築するタイミングは、インパクト値順の転置ファイル３１および文書ＩＤ順の転置ファイル３２を構築する前後いずれのタイミングでもよいし、これらのファイルと並行して構築してもよい。 [Operation of index building device]
The operation of the index construction apparatus 10A shown in FIG. 10 for constructing the transposition file 31 in the order of impact values and the transposition file 32 in the order of document ID is the same as that of the index construction apparatus 10 of the first embodiment, and the description thereof is omitted. Will be described with reference to FIG. 11 (see FIG. 10 as appropriate). FIG. 11 is a flowchart showing an operation of constructing a document unique score transposed file by the index construction apparatus shown in FIG. The index construction apparatus 10A reads the text document group to be searched by the unique score calculation means 16 (step S31), and calculates a document-specific score for each text document by the unique score calculation means 16 (step S32). Then, the index construction apparatus 10A constructs the transposed file 33 in the document unique score order by the document unique score transposed file construction means 17 (step S33). It should be noted that the index construction apparatus 10A may construct the transposition file 33 in the document unique score order before or after constructing the transposition file 31 in the impact value order and the transposition file 32 in the document ID order. You may build it in parallel with the file.

［検索処理装置の構成］
図１２は、図９に示す検索処理装置の一例を模式的に示すブロック図である。検索処理装置２０Ａは、第１探索手段２４１Ａの構成が異なる点を除いて図６に示した検索処理装置２０と同じ構成なので、同じ構成には同じ符号を付して説明を省略する。
第１探索手段２４１Ａは、図１２に示すように、第１処理手段２５１と、第２処理手段２５２とを備える。 [Configuration of search processing device]
FIG. 12 is a block diagram schematically showing an example of the search processing device shown in FIG. The search processing device 20A has the same configuration as that of the search processing device 20 shown in FIG. 6 except that the configuration of the first search means 241A is different.
The first search unit 241A includes a first processing unit 251 and a second processing unit 252 as shown in FIG.

第１処理手段２５１は、文書出現頻度が所定のしきい値よりも低い単語群（ｄｆ値が低い単語群）についてインパクト値順の転置ファイル３１を読み込み、文書毎にインパクト値の高い順に、検索式に依存する単語依存文書スコアを算出するものである。 The first processing unit 251 reads the transposed file 31 in order of impact value for a word group whose word appearance frequency is lower than a predetermined threshold (a word group having a low df value), and searches the document in descending order of impact value. The word-dependent document score depending on the expression is calculated.

第２処理手段２５２は、予め構築された文書独自スコア順の転置ファイル３３から、第１処理手段２５１で単語依存文書スコアを算出した文書に対応する文書独自のスコアを読み込み、読み込んだ文書独自のスコアと、第１処理手段２５１で算出した単語依存文書スコアとの線形和を用いて候補文書を探索するものである。例えば、ある文書について、文書独自のスコアが「５」であり、かつ、単語依存文書スコアが「１０」であれば、第２処理手段２５２は、その文書の合計スコアを「１５」として、候補文書を探索する。なお、線形和とは、単純な加算のみに限定されるものではなく、重み付けをしてから加算することも含む。 The second processing means 252 reads a document-specific score corresponding to the document for which the word-dependent document score has been calculated by the first processing means 251 from the pre-constructed transposition file 33 in the document-specific score order. The candidate document is searched using a linear sum of the score and the word-dependent document score calculated by the first processing means 251. For example, for a document, if the document-specific score is “5” and the word-dependent document score is “10”, the second processing unit 252 sets the total score of the document as “15” and selects a candidate. Search for a document. The linear sum is not limited to simple addition, but includes addition after weighting.

［検索処理装置の動作］
図１２に示した検索処理装置２０Ａは、全体の動作の中で候補文書探索処理（ステップＳ１４、図７参照）のみが異なる点を除いて、第１実施形態の検索処理装置２０と同じように動作するので、全体の動作の説明を省略し、候補文書探索処理について図１３を参照（適宜図１２参照）して説明する。図１３は、図１２に示す検索処理装置による候補文書探索処理の詳細を示すフローチャートである。 [Operation of search processing device]
The search processing device 20A shown in FIG. 12 is the same as the search processing device 20 of the first embodiment, except that only the candidate document search processing (see step S14, see FIG. 7) is different in the overall operation. Since it operates, description of the overall operation will be omitted, and the candidate document search processing will be described with reference to FIG. 13 (refer to FIG. 12 as appropriate). FIG. 13 is a flowchart showing details of candidate document search processing by the search processing device shown in FIG.

候補文書探索手段２４は、第１探索手段２４１Ａの第１処理手段２５１によって、ｄｆ値の低い単語群についてインパクト値順の転置ファイル３１を読み込み、単語依存文書スコアを算出する（ステップＳ２１ａ）。そして、候補文書探索手段２４は、第１探索手段２４１Ａの第２処理手段２５２によって、文書独自スコア順の転置ファイルを読み込み、読み込んだスコアと、ステップＳ２１ａで算出したスコア（単語依存文書スコア）との線形和に基づいて候補文書を探索する（ステップＳ２１ｂ）。以下の動作は、図８を参照して説明した候補文書探索処理と同様である。ただし、ステップＳ２７ｂでは、候補文書探索手段２４は、第２探索手段２４３によって、ステップＳ２１ｂで探索された候補文書の文書ＩＤに一致する候補文書を探索する。 The candidate document search unit 24 reads the transposed file 31 in order of impact value for the word group having a low df value by the first processing unit 251 of the first search unit 241A, and calculates a word-dependent document score (step S21a). Then, the candidate document search means 24 reads the transposed file in the document unique score order by the second processing means 252 of the first search means 241A, the read score, and the score (word-dependent document score) calculated in step S21a. A candidate document is searched based on the linear sum of (step S21b). The following operation is the same as the candidate document search process described with reference to FIG. However, in step S27b, the candidate document search unit 24 searches for candidate documents that match the document ID of the candidate document searched in step S21b by the second search unit 243.

本実施形態によれば、テキスト検索装置１Ａは、インパクト値順の転置ファイル３１を利用して算出した単語依存文書スコアと、文書独自スコア順の転置ファイル３３から読み込んだ文書独自のスコアとの線形和を用いて候補文書を探索することができる。したがって、テキスト検索装置１Ａは、単語には依存しない文書のスコア（文書独自のスコア）を、あたかも検索式に含まれる単語に依存するスコア（単語依存文書スコア）の一種であるかのように同様に扱うことができる。その結果、テキスト検索装置１Ａは、検索対象とする文書を、例えばＷｅｂ文書（ドキュメント）とした場合に、より多くのユーザに閲覧、利用されるＷｅｂ文書の中から、指定された検索式に含まれる単語が含まれる文書を候補文書として出力することが可能となる。 According to the present embodiment, the text search apparatus 1A is linear between the word-dependent document score calculated using the transposed file 31 in the order of impact values and the document-specific score read from the transposed file 33 in the document-specific score order. Candidate documents can be searched using sums. Therefore, the text search apparatus 1A similarly applies the score of a document that does not depend on a word (document-specific score) as if it is a kind of score (word-dependent document score) that depends on a word included in a search expression. Can be handled. As a result, when the document to be searched for is a Web document (document), for example, the text search device 1A includes the specified search expression from Web documents that are viewed and used by more users. It is possible to output a document including a word as a candidate document.

以上、本発明の各実施形態について説明したが、本発明はこれらに限定されるものではなく、その趣旨を変えない範囲で実施することができる。例えば、各実施形態では、テキスト検索装置１（１Ａ）は、インデックス構築装置１０（１０Ａ）を備えるベストモードとして説明したが、インデックス構築装置１０（１０Ａ）は必須ではなく、予め構築された転置ファイルと予め算出されたｄｆ値とを蓄積格納していればよい。同様に、テキスト検索装置１（１Ａ）は、文書記憶装置４０を必ずしも備えていなくてもよい。また、検索対象となるテキスト文書（検索対象文書）は、１つの文書記憶装置４０に蓄積されている必要はなく、ネットワーク上の複数の記憶装置に分散配置されていてもよい。 As mentioned above, although each embodiment of this invention was described, this invention is not limited to these, It can implement in the range which does not change the meaning. For example, in each embodiment, the text search device 1 (1A) has been described as the best mode including the index construction device 10 (10A). However, the index construction device 10 (10A) is not essential, and a transposed file constructed in advance. And the df value calculated in advance may be stored. Similarly, the text search device 1 (1A) may not necessarily include the document storage device 40. Further, the text document to be searched (search target document) does not need to be stored in one document storage device 40, and may be distributed and arranged in a plurality of storage devices on the network.

また、各実施形態では、検索条件として、３個の単語を例示したが、検索条件で入力される単語数は複数であればよい。また、単語のフレーズだけではなく文章を検索式とすることもできる。また、検索式に含まれる単語は、名詞や助詞に限らず、形容詞等の他の品詞でもよい。また、検索式に含まれる単語の言語は日本語に限定されず、英語、仏語、中国語等の他の言語でもよい。 In each embodiment, three words are exemplified as the search condition. However, the number of words input by the search condition may be plural. Moreover, not only the phrase of a word but a sentence can also be used as a search expression. The words included in the search expression are not limited to nouns and particles, but may be other parts of speech such as adjectives. Further, the language of the word included in the search expression is not limited to Japanese, but may be another language such as English, French, Chinese.

また、第２実施形態では、インデックス構築装置１０Ａが文書独自スコア順の転置ファイル３３を１つだけ作成するものとしたが、複数作成するようにしてもよい。例えば、検索サーバのログから得られる各Ｗｅｂページに対するクリック回数のデータを用いて算出された文書（Ｗｅｂページ）毎のスコアによる転置ファイルと、PageRank（登録商標）のスコアによる転置ファイルとをそれぞれ作成することができる。この場合、第２処理手段２５２は、２つの文書独自スコア順の転置ファイルから、第１処理手段２５１で単語依存文書スコアを算出した文書に対応する文書独自のスコアをそれぞれ読み込み、読み込んだそれぞれの文書独自のスコアと、第１処理手段２５１で算出した単語依存文書スコアとの線形和を用いて候補文書を探索する。このときに、２種類の文書独自スコアに別々の重み付けをしてから単語依存文書スコアに加算することも可能である。この場合にも、より多くのユーザに閲覧、利用されるＷｅｂ文書等の中から、指定された検索式に含まれる単語が含まれる文書を候補文書として出力することが可能となる。 In the second embodiment, the index construction apparatus 10A creates only one transposed file 33 in the document unique score order. However, a plurality of them may be created. For example, a transposed file with a score for each document (Web page) calculated using data on the number of clicks for each Web page obtained from a search server log and a transposed file with a PageRank (registered trademark) score are created. can do. In this case, the second processing unit 252 reads the document-specific score corresponding to the document for which the word-dependent document score is calculated by the first processing unit 251 from the two transposed files in the document-specific score order. A candidate document is searched for using a linear sum of the document-specific score and the word-dependent document score calculated by the first processing means 251. At this time, the two kinds of document-specific scores can be weighted separately and then added to the word-dependent document score. Also in this case, it is possible to output a document including a word included in a designated search expression as a candidate document from among Web documents that are browsed and used by more users.

本発明の第１実施形態に係るテキスト検索装置を模式的に示す構成図である。It is a lineblock diagram showing typically the text search device concerning a 1st embodiment of the present invention. 本発明の第１実施形態に係るテキスト検索装置の概要を模式的に示す説明図である。It is explanatory drawing which shows typically the outline | summary of the text search device concerning 1st Embodiment of this invention. 従来のテキスト検索装置の概要を模式的に示す説明図である。It is explanatory drawing which shows the outline | summary of the conventional text search apparatus typically. 図１に示すインデックス構築装置の一例を模式的に示すブロック図である。It is a block diagram which shows typically an example of the index construction apparatus shown in FIG. 図４に示すインデックス構築装置の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the index construction apparatus shown in FIG. 図１に示す検索処理装置の一例を模式的に示すブロック図である。It is a block diagram which shows typically an example of the search processing apparatus shown in FIG. 図６に示す検索処理装置の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the search processing apparatus shown in FIG. 図７に示す候補文書探索処理の詳細を示すフローチャートである。It is a flowchart which shows the detail of the candidate document search process shown in FIG. 本発明の第２実施形態に係るテキスト検索装置の概要を模式的に示す説明図である。It is explanatory drawing which shows typically the outline | summary of the text search device concerning 2nd Embodiment of this invention. 図９に示すインデックス構築装置の一例を模式的に示すブロック図である。It is a block diagram which shows typically an example of the index construction apparatus shown in FIG. 図１０に示すインデックス構築装置による文書独自スコア転置ファイルを構築する動作を示すフローチャートである。It is a flowchart which shows the operation | movement which builds the document original score transposition file by the index construction apparatus shown in FIG. 図９に示す検索処理装置の一例を模式的に示すブロック図である。It is a block diagram which shows typically an example of the search processing apparatus shown in FIG. 図１２に示す検索処理装置による候補文書探索処理の詳細を示すフローチャートである。It is a flowchart which shows the detail of the candidate document search process by the search processing apparatus shown in FIG. 文書ＩＤ順の転置ファイルの一例を示す説明図である。It is explanatory drawing which shows an example of the transposition file of document ID order.

Explanation of symbols

１（１Ａ，１Ｂ）テキスト検索装置
１０（１０Ａ）インデックス構築装置（インデックス構築手段）
１１文書読込手段
１２文書特徴量算出手段
１３出現頻度判別手段
１４第１転置ファイル構築手段
１５第２転置ファイル構築手段
１６独自スコア算出手段
１７文書独自スコア転置ファイル構築手段
２０（２０Ａ）検索処理装置
２１検索条件入力手段
２２単語抽出手段
２３単語群分類手段
２４候補文書探索手段
２４１（２４１Ａ）第１探索手段
２４２件数判別手段
２４３第２探索手段
２４４絞込制御手段
２５候補文書出力手段
２５１第１処理手段
２５２第２処理手段
３０インデックス記憶装置
４０文書記憶装置
５０単語辞書記憶装置
６０端末装置
Ｍ入力装置
Ｄ出力装置 1 (1A, 1B) Text search device 10 (10A) Index construction device (index construction means)
DESCRIPTION OF SYMBOLS 11 Document reading means 12 Document feature-value calculation means 13 Appearance frequency discrimination means 14 1st transposition file construction means 15 2nd transposition file construction means 16 Original score calculation means 17 Document original score transposition file construction means 20 (20A) Search processing apparatus 21 Search condition input means 22 Word extraction means 23 Word group classification means 24 Candidate document search means 241 (241A) First search means 242 Number of cases determination means 243 Second search means 244 Narrow down control means 25 Candidate document output means 251 First processing means 252 Second processing means 30 Index storage device 40 Document storage device 50 Word dictionary storage device 60 Terminal device M Input device D Output device

Claims

Constructed for each word using an impact value indicating a weight value obtained based on the word appearance frequency calculated in advance for each word in each text document having the document ID and the total number of words included in the text document. A plurality of transposed files including a transposed file in order of impact value and a transposed file in order of document ID constructed for each word included in each text document, as the index for document search, A text search device for searching,
A search condition input means for inputting a search expression including a plurality of words and the number of documents as a specified search condition;
Word extraction means for extracting each word included in the input search formula based on a word dictionary;
Based on the document appearance frequency calculated in advance through all the text documents to be searched for each word, each of the extracted words is not a word group whose document appearance frequency is lower than a predetermined threshold. Word group classification means for classifying into word groups;
As a first stage search, for a word group whose document appearance frequency is lower than a predetermined threshold, a candidate document that is an output candidate is searched using the transposed file in the order of impact value, and a second stage search is performed. As for the word group whose document appearance frequency is equal to or higher than the predetermined threshold , the document ID of the candidate document searched in the first stage search is used as the document ID using the transposed file in the document ID order. Candidate document search means for searching for candidate documents to be output candidates by searching on sequential transposed files;
A text search apparatus comprising candidate document output means for determining and outputting the candidate document when the input search condition is satisfied.

The candidate document search means includes:
Read the transposed file in order of impact value for a word group whose document appearance frequency is lower than the predetermined threshold, calculate a word-dependent document score depending on the search formula for each document in order of high impact value, First search means for searching for the candidate document based on the calculated word-dependent document score;
A number determination means for determining whether or not the number of searched candidate documents is larger than the specified number of documents;
When the number of searched candidate documents is larger than the designated number of documents, the transposed file in the document ID order for the word group whose document appearance frequency is equal to or higher than the predetermined threshold is read. Second search means for calculating a score for each document while searching for a candidate document that matches the document ID of the candidate document searched by the first search means;
And narrowing down control means for narrowing down the candidate documents by alternately using the first search means and the second search means, and determining a rank within the designated number of documents as the rank of the candidate documents. The text search device according to claim 1.

The first search means includes
A transposition file in the order of impact value is read for a word group whose document appearance frequency is lower than the predetermined threshold value, and a word-dependent document score depending on the search formula is calculated in descending order of impact value for each document. Processing means;
The document-specific score corresponding to the document for which the word-dependent document score has been calculated is read from the transposed file in the document-specific score order that is built in advance in the document-specific score order calculated for each text document, and the read The text search apparatus according to claim 2, further comprising: a second processing unit that searches for the candidate document using a linear sum of a document-specific score and the calculated word-dependent document score.

Unique score calculating means for calculating a document-specific score for each text document;
4. The text search apparatus according to claim 3, further comprising: a document unique score transposed file construction unit that constructs a transposed file in the document unique score order based on the calculated document unique score.

The transposed file in document ID order constructed for each word included in each text document holds a maximum impact value for all the text documents to be searched for the word. The text search device according to any one of claims 1 to 4.

Index construction means for constructing the transposed file in the impact value order and the transposed file in the document ID order,
The index construction means includes:
A document reading means for reading a text document group to be searched;
A word is extracted from each read text document based on the word dictionary, and the total number of words and the word appearance frequency of each word are calculated as the feature amount of the word included in the text document, and the search target is set for each word. A document feature amount calculating means for calculating a document appearance frequency through all text documents;
Appearance frequency determination means for comparing the document appearance frequency for each word through all the text documents to be searched, and determining whether the document appearance frequency of each word is lower than a predetermined threshold;
First transposed file construction means for constructing a transposed file in order of impact value for the word as an index for the document search when the document appearance frequency of the word is lower than the predetermined threshold;
When the document appearance frequency of a word is equal to or higher than the predetermined threshold, the maximum impact value is calculated for all the text documents to be searched for the word, and the document ID including the calculated maximum impact value The text search apparatus according to claim 1, further comprising: a second transposed file construction unit that constructs a sequential transposed file as the document search index.

Constructed for each word using an impact value indicating a weight value obtained based on the word appearance frequency calculated in advance for each word in each text document having the document ID and the total number of words included in the text document. A plurality of transposed files including a transposed file in the order of impact values and a transposed file in the order of document IDs constructed for each word included in each text document as an index for document search, A text search method of a text search device for searching a text document,
The text search device includes search condition input means, word extraction means, word group classification means, candidate document search means, and candidate document output means,
A search condition input step of inputting, by the search condition input means, a search expression including a plurality of words and the number of documents as a specified search condition;
A word extraction step of extracting each word included in the inputted search expression based on a word dictionary by the word extraction means;
Based on the document appearance frequency calculated in advance through all text documents to be searched for each word by the word group classification unit, the extracted word is lower in the document appearance frequency than a predetermined threshold value. A word group classification step for classifying into a word group and a non-word group;
As a first-stage search , the candidate document search means searches for candidate documents that are output candidates using a transposed file in the order of impact values for a word group whose document appearance frequency is lower than a predetermined threshold. Then, as a second stage search, for the word group having the document appearance frequency equal to or higher than a predetermined threshold, using the transposed file in the document ID order, the candidate documents searched in the first stage search are searched. A candidate document search step of searching for a candidate document as an output candidate by searching the document ID on the transposed file in the document ID order ;
And a candidate document output step for determining and outputting the candidate document when the input search condition is satisfied by the candidate document output means.

A text search program for realizing the function of the text search device according to any one of claims 1 to 6 by a computer.

A computer-readable recording medium on which the text search program according to claim 8 is recorded.