JP2004046438A

JP2004046438A - Text retrieval method and device, text retrieval program and storage medium storing text retrieval program

Info

Publication number: JP2004046438A
Application number: JP2002201561A
Authority: JP
Inventors: Takashi Inoue; 井上　孝史; Masayuki Sugizaki; 杉崎　正之; Nobuyuki Omori; 大森　信行; Hiroto Inagaki; 稲垣　博人
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2002-07-10
Filing date: 2002-07-10
Publication date: 2004-02-12

Abstract

<P>PROBLEM TO BE SOLVED: To promptly search phrases. <P>SOLUTION: As a stanza of two words, a break character is inserted on the boundary of the words constituting the stanza, an index table is produced in which document IDs are aligned by which the stanza is extracted using the stanza as a key, when retrieving, a word is extracted by conducting a morphological analysis regarding an inputted retrieval request, if the extracted words make a phrase composed of two or more words, every stanza of two words included in the phrase is extracted, using the extracted stanza as a key, the index table is searched, and the search result is acquired and outputted. <P>COPYRIGHT: (C)2004,JPO

Description

【０００１】
【発明の属する技術分野】
本発明は、テキスト検索方法及び装置及びテキスト検索プログラム及びテキスト検索プログラムを格納した記憶媒体に係り、特に、大量の文書集合の中から特定の単語を含む文書を検索するテキスト検索方法及び装置及びテキスト検索プログラム及びテキスト検索プログラムを格納した記憶媒体に関する。
【０００２】
【従来の技術】
テキスト検索とは、文書の集合をデータベースに登録しておき、ユーザが与えた検索語に関連する文書をそのデータベースから取り出す技術である。
【０００３】
検索語は、『通信』のような単語や、『通信機器開発』のような複数の語からなるフレーズが与えられることが多い。ここで、「関連する文書」とは、「検索語が含まれる文書」と大体同義であると考えられる。また、検索入力としてフレーズが与えられた場合は、ユーザは、この文字列そのものが含まれている文書を探していると考える。
【０００４】
大量の文書を対象とするテキスト検索処理では、検索を高速に処理するために、転置ファイルと呼ばれる索引表（以下、単に索引表と記す）を備えていて、検索時に参照するものが多い。索引表とは、予め検索対象のテキスト中に含まれている単語を抽出し、各単語がどの文書に出現したかを記憶するものである。その例を図１１に示す。索引表は、単語をキーとし、各単語の出現する文書のＩＤの集合を記憶している。表は、端とをキーとして整列されており、この表を参照することにより、ある単語が指定されると、その単語が出現する文書ＩＤの集合を高速に取り出すことができる。
【０００５】
まず、索引表を用いた一般的な検索処理の流れを説明する。ユーザから入力された検索語を基に索引表を探索し、キーが検索語と一致する表中の行を探し、その行に記載されている文書のＩＤの集合を検索結果としてユーザに対し出力する。例えば、検索語が「通信」という語であった場合には、図１１のキーの部分が「通信」である行に記載されている文書ＩＤ集合（１，５，９，１，１７，３５，８７，１３２，１６３）が検索結果となる。
【０００６】
次に、検索入力が複数の単語から構成されるフレーズ（複合語や短い文など）である場合の処理を説明する。まず、検索入力に対して形態素解析を行い、フレーズを構成する単語に分割する。次に、索引表を探索し、分割して得られたそれぞれの単語をキーとする表中の行を探し、各単語が出現する文書ＩＤの集合を得る。得られた文書集合の積をとり、フレーズを構成するすべての語を含む文書ＩＤの集合を得る。得られた文書集合に含まれるＩＤを持つ文書のテキストを走査し、検索入力と一致するフレーズを実際に持つかどうかを調べ、一致するフレーズを持つ文書ＩＤの検索結果として出力する。
【０００７】
例えば、検索入力が「通信機器」であったとする。このフレーズは、「通信」と「機器」の２つの語で構成される。まず、図１１の索引表で「通信」及び「機器」とマッチングする文書ＩＤ集合をそれぞれ求める。「通信」にマッチする文書ＩＤは、（１，５，９，１１，１７，３５，８７，１３２，１６３）で、一方、「機器」とマッチする文書ＩＤ集合は、（１，５，７，２２，４７，８７，１３２）となるから、これらの集合の積をとると（１，５，８７，１３２）となる。ここで得られた文書集合の中のテキストを走査し、実際に「通信機器」という文字列が含まれているかどうかを検査し、含まれていた文書ＩＤの集合を検索結果として出力する。
【０００８】
【発明が解決しようとする課題】
しかしながら、上記従来の索引表を用いたテキスト検索方法では、検索入力として複数の語からなるフレーズが与えられた場合に、最後に文書のテキストを走査する必要がある。特に、出現頻度の高い語から構成されるフレーズの場合には走査対象の文書数が多く、検索速度が遅くなってしまう。
【０００９】
本発明は、上記の点に鑑みなされたもので、フレーズに対する検索の高速化を可能とするテキスト検索方法及び装置及びテキスト検索プログラム及びテキスト検索プログラムを格納した記憶媒体を提供することを目的とする。
【００１０】
【課題を解決するための手段】
図１は、本発明の原理を説明するための図である。
【００１１】
本発明は、文書集合の中から特定の単語を含む文書を索引表を用いて検索するテキスト検索方法において、
２つの単語の連とし、該連を構成する単語の境界に区切り文字を挿入し、該連をキーとして該連を取り出した文書ＩＤを整列させた索引表を作成しておく索引表作成過程（ステップ１）と、
検索時に、入力された検索要求を形態素解析して単語を抽出する単語抽出過程（ステップ２）と、
抽出された単語が２単語以上からなるフレーズであった場合には、該フレーズに含まれる２語の連の全てを抽出する連抽出過程（ステップ３）と、
抽出された連をキーとして索引表を探索し、検索結果を取得して出力する検索処理過程（ステップ４）と、を行う。
【００１２】
また、本発明の検索処理過程において、
１単語の検索要求が入力された場合には、索引表のキーのキーの区切り文字より前の文字列である１つ目の単語が、該検索要求の単語と一致する行全てを抽出し、該行に含まれる文書ＩＤリストの集合の和を検索結果とし、
２単語からなるフレーズが検索要求として入力された場合には、索引表のキーが検索要求のフレーズと一致する行を取り出し、該行の文書ＩＤの集合を検索結果とし、
３単語からなるフレーズが検索要求として入力された場合には、該検索要求のフレーズに含まれる単語２つからなる連を全て取り出し、それぞれと一致する該索引表中の行を検索し、各連が出現する文書ＩＤの集合を取得し、該文書ＩＤの集合の積をとり、該フレーズを構成する全ての連を含む文書ＩＤの集合を取得し、該文書ＩＤの集合に含まれる文書ＩＤを持つ文書のテキストを走査し、入力された該検索要求と一致するフレーズの存在を調べ、一致するフレーズを持つ文書のＩＤを検索結果とし、
Ｎ単語からなるフレーズが検索要求として入力された場合で、
単語数がＮ−１以下の場合には、該検索要求と索引表を比較し、前方一致箇所の文書ＩＤの和集合を検索結果とし、
単語数がＮの場合には、検索要求と索引表を比較し、一致する文書ＩＤを検索結果とし、
単語数がＮ＋１以上の場合には、単語位置で、１〜Ｎ、２〜Ｎ＋１、３〜Ｎ＋２，…のように１つずつずらし、文書ＩＤの積を取り、該文書ＩＤの積集合に含まれる文書ＩＤを持つ文書のテキストを走査し、入力された該検索要求と一致するフレーズの存在を調べ、一致するフレーズを持つ文書のＩＤを検索結果とする。
【００１３】
図２は、本発明の原理構成図である。
【００１４】
本発明は、文書集合の中から特定の単語を含む文書を索引表を用いて検索するテキスト検索装置であって、
２つの単語の連とし、該連を構成する単語の境界に区切り文字を挿入し、該連をキーとして該連を取り出した文書ＩＤを整列させた索引表を作成しておく索引表作成手段１００と、
検索時に、入力された検索要求を形態素解析して単語を抽出する単語抽出手段２２０と、
抽出された単語が２単語以上からなるフレーズであった場合には、該フレーズに含まれる２語の連の全てを抽出する連抽出手段２３０と、
抽出された連をキーとして索引表を探索し、検索結果を取得して出力する検索処理手段２４０と、を有する。
【００１５】
また、本発明の検索処理手段２４０は、
１単語の検索要求が入力された場合には、索引表のキーのキーの区切り文字より前の文字列である１つ目の単語が、該検索要求の単語と一致する行全てを抽出し、該行に含まれる文書ＩＤリストの集合の和を検索結果とする手段と、
２単語からなるフレーズが検索要求として入力された場合には、索引表のキーが検索要求のフレーズと一致する行を取り出し、該行の文書ＩＤの集合を検索結果とする手段と、
３単語からなるフレーズが検索要求として入力された場合には、該検索要求のフレーズに含まれる単語２つからなる連を全て取り出し、それぞれと一致する該索引表中の行を検索し、各連が出現する文書ＩＤの集合を取得し、該文書ＩＤの集合の積をとり、該フレーズを構成する全ての連を含む文書ＩＤの集合を取得し、該文書ＩＤの集合に含まれる文書ＩＤを持つ文書のテキストを走査し、入力された該検索要求と一致するフレーズの存在を調べ、一致するフレーズを持つ文書のＩＤを検索結果とする手段と、
Ｎ単語からなるフレーズが検索要求として入力された場合で、
単語数がＮ−１以下の場合には、該検索要求と索引表を比較し、前方一致箇所の文書ＩＤの和集合を検索結果とし、
単語数がＮの場合には、検索要求と索引表を比較し、一致する文書ＩＤを検索結果とし、
単語数がＮ＋１以上の場合には、単語位置で、１〜Ｎ、２〜Ｎ＋１、３〜Ｎ＋２，…のように１つずつずらし、文書ＩＤの積を取り、該文書ＩＤの積集合に含まれる文書ＩＤを持つ文書のテキストを走査し、入力された該検索要求と一致するフレーズの存在を調べ、一致するフレーズを持つ文書のＩＤを検索結果とする手段と、を含む。
【００１６】
本発明は、文書集合の中から特定の単語を含む文書を索引表を用いて検索するテキスト検索プログラムであって、
２つの単語の連とし、該連を構成する単語の境界に区切り文字を挿入し、該連をキーとして該連を取り出した文書ＩＤを整列させた索引表を作成しておく索引表作成ステップと、
検索時に、入力された検索要求を形態素解析して単語を抽出する単語抽出ステップと、
抽出された単語が２単語以上からなるフレーズであった場合には、該フレーズに含まれる２語の連の全てを抽出する連抽出ステップと、
抽出された連をキーとして索引表を探索し、検索結果を取得して出力する検索処理ステップと、を実行する。
【００１７】
また、本発明の検索処理ステップは、
１単語の検索要求が入力された場合には、索引表のキーのキーの区切り文字より前の文字列である１つ目の単語が、該検索要求の単語と一致する行全てを抽出し、該行に含まれる文書ＩＤリストの集合の和を検索結果とするステップと、
２単語からなるフレーズが検索要求として入力された場合には、索引表のキーが検索要求のフレーズと一致する行を取り出し、該行の文書ＩＤの集合を検索結果とするステップと、
３単語からなるフレーズが検索要求として入力された場合には、該検索要求のフレーズに含まれる単語２つからなる連を全て取り出し、それぞれと一致する該索引表中の行を検索し、各連が出現する文書ＩＤの集合を取得し、該文書ＩＤの集合の積をとり、該フレーズを構成する全ての連を含む文書ＩＤの集合を取得し、該文書ＩＤの集合に含まれる文書ＩＤを持つ文書のテキストを走査し、入力された該検索要求と一致するフレーズの存在を調べ、一致するフレーズを持つ文書のＩＤを検索結果とするステップと、
Ｎ単語からなるフレーズが検索要求として入力された場合で、
単語数がＮ−１以下の場合には、該検索要求と索引表を比較し、前方一致箇所の文書ＩＤの和集合を検索結果とし、
単語数がＮの場合には、検索要求と索引表を比較し、一致する文書ＩＤを検索結果とし、
単語数がＮ＋１以上の場合には、単語位置で、１〜Ｎ、２〜Ｎ＋１、３〜Ｎ＋２，…のように１つずつずらし、文書ＩＤの積を取り、該文書ＩＤの積集合に含まれる文書ＩＤを持つ文書のテキストを走査し、入力された該検索要求と一致するフレーズの存在を調べ、一致するフレーズを持つ文書のＩＤを検索結果とするステップと、を含む。
【００１８】
本発明は、文書集合の中から特定の単語を含む文書を索引表を用いて検索するテキスト検索プログラムを格納した記憶媒体であって、
２つの単語の連とし、該連を構成する単語の境界に区切り文字を挿入し、該連をキーとして該連を取り出した文書ＩＤを整列させた索引表を作成しておく索引表作成ステップと、
検索時に、入力された検索要求を形態素解析して単語を抽出する単語抽出ステップと、
抽出された単語が２単語以上からなるフレーズであった場合には、該フレーズに含まれる２語の連の全てを抽出する連抽出ステップと、
抽出された連をキーとして索引表を探索し、検索結果を取得して出力する検索処理ステップと、を実行するプログラムを格納する。
【００１９】
また、本発明のテキスト検索プログラムを格納した記憶媒体の検索処理ステップは、
１単語の検索要求が入力された場合には、索引表のキーのキーの区切り文字より前の文字列である１つ目の単語が、該検索要求の単語と一致する行全てを抽出し、該行に含まれる文書ＩＤリストの集合の和を検索結果とするステップと、
２単語からなるフレーズが検索要求として入力された場合には、索引表のキーが検索要求のフレーズと一致する行を取り出し、該行の文書ＩＤの集合を検索結果とするステップと、
３単語からなるフレーズが検索要求として入力された場合には、該検索要求のフレーズに含まれる単語２つからなる連を全て取り出し、それぞれと一致する該索引表中の行を検索し、各連が出現する文書ＩＤの集合を取得し、該文書ＩＤの集合の積をとり、該フレーズを構成する全ての連を含む文書ＩＤの集合を取得し、該文書ＩＤの集合に含まれる文書ＩＤを持つ文書のテキストを走査し、入力された該検索要求と一致するフレーズの存在を調べ、一致するフレーズを持つ文書のＩＤを検索結果とするステップと、
Ｎ単語からなるフレーズが検索要求として入力された場合で、
単語数がＮ−１以下の場合には、該検索要求と索引表を比較し、前方一致箇所の文書ＩＤの和集合を検索結果とし、
単語数がＮの場合には、検索要求と索引表を比較し、一致する文書ＩＤを検索結果とし、
単語数がＮ＋１以上の場合には、単語位置で、１〜Ｎ、２〜Ｎ＋１、３〜Ｎ＋２，…のように１つずつずらし、文書ＩＤの積を取り、該文書ＩＤの積集合に含まれる文書ＩＤを持つ文書のテキストを走査し、入力された該検索要求と一致するフレーズの存在を調べ、一致するフレーズを持つ文書のＩＤを検索結果とするステップと、を含む。
【００２０】
上記のように、本発明では、索引表のキーを単語ではなく、２つの単語の連とし、その連の出現位置を記憶しておく。なお、キーの単語の境界には境界を示すデリミタ（区切り文字）を挿入しておく。表は、連をキーとして整列しておく。１単語の検索要求が来た場合には、索引表の中のキーの１つめの単語（キーのデリミタより前の文字列）が検索要求の単語と一致する行すべてを取り出し、その文書ＩＤリストの集合の和を取り検索結果とする。表は連をキーとしてソートされているので、ある範囲の連続する行が一致することになる。この場合、従来の索引表による検索と検索速度は同程度である。
【００２１】
２単語からなるフレースが検索要求として入力された場合には、索引表のキーが検索要求のフレーズと一致する行を取り出し、その文書ＩＤの集合を検索結果とする。この場合は、従来の方法と異なり、テキストを走査する必要がないので、高速に検索を行うことができる。
【００２２】
３単語以上からなるフレーズが検索要求として入力された場合には、検索要求のフレーズに含まれる単語の２つからなる連を全て取り出し、それぞれと一致する表中の行を探し、各連が出現する文書ＩＤの集合を得る。得られた文書集合の積を取り、フレーズを構成するすべての連を含む文書ＩＤの集合を得る。得られた文書集合に含まれるＩＤを持つ文書のテキストを走査し、検索入力と一致するフレーズを実際に持つかどうかを調べ、一致するフレーズを持つ文書のＩＤを検索結果として出力する。
【００２３】
３単語以上の場合には、従来方法と同様に最後に文書テキストを走査する必要があるが、２単語の連で一致を調べているので、この段階で走査する対象のテキストの数は従来手法よりずっと少なく、高速に検索できる。
【００２４】
【発明の実施の形態】
まず、検索対象のテキストをテキストデータベースに登録する際に、各テキストから単語を抽出し、２語からなる連を構成し、索引表に連をキーとして出現する文書ＩＤを登録する。
【００２５】
図３は、本発明の一実施の形態における索引表作成装置の構成を示す。
【００２６】
同図に示す索引表作成装置１００は、形態素解析処理により単語を抽出する単語抽出部１１０と、単語抽出部１２０によって得られた単語リストから２語の連を構成する連抽出部１２０と、抽出された連をキーとしてその連を取り出した文書ＩＤを索引表２０に登録する索引表登録部１３０からなる。
【００２７】
検索時には、検索入力を形態素解析して単語を抽出し、解析の結果、２単語以上からなるフレーズであった場合には、フレーズに含まれる２語の連のすべてを抽出する。次に抽出された連をキーとして索引表を探索し、検索結果を得て、結果を出力する。
【００２８】
図４は、本発明の一実施の形態におけるテキスト検索装置の構成を示す。同図に示すテキスト検索装置２００は、ユーザが指定した検索要求を入力する検索要求入力部２１０と、入力された検索要求から単語を抽出する単語抽出部２２０と、単語から連を構成する連抽出部２３０と、予め作成した索引表を参照し、連（または単語）が出現する文書ＩＤ集合を取得し、また、検索要求が３単語以上から構成されている場合には文書を走査して検索要求のフレーズにマッチする文書を求める検索処理部２４０と、検索処理部２４０で得られた文書ＩＤの集合を出力する検索結果出力部２５０から構成される。
【００２９】
【実施例】
以下、図面と共に本発明の実施例を説明する。
【００３０】
以下では、例を用いて前述の索引表作成装置１００及びテキスト検索装置２００の動作を説明する。
【００３１】
まず、テキストデータベースの登録時に、前述の図３の索引表作成装置１００で索引表２０を作成する場合について説明する。
【００３２】
図５は、本発明の一実施例の索引表作成時の動作のフローチャートである。
【００３３】
今、テキストデータベースに登録するテキストが２００個あり、それぞれに１から２００までの通し番号（ＩＤ）が振られているとする。また、その中のＩＤ１，８７，１３２のそれぞれのテキストが、
ＩＤ１：『Ａ社は通信機器開発を進めると発表した。』
ＩＤ８７：『Ｃ社は精密機器開発に加えて通信機器の分野にも進出する。』
ＩＤ１３２：『Ａ社の通信機器の売上げは増加している。』
であったとする。
【００３４】
単語抽出部１１０は、文書集合ＤからＩＤの順にテキストを読み込む。ここで、文書集合Ｄの要素数をｉとする（ステップ１０１）。読み込んだテキストの数がｉより大きくなった場合には、処理を終了する（ステップ１０２）。形態素解析部１２０は、読み込まれたテキストを形態素解析し、単語を抽出する（ステップ１０３）。例えば、ＩＤ１のテキストを形態素解析し、単語を抽出した結果は、
『Ａ社、は、通信、機器、開発、を、進める、と、発表、し、た』
となる。
【００３５】
次に、連抽出部１２０では、抽出された単語リストＷから、２つの単語の連を構成する。先のＩＤ１のテキストの場合、
『Ａ社＄は、は＄通信、通信＄機器、機器＄開発、開発＄を、を＄進める、進める＄と、と＄発表、発表＄し、し＄た』
が連抽出の結果である。ここで、連の単語間のデリミタとし“＄”を用いている。なお、デリミタ（＄）を利用するのは、例えば、デリミタ（＄）なしで、「通信機器」で登録しておき、単語の「通信」がきても、前方一致で比較すれば処理できる。このように、「通信機器」で「通信」で検索の場合はそれでよいが、デリミタを用いないと、「通」１文字で検索しても検索できてしまう。このような検索を許すことも考えられるが、単語ベースの索引を用いた検索を扱っているので、検索入力と索引で単語の区切りが合っているもののを検索するのが適当である。
【００３６】
最後に、索引表登録部１３０では、抽出された連をキーとして文書ＩＤが登録される。
【００３７】
上記の処理を登録するすべてのテキスト２００個について行う（ステップ１０４、１０５）。
【００３８】
上記のようにして索引生成装置１００で作成された索引表の例を図６に示す。
次に、テキスト検索装置２００の処理を説明する。
【００３９】
図７は、本発明の一実施例のテキスト検索装置の連抽出処理のフローチャートである。
【００４０】
検索要求入力部２１０に検索要求が入力されると、単語抽出部２２０において、検索要求が単語を抽出して単語リストを抽出し（ステップ２０１）、単語リストの全要素について連を抽出する（ステップ２０２、２０３）。
【００４１】
次に、検索処理実行時の処理について説明する。
【００４２】
検索入力が、１単語の場合、２単語のフレーズの場合、３単語以上のフレーズの場合に分けて説明する。
【００４３】
まず、検索要求が１単語である場合の検索処理部２３０における処理について説明する。
【００４４】
図８は、本発明の一実施例のテキスト検索装置の検索処理部の処理のフローチャート（１単語の場合）である。
【００４５】
検索要求が１単語であった場合には、索引表２０のキーの１つ目の単語（キーのデリミタより前の文字列）が、検索要求の単語と一致する行全てを取り出し（ステップ３０２）、その文書ＩＤリストの集合の和を取り、検索結果とする。索引表２０は、連をキーとしてソートされているので、ある範囲の連続する行が一致することになる。例えば、検索要求が「通信」という文字列であった場合には、形態素解析処理で１単語と認識されるので、索引表２０を参照し、キーの一つ目の単語が「通信」である行を探す。図６の例では、「通信コスト」「通信機器」「通信産業」の行がマッチする。それぞれの行の文書ＩＤ集合を取り出し和集合をとると、［１，５，９，１１，１７，３５，８７，１３２］のＩＤが得られる（ステップ３０３）。こを検索結果として検索結果出力部２５０から出力する。
【００４６】
次に、検索要求が２単語からなるフレーズである場合について説明する。
【００４７】
図９は、本発明の一実施例のテキスト検索装置の検索処理部の処理のフローチャート（２単語の場合）である。
【００４８】
２単語からなるフレーズが検索要求として入力された場合には、索引表のキーが検索要求のフレーズと一致する行を取り出し（ステップ４０２）、その文書ＩＤの集合を検索結果とする（ステップ４０３）。
【００４９】
例えば、検索要求が「通信機器」という文字列であった場合には、形態素解析処理で２単語と認識されるので、探索表２０からその２語からなる連「通信機器」とキーがマッチするものを探し、マッチした行から文書ＩＤ集合を得る。図６の例では、文書ＩＤ集合［１，５，１３２］が得られ、これを検索結果とし出力する。
【００５０】
次に、検索要求が３単語からなるフレーズである場合について説明する。
【００５１】
図１０は、本発明の一実施例のテキスト検索装置の検索処理部の処理のフローチャート（３単語の場合）である。
【００５２】
３単語以上からなるフレーズが検索要求として入力された場合には、検索要求のフレーズに含まれる単語２つからなる連を全て取り出し（ステップ５０１〜５０３）、それぞれと一致する索引表２０中の行を探し（ステップ５０５）、各連が出現する文書ＩＤの集合を得る（ステップ５０６）。得られた文書集合の積をとり（ステップ５０６）、フレーズを構成するすべての連を含む文書ＩＤの集合を得る（ステップ５０７）。得られた文書集合に含まれるＩＤを持つ文書のテキストを走査し（ステップ５０９）、検索入力と一致するフレーズを実際に持つかどうかを調べ（ステップ５１０）、一致するフレーズを持つ文書ＩＤを検索結果として出力する（ステップ５１１、５０８）。
【００５３】
例えば、検索要求が「通信機器開発」という文字列であった場合には、形態素解析処理で「通信、機器、開発」の３単語と認識され、２語の連としては、「通信機器、機器開発」が得られる。次に、索引表２０を参照し、キーが「通信機器」である行と、キーが「機器開発」である行を探し、それぞれの行に文書集合のＩＤの積集合を取ると、（１，５）が得られる。最後に得られた文書集合に含まれる各文書を走査し、「通信機器開発」という文字列が実際に含まれるかどうか調べる。“文書１”では、「通信機器開発」という文字列が含まれるが、“文書５”では、「通信」と「機器開発」は含まれるが、「通信機器開発」という文字は含まれないので“文書１”は検索結果に加えられるが、“文書５”は加えられない。
【００５４】
こうして得られた文書ＩＤの集合を検索結果とする。
【００５５】
３単語以上の場合は、従来の方法と同様に、最後に文書テキストを走査する必要があるが、２単語の連で一致を調べているので、この段階で走査する対象のテキストの数は、従来の方法よりずっと少なく、高速に検索できる。
【００５６】
詳しくは、３連、４連（Ｎ連）の拡張時には、索引表の作成については、２連の場合と同様である。検索時には、
・単語数が、Ｎ−１個以下の場合には、検索語と索引表のインデックスを比較し、前方一致の箇所の和集合を取る。
【００５７】
・単語数が、Ｎ個の場合には、検索語と索引表のインデックスを比較し、一致の箇所のものを抽出する。
【００５８】
・単語数が、Ｎ＋１個以上の場合には、単語位置で、１〜Ｎ、２〜Ｎ＋１、３〜Ｎ＋２のように１つずつずらしていって積集合を取る。その後は、２連の場合同様に本文を検索する。
【００５９】
原理的には、連の長さが長い程、長い検索文字列に対しての検索が高速になるが、それに従って、索引のサイズが大きくなる。実際、２連の場合でも、索引のサイズがかなり大きくなる。また、３連４連と増やしていっても、絞込の効果はそれほど大きくならない。殆どの検索要求は１〜２単語で行われている。こういったことを総合すると、連数は２あるいは３が適切となる。
【００６０】
また、上記の実施の形態及び実施例における構成の動作をプログラムとして構築し、テキスト検索装置として利用されるコンピュータにインストールすることが可能となる。
【００６１】
また、構築されたプログラムをテキスト検索装置として利用されるコンピュータに接続されるハードディスク装置、フレキシブルディスク、ＣＤ−ＲＯＭ等の可搬記憶媒体に格納しておき、本発明を実施する際にインストールすることも可能である。
【００６２】
なお、本発明は、上記の実施の形態及び実施例に限定されることなく、特許請求の範囲内において、種々変更・応用が可能である。
【００６３】
【発明の効果】
上述のように、本発明によれば、複数の語からなるフレーズを高速に実行することができる。
【図面の簡単な説明】
【図１】本発明の原理を説明するための図である。
【図２】本発明の原理構成図である。
【図３】本発明の一実施の形態における索引表作成装置の構成図である。
【図４】本発明の一実施の形態におけるテキスト検索装置の構成図である。
【図５】本発明の一実施例の索引表作成時の動作のフローチャートである。
【図６】本発明の一実施例の索引表の例である。
【図７】本発明の一実施例のテキスト検索装置の連抽出処理のフローチャートである。
【図８】本発明の一実施例のテキスト検索装置の検索処理部の処理のフローチャート（１単語の場合）である。
【図９】本発明の一実施例のテキスト検索装置の検索処理部の処理のフローチャート（２単語の場合）である。
【図１０】本発明の一実施例のテキスト検索装置の検索処理部の処理のフローチャート（３単語の場合）である。
【図１１】従来の索引表の例である。
【符号の説明】
１０　文書集合
２０　索引表
１００　索引表生成手段、索引表生成装置
１１０　単語抽出部
１２０　連抽出部
１３０　索引表登録部
２１０　検索要求入力部
２２０　単語抽出手段、単語抽出部
２３０　連抽出手段、連抽出部
２４０　検索処理手段、検索処理部
２５０　検索結果出力部[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a text search method and apparatus, a text search program, and a storage medium storing the text search program, and more particularly to a text search method and apparatus for searching for a document containing a specific word from a large set of documents and text. The present invention relates to a storage medium storing a search program and a text search program.
[0002]
[Prior art]
The text search is a technique in which a set of documents is registered in a database, and a document related to a search word given by a user is extracted from the database.
[0003]
A search word is often given a word such as "communication" or a phrase composed of a plurality of words such as "communication device development". Here, the “related document” is considered to be substantially synonymous with the “document including the search word”. When a phrase is given as a search input, it is considered that the user is looking for a document including the character string itself.
[0004]
In a text search process for a large number of documents, an index table called an inverted file (hereinafter simply referred to as an index table) is provided in order to process the search at high speed, and is often referred to at the time of search. The index table is for extracting words included in the text to be searched in advance and storing in which document each word appears. An example is shown in FIG. The index table stores a set of IDs of documents in which each word appears, using the word as a key. The table is arranged using the ends and keys as keys. By referring to the table, when a certain word is specified, a set of document IDs in which the word appears can be extracted at high speed.
[0005]
First, the flow of a general search process using an index table will be described. Searches the index table based on the search term input by the user, finds a row in the table whose key matches the search term, and outputs a set of document IDs described in that row to the user as a search result I do. For example, if the search word is the word “communication”, the document ID set (1, 5, 9, 1, 17, 17, 35) described in the row whose key portion is “communication” in FIG. , 87, 132, 163) are the search results.
[0006]
Next, processing when the search input is a phrase (such as a compound word or a short sentence) composed of a plurality of words will be described. First, a morphological analysis is performed on a search input to divide the words into words constituting a phrase. Next, the index table is searched to find a row in the table with each word obtained as a key, and a set of document IDs in which each word appears is obtained. By multiplying the obtained document set, a set of document IDs including all words constituting the phrase is obtained. The text of the document having the ID included in the obtained document set is scanned, it is checked whether or not the document has a phrase that matches the search input, and the document is output as a search result of the document ID having the matching phrase.
[0007]
For example, assume that the search input is “communication device”. This phrase is composed of two words, “communication” and “device”. First, a document ID set matching "communication" and "device" is obtained from the index table of FIG. The document IDs matching “communication” are (1, 5, 9, 11, 17, 35, 87, 132, 163), while the document ID sets matching “device” are (1, 5, 7) , 22, 47, 87, 132), the product of these sets is (1, 5, 87, 132). The text in the obtained document set is scanned to check whether or not the character string “communication device” is actually included, and the set of included document IDs is output as a search result.
[0008]
[Problems to be solved by the invention]
However, in the conventional text search method using the index table, when a phrase including a plurality of words is given as a search input, it is necessary to finally scan the text of the document. In particular, in the case of a phrase composed of words having a high frequency of appearance, the number of documents to be scanned is large, and the search speed is reduced.
[0009]
SUMMARY OF THE INVENTION The present invention has been made in view of the above points, and has as its object to provide a text search method and apparatus, a text search program, and a storage medium storing the text search program, which can speed up a search for a phrase. .
[0010]
[Means for Solving the Problems]
FIG. 1 is a diagram for explaining the principle of the present invention.
[0011]
The present invention provides a text search method for searching a document containing a specific word from a set of documents using an index table.
An index table creation process in which a series of two words is inserted, a delimiter is inserted at the boundary of the words constituting the series, and an index table is created in which the document IDs from which the series is extracted are arranged using the series as a key ( Step 1),
A word extraction process (step 2) for extracting words by morphologically analyzing the input search request at the time of search;
If the extracted word is a phrase composed of two or more words, a repetition extraction step (step 3) for extracting all repetitions of the two words included in the phrase;
A search process (step 4) of searching the index table using the extracted run as a key, acquiring and outputting a search result, is performed.
[0012]
Further, in the search process of the present invention,
When a search request for one word is input, the first word, which is a character string before the key delimiter of the key in the index table, extracts all rows that match the word of the search request, The sum of the set of document ID lists included in the row is set as a search result,
When a two-word phrase is input as a search request, a line whose key in the index table matches the phrase of the search request is extracted, and a set of document IDs of the line is used as a search result.
When a phrase consisting of three words is input as a search request, all runs of two words included in the phrase of the search request are extracted, and a row in the index table that matches each of the runs is searched for. Is obtained, the product of the sets of the document IDs is multiplied, the set of the document IDs including all the runs constituting the phrase is obtained, and the document IDs included in the set of the document IDs are obtained. Scans the text of the document that has it, checks for the presence of a phrase that matches the input search request, uses the ID of the document that has the matching phrase as the search result,
When a phrase consisting of N words is input as a search request,
If the number of words is equal to or smaller than N-1, the search request is compared with the index table, and the union of the document IDs of the forward matching part is set as the search result.
When the number of words is N, the search request is compared with the index table, and a matching document ID is set as a search result.
When the number of words is equal to or more than N + 1, the word positions are shifted one by one such as 1 to N, 2 to N + 1, 3 to N + 2,. The text of the document having the document ID to be scanned is scanned, the existence of a phrase matching the input search request is checked, and the ID of the document having the matching phrase is set as a search result.
[0013]
FIG. 2 is a diagram illustrating the principle of the present invention.
[0014]
The present invention is a text search apparatus for searching a document containing a specific word from a set of documents using an index table,
Index table creating means 100 for creating a series of two words, inserting a delimiter at the boundary of the words constituting the series, and using the series as a key to create an index table in which the document IDs from which the series are extracted are arranged. When,
At the time of a search, a word extracting means 220 for morphologically analyzing the input search request and extracting words,
If the extracted word is a phrase composed of two or more words, a ream extraction unit 230 for extracting all reams of the two words included in the phrase;
A search processing unit 240 that searches the index table using the extracted run as a key, acquires and outputs a search result.
[0015]
Also, the search processing means 240 of the present invention
When a search request for one word is input, the first word, which is a character string before the key delimiter of the key in the index table, extracts all rows that match the word of the search request, Means for obtaining the sum of the set of document ID lists included in the row as a search result;
Means for extracting a row whose index table key matches the phrase of the search request when a phrase consisting of two words is input as the search request, and using a set of document IDs of the row as a search result;
When a phrase consisting of three words is input as a search request, all runs of two words included in the phrase of the search request are extracted, and a row in the index table that matches each of the runs is searched for. Is obtained, the product of the sets of the document IDs is multiplied, the set of the document IDs including all the runs constituting the phrase is obtained, and the document IDs included in the set of the document IDs are obtained. Means for scanning the text of a document having the same, checking for the presence of a phrase that matches the input search request, and using the ID of the document having the matching phrase as a search result;
When a phrase consisting of N words is input as a search request,
If the number of words is equal to or smaller than N-1, the search request is compared with the index table, and the union of the document IDs of the forward matching part is set as the search result.
When the number of words is N, the search request is compared with the index table, and a matching document ID is set as a search result.
When the number of words is equal to or more than N + 1, the word positions are shifted one by one such as 1 to N, 2 to N + 1, 3 to N + 2,. Means for scanning the text of the document having the document ID to be searched for the presence of a phrase that matches the input search request, and using the ID of the document having the matching phrase as a search result.
[0016]
The present invention is a text search program for searching a document containing a specific word from a set of documents using an index table,
An index table creating step of creating an index table in which a series of two words is inserted, a delimiter is inserted at a boundary between words constituting the series, and a document ID from which the series is extracted is arranged using the series as a key; ,
A word extraction step of morphologically analyzing the input search request to extract words during a search;
When the extracted word is a phrase composed of two or more words, a repetition extraction step of extracting all repetitions of two words included in the phrase;
A search processing step of searching an index table using the extracted run as a key, acquiring and outputting a search result.
[0017]
The search processing step of the present invention includes:
When a search request for one word is input, the first word, which is a character string before the key delimiter of the key in the index table, extracts all rows that match the word of the search request, Using the sum of the set of document ID lists included in the row as a search result;
When a two-word phrase is input as a search request, a line whose index table key matches the search request phrase is extracted, and a set of document IDs of the line is used as a search result;
When a phrase consisting of three words is input as a search request, all runs of two words included in the phrase of the search request are extracted, and a row in the index table that matches each of the runs is searched for. Is obtained, the product of the sets of the document IDs is multiplied, the set of the document IDs including all the runs constituting the phrase is obtained, and the document IDs included in the set of the document IDs are obtained. Scanning the text of the document having the same, searching for the phrase that matches the input search request, and using the ID of the document having the matching phrase as a search result;
When a phrase consisting of N words is input as a search request,
If the number of words is equal to or smaller than N-1, the search request is compared with the index table, and the union of the document IDs of the forward matching part is set as the search result.
When the number of words is N, the search request is compared with the index table, and a matching document ID is set as a search result.
When the number of words is equal to or more than N + 1, the word positions are shifted one by one such as 1 to N, 2 to N + 1, 3 to N + 2,. Scanning the text of the document having the document ID to be searched for the presence of a phrase that matches the input search request, and using the ID of the document having the matching phrase as a search result.
[0018]
The present invention is a storage medium storing a text search program for searching a document including a specific word from a set of documents using an index table,
An index table creating step of creating an index table in which a series of two words is inserted, a delimiter is inserted at a boundary between words constituting the series, and a document ID from which the series is extracted is arranged using the series as a key; ,
A word extraction step of morphologically analyzing the input search request to extract words during a search;
When the extracted word is a phrase composed of two or more words, a repetition extraction step of extracting all repetitions of two words included in the phrase;
A search processing step of searching an index table using the extracted run as a key, acquiring and outputting a search result, and a program for executing the search processing step are stored.
[0019]
Further, the search processing step of the storage medium storing the text search program of the present invention includes:
When a search request for one word is input, the first word, which is a character string before the key delimiter of the key in the index table, extracts all rows that match the word of the search request, Using the sum of the set of document ID lists included in the row as a search result;
When a two-word phrase is input as a search request, a line whose index table key matches the search request phrase is extracted, and a set of document IDs of the line is used as a search result;
When a phrase consisting of three words is input as a search request, all runs of two words included in the phrase of the search request are extracted, and a row in the index table that matches each of the runs is searched for. Is obtained, the product of the sets of the document IDs is multiplied, the set of the document IDs including all the runs constituting the phrase is obtained, and the document IDs included in the set of the document IDs are obtained. Scanning the text of the document having the same, searching for the phrase that matches the input search request, and using the ID of the document having the matching phrase as a search result;
When a phrase consisting of N words is input as a search request,
If the number of words is equal to or smaller than N-1, the search request is compared with the index table, and the union of the document IDs of the forward matching part is set as the search result.
When the number of words is N, the search request is compared with the index table, and a matching document ID is set as a search result.
When the number of words is equal to or more than N + 1, the word positions are shifted one by one such as 1 to N, 2 to N + 1, 3 to N + 2,. Scanning the text of the document having the document ID to be searched for the presence of a phrase that matches the input search request, and using the ID of the document having the matching phrase as a search result.
[0020]
As described above, in the present invention, the key of the index table is not a word but a series of two words, and the appearance position of the series is stored. A delimiter (separator) indicating the boundary is inserted at the boundary between the key words. The table is sorted using the run as a key. When a search request for one word is received, all the lines where the first word of the key (the character string before the key delimiter) in the index table matches the word of the search request are extracted, and the document ID list is obtained. Is taken as the search result. Since the table is sorted using the run as a key, a range of consecutive rows will match. In this case, the search speed using the conventional index table is almost the same as the search speed.
[0021]
When a two-word frace is input as a search request, a row whose index table key matches the phrase of the search request is extracted, and a set of the document IDs is used as a search result. In this case, unlike the conventional method, it is not necessary to scan the text, so that the search can be performed at high speed.
[0022]
When a phrase consisting of three or more words is input as a search request, all the runs of two words included in the phrase of the search request are taken out, a row in the table that matches each of the runs is searched for, and each run appears. A set of document IDs to be obtained is obtained. The product of the obtained document sets is taken to obtain a set of document IDs including all runs that form the phrase. The text of the document having the ID included in the obtained document set is scanned to check whether or not the document has a phrase that matches the search input, and the ID of the document having the matching phrase is output as a search result.
[0023]
In the case of three or more words, it is necessary to scan the document text at the end similarly to the conventional method. Search much faster and much less.
[0024]
BEST MODE FOR CARRYING OUT THE INVENTION
First, when a text to be searched is registered in a text database, words are extracted from each text, a ream consisting of two words is formed, and a document ID that appears using the ream as a key is registered in an index table.
[0025]
FIG. 3 shows the configuration of the index table creation device according to one embodiment of the present invention.
[0026]
The index table creating apparatus 100 shown in FIG. 1 includes a word extracting unit 110 that extracts words by morphological analysis processing, a run extracting unit 120 that forms a run of two words from the word list obtained by the word extracting unit 120, An index table registering unit 130 registers the document ID of the extracted series in the index table 20 using the selected series as a key.
[0027]
At the time of search, words are extracted by morphological analysis of the search input, and if the analysis result indicates that the phrase is composed of two or more words, all of the two words included in the phrase are extracted. Next, an index table is searched using the extracted run as a key, a search result is obtained, and the result is output.
[0028]
FIG. 4 shows a configuration of a text search device according to an embodiment of the present invention. The text search device 200 shown in the figure includes a search request input unit 210 for inputting a search request specified by a user, a word extraction unit 220 for extracting a word from the input search request, and a run extraction for forming a run from words. With reference to the section 230 and an index table created in advance, a document ID set in which a series (or a word) appears is acquired, and when the search request is composed of three or more words, the document is scanned and searched. The search processing unit 240 includes a search processing unit 240 that searches for a document that matches the phrase of the request, and a search result output unit 250 that outputs a set of document IDs obtained by the search processing unit 240.
[0029]
【Example】
Hereinafter, embodiments of the present invention will be described with reference to the drawings.
[0030]
Hereinafter, the operations of the above-described index table creation device 100 and text search device 200 will be described using examples.
[0031]
First, the case where the index table 20 is created by the above-described index table creating apparatus 100 of FIG. 3 when registering the text database will be described.
[0032]
FIG. 5 is a flowchart of an operation at the time of creating an index table according to one embodiment of the present invention.
[0033]
Now, it is assumed that there are 200 texts to be registered in the text database, and serial numbers (ID) from 1 to 200 are assigned to each text. In addition, the text of each of ID1, 87, 132 in that,
ID1: "Company A announced that it will proceed with the development of communication equipment. 』
ID87: "Company C will enter into the field of communication equipment in addition to precision equipment development. 』
ID 132: "Sales of communication equipment of company A are increasing. 』
Assume that
[0034]
The word extracting unit 110 reads texts from the document set D in the order of ID. Here, the number of elements of the document set D is set to i (step 101). If the number of the read texts is larger than i, the process ends (step 102). The morphological analysis unit 120 performs a morphological analysis on the read text and extracts words (step 103). For example, the result of morphologically analyzing the text of ID1 and extracting words is as follows:
"Company A announced and announced that it would advance communications, equipment and development."
It becomes.
[0035]
Next, the run extraction unit 120 forms a run of two words from the extracted word list W. For the text of ID1 above,
"Company A has announced and announced that it will advance and advance communications, communications equipment, equipment development, and development."
Is the result of continuous extraction. Here, "@" is used as a delimiter between consecutive words. The use of the delimiter (利用) can be processed, for example, by registering with “communication device” without the delimiter (＄) and comparing the words “communication” with a forward match. As described above, the search for “communication” in “communication device” is sufficient, but if no delimiter is used, the search can be performed even if one character is used for “communication”. Although it is conceivable to allow such a search, since search using a word-based index is handled, it is appropriate to search for a search input and an index in which words are separated from each other.
[0036]
Finally, the index table registration unit 130 registers a document ID using the extracted run as a key.
[0037]
The above processing is performed for all 200 texts to be registered (steps 104 and 105).
[0038]
FIG. 6 shows an example of the index table created by the index generation device 100 as described above.
Next, processing of the text search device 200 will be described.
[0039]
FIG. 7 is a flowchart of the continuous extraction processing of the text search device according to one embodiment of the present invention.
[0040]
When a search request is input to the search request input unit 210, the word extraction unit 220 extracts a word from the search request and extracts a word list (step 201), and extracts a series of all elements of the word list (step 201). 202, 203).
[0041]
Next, the processing at the time of executing the search processing will be described.
[0042]
The search input will be described for one word, for a two-word phrase, and for three or more words.
[0043]
First, the processing in the search processing unit 230 when the search request is one word will be described.
[0044]
FIG. 8 is a flowchart (in the case of one word) of the processing of the search processing unit of the text search device according to one embodiment of the present invention.
[0045]
If the search request is one word, all the lines where the first word of the key in the index table 20 (the character string before the key delimiter) matches the word of the search request are extracted (step 302). , The sum of the set of the document ID lists is taken as a search result. Since the index table 20 is sorted using the run as a key, a certain range of continuous rows will match. For example, if the search request is a character string “communication”, it is recognized as one word in the morphological analysis processing, so the first word of the key is “communication” by referring to the index table 20. Find the line. In the example of FIG. 6, the rows of “communication cost”, “communication equipment”, and “communication industry” match. By taking out the document ID set of each row and taking the union set, the ID of [1, 5, 9, 11, 17, 35, 87, 132] is obtained (step 303). This is output from the search result output unit 250 as a search result.
[0046]
Next, a case where the search request is a phrase including two words will be described.
[0047]
FIG. 9 is a flowchart (in the case of two words) of the processing of the search processing unit of the text search device according to one embodiment of the present invention.
[0048]
When a phrase consisting of two words is input as a search request, a row whose index table key matches the phrase of the search request is extracted (step 402), and the set of document IDs is used as a search result (step 403). .
[0049]
For example, if the search request is a character string of “communication device”, it is recognized as two words by the morphological analysis process, and therefore, the key matches the “communication device” consisting of the two words from the search table 20. Find a document and obtain a set of document IDs from the matched rows. In the example of FIG. 6, a document ID set [1, 5, 132] is obtained, and this is output as a search result.
[0050]
Next, a case where the search request is a phrase including three words will be described.
[0051]
FIG. 10 is a flowchart (in the case of three words) of the processing of the search processing unit of the text search device according to one embodiment of the present invention.
[0052]
When a phrase consisting of three or more words is input as a search request, all repetitions consisting of two words included in the phrase of the search request are extracted (steps 501 to 503), and a row in the index table 20 corresponding to each of them is retrieved. (Step 505), and a set of document IDs in which each run appears is obtained (step 506). The product of the obtained document sets is calculated (step 506), and a set of document IDs including all the runs constituting the phrase is obtained (step 507). The text of the document having the ID included in the obtained document set is scanned (step 509), and it is checked whether or not the document has a phrase that matches the search input (step 510). The result is output (steps 511 and 508).
[0053]
For example, if the search request is a character string “communication device development”, the morphological analysis process recognizes the three words “communication, device, development”, and the two-word sequence is “communication device, device”. Development ”is obtained. Next, referring to the index table 20, searching for a row whose key is "communication device" and a row whose key is "device development", and taking the intersection of the ID of the document set in each row, (1 , 5) are obtained. Finally, each document included in the obtained document set is scanned to check whether the character string “communication device development” is actually included. “Document 1” includes a character string “communication device development”, while “Document 5” includes “communication” and “device development” but does not include the character “communication device development”. "Document 1" is added to the search results, but "Document 5" is not.
[0054]
A set of document IDs obtained in this manner is used as a search result.
[0055]
In the case of three or more words, it is necessary to scan the document text at the end as in the conventional method. However, since the two words are used for matching, the number of texts to be scanned at this stage is as follows. Search much faster than with traditional methods.
[0056]
More specifically, when three or four stations (N stations) are expanded, the creation of an index table is the same as in the case of two stations. When searching,
-When the number of words is N-1 or less, the search word is compared with the index of the index table, and a union of the positions of the front match is obtained.
[0057]
When the number of words is N, the search word is compared with the index of the index table, and the one at the matching position is extracted.
[0058]
When the number of words is N + 1 or more, the intersection is shifted one by one at the word position, such as 1 to N, 2 to N + 1, and 3 to N + 2, and the intersection is obtained. After that, the text is searched for in the case of two sets.
[0059]
In principle, the longer the run length, the faster the search for a long search character string, but the index size increases accordingly. In fact, even in the case of two sets, the size of the index becomes considerably large. Further, even if the number is increased to three or four, the effect of narrowing down is not so large. Most search requests are made with one or two words. Taking these things together, two or three is appropriate.
[0060]
Further, it is possible to construct the operation of the configuration in the above-described embodiments and examples as a program and install it in a computer used as a text search device.
[0061]
Further, the constructed program may be stored in a portable storage medium such as a hard disk device, a flexible disk, and a CD-ROM connected to a computer used as a text search device, and installed when the present invention is implemented. Is also possible.
[0062]
It should be noted that the present invention is not limited to the above embodiments and examples, and various changes and applications are possible within the scope of the claims.
[0063]
【The invention's effect】
As described above, according to the present invention, a phrase including a plurality of words can be executed at high speed.
[Brief description of the drawings]
FIG. 1 is a diagram for explaining the principle of the present invention.
FIG. 2 is a principle configuration diagram of the present invention.
FIG. 3 is a configuration diagram of an index table creating device according to an embodiment of the present invention.
FIG. 4 is a configuration diagram of a text search device according to an embodiment of the present invention.
FIG. 5 is a flowchart of an operation when creating an index table according to an embodiment of the present invention.
FIG. 6 is an example of an index table according to an embodiment of the present invention.
FIG. 7 is a flowchart of a continuous extraction process of the text search device according to one embodiment of the present invention.
FIG. 8 is a flowchart (for one word) of a process of a search processing unit of the text search device according to one embodiment of the present invention.
FIG. 9 is a flowchart (in the case of two words) of processing of a search processing unit of the text search device according to one embodiment of the present invention.
FIG. 10 is a flowchart (for three words) of processing of a search processing unit of the text search device according to one embodiment of the present invention.
FIG. 11 is an example of a conventional index table.
[Explanation of symbols]
10 Document Set 20 Index Table 100 Index Table Generating Unit, Index Table Generating Device 110 Word Extraction Unit 120 Continuous Extraction Unit 130 Index Table Registration Unit 210 Search Request Input Unit 220 Word Extraction Unit, Word Extraction Unit 230 Continuous Extraction Unit, Continuous Extraction Unit 240 search processing means, search processing unit 250 search result output unit

Claims

In a text search method for searching a document including a specific word from a set of documents using an index table,
An index table creation process in which a series of two words is inserted, a delimiter is inserted at a boundary between words constituting the series, and an index table is created in which the document IDs from which the series are extracted are arranged using the series as a key; ,
A word extraction process for extracting words by morphologically analyzing the input search request during a search;
If the extracted word is a phrase composed of two or more words, a repetition extraction step of extracting all repetitions of the two words included in the phrase;
A search process of searching the index table using the extracted run as a key, and obtaining and outputting a search result.

In the search process,
When a search request for one word is input, the first word, which is a character string before the key delimiter of the key in the index table, extracts all rows that match the word of the search request. , The sum of the set of document ID lists included in the row as a search result,
When a phrase consisting of two words is input as a search request, a row whose key in the index table matches the phrase of the search request is extracted, and a set of document IDs of the row is used as a search result.
When a phrase consisting of three words is input as a search request, all runs of two words included in the phrase of the search request are extracted, and a row in the index table that matches each of the runs is searched for. Is obtained, the product of the sets of the document IDs is multiplied, the set of the document IDs including all the runs constituting the phrase is obtained, and the document IDs included in the set of the document IDs are obtained. Scans the text of the document that has it, checks for the presence of a phrase that matches the input search request, uses the ID of the document that has the matching phrase as the search result,
When a phrase consisting of N words is input as a search request,
When the number of words is N-1 or less, the search request is compared with the index table, and the union of the document IDs of the forward matching portions is set as a search result;
If the number of words is N, the search request is compared with the index table, and a matching document ID is set as a search result;
When the number of words is equal to or more than N + 1, the word positions are shifted one by one such as 1 to N, 2 to N + 1, 3 to N + 2,. 2. The text search method according to claim 1, wherein the text of the document having the matching document ID is scanned, the presence of a phrase matching the input search request is checked, and the ID of the document having the matching phrase is set as a search result.

A text search device for searching a document containing a specific word from a set of documents using an index table,
An index table creating means for creating a series of two words, inserting a delimiter at the boundary of the words constituting the series, and creating an index table in which the document IDs from which the series are extracted are arranged using the series as a key; ,
A word extracting means for morphologically analyzing the input search request and extracting words during the search;
When the extracted word is a phrase composed of two or more words, a ream extraction unit for extracting all reams of the two words included in the phrase;
A search processing unit that searches the index table using the extracted run as a key, and obtains and outputs a search result.

The search processing means,
When a search request for one word is input, the first word, which is a character string before the key delimiter of the key in the index table, extracts all rows that match the word of the search request. Means for using the sum of the set of document ID lists included in the row as a search result;
Means for extracting a line whose index table key matches the phrase of the search request when a phrase consisting of two words is input as the search request, and using a set of document IDs of the line as a search result;
When a phrase consisting of three words is input as a search request, all runs of two words included in the phrase of the search request are extracted, and a row in the index table that matches each of the runs is searched for. Is obtained, the product of the sets of the document IDs is multiplied, the set of the document IDs including all the runs constituting the phrase is obtained, and the document IDs included in the set of the document IDs are obtained. Means for scanning the text of a document having the same, checking for the presence of a phrase that matches the input search request, and using the ID of the document having the matching phrase as a search result;
When a phrase consisting of N words is input as a search request,
When the number of words is N-1 or less, the search request is compared with the index table, and the union of the document IDs of the forward matching portions is set as a search result;
If the number of words is N, the search request is compared with the index table, and a matching document ID is set as a search result;
When the number of words is equal to or more than N + 1, the word positions are shifted one by one such as 1 to N, 2 to N + 1, 3 to N + 2,. Scanning means for scanning the text of a document having the document ID to be searched for a phrase that matches the input search request, and using the ID of the document having the matching phrase as a search result. Text search device.

A text search program for searching a document containing a specific word from a set of documents using an index table,
An index table creating step of creating an index table in which a series of two words is inserted, a delimiter is inserted at a boundary between words constituting the series, and a document ID from which the series is extracted is arranged using the series as a key; ,
A word extraction step of morphologically analyzing the input search request to extract words during a search;
When the extracted word is a phrase composed of two or more words, a repetition extraction step of extracting all repetitions of two words included in the phrase;
A search processing step of searching the index table using the extracted run as a key, and obtaining and outputting a search result.

The search processing step includes:
When a search request for one word is input, the first word, which is a character string before the key delimiter of the key in the index table, extracts all rows that match the word of the search request. Using the sum of the set of document ID lists included in the row as a search result;
When a two-word phrase is input as a search request, extracting a row whose key in the index table matches the phrase of the search request, and using a set of document IDs of the row as a search result;
When a phrase consisting of three words is input as a search request, all runs of two words included in the phrase of the search request are extracted, and a row in the index table that matches each of the runs is searched for. Is obtained, the product of the sets of the document IDs is multiplied, the set of the document IDs including all the runs constituting the phrase is obtained, and the document IDs included in the set of the document IDs are obtained. Scanning the text of the document having the same, searching for the phrase that matches the input search request, and using the ID of the document having the matching phrase as a search result;
When a phrase consisting of N words is input as a search request,
When the number of words is N-1 or less, the search request is compared with the index table, and the union of the document IDs of the forward matching portions is set as a search result;
If the number of words is N, the search request is compared with the index table, and a matching document ID is set as a search result;
When the number of words is equal to or more than N + 1, the word positions are shifted one by one such as 1 to N, 2 to N + 1, 3 to N + 2,. Scanning a text of a document having a document ID to be searched for a phrase that matches the input search request, and using the ID of the document having the matching phrase as a search result. Text search program.

A storage medium storing a text search program for searching a document including a specific word from a set of documents using an index table,
An index table creating step of creating an index table in which a series of two words is inserted, a delimiter is inserted at a boundary between words constituting the series, and a document ID from which the series is extracted is arranged using the series as a key; ,
A word extraction step of morphologically analyzing the input search request to extract words during a search;
When the extracted word is a phrase composed of two or more words, a repetition extraction step of extracting all repetitions of two words included in the phrase;
A storage medium storing a text search program, which stores a search processing step of searching the index table using the extracted run as a key, and obtaining and outputting a search result.

The search processing step includes:
When a search request for one word is input, the first word, which is a character string before the key delimiter of the key in the index table, extracts all rows that match the word of the search request. Using the sum of the set of document ID lists included in the row as a search result;
When a two-word phrase is input as a search request, extracting a row whose key in the index table matches the phrase of the search request, and using a set of document IDs of the row as a search result;
When a phrase consisting of three words is input as a search request, all runs of two words included in the phrase of the search request are extracted, and a row in the index table that matches each of the runs is searched for. Is obtained, the product of the sets of the document IDs is multiplied, the set of the document IDs including all the runs constituting the phrase is obtained, and the document IDs included in the set of the document IDs are obtained. Scanning the text of the document having the same, searching for the phrase that matches the input search request, and using the ID of the document having the matching phrase as a search result;
When a phrase consisting of N words is input as a search request,
When the number of words is N-1 or less, the search request is compared with the index table, and the union of the document IDs of the forward matching portions is set as a search result;
If the number of words is N, the search request is compared with the index table, and a matching document ID is set as a search result;
When the number of words is equal to or more than N + 1, the word positions are shifted one by one such as 1 to N, 2 to N + 1, 3 to N + 2,. Scanning the text of a document having a document ID to be searched for a phrase that matches the input search request, and using the ID of the document having the matching phrase as a search result. A storage medium storing a text search program.