JP4219125B2

JP4219125B2 - Full-text search device, full-text search method, program, and recording medium

Info

Publication number: JP4219125B2
Application number: JP2002214343A
Authority: JP
Inventors: 研策山本; 泰嗣小川; 哲也池田; 卓也平岡; 弘志竹川; 一繁浅田
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 2001-07-24
Filing date: 2002-07-23
Publication date: 2009-02-04
Anticipated expiration: 2022-07-23
Also published as: JP2003122794A

Description

【０００１】
【発明の属する技術分野】
本発明は、全文検索装置、全文検索方法、プログラム、及び記録媒体に関し、より詳細には、複数の文書データから指定された文字列を含む文書を検索する全文検索装置、全文検索方法、プログラム、及び記録媒体に関する。本発明は、例えば文書管理システム、電子図書館システム、特許公報検索システムなど、多量の文書データを管理するシステムに適用可能である。
【０００２】
【従来の技術】
近年、情報通信技術の発達により電子化された文書及びその文書に関する情報がインターネットなどを介して大量に流通している。この電子化文書及び情報の流通に際し、所望の文書を精度よく、さらには高速に検索する文書検索装置が提案されている。
【０００３】
そのような文書検索装置においてはキーワード検索手法や全文検索手法が用いられている。全文検索手法を用いた全文検索装置は、任意の検索文字列と検索対象の文書全てとの間で照合を行なって、検索文字列を含む文書を漏れなく抽出する装置であり、キーワード検索手法のように検索対象となる全ての文書に対してキーワードを予め付与するといった多大な人力が必要ない。全文検索装置としては、様々な種類のものが提案されているが、その１種として転置（索引）ファイル方式を採用した装置がある。転置ファイル方式では、検索のための補助ファイルとして、文字／単語／n-gram（n文字連接）などが出現する文書、或いはそれらの文書中の出現位置を記録する転置ファイルを予め構築し、全文検索時には、転置ファイルのみを用いて検索するもので非常に高速な検索を行なうことが可能であり大量文書の高速検索が要求されるシステムに対して有効である。
【０００４】
なお、全文検索方式一般、転置ファイル方式の詳細については、文献「情報検索アルゴリズム」（北研二、津田和彦、獅々堀正幹著、共立出版株式会社、pp.160-179）、特開平１１−０７３４２９号公報の従来技術、及び全文検索システム協議会平成１０年度活動報告(http://www.ftsanet.com/dbtokyo99/Db99.htm)などで述べられており、公知であるのでその説明を省略する。
【０００５】
転置ファイル方式を採用した従来技術として、特許第３０２４５４４号公報には、検索用インデックスファイルとは別にリアルタイム処理データを記憶することにより、検索用インデックスファイルを更新中であっても検索処理を行うことが可能な情報検索装置が記載されている。また、特開平７−１４６８８０号公報には、新規文書を登録する際に、主インデックスよりも小さな副インデックスに登録し、登録時間を短くすることが可能な文書検索装置及び方法が記載されている。
【０００６】
しかしながら、上述した公報も含め、転置ファイル方式では通常原データの数倍にも及ぶ転置ファイルを構築する必要があり、転置ファイル方式の全文索引は登録されている文書データ量が多くなるにしたがって登録・削除処理に時間を要するようになり、全文検索装置としては利用者側からみた登録・削除処理のレスポンスタイムが長くなる。
【０００７】
【発明が解決しようとする課題】
本発明は、上述のごとき実情に鑑みてなされたものであり、利用者側からみた登録及び削除処理のレスポンスタイムを短くすることが可能な、全文検索装置、全文検索方法、コンピュータをその装置として機能させるためのプログラム、コンピュータにその方法の手順を実行させるためのプログラム、及びそれらのプログラムを記録したコンピュータ読み取り可能な記録媒体を提供することをその目的とする。
【０００９】
【課題を解決するための手段】
請求項１の発明は、入力された文書データを記憶する文書データ記憶部から、入力された検索条件を含む文書データを、登録用の第１全文索引記憶部、削除用の第２全文索引記憶部、検索用の第３全文索引記憶部を用いて検索する全文検索装置であって、登録処理を行う場合は、前記文書データからトークンとトークンの出現位置に関する情報を取得し、該トークンと該トークンの出現位置に関する情報を前記登録用の第１全文索引記憶部に記憶する登録処理手段と、削除処理を行う場合は、前記トークンが前記登録用の第１全文索引記憶部に記憶されているかを判定し、該トークンの出現位置に関する情報が該第１全文索引記憶部に記憶されている場合には、該トークンの出現位置に関する情報を該第１全文索引記憶部から削除し、該トークンの出現位置に関する情報が該第１全文索引記憶部に記憶されていない場合には、該トークンと該トークンの出現位置に関する情報を前記削除用の第２全文索引記憶部に登録する削除処理手段と、検索処理を行う場合は、前記入力された検索条件を含む文書データの文書識別子の第１の集合を、前記第３全文索引記憶部の索引を用いて求め、前記入力された検索条件を含む文書データの文書識別子の第２の集合を、前記第１全文索引記憶部の索引を用いて求め、前記入力された検索条件を含む文書データの文書識別子の第３の集合を、前記第２全文索引記憶部の索引を用いて求め、その後、前記第１の集合及び前記第２の集合の和集合から前記第３の集合を差し引いた集合を検索結果として出力する検索処理手段と、前記第３全文索引記憶部の索引に記憶されているトークンの転置リストに対して、前記第１全文索引記憶部から取り出したトークンの転置リストを加えて、該第３全文索引記憶部に記憶されている前記第２全文索引記憶部から取り出したトークンの転置リストの出現位置に関する情報を削除するとともに、該第１全文索引記憶部の転置リスト及び該第２全文索引記憶部の転置リストをマージする処理を実行するマージ手段と、を有することを特徴としたものである。
【００１３】
請求項２の発明は、請求項１の発明において、前記マージ手段は、前記第１全文索引記憶部又は前記第２全文索引記憶部に記憶された文書データの件数が予め指定された件数に達したときに、前記マージする処理を実行することを特徴としたものである。
【００１４】
請求項３の発明は、請求項１の発明において、前記マージ手段は、前記第１全文索引記憶部又は前記第２全文索引記憶部の容量が予め指定された容量に達したときに、前記マージする処理を実行することを特徴としたものである。
【００１５】
請求項４の発明は、請求項２又は３の発明において、前記第１全文索引記憶部又は前記第２全文索引記憶部を複数含み、前記第３全文索引記憶部にデータをマージする処理を行っている第１全文索引記憶部又は第２全文索引記憶部とは異なる、他の第１全文索引記憶部又は第２全文索引記憶部を使用して、登録処理又は削除処理を行うことを特徴としたものである。
【００１６】
請求項５の発明は、請求項２又は３の発明において、前記第１全文索引記憶部又は第２全文索引記憶部を二つ含み、うち一つの第１全文索引記憶部又は第２全文索引記憶部から前記第３全文索引記憶部にデータをマージする処理を行っている間は、もう一つの第１全文索引記憶部又は第２全文索引記憶部を使用して、登録処理又は削除処理を行うことを特徴としたものである。
【００１７】
請求項６の発明は、入力された文書データを記憶する文書データ記憶部から、入力された検索条件を含む文書データを、登録用の第１全文索引記憶部、削除用の第２全文索引記憶部、検索用の第３全文索引記憶部を用いて検索する全文検索方法であって、登録処理手段が、登録処理を行う場合は、前記文書データからトークンとトークンの出現位置に関する情報を取得し、該トークンと該トークンの出現位置に関する情報を前記登録用の第１全文索引記憶部に記憶する登録処理ステップと、削除処理手段が、削除処理を行う場合は、前記トークンが前記登録用の第１全文索引記憶部に記憶されているかを判定し、該トークンの出現位置に関する情報が該第１全文索引記憶部に記憶されている場合には、該トークンの出現位置に関する情報を該第１全文索引記憶部から削除し、該トークンの出現位置に関する情報が該第１全文索引記憶部に記憶されていない場合には、該トークンと該トークンの出現位置に関する情報を前記削除用の第２全文索引記憶部に登録する削除処理ステップと、検索処理手段が、検索処理を行う場合は、前記入力された検索条件を含む文書データの文書識別子の第１の集合を、前記第３全文索引記憶部の索引を用いて求め、前記入力された検索条件を含む文書データの文書識別子の第２の集合を、前記第１全文索引記憶部の索引を用いて求め、前記入力された検索条件を含む文書データの文書識別子の第３の集合を、前記第２全文索引記憶部の索引を用いて求め、その後、前記第１の集合及び前記第２の集合の和集合から前記第３の集合を差し引いた集合を検索結果として出力する検索処理ステップと、マージ手段が、前記第３全文索引記憶部の索引に記憶されているトークンの転置リストに対して、前記第１全文索引記憶部から取り出したトークンの転置リストを加えて、該第３全文索引記憶部に記憶されている前記第２全文索引記憶部から取り出したトークンの転置リストの出現位置に関する情報を削除するとともに、該第１全文索引記憶部の転置リスト及び該第２全文索引記憶部の転置リストをマージする処理を実行するマージステップと、を含むことを特徴としたものである。
【００１８】
請求項７の発明は、請求項６の発明において、前記マージステップは、前記第１全文索引記憶部又は前記第２全文索引記憶部に記憶された文書データの件数が予め指定された件数に達したときに、前記マージする処理を実行することを特徴としたものである。
【００１９】
請求項８の発明は、請求項６の発明において、前記マージステップは、前記第１全文索引記憶部又は前記第２全文索引記憶部の容量が予め指定された容量に達したときに、前記マージする処理を実行することを特徴としたものである。
【００４９】
請求項９の発明は、コンピュータに、請求項６〜８のいずれか１項に記載の全文検索方法を実行させるためのプログラムである。
【００５０】
請求項１０の発明は、請求項９に記載のプログラムを記録したコンピュータ読み取り可能な記録媒体である。
【００５１】
【発明の実施の形態】
本発明は、小規模の全文索引を登録用及び削除用に別に用意し登録及び削除のレスポンスタイムの悪化を防ぎ、検索処理の際には大規模の全文索引の検索結果に、登録用の小規模全文索引の検索結果を加え、削除用の小規模全文索引の検索結果を除き、利用者に返す検索結果とする全文検索装置であり、これは、本出願人による特願２００１−７８０２６号明細書に記載の手法を全文検索装置に適用し、上述した課題を解決したものである。
【００５２】
なお、上述の特願２００１−７８０２６号明細書には、高度な検索要求に高速に応答できる性能を維持しつつ、システム稼働中の更新性能をさらに向上させることができるデータベース管理システム、プログラム、及び記録媒体が記載されており、登録・削除のためのデータ保持手段を検索向けデータ保持手段とは別に用意することによって、登録・削除のスループットを高くすることを特徴としている。しかしながら、上述の明細書に記載の手法では、登録用及び削除用の小規模な全文索引から検索用の大規模な全文索引へのデータ転送手段で小規模索引に登録されている文書データの識別子から元の文書データを取得し、大規模な索引に登録及び削除を行っている。上述のごとく、大規模な全文索引への登録・削除処理には時間がかかるので、データ転送処理の時間が長くなる。一般に全文索引への登録・削除処理の間は検索処理が行えないので、利用者から見た検索処理のレスポンスタイムが悪くなるという問題がある。
【００５３】
本発明では、小規模な全文索引から大規模な全文索引へのデータ転送手段において、元の文書データを用いるのではなく転置ファイル方式の全文索引を用いることによって、すなわち全文索引の構成要素である転置リストを用いることによって、データ転送に要する時間を短くするようにしている。
【００５４】
図１は、本発明の一実施形態に係る全文検索装置の機能を説明するためのブロック図、図２は、図１における全文検索装置をスタンドアロンで構成した場合のハードウェア構成例を示す図、図３は、図１における全文検索装置をサーバ／クライアントで構成した場合のハードウェア構成例を示す図である。
【００５５】
本発明に係る全文検索装置は、複数の文書データ（複数の電子化文書）から指定された文字列を含む文書を検索する装置である。なお、本明細書中、全文検索装置における「全文検索」とは、検索すべき全ての文字列を対象とした検索装置であることを意味しており、したがって、例えばＳＧＭＬ等のタグ付の文書であれば、適宜、所定のタグ間にある文字列のみを対象としてもよい。
【００５６】
図１を参照すると、本実施形態においては、入力手段１では、登録処理用のテキストデータ，削除処理用の文書識別子，検索処理用の検索条件などが入力され、夫々、登録処理手段３，削除処理手段４，検索処理手段５に渡される。登録処理手段３では文書データに関する登録処理を行う。登録処理手段３における登録処理は文書データ記憶部７及び登録用小規模全文索引記憶部９に対して行われる。削除処理手段４では文書データに関する削除処理を行う。削除処理手段４における削除処理は、入力手段１で入力された文書識別子に基づいて、文書データ記憶部７に記憶された文書データを読み出し、テキスト分割手段６を用い、登録用小規模全文索引記憶部９に登録された索引である場合にはそれを削除し、登録された索引でない場合には削除用小規模全文索引記憶部１０にその索引を記録する。なお、テキスト分割手段６では、登録処理手段３，削除処理手段４，検索処理手段５の各々で必要な、登録処理における文書データから部分文字列への分割処理、削除処理における文書データから部分文字列への分割処理、検索処理における検索条件（検索文字列）から部分文字列への分割処理を行う。
【００５７】
また、検索処理手段５における検索処理は、検索用大規模全文索引記憶部８，登録用小規模全文索引記憶部９，削除用小規模全文索引記憶部１０に対して実行し、記憶部８及び９の検索結果から記憶部１０における検索結果を差し引いた結果を求め、検索結果として出力手段２で出力する。マージ手段１１においては、検索用大規模全文索引記憶部８，登録用小規模全文索引記憶部９，削除用小規模全文索引記憶部１０間でのマージ処理（広義でデータ転送ともいえる）を行う。
【００５８】
なお、以降、特に説明はしないが、削除処理手段４における削除処理に関し、削除用小規模全文索引記憶部１０を使用しなくとも、例えば削除する文書データのみを削除して処理時間が得られる休日などに、文書データ記憶部７に存在する文書データと整合して検索用大規模全文索引記憶部８のデータを更新するなど、他の削除用の文書データ（及び索引）管理方法を用い、登録用小規模全文索引記憶部９を使用した登録処理のみを行う形態も採用できる。逆に、登録用小規模全文索引記憶部９を使用した登録処理を行わず、削除用小規模全文索引記憶部１０のみを使用した削除処理のみを行う形態も採用できる。
【００５９】
図２に示すスタンドアロンでのハードウェア構成においては、図１における入力手段１は入力装置２１に実現され、出力手段２は表示装置２２に実現される。各種処理手段３〜６，１１は主制御装置（ＣＰＵ，メモリ等）２４に、各種記憶部７〜１０は、例えば全てを記憶装置２５として、或いは個々の記憶装置として、さらには記憶装置２５におけるファイルとして実現される。例えば、１つの限られた記憶装置を用いて本発明に係る全文検索を行う場合、検索処理をメインに行うのか、登録・削除処理をメインに行うのかで、その使用する領域を上手く割り当てるとよい。また、入出力制御装置２３は主制御装置２４の制御信号に従って入力装置２１及び表示装置２２を制御する。
【００６０】
図３に示すサーバ／クライアントでのハードウェア構成においては、図１における入力手段１はクライアント３０の入力装置３１で実現され、出力手段２はクライアント３０の表示装置３２に実現される。各種処理手段３〜６，１１はクライアント３０及びサーバ５０の主制御装置（ＣＰＵ，メモリ等）３４，５２に実現され、各種記憶部７〜１０は、例えば、全てをサーバ５０の記憶装置５３として、或いはサーバ５０に接続された個々の記憶装置として、さらには記憶装置２５におけるファイルとして実現される。また、クライアント３０，サーバ５０のネットワーク制御装置３５，５１は、ネットワーク４０を介してクライアント３０とサーバ５０の間のデータ伝送等の制御を行う。さらにクライアント３０の入出力制御装置３３は、主制御装置３４の制御信号に従って入力装置２１及び表示装置２２を制御する。
【００６１】
以下に、上述のごとく構成された本実施形態に係る全文検索装置の動作の一例を詳細に説明する。
図４乃至図６は、図１の全文検索装置における処理例を説明するためのフロー図である。
全文検索装置は、利用者からの処理要求を受け取ると（ステップＳ１）、まず、その処理が、登録処理であるのか（ステップＳ２）、削除処理であるのか（ステップＳ３）、検索処理であるのか（ステップＳ３でＮＯ）を判定する。全文検索装置は、この判定に基づいて以下の各処理を実行することとなる。
【００６２】
（登録処理）
登録処理を実行するには、まず利用者が文書データを作成し、入力手段１からその文書データを登録する。登録処理手段３において文書データを文書データ記憶部７に保存し、同時にその文書データを示す識別子（文書識別子）を定める（ステップＳ１１）。例えばＳＧＭＬ等のタグ付の文書であれば、適宜、所定のタグ間にある文字列のみを対象としてもよい。さらに登録処理手段３において、テキスト分割手段６を用いて文書データから部分文字列（トークン）とそのトークンの出現位置情報を得る（ステップＳ１２）。最後に文書識別子と各トークンの出現位置情報を登録用小規模全文索引記憶部９に記録する（ステップＳ１３）。ここでの「記録」は記憶部の全文索引への記録であり（以下同様）、ステップ１３のごとき処理を索引記憶ステップとも呼ぶ。なお、テキスト分割手段６で使用される分割手法については、Ｎ文字組をトークンとする手法でもよいし、形態素解析を行い単語をトークンとする手法でもよい。以下の例ではＮ文字組みをトークンとする手法を用いたテキスト分割手段に限って説明するが形態素解析を行った単語をトークンとする手法に対しても同様に適用可能である。また、後述するが、ステップＳ１３における記録の後、適時マージ処理が行われる（ステップＳ１４〜Ｓ１６）。
【００６３】
図７は、図１の全文検索装置における処理を説明するための図で、全文索引の一例を示す図である。図７の例を用いて転置ファイル方式の全文索引について詳細に説明する。
登録文書データを文書１，文書２とし、それらの内容（ここではテキスト分割手段６で分割することにより得た内容）がそれぞれ、図７の符号６１，６２で表されるものとする。ここで、各文書の左の数字は文字列の先頭からの文字数を表している。つまり、文書１では、「全文検索」は先頭から１１文字目、「方法」は２０，６０文字目、「全文検索方法」は３１文字目に出現していることを意味する。また文書２では、「探索方法」は先頭から１文字目、「方法」は２４文字目、「全文」は３０，４２文字目に出現していることを意味する。
【００６４】
なお、２文字組を部分文字列とする場合、文書中の全ての部分文字列を抽出し、それらの文書内での出現位置（先頭からの文字数）を部分文字列ごとにまとめて索引に記録する。例えば、文書１からは「全文」が１１，３１の位置、「文検」が１２，３２の位置に出現しているので、索引に記録する。索引では、文書内での出現位置だけでなく、どの文書に出現したかを識別するための文書識別子と出現回数を加えて記録するので、図４の符号６３で示したような形式になる。例えば、「全文」に対する転置リスト｛１，２，（１１，３１）｝及び｛２，２，（３０，４２）｝はそれぞれ、文書１において２回出現してその位置は１１，３１であること、及び文書２において２回出現してその位置は３０，４２であることを意味する。
【００６５】
（削除処理）
削除処理を実行するには、まず利用者が入力手段１から削除する文書の文書識別子を入力する。次に、削除処理手段４において文書データ記憶部７から文書識別子に対応する文書データを読み出す（ステップＳ２１）。さらに削除処理手段４において、テキスト分割手段６を用いて文書データから部分文字列（トークン）とそのトークンの出現位置情報を得る（ステップＳ２２）。例えばＳＧＭＬ等のタグ付の文書であれば、適宜、所定のタグ間にある文字列のみを対象としてもよい。文書識別子が登録用小規模全文索引に登録されている文書識別子かを判定し（ステップＳ２３）、登録用小規模全文索引に登録されている文書識別子である場合には、各トークンの出現位置情報を登録用小規模全文索引記憶部９から削除する（ステップＳ２５）。文書識別子が登録用小規模全文索引に登録されていない場合（検索用大規模全文索引に登録されている場合）には、文書識別子と各トークンの出現位置情報を削除用小規模全文索引記憶部１０に記録する（ステップＳ２４）。そして、削除処理手段４において文書データ記憶部７から文書識別子に対応する文書データを削除する（ステップＳ２９）。また、後述するが、ステップＳ２４における記録の後、適時マージ処理が行われる（ステップＳ２６〜Ｓ２８）。
【００６６】
（検索処理）
検索処理を実行するには、まず利用者が入力手段１から検索文字列を入力する。次に、検索処理手段５において、テキスト分割手段６を用いて検索文字列からトークンを得る（ステップＳ４）。また、検索処理手段５において検索用大規模全文索引記憶部８の検索用大規模全文索引を用いて、検索文字列を含む文書データの文書識別子の集合（Ｒｓ）を得る（ステップＳ５）とともに、登録用小規模全文索引記憶部９の登録用小規模全文索引を用いて、検索文字列を含む文書データの文書識別子の集合（Ｒｉ）を得る（ステップＳ６）。さらに、検索処理手段５において削除用小規模全文索引記憶部１０の削除用小規模全文索引を用いて、検索文字列を含む文書データの文書識別子の集合（Ｒｄ）を得る（ステップＳ７）。検索処理手段５は得られた文書識別子の集合（Ｒｓ，Ｒｉ，Ｒｄ）に対して下記の集合演算を行い、その結果を検索結果（Ｒ）とし（ステップＳ８）、出力手段２を通じて利用者に検索文字列を含む文書データの文書識別子の集合を出力する（ステップＳ９）。
Ｒ＝Ｒｓ＋Ｒｉ−Ｒｄ
ただし、＋を論理和演算子、−を論理差演算子とする。
【００６７】
図７の全文索引６３を例として検索処理について詳細に説明する。
検索文字列を「全文検索」とすると、テキスト分割手段が「全文」，「文検」，「検索」の３個のトークンを抽出する。次に全文索引６３の対応するトークンの３つの転置リストを調べる。それぞれのトークン出現位置の差が１であるものを探すと文書識別子１の１１文字目と３１文字目に「全文検索」が存在することがわかる。
【００６８】
（マージ処理）
マージ手段１１によるマージ処理は、上述の特願２００１−７８０２６号明細書におけるデータ転送手段に変わる処理である。
元の文書データを用いて登録・削除処理を行う場合に比べて、処理開始時に既に作成されている転置リストを直接利用するのでテキスト分割処理によるトークンの切り出し及びその転置リスト作成に要する時間が不要となり、データ転送時間を短くできる。本発明においては転置リスト同士の処理であることからデータ転送処理（データ転送ステップ）のことをマージ処理（マージステップ）とも呼ぶ。全文検索装置における文書データの登録・削除処理を転置リスト同士の処理とすることにより、検索用全文索引へのデータ登録・削除の際に、既に作成されている転置リストを直接利用するので検索用全文索引へのマージ処理の時間を短縮でき、検索処理の待ち時間を短くすることができる。
【００６９】
マージ処理を実行するには、まず登録用小規模全文索引の全てのトークンに対して、（ａ）全文索引からそのトークンの転置リストを取り出す処理（ステップＳ１４）、及び（ｂ）検索用大規模全文索引の対応するトークンの転置リストの末尾に先の転置リストを加える処理（ステップＳ１５）を行う。次に登録用小規模全文索引を空にする（ステップＳ１６）。また、削除用小規模全文索引の全てのトークンに対して、（ｃ）全文索引からそのトークンの転置リストを取り出す処理（ステップＳ２６）、及び（ｄ）検索用大規模全文索引の対応するトークンの転置リストから（ｃ）で取り出した転置リスト中の出現位置情報を削除する処理（ステップＳ２７）を行う。次に削除用小規模全文索引を空にする（ステップＳ２８）。
【００７０】
図８は、図７における全文索引６３のトークン「全文」の転置リストを例にマージ処理の概要を説明するための図である。
検索用全文索引の転置リスト７１として、「全文」に対する転置リスト｛１，２，（１１，３１）｝，｛２，２，（３０，４２）｝、登録用全文索引の転置リスト７２として、「全文」に対する転置リスト｛５，２，（４，１６）｝，｛８，１，（３）｝をマージ処理７３する場合を説明する。マージ処理７３を実行することにより、「全文」に対する転置リスト｛１，２，（１１，３１）｝，｛２，２，（３０，４２）｝，｛５，２，（４，１６）｝，｛８，１，（３）｝（７４）が得られる。さらに、この転置リストと、削除用全文索引の転置リスト７６としての、「全文」に対する転置リスト｛１，２，（１１，３１）｝とをマージ処理７５することにより、「全文」に対する転置リスト｛２，２，（３０，４２）｝，｛５，２，（４，１６）｝，｛８，１，（３）｝（７７）が得られる。
【００７１】
（マージ処理の形態１）
マージ処理は、登録用小規模全文索引記憶部９における登録用小規模全文索引に登録されている文書識別子の数が予め指定されている数に達したときに登録処理手段３によって起動され、マージ手段１１により実行される。
【００７２】
（マージ処理の形態２）
マージ処理は、登録用小規模全文索引記憶部９における記憶容量（大きさ）が予め指定されているサイズになったときに登録処理手段３によって起動され、マージ手段１１により実行されるようにしてもよい。この形態により、利用者から登録される文書データの大きさにばらつきがあるような応用形態として使用される場合に、小さな文書データが連続して登録されたときに登録用小規模全文索引への登録時間が長くなる前にマージ処理が開始されることを防ぐことができる。サイズを起動条件にすることでマージの処理時間を均等にすることができる。さらに、前述のマージ処理（形態１）の場合には件数を起動条件にしており全文索引記憶部の大きさを管理する必要がないので処理が簡単になる利点がある。
【００７３】
（マージ処理の形態３）
削除用小規模全文索引のマージ処理は削除処理手段４によって起動され、マージ手段１１により実行されるようにしてもよい。起動条件は削除用小規模全文索引に登録されている文書識別子の数が予め指定されている数に達したときとしてもよい。
【００７４】
（マージ処理の形態４）
削除用小規模全文索引のマージ処理は削除処理手段４によって起動され、マージ手段１１により実行されるようにしてもよい。起動条件は削除用小規模全文索引記憶部１０の大きさが予め指定されているサイズに達したときとしてもよい。形態３，４では削除処理が多く発生しないような場合、マージ処理の時間を短縮できる利点がある。
【００７５】
上述のごときマージ処理の各形態により、全文検索装置においては登録・削除する文書データの特徴や利用分野の特徴に適した条件で全文索引のマージ処理を開始することが可能となり、マージ処理の発生回数を減らせ、システム全体のスループットを向上させることが可能となる。さらに、マージの開始条件は、マージ処理にかかる所要時間により可変としてもよいし、また、登録により生じるマージと、削除により生じるマージを、いずれかの開始条件のもとで同時に起動させてもよい。
【００７６】
上述した実施形態に係る全文検索装置では、本出願人による特願２００１−７８０２６号明細書に記載の手法を転置ファイル方式の全文索引を用いた全文検索装置に適用し、小規模な全文索引から大規模な全文索引へのデータ転送手段において、元の文書データを用いるのではなく転置ファイル方式の全文索引の構成要素である転置リストを用いることによってデータ転送に要する時間を短くしている。
【００７７】
次に説明する本発明の他の実施形態に係る全文検索装置は、上述した実施形態に係る全文検索装置において、本出願人による特願２００１−１０１０２４号明細書に記載の書き込み遅延データベース管理方法、装置、プログラム、及び記録媒体で用いた手法を適用したものである。これにより、登録用或いは削除用の小規模全文索引から検索用の大規模全文索引へのデータ転送（転置リストのマージ処理）を行っている間は、その登録用或いは削除用の小規模全文索引記憶部を使用することができず、登録処理或いは削除処理を実行することができないという問題を解決することができる。
【００７８】
図９は、本発明の他の実施形態に係る全文検索装置の機能を説明するためのブロック図である。
本実施形態に係る全文検索装置では、登録用及び削除用の小規模全文索引を二つずつ用意し、大規模全文索引へのマージ（データ転送）を行っている間は、他方の小規模全文索引を使用して登録処理或いは削除処理を実行することにより、処理不能な期間を無くすようにしている。すなわち、本実施形態に係る全文検索装置においては、登録用小規模全文索引を二つ備えることでマージ処理を実行中でも登録処理を行うことが可能となり、また削除用小規模全文索引を二つ備えることでマージ処理を実行中でも削除処理を行うことが可能となる。本実施形態によれば、例えば書類をスキャナ等で読み取り、ＯＣＲ処理して、各書類を登録したいときなど、登録処理とそれによるマージ処理が頻繁に連続して行われるときなどに好適である。このようなイメージデータも通常のアプリケーションデータと同じように全文検索が高レスポンスで可能となる。
【００７９】
図１で説明した登録用小規模全文索引記憶部９が登録用小規模全文索引記憶部Ａ（９ａ）及び登録用小規模全文索引記憶部Ｂ（９ｂ）の二つの記憶部を有するものとして、図１で説明した削除用小規模全文索引記憶部１０が削除用小規模全文索引記憶部Ａ（１０ａ）及び削除用小規模全文索引記憶部Ｂ（１０ｂ）の二つの記憶部を有するものとして本実施形態を説明する。なお、図２及び図３で説明したようなハードウェア構成例を本実施形態に係る全文検索装置にも適用可能である。ただし、これらの記憶部の１又は複数を記憶装置２５，５３ではなく、メモリ上に設けても効果的である。
【００８０】
以下に、上述のごとく構成された本実施形態に係る全文検索装置の動作の一例を詳細に説明する。
図１０乃至図１２は、図９の全文検索装置における処理例を説明するためのフロー図である。
全文検索装置は、利用者からの処理要求を受け取ると（ステップＳ３１）、まず、その処理が、登録処理であるのか（ステップＳ３２）、削除処理であるのか（ステップＳ３３）、検索処理であるのか（ステップＳ３３でＮＯ）を判定する。全文検索装置は、この判定に基づいて以下の各処理を実行することとなる。
【００８１】
（登録処理）
登録処理を実行するには、まず利用者が文書データを作成し、入力手段１からその文書データを登録する。登録処理手段３において文書データを文書データ記憶部７に保存し、同時にその文書データを示す識別子（文書識別子）を定める（ステップＳ４１）。さらに登録処理手段３において、テキスト分割手段６を用いて文書データから部分文字列（トークン）とそのトークンの出現位置情報を得る（ステップＳ４２）。なお、テキスト分割手段６で使用される分割手法や全文索引については前述した通りである。文書識別子と各トークンの出現位置情報をその時点の登録用小規模全文索引記憶部（例えば登録用小規模全文索引記憶部Ａ（９ａ））に記録する（ステップＳ４３）。
【００８２】
ステップＳ４３における記録の後、適時マージ処理が行われるが、ここではマージ開始条件に基づいて行われるものとして説明する。まず、ステップＳ４３において記録した結果、マージ開始条件を満たすかを判定する（ステップＳ４４）。マージ開始条件を満たさなければ処理を終了する。なお、図１乃至図８の実施形態において説明したマージ処理開始の条件の各形態は、本実施形態においても適用可能である。また、他方の登録用小規模全文索引記憶部（ここでは登録用小規模全文索引記憶部Ｂ（９ｂ））がマージ処理を実行中であるかも判定する（ステップＳ４５）。実行中の場合にはそのマージ処理の終了を待つ。
【００８３】
マージ開始条件を満たし、且つもう一方の登録用小規模全文索引記憶部Ｂ（９ｂ）がマージ処理を実行中ではない場合に、登録用小規模全文索引記憶部Ａ（９ａ）における登録用小規模全文索引Ａに対して後述のマージ処理（ステップＳ４７〜Ｓ４９）を起動し、次の登録処理に対して記録を行うべき記憶部を登録用小規模全文索引記憶部Ａ（９ａ）からもう一方の登録用小規模全文索引記憶部Ｂ（９ｂ）に切り替える（ステップＳ４６）。マージ処理が起動された場合、マージ手段１１は登録処理手段３とは非同期にマージ処理を実行する。
【００８４】
（削除処理）
削除処理を実行するには、まず利用者が入力手段１から削除する文書の文書識別子を入力する。次に、削除処理手段４において文書データ記憶部７から文書識別子に対応する文書データを読み出す（ステップＳ５１）。さらに削除処理手段４において、テキスト分割手段６を用いて文書データから部分文字列（トークン）とそのトークンの出現位置情報を得る（ステップＳ５２）。
【００８５】
次に、文書識別子が登録用小規模全文索引に登録されている文書識別子かを判定し（ステップＳ５３）、登録用小規模全文索引に登録されている文書識別子である場合には、各トークンの出現位置情報を登録用小規模全文索引記憶部９（９ａ及び９ｂ）から削除する（ステップＳ５５）。文書識別子が登録用小規模全文索引に登録されていない場合（検索用大規模全文索引に登録されている場合）には、文書識別子と各トークンの出現位置情報をその時点の削除用小規模全文索引記憶部（例えば削除用小規模全文索引記憶部Ａ（１０ａ））に記録する（ステップＳ５４）。そして、削除処理手段４において文書データ記憶部７から文書識別子に対応する文書データを削除する（ステップＳ６２）。
【００８６】
ステップＳ５４における記録の後、適時マージ処理が行われるが、ここではマージ開始条件に基づいて行われるものとして説明する。まず、ステップＳ５４において記録した結果、マージ開始条件を満たすかを判定する（ステップＳ５６）。マージ開始条件を満たさなければ処理を終了する（ステップＳ６２の処理は必要）。なお、図１乃至図８の実施形態において説明したマージ処理開始の条件の各形態は、本実施形態においても適用可能である。また、他方の削除用小規模全文索引記憶部（ここでは削除用小規模全文索引記憶部Ｂ（１０ｂ））がマージ処理を実行中であるかも判定する（ステップＳ５７）。実行中の場合にはそのマージ処理の終了を待つ。
【００８７】
マージ開始条件を満たし、且つもう一方の削除用小規模全文索引記憶部Ｂ（１０ｂ）がマージ処理を実行中ではない場合に、削除用小規模全文索引記憶部Ａ（１０ａ）における削除用小規模全文索引Ａに対して後述のマージ処理（ステップＳ５９〜Ｓ６１）を起動し、次の削除処理に対して記録を行うべき記憶部を削除用小規模全文索引記憶部Ａ（１０ａ）からもう一方の削除用小規模全文索引記憶部Ｂ（１０ｂ）に切り替える（ステップＳ５８）。マージ処理が起動された場合、マージ手段１１は削除処理手段４とは非同期にマージ処理を実行する。
【００８８】
（検索処理）
検索処理を実行するには、まず利用者が入力手段１から検索文字列を入力する。次に、検索処理手段５において、テキスト分割手段６を用いて検索文字列からトークンを得る（ステップＳ３４）。また、検索処理手段５において検索用大規模全文索引記憶部８の検索用大規模全文索引を用いて、検索文字列を含む文書データの文書識別子の集合（Ｒｓ）を得る（ステップＳ３５）。検索処理手段５は、登録用小規模全文索引記憶部Ａ（９ａ）の登録用小規模全文索引Ａを用いて、検索文字列を含む文書データの文書識別子の集合（ＲｉＡ）を得、登録用小規模全文索引記憶部Ｂ（９ｂ）の登録用小規模全文索引Ｂを用いて、検索文字列を含む文書データの文書識別子の集合（ＲｉＢ）を得る（ステップＳ３６）。さらに、検索処理手段５は、削除用小規模全文索引記憶部Ａ（１０ａ）の削除用小規模全文索引Ａを用いて、検索文字列を含む文書データの文書識別子の集合（ＲｄＡ）を得、削除用小規模全文索引記憶部Ｂ（１０ａ）の削除用小規模全文索引Ｂを用いて、検索文字列を含む文書データの文書識別子の集合（ＲｄＢ）を得る（ステップＳ３７）。
【００８９】
検索処理手段５は得られた文書識別子の集合（Ｒｓ，ＲｉＡ，ＲｉＢ，ＲｄＡ，ＲｄＢ）に対して下記の集合演算を行い、その結果を検索結果（Ｒ）とし（ステップＳ３８）、出力手段３を通じて利用者に検索文字列を含む文書データの文書識別子の集合を出力する（ステップＳ３９）。
Ｒ＝Ｒｓ＋ＲｉＡ＋ＲｉＢ−ＲｄＡ−ＲｄＢ
ただし、＋を論理和演算子、−を論理差演算子とする。
【００９０】
（マージ処理）
登録用小規模全文索引のマージ処理を実行するには、マージ処理の対象となっている登録用小規模全文索引（ここでは登録用小規模全文索引Ａ）の全てのトークンに対して、（ａ）全文索引からそのトークンの転置リストを取り出す処理（ステップＳ４７）、及び（ｂ）検索用大規模全文索引の対応するトークンの転置リストの末尾に先の転置リストを加える処理を行う（ステップＳ４８）。次に登録用小規模全文索引Ａを空にする（ステップＳ４９）。
【００９１】
削除用小規模全文索引のマージ処理を実行するには、マージ処理の対象となっている削除用小規模全文索引（ここでは削除用小規模全文索引Ａ）の全てのトークンに対して、（ｃ）全文索引からそのトークンの転置リストを取り出す処理（ステップＳ５９）、及び（ｄ）検索用大規模全文索引の対応するトークンの転置リストから（ｃ）で取り出した転置リスト中の出現位置情報を削除する処理（ステップＳ６０）を行う。次に削除用小規模全文索引Ａを空にする（ステップＳ６１）。なお、図８で説明した転置リストのマージ処理例が本実施形態においても適用可能である。
【００９２】
図９乃至図１２を参照して説明した実施形態において、三つ以上の登録用小規模全文索引記憶部及び／又は三つ以上の削除用全文索引記憶部を用いた全文検索装置の形態を、図１３乃至図１６を参照して次の実施形態として例示する。
【００９３】
図１３は、本発明の他の実施形態に係る全文検索装置の機能を説明するためのブロック図である。
本実施形態に係る全文検索装置では、登録用及び削除用の小規模全文索引を三つ以上ずつ（三つずつとして例示する）用意し、他の二つの小規模全文索引が大規模全文索引へのマージ（データ転送）を行っている間は、他の小規模全文索引を使用して登録処理或いは削除処理を実行することにより、処理不能な期間を無くすようにしている。すなわち、本実施形態に係る全文検索装置においては、登録用小規模全文索引を複数備えることでマージ処理が複数の登録用小規模全文索引に対して行われている場合でも他の登録処理が行われている場合でも登録処理を行うことが可能となり、また削除用小規模全文索引を複数備えることでマージ処理が複数の登録用小規模全文索引に対して行われている場合でも他の削除処理が行われている場合でも削除処理を行うことが可能となる。なお、実際には、登録や削除にかかる時間はマージ時間よりも短いので、マージ処理が重なることの方が多い。
【００９４】
図１で説明した登録用小規模全文索引記憶部９が登録用小規模全文索引記憶部Ａ（９ａ）及び登録用小規模全文索引記憶部Ｂ（９ｂ）及び登録用小規模全文索引記憶部Ｃ（９ｃ）の三つの記憶部を有するものとして、図１で説明した削除用小規模全文索引記憶部１０が削除用小規模全文索引記憶部Ａ（１０ａ）及び削除用小規模全文索引記憶部Ｂ（１０ｂ）及び削除用小規模全文索引記憶部Ｃ（１０ｃ）の三つの記憶部を有するものとして本実施形態を説明する。なお、図２及び図３で説明したようなハードウェア構成例を本実施形態に係る全文検索装置にも適用可能である。ただし、これらの記憶部の１又は複数を記憶装置２５，５３ではなく、メモリ上に設けても効果的である。
【００９５】
以下に、上述のごとく構成された本実施形態に係る全文検索装置の動作の一例を詳細に説明する。
図１４乃至図１６は、図１３の全文検索装置における処理例を説明するためのフロー図である。
全文検索装置は、利用者からの処理要求を受け取ると（ステップＳ７１）、まず、その処理が、登録処理であるのか（ステップＳ７２）、削除処理であるのか（ステップＳ７３）、検索処理であるのか（ステップＳ７３でＮＯ）を判定する。全文検索装置は、この判定に基づいて以下の各処理を実行することとなる。
【００９６】
（登録処理）
登録処理を実行するには、まず利用者が文書データを作成し、入力手段１からその文書データを登録する。登録処理手段３において文書データを文書データ記憶部７に保存し、同時にその文書データを示す識別子（文書識別子）を定める（ステップＳ８１）。さらに登録処理手段３において、テキスト分割手段６を用いて文書データから部分文字列（トークン）とそのトークンの出現位置情報を得る（ステップＳ８２）。なお、テキスト分割手段６で使用される分割手法や全文索引については前述した通りである。文書識別子と各トークンの出現位置情報をその時点の登録用小規模全文索引記憶部（例えば登録用小規模全文索引記憶部Ａ（９ａ））に記録する（ステップＳ８３）。
【００９７】
ステップＳ８３における記録の後、適時マージ処理が行われるが、ここではマージ開始条件に基づいて行われるものとして説明する。まず、ステップＳ８３において記録した結果、マージ開始条件を満たすかを判定する（ステップＳ８４）。マージ開始条件を満たさなければ処理を終了する。なお、図１乃至図８の実施形態において説明したマージ処理開始の条件の各形態は、本実施形態においても適用可能である。また、他方の登録用小規模全文索引記憶部（ここでは登録用小規模全文索引記憶部Ｂ（９ｂ））がマージ処理を実行中であるかも判定する（ステップＳ８５）。実行中の場合には、三つ目の登録用小規模全文索引記憶部（ここでは登録用小規模全文索引記憶部Ｃ（９ｃ））が実行中であるかを判定する（ステップＳ８６）。なお、記憶部Ｂ，Ｃが登録処理を実行中であるかも同時に判定しておく。ステップＳ８６において処理が実行中である場合には、その終了を待つ。なお、以降、最も想定されるマージ処理が実行中の場合のみ説明する。
【００９８】
マージ開始条件を満たし、且つ他のいずれかの登録用小規模全文索引記憶部Ｂ（９ｂ）／Ｃ（９ｃ）がマージ処理を実行中ではない場合に、登録用小規模全文索引記憶部Ａ（９ａ）における登録用小規模全文索引Ａに対して、図１１におけるステップＳ４７〜Ｓ４９と同様のマージ処理（ステップＳ８９〜Ｓ９１）を起動し、次の登録処理に対して記録を行うべき記憶部を登録用小規模全文索引記憶部Ａ（９ａ）から他の登録用小規模全文索引記憶部Ｂ（９ｂ）／Ｃ（９ｃ）（マージ処理を実行していない記憶部と同じ記憶部、以下同様の表現を用いる）に切り替える（ステップＳ８７／Ｓ８８）。マージ処理が起動された場合、マージ手段１１は登録処理手段３とは非同期にマージ処理を実行する。
【００９９】
（削除処理）
削除処理を実行するには、まず利用者が入力手段１から削除する文書の文書識別子を入力する。次に、削除処理手段４において文書データ記憶部７から文書識別子に対応する文書データを読み出す（ステップＳ１０１）。さらに削除処理手段４において、テキスト分割手段６を用いて文書データから部分文字列（トークン）とそのトークンの出現位置情報を得る（ステップＳ１０２）。
【０１００】
次に、文書識別子が登録用小規模全文索引に登録されている文書識別子かを判定し（ステップＳ１０３）、登録用小規模全文索引に登録されている文書識別子である場合には、各トークンの出現位置情報を登録用小規模全文索引記憶部９（９ａ及び９ｂ及び９ｃ）から削除する（ステップＳ１０５）。文書識別子が登録用小規模全文索引に登録されていない場合（検索用大規模全文索引に登録されている場合）には、文書識別子と各トークンの出現位置情報をその時点の削除用小規模全文索引記憶部（例えば削除用小規模全文索引記憶部Ａ（１０ａ））に記録する（ステップＳ１０４）。そして、削除処理手段４において文書データ記憶部７から文書識別子に対応する文書データを削除する（ステップＳ１１４）。
【０１０１】
ステップＳ１０４における記録の後、適時マージ処理が行われるが、ここではマージ開始条件に基づいて行われるものとして説明する。まず、ステップＳ１０４において記録した結果、マージ開始条件を満たすかを判定する（ステップＳ１０６）。マージ開始条件を満たさなければ処理を終了する（ステップＳ１１４の処理は必要）。なお、図１乃至図８の実施形態において説明したマージ処理開始の条件の各形態は、本実施形態においても適用可能である。また、他方の削除用小規模全文索引記憶部（ここでは削除用小規模全文索引記憶部Ｂ（１０ｂ））がマージ処理を実行中であるかも判定する（ステップＳ１０７）。実行中の場合には、三つ目の登録用小規模全文索引記憶部（ここでは登録用小規模全文索引記憶部Ｃ（１０ｃ））が実行中であるかを判定する（ステップＳ１０８）。なお、記憶部Ｂ，Ｃが登録処理を実行中であるかも同時に判定しておく。ステップＳ１０８において処理が実行中である場合には、その終了を待つ。なお、以降、最も想定されるマージ処理が実行中の場合のみ説明する。
【０１０２】
マージ開始条件を満たし、且つ他のいずれかの削除用小規模全文索引記憶部Ｂ（１０ｂ）／Ｃ（１０ｃ）がマージ処理を実行中ではない場合に、削除用小規模全文索引記憶部Ａ（１０ａ）における削除用小規模全文索引Ａに対して、図１２におけるステップＳ５９〜Ｓ６１と同様のマージ処理（ステップＳ１１１〜Ｓ１１３）を起動し、次の削除処理に対して記録を行うべき記憶部を削除用小規模全文索引記憶部Ａ（１０ａ）から他の削除用小規模全文索引記憶部Ｂ（１０ｂ）／Ｃ（１０ｃ）に切り替える（ステップＳ１０９／Ｓ１１０）。マージ処理が起動された場合、マージ手段１１は削除処理手段４とは非同期にマージ処理を実行する。
【０１０３】
（検索処理）
本実施形態に係る検索処理は、図１０を参照して説明した検索処理と基本的に同様の処理であり、図１０におけるステップＳ３４〜Ｓ３９が、夫々ステップＳ７４〜Ｓ７９に対応している。ただし、ステップＳ７６において、検索処理手段５は、集合ＲｉＡ，ＲｉＢに加えて、登録用小規模全文索引記憶部Ｃ（９ｃ）の登録用小規模全文索引Ｃを用いて検索文字列を含む文書データの文書識別子の集合（ＲｉＣ）を得る。さらに、ステップＳ７７において、検索処理手段５は、集合ＲｄＡ，ＲｄＢに加えて、削除用小規模全文索引記憶部Ｃ（１０ｃ）の削除用小規模全文索引Ｃを用いて検索文字列を含む文書データの文書識別子の集合（ＲｄＣ）を得る。検索処理手段５は得られた文書識別子の集合（Ｒｓ，ＲｉＡ，ＲｉＢ，ＲｉＣ，ＲｄＡ，ＲｄＢ，ＲｄＣ）に対して下記の集合演算を行い、その結果を検索結果（Ｒ）とする（ステップＳ７８）。
Ｒ＝Ｒｓ＋ＲｉＡ＋ＲｉＢ＋ＲｉＣ−ＲｄＡ−ＲｄＢ−ＲｄＣ
ただし、＋を論理和演算子、−を論理差演算子とする。
【０１０４】
図９乃至図１２の実施形態或いは図１３乃至図１６の実施形態では、複数の登録用小規模全文索引記憶部及び／又は複数の削除用全文索引記憶部を用いた全文検索装置を説明したが、これら全文索引記憶部（検索用大規模全文索引記憶部以外）を、図２及び図３で説明した記憶装置２５又は３５やメモリ上における、個々の記憶領域に対して割り当てるか、或いは、記憶装置２５又は３５やメモリ上に記憶された個々のファイルとして位置付けた場合に適用可能な形態を、次の実施形態として図１７乃至図２０を参照して例示する。
【０１０５】
図１７は、本発明の他の実施形態に係る全文検索装置の機能を説明するためのブロック図である。
本実施形態に係る全文検索装置では、登録用及び削除用の小規模全文索引を一つずつ予め用意し、その小規模全文索引が大規模全文索引へのマージ（データ転送）を行っている間など、登録（／削除）処理に際して登録用（／削除用）の全文索引を記憶する処理が可能な登録用（／削除用）全文索引記憶部が存在しない場合に、他の小規模全文索引を新規作成して、登録処理或いは削除処理を実行することにより、処理不能な期間を無くすようにしている。すなわち、本実施形態に係る全文検索装置においては、登録用小規模全文索引を適時、複数備えることでマージ処理が複数の登録用小規模全文索引に対して行われている場合でも他の登録処理が行われている場合でも登録処理を行うことが可能となり、また削除用小規模全文索引を適時、複数備えることでマージ処理が複数の登録用小規模全文索引に対して行われている場合でも他の削除処理が行われている場合でも削除処理を行うことが可能となる。なお、実際には、登録や削除にかかる時間はマージ時間よりも短いので、マージ処理が重なることの方が多い。
【０１０６】
本実施形態に係る全文検索装置は、登録用小規模全文索引記憶部Ａ（９ａ）とは異なる他の登録用小規模全文索引記憶部を管理する記憶部管理手段１２を有するものとする。また、削除処理に関し、記憶部管理手段１２は削除用小規模全文索引記憶部Ａ（１０ａ）とは異なる他の登録用小規模全文索引記憶部をも管理する。記憶部管理手段１２は、登録処理に際して登録用の全文索引を記憶する処理が可能な登録用全文索引記憶部が存在しない場合に、他の登録用全文索引記憶部を新規作成する手段を有する。さらに、記憶部管理手段１２は、余剰の（次の処理でも使用する予定のない）登録用（／削除用）全文索引記憶部を削除する手段をも有する。
【０１０７】
また、図１で説明した登録用小規模全文索引記憶部９が登録用小規模全文索引記憶部Ａ（９ａ）のみから登録用小規模全文索引記憶部Ｂ（９ｂ），Ｃ（９ｃ），Ｄ（９ｄ），．．．へと適時増数していき（順不同）、それらを適時削除していくものとして、図１で説明した削除用小規模全文索引記憶部１０が削除用小規模全文索引記憶部Ａ（１０ａ）のみから削除用小規模全文索引記憶部Ｂ（１０ｂ），Ｃ（１０ｃ），Ｄ（１０ｄ），．．．へと適時増数していき（順不同）、それらを適時削除していくものとして、本実施形態を説明する。
【０１０８】
適時作成／削除が行われた登録用小規模全文索引記憶部を利用して、登録処理手段３は、登録用全文索引記憶部のうち一つの登録用全文索引記憶部から検索用大規模全文索引記憶部８へデータをマージする処理（或いは他の登録処理）を行っている間は、他の登録用全文索引記憶部を使用して、登録処理を行う。一方、適時作成／削除が行われた削除用小規模全文索引記憶部を利用して、削除処理手段４は、削除用全文索引記憶部のうち一つの削除用全文索引記憶部から検索用大規模全文索引記憶部８へデータをマージする処理（或いは他の削除処理）を行っている間は、他の削除用全文索引記憶部を使用して、削除処理を行う。なお、図２及び図３で説明したようなハードウェア構成例を本実施形態に係る全文検索装置にも適用可能である。ただし、これらの記憶部の１又は複数を記憶装置２５，５３ではなく、メモリ上に設けても効果的である。
【０１０９】
以下に、上述のごとく構成された本実施形態に係る全文検索装置の動作の一例を詳細に説明する。
図１８乃至図２０は、図１７の全文検索装置における処理例を説明するためのフロー図である。
全文検索装置は、利用者からの処理要求を受け取ると（ステップＳ１２１）、まず、その処理が、登録処理であるのか（ステップＳ１２２）、削除処理であるのか（ステップＳ１２３）、検索処理であるのか（ステップＳ１２３でＮＯ）を判定する。全文検索装置は、この判定に基づいて以下の各処理を実行することとなる。
【０１１０】
（登録処理）
登録処理を実行するには、まず利用者が文書データを作成し、入力手段１からその文書データを登録する。登録処理手段３において文書データを文書データ記憶部７に保存し、同時にその文書データを示す識別子（文書識別子）を定める（ステップＳ１３１）。さらに登録処理手段３において、テキスト分割手段６を用いて文書データから部分文字列（トークン）とそのトークンの出現位置情報を得る（ステップＳ１３２）。なお、テキスト分割手段６で使用される分割手法や全文索引については前述した通りである。
【０１１１】
記憶部管理手段１２は、登録処理手段３からの命令により或いは適時、現時点で使用できる登録用小規模全文索引記憶部が存在するかを判定する（ステップＳ１３３）。存在しなければ他の登録用小規模全文索引記憶部（例えば登録用小規模全文索引記憶部Ｃ）を新たに作成する（ステップＳ１３５）。使用できる登録用小規模全文索引記憶部が存在した時点で、文書識別子と各トークンの出現位置情報をその時点の登録用小規模全文索引記憶部（例えば登録用小規模全文索引記憶部Ａ（９ａ）／Ｃ）に記録する（ステップＳ１３４／Ｓ１３６）。
【０１１２】
ステップＳ１３４／Ｓ１３６における記録の後、適時マージ処理が行われるが、ここではマージ開始条件に基づいて行われるものとして説明する。まず、ステップＳ１３４／Ｓ１３６において記録した結果、マージ開始条件を満たすかを判定する（ステップＳ１３７）。マージ開始条件を満たさなければ処理を終了する。なお、図１乃至図８の実施形態において説明したマージ処理開始の条件の各形態は、本実施形態においても適用可能である。また、他方の登録用小規模全文索引記憶部（ここでは登録用小規模全文索引記憶部Ｂ（９ｂ）／Ａ（９ａ））がマージ処理を実行中であるかも判定する（ステップＳ１３８）。なお、記憶部Ｂ／Ａが登録処理を実行中であるかも同時に判定しておく。ステップＳ１３８において処理が実行中である場合には、その終了を待つ。なお、以降、最も想定されるマージ処理が実行中の場合のみ説明する。
【０１１３】
マージ開始条件を満たし、且つ他の登録用小規模全文索引記憶部Ｂ（９ｂ）／Ａ（９ａ）がマージ処理を実行中ではない場合に、登録用小規模全文索引記憶部Ａ（９ａ）／Ｃにおける登録用小規模全文索引Ａ／Ｃに対して、図１１におけるステップＳ４７〜Ｓ４９と同様のマージ処理（ステップＳ１４０〜Ｓ１４２）を起動し、次の登録処理に対して記録を行うべき記憶部を登録用小規模全文索引記憶部Ａ（９ａ）／Ｃから他の登録用小規模全文索引記憶部Ｂ（９ｂ）／Ａ（９ａ）に切り替える（ステップＳ１３９）。マージ処理が起動された場合、マージ手段１１は登録処理手段３とは非同期にマージ処理を実行する。また、記憶部管理手段１２は、余剰の（次の処理でも使用する予定のない）登録用全文索引記憶部をマージ処理時や適時、削除するようにすればよい。
【０１１４】
（削除処理）
削除処理を実行するには、まず利用者が入力手段１から削除する文書の文書識別子を入力する。次に、削除処理手段４において文書データ記憶部７から文書識別子に対応する文書データを読み出す（ステップＳ１５１）。さらに削除処理手段４において、テキスト分割手段６を用いて文書データから部分文字列（トークン）とそのトークンの出現位置情報を得る（ステップＳ１５２）。
【０１１５】
次に、文書識別子が登録用小規模全文索引に登録されている文書識別子かを判定し（ステップＳ１５３）、登録用小規模全文索引に登録されている文書識別子である場合には、各トークンの出現位置情報を存在する全ての登録用小規模全文索引記憶部９（９ａ等）から削除する（ステップＳ１５５）。文書識別子が登録用小規模全文索引に登録されていない場合（検索用大規模全文索引に登録されている場合）には、次に示す削除用小規模全文索引記憶部への記録を行う。
【０１１６】
記憶部管理手段１２は、削除処理手段３からの命令により或いは適時、現時点で使用できる削除用小規模全文索引記憶部が存在するかを判定する（ステップＳ１５４）。存在しなければ他の削除用小規模全文索引記憶部（例えば削除用小規模全文索引記憶部Ｃ）を新たに作成する（ステップＳ１５７）。使用できる削除用小規模全文索引記憶部が存在した時点で、文書識別子と各トークンの出現位置情報をその時点の削除用小規模全文索引記憶部（例えば登録用小規模全文索引記憶部Ａ（１０ａ）／Ｃ）に記録する（ステップＳ１５６／Ｓ１５８）。そして、削除処理手段４において文書データ記憶部７から文書識別子に対応する文書データを削除する（ステップＳ１７５）。
【０１１７】
ステップＳ１５６／Ｓ１５８における記録の後、適時マージ処理が行われるが、ここではマージ開始条件に基づいて行われるものとして説明する。まず、ステップＳ１５６／Ｓ１５８において記録した結果、マージ開始条件を満たすかを判定する（ステップＳ１５９）。マージ開始条件を満たさなければ処理を終了する（ステップＳ１７５の処理は必要）。なお、図１乃至図８の実施形態において説明したマージ処理開始の条件の各形態は、本実施形態においても適用可能である。また、他方の削除用小規模全文索引記憶部（ここでは削除用小規模全文索引記憶部Ｂ（１０ｂ）／Ａ（９ａ））がマージ処理を実行中であるかも判定する（ステップＳ１７０）。なお、記憶部Ｂ／Ａが登録処理を実行中であるかも同時に判定しておく。ステップＳ１７０において処理が実行中である場合には、その終了を待つ。なお、以降、最も想定されるマージ処理が実行中の場合のみ説明する。
【０１１８】
マージ開始条件を満たし、且つ他のいずれかの削除用小規模全文索引記憶部Ｂ（１０ｂ）／Ｃ（１０ｃ）がマージ処理を実行中ではない場合に、削除用小規模全文索引記憶部Ａ（１０ａ）／Ｃにおける削除用小規模全文索引Ａ／Ｃに対して、図１２におけるステップＳ５９〜Ｓ６１と同様のマージ処理（ステップＳ１７２〜Ｓ１７４）を起動し、次の削除処理に対して記録を行うべき記憶部を削除用小規模全文索引記憶部Ａ（１０ａ）／Ｃから他の削除用小規模全文索引記憶部Ｂ（１０ｂ）／Ａ（１０ａ）に切り替える（ステップＳ１７１）。マージ処理が起動された場合、マージ手段１１は削除処理手段４とは非同期にマージ処理を実行する。また、記憶部管理手段１２は、余剰の（次の処理でも使用する予定のない）削除用全文索引記憶部をマージ処理時や適時、削除するようにすればよい。
【０１１９】
（検索処理）
本実施形態に係る検索処理は、図１０を参照して説明した検索処理と基本的に同様の処理であり、図１０におけるステップＳ３４〜Ｓ３９が、夫々ステップＳ１２４〜Ｓ１２９に対応している。ただし、ステップＳ１２６において、検索処理手段５は、集合Ｒｉとして、現時点で存在する全ての登録用小規模全文索引記憶部の登録用小規模全文索引を用いて検索文字列を含む文書データの文書識別子の集合を得る。さらに、ステップＳ１２７において、検索処理手段５は、集合Ｒｄとして、現時点で存在する全ての削除用小規模全文索引記憶部の削除用小規模全文索引を用いて検索文字列を含む文書データの文書識別子の集合を得る。
【０１２０】
以上、本発明の全文検索装置を中心に各実施形態を説明してきたが、全文検索装置における処理手順としても説明したように全文検索のシステムにおける全文検索方法としての形態も採り得る。さらに、本発明は、これら全文検索装置として機能させるためのプログラム、又はその各手段として機能させるためのプログラムとしても、或いは、これら全文検索方法を実行するためのプログラム、又はその処理手順を実行するためのプログラム、さらにはそれらのいずれかのプログラムを記録したコンピュータ読み取り可能な記録媒体としての形態も採用可能である。
【０１２１】
本発明による全文検索の機能を実現するためのプログラムやデータを記憶した記録媒体の実施形態を説明する。記録媒体としては、具体的には、ＣＤ−ＲＯＭ、光磁気ディスク、ＤＶＤ−ＲＯＭ、ＦＤ、フラッシュメモリ、及びその他各種ＲＯＭやＲＡＭ等が想定でき、これら記録媒体に上述した本発明の各実施形態のシステムの機能をコンピュータに実行させ、全文検索の機能を実現するためのプログラムを記録して流通させることにより、当該機能の実現を容易にする。そしてコンピュータ等の情報処理装置に上記のごとくの記録媒体を装着して情報処理装置によりプログラムを読み出すか、若しくは情報処理装置が備えている記憶媒体に当該プログラムを記憶させておき、必要に応じて読み出すことにより、本発明に係わる全文検索機能を実行することができる。
【０１２２】
【発明の効果】
本発明によれば、全文検索装置における登録処理や削除処理を小規模な全文索引記憶部に対して行うので、その処理時間は短く抑えることが可能となり、利用者へのレスポンスタイムを短くすることが可能となる。
【０１２３】
また、本発明によれば、登録用小規模全文索引を複数備えることでマージ処理を実行中でも登録処理を行うことが可能となり、また削除用小規模全文索引を複数備えることでマージ処理を実行中でも削除処理を行うことが可能となる。
【図面の簡単な説明】
【図１】本発明の一実施形態に係る全文検索装置の機能を説明するためのブロック図である。
【図２】図１における全文検索装置をスタンドアロンで構成した場合のハードウェア構成例を示す図である。
【図３】図１における全文検索装置をサーバ／クライアントで構成した場合のハードウェア構成例を示す図である。
【図４】図１の全文検索装置における処理例を説明するためのフロー図である。
【図５】図１の全文検索装置における処理例を説明するためのフロー図である。
【図６】図１の全文検索装置における処理例を説明するためのフロー図である。
【図７】図１の全文検索装置における処理を説明するための図で、全文索引の一例を示す図である。
【図８】図７における全文索引のトークン「全文」の転置リストを例にマージ処理の概要を説明するための図である。
【図９】本発明の他の実施形態に係る全文検索装置の機能を説明するためのブロック図である。
【図１０】図９の全文検索装置における処理例を説明するためのフロー図である。
【図１１】図９の全文検索装置における処理例を説明するためのフロー図である。
【図１２】図９の全文検索装置における処理例を説明するためのフロー図である。
【図１３】本発明の他の実施形態に係る全文検索装置の機能を説明するためのブロック図である。
【図１４】図１３の全文検索装置における処理例を説明するためのフロー図である。
【図１５】図１３の全文検索装置における処理例を説明するためのフロー図である。
【図１６】図１３の全文検索装置における処理例を説明するためのフロー図である。
【図１７】本発明の他の実施形態に係る全文検索装置の機能を説明するためのブロック図である。
【図１８】図１７の全文検索装置における処理例を説明するためのフロー図である。
【図１９】図１７の全文検索装置における処理例を説明するためのフロー図である。
【図２０】図１７の全文検索装置における処理例を説明するためのフロー図である。
【符号の説明】
１…入力手段、２…出力手段、３…登録処理手段、４…削除処理手段、５…検索処理手段、６…テキスト分割手段、７…文書データ記憶部、８…検索用大規模全文索引記憶部、９…登録用小規模全文索引記憶部、９ａ…登録用小規模全文索引記憶部Ａ、９ｂ…登録用小規模全文索引記憶部Ｂ、９ｃ…登録用小規模全文索引記憶部Ｃ、１０…削除用小規模全文索引記憶部、１０ａ…削除用小規模全文索引記憶部Ａ、１０ｂ…削除用小規模全文索引記憶部Ｂ、１０ｃ…削除用小規模全文索引記憶部Ｃ、１１…マージ手段、１２…記憶部管理手段、２１，３１…入力装置、２２，３２…表示装置、２３，３３…入出力制御装置、２４，３４，５２…主制御装置（ＣＰＵ・メモリ）、２５，５３…記憶装置、３０…クライアント、３５，５１…ネットワーク制御装置、４０…ネットワーク、５０…サーバ。[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a full-text search device, a full-text search method, a program, and a recording medium, and more specifically, a full-text search device, a full-text search method, a program for searching a document including a specified character string from a plurality of document data, And a recording medium. The present invention can be applied to a system that manages a large amount of document data such as a document management system, an electronic library system, and a patent publication search system.
[0002]
[Prior art]
2. Description of the Related Art In recent years, documents that have been digitized due to the development of information communication technology and information related to the documents have been distributed in large quantities via the Internet or the like. A document search apparatus that searches a desired document with high accuracy and at a high speed during the distribution of the digitized document and information has been proposed.
[0003]
In such a document search apparatus, a keyword search method or a full-text search method is used. A full-text search device using a full-text search method is a device that performs matching between an arbitrary search character string and all documents to be searched, and extracts documents including the search character string without omission. In this way, a great deal of human power is not required, such as assigning keywords in advance to all documents to be searched. Various types of full-text search devices have been proposed. One type of full-text search device is a device that employs a transposed (index) file method. In the transposed file method, as an auxiliary file for searching, a document in which characters / words / n-grams (n-character concatenation) appear, or a transposed file that records the appearance position in those documents is constructed in advance, and the whole text At the time of search, the search is performed using only the transposed file, and it is possible to perform a very high-speed search, which is effective for a system that requires a high-speed search of a large number of documents.
[0004]
For details of the full text search method in general and the transposed file method, refer to the document “Information Search Algorithm” (Ken Kenji, Tsuda Kazuhiko, Sasabori Masatomi, Kyoritsu Publishing Co., Ltd., pp. 160-179), JP-A-11- It is described in the prior art of Japanese Patent No. 073429 and the Full Text Search System Council 1998 Activity Report (http://www.ftsanet.com/dbtokyo99/Db99.htm), etc. To do.
[0005]
Japanese Patent No. 3024544 discloses a conventional technique that employs a transposed file method, in which real-time processing data is stored separately from a search index file, thereby performing a search process even while the search index file is being updated. An information retrieval device capable of performing the above is described. Japanese Patent Laid-Open No. 7-146880 describes a document search apparatus and method that can register a new document in a sub-index smaller than the main index and shorten the registration time. .
[0006]
However, including the above-mentioned gazettes, it is necessary to construct a transposed file that is several times the original data in the transposed file method, and the full-text index of the transposed file method is registered as the amount of document data registered increases. -The deletion process takes time, and the response time of the registration / deletion process as viewed from the user side as a full-text search apparatus becomes long.
[0007]
[Problems to be solved by the invention]
The present invention has been made in view of the above-described circumstances, and a full-text search device, a full-text search method, and a computer that can shorten the response time of registration and deletion processing as seen from the user side. It is an object of the present invention to provide a program for causing a computer to function, a program for causing a computer to execute the procedure of the method, and a computer-readable recording medium on which the program is recorded.
[0009]
[Means for Solving the Problems]
According to the first aspect of the present invention, from the document data storage unit for storing the input document data, the document data including the input search condition is stored in the first full-text index storage unit for registration and the second full-text index storage for deletion. A full-text search device for searching using a third full-text index storage unit for search, When registering, Information on the token and the appearance position of the token is obtained from the document data, and the information on the token and the appearance position of the token is For registration Registration processing means for storing in the first full-text index storage unit; When performing the deletion process, The token is For registration Stored in the first full-text index storage Ruka Determine Information about the appearance position of the token Stored in the first full-text index storage unit Place Information about the appearance position of the token The Delete from the first full-text index storage, Information about the appearance position of the token Not stored in the first full-text index storage unit Iba Information about the token and the location of the token For the deletion Deletion processing means for registering in the second full-text index storage unit; When performing search processing, Document data including the input search condition Document identifier Is obtained using the index of the third full-text index storage unit, and the first set of document data including the inputted search condition Document identifier Is obtained using the index of the first full-text index storage unit, and the second set of document data including the input search condition Document identifier Is obtained using the index of the second full-text index storage unit, and then the third set of 1 And the second set Union of To the above 3 Search processing means for outputting a set obtained by subtracting the set as a search result, and an index of the third full-text index storage unit Remembered The first full-text index store for token transposition lists Part Add a transpose list of tokens taken out of Stored in the third full-text index storage unit Second full-text index storage Part Information on the appearance position of the token transposition list taken out of the first full-text index storage unit is deleted. Transpose list And the second full-text index storage unit Transpose list Merge Do Merging means for executing processing.
[0013]
According to a second aspect of the present invention, in the first aspect of the invention, the merging unit has reached a predetermined number of document data stored in the first full-text index storage unit or the second full-text index storage unit. When the merge Do It is characterized by executing processing.
[0014]
According to a third aspect of the present invention, in the first aspect of the invention, the merging unit is configured to perform the merge when the capacity of the first full-text index storage unit or the second full-text index storage unit reaches a predetermined capacity. Do It is characterized by executing processing.
[0015]
Claim 4 The invention of claim 2 or 3 In the invention of First Full-text index storage Or the second full-text index storage unit Including a plurality of Third Full-text index storage In Processing to merge data First Full-text index storage Or second full-text index storage unit Different from other First Full-text index storage Or second full-text index storage unit Use the registration process Or delete processing It is characterized by performing.
[0016]
Claim 5 The invention of claim 2 or 3 In the invention of First Full-text index storage Or second full-text index storage unit Two Including One of them First Full-text index storage Or second full-text index storage unit To said Third Full-text index storage In During the process of merging data, another First Full-text index storage Or second full-text index storage unit Use the registration process Or delete processing It is characterized by performing.
[0017]
In the invention of claim 6, from the document data storage unit for storing the input document data, the document data including the input search condition is stored in the first full-text index storage unit for registration and the second full-text index storage for deletion. A full-text search method for searching using a third full-text index storage unit for search, wherein the registration processing means includes: When registering, Information on the token and the appearance position of the token is obtained from the document data, and the information on the token and the appearance position of the token is For registration The registration processing step stored in the first full-text index storage unit and the deletion processing means are: When performing the deletion process, The token is For registration Stored in the first full-text index storage Ruka Determine Information about the appearance position of the token Stored in the first full-text index storage unit Place Information about the appearance position of the token The Delete from the first full-text index storage, Information about the appearance position of the token Not stored in the first full-text index storage unit Iba Information about the token and the location of the token For the deletion A deletion processing step registered in the second full-text index storage unit, and a search processing means, When performing search processing, Document data including the input search condition Document identifier Is obtained using the index of the third full-text index storage unit, and the first set of document data including the inputted search condition Document identifier Is obtained using the index of the first full-text index storage unit, and the second set of document data including the input search condition Document identifier Is obtained using the index of the second full-text index storage unit, and then the third set of 1 And the second set Union of To the above 3 A search processing step of outputting a set obtained by subtracting the set of the search results as a search result, and a merging unit, the index of the third full-text index storage unit Remembered The first full-text index store for token transposition lists Part Add a transpose list of tokens taken out of Stored in the third full-text index storage unit Second full-text index storage Part Information on the appearance position of the token transposition list taken out of the first full-text index storage unit is deleted. Transpose list And the second full-text index storage unit Transpose list Merge Do And a merge step for executing processing.
[0018]
According to a seventh aspect of the invention, in the sixth aspect of the invention, in the merging step, the number of document data stored in the first full-text index storage unit or the second full-text index storage unit reaches a predetermined number. When the merge Do It is characterized by executing processing.
[0019]
According to an eighth aspect of the present invention, in the sixth aspect, the merging step is performed when the capacity of the first full-text index storage unit or the second full-text index storage unit reaches a predetermined capacity. Do It is characterized by executing processing.
[0049]
Claim 9 The invention of , Claims to the computer 6-8 Any one of In terms It is a program for executing the described full-text search method.
[0050]
Claim 10 The invention of claim 9 A computer-readable recording medium on which the described program is recorded.
[0051]
DETAILED DESCRIPTION OF THE INVENTION
The present invention prepares a small full-text index separately for registration and deletion to prevent the response time of registration and deletion from being deteriorated. In the search process, a small full-text index is added to the search result of the large-scale full-text index. This is a full-text search device that adds a search result of a large-scale full-text index and excludes a search result of a small-scale full-text index for deletion and returns it to a user. This is the Japanese Patent Application No. 2001-78026 by the present applicant. The problem described above is solved by applying the technique described in the book to the full-text search apparatus.
[0052]
In the above-mentioned Japanese Patent Application No. 2001-78026, a database management system, a program, and a program capable of further improving the update performance during system operation while maintaining the performance capable of responding to an advanced search request at high speed, and A recording medium is described, and a registration / deletion throughput is increased by preparing a data holding unit for registration / deletion separately from a data holding unit for search. However, according to the method described in the above specification, the identifier of the document data registered in the small-scale index by the data transfer means from the small full-text index for registration and deletion to the large-scale full-text index for search. The original document data is acquired from and registered and deleted in a large-scale index. As described above, since the registration / deletion process for a large-scale full-text index takes time, the data transfer process takes a long time. In general, the search process cannot be performed during the registration / deletion process to the full-text index, so that there is a problem that the response time of the search process viewed from the user is deteriorated.
[0053]
In the present invention, in the data transfer means from the small full-text index to the large-scale full-text index, the original document data is not used, but the transposed file-type full-text index is used, that is, a constituent element of the full-text index. By using an inverted list, the time required for data transfer is shortened.
[0054]
FIG. 1 is a block diagram for explaining the function of a full-text search device according to an embodiment of the present invention. FIG. 2 is a diagram showing an example of a hardware configuration when the full-text search device in FIG. FIG. 3 is a diagram illustrating a hardware configuration example when the full-text search apparatus in FIG. 1 is configured by a server / client.
[0055]
The full-text search device according to the present invention is a device for searching for a document including a specified character string from a plurality of document data (a plurality of digitized documents). In this specification, “full-text search” in the full-text search device means a search device for all character strings to be searched, and therefore, for example, a tagged document such as SGML. If so, only character strings between predetermined tags may be targeted as appropriate.
[0056]
Referring to FIG. 1, in the present embodiment, in the input means 1, text data for registration processing, a document identifier for deletion processing, a search condition for search processing, and the like are input. Passed to processing means 4 and search processing means 5. The registration processing means 3 performs registration processing relating to document data. Registration processing in the registration processing means 3 is performed on the document data storage unit 7 and the registration small full-text index storage unit 9. The deletion processing unit 4 performs deletion processing related to document data. The deletion processing in the deletion processing unit 4 reads out the document data stored in the document data storage unit 7 on the basis of the document identifier input by the input unit 1 and uses the text dividing unit 6 to store a small full-text index for registration. If the index is registered in the section 9, it is deleted, and if it is not a registered index, the index is recorded in the small scale full-text index storage section 10 for deletion. In the text dividing means 6, the dividing process from the document data to the partial character string in the registration process and the partial character from the document data in the deleting process required for each of the registration processing means 3, the deletion processing means 4 and the search processing means 5. A process of dividing into columns and a process of dividing a search condition (search character string) into a partial character string in the search process are performed.
[0057]
The search processing in the search processing means 5 is executed for the search large-scale full-text index storage unit 8, the registration small-scale full-text index storage unit 9, and the deletion small-scale full-text index storage unit 10. The result obtained by subtracting the search result in the storage unit 10 from the search result of 9 is obtained and output as the search result by the output means 2. The merging means 11 performs merge processing (also referred to as data transfer in a broad sense) among the search large-scale full-text index storage unit 8, the registration small-scale full-text index storage unit 9, and the deletion small-scale full-text index storage unit 10. .
[0058]
Note that, although not specifically described below, with regard to the deletion process in the deletion processing unit 4, for example, a holiday in which only the document data to be deleted is deleted and the processing time can be obtained without using the deletion small full-text index storage unit 10. For example, other document data (and index) management methods for deletion, such as updating the data in the large-scale search full-text index storage unit 8 in accordance with the document data existing in the document data storage unit 7, are registered. It is also possible to adopt a form in which only the registration process using the small-scale full-text index storage unit 9 is performed. On the contrary, it is also possible to adopt a form in which only the deletion process using only the deletion small full-text index storage unit 10 is performed without performing the registration process using the registration small full-text index storage unit 9.
[0059]
In the stand-alone hardware configuration shown in FIG. 2, the input means 1 in FIG. 1 is realized by the input device 21, and the output means 2 is realized by the display device 22. The various processing means 3 to 6 and 11 are in the main control device (CPU, memory, etc.) 24, and the various storage units 7 to 10 are all in the storage device 25 or as individual storage devices, and further in the storage device 25, for example. Realized as a file. For example, when performing a full-text search according to the present invention using one limited storage device, it is better to allocate the area to be used properly depending on whether the search process is mainly performed or the registration / deletion process is mainly performed. . The input / output control device 23 controls the input device 21 and the display device 22 in accordance with a control signal from the main control device 24.
[0060]
In the hardware configuration of the server / client shown in FIG. 3, the input means 1 in FIG. 1 is realized by the input device 31 of the client 30, and the output means 2 is realized by the display device 32 of the client 30. The various processing units 3 to 6 and 11 are realized in the main control devices (CPU, memory, etc.) 34 and 52 of the client 30 and the server 50, and the various storage units 7 to 10 are all configured as the storage device 53 of the server 50, for example. Alternatively, it is realized as an individual storage device connected to the server 50 and further as a file in the storage device 25. In addition, the network control devices 35 and 51 of the client 30 and the server 50 control data transmission and the like between the client 30 and the server 50 via the network 40. Further, the input / output control device 33 of the client 30 controls the input device 21 and the display device 22 according to the control signal of the main control device 34.
[0061]
Below, an example of operation | movement of the full-text search apparatus based on this embodiment comprised as mentioned above is demonstrated in detail.
4 to 6 are flowcharts for explaining a processing example in the full-text search apparatus of FIG.
When the full-text search apparatus receives a processing request from the user (step S1), first, whether the process is a registration process (step S2), a deletion process (step S3), or a search process. (NO in step S3) is determined. The full-text search device executes the following processes based on this determination.
[0062]
(registration process)
To execute the registration process, the user first creates document data and registers the document data from the input means 1. The registration processing means 3 saves the document data in the document data storage unit 7, and simultaneously determines an identifier (document identifier) indicating the document data (step S11). For example, in the case of a document with a tag such as SGML, only a character string between predetermined tags may be targeted. Further, the registration processing means 3 obtains the partial character string (token) and the appearance position information of the token from the document data by using the text dividing means 6 (step S12). Finally, the document identifier and the appearance position information of each token are recorded in the registration small full-text index storage unit 9 (step S13). Here, “recording” is recording in the full text index of the storage unit (the same applies hereinafter), and the processing such as step 13 is also referred to as an index storing step. The dividing method used by the text dividing unit 6 may be a method using N character sets as tokens, or a method using morpheme analysis and words as tokens. In the following example, the description will be made only for the text dividing means using the method of using the N character set as a token, but it can be similarly applied to the method of using the word subjected to morphological analysis as a token. As will be described later, a timely merge process is performed after recording in step S13 (steps S14 to S16).
[0063]
FIG. 7 is a diagram for explaining processing in the full-text search apparatus of FIG. 1 and shows an example of the full-text index. The full text index of the inverted file method will be described in detail using the example of FIG.
Registered document data is document 1 and document 2, and their contents (here, contents obtained by dividing by text dividing means 6) are respectively represented by reference numerals 61 and 62 in FIG. Here, the number on the left of each document represents the number of characters from the beginning of the character string. That is, in Document 1, “full text search” appears at the 11th character from the top, “method” appears at the 20th and 60th characters, and “full text search method” appears at the 31st character. In document 2, “search method” means the first character from the top, “method” appears in the 24th character, and “full text” appears in the 30th and 42nd characters.
[0064]
When using 2 character sets as partial character strings, all partial character strings in the document are extracted, and their appearance positions (number of characters from the beginning) are collectively recorded in the index for each partial character string. To do. For example, since “full text” appears at positions 11 and 31 and “sentence check” appears at positions 12 and 32 from document 1, it is recorded in the index. In the index, not only the appearance position in the document but also the document identifier for identifying which document appeared and the number of appearances are added and recorded, so the format is as shown by reference numeral 63 in FIG. For example, the transposed lists {1, 2, (11, 31)} and {2, 2, (30, 42)} for “full text” each appear twice in document 1 and their positions are 11, 31 , And it appears twice in document 2 and its position is 30,42.
[0065]
(Deletion process)
To execute the deletion process, the user first inputs the document identifier of the document to be deleted from the input means 1. Next, the deletion processing means 4 reads out the document data corresponding to the document identifier from the document data storage unit 7 (step S21). Further, the deletion processing unit 4 obtains a partial character string (token) and appearance information of the token from the document data using the text dividing unit 6 (step S22). For example, in the case of a document with a tag such as SGML, only a character string between predetermined tags may be targeted. It is determined whether the document identifier is a document identifier registered in the registration small full-text index (step S23). If the document identifier is a document identifier registered in the registration small full-text index, the appearance position information of each token Are deleted from the registration small-scale full-text index storage unit 9 (step S25). When the document identifier is not registered in the registration small full-text index (when registered in the search large-scale full-text index), the document identifier and the appearance position information of each token are deleted. 10 (step S24). Then, the deletion processing means 4 deletes the document data corresponding to the document identifier from the document data storage unit 7 (step S29). As will be described later, a timely merge process is performed after the recording in step S24 (steps S26 to S28).
[0066]
(Search process)
To execute the search process, the user first inputs a search character string from the input means 1. Next, in the search processing means 5, a token is obtained from the search character string using the text dividing means 6 (step S4). Further, the search processing means 5 obtains a set (Rs) of document identifiers of document data including the search character string using the search large-scale full-text index of the search large-scale full-text index storage unit 8 (step S5), A set (Ri) of document identifiers of the document data including the search character string is obtained using the registration small full-text index in the registration small full-text index storage unit 9 (step S6). Further, the search processing means 5 obtains a set (Rd) of document identifiers of the document data including the search character string using the delete small full-text index of the delete small full-text index storage unit 10 (step S7). The search processing means 5 performs the following set operation on the obtained set of document identifiers (Rs, Ri, Rd), and sets the result as the search result (R) (step S8). A set of document identifiers of document data including the search character string is output (step S9).
R = Rs + Ri−Rd
However, + is a logical sum operator and − is a logical difference operator.
[0067]
The search process will be described in detail by taking the full text index 63 of FIG. 7 as an example.
If the search character string is “full text search”, the text dividing unit extracts three tokens of “full text”, “sentence check”, and “search”. Next, the three transposed lists of the corresponding tokens in the full text index 63 are examined. When searching for a token having a difference in token appearance position of 1, a “full text search” exists in the 11th and 31st characters of the document identifier 1.
[0068]
(Merge process)
The merging process by the merging means 11 is a process that replaces the data transfer means in the above-mentioned Japanese Patent Application No. 2001-78026.
Compared to registration / deletion processing using original document data, the transpose list already created at the start of processing is directly used, so the time required for token extraction by text split processing and creation of the transpose list is not required. Thus, the data transfer time can be shortened. In the present invention, the data transfer process (data transfer step) is also called a merge process (merge step) because it is a process between transposed lists. By registering / deleting document data in the full-text search device as processing between transposed lists, the transposed list already created is used directly when registering / deleting data in the full-text search index. The time for merging into the full-text index can be shortened, and the waiting time for search processing can be shortened.
[0069]
In order to execute the merge process, first, for all tokens of the registration small full-text index, (a) a process for extracting the transposed list of the token from the full-text index (step S14), and (b) a large scale for search Processing for adding the previous transposition list to the end of the transposition list of the corresponding token of the full-text index is performed (step S15). Next, the registration small full-text index is emptied (step S16). Further, for all tokens in the small-scale full-text index for deletion, (c) processing for extracting the transposed list of the token from the full-text index (step S26), and (d) for the corresponding tokens in the large-scale full-text index for search Processing for deleting appearance position information in the transposed list extracted in (c) from the transposed list is performed (step S27). Next, the small-scale full-text index for deletion is emptied (step S28).
[0070]
FIG. 8 is a diagram for explaining the outline of the merge process by taking the transposed list of the token “full text” in the full text index 63 in FIG. 7 as an example.
As the transposed list 71 of the full text index for search, the transposed list {1, 2, (11, 31)}, {2, 2, (30, 42)} for the “full text”, and the transposed list 72 of the full text index for registration, The case where the merge process 73 is performed on the transposed list {5, 2, (4, 16)}, {8, 1, (3)} for “full text” will be described. By executing the merge process 73, the transposed list {1, 2, (11, 31)}, {2, 2, (30, 42)}, {5, 2, (4, 16)} for “full text” , {8, 1, (3)} (74) is obtained. Further, the transposition list for “full text” is merged 75 with this transposition list and the transposition list {1, 2, (11, 31)} for “full text” as the transposition list 76 of the deletion full text index. {2, 2, (30, 42)}, {5, 2, (4, 16)}, {8, 1, (3)} (77) are obtained.
[0071]
(Merge processing mode 1)
The merge processing is started by the registration processing means 3 when the number of document identifiers registered in the registration small full-text index in the registration small full-text index storage unit 9 reaches a predetermined number, and merge processing is performed. Performed by means 11.
[0072]
(Merge processing mode 2)
The merging process is started by the registration processing unit 3 when the storage capacity (size) in the registration small full-text index storage unit 9 reaches a predetermined size, and is executed by the merging unit 11. Also good. In this form, when used as an application form in which the size of document data registered by the user varies, when small document data is continuously registered, the registration to the small scale full-text index for registration is performed. It is possible to prevent the merge process from being started before the registration time becomes long. The merge processing time can be equalized by setting the size as a start condition. Further, in the case of the above-described merge processing (form 1), there is an advantage that the processing is simplified because the number of cases is used as a start condition and it is not necessary to manage the size of the full-text index storage unit.
[0073]
(Merge processing mode 3)
The merging process of the small-scale full-text index for deletion may be activated by the deletion processing unit 4 and executed by the merging unit 11. The activation condition may be when the number of document identifiers registered in the small text index for deletion reaches a number specified in advance.
[0074]
(Merge processing mode 4)
The merging process of the small-scale full-text index for deletion may be activated by the deletion processing unit 4 and executed by the merging unit 11. The activation condition may be when the size of the small-scale full-text index storage unit 10 for deletion reaches a size specified in advance. Forms 3 and 4 have an advantage that the merge processing time can be shortened when a large number of deletion processes do not occur.
[0075]
Each form of the merge process as described above allows the full-text search device to start the full-text index merge process under conditions suitable for the characteristics of document data to be registered / deleted and the characteristics of the field of use. The number of times can be reduced, and the throughput of the entire system can be improved. Furthermore, the merge start condition may be variable depending on the time required for the merge process, and the merge caused by the registration and the merge caused by the deletion may be started simultaneously under any start condition. .
[0076]
In the full-text search apparatus according to the above-described embodiment, the technique described in the specification of Japanese Patent Application No. 2001-78026 by the applicant is applied to a full-text search apparatus using a transposed file type full-text index. In a data transfer means to a large-scale full-text index, the time required for data transfer is shortened by using an inverted list that is a constituent element of an inverted-file system full-text index instead of using original document data.
[0077]
The full-text search device according to another embodiment of the present invention to be described next is the write-delay database management method described in Japanese Patent Application No. 2001-101024 by the applicant of the full-text search device according to the above-described embodiment, The method used in the apparatus, program, and recording medium is applied. As a result, while performing data transfer from the small full-text index for registration or deletion to the large-scale full-text index for search (transposition list merge processing), the small full-text index for registration or deletion The problem that the storage unit cannot be used and the registration process or the deletion process cannot be executed can be solved.
[0078]
FIG. 9 is a block diagram for explaining the function of the full-text search apparatus according to another embodiment of the present invention.
In the full-text search apparatus according to the present embodiment, two small full-text indexes for registration and deletion are prepared, and while the merge (data transfer) to the large-scale full-text index is performed, the other small full-text is By executing registration processing or deletion processing using the index, an unprocessable period is eliminated. That is, in the full-text search device according to the present embodiment, by providing two small-scale full-text indexes for registration, registration processing can be performed even during merge processing, and two small-scale full-text indexes for deletion are provided. This makes it possible to perform deletion processing even during merge processing. According to the present embodiment, for example, when a document is read by a scanner or the like, OCR processing is performed, and each document is to be registered, the registration processing and merge processing are frequently performed continuously. With such image data, full-text search can be performed with high response in the same way as normal application data.
[0079]
The registration small full-text index storage unit 9 described in FIG. 1 has two storage units, a registration small full-text index storage unit A (9a) and a registration small full-text index storage unit B (9b). It is assumed that the delete small full-text index storage unit 10 described in FIG. 1 has two storage units, a delete small full-text index storage unit A (10a) and a delete small full-text index storage unit B (10b). An embodiment will be described. Note that the hardware configuration example described with reference to FIGS. 2 and 3 can also be applied to the full-text search apparatus according to the present embodiment. However, it is effective to provide one or more of these storage units on the memory instead of the storage devices 25 and 53.
[0080]
Below, an example of operation | movement of the full-text search apparatus based on this embodiment comprised as mentioned above is demonstrated in detail.
10 to 12 are flowcharts for explaining a processing example in the full-text search apparatus of FIG.
When the full-text search apparatus receives a processing request from the user (step S31), first, whether the process is a registration process (step S32), a deletion process (step S33), or a search process. (NO in step S33) is determined. The full-text search device executes the following processes based on this determination.
[0081]
(registration process)
To execute the registration process, the user first creates document data and registers the document data from the input means 1. The registration processing means 3 saves the document data in the document data storage unit 7, and simultaneously determines an identifier (document identifier) indicating the document data (step S41). Further, the registration processing means 3 obtains a partial character string (token) and the appearance position information of the token from the document data by using the text dividing means 6 (step S42). Note that the division method and full-text index used in the text division means 6 are as described above. The document identifier and the appearance position information of each token are recorded in the registration small full-text index storage unit (for example, the registration small full-text index storage unit A (9a)) at that time (step S43).
[0082]
After the recording in step S43, a timely merge process is performed. Here, it is assumed that the process is performed based on the merge start condition. First, it is determined whether the merge start condition is satisfied as a result of recording in step S43 (step S44). If the merge start condition is not satisfied, the process ends. Each form of the conditions for starting the merge processing described in the embodiments of FIGS. 1 to 8 can also be applied to this embodiment. Further, it is also determined whether the other registration small full-text index storage unit (here, the registration small full-text index storage unit B (9b)) is executing the merge process (step S45). If it is being executed, it waits for the merge process to end.
[0083]
When the merge start condition is satisfied and the other registration small full-text index storage unit B (9b) is not executing the merge process, the small registration full-text index storage unit A (9a) A merge process (steps S47 to S49), which will be described later, is activated for the full-text index A, and the storage unit to be recorded for the next registration process is transferred from the small-scale full-text index storage unit A (9a) for registration to the other. Switching to the registration small full-text index storage unit B (9b) (step S46). When the merge process is activated, the merge unit 11 executes the merge process asynchronously with the registration process unit 3.
[0084]
(Deletion process)
To execute the deletion process, the user first inputs the document identifier of the document to be deleted from the input means 1. Next, the deletion processing means 4 reads out document data corresponding to the document identifier from the document data storage unit 7 (step S51). Further, the deletion processing unit 4 obtains a partial character string (token) and the appearance position information of the token from the document data using the text dividing unit 6 (step S52).
[0085]
Next, it is determined whether the document identifier is a document identifier registered in the registration small full-text index (step S53). If it is a document identifier registered in the registration small full-text index, The appearance position information is deleted from the registration small-scale full-text index storage unit 9 (9a and 9b) (step S55). If the document identifier is not registered in the small full-text index for registration (if registered in the large-scale full-text index for search), the document identifier and the appearance position information of each token are stored in the small full-text for deletion at that time. This is recorded in the index storage unit (for example, the small-scale full-text index storage unit A (10a) for deletion) (step S54). Then, the deletion processing unit 4 deletes the document data corresponding to the document identifier from the document data storage unit 7 (step S62).
[0086]
After the recording in step S54, a timely merge process is performed. Here, it is assumed that the process is performed based on the merge start condition. First, it is determined whether the merge start condition is satisfied as a result of recording in step S54 (step S56). If the merge start condition is not satisfied, the process ends (the process in step S62 is necessary). Each form of the conditions for starting the merge processing described in the embodiments of FIGS. 1 to 8 can also be applied to this embodiment. Further, it is also determined whether or not the other small-scale full-text index storage unit for deletion (here, the small-scale full-text index storage unit B (10b) for deletion) is executing the merge process (step S57). If it is being executed, it waits for the merge process to end.
[0087]
When the merge start condition is satisfied and the other deletion small full-text index storage unit B (10b) is not executing the merge process, the deletion small-scale full-text index storage unit A (10a) has a small deletion size. A merge process (steps S59 to S61), which will be described later, is activated for the full-text index A, and the storage unit to be recorded for the next deletion process is transferred from the small-scale full-text index storage unit A (10a) for deletion to the other. The small-scale full-text index storage unit B (10b) for deletion is switched to (step S58). When the merge process is activated, the merge unit 11 executes the merge process asynchronously with the deletion process unit 4.
[0088]
(Search process)
To execute the search process, the user first inputs a search character string from the input means 1. Next, in the search processing means 5, a token is obtained from the search character string using the text dividing means 6 (step S34). Further, the search processing means 5 obtains a set (Rs) of document identifiers of the document data including the search character string using the search large-scale full-text index of the search large-scale full-text index storage unit 8 (step S35). The search processing means 5 obtains a set of document identifiers (RiA) of the document data including the search character string by using the registration small full-text index A of the registration small full-text index storage unit A (9a), and performs registration. A set (RiB) of document identifiers of the document data including the search character string is obtained using the registration small full-text index B of the small-scale full-text index storage unit B (9b) (step S36). Further, the search processing means 5 obtains a set (RdA) of document identifiers of the document data including the search character string using the deletion small full-text index A of the deletion small full-text index storage unit A (10a). A set (RdB) of document identifiers of document data including the search character string is obtained by using the deletion small full-text index B in the deletion small full-text index storage unit B (10a) (step S37).
[0089]
The search processing means 5 performs the following set operation on the obtained set of document identifiers (Rs, RiA, RiB, RdA, RdB), and sets the result as the search result (R) (step S38), and the output means 3 A set of document identifiers of document data including the search character string is output to the user through (step S39).
R = Rs + RiA + RiB-RdA-RdB
However, + is a logical sum operator and − is a logical difference operator.
[0090]
(Merge process)
In order to execute the merge processing of the registration small full-text index, (a) is applied to all tokens of the registration small full-text index (here, the registration small full-text index A) to be merged. ) Processing to extract the transposed list of the token from the full-text index (step S47), and (b) Processing to add the previous transposed list to the end of the transposed list of the corresponding token of the large-scale full-text index for search (step S48). . Next, the registration small full-text index A is emptied (step S49).
[0091]
In order to execute the merge processing of the small-scale full-text index for deletion, (c) is applied to all tokens of the small-scale full-text index for deletion (here, the small-scale full-text index A for deletion) to be merged. ) Processing to extract the transposed list of the token from the full-text index (step S59), and (d) Delete the appearance position information in the transposed list extracted in (c) from the transposed list of the corresponding token of the large-scale search full-text index. The process (step S60) to perform is performed. Next, the small-scale full-text index A for deletion is emptied (step S61). Note that the example of merge processing of the transposed list described with reference to FIG. 8 can also be applied to this embodiment.
[0092]
In the embodiment described with reference to FIG. 9 to FIG. 12, the form of a full-text search device using three or more small-scale full-text index storage units for registration and / or three or more full-text index storage units for deletion, The following embodiment is illustrated with reference to FIGS.
[0093]
FIG. 13 is a block diagram for explaining the function of the full-text search apparatus according to another embodiment of the present invention.
In the full-text search apparatus according to the present embodiment, three or more small full-text indexes for registration and deletion are prepared (illustrated as three each), and the other two small full-text indexes are converted into large-scale full-text indexes. During merging (data transfer), registration processing or deletion processing is executed using another small-scale full-text index so as to eliminate an unprocessable period. That is, in the full-text search device according to the present embodiment, by providing a plurality of small-scale full-text indexes for registration, even when merge processing is performed on a plurality of small-scale full-text indexes for registration, other registration processing is performed. It is possible to perform registration processing even in the case of multiple deletions, and by providing multiple deletion small full-text indexes, other deletion processing can be performed even when merge processing is performed on multiple registration small full-text indexes. It is possible to perform the deletion process even when the process is performed. Note that in practice, the time required for registration and deletion is shorter than the merge time, and therefore merge processes often overlap.
[0094]
The registration small full-text index storage unit 9 described with reference to FIG. 1 includes the registration small full-text index storage unit A (9a), the registration small full-text index storage unit B (9b), and the registration small full-text index storage unit C. As shown in FIG. 1, the deletion small-scale full-text index storage unit 10 described with reference to FIG. 1 includes the deletion small-scale full-text index storage unit A (10a) and the deletion small-scale full-text index storage unit B. This embodiment will be described as having three storage units (10b) and a small-scale full-text index storage unit C (10c) for deletion. Note that the hardware configuration example described with reference to FIGS. 2 and 3 can also be applied to the full-text search apparatus according to the present embodiment. However, it is effective to provide one or more of these storage units on the memory instead of the storage devices 25 and 53.
[0095]
Below, an example of operation | movement of the full-text search apparatus based on this embodiment comprised as mentioned above is demonstrated in detail.
14 to 16 are flowcharts for explaining a processing example in the full-text search apparatus of FIG.
When the full-text search apparatus receives a processing request from the user (step S71), first, whether the process is a registration process (step S72), a deletion process (step S73), or a search process. (NO in step S73) is determined. The full-text search device executes the following processes based on this determination.
[0096]
(registration process)
To execute the registration process, the user first creates document data and registers the document data from the input means 1. The registration processing means 3 saves the document data in the document data storage unit 7, and simultaneously determines an identifier (document identifier) indicating the document data (step S81). Further, the registration processing means 3 obtains a partial character string (token) and appearance information of the token from the document data using the text dividing means 6 (step S82). Note that the division method and full-text index used in the text division means 6 are as described above. The document identifier and the appearance position information of each token are recorded in the registration small full-text index storage unit (for example, the registration small full-text index storage unit A (9a)) at that time (step S83).
[0097]
After the recording in step S83, a timely merge process is performed. Here, it is assumed that the process is performed based on the merge start condition. First, it is determined whether the merge start condition is satisfied as a result of recording in step S83 (step S84). If the merge start condition is not satisfied, the process ends. Each form of the conditions for starting the merge processing described in the embodiments of FIGS. 1 to 8 can also be applied to this embodiment. Further, it is also determined whether the other registration small full-text index storage unit (here, the registration small full-text index storage unit B (9b)) is executing the merge process (step S85). If it is being executed, it is determined whether the third small-scale full-text index storage unit for registration (here, the small-scale full-text index storage unit for registration C (9c)) is being executed (step S86). It is also determined at the same time whether the storage units B and C are executing the registration process. If the process is being executed in step S86, the process waits for the end. Hereinafter, only the case where the most likely merge process is being executed will be described.
[0098]
When the merge start condition is satisfied and any of the other registration small full-text index storage units B (9b) / C (9c) is not executing the merge process, the registration small full-text index storage unit A ( For the small registration full-text index A in 9a), a merge process (steps S89 to S91) similar to steps S47 to S49 in FIG. 11 is started, and a storage unit to be recorded for the next registration process is started. From the registration small full-text index storage unit A (9a) to other registration small full-text index storage units B (9b) / C (9c) (the same storage unit as the storage unit not executing the merge process, and so on) To use expression (steps S87 / S88). When the merge process is activated, the merge unit 11 executes the merge process asynchronously with the registration process unit 3.
[0099]
(Deletion process)
To execute the deletion process, the user first inputs the document identifier of the document to be deleted from the input means 1. Next, the deletion processing means 4 reads out document data corresponding to the document identifier from the document data storage unit 7 (step S101). Further, the deletion processing unit 4 obtains a partial character string (token) and the appearance position information of the token from the document data by using the text dividing unit 6 (step S102).
[0100]
Next, it is determined whether the document identifier is a document identifier registered in the registration small full-text index (step S103). If the document identifier is registered in the registration small full-text index, The appearance position information is deleted from the registration small-scale full-text index storage unit 9 (9a, 9b, and 9c) (step S105). If the document identifier is not registered in the small full-text index for registration (if registered in the large-scale full-text index for search), the document identifier and the appearance position information of each token are stored in the small full-text for deletion at that time. It is recorded in the index storage unit (for example, the small-scale full-text index storage unit A (10a) for deletion) (step S104). Then, the deletion processing unit 4 deletes the document data corresponding to the document identifier from the document data storage unit 7 (step S114).
[0101]
After the recording in step S104, a timely merge process is performed. Here, it is assumed that the process is performed based on the merge start condition. First, it is determined whether the merge start condition is satisfied as a result of recording in step S104 (step S106). If the merge start condition is not satisfied, the process ends (the process in step S114 is necessary). Each form of the conditions for starting the merge processing described in the embodiments of FIGS. 1 to 8 can also be applied to this embodiment. Further, it is also determined whether the other deletion small full-text index storage unit (here, the deletion small full-text index storage unit B (10b)) is executing the merge process (step S107). If it is being executed, it is determined whether or not the third small-scale full-text index storage unit for registration (here, the small-scale full-text index storage unit C (10c for registration)) is being executed (step S108). It is also determined at the same time whether the storage units B and C are executing the registration process. If the process is being executed in step S108, the process waits for the end. Hereinafter, only the case where the most likely merge process is being executed will be described.
[0102]
When the merge start condition is satisfied and any of the other small-scale full-text index storage units B (10b) / C (10c) for deletion is not executing the merge process, the small-scale full-text index storage unit A for deletion ( 10a), the merge processing (steps S111 to S113) similar to steps S59 to S61 in FIG. 12 is started for the small-scale full-text index A for deletion in 10a), and the storage unit to be recorded for the next deletion processing The small-scale full-text index storage unit A (10a) for deletion is switched to another small-scale full-text index storage unit B (10b) / C (10c) for deletion (steps S109 / S110). When the merge process is activated, the merge unit 11 executes the merge process asynchronously with the deletion process unit 4.
[0103]
(Search process)
The search processing according to the present embodiment is basically the same as the search processing described with reference to FIG. 10, and steps S34 to S39 in FIG. 10 correspond to steps S74 to S79, respectively. However, in step S76, the search processing means 5 uses the registration small full-text index storage unit C (9c) of the registration small full-text index C in addition to the sets RiA and RiB to store the document data including the search character string. A set of document identifiers (RiC) is obtained. Further, in step S77, the search processing means 5 uses the deletion small full-text index storage unit C (10c) in addition to the sets RdA and RdB to delete the document data including the search character string. A set of document identifiers (RdC) is obtained. The search processing means 5 performs the following set operation on the obtained set of document identifiers (Rs, RiA, RiB, RiC, RdA, RdB, RdC), and sets the result as the search result (R) (step S78). ).
R = Rs + RiA + RiB + RiC-RdA-RdB-RdC
However, + is a logical sum operator and − is a logical difference operator.
[0104]
In the embodiment of FIG. 9 to FIG. 12 or the embodiment of FIG. 13 to FIG. 16, the full-text search device using a plurality of small registration full-text index storage units and / or a plurality of deletion full-text index storage units has been described. These full-text index storage units (other than the search large-scale full-text index storage unit) are allocated to the storage devices 25 or 35 described in FIG. 2 and FIG. A form that can be applied when positioned as an individual file stored in the device 25 or 35 or the memory will be exemplified as the next embodiment with reference to FIGS. 17 to 20.
[0105]
FIG. 17 is a block diagram for explaining functions of a full-text search apparatus according to another embodiment of the present invention.
In the full-text search apparatus according to the present embodiment, one small full-text index for registration and one for deletion are prepared in advance, and the small full-text index is merged into a large-scale full-text index (data transfer). When there is no registration (/ deletion) full-text index storage unit capable of storing the registration (/ deletion) full-text index during the registration (/ deletion) process, other small-scale full-text indexes are stored. By newly creating and executing registration processing or deletion processing, an unprocessable period is eliminated. That is, in the full-text search device according to the present embodiment, by providing a plurality of small-scale full-text indexes for registration in a timely manner, even when merge processing is performed on a plurality of small-scale full-text indexes for registration, other registration processing Registration process can be performed even if a small full-text index for deletion is provided in a timely manner, even if merge processing is performed for multiple small-scale full-text indexes for registration. Even when other deletion processes are performed, the deletion process can be performed. Note that in practice, the time required for registration and deletion is shorter than the merge time, and therefore merge processes often overlap.
[0106]
The full-text search apparatus according to the present embodiment includes a storage unit management unit 12 that manages another small-scale full-text index storage unit for registration different from the small-scale full-text index storage unit A (9a) for registration. Further, regarding the deletion process, the storage unit management unit 12 manages another small-scale full-text index storage unit for registration different from the small-scale full-text index storage unit A (10a) for deletion. The storage management unit 12 includes a unit that newly creates another registration full-text index storage unit when there is no registration full-text index storage unit capable of storing the registration full-text index during the registration process. Further, the storage unit management unit 12 includes a unit that deletes a surplus (not scheduled to be used in the next processing) full-text index storage unit for registration (/ for deletion).
[0107]
Further, the registration small full-text index storage unit 9 described in FIG. 1 changes from the registration small full-text index storage unit A (9a) only to the registration small full-text index storage units B (9b), C (9c), and D. (9d),. . . The deletion small full-text index storage unit 10 described with reference to FIG. 1 is the deletion small full-text index storage unit A (10a) only. Small-scale full-text index storage units B (10b), C (10c), D (10d),. . . The present embodiment will be described on the assumption that the number is increased in a timely manner (in no particular order) and deleted in a timely manner.
[0108]
The registration processing unit 3 uses the registration small-text full-text index storage unit that has been created / deleted in a timely manner, and the registration processing unit 3 searches the large-scale full-text index for search from one full-text index storage unit for registration. While the process of merging data into the storage unit 8 (or other registration process) is performed, the registration process is performed using another registration full-text index storage unit. On the other hand, the deletion processing means 4 uses the deletion full-text index storage unit from one deletion full-text index storage unit using the deletion small-scale full-text index storage unit that has been created / deleted in a timely manner. While the process of merging data into the full-text index storage unit 8 (or other deletion process) is performed, the deletion process is performed using another deletion full-text index storage unit. Note that the hardware configuration example described with reference to FIGS. 2 and 3 can also be applied to the full-text search apparatus according to the present embodiment. However, it is effective to provide one or more of these storage units on the memory instead of the storage devices 25 and 53.
[0109]
Below, an example of operation | movement of the full-text search apparatus based on this embodiment comprised as mentioned above is demonstrated in detail.
18 to 20 are flowcharts for explaining a processing example in the full-text search apparatus of FIG.
When the full-text search apparatus receives a processing request from the user (step S121), first, whether the process is a registration process (step S122), a deletion process (step S123), or a search process. (NO in step S123) is determined. The full-text search device executes the following processes based on this determination.
[0110]
(registration process)
To execute the registration process, the user first creates document data and registers the document data from the input means 1. The registration processing means 3 saves the document data in the document data storage unit 7, and simultaneously determines an identifier (document identifier) indicating the document data (step S131). Further, the registration processing means 3 obtains the partial character string (token) and the appearance position information of the token from the document data by using the text dividing means 6 (step S132). Note that the division method and full-text index used in the text division means 6 are as described above.
[0111]
The storage management unit 12 determines whether there is a small full-text index storage unit for registration that can be used at the present time in accordance with an instruction from the registration processing unit 3 or in a timely manner (step S133). If it does not exist, another small registration full-text index storage unit (for example, a registration small full-text index storage unit C) is newly created (step S135). When there is a registration small full-text index storage unit that can be used, the document identifier and the appearance position information of each token are stored in the registration small full-text index storage unit (for example, the registration small full-text index storage unit A (9a). ) / C) (steps S134 / S136).
[0112]
After the recording in steps S134 / S136, a timely merge process is performed. Here, the description will be made assuming that the process is performed based on the merge start condition. First, it is determined whether or not the merge start condition is satisfied as a result of recording in steps S134 / S136 (step S137). If the merge start condition is not satisfied, the process ends. Each form of the conditions for starting the merge processing described in the embodiments of FIGS. 1 to 8 can also be applied to this embodiment. Further, it is also determined whether or not the other small-scale full-text index storage unit for registration (here, the small-scale full-text index storage unit for registration B (9b) / A (9a)) is executing the merge process (step S138). It is also determined at the same time whether the storage unit B / A is executing the registration process. If the process is being executed in step S138, the process waits for the end. Hereinafter, only the case where the most likely merge process is being executed will be described.
[0113]
When the merge start condition is satisfied and the other registration small full-text index storage unit B (9b) / A (9a) is not executing the merge process, the registration small full-text index storage unit A (9a) / For the registration small full-text index A / C in C, the merge process (steps S140 to S142) similar to steps S47 to S49 in FIG. 11 is started, and the storage unit to be recorded for the next registration process Is switched from the registration small full-text index storage unit A (9a) / C to another registration small full-text index storage unit B (9b) / A (9a) (step S139). When the merge process is activated, the merge unit 11 executes the merge process asynchronously with the registration process unit 3. Further, the storage management unit 12 may delete the surplus registration full-text index storage unit (which is not scheduled to be used in the next processing) at the time of merge processing or at appropriate times.
[0114]
(Deletion process)
To execute the deletion process, the user first inputs the document identifier of the document to be deleted from the input means 1. Next, the deletion processing means 4 reads out the document data corresponding to the document identifier from the document data storage unit 7 (step S151). Further, the deletion processing unit 4 obtains a partial character string (token) and the appearance position information of the token from the document data using the text dividing unit 6 (step S152).
[0115]
Next, it is determined whether the document identifier is a document identifier registered in the registration small full-text index (step S153). If the document identifier is a document identifier registered in the registration small full-text index, The appearance position information is deleted from all the registration small full-text index storage units 9 (9a and the like) (step S155). When the document identifier is not registered in the registration small-scale full-text index (when registered in the search large-scale full-text index), recording is performed in the following deletion-use small full-text index storage unit.
[0116]
The storage unit management unit 12 determines whether or not there is a small-scale full-text index storage unit for deletion that can be used at the present time in accordance with an instruction from the deletion processing unit 3 or when appropriate (step S154). If it does not exist, another small-scale full-text index storage unit for deletion (for example, a small-scale full-text index storage unit C for deletion) is newly created (step S157). When there is a small-scale full-text index storage unit for deletion that can be used, the document identifier and the appearance position information of each token are stored in the small-scale full-text index storage unit for deletion (for example, the small-scale full-text index storage unit A (10a for registration). ) / C) (steps S156 / S158). Then, the deletion processing unit 4 deletes the document data corresponding to the document identifier from the document data storage unit 7 (step S175).
[0117]
After the recording in steps S156 / S158, a timely merge process is performed. Here, the description will be made assuming that the process is performed based on the merge start condition. First, it is determined whether or not the merge start condition is satisfied as a result of recording in step S156 / S158 (step S159). If the merge start condition is not satisfied, the process ends (the process in step S175 is necessary). Each form of the conditions for starting the merge processing described in the embodiments of FIGS. 1 to 8 can also be applied to this embodiment. Further, it is also determined whether or not the other deletion small full-text index storage unit (here, the deletion small full-text index storage unit B (10b) / A (9a)) is executing the merge process (step S170). It is also determined at the same time whether the storage unit B / A is executing the registration process. If the process is being executed in step S170, the process waits for the end. Hereinafter, only the case where the most likely merge process is being executed will be described.
[0118]
When the merge start condition is satisfied and any of the other small-scale full-text index storage units B (10b) / C (10c) for deletion is not executing the merge process, the small-scale full-text index storage unit A for deletion ( 10a) The merge processing (steps S172 to S174) similar to steps S59 to S61 in FIG. 12 is started for the small-scale full-text index A / C for deletion in / C, and recording is performed for the next deletion processing. The power storage unit is switched from the small-scale full-text index storage unit A (10a) / C for deletion to another small-scale full-text index storage unit B (10b) / A (10a) for deletion (step S171). When the merge process is activated, the merge unit 11 executes the merge process asynchronously with the deletion process unit 4. Further, the storage management unit 12 may delete the surplus (not scheduled to be used in the next processing) full-text index storage for deletion at the time of merge processing or at appropriate times.
[0119]
(Search process)
The search processing according to the present embodiment is basically the same as the search processing described with reference to FIG. 10, and steps S34 to S39 in FIG. 10 correspond to steps S124 to S129, respectively. However, in step S126, the search processing means 5 uses the registration small full-text indexes of all the registration small full-text index storage units existing at present as the set Ri to document identifiers of document data including the search character string. Get the set of Further, in step S127, the search processing unit 5 uses the deletion small full-text index of all the deletion small full-text index storage units existing at present as the set Rd, and the document identifier of the document data including the search character string Get the set of
[0120]
As described above, each embodiment has been described centering on the full-text search apparatus of the present invention. However, as described as a processing procedure in the full-text search apparatus, a form as a full-text search method in a full-text search system can be adopted. Furthermore, the present invention executes a program for functioning as a full-text search device, a program for functioning as each means thereof, a program for executing these full-text search methods, or a processing procedure thereof. And a computer-readable recording medium in which any one of the programs is recorded can be employed.
[0121]
An embodiment of a recording medium storing a program and data for realizing a full text search function according to the present invention will be described. As the recording medium, specifically, a CD-ROM, a magneto-optical disk, a DVD-ROM, an FD, a flash memory, and various other ROMs and RAMs can be assumed. This function is facilitated by causing a computer to execute the system function and recording and distributing a program for realizing the full-text search function. Then, the recording medium as described above is mounted on an information processing apparatus such as a computer and the program is read by the information processing apparatus, or the program is stored in a storage medium provided in the information processing apparatus. By reading, the full-text search function according to the present invention can be executed.
[0122]
【The invention's effect】
According to the present invention, the registration process and the deletion process in the full-text search device are performed on a small-scale full-text index storage unit, so that the processing time can be kept short and the response time to the user can be shortened. Is possible.
[0123]
In addition, according to the present invention, it is possible to perform registration processing even when merge processing is being performed by providing a plurality of small-scale full-text indexes for registration, and it is also possible to perform merge processing by providing a plurality of small-scale full-text indexes for deletion. Deletion processing can be performed.
[Brief description of the drawings]
FIG. 1 is a block diagram for explaining functions of a full-text search apparatus according to an embodiment of the present invention.
FIG. 2 is a diagram illustrating an example of a hardware configuration when the full-text search apparatus in FIG. 1 is configured as a stand-alone.
FIG. 3 is a diagram illustrating a hardware configuration example when the full-text search apparatus in FIG. 1 is configured by a server / client.
4 is a flowchart for explaining a processing example in the full-text search apparatus of FIG. 1; FIG.
FIG. 5 is a flowchart for explaining an example of processing in the full-text search apparatus of FIG. 1;
6 is a flowchart for explaining a processing example in the full-text search apparatus of FIG. 1; FIG.
7 is a diagram for explaining processing in the full-text search apparatus of FIG. 1, and is a diagram showing an example of a full-text index. FIG.
FIG. 8 is a diagram for explaining the outline of the merging process taking the transposed list of the token “full text” of the full text index in FIG. 7 as an example;
FIG. 9 is a block diagram illustrating functions of a full-text search device according to another embodiment of the present invention.
FIG. 10 is a flowchart for explaining a processing example in the full-text search device of FIG. 9;
11 is a flowchart for explaining a processing example in the full-text search device of FIG. 9;
12 is a flowchart for explaining a processing example in the full-text search device of FIG. 9;
FIG. 13 is a block diagram illustrating functions of a full-text search device according to another embodiment of the present invention.
14 is a flowchart for explaining a processing example in the full-text search apparatus of FIG.
15 is a flowchart for explaining an example of processing in the full-text search apparatus of FIG.
16 is a flowchart for explaining a processing example in the full-text search device of FIG.
FIG. 17 is a block diagram illustrating functions of a full-text search device according to another embodiment of the present invention.
18 is a flowchart for explaining an example of processing in the full-text search device of FIG.
FIG. 19 is a flowchart for explaining an example of processing in the full-text search device of FIG. 17;
20 is a flowchart for explaining a processing example in the full-text search device of FIG.
[Explanation of symbols]
DESCRIPTION OF SYMBOLS 1 ... Input means, 2 ... Output means, 3 ... Registration processing means, 4 ... Deletion processing means, 5 ... Search processing means, 6 ... Text division means, 7 ... Document data memory | storage part, 8 ... Large-scale full-text index storage for search 9, a small-scale full-text index storage unit A for registration, 9 b, a small-scale full-text index storage unit B for registration, 9 c, a small-scale full-text index storage unit C for registration, 10. ... Deleting small full-text index storage unit, 10a ... Deleting small full-text index storage unit A, 10b ... Deleting small full-text index storage unit B, 10c ... Deleting small full-text index storage unit C, 11 ... Merging means , 12 ... storage unit management means, 21, 31 ... input device, 22, 32 ... display device, 23, 33 ... input / output control device, 24, 34, 52 ... main control device (CPU / memory), 25, 53 ... Storage device, 30 ... Client, 35,51 ... Net Over click controller, 40 ... network, 50 ... server.

Claims

From the document data storage unit that stores the input document data, the document data including the input search condition is registered as a first full-text index storage unit for registration, a second full-text index storage unit for deletion, and a third search unit. A full-text search device for searching using a full-text index storage unit,
When performing registration processing, a registration processing unit that acquires information on the token and the appearance position of the token from the document data, and stores the information on the token and the appearance position of the token in the first full-text index storage unit for registration When,
When deleting processing, the token determines Luke been stored in the first full-text index storage unit for the registration, information about the occurrence position of the token that has been stored in the first full-text index storage unit the case, the information about the occurrence position of the token removed from the first full-text index storage unit, the occurrence information is first full-text index not been stored in the storage unit have field coupling on the position of the token, and deletion processing means for registering information about the occurrence position of the token and the token to the second full-text index storage unit for the deletion,
When performing a search process, a first set of document identifiers of document data including the input search condition is obtained using an index of the third full-text index storage unit, and the document including the input search condition A second set of document identifiers of data is obtained using an index of the first full-text index storage unit, and a third set of document identifiers of document data including the input search condition is obtained as the second full-text index. Search processing means for obtaining using a storage unit index, and then outputting a set obtained by subtracting the third set from the union of the first set and the second set as a search result;
Against inverted list of tokens stored in the index of the third full-text index storage unit, in addition to inverted list of the first full-text index storage unit or we retrieved token, the third full-text index storage unit in the storage deletes the information about the occurrence position of the inverted list of tokens the fetched second full-text index storage unit or et al., which is, the inverted list of inverted list and the second full-text index storage unit of the first full-text index storage unit A merge means for executing a process of merging;
A full-text search device characterized by comprising:

Said merging means, when the number of document data stored in the first full-text index storage unit or the second full-text index storage unit has reached the number specified in advance, characterized in that it performs a process of the merged The full-text search device according to claim 1.

It said merging means, when the capacitance of the first full-text index storage unit or the second full-text index storage unit has reached the capacity specified in advance, in claim 1, characterized in that performing the processing of the merged Full-text search device as described.

A first full-text index storage unit or a second full-text index storage unit including a plurality of the first full-text index storage unit or the second full-text index storage unit, and performing a process of merging data into the third full-text index storage unit; The search apparatus according to claim 2, wherein registration processing or deletion processing is performed using different first full-text index storage units or second full-text index storage units.

Including two first full-text index storage units or two full-text index storage units, and merging data from one first full-text index storage unit or second full-text index storage unit to the third full-text index storage unit 4. The search device according to claim 2, wherein the registration process or the deletion process is performed using another first full-text index storage unit or a second full-text index storage unit while performing the process.

From the document data storage unit that stores the input document data, the document data including the input search condition is registered as a first full-text index storage unit for registration, a second full-text index storage unit for deletion, and a third search unit. A full-text search method for searching using a full-text index storage unit,
When the registration processing unit performs the registration process, the information about the token and the appearance position of the token is acquired from the document data, and the information about the token and the appearance position of the token is stored in the first full-text index storage unit for registration. A registration processing step to store;
Deletion means, when performing a removal process, the token determines Luke been stored in the first full-text index storage unit for the registration, information about the occurrence position of the token first full-text index storage unit the case that is stored in the information about the occurrence position of the token removed from the first full-text index storage unit, have information about the occurrence position of the token not been stored in the first full-text index storage unit the case, the deletion processing step of registering information about the occurrence position of the token and the token to the second full-text index storage unit for the deletion,
When the search processing means performs a search process, a first set of document identifiers of the document data including the input search condition is obtained using an index of the third full-text index storage unit, and the input a second set of document identifiers of the document data including the search conditions, determined using the first full-text index storage unit of the index, the third set of the document identifier of the document data including the input search condition, A search processing step of obtaining a set obtained by subtracting the third set from the union of the first set and the second set after obtaining using the index of the second full-text index storage unit; ,
Merge means, in addition with respect to inverted list of the third full-text index storage unit index tokens stored in the inverted list of the first full-text index storage unit or we retrieved token, the third full-text index deletes the information about the occurrence position of the inverted list of tokens taken out the second full-text index storage unit or found stored in the storage unit, inverted list of the first full-text index storage unit and the second full-text index storage unit and merging step of executing processing to merge the inverted list,
A full-text search method comprising:

The merging step, when the number of document data stored in the first full-text index storage unit or the second full-text index storage unit has reached the number specified in advance, characterized in that it performs a process of the merged The full-text search method according to claim 6.

The merging step, when the capacity of the first full-text index storage unit or the second full-text index storage unit has reached the capacity specified in advance, in claim 6, characterized in that executing the processing of the merged Full text search method described.

The program for making a computer perform the full text search method of any one of Claims 6-8.

A computer-readable recording medium on which the program according to claim 9 is recorded.