JP3854684B2

JP3854684B2 - Information processing apparatus and method

Info

Publication number: JP3854684B2
Application number: JP10473897A
Authority: JP
Inventors: 史朗伊藤; 紀子大谷; 昇吾柴田; 隆也上田; 裕治池田
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 1997-04-22
Filing date: 1997-04-22
Publication date: 2006-12-06
Anticipated expiration: 2017-04-22
Also published as: JPH10301939A

Description

【０００１】
【発明の属する技術分野】
本発明は、テキストデータを検索あるいは管理する情報処理装置及びその方法に関するものである。
【０００２】
【従来の技術】
文書データ中の全てのテキストデータを対象として与えられた検索キーを含む文書データを検索する全文検索装置等の情報処理装置では、大量のテキストデータを高速に検索するために、検索対象文書のインデックスを予め作成して、インデックスを用いて検索を行なうインデックス技術が利用されている。インデックス技術の一例として、特開平４−２０５５６０公報では、文字位置インデックス技術について述べられている。
【０００３】
文字位置インデックス技術の基本的な考え方は、被検索テキストデータ中に出現する文字および文字列の位置を文字ごとに１ずつ増加する整数で表わすことにある。その上で、各文字および文字列ごとに、当該文字および文字列をキーとして、当該文字および文字列が現れる全ての位置を列挙する。このインデックスにおいて、ある検索文字列を被検索テキストデータから検索する場合には、当該検索文字列をインデックスのキーとなっている文字および文字列に分解する。そして、分解した文字および文字列の位置関係が、当該検索文字列における位置関係に一致する組み合わせを探すことで検索を行なう。
【０００４】
ここで、従来の情報処理装置の機能構成について、図１０を用いて説明する。
図１０は従来の情報処理装置の機能構成を示すブロック図である。
図１０において、５０１は被検索テキストデータを保持する被検索テキスト保持部である。５０２は被検索テキスト保持部５０１に保持されている被検索テキストデータに対して、被検索テキストデータ中の文字及び文字列ごとに、被検索テキストデータ中での当該文字の位置を保持したインデックスを作成するインデックス作成部である。５０３はインデックス作成部５０２で作成したインデックスを保持するインデックス保持部である。５０４は検索を行う文字列を保持する検索文字列保持部である。５０５はインデックス保持部５０３に保持されているインデックスを用いて、検索文字列保持部５０４に保持されている検索文字列に一致する被検索テキストデータ中の文字列を検索する検索部である。５０６は検索部による検索結果を保持する検索結果保持部である。
【０００５】
次に、従来の情報処理装置で実行されるインデックスを作成するインデックス作成処理について、図１１を用いて説明する。
図１１は従来の情報処理装置で実行されるインデックス作成処理を示すフローチャートである。
まず、ステップＳ６０１では、カウンタｃの初期化を行う。カウンタｃは、処理の対象となっている文字の位置を示すもので、これを０に初期化する。ステップＳ６０２では、ポインタｐの初期化を行う。ポインタｐは、処理の対象となっている文字を指し示すもので、これを被検索テキストデータの先頭文字を指し示すように初期化する。
【０００６】
ステップＳ６０３では、ポインタｐが被検索テキストデータの最後の文字に達したか否かを判定する。最後に達している場合（ステップＳ６０３でＹＥＳ）、インデックス作成処理を終了する。一方、最後に達していない場合（ステップＳ６０３でＮＯ）、ステップＳ６０４に進む。
ステップＳ６０４では、ポインタｐが指し示す位置にある文字について、インデックスの当該文字の位置リストにカウンタｃの値を追加する。ステップＳ６０５では、カウンタｃの値を１増やす。ステップＳ６０６では、ポインタｐが次の文字を指し示すようにポインタｐを進め、ステップＳ６０３に戻る。
【０００７】
以上のインデックス作成処理により、例えば、図１３に示す文書に対して、図１４に示すようなインデックスが作成される。尚、図１３及び図１４では、幾つかの文字以外については表示を省略している。また、図１４の各行が、各文字が現れる位置のリストとなっている。例えば、文字「プ」は、位置３、１５、３６、…に出現していることがわかる。
【０００８】
次に従来の情報処理装置で実行される文字列を検索する検索処理について、図１２を用いて説明する。
図１２は従来の情報処理装置で実行される検索処理を示すフローチャートである。
まず、ステップＳ７０１では、検索文字列保持部５０４に保持されている検索文字列の長さをレジスタｌに代入する。また、カウンタｎに１を代入する。例えば、検索文字列が「Ｃプログラム」である場合は、ｌ＝６、ｎ＝１となる。ステップＳ７０２では、検索文字列保持部５０４に保持されている検索文字列の１番目の文字について、インデックスの読み込みを行う。当該文字の文字位置全てを配列１に読み込む。図１５は、図１４に示したインデックスを用いて検索文字列「Ｃプログラム」を検索しているときの配列１の状態を示している。
【０００９】
ステップＳ７０３では、レジスタｌの内容とカウンタｎの内容を比較する。カウンタｎの内容＜レジスタｌの内容である場合（ステップＳ７０３でＹＥＳ）、ステップＳ７０４に進む。一方、カウンタｎの内容≧レジスタｌの内容である場合（ステップＳ７０３でＮＯ）、ステップＳ７０７に進む。
ステップＳ７０４では、カウンタｎの値を１増やす。ステップＳ７０５では、検索文字列保持部５０４に保持されている検索文字列のカウンタｎの内容が示すｎ番目の文字について、インデックスの読み込みを行う。当該文字の全ての文字位置から（ｎ−１）を減じた値を配列２に読み込む。
【００１０】
ステップＳ７０６では、配列１と配列２から、配列１と配列２の両方に存在している値を全て取り出し、これらの値だけを新たに配列１の値とする。そして、ステップＳ７０３に戻る。図１６は、図１５に示した配列１と、その配列１に対する配列２において、ｎ＝３の時の配列１の状態を示している。
ステップＳ７０７では、配列１が空でない場合は、検索文字列が検索されたことを示す値として１を検索結果保持部５０６に保持する。配列１が空の場合は、検索文字列が検索されなかったことを示す値として０を検索結果保持部５０６に保持する。そして、検索処理を終了する。
【００１１】
以上の検索処理により、上述の例である検索文字列「Ｃプログラム」を検索すると、一２と３５に当該文字列があるので、このテキストは検索される。日本語の場合、語の区切りを容易に求められないため、このように文字列として一致するテキストを検索する検索方法は有効である。
【００１２】
【発明が解決しようとする課題】
しかしながら、上記従来の情報処理装置では、英語のように単語の区切りが明確な言語に対して、単語として一致するテキストだけを検索することができないという問題があった。これは、日本語における空白文字は語の区切りを表すものではないので、空白文字を読み飛ばしてインデックスを作成するためである。そのため、上記の例では、「Ｃプログラム」という検索語に対して、「ＲＰＣプログラム」という単語でも検索されてしまう。
【００１３】
一方、単語ごとにインデックスを作成して検索する方法もあるが、これでは日本語文書などのように単語を容易に区切ることができない言語では、正しいインデックスが作成されるとは限らず、検索結果に誤りが生じる問題がある。
本発明は上記の問題に鑑みてなされたものであり、テキストデータの検索精度を向上することができる情報処理装置及びその方法を提供することを目的とする。
【００１４】
【課題を解決するための手段】
上記の目的を達成するための本発明による情報処理装置は以下の構成を備える。即ち、
テキストデータを検索する情報処理装置であって、
テキストデータを保持する保持手段と、
前記保持手段で保持されているテキストデータ中の所定文字列に該所定文字列を識別するための識別文字を付加する付加手段と、
前記付加手段による付加がなされたテキストデータ中の各文字の位置に関する位置情報を作成する作成手段と、
前記作成手段で作成した位置情報を保持する位置情報保持手段と、
検索条件を入力する入力手段と、
前記入力手段によって入力された検索条件に対し前記付加手段による付加を行い、該検索条件に該当する前記保持手段に保持されているテキストデータを、前記位置情報保持手段で保持される位置情報を参照して検索する検索手段と
を備える。
【００１５】
また、好ましくは、前記保持手段に保持されているテキストデータ中の所定文字列の両端に区切り文字を挿入して該テキストデータを変換する変換手段と、
前記変換手段により変換されたテキストデータを保持する変換テキストデータ保持手段とを更に備え、
前記位置情報作成手段は、前記変換テキストデータ保持手段に保持されているテキストデータ中の各文字の位置に関する位置情報を作成する。
【００１６】
また、好ましくは、前記付加手段は、前記所定文字列の前方、あるいは後方に前記識別文字を付加する。
また、好ましくは、前記入力手段で入力された検索条件中の所定文字列の両端に区切り文字を挿入して該検索条件を変換する検索条件変換手段と、
前記検索条件変換手段により変換された検索条件を保持する変換検索条件保持手段とを更に備え、
前記検索手段は、前記変換検索条件保持手段に保持されている検索条件に該当する前記保持手段に保持されているテキストデータを検索する。
【００１７】
また、好ましくは、前記所定文字列を検出する検出手段を
更に備える。
また、好ましくは、前記所定文字列は、所定の言語の文字が連続する文字列である。
上記の目的を達成するための本発明による情報処理装置は以下の構成を備える。即ち、
テキストデータを管理する情報処理装置であって、
前記テキストデータ中の所定文字列に該所定文字列を識別するための識別文字を付加する付加手段と、
前記付加手段による付加がなされたテキストデータ中の各文字の位置に関する位置情報を作成する作成手段と、
前記作成手段で作成した位置情報と対応づけて、前記テキストデータを管理する管理手段と
を備える。
【００１８】
上記の目的を達成するための本発明による情報処理方法は以下の構成を備える。即ち、
テキストデータを検索する情報処理方法であって、
テキストデータを第１記憶媒体に保持する保持工程と、
前記保持工程で前記記憶媒体に保持されているテキストデータ中の所定文字列に該所定文字列を識別するための識別文字を付加する付加工程と、
前記付加工程による付加がなされたテキストデータ中の各文字の位置に関する位置情報を作成する作成工程と、
前記作成工程で作成した位置情報を第２記憶媒体に保持する位置情報保持工程と、
検索条件を入力する入力工程と、
前記入力工程によって入力された検索条件に対し前記付加工程による付加を行い、該検索条件に該当する前記保持工程で前記第１記憶媒体に保持されているテキストデータを、前記位置情報保持工程で前記第２記憶媒体に保持される位置情報を参照して検索する検索工程と
を備える。
【００１９】
上記の目的を達成するための本発明による情報処理方法は以下の構成を備える。即ち、
テキストデータを管理する情報処理方法であって、
前記テキストデータ中の所定文字列に該所定文字列を識別するための識別文字を付加する付加工程と、
前記付加工程による付加がなされたテキストデータ中の各文字の位置に関する位置情報を作成する作成工程と、
前記作成工程で作成した位置情報と対応づけて、前記テキストデータを記憶媒体に管理する管理工程と
を備える。
【００２０】
上記の目的を達成するための本発明によるコンピュータ可読メモリは以下の構成を備える。即ち、
テキストデータを検索する情報処理のプログラムコードが格納されたコンピュータ可読メモリであって、
テキストデータを第１記憶媒体に保持する保持工程のプログラムコードと、
前記保持工程で前記記憶媒体に保持されているテキストデータ中の所定文字列に該所定文字列を識別するための識別文字を付加する付加工程のプログラムコードと、
前記付加工程による付加がなされたテキストデータ中の各文字の位置に関する位置情報を作成する作成工程のプログラムコードと、
前記作成工程で作成した位置情報を第２記憶媒体に保持する位置情報保持工程と、
検索条件を入力する入力工程のプログラムコードと、
前記入力工程によって入力された検索条件に対し前記付加工程による付加を行い、該検索条件に該当する前記保持工程で前記第１記憶媒体に保持されているテキストデータを、前記位置情報保持工程で前記第２記憶媒体に保持される位置情報を参照して検索する検索工程のプログラムコードと
を備える。
【００２１】
上記の目的を達成するための本発明によるコンピュータ可読メモリは以下の構成を備える。即ち、
テキストデータを管理する情報処理のプログラムコードが格納されたコンピュータ可読メモリであって、
前記テキストデータ中の所定文字列に該所定文字列を識別するための識別文字を付加する付加工程のプログラムコードと、
前記付加工程による付加がなされたテキストデータ中の各文字の位置に関する位置情報を作成する作成工程のプログラムコードと、
前記作成工程で作成した位置情報と対応づけて、前記テキストデータを記憶媒体に管理する管理工程のプログラムコードと
を備える。
【００２２】
【発明の実施の形態】
以下、図面を参照して本発明の好適な実施形態を詳細に説明する。
図１は本発明の実施形態に係る情報処理装置の機能構成を示すブロック図である。
図１において、１０１は被検索テキストデータを保持する被検索テキスト保持部である。１０２は被検索テキスト保持部１０１に保持されている被検索テキストデータ中の英単語の両端に区切り文字を挿入して被検索テキストデータを変換する被検索テキスト変換部である。１０３は被検索テキスト変換部により変換されたテキストデータを保持する変換テキスト保持部である。１０４は変換テキスト保持部１０３に保持されている変換されたテキストデータに対して、変換されたテキストデータ中の文字ごとに、変換されたテキストデータ中での当該文字の位置を列挙したインデックスを保持するインデックス作成部である。
【００２３】
１０５はインデックス作成部１０４で作成したインデックスを保持するインデックス保持部である。１０６は検索のキーとなる文字列あるいは単語を保持する検索キー保持部である。１０７は検索キー保持部１０６に保持されている検索キー中の英単語の両端に区切り文字を挿入して検索キーを変換する検索キー変換部である。１０８は検索キー変換部１０７で変換された変換キーを保持する変換キー保持部である。１０９はインデックス保持部１０５に保持されているインデックスを用いて、変換キー保持部１０８に保持されている変換キーに一致する文字列を検索する検索部である。１１０は検索部１０９による検索結果を保持する検索結果保持部である。
【００２４】
次に本発明の実施形態の情報処理装置の構成について、図２を用いて説明する。
図２は本発明の実施形態の情報処理装置の構成を示すブロック図である。
図２において、２０１はＣＰＵであり、後述する手順を実現するプログラムに従って動作する。２０２はＲＡＭであり、被検索テキスト保持部１０１、変換テキスト保持部１０３、検索キー保持部１０６、変換キー保持部１０８、検索結果保持部１１０と上記プログラムの動作に必要な記憶領域とを提供する。２０３はＲＯＭであり、後述する手順を実現するプログラムを保持する。２０４はディスク装置であり、インデックス保持部１０５を実現する。２０５は情報処理装置の各種構成要素を相互に接続するバスである。
【００２５】
以下、説明していく本発明の実施形態で実行される処理は、インデックスを作成するインデックス作成処理と文字列を検索する検索処理の２つに大きく分かれる。まず、インデックス作成処理について、図３を用いて説明する。
図３は本発明の実施形態で実行されるインデックス作成処理を示すフローチャートである。
【００２６】
まず、ステップＳ３０１では、テキストデータの変換処理を行う。被検索テキスト保持部１０１に保持されているテキストデータのうち、英単語の両端に区切り文字“＠”を挿入する。英単語とは、アルファベット文字（“Ａ”から“Ｚ”および“ａ”から“ｚ”まで）だけが連続する部分である。区切り文字を挿入したテキストデータを、変換テキスト保持部１０３に保持する。例えば、図１３に示す文書に対して、区切り文字を挿入すると、図５にようになる。ステップＳ３０２では、カウンタｃの初期化を行う。カウンタｃは、処理の対象となっている文字の位置を示すもので、これを０に初期化する。
【００２７】
ステップＳ３０３では、ポインタｐの初期化を行う。ポインタｐは、処理の対象となっている文字を指し示すもので、これを被検索テキストデータの先頭文字を指し示すように初期化する。ステップＳ３０４では、ポインタｐが被検索テキストデータの最後に達したか否かを判定する。最後に達している場合（ステップＳ３０４でＹＥＳ）、インデックス作成処理を終了する。一方、最後に達していない場合（ステップＳ３０４でＮＯ）、ステップＳ３０５に進む。
【００２８】
ステップＳ３０５では、ポインタｐが指し示す位置にある文字について、インデックスの当該文字の位置リストにカウンタｃの値を追加する。ステップＳ３０６では、カウンタｃの値を１増やす。ステップＳ３０７では、ポインタｐが次の文字を指し示すようポインタｐを進め、ステップＳ３０４に戻る。
以上のインデックス処理により、例えば、図１３に示す文書に対して、図６に示すインデックスが作成される。まず、図６の各行が、各文字が現れる位置のリストとなっている。リストの最初の行が、区切り文字が現れる位置のリストを示している
次に本発明の実施形態で実行される検索処理について、図４を用いて説明する。
【００２９】
図４は本発明の実施形態で実行される検索処理を示すフローチャートである。
まず、ステップＳ４０１では、検索キーの変換処理を行う。検索キー保持部１０６に保持されている検索キーのうち、英単語の両端に区切り文字“＠”を挿入する。ここで英単語は、上述した図３のフローチャートのステップＳ３０１と同様にして判断する。区切り文字を挿入した検索キーを、変換テキスト保持部１０８に保持する。例えば、検索キーが「Ｃプログラム」である場合は、変換キーは、「＠Ｃ＠プログラム」となる。
【００３０】
ステップＳ４０２では、変換キー保持部１０８に保持されている変換キーの長さをレジスタｌに代入する。また、カウンタｎに１を代入する。例えば、変換キーが、「＠Ｃ＠プログラム」である場合は、ｌ＝８、ｎ＝１となる。ステップＳ４０３では、変換キー保持部１０８に保持されている変換キーの１番目の文字について、インデックスの読み込みを行う。当該文字の文字位置全てを配列１に読み込む。
【００３１】
ステップＳ４０４では、レジスタｌの内容とカウンタｎの内容を比較する。カウンタｎの内容＜レジスタｌの内容である場合（ステップＳ４０４でＹＥＳ）、ステップＳ４０８に進む。一方、カウンタｎの内容≧レジスタｌの内容である場合（ステップＳ４０４でＮＯ）、ステップＳ４０５に進む。
ステップＳ４０５では、カウンタｎの値を１増やす。ステップＳ４０６では、変換キー保持部１０８に保持されている変換キーのカウンタｎの内容が示すｎ番目の文字について、インデックスの読み込みを行う。当該文字の全ての文字位置から（ｎ−１）を減じた値を配列２に読み込む。
【００３２】
ステップＳ４０７では、配列１と配列２から、配列１と配列２の両方に存在している値を全て取り出し、これらの値だけを新たに配列１の値とする。そして、ステップＳ４０４に戻る。図７は、上述した検索列「Ｃプログラム」において、ｎ＝５の時の配列１の状態を示している。
ステップＳ４０８では、配列１が空でない場合は、検索キーが検索されたことを示す値として１を検索結果保持部１１０に保持する。配列１が空の場合は、検索キーが検索されなかったことを示す値として０を検索結果保持部１１０に保持する。そして、検索処理を終了する。
【００３３】
以上の検索処理により、上述の例である検索キー「Ｃプログラム」に対して、「ＲＰＣプログラム」を含む、検索結果として不適切な文字列は検索されない。
以上説明したように、本実施形態によれば、英語のように単語の区切りが明確な言語に対する検索において、単語と一致する文字列だけを正確に検索することができる。
【００３４】
尚、本実施形態においては、インデックス作成処理と検索処理を同一の情報処理装置で実行する場合について説明したが、これに限定されるものではない。インデックス作成処理と検索処理を異なる情報処理装置で行ってもよい。この場合の各情報処理装置の機能構成について、図８と図９を用いて説明する。尚、図８に示す情報処理装置と、図９に示す情報処理装置は、ネットワーク回線等で接続され互いにデータの授受を可能とする構成になっている。あるいは、ＣＤ−ＲＯＭ等の可搬記憶媒体により図８の情報処理装置で作成したインデックスを図９の情報処理装置で利用する。
【００３５】
図８は本発明の他の実施形態に係る情報処理装置の機能構成を示すブロック図である。
図８において、１５０１は被検索テキストデータを保持する被検索テキスト保持部である。１５０２は被検索テキスト保持部１５０１に保持されている被検索テキストデータ中の英単語の両端に区切り文字を挿入して被検索テキストを変換する被検索テキスト変換部である。１５０３は被検索テキスト変換部により変換されたテキストデータを保持する変換テキスト保持部である。１５０４は変換テキスト保持部１５０３に保持されている変換されたテキストデータに対して、変換されたテキストデータ中の文字ごとに、変換されたテキストデータ中での当該文字の位置を列挙したインデックスを保持するインデックス作成部である。１５０５はインデックス作成部１０４で作成したインデックスを保持するインデックス保持部である。
【００３６】
図９は本発明の他の実施形態に係る情報処理装置の機能構成を示すブロック図である。
図９において、１６０１は図８に示す情報処理装置で作成されたインデックスを保持するインデックス保持部である。１６０２は検索のキーとなる文字列あるいは単語を保持する検索キー保持部である。１６０３は検索キーテキスト保持部に接続されている検索キー中の英単語の両端に区切り文字を挿入して検索キーを変換する検索キー変換部である。１６０４は検索キー変換部１６０３で変換された変換キーを保持する変換キー保持部である。１６０５はインデックス保持部１６０１に保持されているインデックスを用いて、変換キー保持部１６０４に保持されている変換キーに一致する文字列を検索する検索部である。１６０６は検索部１６０５による検索結果を保持する検索結果保持部である。
【００３７】
また、本実施形態では、被検索テキストデータ及び検索キー共に、区切り文字を挿入する変換を行った上で、インデックス作成処理や検索処理を行う場合について説明したが、これに限定されるものではない。例えば、インデックス作成処理や検索処理の途中で、区切り文字の挿入を行ってもよい。
また、英単語の両端に区切り文字を実際に挿入する場合について説明したが、これに限定されるものではない。例えば、文字の挿入に代わって、当該位置を列挙するインデックスを設けて、区切り文字を仮想的に実現してもよい。
【００３８】
また、検索キーの英単語の両端に区切り文字を挿入して検索を行う場合について説明したが、これに限定されるものではない。例えば、検索キーの前方だけに区切り文字を挿入すれば、前方一致検索が実現される。また、検索キーの後方だけに区切り文字を挿入すれば、後方一致検索が実現される。区切り文字を挿入しなければ、部分一致検索が実現される。
【００３９】
また、英単語の両端に常に区切り文字を挿入する場合について説明したが、これに限定されるものではない。英単語が連続する場合には、区切り文字を一つにまとめてもよい。例えば、“information retrieval”に対しては、“@information@retrieval@”とする。
また、アルファベットの連続を区切り文字を挿入する対象とする特定部分文字列とした場合について説明したが、これに限定されるものではない。例えば、数字やカタカナの連続など任意の文字集合の属する文字の連続を特定文字列としてもよい。
【００４０】
また、文字集合に属する文字の連続を特定部分文字列とする場合について説明したが、これに限定されるものではない。
また、一つのテキストデータに対して検索を行う場合について説明したが、これに限定されるものではない。複数のテキストデータに対して検索を行なってもよい。例えば、全テキストに連続させて位置を割り振り、テキストデータが切り替わる位置の情報を保持することで、検索された位置と各テキストデータの切り替わり位置を比較することで、複数のテキストデータの検索が可能になる。
【００４１】
また、被検索テキスト保持部１０１、変換テキスト保持部１０３、検索キー保持部１０６、変換キー保持部１０８、検索結果保持部１１０をＲＡＭで、インデックス保持部１０５をディスク装置で実現する場合について説明したが、これに限定されるものではなく、任意の記憶媒体を用いて実現してもよい。
また、情報処理装置の機能構成の各構成要素を同一の情報処理装置上で構成する場合について説明したが、これに限定されるものではなく、ネットワーク上に分散した情報処理装置に分かれて構成してもよい。
【００４２】
また、プログラムをＲＯＭに保持する場合について説明したが、これに限定されるものではなく、任意の記憶媒体を用いて実現してもよい。また、同様の動作をする回路で実現してもよい。
尚、本発明は、複数の機器（例えば、ホストコンピュータ、インタフェース機器、リーダ、プリンタ等）から構成されるシステムに適用しても、一つの機器からなる装置（例えば、複写機、ファクシミリ装置等）に適用してもよい。
【００４３】
また、本発明の目的は、前述した実施形態の機能を実現するソフトウェアのプログラムコードを記録した記憶媒体を、システムあるいは装置に供給し、そのシステムあるいは装置のコンピュータ（またはＣＰＵやＭＰＵ）が記憶媒体に格納されたプログラムコードを読出し実行することによっても、達成されることは言うまでもない。
【００４４】
この場合、記憶媒体から読出されたプログラムコード自体が上述した実施の形態の機能を実現することになり、そのプログラムコードを記憶した記憶媒体は本発明を構成することになる。
プログラムコードを供給するための記憶媒体としては、例えば、フロッピディスク、ハードディスク、光ディスク、光磁気ディスク、ＣＤ−ＲＯＭ、ＣＤ−Ｒ、磁気テープ、不揮発性のメモリカード、ＲＯＭなどを用いることができる。
【００４５】
また、コンピュータが読出したプログラムコードを実行することにより、前述した実施形態の機能が実現されるだけでなく、そのプログラムコードの指示に基づき、コンピュータ上で稼働しているＯＳ（オペレーティングシステム）などが実際の処理の一部または全部を行い、その処理によって前述した実施の形態の機能が実現される場合も含まれることは言うまでもない。
【００４６】
更に、記憶媒体から読出されたプログラムコードが、コンピュータに挿入された機能拡張ボードやコンピュータに接続された機能拡張ユニットに備わるメモリに書き込まれた後、そのプログラムコードの指示に基づき、その機能拡張ボードや機能拡張ユニットに備わるＣＰＵなどが実際の処理の一部または全部を行い、その処理によって前述した実施形態の機能が実現される場合も含まれることは言うまでもない。
【００４７】
本発明を上記記憶媒体に適用する場合、その記憶媒体には、先に説明したフローチャートに対応するプログラムコードを格納することになるが、簡単に説明すると、図１７、図１８のメモリマップ例に示す各モジュールを記憶媒体に格納することになる。
すなわち、図１７に示すように、少なくとも「保持モジュール」、「付加モジュール」、「作成モジュール」、「入力モジュール」および「検索モジュール」の各モジュールのプログラムコードを記憶媒体に格納すればよい。
【００４８】
尚、「付加モジュール」は、テキストデータを第１記憶媒体に保持する。「付加モジュール」は、保持されているテキストデータ中の所定文字列に該所定文字列を識別するための識別文字を付加する。「作成モジュール」は、付加がなされたテキストデータ中の各文字の位置に関する位置情報を作成する。「位置情報保持モジュール」は、作成した位置情報を第２記憶媒体に保持する。「入力モジュール」は、検索条件を入力する。「検索モジュール」は、入力された検索条件に対し付加を行い、該検索条件に該当する第１記憶媒体に保持されているテキストデータを、第２記憶媒体に保持される位置情報を参照して検索する。
【００４９】
すなわち、図１８に示すように、少なくとも「付加モジュール」、「作成モジュール」、および「管理モジュール」の各モジュールのプログラムコードを記憶媒体に格納すればよい。
尚、「付加モジュール」は、テキストデータ中の所定文字列に該所定文字列を識別するための識別文字を付加する。「作成モジュール」は、付加がなされたテキストデータ中の各文字の位置に関する位置情報を作成する。「管理モジュール」は、作成した位置情報と対応づけて、テキストデータを記憶媒体に管理する。
【００５０】
【発明の効果】
以上説明したように、本発明によれば、テキストデータの検索精度を向上することができる情報処理装置及びその方法を提供できる。
【図面の簡単な説明】
【図１】本発明の実施形態に係る情報処理装置の機能構成を示すブロック図である。
【図２】本発明の実施形態の情報処理装置の構成を示すブロック図である。
【図３】本発明の実施形態で実行されるインデックス作成処理を示すフローチャートである。
【図４】本発明の実施形態で実行される検索処理を示すフローチャートである。
【図５】本発明の実施形態の変換されたテキストデータの一例を示す図である。
【図６】本発明の実施形態のインデックスの一例を示す図である。
【図７】本発明の実施形態の配列１の状態の一例を示す図である。
【図８】本発明の他の実施形態の情報処理装置の機能構成を示すブロック図である。
【図９】本発明の他の実施形態の情報処理装置の機能構成を示すブロック図である。
【図１０】従来の情報処理装置の機能構成を示すブロック図である。
【図１１】従来の情報処理装置で実行されるインデックス作成処理を示すフローチャートである。
【図１２】従来の情報処理装置で実行される検索処理を示すフローチャートである。
【図１３】被検索テキストデータの一例を示す図である。
【図１４】従来のインデックスの一例を示す図である。
【図１５】従来の配列１の状態の一例を示す図である。
【図１６】従来の配列１の状態の一例を示す図である。
【図１７】本発明の実施形態を実現するプログラムコードを格納した記憶媒体のメモリマップの構造を示す図である。
【図１８】本発明の実施形態を実現するプログラムコードを格納した記憶媒体のメモリマップの構造を示す図である。
【符号の説明】
１０１被検索テキスト保持部
１０２被検索テキスト変換部
１０３変換テキスト保持部
１０４インデックス作成部
１０５インデックス保持部
１０６検索キー保持部
１０７検索キー変換部
１０８変換キー保持部
１０９検索部
１１０検索結果保持部[0001]
BACKGROUND OF THE INVENTION
The present invention relates to an information processing apparatus and method for retrieving or managing text data.
[0002]
[Prior art]
In an information processing apparatus such as a full-text search apparatus that searches for document data including a search key given to all text data in the document data, an index of the search target document is used to search a large amount of text data at high speed. Is used in advance, and an index technique for performing a search using an index is used. As an example of the index technique, Japanese Patent Laid-Open No. 4-205560 describes a character position index technique.
[0003]
The basic idea of the character position index technique is to represent the positions of characters and character strings that appear in the searched text data by integers that are incremented by one for each character. Then, for each character and character string, all positions where the character and character string appear are listed using the character and character string as a key. In this index, when a search character string is searched from search target text data, the search character string is decomposed into a character and a character string which are keys of the index. A search is performed by searching for a combination in which the positional relationship between the decomposed character and the character string matches the positional relationship in the search character string.
[0004]
Here, the functional configuration of a conventional information processing apparatus will be described with reference to FIG.
FIG. 10 is a block diagram showing a functional configuration of a conventional information processing apparatus.
In FIG. 10, reference numeral 501 denotes a searched text holding unit that holds searched text data. An index 502 holds the position of the character in the searched text data for each character and character string in the searched text data with respect to the searched text data held in the searched text holding unit 501. An index creation unit to be created. Reference numeral 503 denotes an index holding unit that holds the index created by the index creation unit 502. A search character string holding unit 504 holds a character string to be searched. A search unit 505 searches for a character string in searched text data that matches the search character string held in the search character string holding unit 504 using the index held in the index holding unit 503. Reference numeral 506 denotes a search result holding unit that holds search results obtained by the search unit.
[0005]
Next, an index creation process for creating an index executed by a conventional information processing apparatus will be described with reference to FIG.
FIG. 11 is a flowchart showing an index creation process executed by a conventional information processing apparatus.
First, in step S601, the counter c is initialized. The counter c indicates the position of the character to be processed, and is initialized to 0. In step S602, the pointer p is initialized. The pointer p points to the character to be processed, and is initialized so as to point to the first character of the searched text data.
[0006]
In step S603, it is determined whether or not the pointer p has reached the last character of the searched text data. If it has reached the end (YES in step S603), the index creation process ends. On the other hand, if it has not reached the end (NO in step S603), the process proceeds to step S604.
In step S604, for the character at the position indicated by the pointer p, the value of the counter c is added to the position list of the character in the index. In step S605, the value of the counter c is incremented by one. In step S606, the pointer p is advanced so that the pointer p points to the next character, and the process returns to step S603.
[0007]
By the above index creation processing, for example, an index as shown in FIG. 14 is created for the document shown in FIG. In FIG. 13 and FIG. 14, the display is omitted except for some characters. Each line in FIG. 14 is a list of positions where each character appears. For example, it can be seen that the character “P” appears at positions 3, 15, 36,.
[0008]
Next, search processing for searching for a character string executed by a conventional information processing apparatus will be described with reference to FIG.
FIG. 12 is a flowchart showing search processing executed by a conventional information processing apparatus.
First, in step S701, the length of the search character string held in the search character string holding unit 504 is substituted into the register l. Also, 1 is assigned to the counter n. For example, if the search character string is “C program”, l = 6 and n = 1. In step S702, the index is read for the first character of the search character string held in the search character string holding unit 504. All character positions of the character are read into array 1. FIG. 15 shows the state of the array 1 when the search character string “C program” is searched using the index shown in FIG.
[0009]
In step S703, the contents of register l are compared with the contents of counter n. If the contents of the counter n <the contents of the register l (YES in step S703), the process proceeds to step S704. On the other hand, if the content of the counter n ≧ the content of the register l (NO in step S703), the process proceeds to step S707.
In step S704, the value of the counter n is incremented by one. In step S705, the index is read for the nth character indicated by the content of the search character string counter n held in the search character string holding unit 504. A value obtained by subtracting (n−1) from all character positions of the character is read into the array 2.
[0010]
In step S706, all values existing in both the arrays 1 and 2 are extracted from the arrays 1 and 2, and only these values are newly set as the values of the array 1. Then, the process returns to step S703. FIG. 16 shows the state of the array 1 when n = 3 in the array 1 and the array 2 corresponding to the array 1 shown in FIG.
In step S707, if the array 1 is not empty, 1 is held in the search result holding unit 506 as a value indicating that the search character string has been searched. When the array 1 is empty, 0 is held in the search result holding unit 506 as a value indicating that the search character string has not been searched. Then, the search process ends.
[0011]
When the search character string “C program”, which is the above example, is searched by the above search processing, the text is searched because the character strings are found in lines 1 and 35. In the case of Japanese, since the word break cannot be easily obtained, a search method for searching for text that matches as a character string is effective.
[0012]
[Problems to be solved by the invention]
However, the above-described conventional information processing apparatus has a problem that it is not possible to search only text that matches as a word in a language such as English in which a word is clearly delimited. This is because the white space character in Japanese does not represent a word break, and the index is created by skipping the white space character. Therefore, in the above example, the word “RPC program” is also searched for the search word “C program”.
[0013]
On the other hand, there is a method of searching by creating an index for each word, but this does not always create a correct index in languages that cannot easily separate words, such as Japanese documents. There is a problem that causes errors.
The present invention has been made in view of the above problems, and an object of the present invention is to provide an information processing apparatus and method that can improve text data search accuracy.
[0014]
[Means for Solving the Problems]
In order to achieve the above object, an information processing apparatus according to the present invention comprises the following arrangement. That is,
An information processing apparatus for retrieving text data,
Holding means for holding text data;
Adding means for adding an identification character for identifying the predetermined character string to the predetermined character string in the text data held by the holding means;
Creating means for creating position information regarding the position of each character in the text data added by the adding means;
Position information holding means for holding position information created by the creating means;
An input means for entering search conditions;
The addition unit adds the search condition input by the input unit, and the text data held in the holding unit corresponding to the search condition is referred to the position information held by the position information holding unit. Search means to search
Is provided.
[0015]
Preferably, conversion means for converting the text data by inserting delimiters at both ends of a predetermined character string in the text data held in the holding means,
Further comprising converted text data holding means for holding the text data converted by the converting means,
The position information creating means creates position information related to the position of each character in the text data held in the converted text data holding means.
[0016]
Preferably, the adding means adds the identification character in front of or behind the predetermined character string.
Preferably, search condition conversion means for converting the search condition by inserting delimiters at both ends of the predetermined character string in the search condition input by the input means,
Conversion search condition holding means for holding the search condition converted by the search condition conversion means,
The search means searches for text data held in the holding means corresponding to the search condition held in the conversion search condition holding means.
[0017]
Preferably, a detecting means for detecting the predetermined character string is provided.
In addition.
Preferably, the predetermined character string is a character string in which characters of a predetermined language are continuous.
In order to achieve the above object, an information processing apparatus according to the present invention comprises the following arrangement. That is,
An information processing apparatus for managing text data,
Adding means for adding an identification character for identifying the predetermined character string to the predetermined character string in the text data;
Creating means for creating position information regarding the position of each character in the text data added by the adding means;
Management means for managing the text data in association with the position information created by the creation means;
Is provided.
[0018]
In order to achieve the above object, an information processing method according to the present invention comprises the following arrangement. That is,
An information processing method for retrieving text data,
A holding step of holding the text data in the first storage medium;
An adding step of adding an identification character for identifying the predetermined character string to the predetermined character string in the text data held in the storage medium in the holding step;
A creation step of creating position information regarding the position of each character in the text data added by the addition step;
A position information holding step for holding the position information created in the creating step in a second storage medium;
An input process for entering search conditions;
Addition by the adding step to the search condition input in the input step, and text data held in the first storage medium in the holding step corresponding to the search condition is performed in the position information holding step. A search step of searching by referring to position information held in the second storage medium;
Is provided.
[0019]
In order to achieve the above object, an information processing method according to the present invention comprises the following arrangement. That is,
An information processing method for managing text data,
An adding step of adding an identification character for identifying the predetermined character string to the predetermined character string in the text data;
A creation step of creating position information regarding the position of each character in the text data added by the addition step;
A management step of managing the text data in a storage medium in association with the position information created in the creation step;
Is provided.
[0020]
In order to achieve the above object, a computer readable memory according to the present invention comprises the following arrangement. That is,
A computer readable memory storing program codes for information processing for retrieving text data,
A program code of a holding process for holding the text data in the first storage medium;
A program code of an adding step for adding an identification character for identifying the predetermined character string to the predetermined character string in the text data held in the storage medium in the holding step;
A program code of a creation step for creating position information regarding the position of each character in the text data added by the addition step;
A position information holding step for holding the position information created in the creating step in a second storage medium;
A program code of the input process for inputting the search condition;
Addition by the adding step to the search condition input in the input step, and text data held in the first storage medium in the holding step corresponding to the search condition is performed in the position information holding step. A program code of a search step for searching by referring to position information held in the second storage medium;
Is provided.
[0021]
In order to achieve the above object, a computer readable memory according to the present invention comprises the following arrangement. That is,
A computer readable memory storing a program code for information processing for managing text data,
A program code of an addition step of adding an identification character for identifying the predetermined character string to the predetermined character string in the text data;
A program code of a creation step for creating position information regarding the position of each character in the text data added by the addition step;
In association with the position information created in the creation step, a program code of a management step for managing the text data in a storage medium;
Is provided.
[0022]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the drawings.
FIG. 1 is a block diagram showing a functional configuration of an information processing apparatus according to an embodiment of the present invention.
In FIG. 1, reference numeral 101 denotes a search text holding unit that holds search text data. Reference numeral 102 denotes a searched text conversion unit that converts searched text data by inserting delimiters at both ends of English words in the searched text data held in the searched text holding unit 101. A converted text holding unit 103 holds the text data converted by the searched text conversion unit. For the converted text data held in the converted text holding unit 103, 104 holds an index listing the positions of the characters in the converted text data for each character in the converted text data. This is an index creation unit.
[0023]
An index holding unit 105 holds the index created by the index creation unit 104. Reference numeral 106 denotes a search key holding unit that holds a character string or a word serving as a search key. Reference numeral 107 denotes a search key conversion unit that converts a search key by inserting delimiters at both ends of English words in the search key held in the search key holding unit 106. A conversion key holding unit 108 holds the conversion key converted by the search key conversion unit 107. Reference numeral 109 denotes a search unit that searches for a character string that matches the conversion key held in the conversion key holding unit 108 using the index held in the index holding unit 105. A search result holding unit 110 holds search results obtained by the search unit 109.
[0024]
Next, the configuration of the information processing apparatus according to the embodiment of the present invention will be described with reference to FIG.
FIG. 2 is a block diagram showing the configuration of the information processing apparatus according to the embodiment of the present invention.
In FIG. 2, 201 is a CPU, which operates according to a program for realizing a procedure to be described later. Reference numeral 202 denotes a RAM which provides a search text holding unit 101, a converted text holding unit 103, a search key holding unit 106, a conversion key holding unit 108, a search result holding unit 110, and a storage area necessary for the operation of the program. . Reference numeral 203 denotes a ROM that holds a program for realizing a procedure to be described later. A disk device 204 implements the index holding unit 105. A bus 205 connects various components of the information processing apparatus to each other.
[0025]
Hereinafter, the processing executed in the embodiment of the present invention to be described is broadly divided into two, index creation processing for creating an index and search processing for retrieving a character string. First, the index creation process will be described with reference to FIG.
FIG. 3 is a flowchart showing index creation processing executed in the embodiment of the present invention.
[0026]
First, in step S301, text data conversion processing is performed. In the text data held in the search text holding unit 101, delimiters “@” are inserted at both ends of English words. An English word is a portion where only alphabet letters (from “A” to “Z” and from “a” to “z”) continue. The converted text holding unit 103 holds the text data in which the delimiter is inserted. For example, when a delimiter is inserted into the document shown in FIG. 13, the result is as shown in FIG. In step S302, the counter c is initialized. The counter c indicates the position of the character to be processed, and is initialized to 0.
[0027]
In step S303, the pointer p is initialized. The pointer p points to the character to be processed, and is initialized so as to point to the first character of the searched text data. In step S304, it is determined whether or not the pointer p has reached the end of the searched text data. If it has reached the end (YES in step S304), the index creation process is terminated. On the other hand, if it has not reached the end (NO in step S304), the process proceeds to step S305.
[0028]
In step S305, for the character at the position indicated by the pointer p, the value of the counter c is added to the position list of the character in the index. In step S306, the value of the counter c is incremented by one. In step S307, the pointer p is advanced so that the pointer p points to the next character, and the process returns to step S304.
By the above index processing, for example, the index shown in FIG. 6 is created for the document shown in FIG. First, each line in FIG. 6 is a list of positions where each character appears. The first line of the list shows the list of positions where the delimiter appears
Next, search processing executed in the embodiment of the present invention will be described with reference to FIG.
[0029]
FIG. 4 is a flowchart showing search processing executed in the embodiment of the present invention.
First, in step S401, search key conversion processing is performed. Of the search keys held in the search key holding unit 106, the delimiter “@” is inserted at both ends of the English word. Here, English words are determined in the same manner as in step S301 in the flowchart of FIG. The search key in which the delimiter is inserted is held in the converted text holding unit 108. For example, when the search key is “C program”, the conversion key is “@ C @ program”.
[0030]
In step S402, the length of the conversion key held in the conversion key holding unit 108 is substituted into the register l. Also, 1 is assigned to the counter n. For example, when the conversion key is “@ C @ program”, l = 8 and n = 1. In step S403, the index is read for the first character of the conversion key held in the conversion key holding unit 108. All character positions of the character are read into array 1.
[0031]
In step S404, the contents of register l are compared with the contents of counter n. If the contents of the counter n <the contents of the register l (YES in step S404), the process proceeds to step S408. On the other hand, if the content of the counter n ≧ the content of the register 1 (NO in step S404), the process proceeds to step S405.
In step S405, the value of the counter n is incremented by one. In step S406, the index is read for the nth character indicated by the content of the conversion key counter n held in the conversion key holding unit 108. A value obtained by subtracting (n−1) from all character positions of the character is read into the array 2.
[0032]
In step S407, all values existing in both the arrays 1 and 2 are extracted from the arrays 1 and 2, and only these values are newly set as the values of the array 1. Then, the process returns to step S404. FIG. 7 shows the state of array 1 when n = 5 in the above-described search string “C program”.
In step S408, if the array 1 is not empty, 1 is stored in the search result storage unit 110 as a value indicating that the search key has been searched. When the array 1 is empty, 0 is held in the search result holding unit 110 as a value indicating that the search key has not been searched. Then, the search process ends.
[0033]
Through the above search processing, the character string inappropriate as the search result including the “RPC program” is not searched for the search key “C program” in the above example.
As described above, according to the present embodiment, it is possible to accurately search only a character string that matches a word in a search for a language with a clear word break such as English.
[0034]
In the present embodiment, the case where the index creation process and the search process are executed by the same information processing apparatus has been described. However, the present invention is not limited to this. The index creation process and the search process may be performed by different information processing apparatuses. The functional configuration of each information processing apparatus in this case will be described with reference to FIGS. Note that the information processing apparatus shown in FIG. 8 and the information processing apparatus shown in FIG. 9 are connected by a network line or the like and are configured to be able to exchange data with each other. Alternatively, the index created by the information processing apparatus of FIG. 8 using a portable storage medium such as a CD-ROM is used by the information processing apparatus of FIG.
[0035]
FIG. 8 is a block diagram showing a functional configuration of an information processing apparatus according to another embodiment of the present invention.
In FIG. 8, reference numeral 1501 denotes a searched text holding unit that holds searched text data. Reference numeral 1502 denotes a searched text conversion unit that converts the searched text by inserting delimiters at both ends of English words in the searched text data held in the searched text holding unit 1501. A converted text holding unit 1503 holds the text data converted by the searched text conversion unit. Reference numeral 1504 holds an index listing the positions of the characters in the converted text data for each character in the converted text data with respect to the converted text data held in the converted text holding unit 1503. This is an index creation unit. An index holding unit 1505 holds the index created by the index creation unit 104.
[0036]
FIG. 9 is a block diagram showing a functional configuration of an information processing apparatus according to another embodiment of the present invention.
In FIG. 9, reference numeral 1601 denotes an index holding unit that holds an index created by the information processing apparatus shown in FIG. Reference numeral 1602 denotes a search key holding unit that holds a character string or a word serving as a search key. A search key conversion unit 1603 converts a search key by inserting delimiters at both ends of English words in the search key connected to the search key text holding unit. A conversion key holding unit 1604 holds the conversion key converted by the search key conversion unit 1603. A search unit 1605 searches for a character string that matches the conversion key stored in the conversion key storage unit 1604 using the index stored in the index storage unit 1601. Reference numeral 1606 denotes a search result holding unit that holds search results from the search unit 1605.
[0037]
Further, in the present embodiment, the description has been given of the case where the index creation process and the search process are performed after performing the conversion for inserting the delimiter character in both the search text data and the search key. However, the present invention is not limited to this. . For example, a delimiter may be inserted during index creation processing or search processing.
Moreover, although the case where the delimiter is actually inserted at both ends of the English word has been described, the present invention is not limited to this. For example, instead of inserting characters, an index for enumerating the positions may be provided to virtually realize delimiters.
[0038]
Moreover, although the case where a search is performed by inserting delimiters at both ends of the English words of the search key has been described, the present invention is not limited to this. For example, if a delimiter is inserted only in front of the search key, a forward matching search is realized. Further, if a delimiter is inserted only behind the search key, a backward match search is realized. If no delimiter is inserted, a partial match search is realized.
[0039]
Moreover, although the case where the delimiter is always inserted at both ends of the English word has been described, the present invention is not limited to this. When English words are continuous, delimiters may be combined into one. For example, “information retrieval” is “@ information @ retrieval @”.
Moreover, although the case where the specific partial character string used as the object which inserts a delimiter character was demonstrated as a continuous character of an alphabet, it is not limited to this. For example, a sequence of characters belonging to an arbitrary character set such as a sequence of numbers or katakana may be used as the specific character string.
[0040]
Moreover, although the case where the continuation of the character which belongs to a character set is made into a specific partial character string was demonstrated, it is not limited to this.
Moreover, although the case where it searches with respect to one text data was demonstrated, it is not limited to this. You may search with respect to several text data. For example, multiple text data can be searched by comparing the searched position and the switching position of each text data by allocating the position consecutively to all the text and holding the information of the position where the text data switches. become.
[0041]
In addition, a case has been described in which the searched text holding unit 101, the converted text holding unit 103, the search key holding unit 106, the conversion key holding unit 108, and the search result holding unit 110 are implemented by a RAM, and the index holding unit 105 is implemented by a disk device. However, the present invention is not limited to this, and any storage medium may be used.
In addition, although the case where each component of the functional configuration of the information processing apparatus is configured on the same information processing apparatus has been described, the present invention is not limited to this, and the configuration is divided into information processing apparatuses distributed on the network. May be.
[0042]
Moreover, although the case where a program is hold | maintained at ROM was demonstrated, it is not limited to this, You may implement | achieve using arbitrary storage media. Further, it may be realized by a circuit that performs the same operation.
Note that the present invention can be applied to a system composed of a plurality of devices (for example, a host computer, an interface device, a reader, a printer, etc.), but a device (for example, a copier, a facsimile machine, etc.) composed of a single device You may apply to.
[0043]
Another object of the present invention is to supply a storage medium storing software program codes for implementing the functions of the above-described embodiments to a system or apparatus, and the computer (or CPU or MPU) of the system or apparatus stores the storage medium. Needless to say, this can also be achieved by reading and executing the program code stored in the.
[0044]
In this case, the program code itself read from the storage medium realizes the functions of the above-described embodiment, and the storage medium storing the program code constitutes the present invention.
As a storage medium for supplying the program code, for example, a floppy disk, a hard disk, an optical disk, a magneto-optical disk, a CD-ROM, a CD-R, a magnetic tape, a nonvolatile memory card, a ROM, or the like can be used.
[0045]
Further, by executing the program code read by the computer, not only the functions of the above-described embodiments are realized, but also an OS (operating system) operating on the computer based on the instruction of the program code. It goes without saying that a case where the function of the above-described embodiment is realized by performing part or all of the actual processing and the processing is included.
[0046]
Further, after the program code read from the storage medium is written in a memory provided in a function expansion board inserted into the computer or a function expansion unit connected to the computer, the function expansion board is based on the instruction of the program code. It goes without saying that the CPU of the function expansion unit or the like performs part or all of the actual processing and the functions of the above-described embodiments are realized by the processing.
[0047]
When the present invention is applied to the above storage medium, program codes corresponding to the flowcharts described above are stored in the storage medium. To briefly describe, the memory map examples in FIGS. Each module shown is stored in a storage medium.
That is, as shown in FIG. 17, it is only necessary to store at least the program codes of each of the “holding module”, “addition module”, “creation module”, “input module”, and “search module” in the storage medium.
[0048]
The “additional module” holds the text data in the first storage medium. The “addition module” adds an identification character for identifying the predetermined character string to the predetermined character string in the stored text data. The “creation module” creates position information related to the position of each character in the added text data. The “position information holding module” holds the created position information in the second storage medium. The “input module” inputs search conditions. The “search module” adds to the input search condition, and the text data held in the first storage medium corresponding to the search condition is referred to the position information held in the second storage medium. Search for.
[0049]
That is, as shown in FIG. 18, at least the program code of each module of “addition module”, “creation module”, and “management module” may be stored in the storage medium.
The “addition module” adds an identification character for identifying the predetermined character string to the predetermined character string in the text data. The “creation module” creates position information related to the position of each character in the added text data. The “management module” manages text data in a storage medium in association with the created position information.
[0050]
【The invention's effect】
As described above, according to the present invention, it is possible to provide an information processing apparatus and method that can improve text data search accuracy.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a functional configuration of an information processing apparatus according to an embodiment of the present invention.
FIG. 2 is a block diagram illustrating a configuration of the information processing apparatus according to the embodiment of this invention.
FIG. 3 is a flowchart showing index creation processing executed in the embodiment of the present invention.
FIG. 4 is a flowchart showing search processing executed in the embodiment of the present invention.
FIG. 5 is a diagram showing an example of converted text data according to the embodiment of the present invention.
FIG. 6 is a diagram illustrating an example of an index according to the embodiment of this invention.
FIG. 7 is a diagram illustrating an example of a state of the array 1 according to the embodiment of this invention.
FIG. 8 is a block diagram showing a functional configuration of an information processing apparatus according to another embodiment of the present invention.
FIG. 9 is a block diagram showing a functional configuration of an information processing apparatus according to another embodiment of the present invention.
FIG. 10 is a block diagram illustrating a functional configuration of a conventional information processing apparatus.
FIG. 11 is a flowchart showing index creation processing executed by a conventional information processing apparatus.
FIG. 12 is a flowchart showing search processing executed by a conventional information processing apparatus.
FIG. 13 is a diagram showing an example of searched text data.
FIG. 14 is a diagram illustrating an example of a conventional index.
FIG. 15 is a diagram illustrating an example of a state of a conventional arrangement 1;
FIG. 16 is a diagram illustrating an example of a state of a conventional arrangement 1;
FIG. 17 is a diagram showing the structure of a memory map of a storage medium that stores program codes for realizing an embodiment of the present invention.
FIG. 18 is a diagram showing the structure of a memory map of a storage medium storing program codes for realizing an embodiment of the present invention.
[Explanation of symbols]
101 Searched text holding part
102 Searched text converter
103 Conversion text holding part
104 Index creation section
105 Index holding unit
106 Search key holding unit
107 Search key converter
108 Conversion key holding unit
109 Search part
110 Search result holding unit

Claims

An information processing apparatus for searching for text data consisting of separated first language and not separated Tei every word second language Ru Tei for each word,
Adding means for adding an identification character for identifying the continuous part as a word to both ends of the part of the text data in which the character of the first language is continuous;
Creating means for creating position information regarding the position of each character in the text data to which the identification character is added by the adding means;
The identification character is added by the adding means to the search condition input by the input means, and the identification character is added by referring to the position information from the text data to which the identification character is added. An information processing apparatus comprising: search means for searching for text data corresponding to a search condition.

An information processing apparatus for managing text data of a second language not delimited Tei separated for each word in the first language Ru Tei for each word,
Adding means for adding an identification character for identifying the continuous part as a word to both ends of the part of the text data in which the character of the first language is continuous;
Creating means for creating position information regarding the position of each character in the text data to which the identification character is added by the adding means;
An information processing apparatus comprising: management means for managing the text data in association with the position information created by the creating means.

An information processing method for an information processing apparatus partitioned into each word is partitioned into a first language and words Ru Tei search for text data consisting Tei no second language,
An additional step in which the processing unit adds identification characters for identifying the continuous portion as a word to both ends of the portion in which the text of the first language is continuous in the text data held in the storage medium;
In the adding step, the creating step in which the processing unit creates position information regarding the position of each character in the text data to which the identification character is added;
A position information holding step in which the processing unit holds the position information created in the creating step in the storage medium;
For the input search condition, the processing unit adds an identification character for identifying the continuous part as a word at both ends of the part in which the character of the first language is continuous in the search condition, A search step in which the processing unit searches the text data held in the storage medium for text data corresponding to the search condition to which the identification character is added with reference to the position information held in the storage medium. An information processing method for an information processing apparatus, comprising:

An information processing method for an information processing apparatus that manages text data separated for each word is partitioned into a first language and words Ru Tei consisting Tei no second language,
An addition step in which a processing unit adds an identification character for identifying the continuous part as a word to both ends of the part of the text data in which the character of the first language is continuous;
A creation step in which the processing unit creates position information regarding the position of each character in the text data added in the addition step;
An information processing method for an information processing apparatus, comprising: a management step in which the processing unit manages text data to which the identification character is added in association with the position information created in the creation step.

A computer-readable memory that the information processing program is stored, which are separated in each word being partitioned into first language and words Ru Tei search for text data consisting Tei no second language,
On the computer,
An additional step in which the processing unit adds identification characters for identifying the continuous portion as a word to both ends of the portion in which the text of the first language is continuous in the text data held in the storage medium;
In the adding step, the creating step in which the processing unit creates position information regarding the position of each character in the text data to which the identification character is added;
A position information holding step in which the processing unit holds the position information created in the creating step in the storage medium;
For the input search condition, the processing unit adds an identification character for identifying the continuous part as a word at both ends of the part in which the character of the first language is continuous in the search condition, A search step in which the processing unit searches the text data held in the storage medium for text data corresponding to the search condition to which the identification character is added with reference to the position information held in the storage medium. A computer-readable memory storing a program for realizing and.

A computer-readable memory that the information processing program is stored which is separated for each word is partitioned into a first language and words Ru Tei manage text data consisting Tei no second language,
On the computer,
An addition step in which a processing unit adds an identification character for identifying the continuous part as a word to both ends of the part of the text data in which the character of the first language is continuous;
A creation step in which the processing unit creates position information regarding the position of each character in the text data added in the addition step;
A computer-readable memory storing a program that realizes a management step in which the processing unit manages the text data with the identification character added in association with the position information created in the creation step.