JP2004199282A

JP2004199282A - Document retrieval device and documents registeration device

Info

Publication number: JP2004199282A
Application number: JP2002365654A
Authority: JP
Inventors: Takaaki Nakamura; 隆顕中村; Yoshinori Yamagishi; 義徳山岸; Mitsunori Kori; 光則郡
Original assignee: Mitsubishi Electric Corp
Current assignee: Mitsubishi Electric Corp
Priority date: 2002-12-17
Filing date: 2002-12-17
Publication date: 2004-07-15

Abstract

<P>PROBLEM TO BE SOLVED: To execute efficient document retrieval by excluding any useless input/output data in retrieval processing when discriminating different notation characters or not discriminating them. <P>SOLUTION: This device is provided with a character string conversion part for converting respective characters in an inputted retrieval keyword into character codes exclusively for one character or character codes assigned to the set of different notation characters based on different notation conditions to designate whether to identify or discriminate the different notation characters, a position information reading part for reading the occurrence position information of the respective character codes in a document to be retrieved by referring to a retrieval database, a collating part for collating the occurrence position information of the respective character codes with the character code column and a collation result outputting part for outputting the occurrence position information of the retrieval keyword based on the collation result. <P>COPYRIGHT: (C)2004,JPO&NCIPI

Description

【０００１】
【発明の属する技術分野】
この発明は、文書検索装置および文書登録装置に関するものである。
【０００２】
【従来の技術】
近年、大量の電子文書の蓄積が進み、文書データベースから効率良く所望の文書を検索する方法が求められている。
【０００３】
一般の文書検索装置においては、文書データベースの作成時に、登録する文書中の各文字の出現位置情報を文字コード等によって検索する索引を作成し、記憶装置等に格納する。検索時には、索引を参照することにより、検索キーワードの出現位置を高速に検索することが出来る。
【０００４】
一方、文書検索において、例えば、「ウイスキー」というキーワードを含む文書を検索したい場合、他に「ウィスキー」或いは「ウヰスキー」という語句を含む文書も検索対象とした方が、ユーザの目的にかなう場合もあれば、含めない方がよい場合もある。この「イ／ィ／ヰ」のように用例によっては同一とみなすことが出来る文字（以下、異表記文字という）の例としては、他にも「渡辺」と「渡邊」における「辺／邊」などがある。
【０００５】
従来の文書検索方法では、「ウイスキー」、「ウィスキー」、「ウヰスキー」のように、２つ以上の異表記の存在する文字列については、固定的に異表記を区別して検索するか、もしくは区別しないで検索してきた。固定的に区別する検索方法では、異表記の語句を含む文書は検索対象とならないため、入力された検索キーワードの異表記を含む文書も必要なユーザにとっては検索漏れということになる。逆に、区別しないで検索する方法では、異表記の語句を含む文書もすべて検索対象となるので、異表記を区別したいユーザにとっては必要のない文書までが余計に得られるという問題がある。
【０００６】
この問題を解決するために、検索時に検索キーワードの異表記を区別するか区別しないかをユーザが選択できるようにしているものもある。例えば特許文献１に開示された従来の文書検索装置においては、文書データベースの索引中では、文字の全角と半角、英文字の大文字と小文字などの異表記文字同士を共通文字として扱う。索引には、各文字の文書データベース中での出現位置を表す場所情報と共に、各文字について大文字・小文字のような異表記文字の区別をする文字種別情報を保持する。検索時には、検索キーワードは索引中で用いられている共通文字に変換される。異表記を区別しない検索を行う場合には、文字種別情報は参照せず、共通文字に変換された検索キーワードと同一の共通文字の出現位置情報を索引中から取得する。一方、異表記を区別する検索が指定された場合には、共通文字を検索するだけでなく、文字種別情報を参照し、文字種別までが一致する場合にのみ出現位置情報を取得する。
【０００７】
【特許文献１】
特開平０８−７７１８８号公報（図１、図２）
【０００８】
【発明が解決しようとする課題】
以上に述べたように、文書検索装置においては、異表記文字を区別して検索するか、区別せずに検索するかをユーザの目的によって選択できる方が検索の精度も上がり、使い勝手もよい。
【０００９】
しかし、上記の文書検索装置では、検索時にキーワード中の文字の共通文字化を行っているため、異表記を区別する検索を行う場合でも、不要な文字種別情報の索引まで読み出すことになり、無駄なデータの入出力が発生するという問題がある。
【００１０】
また、上記の文書検索装置では、検索キーワードの共通文字化により、異表記を区別して検索する場合に、文字種別情報を参照するという処理が必要となり、検索処理のステップが増加してＣＰＵ負荷が増大する。
【００１１】
さらに、上記の文書検索装置では、文書索引には、出現位置情報のほかに文字種別情報が含まれるため、文書登録時および検索時における索引参照時のデータ入出力量が増加する。
【００１２】
この発明は、以上のような問題を解決するため、用途に応じて異表記文字を区別する場合と区別しない場合のどちらの検索処理においても、無駄なデータの入出力を防止すると共に処理ステップを可能な限り少なくし、効率のよい文書検索を行う文書検索装置を得ることを目的とする。
【００１３】
また、この発明は、検索対象文書を効率のよい検索が行える形式で登録する文書登録装置を得ることを目的とする。
【００１４】
【課題を解決するための手段】
この発明に係る文書検索装置は、１の文字とこの文字の異表記文字とを同一文字として照合する処理とこれらの文字を別の文字として照合する処理とのいずれかを指定する異表記区別条件に基づいて、入力された検索キーワード中の各文字を、１の文字専用の文字コードあるいは該文字の異表記文字候補の組に割り当てられた文字コードを保持する文字コード記憶部を参照することにより文字コードに変換し、文字コード列を生成する文字列変換部と、検索対象文書データベースから、文字コード列中の各文字コードの出現位置情報を読み出し、文字コード列と読み出した出現位置情報を照合する照合部と、照合部による照合結果を出力する照合結果出力部を備えたものである。
【００１５】
この発明に係る文書登録装置は、１の文字専用の文字コードあるいは該文字の異表記文字候補の組に割り当てられた文字コードを保持する文字コード記憶部を参照し、入力された文書中の各文字についてその文字専用の文字コードを抽出するとともに、各文字を含む異表記文字候補の組がある場合には、異表記文字候補の組に割り当てられた文字コードを抽出する文字コード割り付け部と、文書中の各文字に対して抽出された全ての文字コードの、文書中における出現位置情報を生成する文字コード出現位置索引生成部と、文字コード出現位置索引生成部により生成された索引を記憶装置に格納する索引格納部を備えたものである。
【００１６】
【発明の実施の形態】
以下、この発明の実施の形態を説明する。
実施の形態１．
図１は、この発明の実施の形態１による、文書検索装置と文書登録装置を備えたシステムの構成を示すブロック図である。
検索対象文書の登録時には、ユーザは、文書登録装置２０の入力処理部２４を介して文書を入力する。文書登録装置２０は、内部文字コード割り付け部２１（文字コード割り付け部）、出現位置情報作成部２２（文字コード出現位置索引生成部）、索引作成部２３（文字コード出現位置索引生成部、索引格納部）によって文書を処理し、検索のための索引を生成する。生成された索引は、索引データベース３０（検索対象文書データベース）に格納する。また、また、入力された文書も記憶装置等（図示せず）に格納される。なお、内部文字コード割り付け部２１、出現位置情報作成部２２、索引作成部２３、および入力処理部２４は、プログラムに従ってコンピュータの中央演算処理装置が行う動作のモジュールを表しており、これらは実際には一体として中央演算処理装置を構成する。
【００１７】
文書検索時には、ユーザは、文書検索装置１０の入力処理部１５を介して検索キーワードを入力する。文書検索装置１０は、入力された検索キーワードに基づいて、文字列変換部１１、位置情報読み出し部１２（照合部）、照合部１３、照合結果出力部１４によって検索処理を行い、索引データベース３０から、入力された検索キーワードを含む文書を検索し、検索結果を出力する。なお、文字列変換部１１、位置情報読み出し部１２、照合部１３、照合結果出力部１４、および入力処理部１５は、プログラムに従ってコンピュータの中央演算処理装置が行う動作のモジュールを表しており、これらは実際には一体として中央演算処理装置を構成する。
【００１８】
内部文字コードテーブル４０（文字コード記憶部）は、例えばリレーショナル型データベースによって実現されており、文字をキーとし、その文字専用の文字コードが関係付けられている。さらに、その文字を含む異表記文字候補の組がある場合には、その異表記文字候補の組に対して割り当てられた内部文字コードも関係付けられている。図２に示すように、内部文字コードテーブル４０に登録されている内部文字コードには、１つの文字に対して一義的に割り当てられる内部文字コードと、「イ／ィ／ヰ」や「辺／邊」などのような異表記文字の組に対して割り当てられる内部文字コードがある。
異表記文字の組は、内部文字コードテーブル４０の図示しない管理装置の入力手段（パーソナルコンピュータ等）を用いて、ユーザが任意に登録することが出来る。
【００１９】
なお、文書検索装置１０および文書登録装置２０は、異なるサーバ装置等であってもよいし、１つの装置であってもよい。また、索引データベース３０および内部文字コードテーブル４０は、同一の記憶装置に格納されていてもよいし、別々の装置に格納されていてもよく、文書検索装置１０および文書登録装置２０とは回線により接続される。あるいは、文書検索装置１０または文書登録装置２０と同一の装置に格納されていてもよい。
【００２０】
次に、文書登録処理について説明する。図３は、この発明の実施の形態１による文書登録装置２０が実行する文書登録処理のフローチャートである。まず、入力処理部２４は、登録文書の入力を受け付ける（ステップＳＴ２０１）。ここでは、図４に示す登録文書５１が入力されたとする。
【００２１】
次に、内部文字コード割り付け部２１は、内部文字コードテーブル４０を参照し、入力された登録文書中の各文字について内部文字コードの割り付けを行う。すなわち、各文字に関係付けられたその文字専用の文字コードを抽出すると共に、その文字を含む異表記文字の組の内部文字コードが登録されている場合には、その文字コードも抽出する（ステップＳＴ２０２）。例えば、図４の例では、登録文書５１中の「ウ」に対しては内部文字コード「Ａ２」、「ィ」に対しては文字コード「Ａ５２」と「Ａ１０１」、「イ」に対しては「Ａ１」および「Ａ１０１」が抽出される。
【００２２】
次に、出現位置情報作成部２２は、ステップＳＴ２０２で登録文書５１の各文字に対して割り付けられた内部文字コード毎に、登録文書５１中での出現位置情報を作成する（ステップＳＴ２０３）。例えば、「イ」を表す「Ａ１」については、「イ」は登録文書５１の７文字目に出現するので「７」が取得される。同様に、「ィ」を表す「Ａ５２」については「２」が取得される。また、「イ／ィ／ヰ」を表す「Ａ１０１」については２文字目の「ィ」と７文字目の「イ」が対象なので「２，７」が取得される。
【００２３】
次に、索引作成部２３は、登録文書５１の索引を生成し、索引データベース３０に格納する（ステップＳＴ２０４）。図４に、索引データベース３０に格納される内容の一部を示す。索引は、ステップＳＴ２０２で得られた内部文字コードを見出しとし、その内部文字コードに対応する文字と、ステップＳＴ２０３で得られた登録文書中での出現位置情報（登録文書番号―文書中での位置番号）が登録されている。
【００２４】
次に、文書検索処理について説明する。図５は、この発明の実施の形態１による文書検索装置１０が実行する文書検索処理のフローチャートである。まず、入力処理部１５は、ユーザから、検索キーワードと異表記条件（異表記区別条件）の入力を受け付ける（ステップＳＴ５０１）。
【００２５】
ここでは例として、検索キーワードが「ウィスキー」である場合について説明する。異表記条件とは、ユーザが検索キーワード中の文字ごとに指定する、該当文字を異表記区別するか、区別しない（異表記文字を同一とみなす）かの条件である。各文字について、異表記区別する場合には「０」を、区別しない場合には「１」を指定する。例えば、検索キーワード「ウィスキー」の「ウ」、「ス」、「キ」、「ー」については異表記区別し、「ィ」については区別しないのであれば、異表記条件は「０１０００」となる。これにより「ィ」については、「イ」および「ヰ」が使われていても同一の語句とみなされ、「ウイスキー」および「ウヰスキー」も検索対象となる。
【００２６】
文字列変換部１１は、入力された検索キーワードを１文字づつ読み込む（ステップＳＴ５０２）。
【００２７】
次に、文字列変換部１１は、読み込んだ文字について、入力された異表記条件を参照し、異表記区別するかしないかを判定する（ステップＳＴ５０３）。
【００２８】
ステップＳＴ５０３で異表記区別すると判定された文字については、文字列変換部１１は内部文字コードテーブル４０を参照し、当該文字をその文字専用の内部文字コードに変換する（ステップＳＴ５０４）。
【００２９】
一方、ステップＳＴ５０３で異表記区別しないと判定された文字については、文字列変換部１１は、内部文字コードテーブル４０を参照し、当該文字を異表記文字の組に対して割り当てられる内部文字コードに変換する（ステップＳＴ５０５）。
【００３０】
文字列変換部１１は、検索キーワードの全ての文字を読み込んだかどうか判定する（ステップＳＴ５０６）。すべての文字の読み込みが終了するまで、ステップＳＴ５０２からステップＳＴ５０６の処理を繰り返す。読み込みが終了すると、得られた変換結果が連結され、文字コード列が生成される（ステップＳＴ５０７）。検索キーワードの文字列変換処理を具体例で説明する。キーワード「ウィスキー」の「ウ」については、異表記条件より、異表記区別するので、図２より「Ａ２」に変換される。一方、異表記区別をしない「ィ」については、「イ／ィ／ヰ」を表す「Ａ１０１」に変換される。最終的に、文字コード列「Ａ２Ａ１０１Ａ１２Ａ６Ａ９０」が得られる。
【００３１】
次に、位置情報読み出し部１２は、索引データベース３０から図４に示すような索引データを読み出す。位置情報読み出し部１２は、ステップＳＴ５０７で得られた文字コード列中の各文字コードをキーにして、索引から該当する内部文字コードの出現位置情報を取得する（ステップＳＴ５０８）。文字コード列「Ａ２Ａ１０１Ａ１２Ａ６Ａ９０」については、図６に示すような各内部文字コードの出現位置情報が得られる。
【００３２】
次に、照合部１３は、ステップＳＴ５０５で取得した出現位置情報とステップＳＴ５０７で取得した文字コード列との照合を行う（ステップＳＴ５０９）。具体的には、各文字コードの出現位置情報を調べ、文字コード列と一致する並びがあるかどうか調べる。図６の例では、「１−１，１−２，１−３，１−４，１−５」の並びが該当する。
【００３３】
照合の結果、検索キーワードの文字列コードと、同じ長さで、連続した出現位置情報が存在した場合には、照合結果出力部１４は、出現位置情報を出力して終了する（ステップＳＴ５１０）。図６の例では、検索キーワードに相当する部分の先頭番号である「１−１」を出力する。
【００３４】
一方、一致する並びが存在しない場合には、照合結果出力部１４は、「該当キーワードなし」を返す（ステップＳＴ５１１）。
【００３５】
なお、内部文字コードテーブル４０に登録する異表記文字の組み合わせとしては、他にも英字の大文字と小文字の組、英字の全角と半角の組、ひらがなの長音と母音の大文字または母音の小文字との組、ひらがなの現代仮名遣いと歴史的仮名遣いの組、カタカナの長音と母音の大文字または母音の小文字との組、カタカナの現代仮名遣いと歴史的仮名遣いの組、カタカナの全角と半角の組、数字のアラビア数字と漢数字の組、アラビア数字の全角と半角の組、漢字の正字と略字または俗字の組等を登録してもよい。あるいは、これらの２種以上の組み合わせ等を登録してもよい。
【００３６】
以上のように、この実施の形態１によれば、異表記条件に基づいて、文書検索装置１０の文字列変換部１１が、検索キーワードを１つの文字専用の文字コードあるいは異表記文字の組に割り当てられた文字コードに変換し、索引データベース３０と照合するようにしたので、異表記文字を区別する場合と区別しない場合のどちらの検索処理においても、無駄なデータの入出力を防止し、少ない処理ステップで文書検索を行うことができるという効果がある。
【００３７】
また、この実施の形態１によれば、異表記条件は検索キーワード入力時にユーザが文字単位に指定するようにしたので、ユーザの使用目的に応じて検索条件を指定することが出来る。
【００３８】
また、この実施の形態１によれば、文書登録装置２０の内部文字コード割り付け部２１が、登録文書中に含まれる各文字に対し、１つの文字専用の文字コードと異表記文字の組に割り当てられた文字コードとをそれぞれ割り付け、出現位置情報とともに索引を生成するようにしたので、文書を効率のよい検索が行える形式で登録することが出来るという効果がある。
【００３９】
【発明の効果】
以上のように、この発明によれば、異表記区別条件に基づいて、入力された検索キーワード中の各文字を、その文字専用の文字コードあるいはその文字の異表記文字候補の組に割り当てられた文字コードに変換し、検索対象データベース中に存在するか否か照合するようにしたので、用途に応じて異表記文字を区別する場合と区別しない場合のどちらの検索処理においても、無駄なデータの入出力を防止すると共に処理ステップを可能な限り少なくし、効率のよい文書検索を行う文書検索装置を得られるという効果がある。
【００４０】
この発明によれば、登録する文書中の各文字に、その文字専用の文字コードを割り付けるとともに、各文字を含む異表記文字候補の組がある場合にはその異表記文字候補の組に割り当てられた文字コードをその文字に割り付け、文書中での出現位置情報とともに索引を作成するようにしたので、文書を効率のよい検索が行える形式で登録する文書登録装置を得られるという効果がある。
【図面の簡単な説明】
【図１】この発明の実施の形態１による、文書検索装置と文書登録装置を備えたシステムの構成を示すブロック図である。
【図２】この発明の実施の形態１による、内部文字コードテーブルに記憶されている内容の一部を示す図である。
【図３】この発明の実施の形態１による、文書登録装置が実行する文書登録処理のフローチャートである。
【図４】この発明の実施の形態１による、索引データベースに記憶されている内容の一部を示す図である。
【図５】この発明の実施の形態１による、文書検索装置が実行する文書検索処理のフローチャートである。
【図６】この発明の実施の形態１による、位置情報読み出し部の出力内容を示す図である。
【符号の説明】
１０文書検索装置、１１文字列変換部、１２位置情報読み出し部（照合部）、１３照合部、１４照合結果出力部、１５入力処理部、２０文書登録装置、２１内部文字コード割り付け部（文字コード割り付け部）、２２出現位置情報作成部（文字コード出現位置索引生成部）、２３索引作成部（文字コード出現位置索引生成部、索引格納部）、２４入力処理部、３０索引データベース（検索対象文書データベース）、４０内部文字コードテーブル（文字コード記憶部）、５１登録文書。[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a document search device and a document registration device.
[0002]
[Prior art]
2. Description of the Related Art In recent years, a large amount of electronic documents have been accumulated, and a method for efficiently searching a desired document from a document database has been demanded.
[0003]
In a general document search device, when a document database is created, an index for searching for the appearance position information of each character in a document to be registered by a character code or the like is created and stored in a storage device or the like. At the time of search, the appearance position of the search keyword can be searched at high speed by referring to the index.
[0004]
On the other hand, in the document search, for example, when it is desired to search for a document including the keyword “whiskey”, it may be more appropriate to search for a document including the phrase “whiskey” or “whiskey” for the purpose of the user. If so, it may be better not to include it. Other examples of characters that can be regarded as the same depending on the example (hereinafter, referred to as “different notation characters”) such as “I / i / ヰ” include “side / side” in “Watanabe” and “Watanabe”. and so on.
[0005]
In a conventional document search method, a character string having two or more different notations, such as “whiskey”, “whiskey”, and “whiskey”, is fixedly distinguished from the different notations, or is searched. Don't search. In the search method in which the document is fixedly distinguished, a document including a word with a different notation is not to be searched, so that a user who also needs a document including a different notation of an input search keyword is omitted from the search. On the other hand, in the method of searching without distinguishing, all documents containing words with different notations are also searched, so that there is a problem that a user who wants to distinguish different notations can obtain unnecessary documents.
[0006]
In order to solve this problem, there is a method in which a user can select whether to distinguish different expressions of a search keyword during a search. For example, in the conventional document search device disclosed in Patent Document 1, in a document database index, different notation characters such as full-width and half-width characters and uppercase and lowercase English characters are treated as common characters. The index holds character type information for distinguishing between different characters such as uppercase and lowercase for each character, together with location information indicating the appearance position of each character in the document database. At the time of search, the search keywords are converted to common characters used in the index. When performing a search that does not distinguish between different notations, character position information is not referred to, and the appearance position information of the same common character as the search keyword converted to the common character is acquired from the index. On the other hand, when a search for distinguishing different notations is specified, not only the common character is searched, but also the character type information is referred to, and the appearance position information is acquired only when the character types match.
[0007]
[Patent Document 1]
JP-A-08-77188 (FIGS. 1 and 2)
[0008]
[Problems to be solved by the invention]
As described above, in the document search apparatus, the user can select whether to search for differently written characters or to search without distinguishing them according to the purpose of the user, so that the accuracy of the search is improved and the usability is good.
[0009]
However, in the above-described document search device, since characters in a keyword are converted into a common character at the time of search, even when a search for distinguishing different notations is performed, an index of unnecessary character type information is read out. There is a problem that data input / output occurs.
[0010]
Further, in the above-described document search apparatus, a process of referring to character type information is required when performing a search while distinguishing different notations by common characterization of the search keyword, and the number of steps of the search process increases, thereby reducing the CPU load. Increase.
[0011]
Further, in the above-described document search apparatus, the document index includes character type information in addition to the appearance position information, so that the data input / output amount when referring to the index at the time of document registration and search increases.
[0012]
The present invention solves the above problem by preventing unnecessary data input / output and performing processing steps in both search processing for distinguishing and not distinguishing differently written characters depending on the application. It is an object of the present invention to obtain a document search apparatus that performs an efficient document search with as little as possible.
[0013]
Another object of the present invention is to provide a document registration device for registering a search target document in a format that allows efficient search.
[0014]
[Means for Solving the Problems]
A document search device according to the present invention provides a different notation distinguishing condition that specifies one of a process of matching one character and a different notation character of the character as the same character and a process of matching these characters as another character. Based on the above, each character in the input search keyword is referred to a character code storage unit that holds a character code dedicated to one character or a character code assigned to a set of candidate characters of the different notation of the character. Reads the occurrence position information of each character code in the character code string from the character string conversion unit that converts the character code and generates the character code string, and compares the character code string with the read occurrence position information from the search target document database And a collation result output unit for outputting a collation result by the collation unit.
[0015]
The document registration device according to the present invention refers to a character code storage unit that holds a character code dedicated to one character or a character code assigned to a set of candidate characters of the different character of the character, and reads each character in the input document. A character code allocating unit that extracts a character code dedicated to the character for the character, and extracts a character code assigned to the set of the different notation character candidates when there is a different notation character candidate set including each character, A character code appearance position index generation unit for generating appearance position information in a document of all character codes extracted for each character in a document, and an index generated by the character code appearance position index generation unit in a storage device Is provided with an index storage unit for storing data in the index storage unit.
[0016]
BEST MODE FOR CARRYING OUT THE INVENTION
Hereinafter, embodiments of the present invention will be described.
Embodiment 1 FIG.
FIG. 1 is a block diagram showing a configuration of a system including a document search device and a document registration device according to Embodiment 1 of the present invention.
When registering the search target document, the user inputs the document via the input processing unit 24 of the document registration device 20. The document registration device 20 includes an internal character code allocating unit 21 (character code allocating unit), an appearance position information creating unit 22 (character code appearance position index creating unit), and an index creating unit 23 (character code appearance position index creating unit, index storage). Section) processes the document and generates an index for searching. The generated index is stored in the index database 30 (search target document database). Further, the input document is also stored in a storage device or the like (not shown). The internal character code allocating unit 21, appearance position information creating unit 22, index creating unit 23, and input processing unit 24 represent modules of operations performed by a central processing unit of a computer according to a program. Constitutes a central processing unit integrally.
[0017]
At the time of document search, the user inputs a search keyword via the input processing unit 15 of the document search device 10. The document search device 10 performs a search process by the character string conversion unit 11, the position information reading unit 12 (collation unit), the collation unit 13, and the collation result output unit 14 based on the input search keyword. Then, a document containing the input search keyword is searched, and a search result is output. The character string conversion unit 11, the position information reading unit 12, the collation unit 13, the collation result output unit 14, and the input processing unit 15 represent the modules of the operations performed by the central processing unit of the computer according to the program. Actually constitutes a central processing unit as a single unit.
[0018]
The internal character code table 40 (character code storage unit) is realized by, for example, a relational database. Characters are used as keys and character codes dedicated to the characters are associated with each other. Further, when there is a set of different notation character candidates including the character, an internal character code assigned to the different notation character candidate set is also associated. As shown in FIG. 2, the internal character codes registered in the internal character code table 40 include an internal character code uniquely assigned to one character, and “I / Y / ヰ” and “side / There is an internal character code assigned to a set of different notation characters such as "side".
The user can arbitrarily register the set of differently written characters using an input means (a personal computer or the like) of a management device (not shown) of the internal character code table 40.
[0019]
The document search device 10 and the document registration device 20 may be different server devices or the like, or may be one device. Further, the index database 30 and the internal character code table 40 may be stored in the same storage device or may be stored in separate devices, and are connected to the document search device 10 and the document registration device 20 via a line. Connected. Alternatively, it may be stored in the same device as the document search device 10 or the document registration device 20.
[0020]
Next, the document registration process will be described. FIG. 3 is a flowchart of a document registration process executed by the document registration device 20 according to the first embodiment of the present invention. First, the input processing unit 24 receives an input of a registered document (step ST201). Here, it is assumed that the registration document 51 shown in FIG. 4 has been input.
[0021]
Next, the internal character code assigning unit 21 refers to the internal character code table 40 and assigns an internal character code to each character in the input registered document. That is, a character code dedicated to the character associated with each character is extracted, and if an internal character code of a set of different notation characters including the character is registered, the character code is also extracted (step ST202). For example, in the example of FIG. 4, the internal character code "A2" for "U" in the registered document 51, the character codes "A52" and "A101" for "i", and Extracts "A1" and "A101".
[0022]
Next, the appearance position information creating unit 22 creates appearance position information in the registered document 51 for each internal character code assigned to each character of the registered document 51 in step ST202 (step ST203). For example, for “A1” representing “A”, “7” is acquired because “A” appears in the seventh character of the registered document 51. Similarly, “2” is acquired for “A52” representing “i”. As for “A101” representing “I / i / ヰ”, “2” and “7” are acquired because “i” of the second character and “i” of the seventh character are targets.
[0023]
Next, the index creating unit 23 creates an index of the registered document 51 and stores it in the index database 30 (step ST204). FIG. 4 shows a part of the contents stored in the index database 30. The index uses the internal character code obtained in step ST202 as a heading, the character corresponding to the internal character code, and the appearance position information (registered document number-position in document) in the registered document obtained in step ST203. Number) is registered.
[0024]
Next, the document search processing will be described. FIG. 5 is a flowchart of a document search process executed by the document search device 10 according to the first embodiment of the present invention. First, the input processing unit 15 receives an input of a search keyword and a different notation condition (different notation discrimination condition) from the user (step ST501).
[0025]
Here, a case where the search keyword is “whiskey” will be described as an example. The different notation condition is a condition that the user specifies for each character in the search keyword, whether the corresponding character is distinguished in different notation, or not distinguished (the different notation character is regarded as the same). For each character, “0” is specified when different notation is distinguished, and “1” is specified when no distinction is made. For example, if "U", "S", "K", and "-" of the search keyword "Whiskey" are distinguished differently and "I" is not distinguished, the different notation condition is "01000". . As a result, "i" is regarded as the same word even if "i" and "@" are used, and "whiskey" and "whiskey" are also searched.
[0026]
Character string conversion unit 11 reads the input search keyword one character at a time (step ST502).
[0027]
Next, the character string converter 11 refers to the input different notation condition for the read character and determines whether or not to distinguish the different notation (step ST503).
[0028]
For the character determined to be distinguished in the different notation in step ST503, the character string conversion unit 11 refers to the internal character code table 40 and converts the character into an internal character code dedicated to the character (step ST504).
[0029]
On the other hand, for the character determined not to be distinguished in the different notation in step ST503, the character string conversion unit 11 refers to the internal character code table 40 and sets the character to the internal character code assigned to the set of the different notation characters. Conversion is performed (step ST505).
[0030]
Character string converter 11 determines whether all characters of the search keyword have been read (step ST506). Until the reading of all the characters is completed, the processing from step ST502 to step ST506 is repeated. When the reading is completed, the obtained conversion results are linked to generate a character code string (step ST507). The character string conversion processing of the search keyword will be described with a specific example. Since “U” of the keyword “whiskey” is distinguished in a different notation based on a different notation condition, it is converted to “A2” from FIG. On the other hand, “i” which does not distinguish between different notations is converted into “A101” representing “i / i / ヰ”. Finally, a character code string “A2A101A12A6A90” is obtained.
[0031]
Next, the position information reading unit 12 reads index data as shown in FIG. The position information reading unit 12 acquires the appearance position information of the corresponding internal character code from the index using each character code in the character code string obtained in step ST507 as a key (step ST508). For the character code string “A2A101A12A6A90”, the appearance position information of each internal character code as shown in FIG. 6 is obtained.
[0032]
Next, collating section 13 performs collation between the appearance position information acquired in step ST505 and the character code string acquired in step ST507 (step ST509). Specifically, the appearance position information of each character code is checked to determine whether or not there is a line that matches the character code string. In the example of FIG. 6, the arrangement of “1-1, 1-2, 1-3, 1-4, 1-5” corresponds.
[0033]
As a result of the collation, if there is continuous appearance position information having the same length as the character string code of the search keyword, collation result output unit 14 outputs the appearance position information and ends (step ST510). In the example of FIG. 6, "1-1" which is the head number of the part corresponding to the search keyword is output.
[0034]
On the other hand, when there is no matching arrangement, the matching result output unit 14 returns “no corresponding keyword” (step ST511).
[0035]
Other combinations of different notation characters to be registered in the internal character code table 40 include a combination of uppercase and lowercase letters of the alphabet, a set of full-width and half-width letters of the alphabet, the uppercase of hiragana and the uppercase of vowels, or the lowercase of vowels. Pair, modern hiragana and historical kana, pair of katakana long and uppercase or vowel lowercase, modern katakana and historical kana, full-width and half-width katakana, Arabic numerals A set of numbers and Chinese numerals, a set of full-width and half-width Arabic numerals, a set of regular and abbreviated Chinese characters, or a set of folk characters may be registered. Alternatively, a combination of two or more of these may be registered.
[0036]
As described above, according to the first embodiment, the character string conversion unit 11 of the document search device 10 converts the search keyword into a character code dedicated to one character or a set of different notation characters based on the different notation condition. Since the character codes are converted to the assigned character codes and collated with the index database 30, in both of the search processing for distinguishing and not distinguishing differently written characters, useless data input / output is prevented, and There is an effect that a document search can be performed in the processing step.
[0037]
Further, according to the first embodiment, the different notation condition is specified by the user when inputting the search keyword, so that the search condition can be specified according to the user's purpose of use.
[0038]
Further, according to the first embodiment, the internal character code allocating unit 21 of the document registration device 20 allocates each character included in the registered document to a set of a character code dedicated to one character and a different notation character. Since the generated character codes are assigned to each of them and an index is generated together with the appearance position information, the document can be registered in a format that allows efficient search.
[0039]
【The invention's effect】
As described above, according to the present invention, each character in an input search keyword is assigned to a character code dedicated to that character or a set of candidates for a different notation character of the character based on the different notation distinguishing condition. It is converted to a character code and collated to determine whether it exists in the search target database. This has the effect of preventing input / output and reducing the number of processing steps as much as possible, so that a document search apparatus for performing efficient document search can be obtained.
[0040]
According to the present invention, a character code dedicated to the character is assigned to each character in the document to be registered, and when there is a set of different notation character candidates including each character, the character code is assigned to the different notation character candidate set. Since the character code is assigned to the character and an index is created together with the appearance position information in the document, there is an effect that a document registration device that registers a document in a format that allows efficient search can be obtained.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a configuration of a system including a document search device and a document registration device according to a first embodiment of the present invention.
FIG. 2 is a diagram showing a part of contents stored in an internal character code table according to the first embodiment of the present invention.
FIG. 3 is a flowchart of a document registration process executed by the document registration device according to the first embodiment of the present invention.
FIG. 4 is a diagram showing a part of contents stored in an index database according to the first embodiment of the present invention.
FIG. 5 is a flowchart of a document search process executed by the document search device according to the first embodiment of the present invention.
FIG. 6 is a diagram showing output contents of a position information reading unit according to the first embodiment of the present invention.
[Explanation of symbols]
Reference Signs List 10 document retrieval device, 11 character string conversion unit, 12 position information readout unit (collation unit), 13 collation unit, 14 collation result output unit, 15 input processing unit, 20 document registration device, 21 internal character code allocation unit (character code Allocation unit), 22 appearance position information creation unit (character code appearance position index generation unit), 23 index creation unit (character code appearance position index generation unit, index storage unit), 24 input processing unit, 30 index database (search target document) Database), 40 internal character code table (character code storage unit), 51 registered document.

Claims

Two or more different notation characters that specify one of a process of comparing one character and a different notation character of this character as the same character and a process of matching these characters as another character are distinguished or regarded as the same. A character that holds a character code dedicated to one character or a character code assigned to a set of candidate characters of the different notation based on the different notation distinguishing condition that specifies A character string conversion unit that converts the character code by referring to the code storage unit and generates a character code string;
A matching unit that reads, from the search target document database, the appearance position information of each character code in the character code string, and compares the character code string with the read occurrence position information;
A document search device including a collation result output unit that outputs a collation result by the collation unit.

The character string conversion unit can receive, together with the previous term search keyword, a different notation discrimination condition specified for each character in the search search keyword, and convert each character into a character code in accordance with the above different notation discrimination condition. 2. The document search device according to claim 1, wherein the conversion is performed.

Reference is made to a character code storage unit that holds a character code dedicated to one character or a character code assigned to a set of candidate characters for the different characters of the character, and for each character in the input document, the character code dedicated to that character is stored. A character code allocating unit that extracts and, when there is a set of allotted character candidates including the respective characters, extracts a character code assigned to the set of allotted character candidates;
A character code appearance position index generation unit that generates appearance position information in the document of all character codes extracted for each character in the document;
A document registration device comprising an index storage unit for storing an index generated by the character code appearance position index generation unit in a storage device.