JP2005004626A

JP2005004626A - Different notation normalization processor, different notation normalization processing program, storage medium storing the same, document retrieval device, document retrieval program and storage medium storing the same

Info

Publication number: JP2005004626A
Application number: JP2003169484A
Authority: JP
Inventors: Hiroko Ida; 裕子井田; Yuichi Kojima; 裕一小島; Masayuki Kameda; 雅之亀田; Yasutsugu Ogawa; 泰嗣小川; Yukiko Hiraoka; 優希子平岡; Kensaku Yamamoto; 研策山本
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 2003-06-13
Filing date: 2003-06-13
Publication date: 2005-01-06
Anticipated expiration: 2023-06-13
Also published as: JP4294386B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a different notation normalization processor by which fluctuation in various notations among various languages having different notations in Chinese character is absorbed. <P>SOLUTION: The different notation normalization processor comprises: a document input means 101 receiving an input text; a Chinese character equivalence segmentation means 102a extracting the notation in Chinese character from the input text received by the document input means 101; at least two or more of different notation normalization rules 102c describing corresponding characters among the languages having the different notations in Chinese character; a different notation normalization rule selection means 102d selecting the different notation normalization rule 102c suitable for the input text received by the document input means 101; and a normalization processing means 102b normalizing the notation in Chinese character extracted by the Chinese character equivalence segmentation means 102a in accordance with the different notation normalization rule 102c selected by the different notation normalization rule selection means 102d. In this way, the fluctuation in the various notations among the various languages having the different notations in Chinese character is absorbed. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

【０００１】
【発明の属する技術分野】
本発明は、異表記正規化処理装置、異表記正規化処理プログラム、これを記憶する記憶媒体、文書検索装置、文書検索プログラム及びこれを記憶する記憶媒体に関する。
【０００２】
【従来の技術】
中国において一般に用いられる漢字は簡体字と呼ばれ、台湾において一般に用いられる漢字は繁体字と呼ばれており、同じ意味を表すのに異なる漢字が用いられる場合がある。例えば、繁体字では「印刷機械」と表すものが、簡体字では「印刷机械」として表される。
【０００３】
そこで、簡体字で記載された文書と繁体字で記載された文書を変換する装置が研究、開発されている（例えば、特許文献１参照）。
【０００４】
【特許文献１】
特開平８−２６３４７８号公報
【０００５】
【発明が解決しようとする課題】
ところで、漢字を使用する言語としては、上述した中国語における簡体字や繁体字の他に、日本語、韓国語、ベトナム語等も挙げられる。これらの言語においては、漢字の発音は異なるが同じ意味で使用されることもあり、簡単な文章であれば理解することができる。また、言語が同一であっても、各言語においては旧字新字や異体字などがあり、同じ意味を表すのに異なる漢字が用いられる場合がある。
【０００６】
このように同じ意味を表すのに異なる漢字が用いられている文書について文書検索処理を行った場合には、漢字表記が異なるために検索漏れが発生するという問題がある。
【０００７】
そのため、近年においては、簡体字と繁体字との間のみならず、各種の言語間における多様な表記のゆれを吸収することが望まれている。
【０００８】
本発明の目的は、漢字表記が異なる各種の言語間における多様な表記のゆれを吸収することができる異表記正規化処理装置、異表記正規化処理プログラム、これを記憶する記憶媒体を提供することである。
【０００９】
本発明の目的は、漢字使用圏の言語（中国語（簡体字や繁体字）、日本語、韓国語、ベトナム語等）で記述された文書についての文書検索処理における検索漏れの発生を防止し、最も適切な文書を検索することができる文書検索装置、文書検索プログラム及びこれを記憶する記憶媒体を提供することである。
【００１０】
【課題を解決するための手段】
請求項１記載の発明の異表記正規化処理装置は、入力テキストを受け付ける文書入力手段と、この文書入力手段により受け付けた前記入力テキストから漢字表記を抽出する漢字相当切出手段と、漢字表記が異なる言語間の文字対応を記述した少なくとも２種以上の異表記正規化規則を格納する正規化規則格納手段と、前記文書入力手段により受け付けた前記入力テキストに適した前記異表記正規化規則を選択する異表記正規化規則選択手段と、この異表記正規化規則選択手段により選択した前記異表記正規化規則に応じて前記漢字相当切出手段により抽出した漢字表記を正規化する正規化処理手段と、を備える。
【００１１】
したがって、入力テキストから漢字表記が抽出され、漢字表記が異なる言語間の文字対応を記述した少なくとも２種以上の異表記正規化規則から入力テキストに適した異表記正規化規則が選択され、この選択された異表記正規化規則に基づいて抽出された漢字表記が正規化される。これにより、漢字表記が異なる各種の言語間における多様な表記のゆれを吸収することが可能になる。
【００１２】
請求項２記載の発明は、請求項１記載の異表記正規化処理装置において、前記異表記正規化規則選択手段は、前記入力テキストの言語情報を解析し、解析された言語情報に基づいて前記異表記正規化規則を選択する。
【００１３】
したがって、どの言語で入力するかを意識せずとも、漢字の入力テキストに対して適切な異表記正規化処理を施すことが可能になる。
【００１４】
請求項３記載の発明は、請求項２記載の異表記正規化処理装置において、前記異表記正規化規則選択手段は、解析された言語情報に基づいて複数の前記異表記正規化規則が選択された場合には、これらの異表記正規化規則の中から所望の前記異表記正規化規則の選択を許容する。
【００１５】
したがって、漢字の入力テキストに対して最も適切な異表記正規化処理を施すことが可能になる。
【００１６】
請求項４記載の発明は、請求項１ないし３のいずれか一記載の異表記正規化処理装置において、前記異表記正規化規則は、所定の言語の漢字表記とその異体字との間の文字対応が記述されたものである。
【００１７】
したがって、所定の言語の漢字表記とその異体字との間における表記のゆれを吸収することが可能になる。
【００１８】
請求項５記載の発明は、請求項１ないし３のいずれか一記載の異表記正規化処理装置において、前記異表記正規化規則は、所定の言語の漢字表記に対応する他の言語の漢字表記が記述されたものである。
【００１９】
したがって、所定の言語の漢字表記と他の言語の漢字表記との間における表記のゆれを吸収することが可能になる。
【００２０】
請求項６記載の発明は、請求項１ないし３のいずれか一記載の異表記正規化処理装置において、所定の言語の漢字表記とその異体字との間の文字対応が記述された前記異表記正規化規則と、所定の言語の漢字表記に対応する他の言語の漢字表記が記述された前記異表記正規化規則との両方を有している。
【００２１】
したがって、漢字表記が異なる各種の言語間における多様な表記のゆれを吸収することが可能になる。
【００２２】
請求項７記載の発明の異表記正規化処理プログラムは、コンピュータにインストールされ、入力テキストを受け付ける文書入力機能と、この文書入力機能により受け付けた前記入力テキストから漢字表記を抽出する漢字相当切出機能と、漢字表記が異なる言語間の文字対応を記述した少なくとも２種以上の異表記正規化規則を格納する正規化規則格納機能と、前記文書入力機能により受け付けた前記入力テキストに適した前記異表記正規化規則を選択する異表記正規化規則選択機能と、この異表記正規化規則選択機能により選択した前記異表記正規化規則に応じて前記漢字相当切出機能により抽出した漢字表記を正規化する正規化処理機能と、をコンピュータに実行させる。
【００２３】
したがって、入力テキストから漢字表記が抽出され、漢字表記が異なる言語間の文字対応を記述した少なくとも２種以上の異表記正規化規則から入力テキストに適した異表記正規化規則が選択され、この選択された異表記正規化規則に基づいて抽出された漢字表記が正規化される。これにより、漢字表記が異なる各種の言語間における多様な表記のゆれを吸収することが可能になる。
【００２４】
請求項８記載の発明は、請求項７記載の異表記正規化処理プログラムにおいて、前記異表記正規化規則選択機能は、前記入力テキストの言語情報を解析し、解析された言語情報に基づいて前記異表記正規化規則を選択する。
【００２５】
したがって、どの言語で入力するかを意識せずとも、漢字の入力テキストに対して適切な異表記正規化処理を施すことが可能になる。
【００２６】
請求項９記載の発明は、請求項８記載の異表記正規化処理プログラムにおいて、前記異表記正規化規則選択機能は、解析された言語情報に基づいて複数の前記異表記正規化規則が選択された場合には、これらの異表記正規化規則の中から所望の前記異表記正規化規則の選択を許容する。
【００２７】
したがって、漢字の入力テキストに対して最も適切な異表記正規化処理を施すことが可能になる。
【００２８】
請求項１０記載の発明は、請求項７ないし９のいずれか一記載の異表記正規化処理プログラムにおいて、前記異表記正規化規則は、所定の言語の漢字表記とその異体字との間の文字対応が記述されたものである。
【００２９】
したがって、所定の言語の漢字表記とその異体字との間における表記のゆれを吸収することが可能になる。
【００３０】
請求項１１記載の発明は、請求項７ないし９のいずれか一記載の異表記正規化処理プログラムにおいて、前記異表記正規化規則は、所定の言語の漢字表記に対応する他の言語の漢字表記が記述されたものである。
【００３１】
したがって、所定の言語の漢字表記と他の言語の漢字表記との間における表記のゆれを吸収することが可能になる。
【００３２】
請求項１２記載の発明は、請求項７ないし９のいずれか一記載の異表記正規化処理プログラムにおいて、所定の言語の漢字表記とその異体字との間の文字対応が記述された前記異表記正規化規則と、所定の言語の漢字表記に対応する他の言語の漢字表記が記述された前記異表記正規化規則との両方を有している。
【００３３】
したがって、漢字表記が異なる各種の言語間における多様な表記のゆれを吸収することが可能になる。
【００３４】
請求項１３記載の発明の記憶媒体は、請求項７ないし１２のいずれか一記載の異表記正規化処理プログラムを記憶する。
【００３５】
したがって、この記憶媒体に記憶された異表記正規化処理プログラムをコンピュータに読み取らせることにより、請求項７ないし１２のいずれか一記載の発明と同様の作用を得ることが可能になる。
【００３６】
請求項１４記載の発明の文書検索装置は、異表記正規化処理を行う請求項１ないし６のいずれか一記載の異表記正規化処理装置と、入力テキストを受け付ける文書入力手段と、この文書入力手段により受け付けた前記入力テキストが文書の場合、文書データベースに格納する文書格納手段と、この文書格納手段により前記文書データベースに格納された前記文書から文字列を抽出して抽出文字列毎の異表記正規化処理を前記異表記正規化処理装置で行った後、インデックス情報を抽出してインデックス情報格納部に保存するインデックス登録手段と、前記文書入力手段により受け付けた前記入力テキストが検索語の場合、当該検索語を前記異表記正規化処理装置に送出して当該検索語から文字列を抽出して異表記正規化処理を行った後、検索条件を作成する検索条件作成手段と、この検索条件作成手段により作成された前記検索条件に基づいて前記インデックス情報格納部のインデックス情報を検索する検索処理手段と、この検索処理手段による検索の結果を出力する結果出力手段と、を備える。
【００３７】
したがって、異表記正規化処理が施されたインデックス情報及び検索条件が生成されるので、漢字使用圏の言語（中国語（簡体字や繁体字）、日本語、韓国語、ベトナム語等）で記述された文書についての文書検索処理における検索漏れの発生を防止することが可能になるので、最も適切な文書を検索することが可能になる。
【００３８】
請求項１５記載の発明の文書検索プログラムは、コンピュータにインストールされ、請求項１ないし６のいずれか一記載の異表記正規化処理装置における異表記正規化処理を実現する異表記正規化処理機能と、入力テキストを受け付ける文書入力機能と、この文書入力機能により受け付けた前記入力テキストが文書の場合、文書データベースに格納する文書格納機能と、この文書格納機能により前記文書データベースに格納された前記文書から文字列を抽出して抽出文字列毎の異表記正規化処理を前記異表記正規化処理機能で行った後、インデックス情報を抽出してインデックス情報格納部に保存するインデックス登録機能と、前記文書入力機能により受け付けた前記入力テキストが検索語の場合、当該検索語を前記異表記正規化処理機能に送出して当該検索語から文字列を抽出して異表記正規化処理を行った後、検索条件を作成する検索条件作成機能と、この検索条件作成機能により作成された前記検索条件に基づいて前記インデックス情報格納部のインデックス情報を検索する検索処理機能と、この検索処理機能による検索の結果を出力する結果出力機能と、をコンピュータに実行させる。
【００３９】
したがって、異表記正規化処理が施されたインデックス情報及び検索条件が生成されるので、漢字使用圏の言語（中国語（簡体字や繁体字）、日本語、韓国語、ベトナム語等）で記述された文書についての文書検索処理における検索漏れの発生を防止することが可能になるので、最も適切な文書を検索することが可能になる。
【００４０】
請求項１６記載の発明の記憶媒体は、請求項１５記載のプログラムを記憶する。
【００４１】
したがって、この記憶媒体に記憶された文書検索プログラムをコンピュータに読み取らせることにより、請求項１５記載の発明と同様の作用を得ることが可能になる。
【００４２】
【発明の実施の形態】
本発明の第一の実施の形態を図１ないし図６に基づいて説明する。
【００４３】
図１は、本発明が適用される異表記正規化処理装置１のハードウェア構成を概略的に示すブロック図である。図１に示すように、異表記正規化処理装置１は、例えばパーソナルコンピュータやワークステーションであり、コンピュータの主要部であって各部を集中的に制御するＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）２を備えている。このＣＰＵ２には、ＢＩＯＳなどを記憶した読出し専用メモリであるＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）３と、各種データを書換え可能に記憶するＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）４とがバス５で接続されている。
【００４４】
さらにバス５には、各種のプログラム等を格納するＨＤＤ（ＨａｒｄＤｉｓｋＤｒｉｖｅ）６と、配布されたプログラムであるコンピュータソフトウェアを読み取るための機構としてＣＤ（ＣｏｍｐａｃｔＤｉｓｃ）−ＲＯＭ７を読み取るＣＤ−ＲＯＭドライブ８と、異表記正規化処理装置１とネットワーク９との通信を司る通信制御装置１０と、キーボードやマウスなどの入力装置１１と、ＣＲＴ（ＣａｔｈｏｄｅＲａｙＴｕｂｅ）、ＬＣＤ（ＬｉｑｕｉｄＣｒｙｓｔａｌＤｉｓｐｌａｙ）などの表示装置１２とが、図示しないＩ／Ｏを介して接続されている。
【００４５】
ＲＡＭ４は、各種データを書換え可能に記憶する性質を有していることから、ＣＰＵ２の作業エリアとして機能し、例えば後述する文書バッファ等の役割を果たす。
【００４６】
図１に示すＣＤ−ＲＯＭ７は、この発明の記憶媒体を実施するものであり、ＯＳ（ＯｐｅｒａｔｉｎｇＳｙｓｔｅｍ）や各種のプログラムが記憶されている。ＣＰＵ２は、ＣＤ−ＲＯＭ７に記憶されているプログラムをＣＤ−ＲＯＭドライブ８で読み取り、ＨＤＤ６にインストールする。
【００４７】
なお、記憶媒体としては、ＣＤ−ＲＯＭ７のみならず、ＤＶＤなどの各種の光ディスク、各種光磁気ディスク、フレキシブル・ディスクなどの各種磁気ディスク等、半導体メモリ等の各種方式のメディアを用いることができる。また、通信制御装置１０を介してインターネットなどのネットワーク９からプログラムをダウンロードし、ＨＤＤ６にインストールするようにしてもよい。この場合に、送信側のサーバでプログラムを記憶している記憶装置も、この発明の記憶媒体である。なお、プログラムは、所定のＯＳ（ＯｐｅｒａｔｉｎｇＳｙｓｔｅｍ）上で動作するものであってもよいし、その場合に後述の各種処理の一部の実行をＯＳに肩代わりさせるものであってもよいし、所定のアプリケーションソフトやＯＳなどを構成する一群のプログラムファイルの一部として含まれているものであってもよい。
【００４８】
このシステム全体の動作を制御するＣＰＵ２は、このシステムの主記憶として使用されるＨＤＤ６上にロードされたプログラムに基づいて各種処理を実行する。
【００４９】
次に、異表記正規化処理装置１のＣＰＵ２がプログラムに基づいて実行する各種処理の内容について説明する。図２は、異表記正規化処理装置１を示す機能ブロック図である。図２に示すように、当該異表記正規化処理装置１においては、入力装置１１から入力された文書データや検索語等のテキスト（入力テキスト）を受け付ける文書入力手段１０１、入力テキストに対して正規化処理を施す異表記正規化手段１０２、異表記正規化手段１０２から出力された正規化表記に関する情報等の処理結果を表示装置１２に対して出力する出力手段１０３の各機能が、ＣＰＵ２がコンピュータソフトウェアに従って動作することにより実現される。なお、文書検索の用途などに用いる場合は、異表記正規化手段１０２における処理結果を、出力手段１０３を介することなく、そのまま直接インデキシング処理や検索処理に出力することも可能である。また、文書データや検索語等のテキストの入力は、入力装置１１からの入力に限るものではなく、ＯＣＲ（ＯｐｔｉｃａｌＣｈａｒａｃｔｅｒＲｅａｄｅｒ）等による入力であっても良い。
【００５０】
ここで、異表記正規化手段１０２について詳細に説明する。図２に示すように、異表記正規化手段１０２は、漢字相当切出手段１０２ａと、正規化処理手段１０２ｂと、少なくとも２種以上の異表記正規化規則１０２ｃを格納する正規化規則格納手段１０２ｅと、異表記正規化規則選択手段１０２ｄとを有している。
【００５１】
漢字相当切出手段１０２ａは、文書入力手段１０１を介して入力されたテキスト（入力テキスト）から漢字表記の文字を一文字ずつ抽出して、正規化処理手段１０２ｂに出力する。
【００５２】
正規化規則格納手段１０２ｅに格納される異表記正規化規則１０２ｃは、概略的には、所定の言語間の文字対応を記述したものである。図３は異表記正規化規則１０２ｃの一例を示す説明図である。図３に示す異表記正規化規則１０２ｃは、中国語繁体字と中国語簡体字との間の変換についての異表記正規化規則である。図３に示すように、異表記正規化規則１０２ｃには、中国語繁体字の表記及び文字コード（ユニコード）と、中国語簡体字の表記及び文字コード（ユニコード）とが対応付けられている。なお、異表記正規化規則１０２ｃは、図４に示すように、複数言語間の異表記正規化規則が記述されていても良く、文字コードもユニコード以外の文字コード、例えばＪＩＳ０２０８やＵＴＦ−８など複数種類の文字コードが記述されていても良い。これにより、所定の言語の漢字表記と他の言語の漢字表記との間における表記のゆれを吸収することが可能になる。また、異表記正規化規則１０２ｃとしては、図５に示すように、言語異体字と繁体字との間の変換についての異表記正規化規則であっても良い。これにより、所定の言語の漢字表記とその異体字との間における表記のゆれを吸収することが可能になる。このような異表記正規化規則１０２ｃは、異表記正規化規則選択手段１０２ｄにより選択される。
【００５３】
異表記正規化規則選択手段１０２ｄは、入力装置１１から入力されて文書入力手段１０１で受け付けた文書（入力テキスト）の言語情報を解析し、解析された言語情報に基づいて異表記正規化規則１０２ｃを正規化規則格納手段１０２ｅから抽出する。文書の言語情報は、入力装置１１から文書とは別に文書の言語情報を明示的に入力された情報もしくは、予め入力文書中に記載される言語情報が使用される。例えば、ＸＭＬ文書では、ｌａｎｇ属性を用いて言語情報が記述されている。このような言語情報の判別方法としては、既存の言語識別方法（例えば、特開平１０−３２０３９９号公報に示されている方法等）を用いることができる。したがって、どの言語で入力するかを意識せずとも、漢字の入力テキストに対して適切な異表記正規化処理を施すことが可能になる。なお、複数の異表記正規化規則１０２ｃが抽出された場合には、表示装置１２に抽出された異表記正規化規則１０２ｃのリストを一覧表示し、このリストの中から適当な異表記正規化規則１０２ｃを入力装置１１で選択するようにしても良い。これにより、漢字の入力テキストに対して最も適切な異表記正規化処理を施すことが可能になる。
【００５４】
正規化処理手段１０２ｂは、漢字相当切出手段１０２ａから出力された一文字の漢字に対し、異表記正規化規則選択手段１０２ｄにより選択された異表記正規化規則１０２ｃに基づいて、正規化表記を生成する。
【００５５】
次に、異表記正規化処理装置１における異表記正規化処理の全体的な流れについて図６を参照して説明する。図６は、異表記正規化規則１０２ｃとして中国語繁体字−中国語簡体字変換（図３参照）を異表記正規化規則選択手段１０２ｄにより選択し、入力テキストとして中国語繁体字で「印刷機械」と入力した場合を例として示している。
【００５６】
図６に示すように、まず、文書入力手段１０１を介して入力されたテキストを文書入力手段１０１で受付け（ステップＳ１のＹ）、漢字相当切出手段１０２ａにおいて入力されたテキスト中に含まれている漢字一文字を検出して切り出す（ステップＳ２）。ここでは、まず最初に、漢字「印」が切り出される。
【００５７】
切り出された漢字「印」は正規化処理手段１０２ｂに入力され、異表記正規化規則１０２ｃに基づいて正規化処理を実施する（ステップＳ３）。正規化処理は、切り出された漢字について異表記正規化規則１０２ｃ（中国語繁体字−中国語簡体字変換）に適合するか否かを確認し、異表記正規化規則１０２ｃ（中国語繁体字−中国語簡体字変換）に適合する場合には異表記正規化規則１０２ｃ（中国語繁体字−中国語簡体字変換）に従って正規化を実施するものである。ここで、切り出された漢字「印」は、図３に示す異表記正規化規則１０２ｃ（中国語繁体字−中国語簡体字変換）に該当する規則が存在しないため、そのままステップＳ４に進む。
【００５８】
入力されたテキスト中に含まれている全ての漢字の切り出しが終わっていないので（ステップＳ４のＮ）、ステップＳ２に戻り、次の文字「刷」を漢字相当切出手段１０２ａにより切り出す。
【００５９】
次に切り出された漢字「刷」も、図３に示す異表記正規化規則１０２ｃ（中国語繁体字−中国語簡体字変換）に該当する規則が存在しないため、正規化を実施せずに（ステップＳ３）、そのままステップＳ４に進む。そして、入力されたテキスト中に含まれている全ての漢字の切り出しが終わっていないので（ステップＳ４のＮ）、ステップＳ２に戻り、次の文字「機」を漢字相当切出手段１０２ａにより切り出す。
【００６０】
次に切り出された漢字「機」は、図３に示す異表記正規化規則１０２ｃ（中国語繁体字−中国語簡体字変換）に該当する規則が存在するため、正規化処理手段１０２ｂにより中国語繁体字「機」を中国語簡体字「机」に正規化し（ステップＳ３）。ステップＳ４に進む。そして、入力されたテキスト中に含まれている全ての漢字の切り出しが終わっていないので（ステップＳ４のＮ）、ステップＳ２に戻り、次の文字「械」を漢字相当切出手段１０２ａにより切り出す。
【００６１】
次に切り出された漢字「械」は、図３に示す異表記正規化規則１０２ｃ（中国語繁体字−中国語簡体字変換）に該当する規則が存在しないため、正規化を実施せずに（ステップＳ３）、そのままステップＳ４に進む。そして、入力されたテキスト中に含まれている全ての漢字の切り出しが終わったので（ステップＳ４のＹ）、ステップＳ５に進む。
【００６２】
したがって、元のテキストの文字列にある中国語繁体字である漢字文字列「印刷機械」が異表記正規化規則１０２ｃ（中国語繁体字−中国語簡体字変換）に従って中国語簡体字「印刷机械」に変換される。このようにして正規化が実施された文字列「印刷机械」は、正規化処理手段１０２ｂの処理結果として表示装置１２に出力される（ステップＳ５）。
【００６３】
このように本実施の形態によれば、入力テキストから漢字表記が抽出され、漢字表記が異なる言語間の文字対応を記述した少なくとも２種以上の異表記正規化規則１０２ｃから入力テキストに適した異表記正規化規則１０２ｃが選択され、この選択された異表記正規化規則１０２ｃに基づいて抽出された漢字表記が正規化される。これにより、漢字表記が異なる各種の言語間における多様な表記のゆれを吸収することが可能になる。
【００６４】
なお、本実施の形態においては、正規化対象の文字列を中国語繁体字と中国語簡体字との変換を例として説明したが、これに限るものではなく、日本語漢字と中国語繁体字の場合、あるいは、ベトナム漢字と中国語簡体字の場合等、漢字使用国間における同様の異表記正規化処理を行う場合に適応することができる。
【００６５】
また、異表記正規化規則選択手段１０２ｄにより異表記正規化規則１０２ｃとして中国語繁体字−中国語簡体字変換（図３参照）と言語異体字−中国繁体字変換（図５参照）との両方を選択し、例えば入力された漢字が言語異体字「国」である場合、言語異体字−中国繁体字変換（図５参照）に従って中国繁体字「國」に正規化し、さらに、中国語繁体字−中国語簡体字変換（図３参照）に従って中国繁体字「國」を中国語簡体字「国」に正規化するようにしても良い。
【００６６】
次に、本発明の第二の実施の形態を図７及び図８に基づいて説明する。なお、本発明の第一の実施の形態において説明した部分と同一部分については同一符号を用い、説明も省略する。本実施の形態は、第一の実施の形態の異表記正規化処理装置１を備えた文書検索装置に関するものである。
【００６７】
図７は本実施の形態の文書検索装置５０の概略構成を示す機能ブロック図、図８は文書検索装置５０における処理の流れを示し、（ａ）は文書に含まれる文字列からインデックス情報を抽出して登録するインデックス情報登録処理の流れを示すフローチャート、（ｂ）は検索語により文書を検索する文書検索処理の流れの概略を示すフローチャートである。
【００６８】
本実施の形態の文書検索装置５０は、第一の実施の形態で説明した異表記正規化処理装置１と同様のパーソナルコンピュータやワークステーションであって、そのハードウェア構成は異表記正規化処理装置１と同一である。すなわち、文書検索装置５０のＨＤＤ６上には、文書検索用のプログラムがロードされている点で、異表記正規化処理装置１とは異なっている。
【００６９】
以下においては、ＣＰＵ２が文書検索用のプログラムに基づいて実行する処理について、図７に示す機能ブロック図及び図８の処理の流れに沿って説明する。
【００７０】
まず、インデックス情報登録処理の流れについて説明する。キーボードやマウスなどの入力装置１１，ＯＣＲ等からなる入力装置５１から入力された文書を文書入力手段５２を介して文書格納手段５３が読み取り、読み取った文書を文書データベース５４に逐次登録していく（ステップＳ１１）。
【００７１】
次いで、インデックス登録手段５５は、登録されている文書のうち、まだインデックス情報を抽出していない文書があれば、文書データベース５４より読み出し（ステップＳ１２）、該文書を異表記正規化処理装置１に引き継いで、該文書から文字列を抽出して抽出文字列毎の異表記正規化処理を行う（ステップＳ１３）。なお、異表記正規化処理については、第一の実施の形態で説明したので、ここでの説明は省略する。
【００７２】
ステップＳ１３において異表記正規化された正規化表記情報は、インデックス登録手段５５に返送されてくるので、文書との対応をつけた形式（インデックス情報）でインデックス情報格納部５６に格納保存される（ステップＳ１４）。
【００７３】
次に、文書検索処理の流れについて説明する。入力装置５１から入力された検索語は、文書入力手段５２を介して検索条件作成手段５７に読み取られ、検索条件作成手段５７は、該検索語を異表記正規化処理装置１に送出して該検索語から文字列を抽出して、異表記正規化処理を行う（ステップＳ２１）。なお、異表記正規化処理については、第一の実施の形態で説明したので、ここでの説明は省略する。
【００７４】
続いて、検索条件作成手段５７は、異表記正規化処理された正規化表記情報に基づいて検索条件を作成し、検索処理手段５８において、インデックス情報格納部５６に格納保存されているインデックス情報との照合を行い、文書検索処理を行う（ステップＳ２２）。
【００７５】
文書検索された結果は、必要であれば、検索された文書が文書データベース５４から読み出されて、表示装置１２あるいはかかる処理結果を出力保存するＨＤＤ６等からなる出力装置５９において表示される（ステップＳ２３：結果出力手段）。
【００７６】
このように本実施の形態によれば、異表記正規化処理が施されたインデックス情報及び検索条件が生成されるので、漢字使用圏の言語（中国語（簡体字や繁体字）、日本語、韓国語、ベトナム語等）で記述された文書についての文書検索処理における検索漏れの発生を防止することが可能になるので、最も適切な文書を検索することが可能になる。
【００７７】
【発明の効果】
請求項１記載の発明の異表記正規化処理装置によれば、入力テキストを受け付ける文書入力手段と、この文書入力手段により受け付けた前記入力テキストから漢字表記を抽出する漢字相当切出手段と、漢字表記が異なる言語間の文字対応を記述した少なくとも２種以上の異表記正規化規則を格納する正規化規則格納手段と、前記文書入力手段により受け付けた前記入力テキストに適した前記異表記正規化規則を選択する異表記正規化規則選択手段と、この異表記正規化規則選択手段により選択した前記異表記正規化規則に応じて前記漢字相当切出手段により抽出した漢字表記を正規化する正規化処理手段と、を備え、入力テキストから漢字表記を抽出し、漢字表記が異なる言語間の文字対応を記述した少なくとも２種以上の異表記正規化規則から入力テキストに適した異表記正規化規則を選択し、この選択された異表記正規化規則に基づいて抽出された漢字表記を正規化することにより、漢字表記が異なる各種の言語間における多様な表記のゆれを吸収することができる。
【００７８】
請求項２記載の発明は、請求項１記載の異表記正規化処理装置において、前記異表記正規化規則選択手段によれば、前記入力テキストの言語情報を解析し、解析された言語情報に基づいて前記異表記正規化規則を選択することにより、どの言語で入力するかを意識せずとも、漢字の入力テキストに対して適切な異表記正規化処理を施すことができる。
【００７９】
請求項３記載の発明によれば、請求項２記載の異表記正規化処理装置において、前記異表記正規化規則選択手段は、解析された言語情報に基づいて複数の前記異表記正規化規則が選択された場合には、これらの異表記正規化規則の中から所望の前記異表記正規化規則の選択を許容することにより、漢字の入力テキストに対して最も適切な異表記正規化処理を施すことができる。
【００８０】
請求項４記載の発明によれば、請求項１ないし３のいずれか一記載の異表記正規化処理装置において、前記異表記正規化規則は、所定の言語の漢字表記とその異体字との間の文字対応が記述されたものであることにより、所定の言語の漢字表記とその異体字との間における表記のゆれを吸収することができる。
【００８１】
請求項５記載の発明によれば、請求項１ないし３のいずれか一記載の異表記正規化処理装置において、前記異表記正規化規則は、所定の言語の漢字表記に対応する他の言語の漢字表記が記述されたものであることにより、所定の言語の漢字表記と他の言語の漢字表記との間における表記のゆれを吸収することができる。
【００８２】
請求項６記載の発明によれば、請求項１ないし３のいずれか一記載の異表記正規化処理装置において、所定の言語の漢字表記とその異体字との間の文字対応が記述された前記異表記正規化規則と、所定の言語の漢字表記に対応する他の言語の漢字表記が記述された前記異表記正規化規則との両方を有していることにより、漢字表記が異なる各種の言語間における多様な表記のゆれを吸収することができる。
【００８３】
請求項７記載の発明の異表記正規化処理プログラムによれば、コンピュータにインストールされ、入力テキストを受け付ける文書入力機能と、この文書入力機能により受け付けた前記入力テキストから漢字表記を抽出する漢字相当切出機能と、漢字表記が異なる言語間の文字対応を記述した少なくとも２種以上の異表記正規化規則を格納する正規化規則格納機能と、前記文書入力機能により受け付けた前記入力テキストに適した前記異表記正規化規則を選択する異表記正規化規則選択機能と、この異表記正規化規則選択機能により選択した前記異表記正規化規則に応じて前記漢字相当切出機能により抽出した漢字表記を正規化する正規化処理機能と、をコンピュータに実行させることで、入力テキストから漢字表記を抽出し、漢字表記が異なる言語間の文字対応を記述した少なくとも２種以上の異表記正規化規則から入力テキストに適した異表記正規化規則を選択し、この選択された異表記正規化規則に基づいて抽出された漢字表記を正規化することにより、漢字表記が異なる各種の言語間における多様な表記のゆれを吸収することができる。
【００８４】
請求項８記載の発明によれば、請求項７記載の異表記正規化処理プログラムにおいて、前記異表記正規化規則選択機能は、前記入力テキストの言語情報を解析し、解析された言語情報に基づいて前記異表記正規化規則を選択することにより、どの言語で入力するかを意識せずとも、漢字の入力テキストに対して適切な異表記正規化処理を施すことができる。
【００８５】
請求項９記載の発明によれば、請求項８記載の異表記正規化処理プログラムにおいて、前記異表記正規化規則選択機能は、解析された言語情報に基づいて複数の前記異表記正規化規則が選択された場合には、これらの異表記正規化規則の中から所望の前記異表記正規化規則の選択を許容することにより、漢字の入力テキストに対して最も適切な異表記正規化処理を施すことができる。
【００８６】
請求項１０記載の発明によれば、請求項７ないし９のいずれか一記載の異表記正規化処理プログラムにおいて、前記異表記正規化規則は、所定の言語の漢字表記とその異体字との間の文字対応が記述されたものであることにより、所定の言語の漢字表記とその異体字との間における表記のゆれを吸収することができる。
【００８７】
請求項１１記載の発明によれば、請求項７ないし９のいずれか一記載の異表記正規化処理プログラムにおいて、前記異表記正規化規則は、所定の言語の漢字表記に対応する他の言語の漢字表記が記述されたものであることにより、所定の言語の漢字表記と他の言語の漢字表記との間における表記のゆれを吸収することができる。
【００８８】
請求項１２記載の発明によれば、請求項７ないし９のいずれか一記載の異表記正規化処理プログラムにおいて、所定の言語の漢字表記とその異体字との間の文字対応が記述された前記異表記正規化規則と、所定の言語の漢字表記に対応する他の言語の漢字表記が記述された前記異表記正規化規則との両方を有していることにより、漢字表記が異なる各種の言語間における多様な表記のゆれを吸収することができる。
【００８９】
請求項１３記載の発明の記憶媒体によれば、請求項７ないし１２のいずれか一記載の異表記正規化処理プログラムを記憶することにより、この記憶媒体に記憶された異表記正規化処理プログラムをコンピュータに読み取らせることで、請求項７ないし１２のいずれか一記載の発明と同様の作用効果を得ることができる。
【００９０】
請求項１４記載の発明の文書検索装置によれば、異表記正規化処理を行う請求項１ないし６のいずれか一記載の異表記正規化処理装置と、入力テキストを受け付ける文書入力手段と、この文書入力手段により受け付けた前記入力テキストが文書の場合、文書データベースに格納する文書格納手段と、この文書格納手段により前記文書データベースに格納された前記文書から文字列を抽出して抽出文字列毎の異表記正規化処理を前記異表記正規化処理装置で行った後、インデックス情報を抽出して保存するインデックス登録手段と、前記文書入力手段により受け付けた前記入力テキストが検索語の場合、当該検索語を前記異表記正規化処理装置に送出して当該検索語から文字列を抽出して異表記正規化処理を行った後、検索条件を作成する検索条件作成手段と、この検索条件作成手段により作成された前記検索条件に基づいて前記インデックス情報を検索する検索処理手段と、この検索処理手段による検索の結果を出力する結果出力手段と、を備え、異表記正規化処理が施されたインデックス情報及び検索条件を生成することにより、漢字使用圏の言語（中国語（簡体字や繁体字）、日本語、韓国語、ベトナム語等）で記述された文書についての文書検索処理における検索漏れの発生を防止することができるので、最も適切な文書を検索することができる。
【００９１】
請求項１５記載の発明の文書検索プログラムによれば、コンピュータにインストールされ、請求項１ないし６のいずれか一記載の異表記正規化処理装置における異表記正規化処理を実現する異表記正規化処理機能と、入力テキストを受け付ける文書入力機能と、この文書入力機能により受け付けた前記入力テキストが文書の場合、文書データベースに格納する文書格納機能と、この文書格納機能により前記文書データベースに格納された前記文書から文字列を抽出して抽出文字列毎の異表記正規化処理を前記異表記正規化処理機能で行った後、インデックス情報を抽出して保存するインデックス登録機能と、前記文書入力機能により受け付けた前記入力テキストが検索語の場合、当該検索語を前記異表記正規化処理機能に送出して当該検索語から文字列を抽出して異表記正規化処理を行った後、検索条件を作成する検索条件作成機能と、この検索条件作成機能により作成された前記検索条件に基づいて前記インデックス情報を検索する検索処理機能と、この検索処理機能による検索の結果を出力する結果出力機能と、をコンピュータに実行させ、異表記正規化処理が施されたインデックス情報及び検索条件を生成することにより、漢字使用圏の言語（中国語（簡体字や繁体字）、日本語、韓国語、ベトナム語等）で記述された文書についての文書検索処理における検索漏れの発生を防止することができるので、最も適切な文書を検索することができる。
【００９２】
請求項１６記載の発明の記憶媒体によれば、請求項１５記載のプログラムを記憶することにより、この記憶媒体に記憶された文書検索プログラムをコンピュータに読み取らせることで、請求項１５記載の発明と同様の作用効果を得ることができる。
【図面の簡単な説明】
【図１】本発明の第一の実施の形態の異表記正規化処理装置のハードウェア構成を概略的に示すブロック図である。
【図２】異表記正規化処理装置を示す機能ブロック図である。
【図３】異表記正規化規則の一例を示す説明図である。
【図４】異表記正規化規則の一例を示す説明図である。
【図５】異表記正規化規則の一例を示す説明図である。
【図６】異表記正規化処理装置における異表記正規化処理の全体的な流れを示すフローチャートである。
【図７】本発明の第二の実施の形態の文書検索装置の概略構成を示す機能ブロック図である。
【図８】文書検索装置における処理の流れを示し、（ａ）は文書に含まれる文字列からインデックス情報を抽出して登録するインデックス情報登録処理の流れを示すフローチャート、（ｂ）は検索語により文書を検索する文書検索処理の流れの概略を示すフローチャートである。
【符号の説明】
１異表記正規化処理装置
７記憶媒体
５０文書検索装置
５２文書入力手段
５３文書格納手段
５４文書データベース
５５インデックス登録手段
５６インデックス情報格納部
５７検索条件作成手段
５８検索処理手段
１０１文書入力手段
１０２ａ漢字相当切出手段
１０２ｂ正規化処理手段
１０２ｃ異表記正規化規則
１０２ｄ異表記正規化規則選択手段
１０２ｅ正規化規則格納手段[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a different notation normalization processing apparatus, a different notation normalization processing program, a storage medium storing the same, a document search apparatus, a document search program, and a storage medium storing the same.
[0002]
[Prior art]
Chinese characters commonly used in China are called simplified characters, Chinese characters commonly used in Taiwan are called traditional characters, and different Chinese characters may be used to represent the same meaning. For example, what is represented as “printing machine” in traditional Chinese characters is represented as “printing machine” in simplified Chinese characters.
[0003]
Therefore, an apparatus for converting a document written in simplified characters and a document written in traditional characters has been researched and developed (for example, see Patent Document 1).
[0004]
[Patent Document 1]
JP-A-8-263478
[0005]
[Problems to be solved by the invention]
By the way, examples of languages that use Chinese characters include Japanese, Korean, Vietnamese, etc. in addition to the simplified and traditional Chinese characters described above. In these languages, the pronunciation of kanji is different but sometimes used in the same meaning, and can be understood with simple sentences. Even if the languages are the same, there are old characters, new characters, and variant characters in each language, and different kanji characters may be used to represent the same meaning.
[0006]
As described above, when a document search process is performed on a document in which different kanji characters are used to express the same meaning, there is a problem that a search omission occurs due to different kanji notations.
[0007]
For this reason, in recent years, it has been desired to absorb various notation fluctuations between various languages as well as between simplified and traditional characters.
[0008]
An object of the present invention is to provide a different notation normalization processing apparatus, a different notation normalization processing program, and a storage medium for storing the same, which can absorb various notation fluctuations between various languages having different kanji notations. It is.
[0009]
An object of the present invention is to prevent occurrence of a search omission in a document search process for a document written in a language in a Chinese-speaking area (Chinese (simplified or traditional), Japanese, Korean, Vietnamese, etc.) To provide a document search apparatus, a document search program, and a storage medium for storing the document search program capable of searching the most appropriate document.
[0010]
[Means for Solving the Problems]
The different notation normalization processing apparatus according to claim 1 is a document input means for receiving input text, a kanji equivalent cutout means for extracting kanji notation from the input text received by the document input means, and kanji notation. Selection of normalization rule storage means for storing at least two kinds of different notation normalization rules describing character correspondence between different languages, and the different notation normalization rules suitable for the input text received by the document input means And a normalization processing means for normalizing the kanji notation extracted by the kanji equivalent cutout means according to the different notation normalization rule selected by the different notation normalization rule selecting means. .
[0011]
Therefore, the kanji notation is extracted from the input text, and the different notation normalization rule suitable for the input text is selected from at least two different notation normalization rules describing the character correspondence between languages having different kanji notations. The kanji notation extracted based on the different notation normalization rules is normalized. As a result, it is possible to absorb various notation fluctuations between various languages having different kanji notations.
[0012]
The invention according to claim 2 is the different notation normalization processing device according to claim 1, wherein the different notation normalization rule selection unit analyzes language information of the input text, and based on the analyzed language information, Select a different notation normalization rule.
[0013]
Therefore, it is possible to perform an appropriate different notation normalization process for the input text in kanji without being conscious of which language is used for input.
[0014]
According to a third aspect of the present invention, in the different notation normalization processing device according to the second aspect, the different notation normalization rule selecting unit selects a plurality of the different notation normalization rules based on the analyzed language information. In such a case, it is allowed to select a desired different notation normalization rule from these different notation normalization rules.
[0015]
Therefore, it is possible to perform the most appropriate different notation normalization process on the input text of kanji.
[0016]
The invention according to claim 4 is the different notation normalization processing device according to any one of claims 1 to 3, wherein the different notation normalization rule is a character between a kanji notation of a predetermined language and its variant character. The correspondence is described.
[0017]
Therefore, it becomes possible to absorb the fluctuation of the notation between the kanji notation of a predetermined language and its variant characters.
[0018]
According to a fifth aspect of the present invention, in the different notation normalization processing device according to any one of the first to third aspects, the different notation normalization rule is a kanji notation of another language corresponding to a kanji notation of a predetermined language. Is described.
[0019]
Therefore, it becomes possible to absorb the fluctuation of the notation between the kanji notation of a predetermined language and the kanji notation of another language.
[0020]
The invention according to claim 6 is the variant notation normalization processing device according to any one of claims 1 to 3, wherein the variant notation in which the character correspondence between the kanji notation of the predetermined language and the variant is described. Both the normalization rule and the different notation normalization rule in which the kanji notation of another language corresponding to the kanji notation of a predetermined language is described.
[0021]
Therefore, it becomes possible to absorb the fluctuation of various notations between various languages having different kanji notations.
[0022]
A different notation normalization processing program according to claim 7 is installed in a computer and has a document input function for receiving input text, and a kanji equivalent cutout function for extracting kanji notation from the input text received by the document input function. A normalization rule storage function for storing at least two kinds of different notation normalization rules describing character correspondence between languages having different kanji notations, and the different notation suitable for the input text received by the document input function A normalization rule selection function for selecting a normalization rule, and a Chinese character notation extracted by the kanji equivalent cutout function is normalized according to the normalization normalization rule selected by the normalization normalization rule selection function. Let the computer execute the normalization processing function.
[0023]
Therefore, the kanji notation is extracted from the input text, and the different notation normalization rule suitable for the input text is selected from at least two different notation normalization rules describing the character correspondence between languages having different kanji notations. The kanji notation extracted based on the different notation normalization rules is normalized. As a result, it is possible to absorb various notation fluctuations between various languages having different kanji notations.
[0024]
The invention according to claim 8 is the different notation normalization processing program according to claim 7, wherein the different notation normalization rule selection function analyzes language information of the input text, and based on the analyzed language information, Select a different notation normalization rule.
[0025]
Therefore, it is possible to perform an appropriate different notation normalization process for the input text in kanji without being conscious of which language is used for input.
[0026]
The invention according to claim 9 is the different notation normalization processing program according to claim 8, wherein the different notation normalization rule selection function selects a plurality of the different notation normalization rules based on the analyzed linguistic information. In such a case, it is allowed to select a desired different notation normalization rule from these different notation normalization rules.
[0027]
Therefore, it is possible to perform the most appropriate different notation normalization process on the input text of kanji.
[0028]
The invention according to claim 10 is the variant notation normalization processing program according to any one of claims 7 to 9, wherein the variant notation normalization rule is a character between a kanji notation of a predetermined language and its variant character. The correspondence is described.
[0029]
Therefore, it becomes possible to absorb the fluctuation of the notation between the kanji notation of a predetermined language and its variant characters.
[0030]
The invention according to claim 11 is the different notation normalization processing program according to any one of claims 7 to 9, wherein the different notation normalization rule is a kanji notation of another language corresponding to a kanji notation of a predetermined language. Is described.
[0031]
Therefore, it becomes possible to absorb the fluctuation of the notation between the kanji notation of a predetermined language and the kanji notation of another language.
[0032]
The invention according to claim 12 is the variant notation normalization processing program according to any one of claims 7 to 9, wherein the variant notation in which character correspondence between the kanji notation of a predetermined language and its variant is described. Both the normalization rule and the different notation normalization rule in which the kanji notation of another language corresponding to the kanji notation of a predetermined language is described.
[0033]
Therefore, it becomes possible to absorb the fluctuation of various notations between various languages having different kanji notations.
[0034]
A storage medium according to a thirteenth aspect of the invention stores the different notation normalization processing program according to any one of the seventh to twelfth aspects of the invention.
[0035]
Therefore, by causing the computer to read the different notation normalization processing program stored in the storage medium, it is possible to obtain the same operation as that of the invention according to any one of claims 7 to 12.
[0036]
A document retrieval apparatus according to a fourteenth aspect of the present invention is a different notation normalization processing apparatus according to any one of claims 1 to 6 that performs different notation normalization processing, a document input unit that receives input text, and the document input If the input text received by the means is a document, the document storage means for storing in the document database, and the character storage is extracted from the document stored in the document database by the document storage means, and the different notation for each extracted character string After the normalization processing is performed by the different notation normalization processing device, index registration means for extracting index information and storing it in the index information storage unit, and when the input text received by the document input means is a search word, The search term is sent to the different notation normalization processing device, a character string is extracted from the search term, and the different notation normalization processing is performed. Search condition creating means for creating a search result, search processing means for retrieving index information in the index information storage unit based on the search condition created by the search condition creating means, and a result of the search by the search processing means. And a result output means for outputting.
[0037]
Therefore, index information and search conditions that have been subjected to different notation normalization processing are generated, so they are written in the language of the Chinese character use area (Chinese (simplified and traditional), Japanese, Korean, Vietnamese, etc.). Since it is possible to prevent the occurrence of a search omission in the document search process for the selected document, it is possible to search for the most appropriate document.
[0038]
A document search program according to a fifteenth aspect of the invention is installed in a computer and has a different notation normalization processing function for realizing a different notation normalization process in the different notation normalization processing device according to any one of claims 1 to 6. A document input function that accepts an input text, a document storage function that stores the input text received by the document input function in a document database, and a document that is stored in the document database by the document storage function. An index registration function for extracting the index information and storing it in the index information storage unit after extracting the character string and performing the different notation normalization processing for each extracted character string with the different notation normalization processing function, and the document input When the input text received by the function is a search word, the search word is transferred to the different notation normalization processing function. A search condition creating function for creating a search condition after extracting a character string from the search term and performing normal notation processing, and the index based on the search condition created by the search condition creating function A computer is caused to execute a search processing function for searching index information in the information storage unit and a result output function for outputting a search result by the search processing function.
[0039]
Therefore, index information and search conditions that have been subjected to different notation normalization processing are generated, so they are written in the language of the Chinese character use area (Chinese (simplified and traditional), Japanese, Korean, Vietnamese, etc.). Since it is possible to prevent the occurrence of a search omission in the document search process for the selected document, it is possible to search for the most appropriate document.
[0040]
A storage medium according to a sixteenth aspect stores the program according to the fifteenth aspect.
[0041]
Therefore, by causing the computer to read the document search program stored in the storage medium, it is possible to obtain the same operation as that of the invention of the fifteenth aspect.
[0042]
DETAILED DESCRIPTION OF THE INVENTION
A first embodiment of the present invention will be described with reference to FIGS.
[0043]
FIG. 1 is a block diagram schematically showing a hardware configuration of a different notation normalization processing apparatus 1 to which the present invention is applied. As shown in FIG. 1, the different notation normalization processing apparatus 1 is, for example, a personal computer or a workstation, and includes a CPU (Central Processing Unit) 2 that is a main part of the computer and controls each part centrally. . The CPU 2 is connected by a bus 5 to a ROM (Read Only Memory) 3 which is a read only memory storing BIOS and a RAM (Random Access Memory) 4 which stores various data in a rewritable manner.
[0044]
Further, the bus 5 has an HDD (Hard Disk Drive) 6 for storing various programs and the like, and a CD-ROM drive 8 for reading a CD (Compact Disc) -ROM 7 as a mechanism for reading computer software as a distributed program. A communication control device 10 that controls communication between the different notation normalization processing device 1 and the network 9, an input device 11 such as a keyboard and a mouse, and a display device such as a CRT (Cathode Ray Tube) and an LCD (Liquid Crystal Display). 12 are connected via an I / O (not shown).
[0045]
Since the RAM 4 has a property of storing various data in a rewritable manner, the RAM 4 functions as a work area for the CPU 2 and plays a role of, for example, a document buffer described later.
[0046]
A CD-ROM 7 shown in FIG. 1 implements the storage medium of the present invention, and stores an OS (Operating System) and various programs. The CPU 2 reads the program stored in the CD-ROM 7 with the CD-ROM drive 8 and installs it in the HDD 6.
[0047]
As the storage medium, not only the CD-ROM 7 but also various types of media such as semiconductor memory such as various optical disks such as DVD, various magnetic disks such as various magneto-optical disks and flexible disks, and the like can be used. Alternatively, the program may be downloaded from the network 9 such as the Internet via the communication control device 10 and installed in the HDD 6. In this case, the storage device storing the program in the server on the transmission side is also a storage medium of the present invention. Note that the program may operate on a predetermined OS (Operating System), and in that case, the OS may take over the execution of some of the various processes described below, It may be included as a part of a group of program files constituting the application software or OS.
[0048]
The CPU 2 that controls the operation of the entire system executes various processes based on a program loaded on the HDD 6 used as the main storage of the system.
[0049]
Next, the contents of various processes executed by the CPU 2 of the different notation normalization processing apparatus 1 based on a program will be described. FIG. 2 is a functional block diagram showing the different notation normalization processing apparatus 1. As shown in FIG. 2, in the different notation normalization processing apparatus 1, the document input means 101 that accepts text (input text) such as document data or a search term input from the input device 11, and the input text is normalized. Each function of the output notation normalizing means 102 for performing the normalization processing, and the output means 103 for outputting the processing results such as information related to the normalized notation output from the different notation normalizing means 102 to the display device 12, the CPU 2 is a computer It is realized by operating according to software. When used for a document search or the like, it is also possible to directly output the processing result in the different notation normalization unit 102 to the indexing process or the search process without using the output unit 103. Further, the input of text such as document data and search terms is not limited to the input from the input device 11, but may be input by OCR (Optical Character Reader) or the like.
[0050]
Here, the different notation normalization means 102 will be described in detail. As shown in FIG. 2, the different notation normalizing means 102 includes a kanji equivalent cutout means 102a, a normalization processing means 102b, and a normalization rule storage means 102e for storing at least two kinds of different notation normalization rules 102c. And a different notation normalization rule selection means 102d.
[0051]
The kanji equivalent cutout means 102a extracts characters in Chinese characters one by one from the text (input text) input via the document input means 101 and outputs them to the normalization processing means 102b.
[0052]
The different notation normalization rule 102c stored in the normalization rule storage unit 102e generally describes character correspondence between predetermined languages. FIG. 3 is an explanatory diagram showing an example of the different notation normalization rule 102c. The different notation normalization rule 102c shown in FIG. 3 is a different notation normalization rule for conversion between traditional Chinese characters and simplified Chinese characters. As shown in FIG. 3, the traditional Chinese character notation and character code (Unicode) and the simplified Chinese character notation and character code (Unicode) are associated with the different notation normalization rule 102 c. As shown in FIG. 4, the different notation normalization rule 102c may describe different notation normalization rules between a plurality of languages, and the character code may be a character code other than Unicode, such as JIS0208 or UTF-8. Multiple types of character codes may be described. Thereby, it becomes possible to absorb the fluctuation of the notation between the kanji notation of a predetermined language and the kanji notation of another language. Further, the different notation normalization rule 102c may be a different notation normalization rule for conversion between language variants and traditional characters, as shown in FIG. Thereby, it becomes possible to absorb the fluctuation of the notation between the kanji notation of a predetermined language and its variant characters. Such a different notation normalization rule 102c is selected by the different notation normalization rule selection means 102d.
[0053]
The different notation normalization rule selection unit 102d analyzes the language information of the document (input text) input from the input device 11 and received by the document input unit 101, and the different notation normalization rule 102c based on the analyzed language information. Is extracted from the normalization rule storage means 102e. As the language information of the document, information in which the language information of the document is explicitly input separately from the document from the input device 11 or language information described in advance in the input document is used. For example, in an XML document, language information is described using a lang attribute. As a method for discriminating such language information, an existing language identification method (for example, a method disclosed in JP-A-10-320399) can be used. Therefore, it is possible to perform an appropriate different notation normalization process for the input text in kanji without being conscious of which language is used for input. When a plurality of different notation normalization rules 102c are extracted, a list of different notation normalization rules 102c extracted is displayed on the display device 12, and an appropriate different notation normalization rule is selected from the list. 102c may be selected by the input device 11. As a result, the most appropriate different notation normalization process can be performed on the input text in kanji.
[0054]
The normalization processing unit 102b generates a normalized notation based on the different notation normalization rule 102c selected by the different notation normalization rule selection unit 102d for the single kanji character output from the kanji equivalent cutout unit 102a. To do.
[0055]
Next, the overall flow of the different notation normalization process in the different notation normalization processing apparatus 1 will be described with reference to FIG. In FIG. 6, the Chinese traditional Chinese-Simplified Chinese conversion (see FIG. 3) is selected by the different notation normalization rule selection means 102 d as the different notation normalization rule 102 c, and “printing machine” is selected as the input text in the traditional Chinese character. Is shown as an example.
[0056]
As shown in FIG. 6, first, text input via the document input means 101 is received by the document input means 101 (Y in step S1), and is included in the text input by the kanji equivalent cutout means 102a. One kanji character is detected and cut out (step S2). Here, first, the Chinese character “mark” is cut out.
[0057]
The extracted Chinese character “mark” is input to the normalization processing means 102b, and normalization processing is performed based on the different notation normalization rule 102c (step S3). The normalization process confirms whether or not the extracted Chinese characters conform to the different notation normalization rules 102c (Traditional Chinese characters-Simplified Chinese conversion), and the different notation normalization rules 102c (Traditional Chinese characters-Chinese characters). When conforming to (simplified Chinese character conversion), normalization is performed according to the different notation normalization rule 102c (traditional Chinese character-Simplified Chinese character conversion). Here, since there is no rule corresponding to the different notation normalization rule 102c (traditional Chinese character-Simplified Chinese character conversion) shown in FIG. 3 for the cut out Chinese character “mark”, the process directly proceeds to step S4.
[0058]
Since all the kanji characters included in the input text have not been cut out (N in step S4), the process returns to step S2, and the next character “print” is cut out by the kanji equivalent cutting means 102a.
[0059]
Next, the kanji “print” cut out does not have a normalization rule because there is no rule corresponding to the different notation normalization rule 102c (traditional Chinese character-simplified Chinese conversion) shown in FIG. S3) Proceed directly to step S4. Then, since all the kanji characters included in the input text have not been cut out (N in step S4), the process returns to step S2, and the next character “machine” is cut out by the kanji equivalent cutting means 102a.
[0060]
The next extracted Chinese character “machine” has a rule corresponding to the different notation normalization rule 102c (Traditional Chinese character-Simplified Chinese character conversion) shown in FIG. The character “machine” is normalized to the Chinese simplified character “desk” (step S3). Proceed to step S4. Then, since all the kanji characters included in the input text have not been cut out (N in step S4), the process returns to step S2, and the next character “machine” is cut out by the kanji equivalent cutting means 102a.
[0061]
Next, the kanji character “machine” cut out does not have normalization because there is no rule corresponding to the different notation normalization rule 102c (traditional Chinese character-simplified Chinese character conversion) shown in FIG. S3) Proceed directly to step S4. Since all the kanji characters included in the input text have been cut out (Y in step S4), the process proceeds to step S5.
[0062]
Therefore, the Chinese character string “printing machine”, which is the traditional Chinese character in the character string of the original text, is changed to the simplified Chinese character “printing machine” in accordance with the different notation normalization rule 102c (Traditional Chinese-Simplified Chinese character conversion). Converted. The character string “printing machine” subjected to normalization in this way is output to the display device 12 as a processing result of the normalization processing means 102b (step S5).
[0063]
As described above, according to the present embodiment, kanji notation is extracted from the input text, and the different notation normalization rule 102c describing the character correspondence between languages having different kanji notations is used. The notation normalization rule 102c is selected, and the kanji notation extracted based on the selected different notation normalization rule 102c is normalized. As a result, it is possible to absorb various notation fluctuations between various languages having different kanji notations.
[0064]
In this embodiment, the character string to be normalized is described as an example of conversion between traditional Chinese characters and simplified Chinese characters. However, the present invention is not limited to this. It is possible to apply to the case where the same notation normalization processing is performed between countries where Kanji is used, such as in the case of Vietnamese Kanji and Simplified Chinese.
[0065]
Further, the different notation normalization rule selection means 102d converts both the traditional Chinese character-Simplified Chinese character conversion (see FIG. 3) and the variant language character-Chinese traditional character character conversion (see FIG. 5) as the different character normalization rule 102c. For example, if the input Chinese character is the language variant “country”, it is normalized to the Chinese traditional character “country” according to the language variant-Chinese traditional character conversion (see FIG. 5), and further, the Chinese traditional character— The traditional Chinese character “country” may be normalized to the simplified Chinese character “country” according to the simplified Chinese character conversion (see FIG. 3).
[0066]
Next, a second embodiment of the present invention will be described with reference to FIGS. The same parts as those described in the first embodiment of the present invention are denoted by the same reference numerals, and description thereof is also omitted. The present embodiment relates to a document search apparatus provided with the different notation normalization processing apparatus 1 of the first embodiment.
[0067]
FIG. 7 is a functional block diagram showing a schematic configuration of the document search device 50 according to the present embodiment, FIG. 8 shows a processing flow in the document search device 50, and (a) extracts index information from a character string included in the document. FIG. 5B is a flowchart showing an outline of a flow of document search processing for searching for a document using a search word.
[0068]
The document retrieval apparatus 50 of this embodiment is a personal computer or workstation similar to the different notation normalization processing apparatus 1 described in the first embodiment, and has a hardware configuration that is different from the different notation normalization processing apparatus. 1 is the same. That is, the document retrieval apparatus 50 is different from the different notation normalization processing apparatus 1 in that a document retrieval program is loaded on the HDD 6.
[0069]
In the following, processing executed by the CPU 2 based on the document search program will be described along the functional block diagram shown in FIG. 7 and the processing flow shown in FIG.
[0070]
First, the flow of index information registration processing will be described. The document storage unit 53 reads a document input from the input device 51 such as a keyboard and a mouse, the input device 51 such as an OCR, and the like through the document input unit 52, and sequentially registers the read document in the document database 54 ( Step S11).
[0071]
Next, if there is a document for which index information has not yet been extracted from among the registered documents, the index registration means 55 reads it from the document database 54 (step S12), and the document is stored in the different notation normalization processing device 1. In succession, a character string is extracted from the document, and a different notation normalization process is performed for each extracted character string (step S13). Since the different notation normalization processing has been described in the first embodiment, the description thereof is omitted here.
[0072]
Since the normalized notation information normalized in step S13 is returned to the index registration means 55, it is stored and saved in the index information storage unit 56 in a format (index information) associated with the document ( Step S14).
[0073]
Next, the flow of document search processing will be described. The search term input from the input device 51 is read by the search condition creation unit 57 via the document input unit 52, and the search condition creation unit 57 sends the search term to the different notation normalization processing device 1 to transmit the search term. A character string is extracted from the search term, and a different notation normalization process is performed (step S21). Since the different notation normalization processing has been described in the first embodiment, the description thereof is omitted here.
[0074]
Subsequently, the search condition creating unit 57 creates a search condition based on the normalized notation information subjected to the different notation normalization process, and the search processing unit 58 uses the index information stored and saved in the index information storage unit 56. The document search process is performed (step S22).
[0075]
If necessary, the document search result is read from the document database 54 and displayed on the display device 12 or the output device 59 such as the HDD 6 that outputs and saves the processing result (step). S23: Result output means).
[0076]
As described above, according to the present embodiment, index information and search conditions subjected to the different notation normalization processing are generated, so that the language of the kanji use range (Chinese (simplified and traditional), Japanese, Korean) Occurrence of a search omission in a document search process for a document described in Japanese, Vietnamese, etc.) can be prevented, so that the most appropriate document can be searched.
[0077]
【The invention's effect】
According to the different notation normalization processing apparatus of claim 1, the document input means for receiving the input text, the kanji equivalent extracting means for extracting the kanji notation from the input text received by the document input means, and the kanji Normalization rule storage means for storing at least two different notation normalization rules describing character correspondence between languages with different notations, and the different notation normalization rules suitable for the input text received by the document input means Normalization processing for normalizing the kanji notation extracted by the kanji equivalent cutout means according to the different notation normalization rule selected by the different notation normalization rule selecting means Means for extracting kanji notation from the input text, and describing at least two different notation normalization rules describing character correspondence between languages having different kanji notations By selecting different notation normalization rules suitable for the input text, and normalizing the extracted kanji notations based on the selected different notation normalization rules, a variety of languages between different languages with different kanji notations can be used. Can absorb the fluctuation of the notation.
[0078]
According to a second aspect of the present invention, in the different notation normalization processing device according to the first aspect, according to the different notation normalization rule selecting means, the language information of the input text is analyzed, and based on the analyzed language information. By selecting the different notation normalization rule, appropriate different notation normalization processing can be performed on the input text of kanji without being conscious of which language is used for input.
[0079]
According to a third aspect of the present invention, in the different notation normalization processing device according to the second aspect, the different notation normalization rule selecting means may include a plurality of the different notation normalization rules based on the analyzed language information. If selected, the most appropriate different notation normalization process is performed on the input text of the Chinese character by allowing selection of the desired different notation normalization rule from these different notation normalization rules. be able to.
[0080]
According to a fourth aspect of the present invention, in the different notation normalization processing device according to any one of the first to third aspects, the different notation normalization rule is a kanji notation of a predetermined language and its variant characters. By describing the correspondence between the characters, it is possible to absorb the fluctuation of the notation between the kanji notation of the predetermined language and its variant characters.
[0081]
According to a fifth aspect of the present invention, in the different notation normalization processing device according to any one of the first to third aspects, the different notation normalization rule may be set in another language corresponding to the kanji notation of a predetermined language. By describing the kanji notation, it is possible to absorb the fluctuation of the notation between the kanji notation of a predetermined language and the kanji notation of another language.
[0082]
According to the invention described in claim 6, in the different notation normalization processing device according to any one of claims 1 to 3, the character correspondence between the kanji notation of a predetermined language and the variant character is described. Various languages with different kanji notations by having both the different notation normalization rules and the different notation normalization rules in which kanji notations of other languages corresponding to the kanji notation of a given language are described It can absorb the fluctuations of various notations.
[0083]
According to the different notation normalization processing program of the invention described in claim 7, a document input function that is installed in a computer and receives input text, and a kanji equivalent cutout that extracts kanji notation from the input text received by the document input function. An output function, a normalization rule storage function for storing at least two different notation normalization rules describing character correspondence between languages having different kanji notations, and the input text received by the document input function. The different notation normalization rule selection function for selecting the different notation normalization rule and the kanji notation extracted by the kanji equivalent cutout function according to the different notation normalization rule selected by the different notation normalization rule selection function are normalized. The normalization processing function to convert to Kanji is extracted from the input text, and the Kanji notation is different. Select a different notation normalization rule suitable for the input text from at least two different notation normalization rules that describe the correspondence between languages, and extract the kanji notation extracted based on the selected different notation normalization rule By normalizing, it is possible to absorb various notation fluctuations between various languages having different kanji notations.
[0084]
According to an eighth aspect of the present invention, in the different notation normalization processing program according to the seventh aspect, the different notation normalization rule selection function analyzes language information of the input text, and based on the analyzed language information. By selecting the different notation normalization rule, appropriate different notation normalization processing can be performed on the input text of kanji without being conscious of which language is used for input.
[0085]
According to a ninth aspect of the present invention, in the different notation normalization processing program according to the eighth aspect, the different notation normalization rule selection function may include a plurality of the different notation normalization rules based on the analyzed language information. If selected, the most appropriate different notation normalization process is performed on the input text of the Chinese character by allowing selection of the desired different notation normalization rule from these different notation normalization rules. be able to.
[0086]
According to a tenth aspect of the present invention, in the different notation normalization processing program according to any one of the seventh to ninth aspects, the different notation normalization rule is a kanji notation of a predetermined language and its variant characters. By describing the correspondence between the characters, it is possible to absorb the fluctuation of the notation between the kanji notation of the predetermined language and its variant characters.
[0087]
According to an eleventh aspect of the present invention, in the different notation normalization processing program according to any one of the seventh to ninth aspects, the different notation normalization rule may be set in another language corresponding to the kanji notation of a predetermined language. By describing the kanji notation, it is possible to absorb the fluctuation of the notation between the kanji notation of a predetermined language and the kanji notation of another language.
[0088]
According to the twelfth aspect of the invention, in the variant notation normalization processing program according to any one of claims 7 to 9, the character correspondence between the kanji notation of a predetermined language and the variant character is described. Various languages with different kanji notations by having both the different notation normalization rules and the different notation normalization rules in which kanji notations of other languages corresponding to the kanji notation of a given language are described It can absorb the fluctuations of various notations.
[0089]
According to the storage medium of the invention of the thirteenth aspect, by storing the different notation normalization processing program according to any one of the seventh to twelfth aspects, the different notation normalization processing program stored in the storage medium is stored. By causing the computer to read it, it is possible to obtain the same operational effects as the invention according to any one of claims 7 to 12.
[0090]
According to the document retrieval device of the invention described in claim 14, the different notation normalization processing device according to any one of claims 1 to 6, which performs the different notation normalization processing, the document input means for receiving the input text, and this When the input text received by the document input means is a document, a document storage means for storing in the document database, and a character string is extracted from the document stored in the document database by the document storage means for each extracted character string After the different notation normalization processing is performed by the different notation normalization processing apparatus, when the input text received by the document input means is an index registration unit that extracts and stores index information, and the search term Search to create a search condition after extracting the character string from the search term and performing the different notation normalization process A matter creation means, a search processing means for searching the index information based on the search condition created by the search condition creation means, and a result output means for outputting a search result by the search processing means, Documents written in Kanji-speaking languages (Chinese (simplified and traditional), Japanese, Korean, Vietnamese, etc.) by generating index information and search conditions that have been subjected to different notation normalization processing Occurrence of a search omission in the document search process for can be prevented, so that the most appropriate document can be searched.
[0091]
According to the document search program of the invention described in claim 15, the different notation normalization processing which is installed in a computer and realizes the different notation normalization processing in the different notation normalization processing device according to any one of claims 1 to 6. A function, a document input function that accepts an input text, a document storage function that stores the document in the document database when the input text received by the document input function is a document, and the document database that is stored in the document database by the document storage function After extracting a character string from a document and performing different notation normalization processing for each extracted character string using the different notation normalization processing function, an index registration function for extracting and storing index information and acceptance by the document input function If the input text is a search word, the search word is sent to the different notation normalization processing function and the search word A search condition creation function for creating a search condition after extracting a character string from the character string and performing a different notation normalization process, and a search for searching the index information based on the search condition created by the search condition creation function A processing function and a result output function for outputting a search result by the search processing function, and generating index information and search conditions subjected to the different notation normalization process, thereby generating Search omissions in the document search process for documents written in languages (Chinese (simplified and traditional), Japanese, Korean, Vietnamese, etc.) can be prevented. can do.
[0092]
According to the storage medium of the invention described in claim 16, by storing the program described in claim 15 and causing the computer to read the document search program stored in the storage medium, Similar effects can be obtained.
[Brief description of the drawings]
FIG. 1 is a block diagram schematically showing a hardware configuration of a different notation normalization processing apparatus according to a first embodiment of this invention.
FIG. 2 is a functional block diagram showing a different notation normalization processing apparatus.
FIG. 3 is an explanatory diagram showing an example of a different notation normalization rule.
FIG. 4 is an explanatory diagram showing an example of a different notation normalization rule.
FIG. 5 is an explanatory diagram illustrating an example of a different notation normalization rule;
FIG. 6 is a flowchart showing an overall flow of different notation normalization processing in the different notation normalization processing apparatus;
FIG. 7 is a functional block diagram illustrating a schematic configuration of a document search apparatus according to a second embodiment of this invention.
FIG. 8 shows a flow of processing in the document search apparatus, (a) is a flowchart showing a flow of index information registration processing for extracting and registering index information from a character string included in a document, and (b) is based on a search word. It is a flowchart which shows the outline of the flow of the document search process which searches a document.
[Explanation of symbols]
1. Different notation normalization processing device
7 Storage media
50 Document retrieval device
52 Document input means
53 Document storage means
54 Document Database
55 Index registration means
56 Index information storage unit
57 Search condition creation means
58 Search processing means
101 Document input means
102a Kanji equivalent cutout means
102b Normalization processing means
102c Different notation normalization rules
102d Different notation normalization rule selection means
102e Normalization rule storage means

Claims

A document input means for receiving input text;
Kanji equivalent cutout means for extracting kanji notation from the input text received by the document input means;
Normalization rule storage means for storing at least two different notation normalization rules describing character correspondence between languages having different kanji notations;
Different notation normalization rule selection means for selecting the different notation normalization rule suitable for the input text received by the document input means;
A different notation normalization processing device comprising: normalization processing means for normalizing the Chinese character notation extracted by the kanji equivalent cutout means according to the different notation normalization rule selected by the different notation normalization rule selection means.

The different notation normalization rule selecting means analyzes language information of the input text, and selects the different notation normalization rule based on the analyzed language information.
The different notation normalization processing apparatus according to claim 1.

The different notation normalization rule selecting means, when a plurality of the different notation normalization rules are selected based on the analyzed language information, the desired different notation normalization rule from these different notation normalization rules Allow selection of optimization rules,
The different notation normalization processing apparatus according to claim 2.

The different notation normalization rule is a description of character correspondence between kanji notation of a predetermined language and its variant characters.
The different notation normalization processing device according to claim 1.

The different notation normalization rule is a description of kanji notations of other languages corresponding to kanji notations of a predetermined language.
The different notation normalization processing device according to claim 1.

The different notation normalization rule describing the character correspondence between the kanji notation of the predetermined language and its variant character, and the different notation describing the kanji notation of another language corresponding to the kanji notation of the predetermined language Have both normalization rules,
The different notation normalization processing device according to claim 1.

Installed on the computer,
Document input function that accepts input text,
A kanji equivalent cutout function for extracting kanji notation from the input text received by the document input function;
A normalization rule storage function for storing at least two kinds of different notation normalization rules describing character correspondence between languages having different kanji notations;
A different notation normalization rule selection function for selecting the different notation normalization rule suitable for the input text received by the document input function;
A normalization processing function for normalizing the kanji notation extracted by the kanji equivalent cutout function according to the different notation normalization rule selected by the different notation normalization rule selection function, and causing the computer to execute the different notation normalization Processing program.

The different notation normalization rule selection function analyzes language information of the input text, and selects the different notation normalization rule based on the analyzed language information.
The non-notation normalization processing program according to claim 7.

The different notation normalization rule selection function, when a plurality of the different notation normalization rules are selected based on the analyzed language information, the desired different notation normalization rule from these different notation normalization rules Allow selection of optimization rules,
The non-notation normalization processing program according to claim 8.

The different notation normalization rule is a description of character correspondence between kanji notation of a predetermined language and its variant characters.
The non-notation normalization processing program according to any one of claims 7 to 9.

The different notation normalization rule is a description of kanji notations of other languages corresponding to kanji notations of a predetermined language.
The non-notation normalization processing program according to any one of claims 7 to 9.

The different notation normalization rule describing the character correspondence between the kanji notation of the predetermined language and its variant character, and the different notation describing the kanji notation of another language corresponding to the kanji notation of the predetermined language Have both normalization rules,
The non-notation normalization processing program according to any one of claims 7 to 9.

A storage medium for storing the different notation normalization processing program according to any one of claims 7 to 12.

The different notation normalization processing apparatus according to any one of claims 1 to 6, which performs different notation normalization processing;
A document input means for receiving input text;
If the input text received by the document input means is a document, a document storage means for storing in a document database;
After extracting the character string from the document stored in the document database by the document storage means and performing the different notation normalization processing for each extracted character string in the different notation normalization processing device, the index information is extracted. Index registration means for storing in the index information storage unit;
When the input text received by the document input means is a search word, after sending the search word to the different notation normalization processing device, extracting a character string from the search word, and performing the different notation normalization process , A search condition creating means for creating a search condition,
Search processing means for searching index information in the index information storage unit based on the search conditions created by the search condition creating means;
A result output means for outputting a result of the search by the search processing means;
A document search apparatus comprising:

Installed on the computer,
A different notation normalization processing function for realizing different notation normalization processing in the different notation normalization processing device according to any one of claims 1 to 6,
Document input function that accepts input text,
When the input text received by the document input function is a document, a document storage function for storing in a document database;
After extracting a character string from the document stored in the document database by the document storage function and performing the different notation normalization processing for each extracted character string by the different notation normalization processing function, the index information is extracted. An index registration function to be stored in the index information storage unit;
When the input text received by the document input function is a search word, after sending the search word to the different notation normalization processing function, extracting a character string from the search word, and performing the different notation normalization process , Search condition creation function to create search conditions,
A search processing function for searching index information in the index information storage unit based on the search conditions created by the search condition creation function;
A result output function for outputting the search results by this search processing function;
Search program that causes a computer to execute

A storage medium for storing the document search program according to claim 15.