JP3937741B2

JP3937741B2 - Document standardization

Info

Publication number: JP3937741B2
Application number: JP2001091888A
Authority: JP
Inventors: 泰男小山; 孝司山田; 庸雄河西; 達矢細田; 勝仁鈴木
Original assignee: Seiko Epson Corp
Current assignee: Seiko Epson Corp
Priority date: 2001-03-28
Filing date: 2001-03-28
Publication date: 2007-06-27
Anticipated expiration: 2021-03-28
Also published as: JP2002288175A

Description

【０００１】
【発明の属する技術分野】
本発明は、文書に対して処理を行なうことにより、文書を標準化する技術に関する。
【０００２】
【従来の技術】
テキストデータの検索は、特許公報や文献データベースの検索など、様々な分野で必要とされているが、大量のテキストデータを単にデータベースとして蓄積しておき、パターンマッチングの技術を用いて、蓄積した文書から目的の単語が含まれるものを検索するのが通常である。この場合、検索を容易にしようとすれば、シソーラスなどを用いて、検索しようとする単語とよく似た概念語の検索を行なったり（例えば、検索語が「自動車」である場合に、「車」や「車両」も検索語として検索を行なったり）、あるいは表記の相違を考慮して検索を行なったり（例えば、「車両」に対して「車輌」も検索語として検索を行なったり）することが提案されている。
【０００３】
かかる手法を実現するには、文書をデータベースに登録する際に、検索の対象となりそうな言葉を派生させて、文書のキーワードとして記憶しておいたり、検索を行なう際に、正規表現と呼ばれるような表現形式を用いて、一文字違いなどの単語などを検索の対象とするといった対応が採られていた。例えば、特開平１０−２４０７４２号では、文字列情報の蓄積時に、入力された原文字列以外の入力候補文字列を生成し、この入力候補文字列を蓄積文字列に変換して、データベースに蓄積している。また、検索時には、検索者が、検索するための検索文字列情報を入力すると、この検索文字列以外で検索可能な検索候補文字列を生成し、検索文字列情報と、蓄積されている蓄積文字列とを照合することにより、検索を行なっている。
【０００４】
【発明が解決しようとする課題】
しかしながら、かかる手法では、データベースへの登録時に、多数の単語についてそれぞれ派生語を生成せねばならず、処理に膨大な手間を要するという問題があった。例えば、「切換」という用語に「切り換え」「切替」「切り替え」「切替え」などの表記のゆれが存在する場合、これら全ての候補文字列を、蓄積しようとしている文書毎に発生され、かつ記憶したのでは、処理に時間を要し、しかも膨大な記憶容量が必要となってしまう。
【０００５】
また、異なる単語に異なる表現のゆれなどが存在する場合、例えば「切り替え」と「書き換え」という単語を考えると、一方を「切替」に統一することと、他方を「書替」に統一することは、それぞれ別の作業になるので、いちいち指定しなければならないという問題があった。更に、上記の「切り替え」の例のように、複数の表記が存在する場合、どの表記を用いるか、という指定を行なわねばならなかった。
【０００６】
本発明は、こうした問題を解決し、文書の標準化を行なうことで、その後の種々の文書処理、例えば検索の手間を減らすことを目的とする。
【０００７】
【課題を解決するための手段およびその作用・効果】
上記課題の少なくとも一部を解決する本発明の文書標準化方法は、
一定のまとまりを持った文書を入力し、
該文書を形態素解析して、文法情報を伴う単語を切り出し、
該切り出した単語に対して、予め定めた標準化の処理を行ない、
該標準化された後の単語から再構成された文書を出力すること
を要旨としている。
【０００８】
また、これに関連してなされた文書データベースの構築方法の発明は、
一定のまとまりを持った文書を入力し、
該文書を形態素解析して、文法情報を伴う単語を切り出し、
該切り出した単語に対して、予め定めた標準化の処理を行ない、
該標準化された後の単語から再構成された文書をデータベースとして蓄積すること
を要旨としている。
【０００９】
更に、これらに関連してなされた文書検索方法の発明は、
文書の検索に先立って、
一定のまとまりを持った文書を入力し、
該文書を形態素解析して、文法情報を伴う単語を切り出し、
該切り出した単語に対して、予め定めた標準化の処理を行ない、
該標準化された後の単語から再構成された文書を予めデータベースとして蓄積しておき、
文書の検索時に、
指定された検索用単語と前記データベースに蓄積された文書とを比較して、該検索用単語が含まれる文書を特定すること
を要旨としている。
【００１０】
かかる発明においては、文書を形態素解析することにより文法情報を伴って単語を切り出すので、これに対して適切な標準化を施すことができる。即ち、単語の切り出しを行なっていることから、単純な置き換えではなく、単語単位で適切な標準化を施すことができる。標準化した単語から再構成した文書は、例えばファイルとしてあるいはディスプレイに、出力しても良いし、再構成した文書としてデータベースの構築に用いても良い。かかるデータベースでは、文書は、原則として標準化されて蓄積されているから、検索を極めて容易に行なうこともできる。
【００１１】
かかる標準化において、
前記予め定めた標準化の処理としては、少なくとも
（ａ）予め定めた文字に置き換える文字の標準化、
（ｂ）共起関係を有する単語の関係を予め定めた関係に修正する連語化処理、
（ｃ）表記のゆれを予め定めた表記に統一する表記の統一処理、
（ｄ）自立語を、予め定めた置き換えの基準に従って、他の自立語に置き換える自立語処理、
（ｅ）付属語を、所定の規則に従って他の付属語に置き換える付属語処理
のうちの一つを含ませることができる。これらの処理のうち、少なくとも一つを採用することで、文書の標準化を様々なレベルで行なうことができる。
【００１２】
これらの標準化の処理は、予め用意した辞書を参照することにより、単語の置き換えを行なう処理として実現することができる。形態素解析により文法情報を伴って単語を切り出しているので、辞書を参照することは容易である。かかる形態素解析についても、予め用意した形態素解析用の辞書を用いて実現することができる。もとより、アルゴリズムに依拠して形態素解析を行なうことも可能である。
【００１３】
上記の複数の標準化処理は、様々な順序で実施可能であるが、例えば文字の標準化の処理（ａ）の後に自立語処理（ｄ）を行なうことも好適である。こうすれば、例えば半角の「WINDOWS」と「ウィンドウズ」、および全角の「ＷＩＮＤＯＷＳ」「ウィンドウズ」といった自立語のばらつきを、簡単な操作で確実に標準化することができる。
【００１４】
また、連語化処理（ｂ）の後に自立語処理を行なうことをも同様に好適である。連語化処理とは、共起関係にある単語の関係を予め定めた関係に修正するものであり、連語化処理を予めしておくことで、自立語処理をより確実に行なうことができる。例えば、「腹が」＋「立つ」という連語を「怒る」に置き換える自立語処理を行なうものとした場合、「腹が」＋「ひどく」＋「立つ」を、一旦連語化処理により「ひどく」＋「腹が」＋「立つ」に変換しておけば、次の自立語処理により、「ひどく」＋「怒る」に標準化することは容易である。更に、表記の統一処理（ｃ）を、少なくとも自立語処理（ｄ）の後に行なうことも好適である。こうすることで、自立語処理より、表記の統一が崩れると言うことがない。
【００１５】
また、前記標準化の処理の際に、標準化の結果が２以上存在する場合には、該２以上の結果のうちの一つを表示すると共に、複数の結果が存在することを表示することも望ましい。標準化の処理を行なっている使用者は、これにより、複数の結果が存在することを知ることができ、場合によっては、他の候補を選択することができるからである。使用者の操作に応じて、前記表示した結果以外の結果を次候補として順次表示することも、候補選択の面から望ましい。
【００１６】
なお、これらの発明は、いずれも上記の方法を実行する装置の発明、コンピュータ上で実行され、上記の機能を実現するプログラムの発明、こうしたプログラムを記録した記録媒体としての発明として、それぞれ把握することができる。装置は、コンピュータ上でプログラムが実行されることで、上記の文書の入力、形態素解析、標準化、出力、データベースの構築などを実現するものであっても良いし、ディスクリートな回路構成より実現するものであっても良い。また、プログラムは、Ｃ言語やパスカル、フォートラン、コボル、ＢＡＳＩＣ、等の周知のプログラム言語が採用可能であり、オブジェクト指向のプログラム言語、あるいはＪａｖａＳｃｒｉｐｔ等の言語などを利用することも可能である。記録媒体としては、フレキシブルディスク，ＣＤ−ＲＯＭ，ＤＶＤ−ＲＯＭ，半導体メモリ（ＲＯＭ，ＰＲＯＭ，ＥＥＰＲＯＭ，フラッシュメモリ等）など、種々の記録媒体を用いることができる。もとより、インターネットなどのネットワーク上に置かれたサーバにこれらのプログラムを記憶しておき、クライアントのコンピュータにダウンロードして利用することも可能である。
【００１７】
【発明の他の態様】
本願発明の標準化の技術は、例えば翻訳などにも用いることができる。翻訳では、翻訳例をデータベース化することが有効であり、こうしたデータベースを翻訳者の作成した文書の癖などから自由なプレーンなテキストにより構築することは、翻訳のための検索において極めて有用である。また、インターネットなどの検索エンジンがネット上の多数のウェブを検索し、これをデータベース化する際にも、同様の標準化を適用することは有効である。ウェブサイトなどの作成は、基本的には個人の責任に委ねられているので、文書の表現の統一がなされていないからである。
【００１８】
【発明の実施の形態】
以下、本発明の実施の形態を実施例に基づいて説明する。
（１）実施例の構成：
はじめに、実施例の構成について図１を用いて説明する。図１は本実施例のデータベース構築を行なうシステムを示す概略構成図である。このシステムは、インターネットのような大規模なネットワーク１０に接続されたデータベースサーバ２００として実現されている。ネットワーク１０には図示しないクライアントが接続されている。
【００１９】
データベースサーバ２００は、モデムやルータ２０を介してネットワーク１０とのデータのやり取りを制御するネットワークインタフェース（ＮＴ−Ｉ／Ｆ）２１、処理を行なうＣＰＵ２２、処理プログラムや固定的なデータを記憶するＲＯＭ２３、ワークエリアとしてのＲＡＭ２４、時間を管理するタイマ２５、モニタ３０への表示を司る表示回路２６、後述する各種のデータを蓄積するハードディスク（ＨＤ）２７、キーボード１１やマウス１２とのインタフェースを司る入力インタフェース（Ｉ／Ｆ）２８等を備える。なお、ハードディスク２７は、固定式のものとして記載したが、着脱式のものでも良いし、着脱式の記憶装置（例えばＣＤ−ＲＯＭ、ＣＤ−Ｒ、ＣＤ−ＲＷ、ＤＶＤ−ＲＯＭ、ＤＶＤ−ＲＡＭ、フレキシブルディスクなど）を併用することも可能である。また、この実施例では、サーバ２００の処理プログラムは、ＲＯＭ２３内に記憶されているものとしたが、ハードディスク２７に記憶しておき、起動時にＲＡＭ２４上に展開して実行するものとしても良い。あるいは、上述した着脱式の記録媒体から読み込むものとしても良い。更には、ネットワーク１０を介して、他のサーバから読み込んで実行するものとしても良い。
【００２０】
図１に示したサーバ２００は、キーボード１１から入力した文書（テキストデータ）や、ネットワーク１０を介して外部から取り込んだテキストデータを、標準化して、最終的にはハードディスク２７に文書データベースを構築する。その後、データベース化された文書データに対して、検索処理を行なうこともできるが、この検索処理は、サーバ２００から行なっても良いし、ネットワーク１０を介して接続された各クライアントから行なうこともできる。
【００２１】
サーバ２００内には、上述のように、ＣＰＵ２２やＲＯＭ２３などのハードウェアが設けられているが、かかるサーバ２００内において後述するプログラムを実行することにより、図２に示した構成を実現することができる。即ち、サーバ２００は、図２に示した機能実現手段をディスクリートに設けたのと同じ働きを実現する。サーバ２００は、図示するように、文書入力部２０５，形態素解析部２１０、辞書検索部２２０，形態素解析用辞書２３０，標準化ルールデータベース２４０，標準化処理部２５０，ログ管理部２６０，文書出力部２７０，ログ出力装置２８０などを備える。
【００２２】
ここで、文書入力部２０５は、文書を入力する処理を実現するものであり、キーボード１１から文書を入力したり、予めハードディスク２７などに記憶している文書を取り込んだりするものである。形態素解析部２１０は、入力した文書のテキストデータを形態素解析するものであり、漢字仮名混じりのテキストデータの形態素を解析して、テキストデータを構成する自立語や付属語などを、その文法情報と共に取得するものである。標準化処理部２５０は、形態素解析されたテキストデータに対して標準化の処理を実行するものであり、実行される標準化の処理としては、
（ａ）文字の標準化処理（予め定めた文字に置き換える文字の標準化）、
（ｂ）連語化処理（共起関係を有する単語の関係を予め定めた関係に修正する処理）、
（ｃ）表記の統一処理（表記のゆれを予め定めた表記に統一する処理）、
（ｄ）自立語処理（自立語を、予め定めた置き換えの基準に従って、他の自立語に置き換える処理）、
（ｅ）付属語処理（付属語を、所定の規則に従って他の付属語に置き換える処理）
がある。これらの処理は全て実行される必要はなく、使用者の設定により、必要な処理（少なくとも一つの処理）が実行される。
【００２３】
文書出力部２７０は、標準化されたテキストデータを外部に出力するものである。本実施例では、テキストデータは、ハードディスク２７にデータベースとして保存されるものとしたが、単純に標準化処理後のテキストデータをモニタ３０上に表示するものとしても良いし、図示しないプリンタなどに印字するものとしても良い。あるいは、ネットワーク１０を介して外部のクライアントマシンに出力するものとしても良い。
【００２４】
辞書検索部２２０は、形態素解析用辞書２３０と標準化ルールデータベース２４０を参照するためのものである。形態素解析部２１０や標準化処理部２５０は、辞書やデータベースを参照する必要が生じると、この辞書検索部２２０を介して、辞書２３０やデータベース２４０をアクセスし、必要な情報を取り出し、それぞれ形態素解析部２１０や標準化処理部２５０に渡す。なお、辞書検索部２２０は、形態素解析用辞書２３０や標準化ルールデータベース２４０毎に別々に設けても差し支えない。
【００２５】
ログ管理部２６０とログ出力部２８０は、標準化の処理のログを管理し、これを出力するものである。標準化の処理は、上述したように、文字の標準化から連語化処理まで、様々なレベルに及ぶので、どのような処理を行なったか、必要に応じて参照できるよう、ログを管理し出力するのである。ログには、処理対象となった文書、実施された標準化処理の内容、その結果、エラーなどの情報が保存される。
【００２６】
（２）実施例における処理の概要：
そこで、次に標準化処理部２５０において実現される標準化処理について、図３に依拠しつつ説明する。図３は、標準化処理部２５０が実行する処理の概要を示す説明図である。この図では、標準化処理部２５０は、全ての標準化処理を実行するものとして記載しているが、実際には、少なくともいずれか一つの標準化処理が実行される場合も存在する。いずれの標準化処理ないしそれらの任意の組合わせを実行するかは、使用者が初期設定（プロパティなど）により定めるものとなっている。図３に示した標準化処理ルーチンが起動されると、まず、文書を読み込む処理が実行される（ステップＳ３００）。この処理は、文書入力部２０５に相当する処理であり、キーボード１１から文書を入力するものとしても良いし、既に作られてハードディスク２７などに保存されている処理用の文書ＴＸＴ（テキストデータ）を読み出すものとしてもよい。従って、例えば標準化処理の実行を示すアイコンを、モニタ３０のいわゆるデスクトップに表示しておき、マウス１２によりテキストファイルをドラッグアンドドロップすることにより、図３に示した標準化処理が起動され、そのテキストファイルが、読み込まれるものとすることもできる。
【００２７】
文書の読み込みは、一括して全データを読み込むという形で実現しても良いし、テキストデータから改行などを区切りコードとして、順次読み込む形態としても良い。可能であれば、句読点などを用いて「文」単位で読み込んでも良い。いずれの場合でも、一つ一つの文には、識別番号を付与して、その後に管理に用いることが望ましい。なお、テキストデータは、ＲＡＭ２４上に実際に展開して処理可能な状態としても良いし、識別番号を付けてからハードディスク２７などにランダムアクセスあるいはシーケンシャルアクセス可能に保存してもよい。
【００２８】
こうして文書の読込を行なった後、まず形態素解析処理を行なう（ステップＳ３１０）。これは、形態素解析部２１０に相当する処理であり、辞書検索部２２０を介して形態素解析用辞書２３０を参照する処理に相当する。実際には、ハードディスク２７に記憶された逆引き辞書ＩＤＣを参照して、文書を構成する単語を形態素解析により定める。形態素解析処理の詳細を図４に示した。以下、図４に基づいて、形態素解析の処理について説明する。なお、逆引き辞書とは、通常の仮名漢字変換用辞書が、仮名文字を見出しにして漢字やカタカナ等の変換文字列が、文法情報と共に配列されているのに対して、図５に示すように、これが逆に配列されている辞書である。従って、例えば「学校」という文字列から「がっこう」という読みと名詞という文法情報などを取り出すことができる。
【００２９】
形態素解析処理が開始されると、まず識別番号をつけた一つの文が、解析の対象として特定され、この文の先頭からＭ文字目（Ｍ＝１，２，・・・・）からＬ文字分（Ｌ＝１，２，・・・）を取り出して逆引き辞書ＩＤＣを引く処理を行なう（ステップＳ１２）。Ｍは、着目している文字列の先頭位置を、Ｌは、取り出す文字数を、それぞれ示していることになる。逆引き辞書の参照の手法は、まずＭ＝１、即ち先頭位置から、Ｌ＝１、即ち１文字分の文字を取り出し、辞書を参照して該当語を取り出す処理から開始する。Ｌを順次インクリメントしながら辞書ＩＤＣを参照し、該当する見出し語がなくなれば、着目する文字列の先頭位置Ｍをインクリメントし、再度文字数Ｌを１に戻して、辞書の検索を行なう。こうして着目する文字の位置か、解析しようとする文の文字数を超えたところで、辞書の参照をうち切る。
【００３０】
例えば、「ＤＤという車は、品質を重視したセダンである。」という文章に対して、逆引き辞書ＩＤＣを参照すると、「ＤＤ」「と」「いう」「という」「い」「う」「車」「は」「品質」「を」「重視」「した」「し」「た」「セダン」「で」「ある」「である」「あ」といった語を切り出すことができる。ここで、「い」や「う」「あ」「し」「た」などの仮名一音も、語として切り出しているのは、「いう（言う）」の語幹「い」や「うる（売る）」の語幹「う」などが、文中に現れる可能性があるからである。
【００３１】
逆引き辞書ＩＤＣには、これらの語がその文法情報と共に記憶されている。そこで、切り出した語を次に文法情報に従って並べて、破綻しない配列を見い出す処理を行なう。かかる解析は、例えば複数文節最長一致法や最小コスト法といった手法が知られており、所定の語の組合わせのうちどれが最も日本語としてもっともらしいかを検定するのである。本実施例では、最小コスト法を採用しているので、こうして得られた多数の文字列を対象として、次にコスト計算を行なう（ステップＳ３１４）。コスト計算とは、文字列の配列に対して、日本語らしい配列ほど点数が低くなるように予め用意された文字列のコストを計算する処理である。その規則は大まかに言えば、自立語はコスト２、これに付属語が付属する場合はコスト０、といったものである。例えば、「品質を」を例にとると、「品質」＋「を」ではあれば、自立語＋付属語（助詞）の結びつきとなって、コスト２、「品」＋「質」＋「を」であれば、自立語＋自立語＋付属語（助詞）となってコストは４となるのである。最小コスト法のルールは、現実の日本語にあわせてチューニングされており、「まったく」＋「ない」などの共起関係にある単語が文中に生じる場合は、コスト「−１」など、様々な規則が用意されている。
【００３２】
こうして、逆引き辞書の参照により得られた全ての単語について、上記のコストを計算し、そのうちで最小のコストになる文を特定する処理を行なう（ステップＳ３１５）。上記の例では、「品」（自立語・名詞）＋「質」（自立語・名詞）＋「を」（付属語・助詞）よりも、「品質」（自立語・名詞）＋「を」（付属語・助詞）の方が、日本語として確からしいと判断するのである。もとより、この計算は、少なくとも文を単位として行なわれ、文全体で、コストが最小になるような単語の配列を選択する。従って、例えば共起関係によるコストの低減などがあれば、「品質」＋「を」に替えて、「品」＋「質」＋「を」が選択される場合も存在する。
【００３３】
こうして最小コスト法による形態素解析が完了すると、次に文構造の解析処理を行なう（ステップＳ３１６）。この処理は、文を構成している単語の結びつき方を、論理積と論理和により表現するものであり、例えば複文を、二つの文に分離する場合などに利用される。本実施例では、特にこの点については説明しない。以上の処理を行なった後、形態素解析されたデータを出力する処理を行なう（ステップＳ３１８）。データは、そのまま次の標準化処理に渡されても良いし、一旦ハードディスク２７に識別コード付きで保存されるものとしても良い。
【００３４】
こうして形態素解析された文に対して、次に各種の標準化の処理が実行される（図３参照）。標準化の処理としては、
▲１▼文字の標準化処理（ステップＳ３２０）
▲２▼連語化処理（ステップＳ３３０）
▲３▼自立語処理（ステップＳ３４０）
▲４▼表記の統一処理（ステップＳ３５０）
▲５▼付属語処理（ステップＳ３６０）
がある。なお、各標準化の処理は、既に説明したように、全てを実行する必要はなく、使用者の意図に合致した処理のみ実施しすればよい。また、複数の標準化処理を実施する場合、上記の順に限るものではなく、その他の順序で実施することも可能である。
【００３５】
まず、文字の標準化の処理について、図６を参照しつつ説明する。文字の標準化処理が起動されると、まず標準化規則ＣＳＤを参照する処理を実行する（ステップＳ３２２）。この標準化規則ＣＳＤは、予めハードディスク２７に記憶されているものであり、文字の標準化をどのような規則に沿って行なうかを定めたものである。こうした規則は、一応デフォルトが設定されているが、使用者により変更可能なものとなっている。この実施例における文字の標準化とは、図７に示したように、括弧、引用符、一般記号、英数字、句点、読点、半角カタカナ、名前の繋文字、長音記号を、一定の規則で置き換える処理を言う。このうち図７の欄Ａに「×」で示したものは、置き換えに際して周りの文字を考慮する必要がないことを、「○」は周囲の文字を考慮する必要があることを、それぞれ示している。また、欄Ｂは、置き換えの範囲を示しているが、ここで「文」が置き換えの範囲になる場合があるとされているので、例えば「−」（マイナス記号）と「−」（長音記号）とが相違している場合などには、長音記号に置き換えると、形態素解析の結果に影響を与える場合があるからである。従って、長音記号の置き換えなどを行なった場合には、逆引き辞書ＩＤＣを参照して、文構成を変更することがある。
【００３６】
文字の標準化の例として、句点や読点を取り上げると、まずこれらについては、デフォルトで「、」「。」に置き換えられるように設定されている。従って、「コーヒーは，うまい．」という文に対して、文字の置き換えが行なわれると、「コーヒーは、うまい。」となる。もっとも、この設定は、変更可能なので、句点として「。」が、読点として「，」に設定が変更されていれば、「コーヒーは，うまい。」となる。なお、欄Ａに示したように、周りの文字を考慮するとなっているが、周りの文字列が英文であれば、逆に「，」「．」への置き換えがデフォルトの設定となっている。
【００３７】
その他の文字の標準化を例示すると、
（Ａ）括弧：『』と「」の置き換えを行なうなど、
（Ｂ）引用符：“”と””の置き換えを行なうなど、
（Ｃ）一般記号：種々の記号（例えば「：，？！」など）について、半角／全角の置き換えを行なうなど、
（Ｄ）英数字：全角／半角や大文字／小文字の置き換えを行なうなど、
（Ｅ）半角カタカナ：カタカナについて全角／半角の置き換えを行なうなど、
（Ｆ）名前の繋文字：「クイーン＝エリザベス」を「クイーン・エリザベス」に置き換えるなど、
がある。
【００３８】
これらの規則を用いて、各文字を変更する処理を行なう（図６、ステップＳ３２４）。その後、全ての文字についての置き換えが完了したかを判断し（ステップＳ３２６）、全ての文字について完了するまで、規則に従う置き換えを実施する。
【００３９】
以上説明した文字の標準化処理を行なった後、次に、共起の連語化処理（図３、ステップＳ３３０）を実行する。この処理の詳細を、図８に示した。以下、この図８に従って説明する。共起の連語化処理が開始されると、まず形態素解析により得られた文の文節Ｎに着目する（ステップＳ３３１）。処理の開始時にはＮ＝１である。次に、共起辞書ＲＧＤを参照しつつ、文節列を後方に向かってサーチする処理を行なう（ステップＳ３３２）。このサーチの様子を図９に示した。図９は、「俺は学校に急いで行くよ」という文を対象に共起の連語化処理を行なう様子を示している。形態素解析により、「俺は」＋「学校に」＋「急いで」＋「行くよ」という文節が切り出されている。なお、詳しく言えば、各文節内は、自立語＋付属語（＋付属語・・・）として解析されている。
【００４０】
ここでまずＮ＝１、即ち、「俺は」という文節に着目し、この文節を起点としてＮ＝２、３、４、即ち「学校に」「急いで」「行くよ」などの文節がサーチされる、サーチは、共起辞書ＲＧＤに記載されている文節がないかを検証するものである。従って、正確には文節によるサーチではなく、文節とその語幹を用いたサーチである。こうしたサーチを行ないつつ、共起関係にある文節があるかを判断する（ステップＳ３３３）。図９に示した例では、「俺は」については共起辞書に該当する項目がなく、Ｎ＝２、即ち「学校に」について、「学校に行」という共起関係が、共起辞書ＲＧＤに見い出された。共起関係にある文節が見い出された場合には、次に文節の入れ替えが可能であるか否かを判断する（ステップＳ３３４）。共起関係にある二つの文節が連続していれば、入れ替えを行なう必要はない。また、離れた位置にある文節間に共起関係が見い出されても、文構造上、文節の入れ替えを行なうことができない場合も存在する。例えば、「俺は学校に電話し、それから行くよ」という例文では、「学校に」と「行く」という共起関係が見い出されても、「俺は電話し、それから学校に行くよ」と入れ替えることが必ずしもできない。文構造上の制約があるからである。
【００４１】
共起関係にあることが見い出された二つの文節が離れており、かつ文構造上、文節の入れ替えが可能であると判断された場合には、文節の位置を入れ替える処理を行なう（ステップＳ３３５）。この結果、図１０に示したように、文は、「俺は急いで学校に行くよ」となる。続いて、連語化処理を行なう（ステップＳ３３６）。即ち、連続する二つの文節に共起関係が認められるので、これを連語化して一つの文節扱いとするのである。この様子を図１１に示した。なお、共起関係に基づく連語化は、上記実施例では２文節を一つの文節に連語化するものとして説明したが、場合によっては３文節以上を一つの文節に連語化することも可能である。
【００４２】
その後、着目する文節を一つ進め（ステップＳ３３７）、全ての文節について共起関係の処理が完了したかを判断し（ステップＳ３３８）、未だ完了していなければ、ステップＳ３３２に戻って、処理を継続する。全ての文節について、共起関係の処理が完了すれば、「ＮＥＸＴ」に抜けて、本ルーチンを終了する。なお、上記のフローでは、共起関係にある文節の探索は、文の先頭の文節から順に行なうものとしたが、いわゆる「係り受け」の受け語を先に特定して探索を行なうという手法を採用すれば、文の後方から順に探索するものとすることもできる。いずれから探索するかは、辞書の構成や探索アルゴリズムに拠る。
【００４３】
こうして文字の標準化（図３，ステップＳ３２０）、共起の連語化処理（ステップＳ３３０）が完了すると、次に、自立語の標準化処理を行なう（ステップＳ３４０）。この処理の詳細を、図１２に示した。図１２に示した自立語の標準化処理が開始されると、まず標準化規則を参照する処理を行なう（ステップＳ３４２）。この処理は、文字の標準化で参照したものと同様に、デフォルトは予め設定してあるが、使用者により変更可能な設定を取得するものである。もとより、この規則は固定的なものとすることもできる。自立語の標準化は、基本的には同一意味の自立語間の異表現の置き換え処理である。かかる処理には、多数の類型が存在するが、例えば、
▲１▼より一般的な表現に置き換える：例、庭球→テニス
▲２▼平易な表現に置き換える：例、瑠璃色→青色
▲３▼常用漢字外の忌避：例、愛嬌→愛敬、挨拶→あいさつ
▲４▼慣用句の平易化：例、一挙手一投足→一つ一つの動作
▲５▼より使用される文字形態への置き換え：例、ウィンドウズ→Ｗｉｎｄｏｗｓ、スパイラルアップ→spiral up
▲６▼連語の置き換え：例、学校に行く→登校する
等を考えることができる。
【００４４】
これらの処理は、実際には、標準化の対象となっている文から順次自立語を取り出し、これを自立語用の標準化辞書ＩＷＤを検索することにより行なわれる（ステップＳ３４４）。自立語用の標準化辞書ＩＷＤは、上述した置き換え可能な自立語が、適用される規則と共に、参照可能に構成されている。従って、標準化の規則を取得した後、辞書を参照して、規則に合致した置き換え語を読み出し、各単語を変更する処理（ステップＳ３４６）を行なうことができる。図１３は、この置き換えの様子を模式的に示した説明図である。図示するように、まず規則の設定を参照する。図において、「◎」はその置き換えが設定（オン）されていることを、「○」は未設定（オフ）であることを、それぞれ示している。自立語の標準化処理において、上記の▲１▼ないし▲６▼を例にとれば、いずれの置き換えを行なうか否かが、標準化規則として記憶されているので、これを読み出し、次に自立語を順次読みだして、この自立語について、置き換えを行なう語が辞書ＩＷＤに登録されているか否かを検索し、仮に登録されていれば、現在オンになっている置き換え規則に合致するかを確認し、オンになっている置き換え規則に合致していれば、自立語の置き換えを行なうのである。以上の処理を全単語について繰り返す（ステップＳ３４８）。図１３に示した例は、▲３▼常用漢字外の忌避がオンになっているので、「俺は」が「僕は」に置き換えられている。また、共起関係があると認定されて連語化された言葉も、必要に応じて、他の言葉に置き換えられるので、この例では「学校に行」→「登校」といった置き換えが行なわれ、これに応じて、付属語の部分も、「くよ」→「するよ」と置き換えられた。
【００４５】
この結果、自立語の標準化処理が完了すると、標準化規則として予め定めた類型について、全ての単語が置き換えられ、自立語は、所望のレベルで標準化されることになる。
【００４６】
自立語の標準化を行なった後、次に表記のゆれの標準化処理を行なう（図３、ステップＳ３５０）。表記のゆれとは、日本語における表記の曖昧さ、許容幅を言い、例えば、
▲１▼長音記号のゆれ：例、ユーザー、ユーザ、
▲２▼送り仮名のゆれ：例、売上げ、売り上げ、
▲３▼拗音表記のゆれ：例、ウィザード、ウイザード、
▲４▼複合語のかな表記のゆれ：例、売り上げ、売りあげ、
▲５▼外来語表記のゆれ：エンゼル、エンジェル、
▲６▼繰り返し文字のゆれ：例、正正堂堂、正々堂々
などを例示することができる。
【００４７】
この処理の概要は、図１２に示した自立語の標準化処理と似ているので、フローチャートは示さないが、自立語の標準化同様、まず規則の設定を参照する。即ち、表記のゆれの標準化処理において、上記の▲１▼ないし▲６▼を例にとれば、いずれの置き換えを行なうか否かが、図１５に示したように、標準化規則ＤＡＤ（図３参照）として記憶されているので、これを読み出し、次に単語を順次読みだして、この単語が標準化規則ＤＡＤに記憶した規則が当てはまるものであれば、かな漢字変換用の通常の単語辞書ＤＩＣを検索する。この辞書には表記のゆれが広く登録されているので、標準化規則ＤＡＤで指定された規則に該当する単語が、辞書ＤＩＣに登録されていれば、その後を読み出して、表記の異なる単語に置き換えるのである。そして、以上の処理を全単語について繰り返す。
【００４８】
自立語の標準化と処理が若干異なるのは、自立語の標準化辞書が、一方向への標準化を行なうことを前提として構成されているのに対して、表記のゆれは、双方向に標準化を行なうことを前提としているためである。表記のゆれは、許容幅を大きく、いずれの表記がより正しいといった判断になじまないものだからである。こうした表記のゆれは、かな漢字変換用の単語辞書ＤＩＣに広く採取されており、互いに関連付けられているので、表記のゆれの標準化を行なう場合には、表記のゆれの標準化規則ＤＡＤを参照し、指定された表記となるよう、単語辞書ＤＩＣを検索するのである。
【００４９】
こうして表記のゆれの標準化を行なった後、付属語の標準化処理を行なう（ステップＳ３６０）。この処理の概要は、図１２に示した自立語の標準化処理とほぼ同一なので、フローチャートは示さないが、基本的には同一意味の付属語間の異表現の置き換え処理である。かかる処理には、多数の類型が存在するが、例えば、
▲１▼繰り返された丁寧表現の簡素化：例、「出られておられます」→「出られています」、
▲２▼古風な表現の現代化：例、「原因なのか否か」→「原因なのかどうか」、
▲３▼くだけた表現の通常表現化：例、「勉強しなくっちゃ」→「勉強しなくては」
などを考えることができる。
【００５０】
これらの処理は、実際には、標準化の対象となっている文から順次付属語を取り出し、これを付属語用の標準化辞書ＡＷＤを検索することにより行なわれる。付属語用の標準化辞書ＡＷＤは、上述した置き換え可能な付属語が、適用される規則と共に、参照可能に構成されている。従って、標準化の規則を取得した後、辞書を参照して、規則に合致した置き換え語を読み出し、各付属語を変更する処理を行なうことができる。図１６は、この置き換えの様子を模式的に示した説明図である。図示するように、まず規則の設定を参照する。即ち、付属語の標準化処理において、上記の▲１▼ないし▲３▼を例にとれば、いずれの置き換えを行なうか否かが、標準化規則として記憶されているので、これを読み出し、次に付属語を順次読みだして、この付属語について、置き換えを行なう語が辞書ＡＷＤに登録されているか否かを検索し、仮に登録されていれば、現在オンになっている置き換え規則に合致するかを確認し、オンになっている置き換え規則に合致していれば、付属語の置き換えを行なうのである。
【００５１】
この結果、付属語の標準化処理が完了すると、標準化規則として予め定めた類型について、全ての単語が置き換えられ、付属語は、所望のレベルで標準化されることになる。
【００５２】
こうして、図３に示した全ての標準化（ステップＳ３２０ないしＳ３６０）が完了すると、サーバ２００は、標準化の結果を、ハードディスク２７内の文書データベースＴＤＢに登録する処理を行なう（ステップＳ３７０）。このデータベースは、文書の全文データベースであり、後述する検索装置により、全文検索を行なうことができる。
【００５３】
（３）実施例の効果：
この文書データベースＴＤＢに登録された文書は、文字、自立語、表記のゆれ、付属語という態様で、標準化がなされているから、書き手の癖や言い回しの相違などがほとんど解消されている。従って、処理された文書は、極めてプレーンなテキストデータとなっており、様々な用途に用いることができる。例えば、特許公報や技術文献などの全文データベースの構築に用いれば、できあがったデータベースを検索する際の雑音や検索漏れなどを低減することができる。また、翻訳しようとする文を標準化すると、機械翻訳のための下訳の一つとして用いることができる。逆に翻訳例を蓄積した翻訳データベースを構築する場合には、訳出者の相違を解消することができる。更に、時代を隔てた著者の表現を比較するといった研究など、文書を対象とした広範な比較研究に用いることも可能である。また、本実施例では、標準化の処理に先立って、テキストデータを形態素解析し、必要な文法情報を入手している。このため、標準化が、単純な文字の置き換えにとどまらず、文法情報を利用した自立語の標準化、表記のゆれの標準化などとしてまとめて行なうことが可能となっている。このため、標準化のために用意するルールも数を低減することができる。文法情報が存在するので、かな漢字変換用の辞書や表記のゆれの辞書、自立語の置き換え辞書などを参照して、容易に標準化を行なうことができる。
【００５４】
実施例では、標準化処理は、▲１▼文字の標準化、▲２▼共起の連語化処理、▲３▼自立語の標準化処理、▲４▼表記のゆれの標準化処理、▲５▼付属語の標準化処理の順で行なったが、この処理は様々な順序で実施可能である。本実施例のように、文字の標準化の処理の後に自立語の標準化処理を行なえば、例えば文字の標準化で半角／全角変換を済ませておき、その後、「ＷＩＮＤＯＷＳ」「ウィンドウズ」といった自立語のばらつきを、標準化すればよいので、簡単な操作で確実に処理を行なうことができる。
【００５５】
また、連語化の処理の後に自立語の標準化処理を行なうことをも同様に好適である。連語化処理を予めしておくことで、自立語の標準化処理をより確実に行なうことができる。実施例では、「学校に」＋「行く」という連語を「登校する」に置き換える処理を行なうものとして説明したが、「学校に」＋「急いで」＋「行く」を、一旦連語化処理により「急いで」＋「学校に」＋「行く」に置き換えておけば、次の自立語処理において「すぐに」＋「登校する」に標準化することは容易であった。更に、表記のゆれの統一処理を、自立語の標準化処理の後に行なっているので、自立語の標準化処理より、一旦なされた表記の統一が崩れると言うことがない。
【００５６】
なお、上記実施例では、標準化の処理において、結果が２以上存在する場合について特に説明しなかったが、２以上の結果が存在する場合（例えば、「売り上げ」に対して、「売上げ」と「売りあげ」が存在する場合）、このうちの一つを優先的に表示し、複数の結果が存在することを、表示することも可能である。こうした表示は、標準化した文字のモニタ３０上での色を変えたり、「次候補あり」といった表示を行なうことで、容易に実現することができる。次候補があることを表示すれば、処理を行なっている使用者は、これにより、複数の結果が存在することを知ることができ、好適である。他の候補を選択する場合には、カーソルを表示されている文節に移動し、「次候補」キーを押すことで、次候補を表示し、必要があれば、複数の候補から所望の候補を選択すればよい。
【００５７】
この他、本実施例では、ログ管理部２５０により、標準化のログを管理しているので、入力した文書に対して行なわれた標準化の処理の詳細を残しておくことも可能である。入力した文章の何番目の文の何番目の単語に対して、どんな処理を行なったか、という形でログを記録しておければ、いつでも、標準化した後の文から元の文を復元することもできる。また、ログ出力部２８０から出力されたログを解析することにより、どのタイプの標準化が多用されたかといった解析を行なうこともでき、標準化を実施した対象である文章の趣（文語的な文か、くだけた口語文か等）や癖（長音を落としやすいか等）を分析することも可能である。
【００５８】
（４）第２実施例の説明：
次に、本発明の第２実施例として、文書の検索方法と検索を行なう装置について説明する。第１実施例として説明した文書の標準化の処理により完成された文書データベースＴＤＢは、外部に公開され、自由な使用、または登録した会員の使用に供される。このとき、文書データベースＴＤＢに直接アクセスするような構成も可能であるが、ネットワーク１０を介して不特定多数のクライアントからアクセス可能とするには、例えば、文書データベースＴＤＢをアクセスするためのＣＧＩを備えたサイトを、サーバ２００内に用意し、クライアント４０は、ネットワーク１０を経由して、いわゆるブラウザから、この文書データベースＴＤＢにアクセスできるようにするのが通常である。そこで、第２実施例として、文書データベースＴＤＢを用いて、ウェブページの検索を行なう手法について、説明する。図１７は、クライアント４０において実行される検索時の処理を示すフローチャートである。まず、検索を開始するクライアント４０は、検索用に用意されたサーバ２００内のサイトにアクセスする（ステップＳ４００）。この結果、図１８に示すような、検索画面が表示される。
【００５９】
そこで、クライアントは、この画面に用意された検索用の文字列を入力するボックスＫＢに、検索内容を、日本語による文章として入力する（ステップＳ４１０）。例えば、図１８に示したように、文字列を入力するボックスＫＢに、「俺が登校した」などと自然文で入力するのである。このとき、検索文の入力に並行して、「検索」ボタンＢＢが押されたかを監視し（ステップＳ４２０）、検索ボタンが押された時には、入力された文章を読み取り、図１８に示した入力の場合には、この文章を形態素解析して、第１実施例で説明した標準化処理を行なう（ステップＳ４３０）。なお、検索は、必ずしも文章による入力に基づいて行なう必要はなく、例えばキーワードを入力して、一または複数のキーワードにより検索するものとしても良いし、キーワードと検索分野を指定して検索するものとしても良い。
【００６０】
こうして得られた標準化された検索文から切り出された検索語（図１８の例では「僕」や「登校」）ＤＳ１，ＤＳ２を利用して、文書データベースＴＤＢの検索を行なう（ステップＳ４４０）。検索の結果、一致する文を有する文書が見つかればその検索結果を出力するのである（ステップＳ４５０）。出力された検索結果は、ネットワーク１０を介してクライアントに送られ、クライアント４０側の画面に表示される。
【００６１】
以上説明した第２実施例によれば、予め標準化されて登録された文書データベースに対して、自然な日本語文を用いて検索を行なうことができる。この場合、検索を行なうとする使用者の言葉の癖を標準化により低減してから検索を行なうので、検索により所望の文書を見い出し易くなっている。このため、検索語の入力について複雑な規則を熟知している必要がなく、特別な訓練を積んだサーチャでなくても容易に検索を行なうことができる。
【００６２】
以上、本発明の実施の形態について説明したが、本発明はこうした実施の形態に何等限定されるものではなく、本発明の要旨を逸脱しない範囲内において、更に種々なる形態で実施し得ることは勿論である。例えば、文書データベースは、全文データベースに替えて、キーワードを用いたデータベースとしても良い。また、翻訳装置に応用することも可能である。翻訳は、単に文法情報を用いて言語間の変換を行なおうとしても上手く行かず（必要な規則が無限に大きくなる）、むしろ豊富な用例を用意し、翻訳にマッチした用例を見い出して、これを適用するような形で訳した方が、意味的に正確な翻訳にできることが知られている。そこで、与えられたテキストデータに、本発明を適用して文書を標準化しておき、これを利用して用例を特定するのを容易にするといった使い方が可能である。
【図面の簡単な説明】
【図１】本発明の実施例における全体構成を示す概略構成図である。
【図２】第１実施例における標準化処理を実現する構成を示すブロック図である。
【図３】実施例における標準化処理ルーチンを示すフローチャートである。
【図４】形態素解析処理ルーチンを示すフローチャートである。
【図５】逆引き辞書の構成を例示する説明図である。
【図６】文字の標準化処理ルーチンを示すフローチャートである。
【図７】文字の標準化処理の内容を例示する説明図である。
【図８】共起の連語化処理ルーチンを示すフローチャートである。
【図９】連語化の処理様子を示す説明図である。
【図１０】同じく連語化における文節の入れ替えの様子を示す説明図である。
【図１１】同じく連語化の様子を示す説明図である。
【図１２】自立語の標準化処理ルーチンを示すフローチャートである。
【図１３】図１３は、自立語の置き換えの様子を模式的に示した説明図である。
【図１４】常用漢字外の忌避がオンになっている場合の自立語の置き換えの一例を示す説明図である。
【図１５】いずれの置き換えを行なうか否かを示す標準化規則ＤＡＤの一例を示す説明図である。
【図１６】付属語の置き換えの様子を模式的に示した説明図である。
【図１７】第２実施例として、クライアント４０において実行される検索時の処理を示すフローチャートである。
【図１８】第２実施例における検索画面の一例を示す説明図である。
【符号の説明】
１０…ネットワーク
１１…キーボード
１２…マウス
２０…ルータ
２２…ＣＰＵ
２３…ＲＯＭ
２４…ＲＡＭ
２５…タイマ
２６…表示回路
２７…ハードディスク
３０…モニタ
４０…クライアント
２００…データベースサーバ
２０５…文書入力部
２１０…形態素解析部
２２０…辞書検索部
２３０…形態素解析用辞書
２４０…データベース
２４０…標準化ルールデータベース
２５０…標準化処理部
２６０…データベース
２６０…ログ管理部
２７０…ハードディスク
２７０…文書出力部
２８０…ログ出力部[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a technique for standardizing a document by processing the document.
[0002]
[Prior art]
Text data search is required in various fields such as patent gazettes and literature database searches, but a large amount of text data is simply stored as a database and stored using pattern matching technology. It is usual to search for items containing the target word. In this case, in order to facilitate the search, a thesaurus or the like is used to search for a concept word that is similar to the word to be searched (for example, when the search word is “car”, ”Or“ vehicle ”as a search term), or a search in consideration of the difference in notation (for example,“ vehicle ”is also searched for“ vehicle ”as a search term). Has been proposed.
[0003]
In order to realize this method, when registering a document in a database, a word that is likely to be searched is derived and stored as a keyword of the document, or when searching, it is called a regular expression. Using a simple expression format, it was possible to search for words such as single-character differences. For example, in Japanese Patent Laid-Open No. 10-240742, when character string information is accumulated, an input candidate character string other than the input original character string is generated, and the input candidate character string is converted into an accumulated character string and accumulated in a database. is doing. When searching, when the searcher inputs search character string information for search, a search candidate character string that can be searched for other than the search character string is generated, and the search character string information and the accumulated stored characters are stored. The search is performed by matching the column.
[0004]
[Problems to be solved by the invention]
However, this method has a problem in that, when registering in the database, derivative words must be generated for each of a large number of words, and the processing is enormous. For example, when the term “switching” includes fluctuations such as “switching”, “switching”, “switching”, and “switching”, all these candidate character strings are generated and stored for each document to be stored. As a result, processing takes time and a huge storage capacity is required.
[0005]
Also, if there are fluctuations of different expressions in different words, for example, if you consider the words “switch” and “rewrite”, unify one to “switch” and unify the other to “rewrite” There is a problem that each must be specified because it is a separate task. Furthermore, as in the above “switching” example, when there are a plurality of notations, it is necessary to specify which notation is used.
[0006]
An object of the present invention is to solve these problems and to standardize a document, thereby reducing the time required for various subsequent document processing, for example, retrieval.
[0007]
[Means for solving the problems and their functions and effects]
The document standardization method of the present invention for solving at least a part of the above problems is as follows.
Enter a document with a certain unity,
Analyzing the document, cutting out words with grammatical information,
A predetermined standardization process is performed on the extracted word,
Outputting a document reconstructed from the standardized words
Is the gist.
[0008]
In addition, the invention of the construction method of the document database made in connection with this,
Enter a document with a certain unity,
Analyzing the document, cutting out words with grammatical information,
A predetermined standardization process is performed on the extracted word,
Accumulating documents reconstructed from the standardized words as a database
Is the gist.
[0009]
Furthermore, the invention of the document retrieval method made in connection with these is as follows:
Prior to searching for documents,
Enter a document with a certain unity,
Analyzing the document, cutting out words with grammatical information,
A predetermined standardization process is performed on the extracted word,
Documents reconstructed from the standardized words are stored in advance as a database,
When searching for documents,
Comparing a designated search word with a document stored in the database to identify a document including the search word
Is the gist.
[0010]
In this invention, a word is cut out with grammatical information by performing morphological analysis on the document, so that appropriate standardization can be performed. In other words, since word extraction is performed, appropriate standardization can be performed in units of words rather than simple replacement. A document reconstructed from standardized words may be output, for example, as a file or to a display, or may be used as a reconstructed document for database construction. In such a database, the documents are standardized and stored in principle, so that the search can be performed very easily.
[0011]
In such standardization,
The predetermined standardization process includes at least
(A) Standardization of characters to be replaced with predetermined characters,
(B) a collocation process for correcting the relationship of words having a co-occurrence relationship to a predetermined relationship;
(C) Unification processing for unifying notation fluctuations to a predetermined notation,
(D) Independent word processing for replacing an independent word with another independent word in accordance with a predetermined replacement criterion;
(E) Ancillary word processing for replacing an ancillary word with another ancillary word according to a predetermined rule
Can be included. By adopting at least one of these processes, standardization of documents can be performed at various levels.
[0012]
These standardization processes can be realized as a word replacement process by referring to a dictionary prepared in advance. Since words are extracted with grammatical information by morphological analysis, it is easy to refer to the dictionary. Such morphological analysis can also be realized by using a morphological analysis dictionary prepared in advance. Of course, it is also possible to perform morphological analysis depending on the algorithm.
[0013]
The plurality of standardization processes can be performed in various orders. For example, it is also preferable to perform the independent word process (d) after the character standardization process (a). By doing so, it is possible to reliably standardize variations of free words such as “WINDOWS” and “Windows” in half-width characters, and “WINDOWS” and “Windows” in full-width characters by a simple operation.
[0014]
It is also suitable to perform independent word processing after collocation processing (b). The collocation processing is to correct the relationship between words in the co-occurrence relationship to a predetermined relationship, and by performing the collocation processing in advance, the independent word processing can be performed more reliably. For example, in the case of performing independent word processing that replaces the conjunctive words “abdomen” + “standing” with “anger”, “abdomen” + “bad” + “standing” is once “bad” by the collocation process. If it is converted to + "belly" + "stand", it is easy to standardize to "bad" + "get angry" by the next independent word processing. Furthermore, it is also preferable to perform the notation unification process (c) at least after the independent word process (d). By doing so, it is not said that the unity of notation is broken rather than the independent word processing.
[0015]
In addition, when there are two or more standardization results in the standardization process, it is also desirable to display one of the two or more results and display that there are a plurality of results. . This is because the user performing the standardization process can know that there are a plurality of results, and can select other candidates depending on the case. It is also desirable from the aspect of candidate selection to sequentially display results other than the displayed results as the next candidate according to the user's operation.
[0016]
Each of these inventions is grasped as an invention of an apparatus that executes the above method, an invention of a program that is executed on a computer and realizes the above functions, and an invention as a recording medium on which such a program is recorded. be able to. The device may realize the above document input, morphological analysis, standardization, output, database construction, etc. by executing a program on a computer, or realized by a discrete circuit configuration. It may be. As the program, a well-known program language such as C language, Pascal, Fortran, Cobol, BASIC, or the like can be used, and an object-oriented program language, a language such as JavaScript, or the like can also be used. As the recording medium, various recording media such as a flexible disk, CD-ROM, DVD-ROM, and semiconductor memory (ROM, PROM, EEPROM, flash memory, etc.) can be used. Of course, these programs can be stored in a server placed on a network such as the Internet, and downloaded to a client computer for use.
[0017]
Other aspects of the invention
The standardization technique of the present invention can be used for translation, for example. In translation, it is effective to create a database of translation examples, and it is extremely useful in search for translation to construct such a database with plain text that is free from a trap of a document created by a translator. It is also effective to apply the same standardization when a search engine such as the Internet searches a large number of webs on the net and creates a database. This is because the creation of websites and the like is basically left to individual responsibility, so the expression of documents is not unified.
[0018]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, embodiments of the present invention will be described based on examples.
(1) Configuration of the embodiment:
First, the configuration of the embodiment will be described with reference to FIG. FIG. 1 is a schematic configuration diagram showing a system for constructing a database according to the present embodiment. This system is realized as a database server 200 connected to a large-scale network 10 such as the Internet. A client (not shown) is connected to the network 10.
[0019]
The database server 200 includes a network interface (NT-I / F) 21 that controls data exchange with the network 10 via a modem or a router 20, a CPU 22 that performs processing, a ROM 23 that stores processing programs and fixed data, A RAM 24 as a work area, a timer 25 for managing time, a display circuit 26 for managing display on the monitor 30, a hard disk (HD) 27 for storing various data to be described later, and an input interface for managing an interface with the keyboard 11 and mouse 12 (I / F) 28 and the like. Although the hard disk 27 is described as being fixed, it may be removable, or a removable storage device (for example, CD-ROM, CD-R, CD-RW, DVD-ROM, DVD-RAM, A flexible disk or the like can also be used together. In this embodiment, the processing program of the server 200 is stored in the ROM 23. However, the processing program may be stored in the hard disk 27 and expanded and executed on the RAM 24 at startup. Alternatively, it may be read from the above-described removable recording medium. Further, it may be executed by reading from another server via the network 10.
[0020]
The server 200 shown in FIG. 1 standardizes a document (text data) input from the keyboard 11 or text data taken from outside via the network 10 and finally constructs a document database on the hard disk 27. . Thereafter, search processing can be performed on the document data stored in the database, but this search processing may be performed from the server 200 or from each client connected via the network 10. .
[0021]
As described above, hardware such as the CPU 22 and the ROM 23 is provided in the server 200. By executing a program to be described later in the server 200, the configuration shown in FIG. 2 can be realized. it can. In other words, the server 200 realizes the same function as the function realizing means shown in FIG. As shown, the server 200 includes a document input unit 205, a morphological analysis unit 210, a dictionary search unit 220, a morphological analysis dictionary 230, a standardization rule database 240, a standardization processing unit 250, a log management unit 260, a document output unit 270, A log output device 280 is provided.
[0022]
Here, the document input unit 205 realizes processing for inputting a document, and inputs a document from the keyboard 11 or takes in a document stored in the hard disk 27 or the like in advance. The morphological analysis unit 210 performs morphological analysis on the text data of the input document. The morphological analysis unit 210 analyzes the morpheme of the text data mixed with kanji characters and displays the independent words and attached words constituting the text data together with the grammatical information. To get. The standardization processing unit 250 performs standardization processing on text data subjected to morphological analysis. As standardization processing to be executed,
(A) Character standardization processing (standardization of characters to be replaced with predetermined characters),
(B) collocation processing (processing for correcting the relationship of words having a co-occurrence relationship to a predetermined relationship),
(C) Unification process of notation (process to unify fluctuation of notation to a predetermined notation),
(D) Independent word processing (processing for replacing independent words with other independent words in accordance with predetermined replacement criteria),
(E) Adjunct processing (processing for replacing an adjunct with another adjunct according to a predetermined rule)
There is. All of these processes do not need to be executed, and a required process (at least one process) is executed according to a user setting.
[0023]
The document output unit 270 outputs standardized text data to the outside. In this embodiment, the text data is stored as a database in the hard disk 27. However, the text data after the standardization processing may be simply displayed on the monitor 30, or printed on a printer (not shown) or the like. It is good as a thing. Alternatively, it may be output to an external client machine via the network 10.
[0024]
The dictionary search unit 220 is for referring to the morphological analysis dictionary 230 and the standardization rule database 240. When it becomes necessary to refer to a dictionary or a database, the morphological analysis unit 210 or the standardization processing unit 250 accesses the dictionary 230 or the database 240 via the dictionary search unit 220, retrieves necessary information, and extracts a morpheme analysis unit. 210 and the standardization processing unit 250. The dictionary search unit 220 may be provided separately for each morphological analysis dictionary 230 and standardization rule database 240.
[0025]
The log management unit 260 and the log output unit 280 manage a log of standardization processing and output it. As described above, the standardization process is performed at various levels from the character standardization to the collocation process, so that the log is managed and output so that it can be referred to as necessary as to what kind of process has been performed. . The log stores information such as the document to be processed, the contents of the standardized processing that has been performed, and, as a result, errors.
[0026]
(2) Overview of processing in the embodiment:
Therefore, the standardization process realized by the standardization processing unit 250 will be described with reference to FIG. FIG. 3 is an explanatory diagram showing an outline of processing executed by the standardization processing unit 250. In this figure, the standardization processing unit 250 is described as executing all the standardization processes, but there are actually cases where at least one standardization process is executed. Which standardization process or any combination thereof is executed is determined by the user through initial settings (properties, etc.). When the standardization processing routine shown in FIG. 3 is started, first, a document reading process is executed (step S300). This process is a process corresponding to the document input unit 205, and may be a document input from the keyboard 11, or a processing document TXT (text data) already created and stored in the hard disk 27 or the like. It may be read out. Therefore, for example, an icon indicating the execution of the standardization process is displayed on the so-called desktop of the monitor 30, and the text file is dragged and dropped with the mouse 12 to start the standardization process shown in FIG. Can also be read.
[0027]
The reading of the document may be realized by reading all data at once, or may be sequentially read from the text data by using a line feed or the like as a delimiter code. If possible, it may be read in “sentence” units using punctuation marks. In any case, it is desirable to assign an identification number to each sentence and use it for management thereafter. Note that the text data may be actually expanded on the RAM 24 so that it can be processed, or may be stored in the hard disk 27 or the like so that it can be accessed randomly or sequentially after an identification number is assigned.
[0028]
After reading the document in this way, first, morphological analysis processing is performed (step S310). This is a process corresponding to the morphological analysis unit 210 and corresponds to a process of referring to the morphological analysis dictionary 230 via the dictionary search unit 220. Actually, referring to the reverse lookup dictionary IDC stored in the hard disk 27, words constituting the document are determined by morphological analysis. Details of the morphological analysis process are shown in FIG. Hereinafter, the morphological analysis process will be described with reference to FIG. Note that the reverse dictionary is a normal dictionary for kana-kanji conversion in which conversion character strings such as kanji and katakana are arranged with grammatical information using kana characters as headings, as shown in FIG. This is a dictionary arranged in reverse. Therefore, for example, it is possible to extract grammatical information such as a noun reading and a noun from a character string “school”.
[0029]
When the morphological analysis process is started, a sentence with an identification number is first identified as an object to be analyzed, and the M character (M = 1, 2,...) From the beginning of this sentence is the L character. Minutes (L = 1, 2,...) Are taken out and the reverse lookup dictionary IDC is looked up (step S12). M indicates the head position of the character string of interest, and L indicates the number of characters to be extracted. The reverse lookup dictionary reference method starts with a process of extracting M = 1, that is, L = 1, that is, one character from the head position, and referring to the dictionary to extract the corresponding word. The dictionary IDC is referenced while sequentially incrementing L. If there is no corresponding entry word, the head position M of the character string of interest is incremented, the number of characters L is returned to 1, and the dictionary is searched. When the position of the focused character or the number of characters of the sentence to be analyzed is exceeded, the dictionary reference is cut off.
[0030]
For example, referring to the reverse lookup dictionary IDC for a sentence “A car named DD is a sedan that emphasizes quality”, “DD” “to” “say” “to” “i” “u” “ The words “car” “ha” “quality” “do” “important” “do” “do” “ta” “sedan” “de” “are” “is” “a” can be extracted. Here, the kana sounds such as “I”, “U”, “A”, “Shi”, “Ta”, etc. are also extracted as words, “I” and “Uru” (sell) This is because the stem of “)” may appear in the sentence.
[0031]
In the reverse dictionary IDC, these words are stored together with the grammatical information. Therefore, the extracted words are arranged next according to the grammatical information, and a process for finding an array that does not fail is performed. For this analysis, for example, methods such as a multiple phrase longest match method and a minimum cost method are known, and a test is performed to determine which combination of predetermined words is most likely Japanese. In the present embodiment, since the minimum cost method is adopted, the cost calculation is next performed for a large number of character strings thus obtained (step S314). The cost calculation is a process of calculating the cost of a character string prepared in advance so that the score is lower for a Japanese character array than for a character string array. The rule is roughly: a self-supporting word has a cost of 2, and if an adjunct is attached to it, the cost is 0. For example, taking “quality” as an example, if “quality” + “wa”, it becomes a link between independent words + adjuncts (particles) and costs 2, “goods” + “quality” + “ "Independent words + independent words + ancillary words (particles), the cost is 4. The rules of the minimum cost method are tuned according to the actual Japanese language, and when words having a co-occurrence relationship such as “nothing” + “no” occur in the sentence, there are various costs such as “−1”. Rules are prepared.
[0032]
In this way, the above-mentioned cost is calculated for all the words obtained by referring to the reverse lookup dictionary, and a process for specifying a sentence having the lowest cost is performed (step S315). In the above example, "Quality" (independent words / nouns) + "O" rather than "Goods" (independent words / nouns) + "Quality" (independent words / nouns) + "O" (adjuncts / particles). It is judged that the (adjunct / particle) is more likely to be Japanese. Of course, this calculation is performed at least for each sentence, and an array of words is selected so as to minimize the cost of the entire sentence. Therefore, for example, if there is a cost reduction due to the co-occurrence relationship, there is a case where “quality” + “quality” + “to” is selected instead of “quality” + “to”.
[0033]
When the morpheme analysis by the minimum cost method is completed in this way, the sentence structure is analyzed (step S316). This process expresses how the words constituting a sentence are connected by logical product and logical sum, and is used, for example, when a compound sentence is separated into two sentences. In this embodiment, this point is not particularly described. After performing the above process, the process which outputs the morphological-analyzed data is performed (step S318). The data may be transferred to the next standardization process as it is, or may be temporarily stored in the hard disk 27 with an identification code.
[0034]
Next, various standardization processes are executed on the sentence thus morphologically analyzed (see FIG. 3). As standardization processing,
(1) Character standardization process (step S320)
(2) Conjunction processing (step S330)
(3) Independent word processing (step S340)
(4) Unification processing of notation (step S350)
(5) Attached word processing (step S360)
There is. As described above, each standardization process does not need to be executed all, and only a process that matches the user's intention may be performed. Further, when a plurality of standardization processes are performed, the order is not limited to the above order, and may be performed in other orders.
[0035]
First, character standardization processing will be described with reference to FIG. When the character standardization process is started, a process for referring to the standardization rule CSD is first executed (step S322). The standardization rule CSD is stored in advance in the hard disk 27, and defines what rule the standardization of characters is to be performed. These rules have defaults, but can be changed by the user. Character standardization in this embodiment means that, as shown in FIG. 7, parentheses, quotes, general symbols, alphanumeric characters, punctuation marks, punctuation marks, half-width katakana characters, name concatenation characters, and long sound symbols are replaced with certain rules. Say processing. Among these, those indicated by “x” in the column A in FIG. 7 indicate that surrounding characters do not need to be considered in replacement, and “◯” indicates that surrounding characters need to be considered. Yes. Also, column B shows the range of replacement, but “sentence” may be the range of replacement here, so for example “−” (minus sign) and “−” (long sound symbol) ) Is different from that of the morpheme analysis if it is replaced with a long sound symbol. Therefore, when a long sound symbol is replaced, the sentence structure may be changed with reference to the reverse dictionary IDC.
[0036]
As an example of character normalization, when taking a punctuation mark and a punctuation mark, these are set to be replaced by “,” “.” By default. Therefore, if the sentence “Coffee is delicious” is replaced with a character, “Coffee is good.” However, since this setting can be changed, if the setting is changed to “,” as a punctuation mark and “,” as a reading mark, “Coffee is delicious”. In addition, as shown in the column A, surrounding characters are taken into account, but if the surrounding character string is English, the replacement with “,” “.” Is the default setting. .
[0037]
To illustrate the standardization of other characters,
(A) Parentheses: “” and “” are replaced, etc.
(B) Quotation marks: “” and “” are replaced, etc.
(C) General symbols: For various symbols (for example, “:,?!”, Etc.)
(D) Alphanumeric characters: replace full-width / half-width, uppercase / lowercase characters, etc.
(E) Half-width katakana: Full-width / half-width replacement for katakana, etc.
(F) Name continuation: “Queen = Elizabeth” replaced with “Queen Elizabeth”, etc.
There is.
[0038]
Using these rules, a process of changing each character is performed (FIG. 6, step S324). Thereafter, it is determined whether or not the replacement has been completed for all the characters (step S326), and the replacement according to the rule is performed until all the characters are completed.
[0039]
After the character standardization processing described above is performed, next, co-occurrence collocation processing (FIG. 3, step S330) is executed. Details of this processing are shown in FIG. Hereinafter, a description will be given with reference to FIG. When co-occurrence collocation processing is started, attention is first focused on the phrase N of the sentence obtained by morphological analysis (step S331). At the start of processing, N = 1. Next, the phrase string search process is performed backward with reference to the co-occurrence dictionary RGD (step S332). The state of this search is shown in FIG. FIG. 9 shows a state in which co-occurrence collocation processing is performed on a sentence “I hurry to school”. By the morphological analysis, the phrase “I am” + “To school” + “Hurry up” + “I will go” is extracted. More specifically, each phrase is analyzed as an independent word + an attached word (+ an attached word...).
[0040]
First, focus on the phrase N = 1, that is, "I am", and search for phrases such as N = 2, 3, 4, ie, "To school", "Hurry up", "Go", etc. The search is to verify whether there is a phrase described in the co-occurrence dictionary RGD. Therefore, it is not a search by phrase, but a search using a phrase and its stem. While performing such a search, it is determined whether there is a phrase having a co-occurrence relationship (step S333). In the example shown in FIG. 9, there is no item corresponding to the co-occurrence dictionary for “I am”, that is, N = 2, that is, the co-occurrence relationship of “line to school” is “co-occurrence dictionary RGD”. Was found. When a phrase having a co-occurrence relationship is found, it is next determined whether or not the phrase can be replaced (step S334). If two phrases in a co-occurrence relationship are continuous, there is no need to swap them. Moreover, even if a co-occurrence relationship is found between clauses located at distant positions, there are cases where the clauses cannot be replaced due to the sentence structure. For example, in the example sentence "I will call school and then go", even if the co-occurrence relationship between "to school" and "go" is found, it will be replaced with "I will call and then go to school" It is not always possible. This is because there are restrictions on the sentence structure.
[0041]
If it is determined that the two clauses found to be in the co-occurrence relationship are separated from each other and that the clauses can be interchanged due to the sentence structure, a processing for replacing the positions of the clauses is performed (step S335). . As a result, as shown in FIG. 10, the sentence becomes “I hurry to school”. Subsequently, a collocation process is performed (step S336). That is, since a co-occurrence relationship is recognized in two consecutive phrases, this is converted into a collocation and treated as one phrase. This situation is shown in FIG. In the above embodiment, the collocation based on the co-occurrence relation has been described as collocation of two phrases into one phrase. However, in some cases, three or more phrases can be collocated into one phrase. .
[0042]
Thereafter, the focused phrase is advanced by one (step S337), and it is determined whether or not the co-occurrence relation processing has been completed for all the phrases (step S338). If not yet completed, the process returns to step S332 to perform the processing. continue. When the co-occurrence processing is completed for all the clauses, the process exits to “NEXT” and ends this routine. In the above flow, the search for the phrase having the co-occurrence relation is performed in order from the first phrase in the sentence. However, a method of performing the search by specifying the receiver of the so-called “dependency” first. If it is adopted, the search can be made sequentially from the back of the sentence. Which to search depends on the configuration of the dictionary and the search algorithm.
[0043]
When character standardization (FIG. 3, step S320) and co-occurrence collocation processing (step S330) are thus completed, next, independent word standardization processing is performed (step S340). Details of this processing are shown in FIG. When the independent word standardization process shown in FIG. 12 is started, a process for referring to the standardization rule is first performed (step S342). In this process, the default is set in advance, similar to that referred to in the character standardization, but the setting that can be changed by the user is acquired. Of course, this rule can be fixed. The standardization of independent words is basically a process of replacing different expressions between independent words having the same meaning. There are many types of such processing, for example,
(1) Replace with a more general expression: eg, tennis ball → tennis
▲ 2 ▼ Replace with plain expression: e.g. scarlet → blue
(3) Avoidance outside of common kanji: eg, caress → love and greetings → greetings
(4) Simplification of idioms: for example, every move, every move
Replacement with the character form used from (5): eg Windows → Windows, spiral up → spiral up
(6) Replacing collocations: for example, going to school → going to school
Etc. can be considered.
[0044]
These processes are actually performed by sequentially extracting independent words from the sentence to be standardized and searching the standardized dictionary IWD for independent words (step S344). The standardized dictionary IWD for independent words is configured so that the replaceable independent words described above can be referred to together with the rules to which the replaced independent words are applied. Therefore, after obtaining the standardization rules, it is possible to refer to the dictionary, read replacement words that match the rules, and perform a process of changing each word (step S346). FIG. 13 is an explanatory diagram schematically showing the state of this replacement. As shown in the figure, reference is first made to rule settings. In the figure, “◎” indicates that the replacement is set (ON), and “◯” indicates that it is not set (OFF). In the independent word standardization process, if the above (1) to (6) are taken as an example, which replacement is performed is stored as a standardization rule. Read sequentially, and search for whether or not the word to be replaced is registered in the dictionary IWD for this independent word, and if it is registered, check if it matches the replacement rule that is currently on. If it matches the replacement rule that is turned on, it will replace the independent word. The above process is repeated for all words (step S348). In the example shown in FIG. 13, (3) “I am” is replaced with “I am” because repelling outside the common Chinese characters is on. In addition, words that have been recognized as having a co-occurrence relationship and made into collocation are also replaced with other words as necessary. In this example, “Go to school” → “Go to school” is replaced. In response to this, the attached word part was also replaced with “kuyo” → “suruyo”.
[0045]
As a result, when the independent word standardization process is completed, all the words are replaced with respect to the type predetermined as the standardization rule, and the independent words are standardized at a desired level.
[0046]
After standardization of independent words, standardization processing of notation fluctuation is performed next (FIG. 3, step S350). Notation fluctuation means the ambiguity and tolerance of notation in Japanese. For example,
(1) Long-symbol fluctuation: example, user, user,
(2) Feeding kana: example, sales, sales,
(3) Shake notation: examples, wizards, wizards,
(4) Kana notation of compound words: examples, sales, sales,
▲ 5 ▼ Foreign language notation: Angel, Angel,
(6) Repeated character fluctuation: eg, Shosho-do Hall
Etc. can be illustrated.
[0047]
Since the outline of this process is similar to the independent word standardization process shown in FIG. 12, a flowchart is not shown. However, as with the independent word standardization, reference is first made to rule settings. That is, in the standardization processing of the fluctuation of the notation, taking the above (1) to (6) as an example, as shown in FIG. 15, the standardization rule DAD (see FIG. 3) ) Is read out, and then the words are sequentially read out. If the rule stored in the standardized rule DAD is applicable to this word, the normal word dictionary DIC for kana-kanji conversion is searched. . Since notation fluctuations are widely registered in this dictionary, if a word corresponding to the rule specified by the standardization rule DAD is registered in the dictionary DIC, it is read out and replaced with a word having a different notation. is there. The above process is repeated for all words.
[0048]
The standardization of independent words is slightly different from the standardization, but the standardized dictionary of independent words is constructed on the assumption that standardization in one direction is performed, whereas the fluctuation of notation is standardized in both directions. This is because it is assumed. This is because the fluctuation of the notation is not suitable for the judgment that either of the notations is correct because the allowable range is large. These notation fluctuations are widely collected in the word dictionary DIC for Kana-Kanji conversion and are associated with each other. Therefore, when standardizing the notation fluctuations, refer to the notation fluctuation standardization rules DAD and specify them. The word dictionary DIC is searched for the written notation.
[0049]
After standardizing the fluctuation of the notation in this way, the standardization processing of the attached words is performed (step S360). Since the outline of this process is almost the same as the independent word standardization process shown in FIG. 12, a flowchart is not shown, but it is basically a process of replacing different expressions between attached words having the same meaning. There are many types of such processing, for example,
(1) Simplification of repeated polite expressions: Have been → "Out" Being ”
(2) Modernization of archaic expressions: example, “Cause Whether or not → “Cause Whether or not "
(3) Regular expression of simple expressions: eg “I have to study” → “I have to study”
Can be considered.
[0050]
In practice, these processes are performed by sequentially retrieving the attached words from the sentence to be standardized and searching the attached dictionary standardized dictionary AWD. The standardized dictionary AWD for adjuncts is configured such that the replaceable adjuncts described above can be referred to together with the rules to which they are applied. Therefore, after obtaining the standardization rules, it is possible to refer to the dictionary, read replacement words that match the rules, and perform processing to change each attached word. FIG. 16 is an explanatory diagram schematically showing the state of this replacement. As shown in the figure, reference is first made to rule settings. That is, in the standardization processing of attached words, taking the above (1) to (3) as an example, which replacement is performed is stored as a standardization rule. The word is read sequentially, and it is searched whether or not the word to be replaced is registered in the dictionary AWD for this attached word, and if it is registered, whether or not it matches the replacement rule that is currently turned on. Check and replace the attached word if it matches the replacement rules that are turned on.
[0051]
As a result, when the standardization process of the attached word is completed, all the words are replaced with respect to the type predetermined as the standardization rule, and the attached word is standardized at a desired level.
[0052]
When all the standardization (steps S320 to S360) shown in FIG. 3 is completed in this way, the server 200 performs a process of registering the standardization result in the document database TDB in the hard disk 27 (step S370). This database is a full-text database of documents, and a full-text search can be performed by a search device described later.
[0053]
(3) Effects of the embodiment:
Documents registered in the document database TDB have been standardized in the form of characters, independent words, fluctuations in notation, and attached words, so that most of the writer's habits and wording differences have been eliminated. Therefore, the processed document is extremely plain text data and can be used for various purposes. For example, if it is used for construction of a full-text database such as patent gazettes and technical literatures, it is possible to reduce noise and search omissions when searching the completed database. Also, if the sentence to be translated is standardized, it can be used as one of the rough translations for machine translation. Conversely, when constructing a translation database that accumulates translation examples, differences in translators can be resolved. Furthermore, it can also be used for a wide range of comparative studies on documents, such as a study of comparing author expressions across time. In this embodiment, prior to standardization processing, text data is subjected to morphological analysis to obtain necessary grammatical information. For this reason, standardization is not limited to simple character replacement, but can be performed collectively as standardization of independent words using grammatical information, standardization of notation fluctuation, and the like. For this reason, the number of rules prepared for standardization can be reduced. Since there is grammatical information, standardization can be easily performed by referring to a dictionary for kana-kanji conversion, a dictionary for fluctuation of notation, a dictionary for replacing independent words, and the like.
[0054]
In the embodiment, standardization processing includes (1) character standardization, (2) co-occurrence collocation processing, (3) independent word standardization processing, (4) standardization processing of fluctuation, and (5) appendix word processing. Although the processing is performed in the order of standardization processing, this processing can be performed in various orders. If independent word standardization processing is performed after character standardization processing as in this embodiment, for example, half-width / full-width conversion is completed by character standardization, and then variations in independent words such as “WINDOWS” and “Windows” are obtained. Can be standardized, so that the processing can be surely performed with a simple operation.
[0055]
Similarly, it is also preferable to perform standardization processing of independent words after the collocation processing. By performing the collocation process in advance, the standardization process of independent words can be performed more reliably. In the embodiment, it has been described as a process of replacing the collocation “go to school” + “go” with “go to school”, but “go to school” + “hurry up” + “go” is once performed by collocation processing. If it was replaced with "Hurry up" + "To school" + "Go", it was easy to standardize to "immediately" + "go to school" in the next independent word processing. Further, since the process of unifying the notation fluctuation is performed after the standardization process of independent words, it cannot be said that the standardization of the notation once is lost compared to the standardization process of independent words.
[0056]
In the above-described embodiment, the case where there are two or more results in the standardization process is not particularly described. However, when there are two or more results (for example, “sales” and “sales” It is also possible to preferentially display one of them and display that there are multiple results. Such a display can be easily realized by changing the color of the standardized character on the monitor 30 or displaying “next candidate”. If it is displayed that there is a next candidate, it is possible for the user performing the processing to know that there are a plurality of results, which is preferable. To select another candidate, move the cursor to the displayed phrase and press the “next candidate” key to display the next candidate. If necessary, select the desired candidate from multiple candidates. Just choose.
[0057]
In addition, in this embodiment, since the log management unit 250 manages the standardization log, it is possible to leave the details of the standardization process performed on the input document. You can always restore the original sentence from the standardized sentence as long as you keep a log in the form of what word and what word in the sentence you entered. You can also. In addition, by analyzing the log output from the log output unit 280, it is possible to analyze which type of standardization is frequently used. It is also possible to analyze whether it is a simple colloquial sentence, etc.) or a song (whether it is easy to drop a long sound).
[0058]
(4) Description of the second embodiment:
Next, as a second embodiment of the present invention, a document search method and a search apparatus will be described. The document database TDB completed by the document standardization process described as the first embodiment is disclosed to the outside and used for free use or registered member use. At this time, a configuration in which the document database TDB is directly accessed is possible, but in order to be accessible from an unspecified number of clients via the network 10, for example, a CGI for accessing the document database TDB is provided. A site is usually prepared in the server 200, and the client 40 is usually allowed to access the document database TDB from a so-called browser via the network 10. Therefore, as a second embodiment, a method for searching for a web page using the document database TDB will be described. FIG. 17 is a flowchart showing processing at the time of search executed in the client 40. First, the client 40 that starts the search accesses a site in the server 200 prepared for the search (step S400). As a result, a search screen as shown in FIG. 18 is displayed.
[0059]
Therefore, the client inputs the search contents as a sentence in Japanese into the box KB for inputting the search character string prepared on this screen (step S410). For example, as shown in FIG. 18, a natural sentence such as “I attended school” is entered in a box KB for inputting a character string. At this time, in parallel with the input of the search text, it is monitored whether the “search” button BB is pressed (step S420). When the search button is pressed, the input text is read and the input shown in FIG. In this case, the sentence is subjected to morphological analysis, and the standardization process described in the first embodiment is performed (step S430). The search does not necessarily need to be performed based on textual input. For example, a keyword may be input and a search may be performed using one or a plurality of keywords, or a search may be performed by specifying a keyword and a search field. Also good.
[0060]
The document database TDB is searched using the search terms ("I" and "school attendance" in the example of FIG. 18) DS1 and DS2 extracted from the standardized search sentences obtained in this way (step S440). If a document having a matching sentence is found as a result of the search, the search result is output (step S450). The output search result is sent to the client via the network 10 and displayed on the screen on the client 40 side.
[0061]
According to the second embodiment described above, it is possible to perform a search using a natural Japanese sentence against a document database registered in advance as a standard. In this case, since the search is performed after standardization of a user's word habit of performing the search is reduced by standardization, a desired document can be easily found by the search. For this reason, it is not necessary to be familiar with complicated rules for inputting a search word, and a search can be easily performed even if the searcher does not have special training.
[0062]
As mentioned above, although embodiment of this invention was described, this invention is not limited to such embodiment at all, and in the range which does not deviate from the summary of this invention, it can implement in various forms. Of course. For example, the document database may be a database using keywords instead of the full-text database. It can also be applied to a translation device. Translation simply doesn't work even if you try to convert between languages using grammatical information (necessary rules grow infinitely), rather, prepare abundant examples, find examples that match the translation, It is known that a translation that is applied in such a way that it can be translated into a semantically accurate one. Therefore, it is possible to use the present invention by applying the present invention to standardized text data, standardizing the document, and using this to make it easy to specify an example.
[Brief description of the drawings]
FIG. 1 is a schematic configuration diagram showing an overall configuration in an embodiment of the present invention.
FIG. 2 is a block diagram showing a configuration for realizing standardization processing in the first embodiment.
FIG. 3 is a flowchart showing a standardization processing routine in the embodiment.
FIG. 4 is a flowchart showing a morphological analysis processing routine.
FIG. 5 is an explanatory diagram illustrating the configuration of a reverse lookup dictionary.
FIG. 6 is a flowchart showing a character standardization processing routine;
FIG. 7 is an explanatory diagram illustrating the contents of character standardization processing;
FIG. 8 is a flowchart showing a co-occurrence collocation processing routine.
FIG. 9 is an explanatory diagram showing a process of collocation.
FIG. 10 is an explanatory diagram showing a state of phrase replacement in collocation.
FIG. 11 is an explanatory diagram showing a state of collocation.
FIG. 12 is a flowchart showing an independent word standardization processing routine;
FIG. 13 is an explanatory diagram schematically showing a state of replacement of independent words.
FIG. 14 is an explanatory diagram showing an example of replacement of free-standing words when repelling outside the common kanji is on.
FIG. 15 is an explanatory diagram showing an example of a standardization rule DAD indicating which replacement is performed.
FIG. 16 is an explanatory diagram schematically showing how an attached word is replaced.
FIG. 17 is a flowchart showing processing at the time of search executed in the client 40 as the second embodiment.
FIG. 18 is an explanatory diagram showing an example of a search screen in the second embodiment.
[Explanation of symbols]
10 ... Network
11 ... Keyboard
12 ... Mouse
20 ... Router
22 ... CPU
23 ... ROM
24 ... RAM
25 ... Timer
26. Display circuit
27 ... Hard disk
30 ... Monitor
40 ... Client
200 ... Database server
205 ... Document input section
210 ... Morphological analyzer
220 ... dictionary search section
230 ... Dictionary for morphological analysis
240 ... Database
240 ... Standardized rule database
250 ... Standardization processing unit
260 ... Database
260 ... Log management section
270: Hard disk
270 ... Document output section
280 ... Log output part

Claims

A document standardization method for standardizing a document using a computer,
A process of inputting a document with a certain unit from a file or a character string from the keyboard,
A process of extracting a plurality of words with grammatical information by analyzing the morpheme by referring to a dictionary stored in a storage device in advance by the computer;
A process in which the computer performs predetermined standardization on the extracted words;
Processing to output a document reconstructed from the standardized words;
The predetermined standardization process is as follows:
(B) When the plurality of extracted words have the stored co-occurrence relationship by referring to the co-occurrence relationship stored in advance in the storage device, the positional relationship of adjacent positions of the plurality of words having the relationship Conjunction processing to be corrected to
(D) an independent word process that replaces an independent word including a plurality of adjacent words whose positional relationship has been corrected with another independent word according to a predetermined replacement criterion;
(E) A document standardization method including ancillary word processing for replacing ancillary words with other ancillary words according to a predetermined rule.

The standardization method according to claim 1,
The predetermined standardization process is as follows:
(A) Standardization of a character that replaces a previously stored character among characters constituting the input document with a predetermined character according to a predetermined rule;
(C) a notation unification process for unifying the fluctuation of the notation into a predetermined notation in accordance with a pre-stored rule;
A standardization method comprising at least one of

The standardization method according to claim 1,
The morpheme analysis uses a morpheme analysis dictionary prepared in advance.

The standardization method according to claim 2, wherein the independent word processing (d) is performed after the character standardization processing (a).

The standardization method according to claim 1, wherein the independent word processing (d) and the adjunct word processing (e) are performed after the collocation processing (b).

3. The standardization method according to claim 2, wherein the unification processing (c) is performed at least after the independent word processing (d).

The standardization method according to claim 1,
When two or more standardization results exist during the standardization process, one of the two or more results is displayed, and
A standardized method that indicates that multiple results exist.

The standardization method according to claim 7, comprising:
A standardization method for sequentially displaying results other than the displayed results as next candidates according to a user operation.

A method for constructing a database in which a plurality of documents are registered as a document database using a computer,
A process of inputting a document with a certain unit from a file or a character string from the keyboard,
A process of extracting a plurality of words with grammatical information by analyzing the morpheme by referring to a dictionary stored in a storage device in advance by the computer;
A process in which the computer performs predetermined standardization on the extracted words;
A process of accumulating documents reconstructed from the standardized words as a database,
The predetermined standardization process is as follows:
(B) When the plurality of extracted words have the stored co-occurrence relationship by referring to the co-occurrence relationship stored in advance in the storage device, the positional relationship of adjacent positions of the plurality of words having the relationship Conjunction processing to be corrected to
(D) an independent word process for replacing an independent word including a plurality of adjacent words whose positional relationship is corrected with another independent word according to a predetermined replacement criterion;
(E) A method for constructing a document database including an adjunct processing for replacing an adjunct with another adjunct according to a predetermined rule.

A document search method for searching a document using a computer,
Prior to searching for documents,
A process of inputting a document with a certain unit from a file or a character string from the keyboard,
A process of extracting a plurality of words with grammatical information by analyzing the morpheme by referring to a dictionary stored in a storage device in advance by the computer;
A process in which the computer performs predetermined standardization on the extracted words;
(B) When the plurality of extracted words have the stored co-occurrence relationship by referring to the co-occurrence relationship stored in advance in the storage device, the positional relationship of adjacent positions of the plurality of words having the relationship Conjunction processing to be corrected to
(D) Independent word processing that replaces an independent word including a plurality of words whose positional relationship has been corrected with another independent word according to a predetermined replacement criterion;
(E) an ancillary word processing for replacing an ancillary word with another ancillary word according to a predetermined rule, a standardization process including a process for outputting a document reconstructed from the standardized word;
Documents reconstructed from the standardized words are stored in advance as a database,
When searching for documents,
A document search method in which a computer compares a designated search word with documents stored in the database, and a computer specifies a document including the search word.

An apparatus for standardizing a document composed of text data using a computer,
An input means for inputting a document having a certain unit,
Morphological analysis of the document, and morphological analysis means for extracting words with grammatical information;
Standardization processing means for performing a predetermined standardization process on the extracted word;
Document output means for outputting a document reconstructed from the standardized words,
The standardization processing means includes:
(B) When the plurality of extracted words have the stored co-occurrence relationship by referring to the co-occurrence relationship stored in advance in the storage device, the positional relationship of adjacent positions of the plurality of words having the relationship Collocation processing means to correct to
(D) an independent word processing means for replacing an independent word including a plurality of words whose positional relationship is corrected with another independent word according to a predetermined replacement criterion;
(E) A document standardization device provided with an adjunct processing means for replacing an adjunct with another adjunct according to a predetermined rule.

The document standardization apparatus according to claim 11,
The standardization processing means includes:
(A) a character standardization means for replacing with a predetermined character;
(C) a notation processing means for unifying the fluctuation of the notation into a predetermined notation;
Document standardization device including at least one of them.

An apparatus for building a document database in which a plurality of documents are registered,
An input means for inputting a document having a certain unit,
Morphological analysis of the document, and morphological analysis means for extracting words with grammatical information;
Standardization processing means for performing a predetermined standardization process on the extracted word;
Document storage means for storing a document reconstructed from the standardized words as a database,
The standardization processing means includes:
(B) When the plurality of extracted words have the stored co-occurrence relationship by referring to the co-occurrence relationship stored in advance in the storage device, the positional relationship of adjacent positions of the plurality of words having the relationship Collocation processing means to correct to
(D) an independent word processing means for replacing an independent word including a plurality of words whose positional relationship is corrected with another independent word according to a predetermined replacement criterion;
(E) An apparatus for constructing a document database provided with an adjunct processing means for replacing an adjunct with another adjunct according to a predetermined rule.

A document search device for searching for a document,
A means of entering a document with a certain unit that operates prior to document retrieval;
Morphological analysis of the document, and morphological analysis means for extracting words with grammatical information;
Standardization processing means for performing a predetermined standardization process on the extracted word;
Storage means for previously storing a document reconstructed from the standardized words as a database,
The standardization processing means includes:
(B) When the plurality of extracted words have the stored co-occurrence relationship by referring to the co-occurrence relationship stored in advance in the storage device, the positional relationship of adjacent positions of the plurality of words having the relationship Collocation processing means to correct to
(D) an independent word processing means for replacing an independent word including a plurality of words whose positional relationship is corrected with another independent word according to a predetermined replacement criterion;
(E) an ancillary word processing means for replacing an ancillary word with another ancillary word according to a predetermined rule;
A document search apparatus comprising search means that operates when searching for a document and compares a specified search word with a document stored in the database to identify a document including the search word.

A program for causing a computer to perform processing for standardizing text data having a certain unit,
A function of cutting out a plurality of words with grammatical information by analyzing the text data with reference to a dictionary stored in a storage device in advance by a computer,
A function for a computer to perform a predetermined standardization on the extracted word, and a function for outputting a document reconstructed from the standardized word,
The predetermined standardization function is:
(B) When the plurality of extracted words have the stored co-occurrence relationship by referring to the co-occurrence relationship stored in advance in the storage device, the positional relationship of adjacent positions of the plurality of words having the relationship Conjunction processing to be corrected to
(D) an independent word process that replaces an independent word including a plurality of adjacent words whose positional relationship has been corrected with another independent word according to a predetermined replacement criterion;
(E) A program that realizes an ancillary word processing for replacing an ancillary word with another ancillary word according to a predetermined rule by a computer.

A recording medium on which the program according to claim 15 is recorded.