JP2004178044A

JP2004178044A - Attribute extraction method, its device and attribute extraction program

Info

Publication number: JP2004178044A
Application number: JP2002340348A
Authority: JP
Inventors: Taizou Kameshiro; 泰三亀代; Takashi Hirano; 敬平野
Original assignee: Mitsubishi Electric Corp
Current assignee: Mitsubishi Electric Corp
Priority date: 2002-11-25
Filing date: 2002-11-25
Publication date: 2004-06-24

Abstract

<P>PROBLEM TO BE SOLVED: To provide a means for precisely extracting document attributes from a document. <P>SOLUTION: This attribute extraction method comprises a morphemic analysis step for carrying out the morphemic analysis of the character fields of an inputted document and for outputting the morphemic analysis results for each character field, and a part of speech pattern collation step for collating the character fields with document attribute part of speech patterns in which the structures of document attributes to be extracted are expressed in morphemic levels and for extracting the character field whose appearing position in the document is within a predetermined range among the matched character fields as document attributes. <P>COPYRIGHT: (C)2004,JPO

Description

【０００１】
【発明の属する技術分野】
本発明は、コンピュータで読取可能なテキストから文書の属性を自動抽出する属性抽出方法及びその装置及び属性抽出プログラムに関するものである。
【０００２】
【従来の技術】
コンピュータの普及と処理の高速化、ディスク容量の大容量化・低価格化を背景に、コンピュータで文書を作成して管理する文書管理方法が増加している。文書を効率的に管理するには、各文書に対して作成日付、作者、会社名等の属性とともに保存することが必要になる。文書に属性を付与して保存するには、属性を人手でタイプ入力する必要があり、文書数が多くなるほど属性入力に手間がかかるため、従来からこれを効率化するための手法が考案されてきた。
【０００３】
その１つとして、文書中の決められたエリア内の文字を属性として抽出する方法（例えば特許文献１）や、記載位置情報を使用せずにパターンマッチングを用いる方法（例えば非特許文献１）などがある。
【０００４】
【特許文献１】
特開２００２−５５９８５号公報（第２頁−１１頁、図２、図４、図５）
【０００５】
【非特許文献１】
「情報処理４０巻４号」ＰＰ３７０〜３７３
【０００６】
【発明が解決しようとする課題】
文書中の決められたエリア内の文字を属性として抽出する方法では、例えばファクシミリから入力した文書のように、取り込んだイメージによって文字フィールド全体が傾いたり、上下左右にずれがある場合には、指定エリアと実際の文書中の記入エリアにずれが生じてうまく属性抽出できず、また文字フィールドの内容については解析せずに属性と対応付けるため、例えば企業・組織名と作成者の記入位置を間違えて記述した文書については作成者属性に企業・組織名を登録するなど、うまく属性を抽出できないという課題があった。
【０００７】
また、記載位置情報を使用せずにパターンマッチングを用いる方法では、例えば形態素解析誤りによって本来得るべき品詞とは異なる品詞が割当てられたり、形態素辞書に存在しない単語が入力文に含まれるとうまく属性を抽出できないという課題があった。
【０００８】
【課題を解決するための手段】
本発明は、文書を入力する入力ステップと、上記文書の文字フィールドごとに形態素解析結果を出力する形態素解析ステップと、文書属性の構造を形態素レベルで表現した文書属性品詞パタンと上記文字フィールドの形態素解析結果とを照合し、一致する上記文字フィールドを属性候補として出力する品詞パタン照合ステップと、上記属性候補の文書における出現位置が所定の範囲にあるか否かを判定する座標判定ステップと、上記座標判定ステップにおいて上記範囲に属すると判定された属性候補を文書属性として出力する出力ステップとを有するものである。
【０００９】
【発明の実施の形態】
実施の形態１．
図１は実施の形態１の構成図である。図において、入力手段１は文書を入力し、形態素解析手段２は入力した文書に対して日本語文法に基づく解析を行い、品詞再抽出手段３は形態素解析手段２の解析結果に対して指定単語の品詞の再抽出を行う。また、品詞パタン照合手段４は形態素解析結果から目的の属性に対応する文字列を抽出し、座標判定手段５は文書中の文字位置を抽出し、出力手段６は属性名とそれに対応する文字列を出力する。形態素辞書７は形態素解析に必要となる情報を記憶する辞書であって、形態素解析手段２が使用する。また、品詞パタンＤＢ８は品詞パタン照合手段４が使用する品詞パタンを記憶している。属性定義ファイルＤＢ９は属性定義情報を格納するものである。
【００１０】
なお、上記構成において、入力手段１はコンピュータシステムにおけるハードディスク上のファイルを入力するオペレーティングシステムによるファイルシステムにより実現され、またスキャナーやＯＣＲ、ＦＡＸＯＣＲのように印刷物を電子的に取り込むための入力装置の他、電子メールを受信する電子メールサーバやインターネット上のＷｅｂページを介した入力を受け付けるＷｅｂサーバ、その他の入力システムやデバイスによっても実現される。
【００１１】
また、形態素解析手段２、品詞再抽出手段３、品詞パタン照合手段４、座標判定手段５は、専用の電子回路を構成することによる他、コンピュータシステムにおける中央演算装置（以下ＣＰＵ）によって実現される。さらに、出力手段６は、オペレーティングシステムによるファイルシステム、プリンタ、ＦＡＸサーバなどによって実現される。形態素辞書７、品詞パタンＤＢ８、属性定義ファイルＤＢ９は本体システムに内部バスあるいはネットワークにより接続された不揮発性記憶装置あるいはハードディスク上に記憶される。
【００１２】
次に、図２及び図３は形態素辞書７が保持する情報の例である。図２に示すように、形態素辞書７は、各単語に対して単語と品詞、付加情報１〜３を含むレコードを有する。各単語の品詞及び付加情報１〜３については、その単語を階層的に分類した値を定義する。ここでいう単語の階層分類とは、たとえば単語を階層的に国文法上の品詞に分類したものをいう。図２の例では、単語をまず名詞や接頭詞、副詞などの国文法上の品詞分類に分類してその結果を品詞に格納し、続いてたとえば名詞を、一般名詞、固有名詞、代名詞などに分類してその結果を付加情報１に格納し、さらに固有名詞を人名、地名、組織名などの詳細分類に分類して、その結果を付加情報２に格納している。また人名の場合は姓名の区別を付加情報３に格納している。これらの分類は、直接的には形態素解析のための情報として用いる。したがってこのような分類方法以外にも、形態素解析のために用いることのできる語の分類方法であれば、どのような分類方法を用いてもよい。なお、実施の形態１による形態素辞書７では、分類の階層数が品詞から数えて最大で４階層までの分類方法を用いているため、付加情報は１から３まであれば十分であるが、単語の分類方法によって、付加情報が３個で不足する場合は、適宜付加情報の数を追加してもよい。
【００１３】
さらに形態素辞書７は、図３に示すような品詞間の文法的な接続条件を記憶している。これらのレコードは、連続する２つの品詞の接続が正しい組合せかどうかを示すものであって、例えば名詞と助詞の組合せは文法的に正しい組合せであることを意味している。
【００１４】
次に実施の形態１における属性抽出処理について説明する。図４は実施の形態１における属性抽出処理を表すフローチャートである。まず、図４のステップＳ１００で、入力手段１は文書を入力する。入力文書はコンピュータが読取り可能な形式のデータである。ここで、入力文書は本装置のコンピュータ上に限らず、別のコンピュータ上にある文書をネットワーク経由で入力してもよい。ここでは、入力文書の内容は、図６に示すようなテキストファイルとする。
【００１５】
次に、ステップＳ２００において形態素解析手段２は、入力手段１が入力した文書に対して形態素解析を実行する。ここでは、まず入力文書の先頭文字から形態素辞書７と照合処理を行う。図６に示す入力文書における１１５の「２０００／９／３０」に対しては、先頭の文字「２」から始まる単語との照合処理を行う。この場合、１１５の１文字目は形態素辞書７中の「２（数字）」とのみ一致する。次に１１５の２文字目と「０」から始まる単語と照合処理を行い、結果として「０（数字）」とのみ一致する。次に図３に示す文法的な接続条件を参照して、「２（数字）」と「０（数字）」の接続条件をチェックする。この接続条件によれば、数字と数字の接続が認められるため、「２」の品詞が「数字」に、「０」の品詞が「数字」に確定される。以下同様に処理を実行する。
【００１６】
さらに形態素解析手段２は、ある単語について候補となるべき品詞が複数存在する場合に、複数候補の存在を示すフラグを付して形態素解析結果を出力する。例えば、図６の入力文書の１１６における「川崎」については、人名を示す固有名詞（図２のレコード１２０）と地域を示す固有名詞（図２のレコード１２０−２）との２つのレコードが形態素辞書７に存在する。どちらも図３に示す接続条件を満たすので結果として出力可能であるが、ここでは一時的に地域を示す固有名詞（レコード１２０−２）を候補として出力し、かつ他に候補単語が存在することを示すフラグを付加する。
【００１７】
図６の入力文書の例に対する形態素解析手段２の出力結果の例を図８に示す。図８では、形態素解析処理の結果抽出した文字列が検出された順に出力されている。この入力文書を単語毎に品詞名、付加情報１〜３、フラグを格納している。ここにおいて、付加情報１〜３はその単語について形態素辞書において記憶されていた付加情報１〜３である。また、この例では「川崎」についてのレコード１１９にのみフラグを付加している。
【００１８】
次にステップＳ３００において、品詞パタン照合手段４が品詞パタン照合処理を行う。品詞パタンとは、各属性の構成を品詞や文字などの構成要素の組み合わせによって表現したものであり、予めユーザによって作成され、品詞パタンＤＢ８中に格納される。図９に品詞パタンの例を示す。この例では”属性＝構成要素のリスト”という形式で属性の構成を表現している。例えば「日付」という属性の構成は「数字」「記号」「数字」「記号」「数字」の組み合わせからなる。一つの属性が複数の異なる構成を有する場合には、”属性＝構成要素”という定義行を一つの属性について複数記述する。品詞パタン照合手段４は、形態素解析手段２が出力した結果と品詞パタンＤＢ８中の品詞パタンを照合して、品詞パタンに一致する文字列を属性候補として抽出する。
【００１９】
ここで品詞パタン照合処理の詳細について説明する。図５はパタン照合処理のフローチャートである。はじめに、図５のステップＳ３１０において、図７に示した属性定義ファイルＤＢ９から属性定義情報として属性数Ｎを取得する（図７の例では３となる）。
【００２０】
ここで、属性定義情報とは、文書から抽出しようとしている属性が文書にどのような形で存在するかを記述した情報であり、予め人手によって作成され、属性定義ファイルＤＢ９中に格納される。図７に実施の形態１で使用する属性定義情報の例を示す。この例では、属性名と文書内の座標値を用いて文書中の属性値を抽出する。一行目の”属性数＝３”という行は、抽出する属性数が３であることを示す。また、二行目以降の”［属性１］”では、１０１によって、属性名が「日付」であることが示され、その次の行においては、成分１０２、成分１０３、成分１０４、成分１０５を座標成分とする座標によって、属性１が文書中に存在する座標が（８０，１０，１０，５）であることが示される。ここで、成分１０２とは属性値が文書中に存在する矩形の左上点のｘ座標であり、成分１０３はそのｙ座標であって、成分１０４はその矩形の矩形幅、成分１０５はその矩形の高さである。この例によれば、成分１０２の値は８０，成分１０３の値は１０、成分１０４の値は１０、成分１０５の値は１０５となる。各成分の数値は文書の縦（ｙ）方向、横（ｘ）方向の長さをそれぞれ１００とし、左上を原点とした相対的な値としている。したがって、座標点（８０，１０）とは、原点からｘ方向に８０％、ｙ方向に１０％の位置である点を意味する。また幅１０はテキスト幅の１０％、高さ５はテキストの高さの５％を意味する。
【００２１】
次に、ステップＳ３２０においてＮ個の属性の一つ一つについてパタン照合を行うために、初期値ｉ＝１を代入する。続いてステップＳ３３０において、第ｉ番目（ここではｉ＝１）の属性定義情報を取得する。例えば属性名、属性の存在する座標を図７に示した属性定義情報として取得する。第１番目の属性名は「日付」であって、座標は（８０，１０，１０，５）である。
【００２２】
次に、ステップＳ３４０において品詞パタン照合手段４が、図８の形態素解析結果と図９の品詞パタンとの照合処理を行う。品詞パタンは、図９に示すとおり、日付と組織名、作者について定義されている。そこで、第１番目の属性名「日付」の品詞パタンと形態素解析結果を照合する。ここにおいて、日付についての品詞パタンは、［数字（２〜４）］［記号］［数字（１〜２）］［記号］［数字（１〜２）］となっている。その結果、形態素解析結果中の文字列とこの品詞パタンが一致する場合には該当文字列を日付として抽出する。ここで、図９の１１０に示した［数字２〜４］とは、「２〜４桁で構成される数字」を意味する。図８の形態素解析結果によれば、「２００２」は連続する数字４桁であるから、１１０の［数字（２〜４）］と一致する。次に、図８の形態素解析結果における符号１１７で示した「／」の品詞が記号であるため、図９の１１１の「記号」と一致する。以下同様に照合して、「９」が数字１桁、「／」１１８が記号、「３０」が数字２桁と一致し、パタン照合で一致する。
【００２３】
次に座標判定手段５は、一致した属性候補の文字フィールドの座標の評価を行う。そのためには、一致した文字フィールドの文書中の位置を抽出する必要があるが、そのために座標判定手段５は文書の全行数・最大文字列数を予め求め、これを文書の高さ・幅とする。ここでは高さ＝３０（行）、幅＝２５（文字）とする。幅の数値は全角を１として計算し、半角文字は０．５と計算する。続いて座標判定手段５は、指定の文字位置、ここでは「２００２／９／３０」の文字位置の行、列位置を取得する。ここでは例として、行、列の位置をそれぞれ２，１９とし、文字幅＝４．５、文字高さ＝１とする。これらに基づいて抽出座標を計算すると、１９×１００／２５＝７６，２×１００／３０＝７、４．５×１００／２５＝１８、１×１００／３０＝３となる。さらに、抽出結果の文字列について属性定義情報中の座標との距離を求める。距離計算は以下の式を用いる。
【００２４】
【数１】

【００２５】
Ｐ_ｄｎｉは属性定義情報中の座標、Ｐ_ｅｎｉは文書から抽出した座標である。ここではＰ_ｄ _１＝（８０，１０，１０，５）、Ｐ_ｅ _１＝（７６，７，１８，３）であるので、数式１に従い計算してＤ１＝１７となる。
【００２６】
次にステップＳ３５０において、第ｉ（ここではｉ＝１）番目の属性が存在したか否かを判定する。ここでは、パタン照合で一致した文字列の距離が一番近く、かつその距離が一定の閾値以内となる文字フィールドが存在する場合に、属性が存在すると判定する。ここで閾値＝３０とすると、属性１の文字列の距離は１７＜３０であるから属性が存在することとなり、ステップＳ３７０へ進み保存処理を行う。ステップＳ３７０では、図１０に例を示すように、抽出した属性をバッファに保持する。
【００２７】
次にステップＳ３８０において、ｉ＝１でありＮ＝３であり、ｉ＜Ｎを満たすことから（ステップＳ３８０：Ｙ）、Ｓ３９０によってｉをインクリメントし、Ｓ３３０へ進む。
【００２８】
ステップＳ３３０において、ｉ＝２より、第２番目の属性を属性定義ファイルＤＢ９から属性定義情報として抽出する。図７に示した属性定義情報を例とすれば、属性名は「組織名」、座標は（８０，２０，１０，５）となる。この場合、品詞パタンは図９より組織名＝［名詞―固有名詞―組織］である。ここで、１２２の「名詞」は品詞を、１２３の「固有名詞」は付加情報１を、１２４の「組織」は付加情報２を意味する。図８に示す形態素解析結果を探索すると、「ＸＸ建設」と「○×電気」が［名詞―固有名詞―組織］である。次に、座標判定手段５はそれぞれの座標値から距離を計算する。例えば、「ＸＸ建設」の座標を（７，１７，１６，３）、「○×電気」の座標を（７６，１７，１６，３）として、式（１）に当てはめてそれぞれの距離を計算すると「ＸＸ建設」が８４，「○×電気」が１５となる。ここでは「○×電気」の距離が最も小さいのでこれを第１候補とする。
【００２９】
次にステップＳ３５０に進み、「○×電気」の距離１５が閾値３０以下であるため、属性が存在することとなるので、ステップＳ３７０で保存処理を行う。続いてステップＳ３８０において、ｉ＝２、Ｎ＝３であって、ｉ＜Ｎを満たすことから、ステップＳ３９０でｉをインクリメントしてステップＳ３３０へ進む。
【００３０】
ステップＳ３３０において、ｉ＝３番目の属性を抽出する。図７の例に示した属性定義情報より属性名は「作者」、座標は（８０，３０，１０，５）であり、図９に示す品詞パタンは作者＝［名詞―固有名詞―人名］となる。図８に示す形態素解析結果から人名を探索すると、「西田」がこの条件を満たす。そこで、式（１）によって距離を求める。いま、「西田」の座標を（７，２３，８，３）とすると、Ｐ_ｄ _３＝（８０，３０，１０，５）、Ｐ_ｅ _３＝（７，２３，８，３）となり、Ｄ３＝８４＞３０となって条件を満たさない。したがってステップＳ３５０で属性値が存在しないこととなるため（ステップＳ３５０：Ｎ）、ステップＳ３６０に進み、再探索処理を行う。
【００３１】
ステップＳ３６０の再探索処理では、品詞再抽出手段３が、ステップＳ２００による形態素解析結果において他の候補が存在する文字列との抽出処理を行う。ここで、図８の形態素解析結果において「川崎」についてのレコード１１９にフラグが立っているので、レコード１１９の「川崎」の品詞が固有名詞（人名）となり得るか否かを形態素辞書７を用いて判定する。形態素辞書７より、「川崎」についてのレコード１２０の付加情報２が人名であるので、品詞再抽出手段３は、レコード１２０「川崎」の品詞「名詞」「固有名詞」「人名」「姓」を品詞パタン照合手段４に出力する。その結果の例を図１１に示す。図１１の形態素解析処理結果と図８の形態素解析処理結果とを比べると、「川崎」についてのレコード１２１の付加情報が変更されている。「川崎」についてのレコード１２１が変更されていることから、ステップＳ３６５において品詞が変更されたものと判定され、ステップＳ３４０へ進む（ステップＳ３６５：Ｙ）。品詞パタン照合手段４は、品詞再抽出手段３の出力結果を用いて再度品詞パタンと照合する。この結果、「川崎」についてのレコード１２１が照合に成功する。次に、座標判定手段５は座標を計算する。例えば「川崎」についてのレコード１２１の抽出座標をＰ_ｅ _３＝（８４，２３，８，３）とすると、Ｄ３＝１５＜３０となり条件を満たすので、「川崎」についてのレコード１２１を第１の候補として出力する。
【００３２】
最後にステップＳ４００で出力手段６は図１０に示すように属性名に対応する属性値を出力して終了する。
【００３３】
以上のように、実施の形態１では、形態素解析の結果に所望の品詞情報が含まれなくとも、品詞再抽出手段３が複数の品詞となり得る単語について品詞の再抽出を行うことで、所望の品詞の取得が可能となり、その結果属性抽出の精度が向上し、適切な内容の文書属性を抽出することができる。
【００３４】
また座標の位置情報を用いて検定することで、位置情報を用いない場合に比べて文書中に属性パタン条件を満たす単語が複数存在しても精度良く抽出可能となる。
【００３５】
また、座標と記述内容を用いて属性を抽出するため、例えば組織名と作成者の記述位置を入れ違いに記述してもそれぞれを正しく抽出することが可能となる。
【００３６】
なお、上記説明において、品詞再抽出手段３と品詞パタン照合手段４，座標判定手段５、出力手段６を別個の構成部位として説明したが、これらは品詞パタン照合手段に統合して構成することも可能である。
【００３７】
また、実施の形態１では、品詞パタン照合の判定に式（１）を使用したが、この他にもＰ_ｄｎｉとＰ_ｅｎｉの差を二乗して和をとる方法などが考えられるため、判定は式（１）に限ったものではない。また、判定の閾値などもこの限りではない。
【００３８】
また、実施の形態１では、品詞パタンの表現に「＝」および「［］」を使用して表現していたが、これに限らず、例えば公知のＢＮＦ（Ｂａｃｋｕｓ−ＮａｕｒＦｏｒｍ）などで表現してもよい。
【００３９】
また、実施の形態１では、形態素解析処理において候補となるべき品詞が複数存在する場合に、その品詞に関する出力結果にフラグを付し、さらに再探索処理において、フラグの存在するレコードを検索することによって、代替候補を求めたが、このような処理方法だけでなく、たとえば形態素解析処理において候補となるべき品詞が複数存在する場合には、その品詞ごとに複数のレコード（代替品詞情報）を出力し、再探索処理において、この複数の出力結果の品詞、付加情報に基づいて代替候補を求めるようにしてもよい。
【００４０】
また、実施の形態１では、文字フィールド単位あるいは語単位に形態素解析結果を出力して、再探索処理によって代替候補を求めて文書属性を抽出する構成とした。しかし形態素解析処理において、一つの文字フィールドあるいは語に対して複数の品詞が得られる場合にそれらの品詞をすべて列挙した表現形式、例えば、川崎＝［地名｜人名］（［Ａ｜Ｂ］はＡとＢのうちのいずれか一つの要素を意味する）のような表現形式によって形態素解析結果を表現し、品詞パタン照合処理でこの結果を用いて、一回の照合処理で文書属性を抽出するようにしても同様の効果が得られる。
【００４１】
また、上記構成はコンピュータプログラムによって実現することも可能である。この場合は、入力手段１は入力プログラム、形態素解析手段２は形態素解析プログラム、品詞再抽出手段３は品詞再抽出プログラム、品詞パタン照合手段４は品詞パタン照合プログラム、座標判定手段５は座標抽出プログラム、出力手段６は出力プログラムとして実現する。
【００４２】
実施の形態２．
実施の形態１においては、形態素辞書中に正しい品詞、付加情報が定義されていることを前提として、属性抽出を行う手段について説明したが、さらに実施の形態２においては、形態素辞書中に正しい品詞や付加情報が与えられていない場合であっても、属性抽出を行う手段について説明する。実施の形態２による属性抽出装置の構成図としては図１を用いる。また図１における各構成要素は実施の形態１と同様であるので、符号の説明については省略する。
【００４３】
実施の形態２で使用する文書を図１２に示す。また実施の形態２で使用する属性定義情報の例を図１３に示す。ここでは、抽出する属性は「宛先人名」と「宛先組織名」である。図４のフローチャートに従って実施の形態１と同様に処理を実行する。まずステップＳ１００において、入力手段１は文書を入力する。次にステップＳ２００で形態素解析手段２は形態素解析を実行する。形態素解析手段２の出力結果の例を図１５に示す。図１５においても、実施の形態１における形態素解析処理の出力結果と同様、形態素解析処理の結果抽出した文字列が検出された順に出力される。したがって、ある文字列に隣接する文字列が抽出された場合、これらの文字列は隣接するレコードとして出力されることになる。
【００４４】
さらに、実施の形態２では、形態素解析結果が「名詞、接尾、人名」や、「名詞、接尾、一般」、「名詞、接尾、地域」等の付加情報を利用する場合に有効である。ここで、「名詞、接尾、人名」とは、人名の後に続けて用いることのできる単語の属する分類であり、例えば「殿」、「様」、「先生」、「著」などの語が該当する。また「名詞、接尾、一般」とは、一般名詞の後に続けて用いることのできる単語の属する分類であり、例えば「額」、「用」、「性」などの語が該当する。さらに「名詞、接尾、地域」は地域名に続けて用いることのできる単語が属する分類であり、「市」、「町」、「村」、「駅」、「支店」などが該当する。なお、ステップＳ１００及びステップＳ２００における処理の詳細については、実施の形態１と同じであるため、ここでは説明を省略する。
【００４５】
続いてステップＳ３００において、品詞パタン照合手段４はパタン照合処理を行う。ここでは、図１３に示した属性定義情報の例を用いて、図５のフローチャートをもとに実行する。
【００４６】
図５のステップＳ３１０で属性数Ｎ＝２を取得し、ステップＳ３２０でｉ＝１をセットした後、ステップＳ３３０で属性定義情報を取得する。ここでは「属性名」＝「宛名人名」、「座標」＝（２０，３０，１０，５）である。次に品詞パタン照合手段４は図１５の形態素解析結果と図１４の品詞パタンを用いて照合処理を行う。図１５によれば、「中河原」についてのレコード１２５は、品詞＝”名詞”、付加情報１＝”固有名詞”、付加情報２＝”地域”、付加情報３＝”一般”であり（［名詞−固有名詞−地域−一般］）、図１３の品詞パタンの［名詞―固有名詞―人名］と一部異なる。また、「中河原」は［人名］として形態素辞書７に存在しないので図１５内でフラグが立っていない。このため品詞再抽出手段３を用いても、所望の品詞を得ることができない。そこで実施の形態２においては、品詞パタン照合手段４は、付加情報２および付加情報３の内容が品詞パタンと完全に一致しない場合でも、その次のレコードの単語を用いてパタン照合処理を行う。ここでは次の単語「殿」についてのレコード１２６が、品詞＝”名詞”、付加情報１＝”接尾”、付加情報２＝”人名”（［名詞―接尾―人名］）としており、その結果品詞パタンの記述［名詞―接尾―人名］と一致する。このことから、品詞パタン照合手段４は品詞照合における一致度を計算して、その一致度を閾値と判定することで品詞パタンと一致するか否かを判定する。
【００４７】
一致度計算は品詞パタン中の各品詞と形態素解析結果がどの程度一致したかを計算する。具体的には以下の数式（２）を用いる。
【００４８】
【数２】

【００４９】
ここで、ｋは品詞パタン中の品詞数、Ｓ_ｊは第ｊ番目の品詞において一致した品詞の付加情報数、Ｐ_ｊは品詞パタン中の第ｊ番目の品詞の付加情報数である。ただしＰ_ｊ＝０は無視する。ｉは図５中のｉと同一である。
【００５０】
この場合、図１４から属性名「宛名人名」は［名詞―固有名詞―人名］および［名詞―接尾―人名］の２つの品詞の組合せで構成され、ｋ＝２となる。また、第１番目の品詞の付加情報数は［固有名詞―人名］の２つであるからＰ_１＝２となり、品詞パタン照合で一致する第１番目の付加情報数は［固有名詞］の１つであるからＳ_１＝１であって、第２番目の品詞について、付加情報数は［接尾―人名］であるからＰ_２＝２、図１５で品詞パタンに一致する付加情報数は２であるからＳ_２＝２である。以上を式（２）に当てはめると、
ＤＭ_１（中河原殿）＝１／２×（１／２＋２／２）＝０．７５
となる。
【００５１】
ここで、ＤＭ_１の判定の閾値を０．５と設定すると、ＤＭ_１＝０．７５＞０．５となるので、品詞パタン照合に成功し、次の処理を行う。品詞パタン照合手段は、他に一致する文字列を探索する。図１５からは「××建設」、「○×電気」、「川崎」が固有名詞であるために一部一致するが、それぞれに続く品詞の付加情報がどれも一致しないので、
ＤＭ_１（××建設中河原）＝１／２×（１／２＋０／２）＝０．２５、
ＤＭ_１（○×電気担当）＝１／２×（１／２＋０／２）＝０．２５、
ＤＭ_１（川崎新規）＝１／２×（１／２＋０／２）＝０．２５
といずれも閾値以下となり一致するとみなさない。
【００５２】
次にステップＳ３５０において、「中河原殿」の座標を例えば（８，２３，１６，３）とすると、属性定義情報による座標との距離は２７となり、閾値３０以下となる。そこでステップＳ３７０に進んで保存処理を行い、図１６に示すように中河原殿をバッファに保持する。
【００５３】
次にステップＳ３８０においてｉ＝１，Ｎ＝２であって、ｉ＜Ｎであるから、ステップＳ３９０においてｉをインクリメントした後、ステップＳ３３０へと進む。ｉ＝２において、宛先組織名を抽出すると、実施の形態１における属性”組織名”と同じく「××建設」、「○×電気」が候補として抽出され、距離がそれぞれ（７，１７，１６，３）、（７６，１７，１６，３）となる。属性定義情報による座標との距離はそれぞれ２４，６７となり、「××建設」との距離が小さいので候補として抽出される。続いてステップＳ３５０、ステップＳ３７０と進みステップＳ３８０でＮとなって終了する。最後にステップＳ４００で出力手段は図１６に示す属性抽出結果を出力する。
【００５４】
以上のように、実施の形態２によれば、形態素辞書に所望の単語が存在しないために形態素解析で誤りとなった場合であっても、品詞パタン照合手段４が一致度を使用して照合を行うので、属性の抽出が可能となる。
【００５５】
更に品詞パタン照合手段４が、品詞情報が完全に一致しなくてもさらにあいまい照合を行うことで、形態素辞書中に正しい品詞、付加情報が存在しないなどの理由により、形態素解析処理が誤りとなる場合であっても、属性抽出を可能とする。
【００５６】
また、形態素解析で複数の品詞である可能性がある単語については、代替候補から文書属性を抽出することとしたので、従来は形態素解析誤りで抽出もれとなっていた属性を抽出することが可能となる。また座標の位置情報を用いて検定することで、より正確な属性抽出が可能となる。
【００５７】
なお、実施の形態２では、品詞パタン照合の判定を式（２）を用いていたが、これは式（２）に限ったものではない。判定のための閾値も同様である。
【００５８】
【発明の効果】
本発明は、抽出すべき文書属性を形態素によるパターンで表現し、さらに入力文書の文字フィールドの形態素解析を行ってパターン照合を行い、文字フィールドの入力文書における出現位置についての判定を行うこととしたので、文書属性として内容が適切な文字フィールドを抽出することができ、かつ、文字フィールドの位置のずれがあっても文書属性を抽出することができるという効果を有する。
【図面の簡単な説明】
【図１】実施の形態１及び２の構成図である。
【図２】形態素辞書の内容例を示す図である。
【図３】形態素辞書の内容例を示す図である。
【図４】属性抽出処理を示すフローチャートである。
【図５】パタン照合処理を示すフローチャートである。
【図６】実施の形態１における入力文書の例を示す図である。
【図７】実施の形態１で用いる属性定義情報の例を示す図である。
【図８】実施の形態１による形態素解析結果の例を示す図である。
【図９】実施の形態１における品詞パタンの例を示す図である。
【図１０】実施の形態１による属性抽出結果の例を示す図である。
【図１１】実施の形態１による修正後の形態素解析結果の例を示す図である。
【図１２】実施の形態２における入力文書の例を示す図である。
【図１３】実施の形態２で用いる属性定義情報の例を示す図である。
【図１４】実施の形態２における品詞パタンの例を示す図である。
【図１５】実施の形態２による形態素解析結果の例を示す図である。
【図１６】実施の形態２による属性抽出結果の例を示す図である。
【符号の説明】
１：入力手段、２：形態素解析手段、３：品詞再抽出手段
４：品詞パタン照合手段、５：座標判定手段、６：出力手段
７：形態素辞書、８：品詞パタンＤＢ、９：属性定義ファイルＤＢ[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to an attribute extraction method for automatically extracting document attributes from a computer-readable text, an apparatus therefor, and an attribute extraction program.
[0002]
[Prior art]
With the spread of computers and the speeding up of processing, the increase in disk capacity, and the price reduction, the number of document management methods for creating and managing documents on computers has increased. In order to efficiently manage documents, it is necessary to save each document with attributes such as creation date, author, company name, and the like. To assign and save attributes to a document, it is necessary to manually input the attributes. As the number of documents increases, it takes time to input the attributes. Conventionally, methods for improving the efficiency have been devised. Was.
[0003]
As one of them, a method of extracting a character in a predetermined area in a document as an attribute (for example, Patent Document 1), a method of using pattern matching without using written position information (for example, Non-Patent Document 1), and the like There is.
[0004]
[Patent Document 1]
JP-A-2002-55985 (pages 2 to 11, FIGS. 2, 4 and 5)
[0005]
[Non-patent document 1]
"Information Processing, Vol. 40, No. 4" PP370-373
[0006]
[Problems to be solved by the invention]
In the method of extracting characters in a specified area in a document as attributes, if the entire character field is tilted or shifted up, down, left and right due to the captured image, such as a document input from a facsimile, specify The attribute is not properly extracted due to the gap between the area and the entry area in the actual document, and the contents of the character fields are not analyzed and are associated with the attributes. For the described document, there is a problem that attributes cannot be extracted well, such as registering a company / organization name in a creator attribute.
[0007]
In addition, in the method using pattern matching without using the written position information, if a part of speech different from the part of speech that should be originally obtained due to a morphological analysis error is assigned, or if a word that does not exist in the morphological dictionary is included in the input sentence, the attribute is appropriately set. There is a problem that cannot be extracted.
[0008]
[Means for Solving the Problems]
The present invention provides an input step of inputting a document, a morphological analysis step of outputting a morphological analysis result for each character field of the document, a document attribute part-of-speech pattern expressing a structure of a document attribute at a morphological level, and a morpheme of the character field. A part-of-speech pattern matching step of matching the analysis result and outputting the matching character field as an attribute candidate; a coordinate determining step of determining whether an appearance position of the attribute candidate in a document is within a predetermined range; Outputting an attribute candidate determined to belong to the range in the coordinate determination step as a document attribute.
[0009]
BEST MODE FOR CARRYING OUT THE INVENTION
Embodiment 1 FIG.
FIG. 1 is a configuration diagram of the first embodiment. In the figure, an input unit 1 inputs a document, a morphological analysis unit 2 analyzes the input document based on Japanese grammar, and a part-of-speech re-extracting unit 3 outputs a specified word Of the part of speech of is re-extracted. The part-of-speech pattern matching unit 4 extracts a character string corresponding to the target attribute from the morphological analysis result, the coordinate determining unit 5 extracts a character position in the document, and the output unit 6 outputs the attribute name and the corresponding character string. Is output. The morphological dictionary 7 is a dictionary for storing information necessary for morphological analysis, and is used by the morphological analyzing means 2. The part-of-speech pattern DB 8 stores the part-of-speech pattern used by the part-of-speech pattern matching unit 4. The attribute definition file DB 9 stores attribute definition information.
[0010]
In the above configuration, the input means 1 is realized by a file system based on an operating system for inputting a file on a hard disk in a computer system, and is an input device for electronically taking in a printed matter such as a scanner, an OCR, or a FAX OCR. In addition, the present invention is also realized by an e-mail server that receives e-mail, a Web server that receives input via a Web page on the Internet, and other input systems and devices.
[0011]
The morphological analysis unit 2, the part-of-speech re-extraction unit 3, the part-of-speech pattern matching unit 4, and the coordinate determination unit 5 are realized by a central processing unit (hereinafter referred to as a CPU) in a computer system in addition to forming a dedicated electronic circuit. . Further, the output unit 6 is realized by a file system by an operating system, a printer, a FAX server, or the like. The morphological dictionary 7, part-of-speech pattern DB8, and attribute definition file DB9 are stored in a non-volatile storage device or a hard disk connected to the main system by an internal bus or a network.
[0012]
Next, FIGS. 2 and 3 are examples of information held by the morphological dictionary 7. FIG. As shown in FIG. 2, the morphological dictionary 7 has a record including a word, a part of speech, and additional information 1 to 3 for each word. For the part of speech and additional information 1 to 3 of each word, a value that hierarchically classifies the word is defined. The word hierarchical classification here refers to, for example, hierarchically classifying words into parts of speech in national grammar. In the example of FIG. 2, words are first classified into part-of-speech classifications in national grammar such as nouns, prefixes, and adverbs, and the results are stored in part-of-speech. Subsequently, for example, nouns are classified into general nouns, proper nouns, pronouns, and the like. Then, the result is stored in the additional information 1, the proper noun is further classified into a detailed classification such as a person name, a place name, and an organization name, and the result is stored in the additional information 2. In the case of a personal name, the distinction between the first and last names is stored in the additional information 3. These classifications are used directly as information for morphological analysis. Therefore, in addition to such a classification method, any classification method that can be used for morphological analysis may be used. Note that, in the morphological dictionary 7 according to the first embodiment, since the number of hierarchical levels of classification is up to four hierarchical levels counted from the part of speech, the additional information from 1 to 3 is sufficient. When three pieces of additional information are insufficient according to the classification method, the number of pieces of additional information may be appropriately added.
[0013]
Further, the morphological dictionary 7 stores grammatical connection conditions between parts of speech as shown in FIG. These records indicate whether or not the connection between two consecutive parts of speech is a correct combination. For example, a combination of a noun and a particle is a grammatically correct combination.
[0014]
Next, an attribute extraction process according to the first embodiment will be described. FIG. 4 is a flowchart showing the attribute extraction processing according to the first embodiment. First, in step S100 in FIG. 4, the input unit 1 inputs a document. The input document is computer-readable data. Here, the input document is not limited to the computer of the present apparatus, and a document on another computer may be input via a network. Here, the content of the input document is a text file as shown in FIG.
[0015]
Next, in step S200, the morphological analysis unit 2 performs morphological analysis on the document input by the input unit 1. Here, first, the matching processing is performed with the morphological dictionary 7 from the first character of the input document. With respect to 115 “2000/9/30” in the input document shown in FIG. 6, a matching process is performed with a word beginning with the first character “2”. In this case, the first character 115 matches only “2 (number)” in the morphological dictionary 7. Next, a collation process is performed with the second character of 115 and a word starting with “0”, and as a result, only “0 (number)” is matched. Next, referring to the grammatical connection condition shown in FIG. 3, the connection condition of “2 (number)” and “0 (number)” is checked. According to this connection condition, since the connection between numbers is recognized, the part of speech of “2” is determined to be “number” and the part of speech of “0” is determined to be “number”. Hereinafter, the same processing is performed.
[0016]
Further, when there are a plurality of parts of speech that should be candidates for a certain word, the morphological analysis unit 2 outputs a morphological analysis result with a flag indicating the presence of the plurality of candidates. For example, as for “Kawasaki” 116 in the input document in FIG. 6, two records of a proper noun indicating a person's name (record 120 in FIG. 2) and a proper noun indicating a region (record 120-2 in FIG. 2) are morphemes. Exists in dictionary 7. Both of them satisfy the connection condition shown in FIG. 3 and can be output as a result. However, in this case, a proper noun (record 120-2) indicating a region is temporarily output as a candidate, and another candidate word exists. Is added.
[0017]
FIG. 8 shows an example of an output result of the morphological analysis unit 2 with respect to the example of the input document of FIG. In FIG. 8, the character strings extracted as a result of the morphological analysis are output in the order in which they are detected. This input document stores a part of speech name, additional information 1 to 3, and a flag for each word. Here, the additional information 1 to 3 are the additional information 1 to 3 stored for the word in the morphological dictionary. In this example, a flag is added only to the record 119 for “Kawasaki”.
[0018]
Next, in step S300, the part-of-speech pattern matching unit 4 performs a part-of-speech pattern matching process. The part-of-speech pattern expresses the configuration of each attribute by a combination of constituent elements such as parts of speech and characters, and is created in advance by a user and stored in the part-of-speech pattern DB8. FIG. 9 shows an example of the part of speech pattern. In this example, the attribute configuration is expressed in the format of “attribute = list of components”. For example, the configuration of the attribute “date” includes a combination of “number”, “symbol”, “number”, “symbol”, and “number”. When one attribute has a plurality of different configurations, a plurality of definition lines “attribute = component” are described for one attribute. The part-of-speech pattern matching unit 4 compares the result output by the morphological analysis unit 2 with the part-of-speech pattern in the part-of-speech pattern DB 8, and extracts a character string that matches the part-of-speech pattern as an attribute candidate.
[0019]
Here, the details of the part-of-speech pattern matching processing will be described. FIG. 5 is a flowchart of the pattern matching process. First, in step S310 of FIG. 5, the number of attributes N is acquired as attribute definition information from the attribute definition file DB 9 shown in FIG. 7 (3 in the example of FIG. 7).
[0020]
Here, the attribute definition information is information that describes in what form the attribute to be extracted from the document is present in the document, is manually created in advance, and stored in the attribute definition file DB9. FIG. 7 shows an example of attribute definition information used in the first embodiment. In this example, an attribute value in a document is extracted using an attribute name and a coordinate value in the document. The first line “number of attributes = 3” indicates that the number of attributes to be extracted is three. In “[Attribute 1]” in the second and subsequent lines, 101 indicates that the attribute name is “date”. In the next line, component 102, component 103, component 104, and component 105 are described. The coordinates as the coordinate components indicate that the coordinates where the attribute 1 exists in the document are (80, 10, 10, 5). Here, the component 102 is the x coordinate of the upper left point of the rectangle whose attribute value exists in the document, the component 103 is the y coordinate, the component 104 is the rectangle width of the rectangle, and the component 105 is the rectangle width of the rectangle. Height. According to this example, the value of the component 102 is 80, the value of the component 103 is 10, the value of the component 104 is 10, and the value of the component 105 is 105. The numerical value of each component is a relative value with the length of the document in the vertical (y) direction and the horizontal (x) direction being 100, respectively, and the upper left as the origin. Therefore, the coordinate point (80, 10) means a point that is 80% in the x direction and 10% in the y direction from the origin. The width 10 means 10% of the text width, and the height 5 means 5% of the text height.
[0021]
Next, in step S320, an initial value i = 1 is substituted for performing pattern matching for each of the N attributes. Subsequently, in step S330, the i-th (here, i = 1) attribute definition information is acquired. For example, the attribute name and the coordinates where the attribute exists are acquired as the attribute definition information shown in FIG. The first attribute name is “date” and the coordinates are (80, 10, 10, 5).
[0022]
Next, in step S340, the part-of-speech pattern matching unit 4 performs a matching process between the morphological analysis result in FIG. 8 and the part-of-speech pattern in FIG. The part-of-speech pattern defines a date, an organization name, and an author, as shown in FIG. Therefore, the part-of-speech pattern of the first attribute name “date” is compared with the morphological analysis result. Here, the part-of-speech pattern for the date is [numerical (2-4)] [symbol] [numerical (1-2)] [symbol] [numerical (1-2)]. As a result, when the character string in the morphological analysis result matches this part of speech pattern, the character string is extracted as a date. Here, [numbers 2 to 4] shown at 110 in FIG. 9 mean “numbers composed of 2 to 4 digits”. According to the result of the morphological analysis in FIG. 8, “2002” is four consecutive digits, and thus matches “110 (numerical (2-4)”). Next, since the part of speech of “/” indicated by reference numeral 117 in the morphological analysis result of FIG. 8 is a symbol, it matches the “symbol” of 111 in FIG. In the same manner, matching is performed in the same manner, and “9” matches one digit, “/” 118 matches a symbol, and “30” matches two digits, and matches by pattern matching.
[0023]
Next, the coordinate determination unit 5 evaluates the coordinates of the character field of the attribute candidate that matches. For this purpose, it is necessary to extract the position of the matched character field in the document. For this purpose, the coordinate determination means 5 obtains in advance the total number of lines and the maximum number of character strings of the document, and calculates the height and width of the document. And Here, height = 30 (lines) and width = 25 (characters). The numerical value of the width is calculated assuming that the full width is 1, and the half width character is calculated as 0.5. Subsequently, the coordinate determination unit 5 acquires the designated character position, here, the row and column position of the character position “2002/9/30”. Here, as an example, assume that the positions of the row and column are 2, 19, respectively, and that the character width is 4.5 and the character height is 1. When the extracted coordinates are calculated based on these, 19 × 100/25 = 76, 2 × 100/30 = 7, 4.5 × 100/25 = 18, and 1 × 100/30 = 3. Further, the distance between the extracted character string and the coordinates in the attribute definition information is obtained. The following formula is used for distance calculation.
[0024]
(Equation 1)

[0025]
P_dniIs the coordinates in the attribute definition information, P_eniAre coordinates extracted from the document. Where P_d ₁= (80,10,10,5), P_e ₁= (76,7,18,3), D1 = 17 calculated according to equation 1.
[0026]
Next, in step S350, it is determined whether the i-th (here, i = 1) attribute exists. Here, if there is a character field in which the distance of the character string matched in the pattern matching is closest and the distance is within a certain threshold, it is determined that the attribute exists. If the threshold value is 30 here, the distance of the character string of the attribute 1 is 17 <30, which means that the attribute exists, and the process proceeds to step S370 to perform the saving process. In step S370, the extracted attributes are held in a buffer, as shown in an example in FIG.
[0027]
Next, in step S380, since i = 1 and N = 3 and i <N is satisfied (step S380: Y), i is incremented by S390, and the process proceeds to S330.
[0028]
In step S330, the second attribute is extracted as attribute definition information from the attribute definition file DB 9 based on i = 2. Taking the attribute definition information shown in FIG. 7 as an example, the attribute name is “organization name” and the coordinates are (80, 20, 10, 5). In this case, the part of speech pattern is organization name = [noun-proper noun-organization] from FIG. Here, “noun” 122 indicates part of speech, “proper noun” 123 indicates additional information 1, and “organization” 124 indicates additional information 2. When the morphological analysis results shown in FIG. 8 are searched, “XX construction” and “XX electric” are [noun-proper noun-organization]. Next, the coordinate determining means 5 calculates a distance from each coordinate value. For example, assuming that the coordinates of “XX construction” are (7, 17, 16, 3) and the coordinates of “○ × electricity” are (76, 17, 16, 3), the respective distances are calculated by applying the formula (1). Then, “XX Construction” is 84 and “XX Electricity” is 15. Here, the distance of “「 × electricity ”is the smallest, so this is set as the first candidate.
[0029]
Next, the process proceeds to step S350, where the attribute 15 exists because the distance 15 of “○ × electricity” is equal to or less than the threshold value 30, and the storage process is performed in step S370. Subsequently, in step S380, since i = 2 and N = 3 and i <N is satisfied, i is incremented in step S390, and the process proceeds to step S330.
[0030]
In step S330, the i = third attribute is extracted. The attribute name is “author” and the coordinates are (80, 30, 10, 5) from the attribute definition information shown in the example of FIG. 7, and the part-of-speech pattern shown in FIG. 9 is author = [noun-proper noun-person name]. Become. When searching for a person name from the morphological analysis result shown in FIG. 8, "Nishida" satisfies this condition. Therefore, the distance is obtained by equation (1). Assuming that the coordinates of “Nishida” are (7, 23, 8, 3), P_d ₃= (80, 30, 10, 5), P_e ₃= (7,23,8,3), and D3 = 84> 30, which does not satisfy the condition. Therefore, since the attribute value does not exist in step S350 (step S350: N), the process proceeds to step S360, and the re-search process is performed.
[0031]
In the re-search processing in step S360, the part of speech re-extraction unit 3 performs an extraction processing with a character string in which another candidate exists in the morphological analysis result in step S200. Here, in the morphological analysis result of FIG. 8, since a flag is set in the record 119 for “Kawasaki”, whether or not the part of speech of “Kawasaki” in the record 119 can be a proper noun (person name) is determined by using the morphological dictionary 7. Judgment. From the morphological dictionary 7, since the additional information 2 of the record 120 for "Kawasaki" is a personal name, the part-of-speech re-extracting means 3 extracts the part of speech "noun", "proper noun", "personal name", and "surname" of the record 120 "Kawasaki". Output to the part-of-speech pattern matching means 4. FIG. 11 shows an example of the result. Comparing the result of the morphological analysis processing of FIG. 11 with the result of the morphological analysis processing of FIG. 8, the additional information of the record 121 for “Kawasaki” has been changed. Since the record 121 for "Kawasaki" has been changed, it is determined that the part of speech has been changed in step S365, and the process proceeds to step S340 (step S365: Y). The part-of-speech pattern matching means 4 uses the output result of the part-of-speech re-extraction means 3 to check again with the part-of-speech pattern. As a result, the record 121 for “Kawasaki” is successfully collated. Next, the coordinate determination means 5 calculates the coordinates. For example, the extracted coordinates of the record 121 for “Kawasaki” are P_e ₃= (84,23,8,3), D3 = 15 <30, which satisfies the condition, so that the record 121 for "Kawasaki" is output as the first candidate.
[0032]
Finally, in step S400, the output means 6 outputs the attribute value corresponding to the attribute name as shown in FIG. 10, and ends.
[0033]
As described above, in the first embodiment, even if the desired part of speech information is not included in the result of the morphological analysis, the part of speech re-extracting unit 3 re-extracts the part of speech for a word that can be a plurality of parts of speech. Part-of-speech can be acquired, and as a result, the accuracy of attribute extraction is improved, and document attributes having appropriate contents can be extracted.
[0034]
In addition, by performing the test using the position information of the coordinates, even if a plurality of words satisfying the attribute pattern condition exist in the document, it is possible to extract the word with high accuracy compared to the case where the position information is not used.
[0035]
In addition, since the attribute is extracted using the coordinates and the description content, even if the description is made by, for example, exchanging the organization name and the description position of the creator, each can be correctly extracted.
[0036]
In the above description, the part-of-speech re-extraction unit 3, the part-of-speech pattern matching unit 4, the coordinate determination unit 5, and the output unit 6 have been described as separate components, but they may be integrated into the part-of-speech pattern matching unit. It is possible.
[0037]
In the first embodiment, the expression (1) is used for the determination of the part of speech pattern matching._dniAnd P_eniThe determination is not limited to Equation (1) because a method of squaring the difference between the two and taking the sum may be considered. Further, the threshold value for the determination is not limited to this.
[0038]
In the first embodiment, the part-of-speech pattern is expressed using “=” and “[]”. However, the present invention is not limited to this. For example, the part-of-speech pattern is expressed using a known BNF (Backus-Naur Form) or the like. You may.
[0039]
Further, in the first embodiment, when there are a plurality of parts of speech to be candidates in the morphological analysis processing, a flag is added to the output result relating to the parts of speech, and a record in which the flag exists is searched in the re-search processing. In addition to this processing method, when there are a plurality of parts of speech to be candidates in the morphological analysis processing, a plurality of records (substitute part of speech information) are output for each part of speech. Then, in the re-search process, alternative candidates may be obtained based on the parts of speech and the additional information of the plurality of output results.
[0040]
In the first embodiment, the morphological analysis result is output in units of character fields or words, and alternative attributes are obtained by re-searching to extract document attributes. However, in the morphological analysis processing, when a plurality of parts of speech are obtained for one character field or word, an expression form listing all the parts of speech, for example, Kawasaki = [place name | person name] ([A | B] is A Morpheme analysis result is expressed in an expression form such as B and B), and the document attribute is extracted in one matching process using this result in the part of speech pattern matching process. Even so, the same effect can be obtained.
[0041]
Further, the above configuration can be realized by a computer program. In this case, the input unit 1 is an input program, the morphological analysis unit 2 is a morphological analysis program, the part-of-speech re-extraction unit 3 is a part-of-speech re-extraction program, the part-of-speech pattern matching unit 4 is a part-of-speech pattern matching program, and the coordinate determination unit 5 is a coordinate extraction program. The output means 6 is realized as an output program.
[0042]
Embodiment 2 FIG.
In the first embodiment, the means for performing attribute extraction has been described on the assumption that the correct part of speech and additional information are defined in the morphological dictionary. In the second embodiment, however, the correct part of speech in the morphological dictionary is further described. A means for extracting attributes even when no additional information is given is described. FIG. 1 is used as a configuration diagram of the attribute extraction device according to the second embodiment. 1 are the same as those in the first embodiment, and the description of the reference numerals is omitted.
[0043]
FIG. 12 shows a document used in the second embodiment. FIG. 13 shows an example of attribute definition information used in the second embodiment. Here, the attributes to be extracted are “destination person name” and “destination organization name”. The processing is executed according to the flowchart of FIG. First, in step S100, the input unit 1 inputs a document. Next, in step S200, the morphological analysis unit 2 performs a morphological analysis. FIG. 15 shows an example of the output result of the morphological analysis means 2. In FIG. 15, similarly to the output result of the morphological analysis processing in the first embodiment, the character strings extracted as a result of the morphological analysis processing are output in the order in which they are detected. Therefore, when character strings adjacent to a certain character string are extracted, these character strings are output as adjacent records.
[0044]
Furthermore, Embodiment 2 is effective when the morphological analysis result uses additional information such as “noun, suffix, personal name”, “noun, suffix, general”, and “noun, suffix, area”. Here, "noun, suffix, person name" is a classification to which words that can be used after a person name belong, such as "dono," "sama," "teacher," and "author." I do. “Noun, suffix, general” is a classification to which a word that can be used after a general noun belongs, for example, words such as “forehead”, “for”, and “sex”. Further, “noun, suffix, area” is a classification to which words that can be used subsequent to the area name belong, and correspond to “city”, “town”, “village”, “station”, “branch”, and the like. The details of the processing in step S100 and step S200 are the same as those in the first embodiment, and a description thereof will not be repeated.
[0045]
Subsequently, in step S300, the part-of-speech pattern matching unit 4 performs a pattern matching process. Here, the process is executed based on the flowchart of FIG. 5 using the example of the attribute definition information shown in FIG.
[0046]
In step S310 of FIG. 5, the number of attributes N = 2 is obtained, and in step S320, i = 1 is set. Then, in step S330, attribute definition information is obtained. Here, “attribute name” = “addressee name” and “coordinates” = (20, 30, 10, 5). Next, the part-of-speech pattern matching means 4 performs a matching process using the morphological analysis result of FIG. 15 and the part-of-speech pattern of FIG. According to FIG. 15, the record 125 for “Nakagawara” has a part of speech = “noun”, additional information 1 = “proper noun”, additional information 2 = “area”, and additional information 3 = “general” ([noun -Proper noun-area-general)) and partly different from [noun-proper noun-person name] in the part-of-speech pattern in FIG. Also, since "Nakagawara" does not exist in the morphological dictionary 7 as [person name], no flag is set in FIG. Therefore, even if the part of speech re-extraction means 3 is used, a desired part of speech cannot be obtained. Therefore, in the second embodiment, even when the contents of the additional information 2 and the additional information 3 do not completely match the part-of-speech pattern, the part-of-speech pattern matching means 4 performs the pattern matching process using the word of the next record. Here, the record 126 for the next word “tono” has a part of speech = “noun”, additional information 1 = “suffix”, and additional information 2 = “personal name” ([noun-suffix-personal name]). This matches the pattern description [noun-suffix-person name]. Accordingly, the part-of-speech pattern matching unit 4 calculates the degree of coincidence in the part-of-speech collation, and determines whether or not the part-of-speech pattern matches the part-of-speech pattern by determining the degree of coincidence as a threshold.
[0047]
The matching degree calculation calculates how much each part of speech in the part of speech pattern matches the morphological analysis result. Specifically, the following equation (2) is used.
[0048]
(Equation 2)

[0049]
Here, k is the number of parts of speech in the part of speech pattern, S_jIs the number of additional information of the matched part of speech in the jth part of speech, P_jIs the number of additional information of the j-th part of speech in the part of speech pattern. Where P_j= 0 is ignored. i is the same as i in FIG.
[0050]
In this case, from FIG. 14, the attribute name "addressee name" is composed of a combination of two parts of speech, [noun-proper noun-person name] and [noun-suffix-person name], and k = 2. Also, since the number of additional information of the first part of speech is [proper noun-person name], P₁= 2, and the first number of additional information that matches in the part-of-speech pattern matching is one of [proper nouns].₁= 1 and the number of additional information for the second part of speech is [suffix-person name], so P₂= 2, the number of additional information matching the part-of-speech pattern in FIG.₂= 2. Applying the above to equation (2),
DM₁(Nakagawara) = 1/2 x (1/2 + 2/2) = 0.75
Becomes
[0051]
Where DM₁Is set to 0.5, the DM₁Since 0.75> 0.5, the part-of-speech pattern matching is successful, and the following processing is performed. The part-of-speech pattern matching means searches for another matching character string. From FIG. 15, “XX construction”, “XX electricity”, and “Kawasaki” are partial nouns because they are proper nouns, but none of the additional information of the part of speech following each matches.
DM₁(Xx riverbank under construction) = 1/2 x (1/2 + 0/2) = 0.25,
DM₁(Ox Electricity) = 1/2 x (1/2 + 0/2) = 0.25,
DM₁(Kawasaki new) = 1/2 x (1/2 + 0/2) = 0.25
Are less than or equal to the threshold value and are not considered to match.
[0052]
Next, in step S350, assuming that the coordinates of “Nakagawara” are, for example, (8, 23, 16, 3), the distance from the coordinates according to the attribute definition information is 27, which is 30 or less. Then, the process proceeds to step S370 to perform a save process, and the Nakagawara-dono is held in the buffer as shown in FIG.
[0053]
Next, in step S380, since i = 1 and N = 2 and i <N, i is incremented in step S390, and the process proceeds to step S330. When the destination organization name is extracted at i = 2, "XX construction" and "XX electric" are extracted as candidates as in the attribute "organization name" in the first embodiment, and the distances are respectively (7, 17, 16). , 3) and (76, 17, 16, 3). The distances from the coordinates according to the attribute definition information are 24 and 67, respectively. Since the distances from “xx construction” are small, they are extracted as candidates. Subsequently, the process proceeds to step S350 and step S370, where N is determined in step S380 and the process ends. Finally, in step S400, the output means outputs the attribute extraction result shown in FIG.
[0054]
As described above, according to the second embodiment, the part-of-speech pattern matching unit 4 performs matching by using the degree of coincidence even when an error occurs in morphological analysis because a desired word does not exist in the morphological dictionary. , The attribute can be extracted.
[0055]
Further, the part-of-speech pattern matching unit 4 performs an ambiguous match even if the part-of-speech information does not completely match, so that the morphological analysis processing becomes erroneous because there is no correct part-of-speech or additional information in the morphological dictionary. Even in this case, attribute extraction is enabled.
[0056]
In addition, for words that may be a plurality of parts of speech in morphological analysis, document attributes are extracted from alternative candidates. It becomes possible. Further, by performing the test using the position information of the coordinates, more accurate attribute extraction becomes possible.
[0057]
In the second embodiment, the expression (2) is used for the determination of the part-of-speech pattern matching, but this is not limited to the expression (2). The same applies to the threshold for determination.
[0058]
【The invention's effect】
According to the present invention, a document attribute to be extracted is expressed by a pattern based on a morpheme, a morphological analysis is performed on a character field of the input document, pattern matching is performed, and a determination is made as to the appearance position of the character field in the input document. Therefore, it is possible to extract a character field whose content is appropriate as a document attribute, and to extract a document attribute even if the position of the character field is shifted.
[Brief description of the drawings]
FIG. 1 is a configuration diagram of

Embodiments

1 and 2.
FIG. 2 is a diagram showing an example of the contents of a morphological dictionary.
FIG. 3 is a diagram showing an example of contents of a morphological dictionary.
FIG. 4 is a flowchart illustrating an attribute extraction process.
FIG. 5 is a flowchart illustrating a pattern matching process.
FIG. 6 is a diagram showing an example of an input document according to the first embodiment.
FIG. 7 is a diagram showing an example of attribute definition information used in the first embodiment.
FIG. 8 is a diagram showing an example of a morphological analysis result according to the first embodiment.
FIG. 9 is a diagram illustrating an example of a part of speech pattern according to the first embodiment.
FIG. 10 is a diagram showing an example of an attribute extraction result according to the first embodiment.
FIG. 11 is a diagram showing an example of a morphological analysis result after correction according to the first embodiment.
FIG. 12 is a diagram illustrating an example of an input document according to the second embodiment.
FIG. 13 is a diagram showing an example of attribute definition information used in the second embodiment.
FIG. 14 is a diagram illustrating an example of a part of speech pattern according to the second embodiment.
FIG. 15 is a diagram showing an example of a morphological analysis result according to the second embodiment.
FIG. 16 is a diagram illustrating an example of an attribute extraction result according to the second embodiment.
[Explanation of symbols]
1: input means, 2: morphological analysis means, 3: part of speech re-extraction means
4: part-of-speech pattern matching means, 5: coordinate determination means, 6: output means
7: Morphological dictionary, 8: Part-of-speech pattern DB, 9: Attribute definition file DB

Claims

An input step for inputting a document;
A morphological analysis step of outputting a morphological analysis result for each character field of the document;
A part-of-speech pattern matching step of comparing a document attribute part-of-speech pattern expressing the structure of a document attribute in morpheme units and character units with a morphological analysis result of the character field, and outputting a matching character as an attribute candidate;
A coordinate determination step of determining whether an appearance position of the attribute candidate in the document is within a predetermined range;
Outputting an attribute candidate determined to belong to the range in the coordinate determination step as a document attribute.

Input means for inputting a document;
Morphological analysis means for outputting a morphological analysis result for each character field of the document,
A part-of-speech pattern matching unit that matches a document attribute part-of-speech pattern consisting of a part of speech and a character type with a morphological analysis result of the character field and outputs a matching character as an attribute candidate;
Coordinate determination means for determining whether the appearance position of the attribute candidate in the document is within a predetermined range,
Output means for outputting, as a document attribute, an attribute candidate determined by the coordinate determination means to belong to the range.

The morphological analysis unit, when there are a plurality of parts of speech corresponding to the morpheme constituting the field, outputs alternative part of speech information as a result of the morphological analysis together with a list including the part of speech and the character type constituting the character field,
3. The attribute extracting apparatus according to claim 2, wherein the part-of-speech pattern matching unit is configured to match a list obtained by exchanging the list and the alternative part-of-speech information with the document attribute part-of-speech pattern. .

The part-of-speech pattern matching unit may be configured such that a part of speech of a character field subsequent to the character field having the list including a part of speech different from the part of speech of the document attribute part of speech pattern is a suffix connected to the part of speech of the document attribute part of speech pattern. 4. The document attribute part-of-speech pattern is compared with a list obtained by replacing the different part-of-speech with the part-of-speech of the document attribute part-of-speech pattern. Attribute extraction device.

When the matching degree of the matching pattern between the character field and the document attribute part-of-speech pattern is equal to or greater than a predetermined value, the part-of-speech pattern matching means outputs the character field as an attribute candidate matching the document attribute part-of-speech pattern. The attribute extraction device according to claim 2, wherein the attribute extraction device is configured to:

An input procedure for inputting a document,
A morphological analysis procedure for outputting a morphological analysis result for each character field of the document,
A part-of-speech pattern matching procedure for comparing a document attribute part-of-speech pattern expressing the structure of a document attribute at a morpheme level with a morphological analysis result of the character field, and outputting a matching character as an attribute candidate;
A coordinate determination procedure for determining whether or not the appearance position of the attribute candidate in the document is within a predetermined range;
An output process of outputting as a document attribute an attribute candidate determined to belong to the range in the coordinate determination step.