JP4036064B2

JP4036064B2 - Morphological sequence processing apparatus and method

Info

Publication number: JP4036064B2
Application number: JP2002266227A
Authority: JP
Inventors: リーヤン李; 昌一舘野
Original assignee: Fuji Xerox Co Ltd; Fujifilm Business Innovation Corp
Current assignee: Fujifilm Business Innovation Corp
Priority date: 2002-09-12
Filing date: 2002-09-12
Publication date: 2008-01-23
Anticipated expiration: 2022-09-12
Also published as: JP2004102856A

Description

【０００１】
【発明の属する技術分野】
この発明は、形態素列から人名を弁別する形態素列処理技術に関する。
【０００２】
【従来の技術】
他の固有名詞と同様に人名はテキスト中で重要な意味を担うことが多い。例えば、テキスト中の人名を抽出して判別することにより、テキストの重要度等を判定したり、目的別、内容別にテキストを分類したりすることができる。したがって、テキスト中の人名を抽出することが望まれている。しかし、人名であるかどうかを判別するには、構文解析や意味解析等複雑な処理をする必要がある。また、逆に、構文解析や意味解析を行う前に、組織名称を把握できれば、構文解析等を簡易かつ確実に行うことが可能になる。
【０００３】
【発明が解決する課題】
この発明は、以上の事情を考慮してなされたものであり、形態素列から人名を弁別する技術を提供することを目的としている。
【０００４】
【課題を解決するための手段】
この発明によれば、上述の目的を達成するために、特許請求の範囲に記載のとおりの構成を採用している。ここでは、発明を詳細に説明するのに先だって、特許請求の範囲の記載について補充的に説明を行なっておく。
【０００５】
この発明の一側面によれば、上述の目的を達成するために、形態素処理装置に：形態素列から、人名に該当する形態素列部分を弁別するための人名形態素列部分弁別規則を記憶する手段と；人名に先行する先行境界の候補の第１の特性、人名に後続する後続境界の候補の第２の特性、および人名部分の候補の第３の特性を上記形態素列の該当する形態素または形態素グループに割り当てる特性割当手段と；上記形態素列の形態素または形態素グループに割り当てられた、上記第１の特性、第２の特性および第３の特性に上記人名形態素列部分弁別規則を適用して人名に該当する形態素列部分を弁別する手段とを設けるようにしている。
【０００６】
この構成においては、形態素列中の形態素または形態素グループが、人名の境界の候補の特性、または人名を構成する候補の特性を有するかどうかを判別し、これに基づいて所定の形態素列部分が人名に相当するかどうかを判別する。したがって、簡単に人名を弁別することができる。特性は、素性、属性ともいう。
【０００７】
対象とする人名は、主に、姓名であるが、姓、名の一方でも良い。日本語に限らず、韓国語、中国語等の他の言語にも適用可能である。
【０００８】
人名境界の候補としては、人のタイトル（職位等）の特性をもつ形態素または形態素グループを用いることができる。
【０００９】
タイトル接頭辞の候補の特性（「元」、「前」等）、タイトルに後続するタイトル接尾辞の候補の特性（「官」、「相」等）、およびタイトルの部分の候補の特性に基づいて形態素グループにタイトルの特性を付与することができる。
【００１０】
ただし、「被告」のように他の形態素と結合して長いタイトルを構成することがまれなものもあり、このようなものには、単タイトルの特性を割り当てることが好ましい。
【００１１】
人名に後続する接尾語である敬称を表す単語、「さん」、「様」、「氏」等を、人名に後続する人名境界の候補の特性を割り当てることができる。
【００１２】
また助詞や句読点、中点、ハイフン、カッコに人名境界の候補の特性を割り当てることができる。
【００１３】
「経営」等の経歴等を表す複数文字からなる漢字列に人名に先行する人名境界の特性を割り当てることができる。
【００１４】
また、この発明の他の側面によれば、上述の目的を達成するために、形態素列処理装置に：形態素列を入力する手段と；下記の複数の姓名弁別規則の少なくとも１つに必要な特性を、上記形態素列に割り当てる手段と；下記の複数の姓名弁別規則の少なくとも１つを適用し、下記のｌｅｆｔｂｏｕｎｄａｒｙの特性の部分と下記のｒｉｇｈｔｂｏｕｎｄａｒｙの特性の部分とに挟まれた形態素列部分を姓名として弁別する手段とを設けるようにしている。
【００１５】
姓名弁別規則はつぎのようなものである。
【００１６】
［規則１］：｜ｌｅｆｔｂｏｕｎｄａｒｙ｜ｆａｍｉｌｙｎａｍｅ（ｐｌａｃｅｎａｍｅ），ｎａｍｅｂｏｄｙ，ｇｉｖｅｎｎａｍｅ｜ｒｉｇｈｔｂｏｕｎｄａｒｙ｜
【００１７】
［規則２］：｜ｌｅｆｔｂｏｕｎｄａｒｙ｜ｆａｍｉｌｙｎａｍｅ（ｐｌａｃｅｎａｍｅ），ｇｉｖｅｎｎａｍｅ，ｎａｍｅｂｏｄｙ［ｇｒｂ：＋］｜ｒｉｇｈｔｂｏｕｎｄａｒｙ｜
【００１８】
［規則３］：｜ｌｅｆｔｂｏｕｎｄａｒｙ｜ｆａｍｉｌｙｎａｍｅ（ｐｌａｃｅｎａｍｅ），ｇｉｖｅｎｎａｍｅ｜ｒｉｇｈｔｂｏｕｎｄａｒｙ｜
【００１９】
［規則４］：｜ｌｅｆｔｂｏｕｎｄａｒｙ｜ｆａｍｉｌｙｎａｍｅ（ｐｌａｃｅｎａｍｅ），ｎａｍｅｂｏｄｙ＋｜ｒｉｇｈｔｂｏｕｎｄａｒｙ｜
【００２０】
［規則５］：｜ｌｅｆｔｂｏｕｎｄａｒｙ｜ｎａｍｅｂｏｄｙ［ｆｌｂ：＋］，ｎａｍｅｂｏｄｙ＋，ｇｉｖｅｎｎａｍｅ｜ｒｉｇｈｔｂｏｕｎｄａｒｙ｜
【００２１】
ただし、姓名に先行する先行境界の候補の特性をｌｅｆｔｂｏｕｎｄａｒｙ、姓名に後続する後続境界の候補の特性をｒｉｇｈｔｂｏｕｎｄａｒｙ、姓または地名の候補の特性をｆａｍｉｌｙｎａｍｅ（ｐｌａｃｅｎａｍｅ）、名の候補の特性をｇｉｖｅｎｎａｍｅ、姓名部分の候補の特性をｎａｍｅｂｏｄｙ、一般名に後続する姓名部分の候補の特性をｎａｍｅｂｏｄｙ［ｇｒｂ：＋］、一般姓に先行するおよび姓名部分の候補の特性をｎａｍｅｂｏｄｙ［ｆｌｂ：＋］でそれぞれ表す。さらに、特性の後の「＋」が１回以上の繰り返しを表す。
【００２２】
なお、この発明は装置またはシステムとして実現できるのみでなく、方法としても実現可能である。また、そのような発明の一部をソフトウェアとして構成することができることはもちろんである。またそのようなソフトウェアをコンピュータに実行させるために用いるソフトウェア製品もこの発明の技術的な範囲に含まれることも当然である。
【００２３】
この発明の上述の側面およびこの発明の他の側面は特許請求の範囲に記載され、以下実施例を用いて詳細に説明される。
【００２４】
【発明の実施の形態】
以下、この発明の実施例について説明する。
【００２５】
図１は、実施例の形態素処理装置を全体として示しており、この図において、形態素処理装置は、日本語テキスト入力部１０、形態素解析部１１、特性付与部１２、チャンキング部１３、人名抽出部１４、人名出力部１５、形態素解析辞書１６、チャンキング用辞書１７等を含んで構成されている。この形態素処理装置は、例えば、パーソナルコンピュータやワークステーション等のコンピュータシステム１００に、記録媒体１０１に記録された、あるいは、通信ネットワーク（図示しない）を介して送られたきたコンピュータプログラムをインストールして実現することができる。
【００２６】
日本語テキスト入力部１０は、形態素処理装置に日本語テキストを入力するものである。形態素解析部１１は、形態素解析辞書１６を参照して入力日本語テキストを形態素に分割し、文法情報を付与するものである。形態素解析部１１には例えば奈良先端科学技術大学院大学で開発された「ＣｈａＳｅｎ」を用いることができる。特性付与部１２は、チャンキング用辞書１７を参照して、チャンキング用に用いる特性を、対応する形態素に付与するものである。
【００２７】
この例では、チャンキング用辞書１７に人名構成する文字候補を登録している。すなわち、姓の先頭にくる文字（ｆｌｂ）、姓の末尾にくる文字（ｆｒｂ）を例えば２４３７個の日本語の姓からそれぞれ１９９０個、１６００個抽出して登録している。また、名の先頭にくる文字（ｇｌｂ）、名の末尾にくる文字（ｇｒｂ）を例えば５９１８３個の日本語の名からそれぞれ５５５０個、４０３０個抽出して登録している。そして、上述の文字の特性ｆｌｂ、ｆｒｂ、ｇｌｇ、ｇｒｂならびにそれらを統合した人名文字候補の特性ｎａｍｅ−ｂｏｄｙを形態素列中の文字に割り当てられるようになっている。
【００２８】
なお、姓および名として辞書（形態素解析辞書１６）に登録されているものの一例を図４、図５に示す。
【００２９】
チャンキングは、複数の一連の形態素をひとまとめにし、ひとまとめの形態素列に文法情報や特性を付与することである。すなわち、この例では、形態素解析結果の各形態素は、ターミナルノード（葉）に保持される。形態素解析結果の列はターミナルノードの列である。各ターミナルノードは、形態素の表層形、語彙形、形態素の品詞や特性（品詞内での役割など）を属性値として持つ。チャンキングは、合い連なる各ノードが保持する属性値を参照して、特定のパターンが存在するかどうかを確認し、もしあれば、特定の部分のノード列をひとまとめにすることである。その際、ノード列の上位にノードを生成し、この上位のノードと該ノード列を関係付ける。新たに生成されたノードに、新たな属性値を設定することができる。なお、ノード列の参照は、最上位にあるノードのみを対象とする。
【００３０】
チャンキング部１３は、チャンキング用辞書１７を用いて人名、タイトル等へのチャンキングを行う。人名抽出部１４は、チャンキング部１３において人名の特性を割り当てられた一連の形態素列を人名として抽出する。人名出力部１５は抽出された人名（形態素列）を出力する。人名は、後段の構文解析で用いられても良いし、テキストの選択等に用いられても良い。その他種々の用途に利用できる。
【００３１】
図２は、チャンキング部１３の構成を示しており、この図において、チャンキング部１３は、境界検出部２１、ルール適用部２２等を含んで構成されている。境界検出部２１は、形態素または形態素のグループの特性に基づいて人名の前後境界を検出する。例えば、「教授」とうのタイトルにより人名の前後境界を検出する。こんれについては後に詳述する。ルール適用部２２は、後述する５つのルールを適用して人名を弁別する。
【００３２】
図３はチャンキング部１３の境界検出部２１の構成を示しており、この図において、境界検出部２１は、長タイトル検出部２３、左境界検出部２４、右境界検出部２５等を含んで構成されている。
【００３３】
長タイトル検出部２３はつぎの長タイトル弁別ルールで長タイトルを弁別する。
【００３４】
［長タイトルのルール１］：
ｌｏｎｇｔｉｔｌｅ＝ｈｅａｄ［ｔｉｔｌｅ−ｐｒｅｆｉｘ：＋］，？＊［ｔｉｔｌｅ−ｂｏｄｙ］，ｎｏｕｎ［ｔｉｔｌｅ］；ｎｏｕｎ［ｔｉｔｌｅ−ｓｕｆｆｉｘ］．
【００３５】
ただし、
ｌｏｎｇｔｉｔｌｅ：長タイトル
ｈｅａｄ：接頭詞
ｔｉｔｌｅ−ｐｒｅｆｉｘ：タイトルの接頭語
ｔｉｔｌｅ−ｂｏｄｙ：タイトル本体
ｎｏｕｎ：名詞
ｔｉｔｌｅ−ｓｕｆｆｉｘ：タイトルの接尾語
【００３６】
例えば、「元法政大学教授駒尺喜美さん」という例は図６に示すように形態素解析され、「元法政大学教授」の部分が長タイトル（ｌｏｎｇｔｉｔｌｅ）として弁別される。
【００３７】
［長タイトルのルール２］：
ｌｏｎｇｔｉｔｌｅ＝？＊［ｔｉｔｌｅ−ｂｏｄｙ：＋］，ｎｏｕｎ［ｔｉｔｌｅ：＋］；ｎｏｕｎ［ｔｉｔｌｅ−ｓｕｆｆｉｘ：＋］．
【００３８】
例えば、「山口敏国立科学博物館名誉研究員」という例は図７に示すように形態素解析され、「国立科学博物館名誉研究員」の部分が長タイトル（ｌｏｎｇｔｉｔｌｅ）として弁別される。
【００３９】
ｔｉｔｌｅ−ｐｒｅｆｉｘ、ｔｉｔｌｅ−ｓｕｆｆｉｘ、ｔｉｔｌｅ−ｂｏｄｙの例を図８に示す。
【００４０】
なお、「被告」のように他の形態素と結合して長いタイトルを構成することがまれなものもあり、このようなものには、単タイトル（ｓｈｏｒｔｔｉｔｌｅ）の特性を割り当て、誤ってチャンキングされて長タイトルを構成することが内容にすることが好ましい。
【００４１】
左境界検出部２４および右境界検出部２５は、タイトル（長タイトルも含む）の特性等を用いて人名の前後（左右）の境界を検出する。タイトル（長タイトルを含む）は左右の境界として働く。
【００４２】
「氏」、「さん」等、所定の助詞、句読点等は図９に示すように右側境界として働く。
【００４３】
「の」等の助詞（例えば「横綱の武蔵丸」）、文章の先頭（ＢＯＳ）、２以上の文字からなる所定の語（例えば「経営」等）は、左側の境界として働く。
【００４４】
つぎにルール適用部２２の人名弁別用のルールについて説明する。
［人名弁別ルール１］：
ｎａｍｅ＝｜ｌｅｆｔｂｏｕｎｄａｒｙ｜ｆａｍｉｌｙｎａｍｅ（ｐｌａｃｅｎａｍｅ），ｎａｍｅ−ｂｏｄｙ，ｇｉｖｅｎｎａｍｅ｜ｒｉｇｈｔｂｏｕｎｄａｒｙ｜．
【００４５】
ただし
ｎａｍｅ：人名
ｌｅｆｔｂｏｕｎｄａｒｙ：人名の左境界
ｆａｍｉｌｙｎａｍｅ（ｐｌａｃｅｎａｍｅ）：姓または地名
ｎａｍｅ−ｂｏｄｙ：人名の本体をなす語
ｇｉｖｅｎｎａｍｅ：名
ｒｉｇｈｔｂｏｕｎｄａｒｙ：人名の右境界
【００４６】
例えば、「山田沙知子（関大）」は、図１０に示すように形態素解析され、その後、「山田沙知子」が人名として弁別される。
【００４７】
［人名弁別ルール２］：
ｎａｍｅ＝｜ｌｅｆｔｂｏｕｎｄａｒｙ｜ｆａｍｉｌｙｎａｍｅ（ｐｌａｃｅｎａｍｅ），ｇｉｖｅｎｎａｍｅ，ｎａｍｅ−ｂｏｄｙ［ｇｒｂ：＋］｜ｒｉｇｈｔｂｏｕｎｄａｒｙ｜．
【００４８】
ただし
ｎａｍｅ−ｂｏｄｙ［ｇｒｂ：＋］：名の末尾文字の候補の特性を有する人名の本体をなす語
【００４９】
例えば、「宮園佳征被告」は、図１１に示すように形態素解析され、その後、「宮園佳征」が人名として弁別される。
【００５０】
［人名弁別ルール３］：
ｎａｍｅ＝｜ｌｅｆｔｂｏｕｎｄａｒｙ｜ｆａｍｉｌｙｎａｍｅ（ｐｌａｃｅｎａｍｅ），ｇｉｖｅｎｎａｍｅ｜ｒｉｇｈｔｂｏｕｎｄａｒｙ｜．
【００５１】
例えば、「ミステリー史に詳しい権田萬治・専修大教授は」は、図１２に示すように形態素解析され、その後、「権田萬治」が人名として弁別される。「詳しい」は人名の左境界を検出するｌｏｎｇｗｏｒｄの特性を持つ。
【００５２】
［人名弁別ルール４］：
ｎａｍｅ＝｜ｌｅｆｔｂｏｕｎｄａｒｙ｜ｆａｍｉｌｙｎａｍｅ（ｐｌａｃｅｎａｍｅ），ｎａｍｅ−ｂｏｄｙ＋｜ｒｉｇｈｔｂｏｕｎｄａｒｙ｜．
【００５３】
例えば、「伊佐山健志氏」は、図１３に示すように形態素解析され、その後、「伊佐山健志」が人名として弁別される。
【００５４】
［人名弁別ルール５］：
ｎａｍｅ＝｜ｌｅｆｔｂｏｕｎｄａｒｙ｜ｎａｍｅ−ｂｏｄｙ［ｆｌｂ：＋］，ｎａｍｅ−ｂｏｄｙ＋，ｇｉｖｅｎｎａｍｅ｜ｒｉｇｈｔｂｏｕｎｄａｒｙ｜．
【００５５】
ただし
ｎａｍｅ−ｂｏｄｙ［ｆｌｂ：＋］：姓の先頭文字の候補の特性を有する人名の本体をなす語
【００５６】
例えば、「萩元晴彦（はぎもと・はるひこ）氏」は、図１４に示すように形態素解析され、その後、「萩元晴彦」が人名として弁別される。
【００５７】
この実施例によれば、珍しい人名であっても確実に人名を弁別できる。例えば、「依田紀基本名人」の「紀基」は比較的珍しい名であり、辞書にないとすると、「紀」と「基」とが別々の部分となり、通常では人名としての認識が困難であるが、両者のｎａｍｅ−ｂｏｄｙの特性を用いて、確実に人名として把握できる。この際、人名境界の特性として「名人」等を用いることにより誤認識となることもない。
【００５８】
また「嶋山ハル子」が、「嶋山」が珍しい姓であるとすると、「嶋」、「山」、「ハル子」と分割されてしまい、通常では正しく人名として認識困難であるが、「嶋」、「山」のｎａｍｅ−ｂｏｄｙの特性を用いて、確実に人名として把握できる。
【００５９】
具体的な構成についてさらに説明する。
【００６０】
なお、この発明は上述の実施例に限定されるものではなくその趣旨を逸脱しない範囲で種々変更が可能である。例えば、上述では、ｎａｍｅ−ｂｏｄｙの特性を特性付与部１２で付与するようにしたが、人名境界の検出以降で、付与するようにしても良い。各処理は、その機能を実現できる範囲で順番を入れ換えたり、同時に行ったりすることができることはもちろんである。
【００６１】
【発明の効果】
以上説明したように、この発明によれば、形態素列から人名を確実かつ簡易に弁別することができる。
【図面の簡単な説明】
【図１】この発明の実施例の形態素列処理装置を全体として示すブロック図である。
【図２】図１のチャンキング部を説明するブロック図である。
【図３】図２の境界検出部を説明するブロック図である。
【図４】日本語の姓の例を説明する図である。
【図５】日本語の名の例を説明する図である。
【図６】長タイトルのチャンキング例を説明する図である。
【図７】長タイトルの他のチャンキング例を説明する図である。
【図８】長タイトルのチャンキングに用いる特性を説明する図である。
【図９】人名の区切りを弁別する特性を説明する図である。
【図１０】人名を弁別するルール１を説明する図である。
【図１１】人名を弁別するルール２を説明する図である。
【図１２】人名を弁別するルール３を説明する図である。
【図１３】人名を弁別するルール４を説明する図である。
【図１４】人名を弁別するルール５を説明する図である。
【符号の説明】
１０日本語テキスト入力部
１１形態素解析部
１２特性付与部
１３チャンキング部
１４人名抽出部
１５人名出力部
１６形態素解析辞書
１７チャンキング用辞書
２１境界検出部
２２ルール適用部
２３長タイトル検出部
２４左境界検出部
２５右境界検出部
１００コンピュータシステム
１０１記録媒体[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a morpheme sequence processing technique for discriminating a person name from a morpheme sequence.
[0002]
[Prior art]
Like other proper nouns, personal names often have important meanings in the text. For example, by extracting and discriminating the names of people in the text, it is possible to determine the importance of the text, etc., and to classify the text by purpose and content. Therefore, it is desired to extract a person name in the text. However, it is necessary to perform complicated processing such as syntax analysis and semantic analysis in order to determine whether the name is a person's name. Conversely, if the organization name can be grasped before syntactic analysis or semantic analysis, syntactic analysis or the like can be performed easily and reliably.
[0003]
[Problems to be solved by the invention]
The present invention has been made in view of the above circumstances, and an object thereof is to provide a technique for discriminating a person name from a morpheme string.
[0004]
[Means for Solving the Problems]
According to this invention, in order to achieve the above-mentioned object, the configuration as described in the claims is adopted. Here, prior to describing the invention in detail, supplementary explanations of the claims will be given.
[0005]
According to one aspect of the present invention, in order to achieve the above-mentioned object, the morpheme processing device: means for storing a personal name morpheme string part discrimination rule for discriminating a morpheme string part corresponding to a person name from a morpheme string The first characteristic of the preceding boundary candidate preceding the person name, the second characteristic of the succeeding boundary candidate following the person name, and the third characteristic of the person name part candidate corresponding morpheme or morpheme group of the morpheme sequence; A characteristic allocating means for assigning to the morpheme string morpheme or morpheme group, and applying the personal name morpheme string partial discrimination rule to the first characteristic, the second characteristic, and the third characteristic corresponding to the person name And a means for discriminating the morpheme row portion to be provided.
[0006]
In this configuration, it is determined whether or not a morpheme or morpheme group in a morpheme string has a characteristic of a candidate boundary of a person name or a characteristic of a candidate constituting a person name, and based on this, a predetermined morpheme string part is a person name It is determined whether it corresponds to. Therefore, it is possible to easily distinguish a person's name. Characteristics are also called features and attributes.
[0007]
The target person names are mainly first and last names, but either the last name or first name may be used. Not only Japanese but also other languages such as Korean and Chinese are applicable.
[0008]
As candidates for person name boundaries, morphemes or morpheme groups having characteristics of a person's title (such as job title) can be used.
[0009]
Based on the characteristics of the title prefix candidates (such as “original” and “previous”), the characteristics of the title suffix candidates that follow the title (such as “government” and “phase”), and the characteristics of the title part candidates The title characteristics can be given to the morpheme group.
[0010]
However, there are rare cases such as “Defendant” that are combined with other morphemes to form a long title, and it is preferable to assign a single title characteristic to such a case.
[0011]
Candidate characteristics of a person name boundary following a person name can be assigned to a word representing a title, which is a suffix that follows the person name, such as “san”, “sama”, “m”.
[0012]
In addition, the characteristics of candidate names can be assigned to particles, punctuation marks, midpoints, hyphens, and parentheses.
[0013]
The characteristics of the personal name boundary preceding the personal name can be assigned to a kanji string composed of a plurality of characters representing a career such as “management”.
[0014]
According to another aspect of the present invention, in order to achieve the above object, the morpheme string processing device includes: means for inputting a morpheme string; and characteristics required for at least one of the following first and last name discrimination rules: And a morpheme sequence part sandwiched between the following left boundary characteristic part and the following right boundary characteristic part by applying at least one of the following plural name discrimination rules: And a means for discriminating as a first and last name.
[0015]
The first and last name discrimination rules are as follows.
[0016]
[Rule 1]: | left boundary | family name (place name), name body, given name | right boundary |
[0017]
[Rule 2]: | left boundary | family name (place name), give name, name body [grb: +] | right boundary |
[0018]
[Rule 3]: | left boundary | family name (place name), give name | right boundary |
[0019]
[Rule 4]: | left boundary | family name (place name), name body + | right boundary |
[0020]
[Rule 5]: | left boundary | name body [flb: +], name body +, given name | right boundary |
[0021]
However, the characteristics of the leading boundary candidate preceding the first and last names are the left boundary, the characteristics of the subsequent boundary candidate following the first and last names are the right boundary, the characteristics of the surname or place name candidate are the family name (place name), and the first name candidate characteristics. Given name, the characteristics of the first and last name part candidates as name body, the characteristics of the first and last name part candidates following the common name as name body [grb: +], and the characteristics of the first and last name part candidates preceding the general last name as name body [ flb: +]. Furthermore, “+” after the characteristic represents one or more repetitions.
[0022]
The present invention can be realized not only as an apparatus or a system but also as a method. Of course, a part of the invention can be configured as software. Of course, software products used to cause a computer to execute such software are also included in the technical scope of the present invention.
[0023]
The above described aspects of the invention and other aspects of the invention are set forth in the appended claims and are described in detail below with reference to examples.
[0024]
DETAILED DESCRIPTION OF THE INVENTION
Examples of the present invention will be described below.
[0025]
FIG. 1 shows the morpheme processing apparatus of the embodiment as a whole. In this figure, the morpheme processing apparatus includes a Japanese text input unit 10, a morpheme analysis unit 11, a characteristic assignment unit 12, a chunking unit 13, and a person name extraction. Unit 14, person name output unit 15, morphological analysis dictionary 16, chunking dictionary 17, and the like. The morpheme processing apparatus is realized by installing a computer program recorded on the recording medium 101 or sent via a communication network (not shown) in a computer system 100 such as a personal computer or a workstation. can do.
[0026]
The Japanese text input unit 10 inputs Japanese text to the morpheme processing device. The morpheme analysis unit 11 divides the input Japanese text into morphemes with reference to the morpheme analysis dictionary 16 and gives grammatical information. For example, “ChaSen” developed at Nara Institute of Science and Technology Graduate University can be used for the morphological analysis unit 11. The characteristic imparting unit 12 refers to the chunking dictionary 17 and imparts characteristics used for chunking to the corresponding morpheme.
[0027]
In this example, character candidates constituting a person name are registered in the chunking dictionary 17. That is, for example, 1990 and 1600 characters are extracted from the last name of the last name (flb) and the last character (frb) of the last name, for example, from 2437 Japanese surnames and registered. In addition, for example, 5550 and 4030 characters are extracted and registered from the 59183 Japanese names, for example, the first character (glb) and the last character (grb). The character characteristics flb, frb, glg, grb and the personal name character candidate characteristic name-body obtained by integrating them are assigned to the characters in the morpheme string.
[0028]
An example of what is registered in the dictionary (morpheme analysis dictionary 16) as the first name and last name is shown in FIGS.
[0029]
Chunking is to group a plurality of morphemes together, and to add grammatical information and characteristics to the group of morpheme strings. That is, in this example, each morpheme of the morpheme analysis result is held in the terminal node (leaf). The row of morphological analysis results is a row of terminal nodes. Each terminal node has a morpheme surface form, vocabulary form, morpheme part-of-speech and characteristics (role within part-of-speech, etc.) as attribute values. Chunking refers to checking whether or not a specific pattern exists by referring to the attribute value held by each associated node, and, if any, grouping a node string of a specific part together. At that time, a node is generated above the node string, and this node string is related to this node. A new attribute value can be set for a newly generated node. Note that the reference to the node column targets only the highest-level node.
[0030]
The chunking unit 13 performs chunking on a person name, a title, and the like using the chunking dictionary 17. The personal name extraction unit 14 extracts a series of morpheme strings assigned the personal name characteristics in the chunking unit 13 as personal names. The person name output unit 15 outputs the extracted person name (morpheme string). The personal name may be used in a later syntax analysis, or may be used for text selection or the like. It can be used for various other purposes.
[0031]
FIG. 2 shows the configuration of the chunking unit 13. In this figure, the chunking unit 13 includes a boundary detection unit 21, a rule application unit 22, and the like. The boundary detection unit 21 detects the front-rear boundary of a person name based on the characteristics of a morpheme or a group of morphemes. For example, the front / rear boundary of a person name is detected by the title “Professor”. This will be described in detail later. The rule application unit 22 discriminates a person name by applying five rules described later.
[0032]
FIG. 3 shows the configuration of the boundary detection unit 21 of the chunking unit 13. In this figure, the boundary detection unit 21 includes a long title detection unit 23, a left boundary detection unit 24, a right boundary detection unit 25, and the like. It is configured.
[0033]
The long title detector 23 discriminates long titles according to the following long title discrimination rule.
[0034]
[Long title rule 1]:
long title = head [title-prefix: +],? * [Title-body], noun [title]; noun [title-suffix].
[0035]
However,
long title: long title head: prefix title-prefix: title prefix title-body: title body noun: noun title-suffix: title suffix
For example, an example of “former Hosei University professor Kimi Komazaku” is subjected to morphological analysis as shown in FIG. 6, and the portion of “former Hosei University professor” is distinguished as a long title.
[0037]
[Long title rule 2]:
long title =? * [Title-body: +], noun [title: +]; noun [title-suffix: +].
[0038]
For example, an example of “Satoshi Yamaguchi National Science Museum Honorary Researcher” is subjected to morphological analysis as shown in FIG. 7, and “National Science Museum Honorary Researcher” is distinguished as a long title.
[0039]
Examples of title-prefix, title-suffix, and title-body are shown in FIG.
[0040]
Note that there are rare cases such as “Defendant” that are combined with other morphemes to form a long title, and such a title is assigned the characteristics of a single title (short title) and is incorrectly chunked. It is preferable that the long title is composed.
[0041]
The left boundary detection unit 24 and the right boundary detection unit 25 detect the front and rear (left and right) boundaries of the person name using the characteristics of titles (including long titles) and the like. Titles (including long titles) act as left and right boundaries.
[0042]
Predetermined particles, punctuation marks, etc., such as “Mr.” and “Mr.”, act as right boundaries as shown in FIG.
[0043]
A particle such as “no” (eg “Yokozuna no Musashimaru”), the beginning of a sentence (BOS), and a predetermined word consisting of two or more characters (eg “management” etc.) act as a left boundary.
[0044]
Next, a rule for name discrimination of the rule application unit 22 will be described.
[Person name discrimination rule 1]:
name = | left boundary | family name (place name), name-body, given name | right boundary |.
[0045]
However, name: personal name left boundary: left boundary of personal name family name (place name): surname or place name name-body: the word that forms the body of personal name given name: right boundary of personal name
For example, “Yamada Sachiko (Seki Univ.)” Is subjected to morphological analysis as shown in FIG. 10, and then “Yamada Sachiko” is discriminated as a personal name.
[0047]
[Person name discrimination rule 2]:
name = | left boundary | family name (place name), give name, name-body [grb: +] | right bound |.
[0048]
However, name-body [grb: +]: a word that forms the body of a person having the characteristics of a candidate for the last character of the name
For example, “Miyazono Kasei Defendant” is subjected to morphological analysis as shown in FIG. 11, and then “Miyazono Kasei” is discriminated as a personal name.
[0050]
[Person name discrimination rule 3]:
name = | left boundary | family name (place name), give name | right boundary |.
[0051]
For example, “Takeharu Gonda / Professor of Senshu University familiar with mystery history” is subjected to morphological analysis as shown in FIG. 12, and then “Yuji Gonda” is distinguished as a personal name. “Detailed” has a long word characteristic of detecting the left boundary of a person name.
[0052]
[Person name discrimination rule 4]:
name = | left boundary | family name (place name), name-body + | right boundary |.
[0053]
For example, “Mr. Kenji Isayama” is subjected to morphological analysis as shown in FIG. 13, and then “Kenji Isayama” is discriminated as a personal name.
[0054]
[Person name discrimination rule 5]:
name = | left boundary | name-body [flb: +], name-body +, given name | right boundary |.
[0055]
However, name-body [flb: +]: a word that forms the main body of a person name having the characteristics of a candidate for the first character of the surname
For example, “Mr. Haruhiko Hagimoto” is subjected to morphological analysis as shown in FIG. 14, and then “Harumoto Hamamoto” is discriminated as a personal name.
[0057]
According to this embodiment, it is possible to reliably discriminate a person name even with an unusual person name. For example, “Kiki” in “Yoda Ki Master” is a relatively rare name, and if it is not in the dictionary, “Kiki” and “Ki” will be separate parts, which are usually difficult to recognize as a human name. However, it can be surely grasped as a person's name by using both name-body characteristics. At this time, there is no possibility of misrecognition by using “master” or the like as the characteristic of the personal name boundary.
[0058]
Also, if “Shimayama Haruko” is an unusual surname, “Shimayama” is divided into “Shima”, “Mountain”, and “Haruko”, which are usually difficult to recognize as personal names. Using the name-body characteristic of “mountain”, it can be surely grasped as a person name.
[0059]
A specific configuration will be further described.
[0060]
The present invention is not limited to the above-described embodiments, and various modifications can be made without departing from the spirit of the invention. For example, in the above description, the name-body characteristic is given by the characteristic assigning unit 12, but it may be given after the personal name boundary is detected. It goes without saying that the processes can be switched in order or performed simultaneously as long as the functions can be realized.
[0061]
【The invention's effect】
As described above, according to the present invention, a person name can be reliably and easily distinguished from a morpheme string.
[Brief description of the drawings]
FIG. 1 is a block diagram showing an overall morpheme string processing apparatus according to an embodiment of the present invention.
FIG. 2 is a block diagram illustrating a chunking unit in FIG.
FIG. 3 is a block diagram illustrating a boundary detection unit in FIG. 2;
FIG. 4 is a diagram illustrating an example of a Japanese surname.
FIG. 5 is a diagram illustrating an example of a Japanese name.
FIG. 6 is a diagram for explaining an example of chunking of a long title.
FIG. 7 is a diagram for explaining another example of chunking of a long title.
FIG. 8 is a diagram illustrating characteristics used for long title chunking.
FIG. 9 is a diagram for explaining characteristics for discriminating the separation of personal names.
FIG. 10 is a diagram for explaining Rule 1 for discriminating person names;
FIG. 11 is a diagram for explaining rule 2 for discriminating person names;
FIG. 12 is a diagram for explaining rule 3 for discriminating person names;
FIG. 13 is a diagram for explaining a rule 4 for discriminating personal names.
FIG. 14 is a diagram for explaining a rule 5 for discriminating person names.
[Explanation of symbols]
DESCRIPTION OF SYMBOLS 10 Japanese text input part 11 Morphological analysis part 12 Character imparting part 13 Chunking part 14 Person name extraction part 15 Person name output part 16 Morphological analysis dictionary 17 Dictionary for chunking 21 Boundary detection part 22 Rule application part 23 Long title detection part 24 Left Boundary detection unit 25 Right boundary detection unit 100 Computer system 101 Recording medium

Claims

Means for inputting a morpheme string;
Means for assigning to the morpheme sequence a characteristic required for at least one of the following first and last name discrimination rules;
The characteristics of the leading boundary candidate preceding the first and last names are left bound, the characteristics of the trailing boundary candidate following the first and last names are right bound, the characteristics of the surname or place name candidate are family name (place name), and the characteristics of the first name candidate are given. name, the name of the surname part candidate is name body, the name of the surname part candidate following the common name is name body [grb: +], and the name of the surname and the surname part is preceded by name body [flb: +], And when the “+” after the characteristic represents one or more repetitions,
[Rule 1]: | left boundary | family name (place name), name body, given name | right boundary |
[Rule 2]: | left boundary | family name (place name), give name, name body [grb: +] | right boundary |
[Rule 3]: | left boundary | family name (place name), give name | right boundary |
[Rule 4]: | left boundary | family name (place name), name body + | right boundary |
[Rule 5]: | left boundary | name body [flb: +], name body +, given name | right boundary |
Applying a plurality of at least one of the first and last name discrimination rules consisting of it and a means for discriminating morphemes portion sandwiched between the portion of the characteristic of the l eft b oundary characteristics of parts and r ight b oundary as first and last name A morpheme sequence processing device characterized by the above.

Means for inputting a morpheme string,
Means for assigning to the morpheme sequence a characteristic required for at least one of the following first and last name discrimination rules;
The characteristics of the leading boundary candidate preceding the first and last names are left bound, the characteristics of the trailing boundary candidate following the first and last names are right bound, the characteristics of the surname or place name candidate are family name (place name), and the characteristics of the first name candidate are given. name, the name of the surname part candidate is name body, the name of the surname part candidate following the common name is name body [grb: +], and the name of the surname and the surname part is preceded by name body [flb: +], And when the “+” after the characteristic represents one or more repetitions,
[Rule 1]: | left boundary | family name (place name), name body, given name | right boundary |
[Rule 2]: | left boundary | family name (place name), give name, name body [grb: +] | right boundary |
[Rule 3]: | left boundary | family name (place name), give name | right boundary |
[Rule 4]: | left boundary | family name (place name), name body + | right boundary |
[Rule 5]: | left boundary | name body [flb: +], name body +, given name | right boundary |
A plurality of applying at least one of the first and last name discrimination rules, means for discriminating the morpheme string part sandwiched between the portion of the characteristic parts and. Right boundary of the characteristics of the left boundary as surname name consisting of,
A computer program for morpheme sequence processing, which causes a computer to function .