JP2004171222A

JP2004171222A - Information extracting device and method and program

Info

Publication number: JP2004171222A
Application number: JP2002335520A
Authority: JP
Inventors: Eiji Murakami; 英治村上; Masamochi Kobata; 真望木幡
Original assignee: Azbil Corp
Current assignee: Azbil Corp
Priority date: 2002-11-19
Filing date: 2002-11-19
Publication date: 2004-06-17

Abstract

<P>PROBLEM TO BE SOLVED: To provide an information extracting device, method and a program capable of precisely extracting information associated with desired contents from an inputted character string. <P>SOLUTION: A corresponding relation between a plurality of rule morphemes and information contents classifications indicating the classifications of the information contents of the rule morphemes is shown by rule data. One or more input morphemes are extracted from an input morpheme 21 obtained by decomposing an input character string 20 per part of speech by an information extracting means so that an input morphemic column 22 can be configured. The input morphemic column 22 is collated with the respective rule morphemes of the rule data so that the input morphemic column 22 of the specific information contents classification can be retrieved based on the information contents classification corresponding to the rule morphemes made coincident with the input morphemic column 22. Thus, the obtained input morphemic column 22 of the specific information contents classification can be extracted as desired information. <P>COPYRIGHT: (C)2004,JPO

Description

【０００１】
【発明の属する技術分野】
本発明は、情報抽出装置および方法、プログラムに関し、特に入力された文字列から特定の内容に関する所望の情報を抽出する情報抽出装置および方法、プログラムに関するものである。
【０００２】
【従来の技術】
コンピュータの普及に伴い、人間とコンピュータとのマンマシンインターフェースを実現する技術が注目されている。これら技術は、人間が持つ基本的なコミュニケーション手段を利用して、負担なくコンピュータと対話できることを目指している。
このような技術では、人間が日常的に使用する自然言語をコンピュータで自動的に解析する場合、人間が話した言葉を自動的に文字列へ変換する音声認識技術が用いられるが、このような音声認識処理とともに、文字列から所望の情報を抽出する情報抽出技術も重要となる。
【０００３】
従来、このような自然言語の文字列から所望の情報を抽出する技術として、形態素解析技術を用いたものが数多く提案されている。
形態素解析とは、自然言語からなる文字列を品詞単位で複数の単語へ分解することにより、その文字列の構成要素を解析するものである（例えば、非特許技術文献１など参照）。
一方、自然言語には、品詞レベルでの構文（文表現パターン）に特徴がある。従来の情報抽出技術では、自然言語からなる文字列を形態素解析し、得られた品詞レベルの構文の特徴を抽出することにより、自然言語から所望の文を抽出するようにしている（例えば、特許技術文献１など参照）。
【０００４】
なお、出願人は、本明細書に記載した先行技術文献情報で特定される先行技術文献以外には、本発明に関連する先行技術文献を出願時までに発見するには至らなかった。
【特許文献１】
特開平８−７７１９６号公報
【非特許文献１】
松本裕治ほか、”形態素解析システム茶筌”，奈良先端科学技術大学院大学，［平成１４年１１月１１日検索］，インターネット＜ＵＲＬ：ｈｔｔｐ／／ｃｈａｓｅｎ．ａｉｓｔ−ｎａｒａ．ａｃ．ｊｐ／ｉｎｄｅｘ．ｈｔｍｌ．ｊａ＞
【０００５】
【発明が解決しようとする課題】
しかしながら、このような従来の情報抽出技術では、品詞レベルでの構文の特徴すなわち文表現パターンを用いて、入力された文字列との照合を行っているため、予め用意した文の文表現パターンに近しい文を抽出できるものの、入力された文字列に含まれる、例えば日時、相手、場所、行動などを示す特定の内容に関する情報のみを精度よく抽出できないという問題点があった。
本発明はこのような課題を解決するためのものであり、入力された文字列から所望の内容に関する情報を精度よく抽出できる情報抽出装置および方法、プログラムを提供することを目的としている。
【０００６】
【課題を解決するための手段】
このような目的を達成するために、本発明にかかる情報抽出装置は、入力文字列を品詞単位の形態素に分解し、得られた形態素に基づき文字文字列から特定の情報内容に関する所望の情報を抽出する情報抽出装置において、特定の情報内容を含む任意の文字列を予め形態素に分解して得られた複数のルール形態素と当該ルール形態素の情報内容の種別を示す情報内容種別との対応関係を示すルールデータと、入力文字列を品詞単位で分解して得られた入力形態素から１つ以上の入力形態素を取り出して入力形態素列を構成し、この入力形態素列とルールデータの各ルール形態素とを照合することにより、当該入力形態素列と一致したルール形態素に対応付けられている情報内容種別に基づいて特定の情報内容種別の入力形態素列を検索し、得られた特定の情報内容種別の入力形態素列を所望の情報として抽出する情報抽出手段とを備えるものである。
【０００７】
入力形態素列を構成する際、情報抽出手段で、入力文字列の元の並びにしたがって入力形態素から連続して取り出した複数の入力形態素から入力形態素列を構成するようにしてもよい。
形態素列を検索する際、情報抽出手段で、特定の情報内容種別に対応する入力形態素列を得られなかった場合、入力形態素列を構成する入力形態素の数を減らして短くした新たな入力形態素列を用いて再照合するようにしてもよい。
【０００８】
形態素列を照合する際、情報抽出手段で、当該入力形態素列を構成する各入力形態素とルールデータの各ルール形態素とを、それぞれの形態素の文字の並びに基づき照合するようにしてもよい。あるいは、当該入力形態素列を構成する各入力形態素とルールデータの各ルール形態素とを、それぞれの形態素の品詞の並びに基づき照合するようにしてもよい。
形態素列を照合する際、情報抽出手段で、当該入力形態素列を構成する各入力形態素とルールデータの各ルール形態素とを、それぞれの形態素の文字の並びに基づき照合し、当該入力形態素列と一致するルール形態素が存在しなかった場合、当該入力形態素列を構成する各入力形態素とルールデータの各ルール形態素とを、それぞれの形態素の品詞の並びに基づき照合するようにしてもよい。
【０００９】
ルールデータの構成について、予め用意された事例文字列を品詞単位で分解して得られた複数のルール形態素と、これらルール形態素ごとに対応付けられた、当該ルール形態素が属する情報内容の種別を示す情報内容種別とからなるルールデータを用いてもよい。
【００１０】
また、本発明にかかる情報抽出方法は、入力文字列を品詞単位の形態素に分解し、得られた形態素に基づき文字文字列から特定の情報内容に関する所望の情報を抽出する情報抽出装置で用いられる情報抽出方法において、入力文字列を品詞単位で分解して得られた入力形態素から１つ以上の入力形態素を取り出して入力形態素列を構成する第１のステップと、特定の情報内容を含む任意の文字列を予め形態素に分解して得られた複数のルール形態素と当該ルール形態素の情報内容の種別を示す情報内容種別との対応関係を示すルールデータの各ルール形態素と、第１のステップで得られた入力形態素列とを照合することにより、当該入力形態素列と一致したルール形態素に対応付けられている情報内容種別に基づいて特定の情報内容種別の入力形態素列を検索する第２のステップと、検索により得られた入力形態素列を所望の情報として抽出する第３のステップとを備えるものである。
【００１１】
入力形態素列を構成する際、第１のステップで、入力文字列の元の並びにしたがって入力形態素から連続して取り出した複数の入力形態素から入力形態素列を構成するようにしてもよい。
形態素列を検索する際、第２のステップで、特定の情報内容種別に対応する入力形態素列を得られなかった場合、入力形態素列を構成する入力形態素の数を減らして短くした新たな入力形態素列を用いて再照合するようにしてもよい。
【００１２】
形態素列を照合する際、第２のステップで、入力形態素列を構成する各入力形態素とルールデータの各ルール形態素とを、それぞれの形態素の文字の並びに基づき照合するようにしてもよい。あるいは、当該入力形態素列を構成する各入力形態素とルールデータの各ルール形態素とを、それぞれの形態素の品詞の並びに基づき照合するようにしてもよい。
形態素列を照合する際、第２のステップで、当該入力形態素列を構成する各入力形態素とルールデータの各ルール形態素とを、それぞれの形態素の文字の並びに基づき照合し、当該入力形態素列と一致するルール形態素が存在しなかった場合、当該入力形態素列を構成する各入力形態素とルールデータの各ルール形態素とを、それぞれの形態素の品詞の並びに基づき照合するようにしてもよい。
【００１３】
ルールデータの構成について、予め用意された事例文字列を品詞単位で分解して得られた複数のルール形態素と、これらルール形態素ごとに対応付けられた、当該ルール形態素が属する情報内容の種別を示す情報内容種別とからなるルールデータを用いてもよい。
【００１４】
また、本発明にかかるプログラムは、入力文字列を品詞単位の形態素に分解し、得られた各形態素に基づき文字データから特定の情報内容に関する所望の情報を抽出する情報抽出装置のコンピュータで、前述した各情報抽出方法のいずれか１つを実行させるためのプログラムである。
【００１５】
【発明の実施の形態】
次に、本発明の実施の形態について図面を参照して説明する。
図１は本発明の一実施の形態にかかる情報抽出装置の構成を示すブロック図である。
情報抽出装置１０は、入力された自然言語の入力文字列を品詞単位の形態素に分解し、得られた各形態素に対応する情報内容の種別に基づき、入力文字列から特定の内容に関する所望の情報を抽出する装置である。
この情報抽出装置１０は、全体としてサーバ装置などのコンピュータから構成されており、入出力インターフェース部（以下、入出力Ｉ／Ｆ部という）１、操作入力部２、画面表示部３、記憶部４、および制御部５が設けられている。
【００１６】
入出力Ｉ／Ｆ部１は、通信回線６を介して接続された情報処理装置（図示せず）、あるいはＣＤ−ＲＯＭやフレキシブルディスクなどの記録媒体９との間で、入力文字列やその入力文字列から抽出した情報、さらにはプログラムなどの各種データをやり取りする回路部である。
操作入力部２は、キーボードやマウスなどからなり、入力文字列などの各種データや各種処理に対する指示を操作入力するための入力装置である。
画面表示部３は、ＬＣＤやＣＲＴなどからなり、入力文字列から抽出した情報や処理の状態を画面に表示出力する画面表示装置である。
【００１７】
記憶部４は、ハードディスクやメモリからなり、制御部５での情報抽出処理に用いる多数のルールデータ４Ａや、制御部５で実行されるプログラム４Ｂなど、制御部５での処理動作に用いる各種情報を記憶する記憶装置である。
ルールデータ４Ａは、自然言語の文字列を形態素解析して得られた複数のルール形態素と、そのルール形態素の情報内容を示す情報内容種別との対応関係を示すデータである。
プログラム４Ｂは、予め記録媒体９や通信回線６から入出力Ｉ／Ｆ部１を介して取り込まれ、記憶部４に格納される。
【００１８】
制御部５は、ＣＰＵなどのマイクロプロセッサとその周辺回路からなり、記憶部４のプログラム４Ｂを読み込んで実行することにより、そのプログラム４Ｂと自装置のハードウェア資源とを協働させて、情報抽出処理を行う機能手段を実現する。
この機能手段としては、情報抽出手段５Ａ、形態素解析手段５Ｂ、ルール生成手段５Ｃがある。
【００１９】
情報抽出手段５Ａは、入出力Ｉ／Ｆ部１や操作入力部２から入力された自然言語の文字列を形態素解析手段５Ｂにより形態素に解析し、得られた１つ以上の入力形態素からなる入力形態素列を単位として、記憶部４内の各ルールデータ４Ａと照合することにより、入力文字列から所望の内容の情報を抽出する機能手段である。
形態素解析手段５Ｂは、自然言語の文字列を、名詞、助詞、動詞などの品詞単位で複数の形態素に分解する機能手段である。
【００２０】
ルール生成手段５Ｃは、予め用意された自然言語の文字列からなる事例文字列を、形態素解析手段５Ｂにより形態素に解析し、得られた複数のルール形態素に、そのルール形態素が属する情報内容の種別すなわち情報内容種別を対応付けることによりルールデータ４Ａを生成し、これらルールデータ４Ａを記憶部４へ格納する機能手段である。
【００２１】
次に、図面を参照して、本実施の形態にかかる情報抽出装置１０の動作について説明する。
まず、図２を参照して、制御部５の形態素解析手段５Ｂで行われる形態素解析について概略説明する。図２は、形態素解析手段５Ｂで行われる形態素解析処理を示す説明図である。
本実施の形態にかかる情報抽出装置１０は、図２に示すような、自然言語の入力文字列２０から、特定の内容の情報を所望の情報３０として抽出することを目的としている。ここでは、特定の内容の情報として、日時、相手、場所、および行動に関する情報を抽出する例について説明する。
【００２２】
形態素解析とは、自然言語の文字列を、名詞、助詞、動詞などの品詞を単位とする形態素に分解する処理である。品詞とは、意味を持つ文字列を、その性質で分類した場合の名称であり、この品詞単位で分解される文字列の単位すなわち形態素が、意味を持つ最も短い文字列となる。
形態素解析手段５Ｂで行われる形態素解析処理については、前述した公知の形態素解析方法を用いればよい（非特許技術文献１など参照）。一般的な形態素解析処理では、予め品詞の具体例が多数登録された辞書を用いて、文字列を各形態素に分解している。
【００２３】
例えば、図２に示すように、「１０月１８日に村上さんと藤沢で打ち合わせる」という入力文字列２０を形態素解析した場合、文字データ２０Ａとその品詞２０Ｂ、２０Ｃからなる複数の形態素（以下では、入力文字列から得られた形態素を入力形態素という）２１が生成される。
例えば、「１０」という文字データ２０Ａに対して、「名詞」および「数詞」という品詞２０Ｂ，２０Ｃが割り当てられ、これらが組として１つの入力形態素２１を構成する。
なお、品詞については、例えば「名詞」には、「数詞」、「人名」、「地名」などの詳細な分類があり、分類が深いほど照合精度が向上するものの照合所要時間が増大する。この例では、２段階の深さの分類を用いているが、照合精度と照合所要時間とを考慮して分類の深さを任意に調整すればよい。
【００２４】
ここで、「１０月１８日に村上さんと藤沢で打ち合わせる」という入力文字列２０から、日時「１０月１８日」、相手「村上さん」、場所「藤沢」、および行動「打ち合わせる」という所望の情報３０を抽出する場合、入力文字列２０を構成するどの文字列がどの情報を示すのか、すなわちその情報内容種別２０Ｍを把握する必要がある。
【００２５】
本実施の形態では、このような品詞で分解して得られた複数の入力形態素２１から１つ以上の入力形態素を取り出して構成される入力形態素列２２が、有用な情報となる文字列を構成することに着目したものである。
そして、この入力形態素列２２を単位として、情報内容種別との関係が設定された複数のルール形態素を有するルールデータ４Ａと入力文字列２０に含まれる各入力形態素列２２とを照合することにより、特定の情報内容種別の入力形態素列を検索し、得られた特定の情報内容種別２０Ｍの入力形態素列を所望の情報として抽出するようにしたものである。
【００２６】
ここで、ルール形態素とは、任意の文字列、望ましくは抽出したい種類の情報を含む例文を形態素解析することによって得られた形態素である。ルールデータ４Ａは、複数の例文すなわち事例文字列から得られたルール形態素に対して、その文字列の情報内容を関連付けたものである。
また、有用な情報には、品詞の並びに特徴があることに着目し、入力文字列とルールデータとを照合する際、形態素列を構成する文字の並びを照合する方法のほかに、品詞の並びを照合するようにしている。
【００２７】
次に、図３および図４を参照して、ルールデータについて説明する。図３はルールデータの構成例である。図４はルール生成手段５Ｃでのルール生成処理を示すフローチャートである。
記憶部４は、ルールデータ４Ａとして多数のルールデータ４１，４２…が登録されている。ルールデータ４１には、各ルール形態素４０を構成する情報すなわち、文字（文字データ）４１Ａ、品詞４１Ｂ，４１Ｃ…と、この文字データ４１Ａが有する情報の情報内容種別４１Ｍが組として格納されている。他のルールデータ４２…もルールデータ４１と同じ構成をなしている。
【００２８】
このようなルールデータ４Ａは、制御部５のルール生成手段５Ｃにより生成され、記憶部４に格納される。
ルール生成手段５Ｃは、操作入力部２から指示に応じて、図４に示すルール生成処理を実行する。
【００２９】
まず、事例用として入力された入力文字列を形態素解析手段５Ｂにより形態素解析して複数のルール形態素に分解し（ステップ２００）、これらルール形態素に対して、個々のルール形態素の文字データが属する情報内容種別を設定する（ステップ２０１）。
この情報内容種別の設定は、事例用入力文字列に対して利用者が判断して行ってもよく、あるいは事例用入力文字列として、情報内容種別の情報を示す文字位置が既知の文字列を使用してもよい。
【００３０】
そして、各ルール形態素に対して、そのルール形態素の情報内容種別をそれぞれ関連付け、前述した図４のような構成で、ルールデータ４Ａとして記憶部４へ登録し（ステップ２０２）、一連のルール生成処理を終了する。
なお、図４では、情報内容種別の有無にかかわらず、事例用入力文字列に含まれるすべての形態素をルールデータ４Ａとして登録した場合を例として説明したがこれに限定されるものではない。
例えば、情報内容種別が明確なルール形態素またはその列だけをルールデータ４Ａとして登録してもよく、ルールデータ４Ａのサイズを削減でき、照合所要時間も短縮できる。
【００３１】
次に、図５〜図７を参照して、情報抽出手段５Ａでの情報抽出処理について説明する。図５は情報抽出処理における照合モードを示す表である。図６は情報抽出手段５Ａでの情報抽出処理を示すフローチャートである。図７は入力形態素列とルールデータとの照合処理を示すフローチャートである。
制御部５の情報抽出手段５Ａは、操作入力部２からの指示または入出力Ｉ／Ｆ部１を介した外部からの指示に応じて、図６の情報抽出処理を開始する。
まず、処理対象として操作入力部２または入出力Ｉ／Ｆ部１を介して外部から入力された入力文字列について、形態素解析手段５Ｂを用いて形態素解析を行う（ステップ１００）。
【００３２】
そして、入力文字列から抽出したい所望の情報のうち、未処理の情報内容種別を選択し（ステップ１０１）、所定の優先順位に基づき照合モードを選択する（ステップ１０２）。この照合モードとは、入力形態素列とルールデータとを照合する際の規則であり、各情報内容種別ごとに設定される。
ここでは両者の形態素列を「文字データ」または「品詞」のいずれで照合するかを示すマッチングレベルと、照合する形態素列の長さすなわち形態素列を構成する形態素の数との組み合わせにより、各照合モードが構成されている。
【００３３】
各照合モードでは、マッチングレベルと形態素列長との組み合わせにより得られる照合精度や照合所要時間が異なるため、これら性能に応じてその照合モードを用いる順序が優先順位として設定されており、選択した照合モードに基づく照合により入力文字列から所望の情報が得られなかった場合には、次の順位の照合モードが選択される。
【００３４】
例えば、図５には、日時情報用の照合モード表の例が示されており、この照合モード表の優先順位によれば、マッチングレベルとして「品詞」より「文字データ」での照合が優先的に行われ、各マッチングレベルごとに形態素列長として「４」が最初に用いられ、その後「１」まで順に短い形態素列長が用いられることになる。
一般に、形態素長が短いほど、照合の成功率は高くなるが、同時にノイズも増大する。したがって、照合にあたっては、同じマッチングレベルであれば、ある程度の形態素長から開始し、照合に成功しなければその形態素を短くして照合幅を狭くしながら照合を繰り返していけばよい。
【００３５】
また、同じ形態素長であれば、マッチングレベルすなわち各形態素の分類の深さを深いところから浅くしていき、最終的には品詞のみによる照合を行うことも考えられる。
なお、照合モードの順序については、予め記憶部４に登録されているものを情報抽出処理の際に参照してもよく、情報抽出処理のプログラムに作り込んでもよい。
【００３６】
このようにして、前述したステップ１０２で照合モードを選択した後、その照合モードに基づき図７の照合処理を実行することにより、ルールデータ４Ａを参照して、入力文字列から当該情報内容種別の入力形態素列を検索する（ステップ１０３）。
ここで、当該情報内容種別の入力形態素列の検索に成功した場合には（ステップ１０４：ＹＥＳ）、当該情報内容種別に対応する所望の情報としてその入力形態素列の文字を抽出する（ステップ１０５）。
【００３７】
そして、未処理の情報内容種別がある場合には（ステップ１０６：ＹＥＳ）、前述したステップ１０１へ戻って未処理の情報内容種別に関する所望情報の検索処理を繰り返し実行する。
ここで、未処理の情報内容種別がない場合には（ステップ１０６：ＮＯ）、それまでに抽出した所望の情報を画面表示部３へ表示出力し、あるいは入出力Ｉ／Ｆ部１を介して外部へ出力し（ステップ１０７）、一連の情報抽出処理を終了する。
また、前述したステップ１０４で、当該情報内容種別の入力形態素列の検索に失敗した場合には（ステップ１０４：ＮＯ）、未処理の照合モードがあるかどうか判断する（ステップ１０８）。
【００３８】
そして、未処理の照合モードがある場合には（ステップ１０８：ＹＥＳ）、前述したステップ１０２に戻って、次の優先順位の照合モードを選択し、マッチングレベルや入力形態素列の形態素長を変えながら所望情報の抽出処理を繰り返し実行する。
なお、未処理の照合モードがない場合には（ステップ１０８：ＮＯ）、当該情報内容種別に対応する情報が入力文字列に存在しないと判断して、前述したステップ１０６へ移行し、未処理の情報内容種別に関する処理を行う。
【００３９】
次に、図７を参照して、ステップ１０３での照合処理について詳細に説明する。
情報抽出手段５Ａでは、まず、前述のステップ１０２で選択した照合モードに基づき、マッチングレベルおよび形態素列長を設定し（ステップ１１０）、入力文字列から、その形態素列長分だけ未処理の入力形態素を入力形態素列として取得する（ステップ１１１）。
この際、入力文字列から得られた各形態素のうち、例えば入力文字列の元の並びに沿ってその先頭を取り出し開始位置として形態素列長分の形態素を取り出し、その後は順に１形態素ずつ取り出し開始位置を後方に移動させて、形態素列長分の形態素を順次取り出せばよい。
【００４０】
次に、ルールデータ４Ａから未処理のルールデータを選択し（ステップ１１２）、そのルールデータ４Ａ内に、入力形態素列と一致するルール形態素の列が内在しているかどうか照合する（ステップ１１３）。
この際、照合については、当該照合モードのマッチングレベルに基づき照合される。
すなわち、マッチングレベルが「文字データ」の場合には、入力形態素列の文字データの並びとルールデータ内のルール形態素の文字データの並びとが比較される。これに対して、照合モードのマッチングレベルが「品詞」の場合には、入力形態素列の品詞の並びとルールデータ内のルール形態素の品詞の並びとが比較される。
【００４１】
このようにして、ステップ１１３で照合が行われ、当該ルールデータ内に入力形態素列と一致するルール形態素の列が見つからなかった場合（ステップ１１３：ＮＯ）、未処理のルールデータがある場合には（ステップ１１４：ＹＥＳ）、前述したステップ１１２へ戻って未処理のルールデータを選択し、そのルールデータとの照合を繰り返し行う。
【００４２】
一方、未処理のルールデータがない場合（ステップ１１４：ＮＯ）、未処理の入力形態素列がある場合には（ステップ１１５：ＹＥＳ）、前述したステップ１１１へ戻って未処理の入力形態素列を取得し、その入力形態素列に対する照合を繰り返し行う。
また、未処理の入力形態素列がない場合（ステップ１１５：ＮＯ）、当該照合モードにおける当該情報内容種別の入力形態素列の検索に失敗したと判断し（ステップ１１６）、当該照合モードにおける一連の照合処理を終了する。
【００４３】
また、前述したステップ１１３において、当該ルールデータ内に入力形態素列と同じルール形態素の列が見つかった場合（ステップ１１３：ＹＥＳ）、その見つかった各ルール形態素に関連付けられている情報内容種別をチェックする（ステップ１１７）。
【００４４】
ここで、上記各ルール形態素の情報内容種別が所望の情報内容種別と一致しない場合には（ステップ１１７：ＮＯ）、前述したステップ１１４へ移行して、未処理のルールデータに対する処理を行う。
一方、見つかった各ルール形態素に関連付けられているすべての情報内容種別が所望の情報内容種別と一致する場合には（ステップ１１７：ＹＥＳ）、検索成功と判断し（ステップ１１８）、一連の照合処理を終了する。
【００４５】
このように、特定の情報内容を含む任意の文字列を予め形態素に分解して得られた複数のルール形態素と当該ルール形態素の情報内容の種別を示す情報内容種別との対応関係を示すルールデータ４Ａを設け、情報抽出手段５Ａで、入力文字列を品詞単位で分解して得られた入力形態素から１つ以上の入力形態素を取り出して入力形態素列を構成し、この入力形態素列とルールデータの各ルール形態素とを照合することにより、当該入力形態素列と一致したルール形態素に対応付けられている情報内容種別に基づいて特定の情報内容種別の入力形態素列を検索し、得られた特定の情報内容種別の入力形態素列を所望の情報として抽出するようにしたので、入力された文字列から所望の内容に関する情報を精度よく抽出できる。
【００４６】
このとき、情報抽出手段５Ａでは、入力文字列の元の並びにしたがって入力形態素から連続して取り出した複数の入力形態素から入力形態素列を構成するようにしたので、入力文字列の並びという情報を有効に利用でき、より高い精度で所望の情報を抽出できる。
また、情報抽出手段５Ａでは、特定の情報内容種別に対応する入力形態素列を得られなかった場合、入力形態素列を構成する入力形態素の数を減らして短くした新たな入力形態素列を用いて再検索するようにしたので、照合精度を優先しながら柔軟に所望の情報を抽出できる。
【００４７】
また、情報抽出手段５Ａでは、照合の際、当該入力形態素列を構成する各入力形態素とルールデータの各ルール形態素とを、それぞれの形態素の文字の並びに基づき照合するようにしたので、入力形態素列の文字の並びと一致する形態素列がルールデータ内に存在する場合にのみ、情報が抽出されることになり、高い精度で所望の情報を抽出できる。
また、情報抽出手段５Ａでは、照合の際、当該入力形態素列を構成する各入力形態素とルールデータの各ルール形態素とを、それぞれの形態素の品詞の並びに基づき照合するようにしたので、入力形態素列の品詞の並びと一致する形態素列がルールデータ内に存在する場合には、情報が抽出されることになり、全く等しい文字の並びがルールデータに存在しない場合でも、広い範囲で柔軟に所望の情報を抽出できる。
【００４８】
また、情報抽出手段５Ａでは、照合の際、当該入力形態素列を構成する各入力形態素とルールデータの各ルール形態素とを、それぞれの形態素の文字の並びに基づき照合し、当該入力形態素列と一致するルール形態素が存在しなかった場合、当該入力形態素列を構成する各入力形態素とルールデータの各ルール形態素とを、それぞれの形態素の品詞の並びに基づき照合するようにしたので、まず最初は文字の並びでの照合により高い精度での照合が行われ、入力文字列と全く同一の文字の並びがルールデータにない場合には、自動的にその品詞の並びでの照合により広い範囲で柔軟な照合が行われることになり、高い精度を考慮しつつ広い範囲で柔軟に所望の情報を抽出でき、所望の情報を抽出できる確率を向上させることができる。
【００４９】
また、ルール生成手段５Ｃでは、予め用意された事例文字列を品詞単位で分解して得られた複数のルール形態素と、これらルール形態素ごとに対応付けられた、当該ルール形態素が属する情報内容の種別を示す情報内容種別とからなるルールデータを生成するようにしたので、このルールデータを照合に用いることにより、情報内容種別が不明な入力文字列であっても、効率よくかつ精度よく所望の情報を抽出できる。
【００５０】
次に、図８を参照して、情報抽出動作の具体例について説明する。図８は情報抽出動作の具体例である。
ここでは、所望の情報として日時情報を入力文字列から抽出するものとし、照合モードして、マッチングレベルが「文字データ」であり、形態素列長が「４」の照合モードを用いる場合を例として説明する。
入力文字列２０として「１０月８日に村上さんと藤沢で打ち合わせする」という文字列が入力された場合、この入力文字列が「１０」〜「打ち合わせる」の１１個の形態素に分解される。この例では、マッチングレベルが「文字データ」なので、各形態素のうち文字データ２０Ａが処理対象となる。
【００５１】
前述した図７の照合処理では、この文字データ２０Ａから、形態素列長＝４個ずつ形態素が取り出され、取り出した形態素の文字データからまず入力形態素列５１が生成される（ステップ１１１）。なお、照合不一致に応じて、１形態素ずつその取り出し開始位置を移動させて入力形態素列５２，〜，５ｎが順次生成されることになる。
【００５２】
そして、取り出した入力形態素列５１が、ルールデータ４Ａの各ルールデータの文字データ４１Ａ，４２Ａ，〜に内在する各ルール形態素と照合される（ステップ１１３）。この例では、「１０月１８日」という入力形態素列５１が、ルールデータの文字データ４２Ａに存在しており、両者の文字データの並びが一致する。
このとき、ルールデータの文字データ４２Ａ内で見つかった各ルール形態素に、「日時」という情報内容種別４２Ｍが関連付けられており、それが所望の情報内容種別と一致することから（ステップ１１７）、当該入力形態素列の文字データ列が所望の日時情報として抽出される（図６：ステップ１０５）。
【００５３】
次に、図９を参照して、情報抽出動作の他の具体例について説明する。図９は情報抽出動作の他の具体例である。
ここでは、所望の情報として相手情報を入力文字列から抽出するものとし、照合モードして、マッチングレベルが「品詞」であり、形態素列長が「２」の照合モードを用いる場合を例として説明する。
まず、入力文字列２０が形態素に分解される。この例では、マッチングレベルが「品詞」なので、「名詞」「名詞」「名詞」「名詞」「助詞」「名詞」「接尾辞」「助詞」「名詞」「助詞」「動詞」という入力文字列の品詞２０Ｂが処理対象となる。
【００５４】
前述した図７の照合処理では、この品詞２０Ｂから、形態素列長＝２個ずつ形態素が取り出され、その取り出した形態素の品詞からまず入力形態素列６１が生成される（ステップ１１１）。なお、照合不一致に応じて、１形態素ずつその取り出し開始位置を移動させて入力形態素列６２，〜，６ｋ，〜，６ｎが順次生成されることになる。
そして、取り出した入力形態素列６１が、ルールデータ４Ａの各ルールデータの品詞４１Ｂ，４２Ｂ，〜に内在する各ルール形態素と照合される（ステップ１１３）。この例では、その後生成された「名詞」「接尾辞」という入力形態素列６ｋが、ルールデータの品詞４１Ｂに存在しており、両者の品詞の並びが一致する。
【００５５】
このとき、ルールデータの品詞４１Ｂ内で見つかった各ルール形態素に、「相手」という情報内容種別４１Ｍが関連付けられており、それが所望の情報内容種別と一致することから（ステップ１１７）、当該入力形態素列の文字データ列が所望の相手情報として抽出される（図６：ステップ１０５）。
なお、品詞レベルでの形態素列の比較については、その品詞の分類の所定の深さで比較される。この際、ルールデータ４Ａとして用意されている分類の深さのうち、最も深いレベルから比較を開始し、不一致に応じて順に浅いレベルでの比較を行うようにしてもよく、照合精度を優先しながら柔軟に所望の情報を抽出できる。
【００５６】
【発明の効果】
以上説明したように、本発明は、特定の情報内容を含む任意の文字列を予め形態素に分解して得られた複数のルール形態素と当該ルール形態素の情報内容の種別を示す情報内容種別との対応関係を示すルールデータを設け、情報抽出手段で、入力文字列を品詞単位で分解して得られた入力形態素から１つ以上の入力形態素を取り出して入力形態素列を構成し、この入力形態素列とルールデータの各ルール形態素とを照合することにより、当該入力形態素列と一致したルール形態素に対応付けられている情報内容種別に基づいて特定の情報内容種別の入力形態素列を検索し、得られた特定の情報内容種別の入力形態素列を所望の情報として抽出するようにしたので、入力された文字列から所望の内容に関する情報を精度よく抽出できる。
【図面の簡単な説明】
【図１】本発明の一実施の形態にかかる情報抽出装置の構成を示すブロック図である。
【図２】形態素解析処理を示す説明図である。
【図３】ルールデータの構成例である。
【図４】ルールデータ生成処理を示すフローチャートである。
【図５】照合モード表の一例である。
【図６】情報抽出処理を示すフローチャートである。
【図７】照合処理を示すフローチャートである。
【図８】情報抽出動作の具体例である。
【図９】情報抽出動作の他の具体例である。
【符号の説明】
１０…情報抽出装置、１…入出力Ｉ／Ｆ部、２…操作入力部、３…画面表示部、４…記憶部、４Ａ…ルールデータ、４Ｂ…プログラム、５…制御部、５Ａ…情報抽出手段、５Ｂ…形態素解析手段、５Ｃ…ルール生成手段、６…通信回線、９…記録媒体、２０…入力文字列、２０Ａ…文字データ、２０Ｂ，２０Ｃ…品詞、２０Ｍ…情報内容種別、２１…入力形態素、２２…入力形態素列、３０…所望の情報、４０…ルール形態素、４１，４２…ルールデータ、４１Ａ，４２Ａ…文字データ、４１Ｂ，４１Ｃ，４２Ｂ，４２Ｃ…品詞、４１Ｍ，４２Ｍ…情報内容種別、５１，５２，５３，５ｎ…入力形態素列（文字データ）、６１，６２，６ｋ，６ｎ…入力形態素列（品詞）。[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to an information extracting apparatus, method, and program, and more particularly to an information extracting apparatus, method, and program for extracting desired information relating to specific contents from an input character string.
[0002]
[Prior art]
2. Description of the Related Art With the spread of computers, a technique for realizing a man-machine interface between a human and a computer has attracted attention. These technologies aim to be able to interact with computers without burden using basic human communication methods.
In such a technology, when a computer automatically analyzes a natural language used daily by a human, a speech recognition technology that automatically converts a word spoken by a human into a character string is used. Along with the voice recognition processing, an information extraction technique for extracting desired information from a character string is also important.
[0003]
Conventionally, many techniques using morphological analysis techniques have been proposed as techniques for extracting desired information from such a character string of a natural language.
The morphological analysis is to analyze a component of the character string by decomposing a character string composed of a natural language into a plurality of words for each part of speech (for example, see Non-Patent Document 1).
On the other hand, natural languages have a feature in the syntax (sentence expression pattern) at the part of speech level. In the conventional information extraction technology, a desired sentence is extracted from a natural language by performing a morphological analysis on a character string composed of a natural language and extracting the features of the obtained part-of-speech-level syntax (for example, Japanese Patent Technical document 1 etc.).
[0004]
The applicant has not found any prior art documents related to the present invention other than the prior art documents specified by the prior art document information described in this specification by the time of filing.
[Patent Document 1]
JP-A-8-77196
[Non-patent document 1]
Yuji Matsumoto et al., "Morphological Analysis System ChaSen", Nara Institute of Science and Technology, [Search November 11, 2002], Internet <URL: http // chasen. aist-nara. ac. jp / index. html. ja>
[0005]
[Problems to be solved by the invention]
However, in such a conventional information extraction technology, since the matching with the input character string is performed by using the syntax feature at the part of speech level, that is, the sentence expression pattern, the sentence expression pattern of the sentence prepared in advance is used. Although a close sentence can be extracted, there is a problem that it is not possible to accurately extract only information relating to specific contents, such as a date and time, a partner, a place, and an action, included in an input character string.
An object of the present invention is to solve such a problem, and an object of the present invention is to provide an information extracting device, an information extracting method, and an information extracting method capable of accurately extracting information about desired contents from an input character string.
[0006]
[Means for Solving the Problems]
In order to achieve such an object, an information extraction device according to the present invention decomposes an input character string into morphemes in a part of speech unit, and extracts desired information regarding specific information content from a character string based on the obtained morpheme. In the information extracting device to extract, a correspondence relation between a plurality of rule morphemes obtained by previously decomposing an arbitrary character string including specific information content into morphemes and an information content type indicating a type of information content of the rule morpheme is determined. Rule data and one or more input morphemes taken out of the input morpheme obtained by decomposing the input character string by the part of speech unit to form an input morpheme sequence. The input morpheme sequence and each rule morpheme of the rule data are By matching, an input morpheme string of a specific information content type is searched for based on the information content type associated with the rule morpheme that matches the input morpheme string, and obtained. Input morpheme string of specific information content type that is intended and a data extracting means for extracting a desired information.
[0007]
When constructing the input morpheme string, the information extraction means may constitute the input morpheme string from a plurality of input morphemes successively extracted from the original of the input character string and thus from the input morpheme.
When searching for a morpheme string, if the information extraction means cannot obtain an input morpheme string corresponding to a specific information content type, a new input morpheme string that has been shortened by reducing the number of input morphemes constituting the input morpheme string May be used for re-matching.
[0008]
When collating the morpheme string, the information extraction means may collate each input morpheme constituting the input morpheme string with each rule morpheme of the rule data based on the sequence of the respective morphemes. Alternatively, each input morpheme constituting the input morpheme sequence may be collated with each rule morpheme of the rule data based on the part of speech of each morpheme.
When collating the morpheme string, the information extraction unit collates each input morpheme constituting the input morpheme string with each rule morpheme of the rule data based on the character string of each morpheme, and matches the input morpheme string. When there is no rule morpheme, each input morpheme constituting the input morpheme sequence and each rule morpheme of the rule data may be collated based on the part of speech of each morpheme.
[0009]
Regarding the configuration of rule data, it shows a plurality of rule morphemes obtained by decomposing a case character string prepared in advance for each part of speech, and the type of information content to which the rule morpheme belongs, which is associated with each rule morpheme. Rule data including the information content type may be used.
[0010]
Further, the information extraction method according to the present invention is used in an information extraction device that decomposes an input character string into morphemes in a part of speech unit and extracts desired information on specific information content from the character string based on the obtained morpheme. In the information extraction method, a first step of extracting one or more input morphemes from an input morpheme obtained by decomposing an input character string into parts of speech to form an input morpheme string, and an optional step including a specific information content Each rule morpheme of rule data indicating a correspondence relationship between a plurality of rule morphemes obtained by previously decomposing a character string into morphemes and an information content type indicating a type of information content of the rule morpheme is obtained in the first step. By comparing the input morpheme string with the input morpheme string, a specific information content type is input based on the information content type associated with the rule morpheme that matches the input morpheme string. Those comprising a second step of searching for morphemes column, and a third step of extracting the input morphemes obtained by the search as the desired information.
[0011]
When constructing the input morpheme sequence, in the first step, the input morpheme sequence may be constructed from a plurality of input morphemes that are successively extracted from the original sequence of the input character string and thus from the input morpheme.
When searching for a morpheme string, if the input morpheme string corresponding to the specific information content type cannot be obtained in the second step, a new input morpheme that has been shortened by reducing the number of input morphemes constituting the input morpheme string Re-matching may be performed using a column.
[0012]
When collating the morpheme sequence, in the second step, each input morpheme constituting the input morpheme sequence and each rule morpheme of the rule data may be collated based on the character sequence of each morpheme. Alternatively, each input morpheme constituting the input morpheme sequence may be collated with each rule morpheme of the rule data based on the part of speech of each morpheme.
When matching the morpheme string, in the second step, each input morpheme constituting the input morpheme string and each rule morpheme in the rule data are matched based on the sequence of the respective morphemes, and match with the input morpheme string. If there is no rule morpheme to be executed, each input morpheme constituting the input morpheme sequence may be collated with each rule morpheme of the rule data based on the part of speech of each morpheme.
[0013]
Regarding the configuration of rule data, it shows a plurality of rule morphemes obtained by decomposing a case character string prepared in advance for each part of speech, and the type of information content to which the rule morpheme belongs, which is associated with each rule morpheme. Rule data including the information content type may be used.
[0014]
Further, the program according to the present invention is a computer of an information extraction device that decomposes an input character string into morphemes in a part of speech unit and extracts desired information related to specific information content from character data based on each obtained morpheme. Is a program for executing any one of the information extraction methods described above.
[0015]
BEST MODE FOR CARRYING OUT THE INVENTION
Next, embodiments of the present invention will be described with reference to the drawings.
FIG. 1 is a block diagram showing a configuration of an information extraction device according to one embodiment of the present invention.
The information extraction device 10 decomposes the input character string of the input natural language into morphemes in the unit of part of speech and, based on the obtained information content type corresponding to each morpheme, obtains desired information relating to specific content from the input character string. Is a device for extracting
The information extraction device 10 is generally composed of a computer such as a server device, and has an input / output interface unit (hereinafter, referred to as an input / output I / F unit) 1, an operation input unit 2, a screen display unit 3, and a storage unit 4. , And a control unit 5 are provided.
[0016]
The input / output I / F unit 1 receives an input character string or an input character string from an information processing device (not shown) connected via a communication line 6 or a recording medium 9 such as a CD-ROM or a flexible disk. It is a circuit section for exchanging information extracted from character strings and various data such as programs.
The operation input unit 2 includes a keyboard, a mouse, and the like, and is an input device for performing operation input of various data such as input character strings and instructions for various processes.
The screen display unit 3 is a screen display device that includes an LCD, a CRT, and the like, and that displays information extracted from an input character string and a processing state on a screen.
[0017]
The storage unit 4 is composed of a hard disk or a memory, and various types of information used for processing operations in the control unit 5, such as a large number of rule data 4A used for information extraction processing in the control unit 5 and programs 4B executed by the control unit 5. Is a storage device for storing.
The rule data 4A is data indicating a correspondence between a plurality of rule morphemes obtained by morphological analysis of a character string in a natural language and an information content type indicating information content of the rule morpheme.
The program 4B is prefetched from the recording medium 9 or the communication line 6 via the input / output I / F unit 1 and stored in the storage unit 4.
[0018]
The control unit 5 includes a microprocessor such as a CPU and its peripheral circuits. The control unit 5 reads and executes the program 4B in the storage unit 4, thereby causing the program 4B to cooperate with the hardware resources of its own device to extract information. Implement functional means for performing processing.
As this function means, there are an information extraction means 5A, a morphological analysis means 5B, and a rule generation means 5C.
[0019]
The information extraction unit 5A analyzes a character string of a natural language input from the input / output I / F unit 1 or the operation input unit 2 into a morpheme by the morpheme analysis unit 5B, and obtains an input composed of one or more input morphemes obtained. This is a functional unit that extracts desired content information from an input character string by collating the morpheme string with each rule data 4A in the storage unit 4.
The morphological analysis unit 5B is a functional unit that decomposes a character string in a natural language into a plurality of morphemes for each part of speech such as a noun, a particle, or a verb.
[0020]
The rule generation unit 5C analyzes a case character string composed of a character string of a natural language prepared in advance into a morpheme by the morphological analysis unit 5B, and the type of information content to which the rule morpheme belongs to a plurality of obtained rule morphemes. That is, it is a functional unit that generates the rule data 4A by associating the information content types, and stores the rule data 4A in the storage unit 4.
[0021]
Next, the operation of the information extraction device 10 according to the present exemplary embodiment will be described with reference to the drawings.
First, the morphological analysis performed by the morphological analysis unit 5B of the control unit 5 will be schematically described with reference to FIG. FIG. 2 is an explanatory diagram illustrating a morphological analysis process performed by the morphological analysis unit 5B.
The information extraction device 10 according to the present embodiment aims to extract information of specific content as desired information 30 from an input character string 20 of a natural language as shown in FIG. Here, an example will be described in which information relating to a date and time, a partner, a place, and an action is extracted as information having specific contents.
[0022]
Morphological analysis is a process of decomposing a character string in a natural language into morphemes in units of parts of speech such as nouns, particles, and verbs. The part of speech is a name when a character string having a meaning is classified by its property, and the unit of the character string decomposed in the unit of part of speech, that is, a morpheme is the shortest character string having a meaning.
For the morphological analysis performed by the morphological analysis unit 5B, the above-described known morphological analysis method may be used (see Non-Patent Document 1, etc.). In a general morphological analysis process, a character string is decomposed into morphemes using a dictionary in which many specific examples of part of speech are registered in advance.
[0023]
For example, as shown in FIG. 2, when the input character string 20 of “meeting with Mr. Murakami on October 18 at Fujisawa” is subjected to morphological analysis, a plurality of morphemes including character data 20A and its parts of speech 20B and 20C (hereinafter referred to as “morphological”) , A morpheme obtained from the input character string is referred to as an input morpheme) 21.
For example, parts of speech 20B and 20C of “noun” and “numeral” are assigned to character data 20A of “10”, and these constitute one input morpheme 21 as a set.
As for the part of speech, for example, “noun” includes detailed classifications such as “numerical”, “personal name”, and “place name”. The deeper the classification, the higher the matching accuracy, but the longer the matching time. In this example, a two-stage depth classification is used, but the classification depth may be arbitrarily adjusted in consideration of the matching accuracy and the matching required time.
[0024]
Here, from the input character string 20 of “meeting with Murakami on October 18 at Fujisawa”, the date and time “October 18”, the partner “Murakami”, the place “Fujisawa”, and the action “meeting” When extracting desired information 30, it is necessary to know which character string constituting the input character string 20 indicates which information, that is, its information content type 20M.
[0025]
In the present embodiment, an input morpheme sequence 22 that is obtained by extracting one or more input morphemes from a plurality of input morphemes 21 obtained by decomposing with such parts of speech forms a character string that is useful information. It focuses on doing.
Then, by using the input morpheme string 22 as a unit, by comparing the rule data 4A having a plurality of rule morphemes set in relation to the information content type with each input morpheme string 22 included in the input character string 20, An input morpheme string of a specific information content type is searched, and the obtained input morpheme string of the specific information content type 20M is extracted as desired information.
[0026]
Here, the rule morpheme is a morpheme obtained by morphological analysis of an example sentence including an arbitrary character string, preferably, information of a type to be extracted. The rule data 4A is obtained by associating rule morphemes obtained from a plurality of example sentences, ie, case character strings, with information contents of the character strings.
Focusing on useful information that has a part-of-speech sequence, when comparing the input character string with the rule data, in addition to the method of matching the arrangement of characters constituting the morpheme string, the part-of-speech arrangement Is collated.
[0027]
Next, the rule data will be described with reference to FIGS. FIG. 3 is a configuration example of the rule data. FIG. 4 is a flowchart showing the rule generation processing in the rule generation means 5C.
A large number of rule data 41, 42,... Are registered in the storage unit 4 as rule data 4A. The rule data 41 stores information constituting each rule morpheme 40, that is, a character (character data) 41A, parts of speech 41B, 41C... And an information content type 41M of information included in the character data 41A as a set. The other rule data 42 have the same configuration as the rule data 41.
[0028]
Such rule data 4A is generated by the rule generation means 5C of the control unit 5 and stored in the storage unit 4.
The rule generation means 5C executes a rule generation process shown in FIG. 4 according to an instruction from the operation input unit 2.
[0029]
First, the input character string input for the case is morphologically analyzed by the morphological analysis means 5B to decompose it into a plurality of rule morphemes (step 200). The content type is set (step 201).
The setting of this information content type may be made by the user with respect to the input character string for the case, or a character string whose character position indicating the information of the information content type is known as the input character string for the case. May be used.
[0030]
Then, each rule morpheme is associated with the information content type of the rule morpheme, and registered in the storage unit 4 as the rule data 4A in the configuration as shown in FIG. 4 (step 202), and a series of rule generation processing To end.
Although FIG. 4 illustrates an example in which all morphemes included in the input character string for the case are registered as the rule data 4A regardless of the presence or absence of the information content type, the present invention is not limited to this.
For example, only a rule morpheme having a clear information content type or a column thereof may be registered as the rule data 4A, and the size of the rule data 4A can be reduced, and the time required for collation can be reduced.
[0031]
Next, the information extraction processing in the information extraction means 5A will be described with reference to FIGS. FIG. 5 is a table showing the matching mode in the information extraction process. FIG. 6 is a flowchart showing the information extraction processing in the information extraction means 5A. FIG. 7 is a flowchart showing the collation processing between the input morpheme string and the rule data.
The information extraction means 5A of the control unit 5 starts the information extraction processing of FIG. 6 in response to an instruction from the operation input unit 2 or an external instruction via the input / output I / F unit 1.
First, a morphological analysis is performed using a morphological analysis unit 5B on an input character string input from the outside via the operation input unit 2 or the input / output I / F unit 1 as a processing target (step 100).
[0032]
Then, an unprocessed information content type is selected from desired information to be extracted from the input character string (step 101), and a collation mode is selected based on a predetermined priority (step 102). The collation mode is a rule for collating an input morpheme string with rule data, and is set for each information content type.
Here, each matching is determined by a combination of a matching level indicating whether the two morpheme strings are compared with “character data” or “part of speech” and the length of the morpheme string to be matched, that is, the number of morphemes constituting the morpheme string. The mode is configured.
[0033]
In each matching mode, the matching accuracy and required matching time obtained by the combination of the matching level and the morpheme sequence length are different. Therefore, the order of using the matching mode is set as a priority order according to these performances. If the desired information cannot be obtained from the input character string by the collation based on the mode, the collation mode of the next order is selected.
[0034]
For example, FIG. 5 shows an example of a collation mode table for date and time information. According to the priorities of the collation mode table, collation using “character data” has priority over “part of speech” as the matching level. The morpheme sequence length is used first for each matching level as a morpheme sequence length, and thereafter the shorter morpheme sequence length is used up to “1”.
In general, the shorter the morpheme length, the higher the success rate of matching, but at the same time the noise increases. Therefore, in matching, if the matching level is the same, it is only necessary to start from a certain morpheme length, and if the matching is not successful, the matching may be repeated while shortening the morpheme and narrowing the matching width.
[0035]
If the morpheme lengths are the same, the matching level, that is, the depth of classification of each morpheme may be reduced from a deep place to a shallow place, and finally the collation using only the part of speech may be performed.
Note that the order of the matching modes may be referred to at the time of the information extraction processing, or may be incorporated in the information extraction processing program.
[0036]
In this manner, after selecting the collation mode in step 102 described above, the collation processing of FIG. 7 is executed based on the collation mode. An input morpheme string is searched (step 103).
Here, when the search of the input morpheme string of the information content type is successful (step 104: YES), the characters of the input morpheme string are extracted as desired information corresponding to the information content type (step 105). .
[0037]
If there is an unprocessed information content type (step 106: YES), the process returns to step 101 to repeatedly execute a search process for desired information on the unprocessed information content type.
Here, if there is no unprocessed information content type (step 106: NO), the desired information extracted so far is displayed and output on the screen display unit 3, or is output via the input / output I / F unit 1. The information is output to the outside (step 107), and a series of information extraction processing ends.
If the search of the input morpheme string of the information content type fails in step 104 described above (step 104: NO), it is determined whether or not there is an unprocessed collation mode (step 108).
[0038]
If there is an unprocessed matching mode (step 108: YES), the process returns to step 102 to select a matching mode of the next priority and change the matching level and the morpheme length of the input morpheme string. The extraction processing of the desired information is repeatedly executed.
If there is no unprocessed collation mode (step 108: NO), it is determined that the information corresponding to the information content type does not exist in the input character string, and the process proceeds to step 106 described above. Performs processing related to the information content type.
[0039]
Next, the collation processing in step 103 will be described in detail with reference to FIG.
The information extraction means 5A first sets a matching level and a morpheme string length based on the collation mode selected in the above-mentioned step 102 (step 110), and from the input character string, the input morpheme which has not been processed by the morpheme string length. Is obtained as an input morpheme sequence (step 111).
At this time, of the morphemes obtained from the input character string, for example, the head of the input character string is extracted along the original sequence and the morpheme of the morpheme string length is extracted as the extraction start position. May be moved backward to sequentially extract morphemes for the morpheme sequence length.
[0040]
Next, unprocessed rule data is selected from the rule data 4A (step 112), and it is checked whether or not a rule morpheme string that matches the input morpheme string is included in the rule data 4A (step 113).
At this time, collation is performed based on the matching level of the collation mode.
That is, when the matching level is “character data”, the arrangement of the character data of the input morpheme string is compared with the arrangement of the character data of the rule morpheme in the rule data. On the other hand, when the matching level in the matching mode is “part of speech”, the arrangement of the part of speech of the input morpheme sequence is compared with the arrangement of the part of speech of the rule morpheme in the rule data.
[0041]
In this way, the collation is performed in step 113, and if a rule morpheme sequence that matches the input morpheme sequence is not found in the rule data (step 113: NO), if there is unprocessed rule data, (Step 114: YES), the process returns to the above-described step 112, selects unprocessed rule data, and repeatedly performs collation with the rule data.
[0042]
On the other hand, if there is no unprocessed rule data (step 114: NO), and if there is an unprocessed input morpheme string (step 115: YES), the process returns to step 111 to acquire an unprocessed input morpheme string. Then, the matching for the input morpheme sequence is repeatedly performed.
If there is no unprocessed input morpheme string (step 115: NO), it is determined that the search for the input morpheme string of the information content type in the matching mode has failed (step 116), and a series of matching in the matching mode is performed. The process ends.
[0043]
When the same rule morpheme sequence as the input morpheme sequence is found in the rule data in step 113 described above (step 113: YES), the information content type associated with each found rule morpheme is checked. (Step 117).
[0044]
Here, when the information content type of each rule morpheme does not match the desired information content type (step 117: NO), the process proceeds to step 114 described above, and processing is performed on unprocessed rule data.
On the other hand, if all the information content types associated with each found rule morpheme match the desired information content type (step 117: YES), it is determined that the search is successful (step 118), and a series of collation processing is performed. To end.
[0045]
As described above, rule data indicating a correspondence relationship between a plurality of rule morphemes obtained by previously decomposing an arbitrary character string including specific information content into morphemes and an information content type indicating a type of information content of the rule morpheme 4A, one or more input morphemes are taken out from the input morpheme obtained by decomposing the input character string into parts of speech by the information extraction means 5A, and an input morpheme sequence is constructed. By collating each rule morpheme, the input morpheme string of the specific information content type is searched based on the information content type associated with the rule morpheme that matches the input morpheme string, and the obtained specific information is obtained. Since the input morpheme string of the content type is extracted as the desired information, information relating to the desired content can be accurately extracted from the input character string.
[0046]
At this time, the information extracting means 5A constructs the input morpheme sequence from a plurality of input morphemes which are successively extracted from the original sequence of the input character strings and, hence, from the input morpheme. It is possible to extract desired information with higher accuracy.
When the input morpheme sequence corresponding to the specific information content type cannot be obtained, the information extracting means 5A re-uses the input morpheme sequence constituting the input morpheme sequence by reducing the number of input morpheme sequences to shorten the input morpheme sequence. Since search is performed, desired information can be extracted flexibly while giving priority to collation accuracy.
[0047]
In addition, the information extraction means 5A matches each input morpheme constituting the input morpheme string and each rule morpheme of the rule data based on the character sequence of each morpheme at the time of matching. Information is extracted only when a morpheme string that matches the character arrangement of the character string exists in the rule data, and desired information can be extracted with high accuracy.
In addition, the information extraction means 5A matches each input morpheme constituting the input morpheme string and each rule morpheme of the rule data based on the part of speech of each morpheme at the time of matching. If a morpheme string that matches the part-of-speech sequence exists in the rule data, the information will be extracted. Information can be extracted.
[0048]
In addition, at the time of matching, the information extracting means 5A checks each input morpheme constituting the input morpheme string and each rule morpheme of the rule data based on the sequence of the respective morphemes, and matches the input morpheme string. If the rule morpheme does not exist, each input morpheme constituting the input morpheme sequence and each rule morpheme of the rule data are collated based on the part of speech of each morpheme. When the rule data does not have exactly the same character sequence as the input character string, the matching with the part-of-speech sequence automatically performs flexible matching over a wide range. As a result, desired information can be flexibly extracted in a wide range while considering high accuracy, and the probability that desired information can be extracted can be improved.
[0049]
Further, the rule generation unit 5C uses a plurality of rule morphemes obtained by decomposing the case character string prepared in advance for each part of speech, and the type of information content to which the rule morpheme belongs, which is associated with each rule morpheme. Since the rule data consisting of the information content type indicating the information content is generated, by using this rule data for collation, even if the input character string has an unknown information content type, the desired information can be efficiently and accurately obtained. Can be extracted.
[0050]
Next, a specific example of the information extraction operation will be described with reference to FIG. FIG. 8 shows a specific example of the information extraction operation.
Here, as an example, it is assumed that date and time information is extracted from the input character string as desired information, the matching mode is set, the matching level is “character data”, and the morpheme string length is “4”. explain.
When a character string “Met with Mr. Murakami on October 8 at Fujisawa” is input as the input character string 20, this input character string is decomposed into 11 morphemes “10” to “meeting” . In this example, since the matching level is "character data", character data 20A of each morpheme is to be processed.
[0051]
In the collation processing of FIG. 7 described above, morphemes with a morpheme string length = 4 are extracted from the character data 20A, and an input morpheme string 51 is first generated from the extracted morpheme character data (step 111). In addition, according to the mismatch of the collation, the input morpheme sequence 52,..., 5n is sequentially generated by moving the extraction start position by one morpheme.
[0052]
Then, the extracted input morpheme string 51 is collated with each rule morpheme included in the character data 41A, 42A,... Of each rule data of the rule data 4A (step 113). In this example, the input morpheme string 51 of “October 18” exists in the character data 42A of the rule data, and the arrangement of both character data matches.
At this time, each rule morpheme found in the character data 42A of the rule data is associated with the information content type 42M of “date and time”, which matches the desired information content type (step 117). The character data string of the input morpheme string is extracted as desired date and time information (FIG. 6: step 105).
[0053]
Next, another specific example of the information extracting operation will be described with reference to FIG. FIG. 9 shows another specific example of the information extracting operation.
Here, it is assumed that the other party information is extracted from the input character string as desired information, the matching mode is set, the matching level is “part of speech”, and the morpheme string length is “2”. I do.
First, the input character string 20 is decomposed into morphemes. In this example, since the matching level is "part of speech", the input strings "noun,""noun,""noun,""noun,""particle,""noun,""suffix,""noun,""noun,""particle," and "verb" Is the processing target.
[0054]
In the collation processing of FIG. 7 described above, morphemes are extracted from the part of speech 20B by two morpheme string lengths, and an input morpheme string 61 is first generated from the part of speech of the extracted morpheme (step 111). It should be noted that the input morpheme strings 62,..., 6k,.
Then, the extracted input morpheme string 61 is collated with each rule morpheme included in the part of speech 41B, 42B,... Of each rule data of the rule data 4A (step 113). In this example, the input morpheme sequence 6k of “noun” and “suffix” generated thereafter is present in the part of speech 41B of the rule data, and the arrangement of the two parts of speech matches.
[0055]
At this time, each rule morpheme found in the part-of-speech 41B of the rule data is associated with the information content type 41M of “partner”, which matches the desired information content type (step 117). A character data string of a morpheme string is extracted as desired partner information (FIG. 6: step 105).
Note that the comparison of the morpheme strings at the part of speech level is performed at a predetermined depth of the classification of the part of speech. At this time, the comparison may be started from the deepest level of the classification depth prepared as the rule data 4A, and the comparison may be performed at the shallower level in order according to the mismatch. It is possible to extract desired information flexibly while doing so.
[0056]
【The invention's effect】
As described above, the present invention relates to a method in which a plurality of rule morphemes obtained by preliminarily decomposing an arbitrary character string including specific information content into morphemes and an information content type indicating a type of information content of the rule morpheme. Rule data indicating a correspondence relationship is provided, and information extraction means extracts one or more input morphemes from input morphemes obtained by decomposing the input character string in units of part of speech to form an input morpheme sequence. By matching the rule morphemes of the rule data with the rule morphemes of the rule data, the input morpheme string of the specific information content type is searched based on the information content type associated with the rule morpheme that matches the input morpheme string, and Since the input morpheme string of the specific information content type is extracted as desired information, information on desired contents can be accurately extracted from the input character string.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a configuration of an information extraction device according to an embodiment of the present invention.
FIG. 2 is an explanatory diagram showing a morphological analysis process.
FIG. 3 is a configuration example of rule data.
FIG. 4 is a flowchart illustrating a rule data generation process.
FIG. 5 is an example of a collation mode table.
FIG. 6 is a flowchart illustrating information extraction processing.
FIG. 7 is a flowchart illustrating a collation process.
FIG. 8 is a specific example of an information extraction operation.
FIG. 9 is another specific example of the information extracting operation.
[Explanation of symbols]
DESCRIPTION OF SYMBOLS 10 ... Information extraction device, 1 ... Input / output I / F part, 2 ... Operation input part, 3 ... Screen display part, 4 ... Storage part, 4A ... Rule data, 4B ... Program, 5 ... Control part, 5A ... Information extraction Means, 5B: Morphological analysis means, 5C: Rule generation means, 6: Communication line, 9: Recording medium, 20: Input character string, 20A: Character data, 20B, 20C: Part of speech, 20M: Information content type, 21: Input Morpheme, 22 ... input morpheme sequence, 30 ... desired information, 40 ... rule morpheme, 41, 42 ... rule data, 41A, 42A ... character data, 41B, 41C, 42B, 42C ... part of speech, 41M, 42M ... information content type , 51, 52, 53, 5n ... input morpheme strings (character data), 61, 62, 6k, 6n ... input morpheme strings (part of speech).

Claims

In an information extraction device that decomposes an input character string into morphemes in a part of speech unit and extracts desired information related to specific information content from the character string based on the obtained morpheme,
Rule data indicating a correspondence relationship between a plurality of rule morphemes obtained by previously decomposing an arbitrary character string including the specific information content into morphemes and an information content type indicating a type of information content of the rule morpheme,
By extracting one or more input morphemes from input morphemes obtained by decomposing an input character string into parts of speech, constructing an input morpheme sequence, and comparing the input morpheme sequence with each rule morpheme of the rule data, Searching for an input morpheme string of a specific information content type based on the information content type associated with the rule morpheme that matches the input morpheme string, An information extraction device, comprising: an information extraction unit that extracts the information as the information.

The information extraction device according to claim 1,
The information extracting device, wherein the information extracting means constructs the input morpheme string from a plurality of input morphemes which are successively extracted from the original of the input character string and thus the input morpheme.

The information extraction device according to claim 1,
When the input morpheme sequence corresponding to the specific information content type cannot be obtained, the information extracting unit re-uses the input morpheme sequence constituting the input morpheme sequence by reducing the number of input morpheme sequences to shorten the input morpheme sequence. An information extraction device characterized by performing collation.

The information extraction device according to claim 1,
The information extracting device, wherein the information extracting means performs a collation between each input morpheme constituting the input morpheme string and each rule morpheme of the rule data based on a character string of each morpheme. .

The information extraction device according to claim 1,
The information extracting device, wherein the information extracting means performs a collation between each input morpheme constituting the input morpheme string and each rule morpheme of the rule data based on a part of speech of each morpheme. .

The information extraction device according to claim 1,
The information extracting means compares each input morpheme constituting the input morpheme string and each rule morpheme of the rule data based on a character string of each morpheme at the time of the matching, and matches the input morpheme string. An information extracting apparatus characterized in that, when a rule morpheme does not exist, each input morpheme constituting the input morpheme sequence is collated with each rule morpheme of the rule data based on a part of speech of each morpheme.

The information extraction device according to claim 1,
The rule data includes a plurality of rule morphemes obtained by decomposing a case character string prepared in advance for each part of speech, and information indicating the type of information content to which the rule morpheme belongs, which is associated with each rule morpheme. An information extraction device characterized by comprising a content type.

An information extraction method used in an information extraction device that decomposes an input character string into morphemes of a part of speech unit and extracts desired information related to specific information content from the character string based on the obtained morpheme,
A first step of extracting one or more input morphemes from an input morpheme obtained by decomposing an input character string in units of part of speech to form an input morpheme sequence;
Each rule of rule data indicating a correspondence relationship between a plurality of rule morphemes obtained by previously decomposing an arbitrary character string including the specific information content into morphemes and an information content type indicating a type of information content of the rule morpheme By comparing the morpheme with the input morpheme string obtained in the first step, a specific information content type is identified based on the information content type associated with the rule morpheme that matches the input morpheme string. A second step of searching for an input morpheme sequence;
Extracting the input morpheme string obtained by the search as the desired information.

The information extraction method according to claim 8,
The information extracting method according to claim 1, wherein the first step comprises constructing the input morpheme string from a plurality of input morphemes successively extracted from the original of the input character string and thus from the input morpheme.

The information extraction method according to claim 8,
In the second step, when an input morpheme string corresponding to a specific information content type cannot be obtained, the input morpheme string is reduced again by using a new input morpheme string that is reduced in number to form the input morpheme string. An information extraction method characterized by collating.

The information extraction method according to claim 8,
The second step is characterized in that at the time of the collation, each input morpheme constituting the input morpheme sequence is collated with each rule morpheme of the rule data based on a character string of each morpheme. Method.

The information extraction method according to claim 8,
The second step is characterized in that at the time of the collation, each input morpheme constituting the input morpheme sequence is collated with each rule morpheme of the rule data based on a part of speech of each morpheme. Method.

The information extraction method according to claim 8,
In the second step, at the time of the collation, each input morpheme constituting the input morpheme string is collated with each rule morpheme of the rule data based on a character string of each morpheme, and matches with the input morpheme string. If there is no rule morpheme to be executed, each input morpheme constituting the input morpheme string is collated with each rule morpheme of the rule data based on the part of speech of each morpheme.

The information extraction method according to claim 8,
The first step includes, as the rule data, a plurality of rule morphemes obtained by decomposing a case character string prepared in advance for each part of speech, and the rule morphemes associated with each of the rule morphemes belong. An information extraction method characterized by using rule data including an information content type indicating a type of information content.

15. A computer of an information extracting device for decomposing an input character string into morphemes in a part of speech unit and extracting desired information relating to specific content from the character data based on each of the obtained morphemes. A program for executing the information extraction method according to one aspect.