JP4014130B2

JP4014130B2 - Glossary generation device, glossary generation program, and glossary search device

Info

Publication number: JP4014130B2
Application number: JP2001289477A
Authority: JP
Inventors: 一郎山田; 正啓柴田; 伸行八木
Original assignee: Japan Broadcasting Corp
Current assignee: Japan Broadcasting Corp
Priority date: 2001-09-21
Filing date: 2001-09-21
Publication date: 2007-11-28
Anticipated expiration: 2021-09-21
Also published as: JP2003099429A

Description

【０００１】
【発明の属する技術分野】
本発明は、テキストデータから、用語及びその用語を解説したデータを抽出する用語集生成装置及び用語集生成プログラム、並びに用語からその用語の解説データを検索する用語集検索装置に関する。
【０００２】
【従来の技術】
従来、自然言語のテキストデータから、用語及びその用語を定義した解説データを抽出する方法としては、文の表層的な特徴を表わした表層パターンのマッチングに基づく方法が知られている。
【０００３】
この方法では、例えば、用語をαとし、用語の定義文をβとしたとき、「αとはβである」、「αはβ」といった表層パターンに基づいて、テキストデータのマッチングを行なうことで、前記表層パターンにマッチングした用語及びその用語を定義する解説データを抽出することができる。
【０００４】
【発明が解決しようとする課題】
しかしながら、実際の放送番組等で使用されるニュース原稿を調査したところ、前記した「αとはβである」、「αはβ」といった表層パターンは、ニュース原稿全体の７．７％しか使用されておらず、前記ニュース原稿に含まれる用語及びその用語の解説データを抽出した用語集を生成するには、データ量として不充分であるという問題があった。
【０００５】
また、用語を定義するには、その用語に係る連体修飾節を用いる場合がある。この連体修飾節を用いる場合、例えば、「住民票の取得や、企業が行なう許認可などの手続きを役所に出向かなくてもインターネットでできるようにする電子政府の実現などの…」というニュース原稿において、連体修飾節である「住民票の取得や、企業が行なう許認可などの手続きを役所に出向かなくてもインターネットでできるようにする」は、用語である「電子政府」を定義している。
【０００６】
このように、連体修飾節により用語を定義している場合は、前記従来の技術における表層パターンのマッチングでは、用語の解説データを抽出することができないという問題があった。
【０００７】
本発明は、前記した技術的問題点に鑑みてなされたものであり、自然言語のテキストデータから、連体修飾節に基づいて、用語及びその用語を定義する解説データを抽出する用語集生成装置及び用語集生成プログラム、並びに用語からその用語の解説データを検索する用語集検索装置を提供することを目的とする。
【０００８】
【課題を解決するための手段】
本発明は、前記目的を達成するために創案されたものであり、まず、請求項１に記載の用語集生成装置は、以下の構成にかかるものとした。
すなわち、入力された自然言語のテキストデータを形態素解析及び構文解析を行なうことで、前記テキストデータの文節の係り受け情報を生成する係り受け解析手段と、前記テキストデータから、名詞または名詞句となる文字列を用語データとして抽出する用語データ抽出手段と、前記係り受け情報と、用語データを言い換える特定の言い換え表現とに基づいて、前記テキストデータから、前記用語データの上位概念を示す概念データを抽出する概念データ抽出手段と、予め連体修飾節が用語を定義する説明文となるときの特徴となる学習データを登録した学習データベースと、前記係り受け情報と前記学習データとに基づいて、前記用語データに係る連体修飾節が前記用語データの定義となっているかを判断し、定義と判断された連体修飾節を修飾データとして抽出する修飾データ抽出手段と、前記修飾データに前記概念データを連結することで、前記用語データを定義する解説データを生成する解説データ生成手段と、を備える構成とした。
【０００９】
かかる構成によれば、用語集生成装置は、係り受け解析手段によって、入力された自然言語のテキストデータを形態素解析及び構文解析を行なうことで、このテキストデータの文節の係り受け情報を生成し、用語データ抽出手段によって、このテキストデータから、名詞または名詞句となる文字列を用語データとして抽出する。なお、この名詞または名詞句となる文字列は、例えば構文解析により抽出する。そして、概念データ抽出手段によって、前記係り受け情報と、用語データを言い換える特定の言い換え表現とに基づいて、抽出した前記用語データに対する上位概念を示す概念データを前記テキストデータから抽出する。
【００１０】
さらに、用語集生成装置は、修飾データ抽出手段によって、前記係り受け情報と、予め連体修飾節が用語を定義する説明文となるときの特徴となる学習データを登録した学習データベースの学習データとに基づいて、前記用語データを定義する連体修飾節を修飾データとして抽出し、解説データ生成手段によって、前記修飾データに前記概念データを連結することで、前記用語データを定義する解説データを生成する。
【００１１】
これにより、用語集生成装置は、入力された自然言語のテキストデータに対して、形態素解析や構文解析により係り受け解析を行なうことで、用語データと、その用語データの上位概念を示す概念データと、用語データを定義する修飾データを前記テキストデータから抽出し、概念データと修飾データとに基づいて、用語データを定義する解説データを生成する。
【００１２】
さらに、請求項２に記載の用語集生成装置は、請求項１に記載の用語集生成装置において、用語データ及びその用語データに対応した複数の概念データを登録する概念データベースを備え、前記概念データ抽出手段が、前記複数の概念データから、その出現頻度に基づいて、前記用語データに対応した１つの概念データを確定する構成とした。
【００１３】
かかる構成によれば、用語集生成装置は、概念データ抽出手段によって、用語データ及びその用語データに対応した複数の概念データを登録した概念データベースの概念データから、その出現頻度を参照し、用語データに対応した１つの概念データを確定する。これにより、用語集生成装置は、概念データベース内の用語データに対応した複数の概念データのうちで、出現頻度の最も高い概念データを、その用語データに最も適した概念データとして確定する。
【００１４】
また、請求項３に記載の用語集生成装置は、請求項１または請求項２に記載の用語集生成装置において、学習データベースが、連体修飾節が用語を定義する説明文となるときの、その用語に直接係る動詞とその直前の助詞とを学習データとして登録する構成とした。
【００１５】
かかる構成によれば、用語集生成装置は、学習データベースによって、連体修飾節が用語を定義する説明文となるときの、その用語に直接係る動詞とその直前の助詞とを学習データとして登録する。これにより、用語集生成装置は、複数の連体修飾節の中から、用語を定義する説明文である連体修飾節を、その用語に直接係る動詞とその直前の助詞との組合せにより特定する。
【００１６】
さらに、請求項４に記載の用語集生成装置は、請求項１乃至請求項３のいずれか１項に記載の用語集生成装置において、用語データ及びその用語データに対応した解説データを蓄積する解説データ蓄積手段を備える構成とした。
【００１７】
かかる構成によれば、用語集生成装置は、生成した用語データ及びその用語データに対応した解説データを解説データ蓄積手段に蓄積し、これにより、解説データ蓄積手段に、入力された自然言語のテキストデータから抽出した用語集のデータベースを構築する。
【００１８】
また、請求項５に記載の用語集生成装置は、請求項１乃至請求項４のいずれか１項に記載の用語集生成装置において、入力されるテキストデータは、ニュース原稿のデータである構成とした。
【００１９】
かかる構成によれば、用語集生成装置は、ニュース原稿のデータを外部から入力することで、一般的な用語以外に、ニュース原稿に含まれる難解な用語、新語、造語から用語の解説データを生成する。
【００２０】
また、請求項６に記載の用語集生成プログラムは、入力された自然言語のテキストデータと、連体修飾節が用語を定義する説明文となるときの特徴となる学習データを登録した学習データベースとから、前記テキストデータ内の用語を定義する解説データを生成するために、コンピュータを、以下の手段により機能させるように構成した。
【００２１】
すなわち、テキストデータを形態素解析及び構文解析を行なうことで、前記テキストデータの文節間の係り受け情報を生成する係り受け解析手段、前記テキストデータから、名詞や名詞句との少なくとも１つを用語データとして抽出する用語データ抽出手段、前記係り受け情報と特定の言い換え表現とに基づいて、前記テキストデータから、前記用語データの上位概念を示す概念データを抽出する概念データ抽出手段、前記係り受け情報と前記学習データとに基づいて、前記用語データを定義する連体修飾節を修飾データとして抽出する修飾データ抽出手段、前記修飾データに前記概念データを連結することで、前記用語データを定義する解説データを生成する解説データ生成手段とした。
【００２２】
かかる構成によれば、用語集生成プログラムは、係り受け解析手段によって、テキストデータを形態素解析及び構文解析を行なうことで、前記テキストデータの文節間の係り受け情報を生成し、用語データ抽出手段によって、前記テキストデータから、名詞または名詞句となる文字列を用語データとして抽出し、概念データ抽出手段によって、前記係り受け情報と特定の言い換え表現とに基づいて、前記テキストデータから、前記用語データの上位概念を示す概念データを抽出し、修飾データ抽出手段によって、前記係り受け情報と前記学習データとに基づいて、前記用語データを定義する連体修飾節を修飾データとして抽出し、解説データ生成手段によって、前記修飾データに前記概念データを連結することで、前記用語データを定義する解説データを生成する。
【００２３】
これにより、用語集生成プログラムは、入力された自然言語のテキストデータに対して、形態素解析や構文解析により係り受け解析を行なうことで、用語データと、その用語データの上位概念を示す概念データと、用語データを定義する修飾データを前記テキストデータから抽出し、概念データと修飾データとに基づいて、用語データを定義する解説データを生成する。
【００２４】
さらに、請求項７に記載の用語集検索装置は、テキストデータから用語データを説明する解説データを生成する請求項４に記載の用語集生成装置と、用語データを入力する入力手段と、用語データに基づいて、解説データ蓄積手段に蓄積された解説データを検索する解説データ検索手段と、を備える構成とした。
【００２５】
かかる構成によれば、用語集検索装置は、用語集生成装置によって、入力されたテキストデータから用語データを説明する解説データを生成し、入力手段によって、用語データを入力し、解説データ検索手段によって、用語集生成装置内の解説データ蓄積手段に蓄積された解説データを検索し、出力手段によって、前記検索結果を出力する。これにより、用語集検索装置は、入力されたテキストデータから用語データを説明する解説データを生成する用語集生成装置に、ユーザからの用語データの問合せに対して、その用語データに対応する解説データを検索して、検索結果を出力するユーザインターフェース機能を有する。
【００２６】
【発明の実施の形態】
以下、本発明の実施の形態を図面に基づいて詳細に説明する。
（第一の実施形態：用語集生成装置の構成）
図１は、本発明における第一の実施形態に係る用語集生成装置の全体構成を示すブロック図である。図１に示すように、用語集生成装置１は、テキストデータであるニュース原稿を入力し、そのニュース原稿に含まれる用語データと、その用語データを定義する解説データとを生成する装置である。
【００２７】
この用語集生成装置１は、係り受け解析手段１１、用語データ抽出手段１２、概念データ抽出手段１３、概念データベース１４、修飾データ抽出手段１５、学習データベース１６、解説データ生成手段１７及び解説データ蓄積手段１８を備えて構成されている。また、ニュース原稿は、外部のニュース原稿データベース３から、テキストデータとして入力されるものとする。さらに、用語集生成装置１には、ニュース原稿データベース３から、ニュース原稿の範囲指定（例えば、２０００年のニュース原稿のみを対象等）を行なう入力装置（図示せず）が外部に接続されている。
【００２８】
係り受け解析手段１１は、入力装置（図示せず）から入力されたニュース原稿の範囲指定に基づいて入力したニュース原稿を、形態素解析と構文解析とにより文節単位に分解し、その文節の文字列と、文節の係り受け関係とを係り受け情報として生成し、用語データ抽出手段１２、概念データ抽出手段１３及び修飾データ抽出手段１５へ通知する。
【００２９】
この係り受け解析手段１１は、形態素解析によって、ニュース原稿から意味を担う最小の言語単位である形態素を同定し、構文解析によって、名詞句、動詞句などの文節及びその係り受け関係を同定する。なお、この形態素解析や構文解析は、公知の技術によって実現することができる。
【００３０】
用語データ抽出手段１２は、係り受け解析手段１１で生成された係り受け情報に基づいて、その係り受け情報に含まれる文節から、用語となる文節を抽出し、用語データとして出力する。この用語データは、概念データ抽出手段１３、修飾データ抽出手段１５、解説データ生成手段１７へ通知される。
【００３１】
ここで、用語となる文節の抽出は、その文節が名詞または名詞句である場合に、その文節を用語であると認定して行なう。もし、用語として認定する文節が存在しなければ、用語データは、ヌル文字列として出力する。
【００３２】
また、これ以外にも、用語集生成装置１に入力されるニュース原稿のテキストデータで、用語として扱う文字列を表層的な情報により抽出することも可能である。例えば、ニュース原稿のテキストデータで、鉤括弧等で囲まれた名詞句を重要部分と判断することで、用語データ抽出手段１２は、鉤括弧で囲まれた文字列を用語として抽出することができる。
【００３３】
概念データ抽出手段１３は、係り受け解析手段１１で生成された係り受け情報と、用語データ抽出手段１２で抽出された用語データとに基づいて、その用語データの上位概念を示す文節を概念データとして抽出する。
【００３４】
この概念データは、用語データとともに概念データベース１４に登録される。このとき、すでに同一の用語データが登録されている場合は、複数の概念データを登録することを可能とし、さらにその用語データに対応する概念データも同一の場合は、その概念データの出現数を更新（インクリメント）する。
【００３５】
そして、概念データ抽出手段１３は、概念データベース１４から、用語データに対応する概念データを読み込み、解説データ生成手段１７へ通知する。このとき、用語データに対応する概念データが複数存在する場合は、その概念データの中で最も出現数が多い概念データを、解説データ生成手段１７へ通知する。
【００３６】
なお、この概念データとは、「という」とか「と呼ばれる」などの特定の言い換え表現（特定表現）によって、用語データを言い換えたものである。ここで、言い換えたと判断する材料として、係り受け解析手段１１で生成された係り受け情報を使用する。
【００３７】
すなわち、「文節Ａ→文節Ｂ」の矢印（→）が係り受け関係を示し、文節Ａが係り元、文節Ｂが係り先であるとすると、入力された文字列が「（用語データ）→という→（名詞（句））」や、「（用語データ）→と呼ばれる→（名詞（句））」という係り受け関係であったとき、この名詞（句）が用語データの上位概念を示すと判断する。実際のニュース原稿から、この特定表現による概念データを抽出した結果を図３に示す。例えば、「顔文字という表現」という文字列で、用語データは「顔文字」（Ｄ１）であり、概念データは「表現」（Ｄ２）である。
【００３８】
概念データベース１４は、用語データと概念データとを対応させて登録してあるデータベースで、概念データ抽出手段１３が、用語データ及び概念データの登録及び参照を行なう。この用語データと概念データとは１対多の関係を持ち、１つの用語データに対して、複数の概念データが登録される。また、個々の概念データには、その出現数が付与されており、概念データ抽出手段１３が、その出現数を更新及び参照を行なう。なお、概念データベース１４は、ハードディスク等の記憶手段によって構成されている。
【００３９】
修飾データ抽出手段１５は、係り受け解析手段１１で生成された係り受け情報と、用語データ抽出手段１２で抽出された用語データと、学習データベース１６に登録してある学習データとに基づいて、その用語データを修飾する連体修飾節を抽出し、その連体修飾節が用語データを定義しているかどうかを判定し、その判定結果に基づいて、連体修飾節を修飾データとして解説データ生成手段１７へ通知する。
【００４０】
この修飾データ抽出手段１５は、連体修飾節が用語データを定義しているかどうかを判定するには、前記連体修飾節の用語データに直接係る動詞と、その直前の助詞との２項組を学習データとして登録した学習データベース１６に基づいて行なう。なお、この学習データは、その動詞に類似性を持つ動詞を同時に登録しておき、類似した単語を木構造に分類した同義語・類語素リストである公知のシソーラス（Ｔｈｅａｕｒｕｓ）として学習データベース１６上に構築しておく。
【００４１】
そこで、用語データに係る連体修飾節中の動詞を「ｖ」、そして、前記シソーラスとして、この動詞「ｖ」と同じグループに属する動詞集合を「ｖｇ１」、動詞「ｖ」の親ノードに属する動詞集合を「ｖｇ２」、動詞「ｖ」の直前の助詞を「ｐ」とし、また、前記動詞集合「ｖｇ１」及び「ｖｇ２」の動詞「ｖ」の類似度に対する重み付け係数をそれぞれ「ｗ_a」，「ｗ_b」、動詞集合「ｖｇ」と助詞「ｐ」が学習データ中に出現した回数を「ｎ（ｖｇ，ｐ）」、その期待値を「ｅ（ｖｇ，ｐ）」としたとき、連帯修飾節が用語データを修飾するいわゆる用語定義節であるかどうかを判定する指標値「ｗｅｉｇｈｔ（ｖ，ｐ）」を、（１）式のように定義する。
【００４２】
【数１】

【００４３】
この（１）式において、ｎ（ｖｇ１，ｐ）＜ｅ（ｖｇ１，ｐ）のときは、第一項を０とし、また、ｎ（ｖｇ２，ｐ）＜ｅ（ｖｇ２，ｐ）のときは、第二項を０とする。この指標値ｗｅｉｇｈｔ（ｖ，ｐ）が、予め設定された閾値よりも大きいときに、連体修飾節が用語データを定義する説明文であると判定する。
【００４４】
実験結果として、２００１年６月のニュース原稿から用語データを抽出した結果の一部を図４に示す。なお、この実験においては、１５２９５個の学習データを与え、ｗ_a＝０．６７，ｗ_b＝０．３３、閾値を１．０として、用語定義節を抽出している。
【００４５】
学習データベース１６は、修飾データ抽出手段１５で説明したように、連体修飾節の用語データに直接係る動詞と、その直前の助詞との２項組を学習データとして登録したデータベースで、ハードディスク等の記憶手段で構成されている。
【００４６】
解説データ生成手段１７は、用語データ抽出手段１２で抽出された用語データと、概念データ抽出手段１３で抽出された概念データと、修飾データ抽出手段１５とに基づいて、用語データと、その用語データを定義する解説データとを生成する。
【００４７】
前記説明した修飾データ抽出手段１５で抽出された修飾データは、図４に示すように、動詞の連体形で文が終了しており、定義文としては適切でない。そのため、解説データ生成手段１７では、「修飾データ（連体修飾節）＋概念データ（上位概念）」によって、解説データを生成する。例えば、図３において用語データが「顔文字」（Ｄ１）、概念データが「表現」（Ｄ２）、そして、図４において、前記「顔文字」に対応する修飾データが「パソコンや携帯電話でやり取りする電子メールについて記号などを使って感情を表す」（Ｄ３）であったとき、「顔文字」に対応する解説データは、前記修飾データと概念データを連結して、「パソコンや携帯電話でやり取りする電子メールについて記号などを使って感情を表す表現」となる。
【００４８】
また、解説データ生成手段１７は、用語データの修飾データは存在するが、概念データが存在しない場合は、用語データ自体を形態素解析したときの最終形態素が、概念データであるかを判定する。この判定は、例えば、最終形態素で上位概念となりやすいデータを、予め概念データベース１４に概念データとして登録しておくことで判断することができる。
【００４９】
ここで、この最終形態素が、概念データである場合は、「修飾データ（連体修飾節）＋用語データの最終形態素」によって、解説データを生成する。例えば、図４において、用語データ「司法制度改革推進法」（Ｄ４）に対する概念データが存在しない場合は、「司法制度改革推進法」（Ｄ４）の最終形態素である「法」を概念データとして、前記用語データの連体修飾節「司法制度改革を推進するための体制を定める」（Ｄ５）に付加することで、「司法制度改革を推進するための体制を定める法」という解説データを生成する。
【００５０】
また、解説データ生成手段１７は、用語データの修飾データは存在するが、概念データが存在せず、その最終形態素も概念データと判定できない場合は、「修飾データ（連体修飾節）＋もの（こと）」によって、解説データを生成する。例えば、図４において、用語データ「レッドデータブック」（Ｄ６）に対する概念データが存在しない場合は、「レッドデータブック」（Ｄ６）の連体修飾節「絶滅の恐れのある野鳥を記録した」（Ｄ７）に「もの（こと）」を付加することで、「絶滅の恐れのある野鳥を記録したもの（こと）」という解説データを生成する。
なお、この生成された解説データは、解説データ蓄積手段１８に用語データと対にして蓄積される。
【００５１】
解説データ蓄積手段１８は、解説データ生成手段１７から出力される用語データと解説データとを対にして蓄積する蓄積手段で、ハードディスク等で構成される。ここで蓄積されたデータは、用語データから解説データを参照することが可能な用語集検索用のデータベースとして使用することができる。
【００５２】
以上、一実施形態に基づいて本発明に係る用語集生成装置１の構成について説明したが、本発明はこれに限定されるものではなく、例えば、解説データ蓄積手段１８を備えず、解説データ生成手段１７が、用語データと解説データとをデータとして外部に出力し、外部に蓄積手段を備えた形態であっても構わない。
【００５３】
また、概念データベース１４を備えず、概念データ抽出手段１３が、係り受け解説手段１１から入力された係り受け情報と、用語データ抽出手段１２から入力された用語データとに基づいて、その係り受け情報毎に概念データを抽出する形態であっても構わない。
【００５４】
また、用語集生成装置１の入力データは、ニュース原稿に限ったものではなく、一般的なテキストデータであればよい。例えば、雑誌の原稿を蓄積したデータベースからその原稿をテキストデータとして入力することで、最新の雑誌の用語データに対する解説データを出力することができる。
【００５５】
（第一の実施形態：用語集生成装置の動作）
次に、図１、図５〜図８に基づいて、用語集生成装置１の動作について説明する。図５は、用語集生成装置１全体の動作を示すフローチャートである。図６は、概念データ抽出手段１３の動作を示すフローチャートである。図７は、修飾データ抽出手段１５の動作を示すフローチャートである。図８は、解説データ生成手段１７の動作を示すフローチャートである。
まず、図５のフローチャートに基づいて、用語集生成装置１全体の動作を説明する。
【００５６】
まず最初に、外部から入力されたニュース原稿を、形態素解析と構文解析とにより文節単位に分解し、その文節の文字列と、文節の係り受け関係とを係り受け情報として生成する（ステップａ１）。
【００５７】
そして、前記係り受け情報に基づいて、係り受け情報に含まれる文節から、その文節が名詞または名詞句であることを判定して、用語となる文節を用語データとして抽出する（ステップａ２）。
【００５８】
ここで、ステップａ２で抽出された用語データがあるかどうかを判定し（ステップａ３）、用語データがない場合（Ｎｏ）は、動作を終了する。一方、用語データがある場合（Ｙｅｓ）は、ステップａ４へ進む。
【００５９】
そして、前記係り受け情報と、前記用語データとに基づいて、「という」とか「と呼ばれる」などの特定表現により、その用語データの上位概念を示す文節を概念データとして抽出する（ステップａ４）。
【００６０】
また、前記係り受け情報と、前記用語データと、学習データベースに登録されている学習データとに基づいて、その用語データを修飾する連体修飾節が、用語データを定義しているかどうかを判定し、その判定結果に基づいて、連体修飾節を修飾データとして抽出する（ステップａ５）。
【００６１】
そして、前記概念データと、前記修飾データとに基づいて、前記用語データを定義する解説データを生成する（ステップａ６）。
【００６２】
次に、図６のフローチャートに基づいて、概念データ抽出手段１３の動作について説明する。なお、本フローチャートは、図５におけるステップａ４を詳細に説明したものである。
【００６３】
まず最初に、入力された係り受け情報に基づいて、この係り受け情報の各文節の中で、「という」とか「と呼ばれる」などの特定表現があるかどうかを判定し（ステップｂ１）、特定表現がない場合（Ｎｏ）は、入力された用語データが、すでに概念データベース１４にあるかどうか（登録されているかどうか）を判定し（ステップｂ２）、概念データベース１４にない（登録されていない）場合（Ｎｏ）は、この用語データに対応した概念データが存在しないものとして処理を終了する。一方、概念データベース１４に用語データに対応する概念データがある（登録されている）場合（Ｙｅｓ）は、ステップｂ５へ進む。
【００６４】
また、ステップｂ１において、特定表現がある場合（Ｙｅｓ）は、その特定表現に基づいて、その係り受け関係から、用語データの上位概念である概念データを抽出する（ステップｂ３）。
【００６５】
そして、用語データと、ステップｂ３で抽出した概念データを概念データベース１４に登録する（ステップｂ４）。このとき、すでに同一の用語データが登録されている場合は、複数の概念データを登録することを可能とし、さらにその用語データに対応する概念データも同一の場合は、その概念データの出現数を更新（インクリメント）する。
【００６６】
そして、前記用語データに対応した概念データのうちで、前記出現数に基づいて、最も出現頻度の高い概念データを、前記用語データの概念データとして出力する（ステップｂ５）。
【００６７】
なお、用語集生成装置１において、概念データベース１４が存在しない場合は、ステップｂ１の判断で、特定表現がない場合（Ｎｏ）は、用語データに対応した概念データが存在しないものとして処理を終了する。一方、特定表現がある場合（Ｙｅｓ）は、ステップｂ３において、その特定表現に基づいて、係り受け関係から、用語データの上位概念である概念データを抽出・出力して処理を終了することで、概念データの抽出動作を簡略化することも可能である。
【００６８】
次に、図７のフローチャートに基づいて、修飾データ抽出手段１５の動作について説明する。なお、本フローチャートは、図５におけるステップａ５を詳細に説明したものである。
【００６９】
まず最初に、入力された係り受け情報と、用語データとに基づいて、この係り受け情報の各文節の中で、用語データに係る連体修飾節があるかどうかを判定し（ステップｃ１）、用語データに係る連体修飾節がない場合（Ｎｏ）は、用語データに係る修飾データがないものとして処理を終了する。
【００７０】
一方、ステップｃ１において、用語データに係る連体修飾節がある場合（Ｙｅｓ）は、この連体修飾節が用語データを定義する説明文であるかどうかを判定する指標値を前記（１）式に基づいて算出する（ステップｃ２）。
【００７１】
そして、前記指標値と予め設定したしきい値との比較を行ない（ステップｃ３）、指標値がしきい値以下の場合（Ｎｏ）は、この連体修飾節は、用語データを定義する修飾データがないものとして処理を終了する。一方、指標値がしきい値よりも大きい場合（Ｙｅｓ）は、この連体修飾節は、用語データに係る修飾データであると判定して、この連体修飾節を修飾データとして出力する（ステップｃ４）。
【００７２】
次に、図８のフローチャートに基づいて、解説データ生成手段１７の動作について説明する。なお、本フローチャートは、図５におけるステップａ６を詳細に説明したものである。
【００７３】
まず最初に、用語データを定義した説明文である修飾データがあるかどうかを判定する（ステップｄ１）。ここで修飾データがない場合（Ｎｏ）は、さらに用語データの上位概念である概念データがあるかどうかを判定し（ステップｄ２）、概念データがない場合（Ｎｏ）は、解説データの変数に「なし」（ヌルデータ）を設定して（ステップｄ３）、ステップｄ１０へ進む。一方、ステップｄ２で概念データがある場合（Ｙｅｓ）は、解説データの変数に「概念データ」の文字列を設定して（ステップｄ４）、ステップｄ１０へ進む。
【００７４】
また、ステップｄ１において、用語データを定義した説明文である修飾データがある場合（Ｙｅｓ）は、さらに、用語データの上位概念である概念データがあるかどうかを判定する（ステップｄ５）。ここで概念データがある場合（Ｙｅｓ）は、解説データの変数に「修飾データ＋概念データ」の文字列を設定して（ステップｄ６）、ステップｄ１０へ進む。
【００７５】
一方、ステップｄ５で、用語データの上位概念である概念データがない場合（Ｎｏ）は、用語データの最終形態素が概念データであるかどうかを判定し（ステップｄ７）、最終形態素が概念データである場合（Ｙｅｓ）は、解説データの変数に「修飾データ＋最終形態素」の文字列を設定して（ステップｄ８）、ステップｄ１０へ進む。
【００７６】
また、ステップｄ７において、用語データの最終形態素が概念データでない場合（Ｎｏ）は、解説データの変数に「修飾データ＋もの（こと）」の文字列を設定して（ステップｄ９）、ステップｄ１０へ進む。
【００７７】
そして、用語データと、前記処理によって設定された解説データを出力する（ステップｄ１０）。
以上の動作によって、用語集生成装置１は、ニュース原稿から、用語データと、その用語データを定義する解説データを生成することができる。
【００７８】
なお、用語集生成装置１は、コンピュータにおいて各機能をプログラムで実現することも可能であり、各機能プログラムを結合して用語集生成プログラムとして動作させることも可能である。
【００７９】
（第二の実施形態：用語集検索装置）
図２は、本発明における第二の実施形態に係る用語集検索装置の全体構成を示すブロック図である。図２に示すように、用語集検索装置２は、テキストデータであるニュース原稿を入力し、そのニュース原稿に含まれる用語データと、その用語データを定義する解説データとを生成するとともに、操作者が問合せを行なった用語データに対する解説データを出力する装置である。
【００８０】
この用語集検索装置２は、図１に示した用語集生成装置１に、解説データ検索手段２１、入力手段２２及び出力手段２３が付加されて構成されている。解説データ検索手段２１、入力手段２２及び出力手段２３以外の構成は、図１に示したものと同一の符号を付し、説明を省略する。さらに、用語集検索装置２は、外部にキーボード等の入力装置４と、ＣＲＴ等の出力装置５が外部に接続されている。
【００８１】
解説データ検索手段２１は、入力手段２２から入力されたテキストデータである用語データに基づいて、解説データ蓄積手段１８に蓄積されている前記用語データに対応した解説データを検索し、検索された解説データをテキストデータとして出力手段２３へ出力する。
【００８２】
入力手段２２は、キーボード、マウス等の入力装置４から操作者が入力した入力データを入力し解析を行なう。この入力データとして、用語データが入力された場合は、この用語データを解説データ検索手段２１に出力する。
【００８３】
また、入力データが、ニュース原稿の範囲指定、例えば、「２０００年のニュース原稿」を指定した場合は、その範囲指定データを用語集生成装置１に通知することで、用語集生成装置１が、その範囲内のニュース原稿をニュース原稿データベース３から入力して、用語データを抽出し、その用語データを定義する解説データを生成し、前記用語データ及び解説データをテキストデータとして出力手段２３へ出力する。
【００８４】
出力手段２３は、用語集生成手段１で生成される用語データ及び解説データ、並びに解説データ検索手段２１で検索後に出力される解説データを、ＣＲＴ等の出力装置５へ、出力データとして出力する。
【００８５】
図９に用語集検索装置２に接続された出力装置５の表示例５０を示す。図９の画面は、操作者が用語データを入力し、それに対応した解説データを表示するニュース原稿の用語解説を行なうアプリケーションの画面例である。
【００８６】
例えば、操作者が用語入力欄５０ａに、入力装置４を介して用語データを入力し、ＲＥＴＵＲＮキーを押下する、あるいはマウスで検索ボタン５０ｂをクリックすることで、用語集検索装置２は、解説データ蓄積手段１８に蓄積されている前記用語データに対応する解説データを検索し、解説表示欄５０ｄに解説データを表示する。ここで、クリアボタン５０ｃは、用語入力欄５０ａに入力された解説データを一括して削除するためのボタンで、終了ボタン５０ｅは、ニュース原稿の用語解説を行なうアプリケーションを終了するためのボタンである。
【００８７】
このように、用語集検索装置２は、ニュース原稿から、用語データとその用語データを定義する解説データとを生成するとともに、操作者が問合せを行なった用語データに対する解説データを出力するインターフェースを有することで、常に用語の最新の解説を素早く検索することができる。
【００８８】
以上、一実施形態に基づいて本発明に係る用語集検索装置２について説明したが、本発明はこれに限定されるものではなく、例えば、ネットワークインターフェースを有し、ネットワークを介して遠隔地の操作者が、用語データの解説データを検索することも可能である。
【００８９】
【発明の効果】
以上説明したとおり、本発明に係る用語集生成装置及び用語集生成プログラム並びに用語集検索装置では、以下に示す優れた効果を奏する。
【００９０】
請求項１に記載の発明によれば、用語集生成装置は、入力された自然言語のテキストデータに対して、形態素解析や構文解析により係り受け解析を行なうことで、用語データと、その用語データの上位概念を示す概念データと、用語データを定義する修飾データを前記テキストデータから抽出し、概念データと修飾データとに基づいて、用語データを定義する解説データを生成することができる。
【００９１】
これにより、入力された自然言語のテキストデータから、用語集の元となる用語データを抽出することができ、さらに、その用語データの定義文となる解説データを同時に抽出することができるので、多量のテキストデータであっても、人手を介することなく、高速に用語集のデータを抽出することができる。
【００９２】
請求項２に記載の発明によれば、用語集生成装置は、概念データベース内の用語データに対応した複数の概念データのうちで、出現頻度の最も高い概念データを、その用語データの概念データとすることができるので、解説データを生成するときに、この概念データを末尾に付加することで、前記解説データは、上位概念が同じ用語データであれば、末尾が同じ概念データで構成されることになり、統一性のある用語集のデータを生成することができる。
【００９３】
請求項３に記載の発明によれば、用語集生成装置は、複数の連体修飾節の中から、用語を定義する説明文である連体修飾節を、用語に直接係る動詞とその直前の助詞との組合せにより特定するので、前記連体修飾節が用語を定義するものであるかどうかの判定を、文章の意味を考慮しなくてもよいため、簡易な判定基準で行なうことができ、高速に判定を行なうことができる。
【００９４】
請求項４に記載の発明によれば、用語集生成装置は、生成した用語データ及びその用語データに対応した解説データを解説データ蓄積手段に蓄積し、これにより、解説データ蓄積手段に、入力された自然言語のテキストデータの用語集を構築することができる。これにより、用語集のデータベースを容易に構築することができ、さらに、このデータベースを用いて用語データの検索システムを構築することも可能になる。
【００９５】
請求項５に記載の発明によれば、用語集生成装置は、入力されるテキストデータがニュース原稿のデータであるため、特にニュース原稿では、難解な用語や、毎年多くの創り出される新語や、既存の言葉を組み合わせた造語を、視聴者に容易に理解できるような用語の説明を伴うことが多く、この説明を利用して効果的に用語集のデータを生成することができる。
【００９６】
請求項６に記載の発明によれば、用語集生成プログラムは、入力された自然言語のテキストデータに対して、形態素解析や構文解析により係り受け解析を行なうことで、用語データと、その用語データの上位概念を示す概念データと、用語データに係る修飾データを前記テキストデータから抽出し、概念データと修飾データとに基づいて、用語データを定義する解説データを生成することができる。
【００９７】
これにより、入力された自然言語のテキストデータから、用語集の元となる用語データを抽出することができ、さらに、その用語データの定義文となる解説データを同時に抽出することができるので、多量のテキストデータであっても、人手を介することなく、高速に用語集のデータを抽出することができる。
【００９８】
請求項７に記載の発明によれば、用語集検索装置は、入力されたテキストデータから用語データを説明する解説データを生成する用語集生成装置に、ユーザからの用語データの問合せに対して、その用語データに対応する解説データを検索して、検索結果を出力するユーザインターフェースを提供することができ、常に最新の用語データの意味をユーザが容易に、検索することができる。
【図面の簡単な説明】
【図１】本発明の実施の形態に係る用語集生成装置の全体構成を示すブロック図である。
【図２】本発明の実施の形態に係る用語集検索装置の全体構成を示すブロック図である。
【図３】本発明の特定表現を利用した用語データの概念データ抽出結果の一部を示した図である。
【図４】本発明の用語データを定義する連体修飾節抽出結果の一部を示した図である。
【図５】本発明の実施の形態に係る用語集生成装置の動作を示すフローチャートである。
【図６】本発明の実施の形態に係る概念データを抽出する動作を示すフローチャートである。
【図７】本発明の実施の形態に係る修飾データを抽出する動作を示すフローチャートである。
【図８】本発明の実施の形態に係る解説データを生成する動作を示すフローチャートである。
【図９】本発明の実施の形態に係る用語集検索装置の出力画面の一例を示す図である。
【符号の説明】
１……用語集生成装置
２……用語集検索装置
１１……係り受け解析手段
１２……用語データ抽出手段
１３……概念データ抽出手段
１４……概念データベース
１５……修飾データ抽出手段
１６……学習データベース
１７……解説データ生成手段
１８……解説データ蓄積手段
２１……解説データ検索手段
２２……入力手段
２３……出力手段[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a terminology generating device and a terminology generating program for extracting a term and data describing the term from text data, and a terminology searching device for searching explanation data of the term from the term.
[0002]
[Prior art]
2. Description of the Related Art Conventionally, as a method for extracting a term and comment data defining the term from natural language text data, a method based on matching of a surface layer pattern representing a surface feature of a sentence is known.
[0003]
In this method, for example, when the term is α and the definition sentence of the term is β, text data is matched based on a surface pattern such as “α is β” and “α is β”. The term matching the surface layer pattern and the explanation data defining the term can be extracted.
[0004]
[Problems to be solved by the invention]
However, when investigating news manuscripts used in actual broadcast programs, etc., the surface layer patterns such as “α is β” and “α is β” are used only in 7.7% of the entire news manuscript. However, there is a problem that the amount of data is insufficient to generate a glossary from which terms included in the news manuscript and commentary data on the terms are extracted.
[0005]
In addition, in order to define a term, there is a case where a linkage modification clause related to the term is used. For example, in the case of a news manuscript that says, “Establishment of e-government that makes it possible to obtain a resident's card or to carry out procedures such as approval and approval by a company on the Internet without going to a government office…” The term “electronic government” is defined in the “registration” section, “Establishing procedures for obtaining resident's cards and permitting / permitting by companies without going to the office”.
[0006]
As described above, when the term is defined by the linkage modification clause, there is a problem that the explanation data of the term cannot be extracted by the matching of the surface layer pattern in the conventional technique.
[0007]
The present invention has been made in view of the above technical problems, and a glossary generating device for extracting a term and commentary data defining the term from text data in a natural language based on a linkage modification clause, and It is an object of the present invention to provide a glossary generation program and a glossary search device for searching explanation data of the terms from the terms.
[0008]
[Means for Solving the Problems]
The present invention has been made to achieve the above-mentioned object, and first, the glossary generation device according to claim 1 has the following configuration.
That is, dependency analysis means for generating dependency information of clauses of the text data by performing morphological analysis and syntax analysis on the input natural language text data, and the text data becomes a noun or noun phrase Based on the term data extraction means for extracting a character string as term data, the dependency information, and a specific paraphrase expression that paraphrases the term data, concept data indicating a high-level concept of the term data is extracted from the text data. The term data based on the conceptual data extraction means, the learning database in which learning data that is characteristic when the combination modification clause is an explanatory sentence defining the term, and the dependency information and the learning data are registered. Judgment is made on whether the definition of the terminology data is defined in the above-mentioned terminology modification clause. And modifying the data extraction means for extracting as modified data, By linking the conceptual data to the modifier data, Comment data generating means for generating comment data defining the term data.
[0009]
According to such a configuration, the glossary generation device generates dependency information of clauses of the text data by performing morphological analysis and syntax analysis on the input natural language text data by the dependency analysis means, A term data extraction unit extracts a character string as a noun or noun phrase from this text data as term data. In addition, the character string used as this noun or noun phrase is extracted by, for example, syntax analysis. Then, based on the dependency information and a specific paraphrase expression for paraphrasing the term data, the concept data extracting means extracts concept data indicating a superordinate concept for the extracted term data from the text data.
[0010]
Furthermore, the glossary generating device uses the modification data extraction means to convert the dependency information into learning data stored in a learning database in which learning data that is characteristic when the association modification clause is an explanatory sentence defining a term is registered in advance. Based on the above, the union modification clause defining the term data is extracted as the modification data, and the explanation data generation means By linking the conceptual data to the modifier data, Explanation data defining the term data is generated.
[0011]
As a result, the glossary generation device performs dependency analysis on the input natural language text data by morphological analysis and syntax analysis, so that the term data and the conceptual data indicating the higher level concept of the term data The modification data defining the term data is extracted from the text data, and the comment data defining the term data is generated based on the concept data and the modification data.
[0012]
Furthermore, the glossary generation device according to claim 2 is the glossary generation device according to claim 1, further comprising a concept database for registering term data and a plurality of concept data corresponding to the term data, and the concept data The extracting means is configured to determine one concept data corresponding to the term data from the plurality of concept data based on the appearance frequency.
[0013]
According to such a configuration, the glossary generation device refers to the appearance frequency from the concept data of the concept database in which the term data and the plurality of concept data corresponding to the term data are registered by the concept data extraction unit, and the term data One concept data corresponding to is determined. Thus, the glossary generation device determines the concept data having the highest appearance frequency among the plurality of concept data corresponding to the term data in the concept database as the concept data most suitable for the term data.
[0014]
Further, the glossary generation device according to claim 3 is the glossary generation device according to claim 1 or 2, wherein the learning database is an explanatory sentence in which the linking modifier defines the term. The verb directly related to the term and the immediately preceding particle are registered as learning data.
[0015]
According to such a configuration, the glossary generation device registers, as learning data, the verb directly related to the term and the immediately preceding particle when the linkage modification clause is an explanatory sentence defining the term by the learning database. As a result, the glossary generation device identifies a combination modification clause, which is an explanatory text defining a term, from a plurality of combination modification clauses by a combination of a verb directly related to the term and the immediately preceding particle.
[0016]
Furthermore, the glossary generation device according to claim 4 is the glossary generation device according to any one of claims 1 to 3, wherein the glossary is stored in the glossary data and the comment data corresponding to the term data. The data storage means is provided.
[0017]
According to such a configuration, the glossary generation device accumulates the generated term data and the explanation data corresponding to the term data in the explanation data storage means, thereby the natural language text input to the explanation data storage means. Build a glossary database extracted from data.
[0018]
The glossary generating device according to claim 5 is the glossary generating device according to any one of claims 1 to 4, wherein the input text data is news manuscript data. did.
[0019]
According to such a configuration, the glossary generation device generates the explanation data of terms from difficult terms, new words and coined words contained in the news manuscript in addition to general terms by inputting the data of the news manuscript from the outside. To do.
[0020]
The terminology generation program according to claim 6 includes: input natural language text data; and a learning database in which learning data that is characteristic when the linkage modifier is an explanatory sentence defining a term is registered. In order to generate comment data defining terms in the text data, the computer is configured to function by the following means.
[0021]
That is, dependency analysis means for generating dependency information between clauses of the text data by performing morphological analysis and syntax analysis of the text data, and at least one of nouns and noun phrases from the text data is term data The term data extracting means for extracting the concept data, the concept data extracting means for extracting the concept data indicating the general concept of the term data from the text data based on the dependency information and the specific paraphrase expression, and the dependency information Modification data extraction means for extracting a linkage modification clause defining the term data as modification data based on the learning data; By linking the conceptual data to the modifier data, Explanation data generating means for generating explanation data for defining the term data is used.
[0022]
According to this configuration, the glossary generation program generates dependency information between clauses of the text data by performing morphological analysis and syntax analysis of the text data by the dependency analysis unit, and the term data extraction unit A character string that is a noun or a noun phrase is extracted from the text data as term data, and the concept data extracting unit extracts the character string from the text data based on the dependency information and a specific paraphrase expression. Extracting conceptual data indicating a superordinate concept, extracting the modifier modifier defining the term data as modifier data based on the dependency information and the learning data by the modifier data extracting means, and using the comment data generating means , By linking the conceptual data to the modifier data, Explanation data defining the term data is generated.
[0023]
As a result, the glossary generation program performs dependency analysis on the input natural language text data by morphological analysis and syntax analysis, thereby obtaining the term data and the concept data indicating the higher level concept of the term data. The modification data defining the term data is extracted from the text data, and the comment data defining the term data is generated based on the concept data and the modification data.
[0024]
Furthermore, the glossary search device according to claim 7, the glossary generation device according to claim 4 that generates commentary data that explains the term data from text data, an input unit that inputs the term data, and term data And a comment data search means for searching the comment data stored in the comment data storage means.
[0025]
According to this configuration, the glossary search device generates comment data that explains the term data from the input text data by the glossary generation device, inputs the term data by the input means, and the comment data search means The comment data stored in the comment data storage means in the glossary generation device is searched, and the search result is output by the output means. As a result, the glossary search device causes the glossary generation device that generates the explanation data to explain the term data from the input text data to the term data corresponding to the term data in response to the query of the term data from the user. And a user interface function for outputting a search result.
[0026]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.
(First Embodiment: Configuration of Glossary Generation Device)
FIG. 1 is a block diagram showing the overall configuration of the glossary generation device according to the first embodiment of the present invention. As shown in FIG. 1, the glossary generating device 1 is a device that inputs a news manuscript that is text data and generates term data included in the news manuscript and commentary data that defines the term data.
[0027]
This glossary generation device 1 includes dependency analysis means 11, term data extraction means 12, concept data extraction means 13, concept database 14, modification data extraction means 15, learning database 16, comment data generation means 17, and comment data storage means. 18 is provided. The news manuscript is input as text data from the external news manuscript database 3. Furthermore, an input device (not shown) for specifying the range of the news manuscript (for example, only the 2000 news manuscript) is connected to the glossary generating device 1 from the news manuscript database 3. .
[0028]
The dependency analysis unit 11 decomposes a news manuscript input based on a news manuscript range specification input from an input device (not shown) into phrase units by morphological analysis and syntax analysis, and character strings of the phrases. And the dependency relation of the phrase are generated as dependency information and notified to the term data extraction means 12, the concept data extraction means 13, and the modification data extraction means 15.
[0029]
The dependency analysis unit 11 identifies a morpheme, which is the smallest language unit that bears meaning from a news manuscript by morphological analysis, and identifies phrases such as noun phrases and verb phrases and their dependency relationships by syntactic analysis. This morphological analysis and syntax analysis can be realized by a known technique.
[0030]
Based on the dependency information generated by the dependency analysis unit 11, the term data extraction unit 12 extracts a phrase as a term from the clauses included in the dependency information and outputs it as term data. This term data is notified to the conceptual data extraction means 13, the modification data extraction means 15, and the comment data generation means 17.
[0031]
Here, the phrase as a term is extracted by identifying the phrase as a term when the phrase is a noun or a noun phrase. If there is no clause recognized as a term, the term data is output as a null character string.
[0032]
In addition to this, it is also possible to extract a character string to be treated as a term from surface information by using text data of a news manuscript input to the glossary generating device 1. For example, in the text data of a news manuscript, by determining a noun phrase enclosed in square brackets or the like as an important part, the term data extraction means 12 can extract a character string enclosed in square brackets as a term. .
[0033]
Based on the dependency information generated by the dependency analysis unit 11 and the term data extracted by the term data extraction unit 12, the concept data extraction unit 13 uses, as concept data, a clause indicating a higher level concept of the term data. Extract.
[0034]
This concept data is registered in the concept database 14 together with the term data. At this time, if the same term data is already registered, it is possible to register a plurality of concept data. If the concept data corresponding to the term data is also the same, the number of occurrences of the concept data is determined. Update (increment).
[0035]
The concept data extraction unit 13 reads the concept data corresponding to the term data from the concept database 14 and notifies the comment data generation unit 17 of the concept data. At this time, when there are a plurality of concept data corresponding to the term data, the comment data generating means 17 is notified of the concept data having the largest number of appearances among the concept data.
[0036]
The concept data is data obtained by paraphrasing term data by a specific paraphrase expression (specific expression) such as “to” or “called”. Here, the dependency information generated by the dependency analysis means 11 is used as a material that is determined to be paraphrased.
[0037]
In other words, an arrow (→) of “sentence A → sentence B” indicates a dependency relationship, and if the phrase A is a source and a clause B is a destination, the input character string is “(term data) → → (noun (phrase)) "or" (term data) → called → (noun (phrase)) ", it is determined that this noun (phrase) indicates the high-level concept of the term data To do. FIG. 3 shows the result of extracting concept data based on this specific expression from an actual news manuscript. For example, in the character string “expression of emoticon”, the term data is “emoticon” (D1) and the concept data is “expression” (D2).
[0038]
The concept database 14 is a database in which term data and concept data are registered in association with each other, and the concept data extraction unit 13 registers and references term data and concept data. The term data and the concept data have a one-to-many relationship, and a plurality of concept data are registered for one term data. Further, the number of appearances is given to each concept data, and the concept data extraction means 13 updates and refers to the number of appearances. The concept database 14 is composed of storage means such as a hard disk.
[0039]
The modification data extraction means 15 is based on the dependency information generated by the dependency analysis means 11, the term data extracted by the term data extraction means 12, and the learning data registered in the learning database 16. A linkage modification clause that modifies the term data is extracted, it is determined whether or not the linkage modification clause defines term data, and based on the determination result, the linkage modification clause is notified to the explanation data generation means 17 as modification data. To do.
[0040]
This modifier data extraction means 15 learns a binary tuple of a verb directly related to the term data of the modifier clause and the immediately preceding particle in order to determine whether or not the modifier clause defines term data. This is performed based on the learning database 16 registered as data. This learning data is stored on the learning database 16 as a known thesaurus which is a synonym / synonym list in which verbs having similarity to the verb are registered at the same time and similar words are classified into a tree structure. Build in.
[0041]
Therefore, the verb in the combination modification clause related to the term data is “v”, and the verb set belonging to the same group as the verb “v” is “vg1” and the verb belongs to the parent node of the verb “v” as the thesaurus. The set is “vg2”, the particle immediately before the verb “v” is “p”, and the weighting coefficient for the similarity of the verb “v” in the verb sets “vg1” and “vg2” is “w”, respectively. _a "," W _b ”, The number of occurrences of the verb set“ vg ”and the particle“ p ”in the learning data is“ n (vg, p) ”, and the expected value is“ e (vg, p) ”, the joint modification clause is An index value “weight (v, p)” for determining whether or not it is a so-called term definition clause that modifies the term data is defined as in equation (1).
[0042]
[Expression 1]

[0043]
In this equation (1), when n (vg1, p) <e (vg1, p), the first term is 0, and when n (vg2, p) <e (vg2, p) The second term is 0. When the index value weight (v, p) is larger than a preset threshold value, it is determined that the combination modification clause is an explanatory text defining term data.
[0044]
FIG. 4 shows a part of the result obtained by extracting the term data from the June 2001 news manuscript as an experimental result. In this experiment, 15295 learning data are given and w _a = 0.67, w _b = 0.33 and the threshold value is 1.0, and the term definition section is extracted.
[0045]
The learning database 16 is a database in which a binary group consisting of a verb directly related to the term data of the combination modification clause and the immediately preceding particle is registered as learning data, as described in the modification data extraction means 15, and is stored in a hard disk or the like. Consists of means.
[0046]
The comment data generation means 17 is based on the term data extracted by the term data extraction means 12, the concept data extracted by the concept data extraction means 13, and the modification data extraction means 15, and the term data and the term data And commentary data that defines
[0047]
As shown in FIG. 4, the modification data extracted by the modification data extraction unit 15 described above ends with a verb combination form, and is not appropriate as a definition sentence. Therefore, the comment data generation means 17 generates comment data by “modification data (linkage modification clause) + concept data (superordinate concept)”. For example, in FIG. 3, the term data is “emoticon” (D1), the conceptual data is “expression” (D2), and the modification data corresponding to the “emoticon” in FIG. When the “email is expressed using symbols, etc.” (D3), the comment data corresponding to the “emoticon” is linked with the modifier data and the conceptual data, and is exchanged with a personal computer or mobile phone. "Expressions that express emotions using symbols etc."
[0048]
Further, the comment data generation means 17 determines whether the final morpheme when the morphological analysis of the term data itself is conceptual data when the modification data of the term data exists but the conceptual data does not exist. This determination can be made, for example, by previously registering data that tends to be a superordinate concept in the final morpheme as concept data in the concept database 14 in advance.
[0049]
Here, when the final morpheme is conceptual data, comment data is generated by “modification data (linkage modification clause) + final morpheme of term data”. For example, in FIG. 4, when there is no conceptual data for the term data “judicial system reform promotion law” (D4), “law” which is the final morpheme of “judicial system reform promotion law” (D4) is used as conceptual data. By adding to the terminology modification clause “determine a system for promoting judicial system reform” (D5), comment data “a law for establishing a system for promoting judicial system reform” is generated.
[0050]
Further, the comment data generation means 17 determines that “qualification data (linkage modification clause) + thing (thing) is present when the term data is qualified but the conceptual data does not exist and the final morpheme cannot be determined as conceptual data. ) "To generate commentary data. For example, in FIG. 4, when there is no conceptual data for the term data “Red Data Book” (D6), the linkage modifier clause of “Red Data Book” (D6) “Records a bird that may be extinct” (D7 ) Is added to “things” to generate commentary data “things that recorded endangered wild birds”.
The generated comment data is stored in the comment data storage means 18 as a pair with term data.
[0051]
The comment data storage means 18 is a storage means for storing the term data and comment data output from the comment data generation means 17 in pairs, and is composed of a hard disk or the like. The data accumulated here can be used as a glossary search database that can refer to explanation data from term data.
[0052]
As described above, the configuration of the glossary generation device 1 according to the present invention has been described based on one embodiment. However, the present invention is not limited to this, for example, the comment data storage means 18 is not provided, and comment data generation is performed. The means 17 may be configured to output the term data and the comment data to the outside as data and have an external storage means.
[0053]
Further, the conceptual database 14 is not provided, and the conceptual data extraction means 13 is based on the dependency information input from the dependency commentary means 11 and the term data input from the term data extraction means 12. It is also possible to extract the concept data every time.
[0054]
Further, the input data of the glossary generation device 1 is not limited to the news manuscript, but may be general text data. For example, by inputting the manuscript as text data from a database storing magazine manuscripts, it is possible to output commentary data for the latest terminology data of the magazine.
[0055]
(First Embodiment: Operation of Glossary Generation Device)
Next, operation | movement of the glossary production | generation apparatus 1 is demonstrated based on FIG. 1, FIG. FIG. 5 is a flowchart showing the overall operation of the glossary generation device 1. FIG. 6 is a flowchart showing the operation of the concept data extraction means 13. FIG. 7 is a flowchart showing the operation of the modification data extraction unit 15. FIG. 8 is a flowchart showing the operation of the comment data generation means 17.
First, based on the flowchart of FIG. 5, the operation | movement of the glossary production | generation apparatus 1 whole is demonstrated.
[0056]
First, an externally input news manuscript is decomposed into phrase units by morphological analysis and syntax analysis, and the character string of the phrase and the dependency relation of the phrase are generated as dependency information (step a1). .
[0057]
Then, based on the dependency information, it is determined from the clauses included in the dependency information that the clause is a noun or a noun phrase, and a phrase that becomes a term is extracted as term data (step a2).
[0058]
Here, it is determined whether or not there is the term data extracted in step a2 (step a3). When there is no term data (No), the operation is terminated. On the other hand, if there is term data (Yes), the process proceeds to step a4.
[0059]
Then, based on the dependency information and the term data, a phrase indicating a superordinate concept of the term data is extracted as concept data by a specific expression such as “to” or “called” (step a4).
[0060]
Further, based on the dependency information, the term data, and the learning data registered in the learning database, it is determined whether or not a linkage modifier that modifies the term data defines the term data. Based on the determination result, the combination modification clause is extracted as modification data (step a5).
[0061]
Then, comment data defining the term data is generated based on the conceptual data and the modification data (step a6).
[0062]
Next, the operation of the concept data extraction unit 13 will be described based on the flowchart of FIG. This flowchart describes step a4 in FIG. 5 in detail.
[0063]
First, based on the input dependency information, it is determined whether each phrase of the dependency information includes a specific expression such as “to” or “called” (step b1). When there is no expression (No), it is determined whether or not the inputted term data is already in the concept database 14 (whether it is registered) (step b2), and is not in the concept database 14 (not registered). In the case (No), the process is terminated on the assumption that conceptual data corresponding to the term data does not exist. On the other hand, when there is conceptual data corresponding to the term data in the conceptual database 14 (registered) (Yes), the process proceeds to step b5.
[0064]
In step b1, if there is a specific expression (Yes), based on the specific expression, concept data that is a superordinate concept of the term data is extracted from the dependency relationship (step b3).
[0065]
Then, the term data and the concept data extracted in step b3 are registered in the concept database 14 (step b4). At this time, if the same term data is already registered, it is possible to register a plurality of concept data. If the concept data corresponding to the term data is also the same, the number of occurrences of the concept data is determined. Update (increment).
[0066]
Then, among the concept data corresponding to the term data, based on the number of appearances, the concept data having the highest appearance frequency is output as the concept data of the term data (step b5).
[0067]
If the concept database 14 does not exist in the glossary generation device 1, if the specific expression does not exist (No) in step b 1, the process is terminated assuming that there is no concept data corresponding to the term data. . On the other hand, if there is a specific expression (Yes), in step b3, based on the specific expression, the concept data that is a superordinate concept of the term data is extracted and output from the dependency relationship, and the process is terminated. It is also possible to simplify the concept data extraction operation.
[0068]
Next, the operation of the modification data extraction unit 15 will be described based on the flowchart of FIG. This flowchart explains step a5 in FIG. 5 in detail.
[0069]
First, based on the input dependency information and term data, it is determined whether or not there is a linkage modification clause related to the term data in each clause of the dependency information (step c1). If there is no linkage modification clause related to the data (No), the processing is terminated assuming that there is no modification data related to the term data.
[0070]
On the other hand, if there is a linkage modification clause related to the term data (Yes) in step c1, an index value for determining whether or not this linkage modification clause is an explanatory text defining the term data is based on the formula (1). (Step c2).
[0071]
Then, the index value is compared with a preset threshold value (step c3). If the index value is equal to or less than the threshold value (No), this linkage modifier clause contains the modifier data defining the term data. The processing is terminated as if it is not. On the other hand, when the index value is larger than the threshold value (Yes), it is determined that this combination modification clause is modification data related to the term data, and this combination modification clause is output as modification data (step c4). .
[0072]
Next, the operation of the comment data generation means 17 will be described based on the flowchart of FIG. This flowchart explains step a6 in FIG. 5 in detail.
[0073]
First, it is determined whether or not there is modification data that is an explanatory text defining term data (step d1). Here, when there is no modification data (No), it is further determined whether there is conceptual data that is a superordinate concept of the term data (Step d2). When there is no conceptual data (No), “ “None” (null data) is set (step d3), and the process proceeds to step d10. On the other hand, if there is conceptual data in step d2 (Yes), the character string of “concept data” is set in the explanatory data variable (step d4), and the process proceeds to step d10.
[0074]
In step d1, if there is modification data that is an explanatory text defining term data (Yes), it is further determined whether there is conceptual data that is a superordinate concept of term data (step d5). If there is conceptual data (Yes), a character string of “qualification data + concept data” is set in the variable of the explanation data (step d6), and the process proceeds to step d10.
[0075]
On the other hand, if there is no conceptual data that is a superordinate concept of term data (No) in step d5, it is determined whether or not the final morpheme of term data is conceptual data (step d7), and the final morpheme is conceptual data. In the case (Yes), a character string “modification data + final morpheme” is set in the comment data variable (step d8), and the process proceeds to step d10.
[0076]
In step d7, if the final morpheme of the term data is not conceptual data (No), the character string “modification data + thing” is set in the explanatory data variable (step d9), and the process proceeds to step d10. move on.
[0077]
Then, the term data and the comment data set by the process are output (step d10).
With the above operation, the glossary generation device 1 can generate term data and commentary data defining the term data from the news manuscript.
[0078]
Note that the glossary generation device 1 can also realize each function as a program in a computer, and can also operate the terminology generation program by combining the function programs.
[0079]
(Second embodiment: Glossary search device)
FIG. 2 is a block diagram showing the overall configuration of the glossary search device according to the second embodiment of the present invention. As shown in FIG. 2, the glossary search device 2 inputs a news manuscript that is text data, generates term data included in the news manuscript, and commentary data that defines the term data, as well as an operator. Is a device that outputs commentary data for the term data inquired about.
[0080]
This glossary search device 2 is configured by adding comment data search means 21, input means 22 and output means 23 to the glossary generation apparatus 1 shown in FIG. The components other than the comment data search means 21, the input means 22 and the output means 23 are denoted by the same reference numerals as those shown in FIG. Furthermore, the glossary search device 2 is connected to an external input device 4 such as a keyboard and an output device 5 such as a CRT.
[0081]
The comment data search means 21 searches the comment data corresponding to the term data stored in the comment data storage means 18 based on the term data that is text data input from the input means 22, and the searched comment data The data is output to the output means 23 as text data.
[0082]
The input means 22 inputs and analyzes the input data input by the operator from the input device 4 such as a keyboard and a mouse. When term data is input as this input data, this term data is output to the comment data search means 21.
[0083]
Further, when the input data designates a range of a news manuscript, for example, “2000 news manuscript”, the glossary generation device 1 notifies the glossary generation device 1 of the range designation data. A news manuscript within the range is input from the news manuscript database 3, term data is extracted, comment data defining the term data is generated, and the term data and comment data are output as text data to the output means 23. .
[0084]
The output means 23 outputs the term data and comment data generated by the glossary generation means 1 and the comment data output after the search by the comment data search means 21 to the output device 5 such as a CRT as output data.
[0085]
FIG. 9 shows a display example 50 of the output device 5 connected to the glossary search device 2. The screen of FIG. 9 is an example of an application screen for explaining the glossary of a news manuscript in which an operator inputs term data and displays comment data corresponding to the data.
[0086]
For example, when the operator inputs the term data into the term input column 50a via the input device 4 and presses the RETURN key or clicks the search button 50b with the mouse, the glossary search device 2 can display the explanation data. The comment data corresponding to the term data stored in the storage means 18 is searched, and the comment data is displayed in the comment display column 50d. Here, the clear button 50c is a button for deleting comment data input to the term input field 50a at once, and the end button 50e is a button for ending an application for explaining the term of the news manuscript. .
[0087]
In this way, the glossary search device 2 has an interface that generates term data and comment data defining the term data from the news manuscript, and outputs comment data for the term data that the operator has inquired about. By doing so, you can always search for the latest explanation of the term quickly.
[0088]
As described above, the glossary search device 2 according to the present invention has been described based on one embodiment. However, the present invention is not limited to this. For example, the glossary search device 2 includes a network interface and operates remotely via a network. It is also possible for a person to search for explanation data of term data.
[0089]
【The invention's effect】
As described above, the glossary generation device, the glossary generation program, and the glossary search device according to the present invention have the following excellent effects.
[0090]
According to the first aspect of the present invention, the glossary generation device performs the dependency analysis on the input natural language text data by morphological analysis or syntactic analysis, thereby obtaining the term data and the term data. The concept data indicating the higher-level concept and the modification data defining the term data are extracted from the text data, and the comment data defining the term data can be generated based on the concept data and the modification data.
[0091]
This makes it possible to extract the terminology data from which the glossary is based from the text data in the natural language that has been input, and the commentary data that is the definition sentence of the term data can be extracted simultaneously. Thus, it is possible to extract the glossary data at high speed without human intervention.
[0092]
According to the second aspect of the present invention, the glossary generating device converts the concept data having the highest appearance frequency among the plurality of concept data corresponding to the term data in the concept database as the concept data of the term data. Therefore, when generating explanation data, by adding this concept data to the end, if the superordinate concept has the same term data, the explanation data should be composed of concept data having the same end. This makes it possible to generate glossary data that is uniform.
[0093]
According to the third aspect of the present invention, the glossary generation device includes a verb modification clause that is a descriptive sentence that defines a term, and a verb directly related to the term and a particle immediately before the term. Since it is not necessary to consider the meaning of the sentence, it is not necessary to consider the meaning of the sentence, so it can be determined at high speed. Can be performed.
[0094]
According to the invention described in claim 4, the glossary generating device stores the generated term data and the comment data corresponding to the term data in the comment data storage means, and is thereby input to the comment data storage means. A glossary of natural language text data can be constructed. As a result, a glossary database can be easily constructed, and a term data search system can be constructed using this database.
[0095]
According to the fifth aspect of the present invention, since the text data input is the data of the news manuscript, the glossary generation device, particularly in the news manuscript, has difficult terms, many new words created every year, In many cases, a coined word that is a combination of these words is accompanied by a description of a term that can be easily understood by a viewer, and the glossary data can be effectively generated using this description.
[0096]
According to the sixth aspect of the invention, the glossary generation program performs dependency analysis on the input natural language text data by morphological analysis or syntactic analysis, thereby obtaining the term data and the term data. The concept data indicating the superordinate concept and the modification data related to the term data can be extracted from the text data, and comment data defining the term data can be generated based on the concept data and the modification data.
[0097]
This makes it possible to extract the terminology data from which the glossary is based from the text data in the natural language that has been input, and the commentary data that is the definition sentence of the term data can be extracted simultaneously. Thus, it is possible to extract the glossary data at high speed without human intervention.
[0098]
According to the seventh aspect of the present invention, the glossary search device generates a commentary data that explains the term data from the input text data, and a glossary generation device that inquires the term data from the user. A user interface for searching the comment data corresponding to the term data and outputting the search result can be provided, and the user can always easily search the meaning of the latest term data.
[Brief description of the drawings]
FIG. 1 is a block diagram showing an overall configuration of a glossary generation device according to an embodiment of the present invention.
FIG. 2 is a block diagram showing an overall configuration of a glossary search device according to an embodiment of the present invention.
FIG. 3 is a diagram showing a part of the result of extracting conceptual data of term data using the specific expression of the present invention.
FIG. 4 is a diagram showing a part of a result of extracting a linkage modification clause that defines term data of the present invention.
FIG. 5 is a flowchart showing the operation of the glossary generation device according to the embodiment of the present invention.
FIG. 6 is a flowchart showing an operation of extracting concept data according to the embodiment of the present invention.
FIG. 7 is a flowchart showing an operation of extracting modification data according to the embodiment of the present invention.
FIG. 8 is a flowchart showing an operation of generating comment data according to the embodiment of the present invention.
FIG. 9 is a diagram showing an example of an output screen of the glossary search device according to the embodiment of the present invention.
[Explanation of symbols]
1 ... Glossary generator
2 ... Glossary search device
11 …… Dependency analysis means
12 …… Term data extraction means
13 …… Concept data extraction means
14 …… Concept database
15 …… Modification data extraction means
16 …… Learning database
17. Explanation data generation means
18 …… Explanation data storage means
21 …… Explanation data search means
22 …… Input means
23 …… Output means

Claims

A glossary generation device that generates commentary data defining term data from input natural language text data,
Dependency analysis means for generating dependency information of clauses of the text data by performing morphological analysis and syntax analysis on the text data;
Analyzing a character string that is a noun or noun phrase from the text data and extracting it as term data,
Based on the dependency information and a specific paraphrase expression for paraphrasing the term data, a concept data extracting means for extracting concept data indicating a superordinate concept of the term data from the text data;
A learning database in which learning data that is a characteristic when the combination modification clause is an explanatory text defining a term is registered in advance;
Based on the dependency information and the learning data, a modification for determining whether a combination modification clause related to the term data is a definition of the term data and extracting a combination modification clause determined to be a definition as modification data Data extraction means;
Comment data generating means for generating comment data defining the term data by connecting the concept data to the modifier data ;
A glossary generating device, comprising:

A concept database for registering the term data and a plurality of concept data corresponding to the term data;
2. The concept data extracting unit determines one concept data corresponding to the term data from a plurality of concept data corresponding to the term data based on the appearance frequency thereof. Glossary generator.

2. The learning database according to claim 1, wherein a verb directly related to the term and a particle immediately before the term are registered as learning data when the combination modification clause is an explanatory text defining the term. Item 3. A glossary generating device according to Item 2.

Comment data storage means for storing the term data and comment data corresponding to the term data;
The glossary generation device according to any one of claims 1 to 3, further comprising:

5. The glossary generating device according to claim 1, wherein the text data is news manuscript data.

Generates comment data that defines terms in the text data from the natural language text data that has been entered and the learning database that contains the learning data that is characteristic when the union modifier is an explanatory sentence that defines the term Computer to
Dependency analysis means for generating dependency information between clauses of the text data by performing morphological analysis and syntax analysis on the text data;
Term data extraction means for extracting a character string that is a noun or noun phrase from the text data as term data,
Concept data extracting means for extracting concept data indicating a superordinate concept of the term data from the text data based on the dependency information and a specific paraphrase expression;
Based on the dependency information and the learning data, a modification for determining whether a combination modification clause related to the term data is a definition of the term data and extracting a combination modification clause determined to be a definition as modification data Data extraction means,
Comment data generating means for generating comment data defining the term data by connecting the concept data to the modifier data ;
Glossary generation program characterized by functioning as

This is a glossary search device that generates commentary data, which is explanatory text that defines terminology data, from text data in natural language that has been input, and that also searches commentary data that corresponds to the terminology data from input terminology data. And
The glossary generating device according to claim 4, wherein comment data for explaining term data is generated from the text data;
Input means for inputting the term data;
Comment data search means for searching comment data stored in the comment data storage means based on the term data;
Output means for outputting the search results;
A glossary search device characterized by comprising: