JP2003099429A

JP2003099429A - Glossary generation device, glossary generation program and glossary retrieval device

Info

Publication number: JP2003099429A
Application number: JP2001289477A
Authority: JP
Inventors: Ichiro Yamada; 一郎山田; Masahiro Shibata; 正啓柴田; Nobuyuki Yagi; 伸行八木
Original assignee: Nippon Hoso Kyokai NHK; Japan Broadcasting Corp
Current assignee: Japan Broadcasting Corp
Priority date: 2001-09-21
Filing date: 2001-09-21
Publication date: 2003-04-04
Anticipated expiration: 2021-09-21
Also published as: JP4014130B2

Abstract

PROBLEM TO BE SOLVED: To provide a glossary generation device and glossary generation program for extracting a term and commentary data defining the term based on an attributive modification clause from natural language text data and to provide a term retrieving device for retrieving the commentary data of the word from the word. SOLUTION: The glossary generation device 1 includes a modification relation analytic means 11 for generating modification relation information of the clause of the text data, a glossary data extraction means 12 for extracting term data, a concept data extraction means 13 for extracting concept data showing the high-order concept of term data, a learning database 16 for registering learning data which becomes feature when the attributive modification clause becomes an explanation sentence defining the term, a modification data extraction means 15 for extracting the attributive modification clause defining term data as modification data, and an explanation data generating means 17 for generation the commentary data.

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は、テキストデータか
ら、用語及びその用語を解説したデータを抽出する用語
集生成装置及び用語集生成プログラム、並びに用語から
その用語の解説データを検索する用語集検索装置に関す
る。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a glossary generation device and a glossary generation program for extracting a term and data explaining the term from text data, and a glossary search for retrieving the explanation data of the term from the term. Regarding the device.

【０００２】[0002]

【従来の技術】従来、自然言語のテキストデータから、
用語及びその用語を定義した解説データを抽出する方法
としては、文の表層的な特徴を表わした表層パターンの
マッチングに基づく方法が知られている。2. Description of the Related Art Conventionally, from natural language text data,
As a method of extracting a term and commentary data defining the term, a method based on matching of surface layer patterns that represent surface characteristics of a sentence is known.

【０００３】この方法では、例えば、用語をαとし、用
語の定義文をβとしたとき、「αとはβである」、「α
はβ」といった表層パターンに基づいて、テキストデー
タのマッチングを行なうことで、前記表層パターンにマ
ッチングした用語及びその用語を定義する解説データを
抽出することができる。In this method, for example, when the term is α and the definition sentence of the term is β, “α is β” and “α
By matching the text data based on the surface pattern such as ".beta.", The term matched with the surface pattern and the commentary data defining the term can be extracted.

【０００４】[0004]

【発明が解決しようとする課題】しかしながら、実際の
放送番組等で使用されるニュース原稿を調査したとこ
ろ、前記した「αとはβである」、「αはβ」といった
表層パターンは、ニュース原稿全体の７．７％しか使用
されておらず、前記ニュース原稿に含まれる用語及びそ
の用語の解説データを抽出した用語集を生成するには、
データ量として不充分であるという問題があった。However, when the news manuscripts used in actual broadcast programs are investigated, the surface patterns such as "α is β" and "α is β" are the news manuscripts. Only 7.7% of the total is used, to generate a glossary that extracts the terms and commentary data for the terms contained in the news manuscript,
There was a problem that the amount of data was insufficient.

【０００５】また、用語を定義するには、その用語に係
る連体修飾節を用いる場合がある。この連体修飾節を用
いる場合、例えば、「住民票の取得や、企業が行なう許
認可などの手続きを役所に出向かなくてもインターネッ
トでできるようにする電子政府の実現などの…」という
ニュース原稿において、連体修飾節である「住民票の取
得や、企業が行なう許認可などの手続きを役所に出向か
なくてもインターネットでできるようにする」は、用語
である「電子政府」を定義している。In addition, to define a term, the adnominal modifier related to the term may be used. When using this adnominal modifier, for example, in a news manuscript, such as "realization of an electronic government that enables procedures such as obtaining a resident's card and permitting and conducting business by companies without going to the government office ..." In the adnominal qualification section, "enable procedures such as obtaining a resident's card or permitting or permitting a company to perform on the Internet without going to the government office," defines the term "e-government."

【０００６】このように、連体修飾節により用語を定義
している場合は、前記従来の技術における表層パターン
のマッチングでは、用語の解説データを抽出することが
できないという問題があった。As described above, when the term is defined by the adnominal modification clause, there is a problem that the explanation data of the term cannot be extracted by the surface layer pattern matching in the above-mentioned conventional technique.

【０００７】本発明は、前記した技術的問題点に鑑みて
なされたものであり、自然言語のテキストデータから、
連体修飾節に基づいて、用語及びその用語を定義する解
説データを抽出する用語集生成装置及び用語集生成プロ
グラム、並びに用語からその用語の解説データを検索す
る用語集検索装置を提供することを目的とする。The present invention has been made in view of the above technical problems, and from natural language text data,
An object is to provide a glossary generation device and a glossary generation program that extract a term and commentary data defining the term, and a glossary search device that searches commentary data of the term from the term based on the adnominal modifier. And

【０００８】[0008]

【課題を解決するための手段】本発明は、前記目的を達
成するために創案されたものであり、まず、請求項１に
記載の用語集生成装置は、以下の構成にかかるものとし
た。すなわち、入力された自然言語のテキストデータを
形態素解析及び構文解析を行なうことで、前記テキスト
データの文節の係り受け情報を生成する係り受け解析手
段と、前記テキストデータから、名詞または名詞句とな
る文字列を用語データとして抽出する用語データ抽出手
段と、前記係り受け情報と、用語データを言い換える特
定の言い換え表現とに基づいて、前記テキストデータか
ら、前記用語データの上位概念を示す概念データを抽出
する概念データ抽出手段と、予め連体修飾節が用語を定
義する説明文となるときの特徴となる学習データを登録
した学習データベースと、前記係り受け情報と前記学習
データとに基づいて、前記用語データに係る連体修飾節
が前記用語データの定義となっているかを判断し、定義
と判断された連体修飾節を修飾データとして抽出する修
飾データ抽出手段と、前記概念データと前記修飾データ
とに基づいて、前記用語データを定義する解説データを
生成する解説データ生成手段と、を備える構成とした。The present invention was devised in order to achieve the above-mentioned object. First, the terminology generation device according to claim 1 has the following configuration. That is, by performing a morphological analysis and a syntactic analysis on the input natural language text data, a dependency analysis means for generating dependency information of a clause of the text data, and a noun or a noun phrase from the text data. Based on the term data extracting unit that extracts a character string as term data, the dependency information, and a specific paraphrase expression that paraphrases the term data, conceptual data indicating a superordinate concept of the term data is extracted from the text data. Based on the dependency information and the learning data, the concept data extracting means, a learning database in which learning data that is a feature when the adnominal modifier is an explanatory text defining a term is registered in advance, and the term data based on the dependency information and the learning data. It is judged whether the adnominal modifier related to is the definition of the term data, and the adnominal modifier judged to be the definition And modifying the data extraction means for extracting as modified data, the based on the concept data and the modified data, and configured to and a description data generating means for generating a commentary data defining the term data.

【０００９】かかる構成によれば、用語集生成装置は、
係り受け解析手段によって、入力された自然言語のテキ
ストデータを形態素解析及び構文解析を行なうことで、
このテキストデータの文節の係り受け情報を生成し、用
語データ抽出手段によって、このテキストデータから、
名詞または名詞句となる文字列を用語データとして抽出
する。なお、この名詞または名詞句となる文字列は、例
えば構文解析により抽出する。そして、概念データ抽出
手段によって、前記係り受け情報と、用語データを言い
換える特定の言い換え表現とに基づいて、抽出した前記
用語データに対する上位概念を示す概念データを前記テ
キストデータから抽出する。According to this configuration, the glossary generating device is
By the morphological analysis and the syntactic analysis of the input natural language text data by the dependency analysis means,
Dependency information of the clause of this text data is generated, and by this term data extraction means, from this text data,
A character string that is a noun or a noun phrase is extracted as term data. The character string that serves as the noun or noun phrase is extracted by, for example, syntactic analysis. Then, based on the dependency information and a specific paraphrase expression that paraphrases the term data, the concept data extracting unit extracts concept data indicating a superordinate concept for the extracted term data from the text data.

【００１０】さらに、用語集生成装置は、修飾データ抽
出手段によって、前記係り受け情報と、予め連体修飾節
が用語を定義する説明文となるときの特徴となる学習デ
ータを登録した学習データベースの学習データとに基づ
いて、前記用語データを定義する連体修飾節を修飾デー
タとして抽出し、解説データ生成手段によって、前記概
念データと前記修飾データとに基づいて、前記用語デー
タを定義する解説データを生成する。Further, the terminology generation device uses the modification data extraction means to learn the modification information and a learning database in which learning data, which is a characteristic when the adnominal modification clause serves as an explanatory sentence defining a term, is registered in advance. Based on the data, the adnominal modifier that defines the term data is extracted as modification data, and the comment data generation means generates comment data defining the term data based on the conceptual data and the modification data. To do.

【００１１】これにより、用語集生成装置は、入力され
た自然言語のテキストデータに対して、形態素解析や構
文解析により係り受け解析を行なうことで、用語データ
と、その用語データの上位概念を示す概念データと、用
語データを定義する修飾データを前記テキストデータか
ら抽出し、概念データと修飾データとに基づいて、用語
データを定義する解説データを生成する。Thus, the terminology generation device performs the dependency analysis on the input natural language text data by morphological analysis or syntactic analysis to indicate the term data and the superordinate concept of the term data. Concept data and modification data defining the term data are extracted from the text data, and commentary data defining the term data is generated based on the concept data and the modification data.

【００１２】さらに、請求項２に記載の用語集生成装置
は、請求項１に記載の用語集生成装置において、用語デ
ータ及びその用語データに対応した複数の概念データを
登録する概念データベースを備え、前記概念データ抽出
手段が、前記複数の概念データから、その出現頻度に基
づいて、前記用語データに対応した１つの概念データを
確定する構成とした。Further, the terminology generation device according to claim 2 is the terminology generation device according to claim 1, further comprising a concept database for registering term data and a plurality of concept data corresponding to the term data. The concept data extraction means determines one concept data corresponding to the term data from the plurality of concept data based on the appearance frequency thereof.

【００１３】かかる構成によれば、用語集生成装置は、
概念データ抽出手段によって、用語データ及びその用語
データに対応した複数の概念データを登録した概念デー
タベースの概念データから、その出現頻度を参照し、用
語データに対応した１つの概念データを確定する。これ
により、用語集生成装置は、概念データベース内の用語
データに対応した複数の概念データのうちで、出現頻度
の最も高い概念データを、その用語データに最も適した
概念データとして確定する。According to this structure, the glossary generating device is
The concept data extracting means refers to the appearance frequency of the concept data in the concept database in which the term data and a plurality of concept data corresponding to the term data are registered, and determines one concept data corresponding to the term data. Thereby, the terminology generation device determines the concept data having the highest appearance frequency among the plurality of concept data corresponding to the term data in the concept database as the concept data most suitable for the term data.

【００１４】また、請求項３に記載の用語集生成装置
は、請求項１または請求項２に記載の用語集生成装置に
おいて、学習データベースが、連体修飾節が用語を定義
する説明文となるときの、その用語に直接係る動詞とそ
の直前の助詞とを学習データとして登録する構成とし
た。The terminology generation device according to claim 3 is the terminology generation device according to claim 1 or 2, wherein the learning database is an explanatory sentence in which the adnominal modifier clause defines a term. The verb directly related to the term and the particle immediately preceding it are registered as learning data.

【００１５】かかる構成によれば、用語集生成装置は、
学習データベースによって、連体修飾節が用語を定義す
る説明文となるときの、その用語に直接係る動詞とその
直前の助詞とを学習データとして登録する。これによ
り、用語集生成装置は、複数の連体修飾節の中から、用
語を定義する説明文である連体修飾節を、その用語に直
接係る動詞とその直前の助詞との組合せにより特定す
る。According to this structure, the terminology generation device is
The learning database registers, as learning data, the verb directly related to the term and the particle immediately preceding the term when the adnominal modifier is an explanatory text defining the term. As a result, the glossary generation device specifies, from a plurality of adnominal modifiers, an adjunct modifier that is a descriptive sentence that defines the term, by a combination of the verb directly related to the term and the particle immediately preceding it.

【００１６】さらに、請求項４に記載の用語集生成装置
は、請求項１乃至請求項３のいずれか１項に記載の用語
集生成装置において、用語データ及びその用語データに
対応した解説データを蓄積する解説データ蓄積手段を備
える構成とした。Further, the terminology generation device according to claim 4 is the terminology generation device according to any one of claims 1 to 3, wherein the terminology data and commentary data corresponding to the term data are provided. It is configured to have an explanation data storage means for storing.

【００１７】かかる構成によれば、用語集生成装置は、
生成した用語データ及びその用語データに対応した解説
データを解説データ蓄積手段に蓄積し、これにより、解
説データ蓄積手段に、入力された自然言語のテキストデ
ータから抽出した用語集のデータベースを構築する。According to this structure, the terminology generation device is
The generated term data and the commentary data corresponding to the terminology data are accumulated in the commentary data accumulating means, thereby constructing a database of the glossary extracted from the input natural language text data in the commentary data accumulating means.

【００１８】また、請求項５に記載の用語集生成装置
は、請求項１乃至請求項４のいずれか１項に記載の用語
集生成装置において、入力されるテキストデータは、ニ
ュース原稿のデータである構成とした。The terminology generation device according to a fifth aspect is the terminology generation device according to any one of the first to fourth aspects, wherein the input text data is news manuscript data. It has a certain structure.

【００１９】かかる構成によれば、用語集生成装置は、
ニュース原稿のデータを外部から入力することで、一般
的な用語以外に、ニュース原稿に含まれる難解な用語、
新語、造語から用語の解説データを生成する。According to this structure, the glossary generating device is
By inputting the data of the news manuscript from the outside, in addition to general terms, difficult terms included in the news manuscript,
Generates explanation data for terms from new words and coined words.

【００２０】また、請求項６に記載の用語集生成プログ
ラムは、入力された自然言語のテキストデータと、連体
修飾節が用語を定義する説明文となるときの特徴となる
学習データを登録した学習データベースとから、前記テ
キストデータ内の用語を定義する解説データを生成する
ために、コンピュータを、以下の手段により機能させる
ように構成した。A terminology generation program according to a sixth aspect of the present invention is a learning program in which input natural language text data and learning data that is a feature when the adnominal modifier is an explanatory sentence defining a term are registered. In order to generate the explanation data that defines the term in the text data from the database, the computer is configured to operate by the following means.

【００２１】すなわち、テキストデータを形態素解析及
び構文解析を行なうことで、前記テキストデータの文節
間の係り受け情報を生成する係り受け解析手段、前記テ
キストデータから、名詞や名詞句との少なくとも１つを
用語データとして抽出する用語データ抽出手段、前記係
り受け情報と特定の言い換え表現とに基づいて、前記テ
キストデータから、前記用語データの上位概念を示す概
念データを抽出する概念データ抽出手段、前記係り受け
情報と前記学習データとに基づいて、前記用語データを
定義する連体修飾節を修飾データとして抽出する修飾デ
ータ抽出手段、前記概念データと前記修飾データとに基
づいて、前記用語データを定義する解説データを生成す
る解説データ生成手段とした。That is, dependency analysis means for generating dependency information between clauses of the text data by performing morphological analysis and syntactic analysis on the text data, and at least one of a noun and a noun phrase from the text data. Term data extracting means for extracting as term data, concept data extracting means for extracting concept data indicating a superordinate concept of the term data from the text data based on the dependency information and a specific paraphrase expression. Modification data extraction means for extracting, as modification data, a adnominal modification clause defining the term data, based on the received information and the learning data, and an explanation for defining the term data based on the concept data and the modification data. It was used as an explanation data generation means for generating data.

【００２２】かかる構成によれば、用語集生成プログラ
ムは、係り受け解析手段によって、テキストデータを形
態素解析及び構文解析を行なうことで、前記テキストデ
ータの文節間の係り受け情報を生成し、用語データ抽出
手段によって、前記テキストデータから、名詞または名
詞句となる文字列を用語データとして抽出し、概念デー
タ抽出手段によって、前記係り受け情報と特定の言い換
え表現とに基づいて、前記テキストデータから、前記用
語データの上位概念を示す概念データを抽出し、修飾デ
ータ抽出手段によって、前記係り受け情報と前記学習デ
ータとに基づいて、前記用語データを定義する連体修飾
節を修飾データとして抽出し、解説データ生成手段によ
って、前記概念データと前記修飾データとに基づいて、
前記用語データを定義する解説データを生成する。According to this configuration, the glossary generation program generates the dependency information between the clauses of the text data by performing the morphological analysis and the syntactic analysis on the text data by the dependency analysis means, and the term data is generated. From the text data, the character string that is a noun or a noun phrase is extracted as term data from the text data by the extraction means, and based on the dependency information and a specific paraphrase expression, the concept data extraction means extracts the text data from the text data. The concept data indicating the superordinate concept of the term data is extracted, and the adjunct modifier clause defining the term data is extracted as the modifier data by the modifier data extracting means based on the dependency information and the learning data, and the commentary data is extracted. By the generation means, based on the conceptual data and the modified data,
Explanation data that defines the term data is generated.

【００２３】これにより、用語集生成プログラムは、入
力された自然言語のテキストデータに対して、形態素解
析や構文解析により係り受け解析を行なうことで、用語
データと、その用語データの上位概念を示す概念データ
と、用語データを定義する修飾データを前記テキストデ
ータから抽出し、概念データと修飾データとに基づい
て、用語データを定義する解説データを生成する。As a result, the glossary generation program shows the term data and the superordinate concept of the term data by performing the dependency analysis on the input natural language text data by morphological analysis or syntactic analysis. Concept data and modification data defining the term data are extracted from the text data, and commentary data defining the term data is generated based on the concept data and the modification data.

【００２４】さらに、請求項７に記載の用語集検索装置
は、テキストデータから用語データを説明する解説デー
タを生成する請求項４に記載の用語集生成装置と、用語
データを入力する入力手段と、用語データに基づいて、
解説データ蓄積手段に蓄積された解説データを検索する
解説データ検索手段と、を備える構成とした。Further, a terminology search device according to a seventh aspect of the present invention includes a glossary generation device according to the fourth aspect of the present invention for generating commentary data for explaining the term data from text data, and an input means for inputting the term data. , Based on term data
The explanation data searching means for searching the explanation data stored in the explanation data storage means is provided.

【００２５】かかる構成によれば、用語集検索装置は、
用語集生成装置によって、入力されたテキストデータか
ら用語データを説明する解説データを生成し、入力手段
によって、用語データを入力し、解説データ検索手段に
よって、用語集生成装置内の解説データ蓄積手段に蓄積
された解説データを検索し、出力手段によって、前記検
索結果を出力する。これにより、用語集検索装置は、入
力されたテキストデータから用語データを説明する解説
データを生成する用語集生成装置に、ユーザからの用語
データの問合せに対して、その用語データに対応する解
説データを検索して、検索結果を出力するユーザインタ
ーフェース機能を有する。According to this structure, the glossary search device is
The glossary generation device generates commentary data that explains the terminology data from the input text data, the terminology data is input by the input means, and the commentary data search means causes the commentary data storage means in the glossary generation device to be input. The accumulated commentary data is searched, and the search result is output by the output means. As a result, the glossary search device causes the glossary generation device that generates the commentary data that explains the terminology data from the input text data, to the glossary data generation device that responds to the query of the terminology data from the user. Has a user interface function of searching for and outputting the search result.

【００２６】[0026]

【発明の実施の形態】以下、本発明の実施の形態を図面
に基づいて詳細に説明する。（第一の実施形態：用語集生成装置の構成）図１は、本
発明における第一の実施形態に係る用語集生成装置の全
体構成を示すブロック図である。図１に示すように、用
語集生成装置１は、テキストデータであるニュース原稿
を入力し、そのニュース原稿に含まれる用語データと、
その用語データを定義する解説データとを生成する装置
である。BEST MODE FOR CARRYING OUT THE INVENTION Embodiments of the present invention will be described in detail below with reference to the drawings. First Embodiment: Configuration of Glossary Generation Device FIG. 1 is a block diagram showing the overall configuration of a glossary generation device according to the first embodiment of the present invention. As shown in FIG. 1, the terminology generation device 1 inputs a news manuscript, which is text data, and includes term data included in the news manuscript,
This is a device that generates commentary data that defines the term data.

【００２７】この用語集生成装置１は、係り受け解析手
段１１、用語データ抽出手段１２、概念データ抽出手段
１３、概念データベース１４、修飾データ抽出手段１
５、学習データベース１６、解説データ生成手段１７及
び解説データ蓄積手段１８を備えて構成されている。ま
た、ニュース原稿は、外部のニュース原稿データベース
３から、テキストデータとして入力されるものとする。
さらに、用語集生成装置１には、ニュース原稿データベ
ース３から、ニュース原稿の範囲指定（例えば、２００
０年のニュース原稿のみを対象等）を行なう入力装置
（図示せず）が外部に接続されている。The terminology generation device 1 includes a dependency analysis means 11, a term data extraction means 12, a concept data extraction means 13, a concept database 14, and a modified data extraction means 1.
5, a learning database 16, an explanation data generation means 17, and an explanation data storage means 18 are provided. The news manuscript is assumed to be input as text data from the external news manuscript database 3.
Further, the glossary generating device 1 specifies the range of news manuscripts from the news manuscript database 3 (for example, 200 manuscripts).
An input device (not shown) for performing only 0-year news manuscripts is connected to the outside.

【００２８】係り受け解析手段１１は、入力装置（図示
せず）から入力されたニュース原稿の範囲指定に基づい
て入力したニュース原稿を、形態素解析と構文解析とに
より文節単位に分解し、その文節の文字列と、文節の係
り受け関係とを係り受け情報として生成し、用語データ
抽出手段１２、概念データ抽出手段１３及び修飾データ
抽出手段１５へ通知する。The dependency analysis means 11 decomposes the news manuscript input based on the range specification of the news manuscript input from the input device (not shown) into bunsetsu units by morphological analysis and syntactic analysis, and the bunsetsu The character string and the dependency relation of the clause are generated as dependency information, and the term data extraction means 12, the concept data extraction means 13 and the modification data extraction means 15 are notified.

【００２９】この係り受け解析手段１１は、形態素解析
によって、ニュース原稿から意味を担う最小の言語単位
である形態素を同定し、構文解析によって、名詞句、動
詞句などの文節及びその係り受け関係を同定する。な
お、この形態素解析や構文解析は、公知の技術によって
実現することができる。The dependency analysis means 11 identifies a morpheme, which is the smallest linguistic unit having a meaning, from a news manuscript by morpheme analysis, and analyzes bunsetsu such as noun phrases and verb phrases and their dependency relationships by syntactic analysis. Identify. The morphological analysis and the syntactic analysis can be realized by a known technique.

【００３０】用語データ抽出手段１２は、係り受け解析
手段１１で生成された係り受け情報に基づいて、その係
り受け情報に含まれる文節から、用語となる文節を抽出
し、用語データとして出力する。この用語データは、概
念データ抽出手段１３、修飾データ抽出手段１５、解説
データ生成手段１７へ通知される。The term data extraction means 12 extracts a clause serving as a term from the clauses included in the dependency information, based on the dependency information generated by the dependency analysis means 11, and outputs it as term data. This term data is notified to the conceptual data extraction means 13, the modified data extraction means 15, and the explanation data generation means 17.

【００３１】ここで、用語となる文節の抽出は、その文
節が名詞または名詞句である場合に、その文節を用語で
あると認定して行なう。もし、用語として認定する文節
が存在しなければ、用語データは、ヌル文字列として出
力する。Here, when a phrase that is a term is extracted as a noun or a noun phrase, the phrase is recognized as a term. If there is no clause to be recognized as a term, the term data is output as a null character string.

【００３２】また、これ以外にも、用語集生成装置１に
入力されるニュース原稿のテキストデータで、用語とし
て扱う文字列を表層的な情報により抽出することも可能
である。例えば、ニュース原稿のテキストデータで、鉤
括弧等で囲まれた名詞句を重要部分と判断することで、
用語データ抽出手段１２は、鉤括弧で囲まれた文字列を
用語として抽出することができる。In addition to this, it is also possible to extract a character string to be treated as a term from surface text information in the text data of a news manuscript input to the terminology generation device 1. For example, in text data of a news manuscript, by determining a noun phrase enclosed in a bracket or the like as an important part,
The term data extraction means 12 can extract a character string enclosed in a bracket as a term.

【００３３】概念データ抽出手段１３は、係り受け解析
手段１１で生成された係り受け情報と、用語データ抽出
手段１２で抽出された用語データとに基づいて、その用
語データの上位概念を示す文節を概念データとして抽出
する。The concept data extracting means 13 uses the dependency information generated by the dependency analyzing means 11 and the term data extracted by the term data extracting means 12 to generate a phrase indicating a superordinate concept of the term data. Extract as conceptual data.

【００３４】この概念データは、用語データとともに概
念データベース１４に登録される。このとき、すでに同
一の用語データが登録されている場合は、複数の概念デ
ータを登録することを可能とし、さらにその用語データ
に対応する概念データも同一の場合は、その概念データ
の出現数を更新（インクリメント）する。The conceptual data is registered in the conceptual database 14 together with the term data. At this time, if the same term data is already registered, it is possible to register a plurality of concept data, and if the concept data corresponding to the term data is also the same, the number of appearances of the concept data is changed. Update (increment).

【００３５】そして、概念データ抽出手段１３は、概念
データベース１４から、用語データに対応する概念デー
タを読み込み、解説データ生成手段１７へ通知する。こ
のとき、用語データに対応する概念データが複数存在す
る場合は、その概念データの中で最も出現数が多い概念
データを、解説データ生成手段１７へ通知する。Then, the conceptual data extracting means 13 reads the conceptual data corresponding to the term data from the conceptual database 14 and notifies the comment data generating means 17 of the conceptual data. At this time, when there are a plurality of concept data corresponding to the term data, the comment data generating means 17 is notified of the concept data having the largest number of appearances in the concept data.

【００３６】なお、この概念データとは、「という」と
か「と呼ばれる」などの特定の言い換え表現（特定表
現）によって、用語データを言い換えたものである。こ
こで、言い換えたと判断する材料として、係り受け解析
手段１１で生成された係り受け情報を使用する。The concept data is the term data paraphrased by a specific paraphrasing expression (specific expression) such as “to” or “called”. Here, the dependency information generated by the dependency analysis unit 11 is used as a material for determining that the paraphrase has been paraphrased.

【００３７】すなわち、「文節Ａ→文節Ｂ」の矢印
（→）が係り受け関係を示し、文節Ａが係り元、文節Ｂ
が係り先であるとすると、入力された文字列が「（用語
データ）→という→（名詞（句））」や、「（用語デー
タ）→と呼ばれる→（名詞（句））」という係り受け関
係であったとき、この名詞（句）が用語データの上位概
念を示すと判断する。実際のニュース原稿から、この特
定表現による概念データを抽出した結果を図３に示す。
例えば、「顔文字という表現」という文字列で、用語デ
ータは「顔文字」（Ｄ１）であり、概念データは「表
現」（Ｄ２）である。That is, the arrow (→) of “bunsetsu A → bunsetsu B” indicates the dependency relation, and the bunsetsu A is the origin and the bunsetsu B.
Is the reference destination, the input character string is "(Terminology data) → to say → (Noun (phrase))" or "(Terminology data) → is called → (Noun (phrase)) When there is a relation, it is determined that this noun (phrase) indicates a superordinate concept of term data. FIG. 3 shows the result of extracting the conceptual data by this specific expression from the actual news manuscript.
For example, in the character string “expression of emoticon”, the term data is “emoticon” (D1) and the concept data is “expression” (D2).

【００３８】概念データベース１４は、用語データと概
念データとを対応させて登録してあるデータベースで、
概念データ抽出手段１３が、用語データ及び概念データ
の登録及び参照を行なう。この用語データと概念データ
とは１対多の関係を持ち、１つの用語データに対して、
複数の概念データが登録される。また、個々の概念デー
タには、その出現数が付与されており、概念データ抽出
手段１３が、その出現数を更新及び参照を行なう。な
お、概念データベース１４は、ハードディスク等の記憶
手段によって構成されている。The concept database 14 is a database in which term data and concept data are registered in association with each other.
The conceptual data extracting means 13 registers and refers to term data and conceptual data. This term data and concept data have a one-to-many relationship, and for one term data,
A plurality of conceptual data are registered. The number of appearances is given to each piece of concept data, and the concept data extracting means 13 updates and refers to the number of appearances. The concept database 14 is composed of storage means such as a hard disk.

【００３９】修飾データ抽出手段１５は、係り受け解析
手段１１で生成された係り受け情報と、用語データ抽出
手段１２で抽出された用語データと、学習データベース
１６に登録してある学習データとに基づいて、その用語
データを修飾する連体修飾節を抽出し、その連体修飾節
が用語データを定義しているかどうかを判定し、その判
定結果に基づいて、連体修飾節を修飾データとして解説
データ生成手段１７へ通知する。The modification data extraction means 15 is based on the dependency information generated by the dependency analysis means 11, the term data extracted by the term data extraction means 12, and the learning data registered in the learning database 16. And extract the adnominal modifier that modifies the term data, determine whether the adnominal modifier defines the term data, and use the adjunct modifier as the modifier data based on the result of the determination. Notify 17

【００４０】この修飾データ抽出手段１５は、連体修飾
節が用語データを定義しているかどうかを判定するに
は、前記連体修飾節の用語データに直接係る動詞と、そ
の直前の助詞との２項組を学習データとして登録した学
習データベース１６に基づいて行なう。なお、この学習
データは、その動詞に類似性を持つ動詞を同時に登録し
ておき、類似した単語を木構造に分類した同義語・類語
素リストである公知のシソーラス（Ｔｈｅａｕｒｕｓ）
として学習データベース１６上に構築しておく。To determine whether or not the adnominal modifier is defining the term data, the modifier data extracting means 15 has two terms, a verb directly related to the term data of the adnominal modifier and a particle immediately before it. This is performed based on the learning database 16 in which the set is registered as learning data. Note that this learning data is a known thesaurus that is a synonym / synonym list in which similar verbs are registered at the same time and similar words are classified into a tree structure.
Is built on the learning database 16.

【００４１】そこで、用語データに係る連体修飾節中の
動詞を「ｖ」、そして、前記シソーラスとして、この動
詞「ｖ」と同じグループに属する動詞集合を「ｖｇ
１」、動詞「ｖ」の親ノードに属する動詞集合を「ｖｇ
２」、動詞「ｖ」の直前の助詞を「ｐ」とし、また、前
記動詞集合「ｖｇ１」及び「ｖｇ２」の動詞「ｖ」の類
似度に対する重み付け係数をそれぞれ「ｗ_a」，
「ｗ_b」、動詞集合「ｖｇ」と助詞「ｐ」が学習データ
中に出現した回数を「ｎ（ｖｇ，ｐ）」、その期待値を
「ｅ（ｖｇ，ｐ）」としたとき、連帯修飾節が用語デー
タを修飾するいわゆる用語定義節であるかどうかを判定
する指標値「ｗｅｉｇｈｔ（ｖ，ｐ）」を、（１）式の
ように定義する。Therefore, the verb in the adnominal modifier related to the term data is “v”, and the verb set that belongs to the same group as this verb “v” is “vg” as the thesaurus.
1 ”, the verb set belonging to the parent node of the verb“ v ”is“ vg
2 ", the verb" v the particle just before the "to" p ", also each weighting factor for the similarity of the verb" v "of the verb set" vg1 "and" vg2 "" w _a "
When the number of times "w _b ", the verb set "vg" and the particle "p" appear in the learning data is "n (vg, p)" and its expected value is "e (vg, p)", solidarity An index value “weight (v, p)” for determining whether the modifier is a so-called term definition clause that modifies term data is defined as in Expression (1).

【００４２】[0042]

【数１】 [Equation 1]

【００４３】この（１）式において、ｎ（ｖｇ１，ｐ）
＜ｅ（ｖｇ１，ｐ）のときは、第一項を０とし、また、
ｎ（ｖｇ２，ｐ）＜ｅ（ｖｇ２，ｐ）のときは、第二項
を０とする。この指標値ｗｅｉｇｈｔ（ｖ，ｐ）が、予
め設定された閾値よりも大きいときに、連体修飾節が用
語データを定義する説明文であると判定する。In this equation (1), n (vg1, p)
When <e (vg1, p), the first term is 0, and
When n (vg2, p) <e (vg2, p), the second term is set to 0. When the index value weight (v, p) is larger than a preset threshold value, it is determined that the adnominal modifier is an explanatory sentence defining the term data.

【００４４】実験結果として、２００１年６月のニュー
ス原稿から用語データを抽出した結果の一部を図４に示
す。なお、この実験においては、１５２９５個の学習デ
ータを与え、ｗ_a＝０．６７，ｗ_b＝０．３３、閾値を
１．０として、用語定義節を抽出している。As an experimental result, FIG. 4 shows a part of the result of extracting the term data from the news manuscript of June 2001. In this experiment, 15295 learning data are given, w _a = 0.67, w _b = 0.33, and the threshold value is 1.0, and the term definition clause is extracted.

【００４５】学習データベース１６は、修飾データ抽出
手段１５で説明したように、連体修飾節の用語データに
直接係る動詞と、その直前の助詞との２項組を学習デー
タとして登録したデータベースで、ハードディスク等の
記憶手段で構成されている。The learning database 16 is a database in which a binary set of a verb directly related to the term data of the adnominal modifier and the particle immediately preceding it is registered as learning data as described in the modification data extracting means 15, and is a hard disk. And the like.

【００４６】解説データ生成手段１７は、用語データ抽
出手段１２で抽出された用語データと、概念データ抽出
手段１３で抽出された概念データと、修飾データ抽出手
段１５とに基づいて、用語データと、その用語データを
定義する解説データとを生成する。The explanation data generation means 17 is based on the term data extracted by the term data extraction means 12, the concept data extracted by the concept data extraction means 13, and the modified data extraction means 15, and the term data, The explanation data that defines the term data is generated.

【００４７】前記説明した修飾データ抽出手段１５で抽
出された修飾データは、図４に示すように、動詞の連体
形で文が終了しており、定義文としては適切でない。そ
のため、解説データ生成手段１７では、「修飾データ
（連体修飾節）＋概念データ（上位概念）」によって、
解説データを生成する。例えば、図３において用語デー
タが「顔文字」（Ｄ１）、概念データが「表現」（Ｄ
２）、そして、図４において、前記「顔文字」に対応す
る修飾データが「パソコンや携帯電話でやり取りする電
子メールについて記号などを使って感情を表す」（Ｄ
３）であったとき、「顔文字」に対応する解説データ
は、前記修飾データと概念データを連結して、「パソコ
ンや携帯電話でやり取りする電子メールについて記号な
どを使って感情を表す表現」となる。The modified data extracted by the modified data extracting means 15 described above is not appropriate as a definition sentence because the sentence ends in the adnominal form of a verb as shown in FIG. Therefore, the commentary data generating means 17 uses the “qualification data (union modifier clause) + conceptual data (superordinate concept)” to
Generate commentary data. For example, in FIG. 3, the term data is “emoticon” (D1) and the concept data is “expression” (D1).
2) Then, in FIG. 4, the modification data corresponding to the "smiley" is "expresses emotions using symbols etc. for e-mail exchanged with a personal computer or a mobile phone" (D
When it is 3), the commentary data corresponding to the "emoticon" is the expression that expresses emotions by using symbols etc. for the e-mail exchanged by the personal computer or the mobile phone by connecting the modification data and the conceptual data. Becomes

【００４８】また、解説データ生成手段１７は、用語デ
ータの修飾データは存在するが、概念データが存在しな
い場合は、用語データ自体を形態素解析したときの最終
形態素が、概念データであるかを判定する。この判定
は、例えば、最終形態素で上位概念となりやすいデータ
を、予め概念データベース１４に概念データとして登録
しておくことで判断することができる。Further, if the modification data of the term data exists, but the conceptual data does not exist, the commentary data generating means 17 determines whether the final morpheme obtained by morphological analysis of the term data itself is the conceptual data. To do. This determination can be made, for example, by preliminarily registering, in the concept database 14, as concept data, data that is likely to be a superordinate concept in the final morpheme.

【００４９】ここで、この最終形態素が、概念データで
ある場合は、「修飾データ（連体修飾節）＋用語データ
の最終形態素」によって、解説データを生成する。例え
ば、図４において、用語データ「司法制度改革推進法」
（Ｄ４）に対する概念データが存在しない場合は、「司
法制度改革推進法」（Ｄ４）の最終形態素である「法」
を概念データとして、前記用語データの連体修飾節「司
法制度改革を推進するための体制を定める」（Ｄ５）に
付加することで、「司法制度改革を推進するための体制
を定める法」という解説データを生成する。Here, when the final morpheme is conceptual data, the commentary data is generated by "modification data (union modification clause) + final morpheme of term data". For example, in FIG. 4, the term data “Judiciary System Reform Promotion Law”
If there is no conceptual data for (D4), "Law", which is the final morpheme of the "Judiciary System Reform Promotion Law" (D4)
As a conceptual data, by adding it to the adnominal qualification section "Define the system for promoting the reform of the judicial system" (D5) of the term data, the explanation that "the law for defining the system for promoting the reform of the judicial system" is added. Generate data.

【００５０】また、解説データ生成手段１７は、用語デ
ータの修飾データは存在するが、概念データが存在せ
ず、その最終形態素も概念データと判定できない場合
は、「修飾データ（連体修飾節）＋もの（こと）」によ
って、解説データを生成する。例えば、図４において、
用語データ「レッドデータブック」（Ｄ６）に対する概
念データが存在しない場合は、「レッドデータブック」
（Ｄ６）の連体修飾節「絶滅の恐れのある野鳥を記録し
た」（Ｄ７）に「もの（こと）」を付加することで、
「絶滅の恐れのある野鳥を記録したもの（こと）」とい
う解説データを生成する。なお、この生成された解説デ
ータは、解説データ蓄積手段１８に用語データと対にし
て蓄積される。Further, the comment data generating means 17 has the modifier data of the term data, but when the concept data does not exist and the final morpheme thereof cannot be determined to be the concept data, the "modifier data (adnominal modifier clause) +" The commentary data is generated by "thing". For example, in FIG.
If there is no conceptual data for the term data "Red Data Book" (D6), "Red Data Book"
By adding "things" to the adnominal modification clause "(recording endangered wild birds)" (D6) in (D6),
It generates commentary data that is "a record of wild birds that are threatened with extinction." The generated commentary data is stored in the commentary data storage means 18 in combination with the term data.

【００５１】解説データ蓄積手段１８は、解説データ生
成手段１７から出力される用語データと解説データとを
対にして蓄積する蓄積手段で、ハードディスク等で構成
される。ここで蓄積されたデータは、用語データから解
説データを参照することが可能な用語集検索用のデータ
ベースとして使用することができる。The explanation data accumulating means 18 is an accumulation means for accumulating the term data and the explanation data output from the explanation data generating means 17 as a pair, and is composed of a hard disk or the like. The data accumulated here can be used as a database for a glossary search in which the commentary data can be referred to from the term data.

【００５２】以上、一実施形態に基づいて本発明に係る
用語集生成装置１の構成について説明したが、本発明は
これに限定されるものではなく、例えば、解説データ蓄
積手段１８を備えず、解説データ生成手段１７が、用語
データと解説データとをデータとして外部に出力し、外
部に蓄積手段を備えた形態であっても構わない。The configuration of the glossary generation device 1 according to the present invention has been described above based on the embodiment, but the present invention is not limited to this. For example, the commentary data storage means 18 is not provided, The explanation data generation means 17 may output the term data and the explanation data to the outside as data, and may have a form in which the storage means is provided outside.

【００５３】また、概念データベース１４を備えず、概
念データ抽出手段１３が、係り受け解説手段１１から入
力された係り受け情報と、用語データ抽出手段１２から
入力された用語データとに基づいて、その係り受け情報
毎に概念データを抽出する形態であっても構わない。Further, the concept data extracting means 13 does not include the concept database 14, and the concept data extracting means 13 is based on the dependency information input from the dependency describing means 11 and the term data input from the term data extracting means 12. The concept data may be extracted for each dependency information.

【００５４】また、用語集生成装置１の入力データは、
ニュース原稿に限ったものではなく、一般的なテキスト
データであればよい。例えば、雑誌の原稿を蓄積したデ
ータベースからその原稿をテキストデータとして入力す
ることで、最新の雑誌の用語データに対する解説データ
を出力することができる。The input data of the glossary generator 1 is
The text is not limited to news manuscripts, and may be general text data. For example, by inputting a manuscript as text data from a database accumulating manuscripts of magazines, it is possible to output commentary data for the latest term data of magazines.

【００５５】（第一の実施形態：用語集生成装置の動
作）次に、図１、図５〜図８に基づいて、用語集生成装
置１の動作について説明する。図５は、用語集生成装置
１全体の動作を示すフローチャートである。図６は、概
念データ抽出手段１３の動作を示すフローチャートであ
る。図７は、修飾データ抽出手段１５の動作を示すフロ
ーチャートである。図８は、解説データ生成手段１７の
動作を示すフローチャートである。まず、図５のフロー
チャートに基づいて、用語集生成装置１全体の動作を説
明する。(First Embodiment: Operation of Glossary Generation Device) Next, the operation of the glossary generation device 1 will be described with reference to FIGS. 1 and 5 to 8. FIG. 5 is a flowchart showing the operation of the entire glossary generation device 1. FIG. 6 is a flowchart showing the operation of the conceptual data extracting means 13. FIG. 7 is a flowchart showing the operation of the modified data extracting means 15. FIG. 8 is a flowchart showing the operation of the explanation data generating means 17. First, the operation of the entire glossary generation device 1 will be described based on the flowchart of FIG.

【００５６】まず最初に、外部から入力されたニュース
原稿を、形態素解析と構文解析とにより文節単位に分解
し、その文節の文字列と、文節の係り受け関係とを係り
受け情報として生成する（ステップａ１）。First, a news manuscript input from the outside is decomposed into bunsetsu units by morphological analysis and syntactic analysis, and a character string of the bunsetsu and a dependency relation of the bunsetsu are generated as dependency information ( Step a1).

【００５７】そして、前記係り受け情報に基づいて、係
り受け情報に含まれる文節から、その文節が名詞または
名詞句であることを判定して、用語となる文節を用語デ
ータとして抽出する（ステップａ２）。Then, based on the dependency information, it is determined from the clauses included in the dependency information that the clause is a noun or a noun phrase, and a clause serving as a term is extracted as term data (step a2). ).

【００５８】ここで、ステップａ２で抽出された用語デ
ータがあるかどうかを判定し（ステップａ３）、用語デ
ータがない場合（Ｎｏ）は、動作を終了する。一方、用
語データがある場合（Ｙｅｓ）は、ステップａ４へ進
む。Here, it is determined whether or not there is term data extracted in step a2 (step a3), and if there is no term data (No), the operation ends. On the other hand, if there is term data (Yes), the process proceeds to step a4.

【００５９】そして、前記係り受け情報と、前記用語デ
ータとに基づいて、「という」とか「と呼ばれる」など
の特定表現により、その用語データの上位概念を示す文
節を概念データとして抽出する（ステップａ４）。Then, based on the dependency information and the term data, a phrase indicating a superordinate concept of the term data is extracted as concept data by a specific expression such as “to” or “called” (step). a4).

【００６０】また、前記係り受け情報と、前記用語デー
タと、学習データベースに登録されている学習データと
に基づいて、その用語データを修飾する連体修飾節が、
用語データを定義しているかどうかを判定し、その判定
結果に基づいて、連体修飾節を修飾データとして抽出す
る（ステップａ５）。In addition, an adnominal modifier that modifies the term data based on the dependency information, the term data, and the learning data registered in the learning database,
It is determined whether or not the term data is defined, and the adnominal modifier is extracted as the modifier data based on the determination result (step a5).

【００６１】そして、前記概念データと、前記修飾デー
タとに基づいて、前記用語データを定義する解説データ
を生成する（ステップａ６）。Then, based on the concept data and the modification data, the commentary data defining the term data is generated (step a6).

【００６２】次に、図６のフローチャートに基づいて、
概念データ抽出手段１３の動作について説明する。な
お、本フローチャートは、図５におけるステップａ４を
詳細に説明したものである。Next, based on the flowchart of FIG.
The operation of the conceptual data extracting means 13 will be described. The flowchart is a detailed description of step a4 in FIG.

【００６３】まず最初に、入力された係り受け情報に基
づいて、この係り受け情報の各文節の中で、「という」
とか「と呼ばれる」などの特定表現があるかどうかを判
定し（ステップｂ１）、特定表現がない場合（Ｎｏ）
は、入力された用語データが、すでに概念データベース
１４にあるかどうか（登録されているかどうか）を判定
し（ステップｂ２）、概念データベース１４にない（登
録されていない）場合（Ｎｏ）は、この用語データに対
応した概念データが存在しないものとして処理を終了す
る。一方、概念データベース１４に用語データに対応す
る概念データがある（登録されている）場合（Ｙｅｓ）
は、ステップｂ５へ進む。First, based on the inputted dependency information, in each clause of this dependency information, "to"
It is determined whether or not there is a specific expression such as "or called" (step b1), and if there is no specific expression (No)
Determines whether the input term data already exists in the concept database 14 (whether it is registered) (step b2), and if it does not exist in the concept database 14 (not registered) (No), The process ends assuming that there is no conceptual data corresponding to the term data. On the other hand, when the concept database 14 has (is registered) the concept data corresponding to the term data (Yes)
Goes to step b5.

【００６４】また、ステップｂ１において、特定表現が
ある場合（Ｙｅｓ）は、その特定表現に基づいて、その
係り受け関係から、用語データの上位概念である概念デ
ータを抽出する（ステップｂ３）。If there is a specific expression in step b1 (Yes), concept data, which is a superordinate concept of term data, is extracted from the dependency relationship based on the specific expression (step b3).

【００６５】そして、用語データと、ステップｂ３で抽
出した概念データを概念データベース１４に登録する
（ステップｂ４）。このとき、すでに同一の用語データ
が登録されている場合は、複数の概念データを登録する
ことを可能とし、さらにその用語データに対応する概念
データも同一の場合は、その概念データの出現数を更新
（インクリメント）する。Then, the term data and the concept data extracted in step b3 are registered in the concept database 14 (step b4). At this time, if the same term data is already registered, it is possible to register a plurality of concept data, and if the concept data corresponding to the term data is also the same, the number of appearances of the concept data is changed. Update (increment).

【００６６】そして、前記用語データに対応した概念デ
ータのうちで、前記出現数に基づいて、最も出現頻度の
高い概念データを、前記用語データの概念データとして
出力する（ステップｂ５）。Then, among the concept data corresponding to the term data, the concept data having the highest appearance frequency based on the number of appearances is output as the concept data of the term data (step b5).

【００６７】なお、用語集生成装置１において、概念デ
ータベース１４が存在しない場合は、ステップｂ１の判
断で、特定表現がない場合（Ｎｏ）は、用語データに対
応した概念データが存在しないものとして処理を終了す
る。一方、特定表現がある場合（Ｙｅｓ）は、ステップ
ｂ３において、その特定表現に基づいて、係り受け関係
から、用語データの上位概念である概念データを抽出・
出力して処理を終了することで、概念データの抽出動作
を簡略化することも可能である。When the concept database 14 does not exist in the glossary generation device 1, when there is no specific expression (No) in the judgment of step b1, it is considered that the concept data corresponding to the term data does not exist. To finish. On the other hand, if there is a specific expression (Yes), in step b3, based on the specific expression, concept data that is a superordinate concept of the term data is extracted from the dependency relationship.
It is also possible to simplify the conceptual data extraction operation by outputting and ending the processing.

【００６８】次に、図７のフローチャートに基づいて、
修飾データ抽出手段１５の動作について説明する。な
お、本フローチャートは、図５におけるステップａ５を
詳細に説明したものである。Next, based on the flowchart of FIG.
The operation of the modified data extracting means 15 will be described. The flowchart is a detailed description of step a5 in FIG.

【００６９】まず最初に、入力された係り受け情報と、
用語データとに基づいて、この係り受け情報の各文節の
中で、用語データに係る連体修飾節があるかどうかを判
定し（ステップｃ１）、用語データに係る連体修飾節が
ない場合（Ｎｏ）は、用語データに係る修飾データがな
いものとして処理を終了する。First, the dependency information that has been input,
Based on the term data, it is determined whether or not there is a adnominal modifier related to the term data in each clause of the dependency information (step c1), and there is no adnominal modifier related to the term data (No). Ends the processing, assuming that there is no qualified data related to the term data.

【００７０】一方、ステップｃ１において、用語データ
に係る連体修飾節がある場合（Ｙｅｓ）は、この連体修
飾節が用語データを定義する説明文であるかどうかを判
定する指標値を前記（１）式に基づいて算出する（ステ
ップｃ２）。On the other hand, in step c1, if there is a adnominal modifier related to the term data (Yes), the index value for judging whether or not the adnominal modifier is an explanatory sentence defining the term data is used in the above (1). It is calculated based on the formula (step c2).

【００７１】そして、前記指標値と予め設定したしきい
値との比較を行ない（ステップｃ３）、指標値がしきい
値以下の場合（Ｎｏ）は、この連体修飾節は、用語デー
タを定義する修飾データがないものとして処理を終了す
る。一方、指標値がしきい値よりも大きい場合（Ｙｅ
ｓ）は、この連体修飾節は、用語データに係る修飾デー
タであると判定して、この連体修飾節を修飾データとし
て出力する（ステップｃ４）。Then, the index value is compared with a preset threshold value (step c3). If the index value is less than or equal to the threshold value (No), this adnominal modifier defines the term data. The process ends as there is no qualifying data. On the other hand, when the index value is larger than the threshold value (Ye
In s), it is determined that this adnominal modifier is the modifier data relating to the term data, and this adnominal modifier is output as the modifier data (step c4).

【００７２】次に、図８のフローチャートに基づいて、
解説データ生成手段１７の動作について説明する。な
お、本フローチャートは、図５におけるステップａ６を
詳細に説明したものである。Next, based on the flowchart of FIG.
The operation of the explanation data generating means 17 will be described. The flowchart is a detailed description of step a6 in FIG.

【００７３】まず最初に、用語データを定義した説明文
である修飾データがあるかどうかを判定する（ステップ
ｄ１）。ここで修飾データがない場合（Ｎｏ）は、さら
に用語データの上位概念である概念データがあるかどう
かを判定し（ステップｄ２）、概念データがない場合
（Ｎｏ）は、解説データの変数に「なし」（ヌルデー
タ）を設定して（ステップｄ３）、ステップｄ１０へ進
む。一方、ステップｄ２で概念データがある場合（Ｙｅ
ｓ）は、解説データの変数に「概念データ」の文字列を
設定して（ステップｄ４）、ステップｄ１０へ進む。First, it is judged whether or not there is modified data which is an explanatory sentence defining the term data (step d1). If there is no modification data (No), it is further determined whether there is concept data that is a superordinate concept of the term data (step d2). If there is no concept data (No), the comment data variable is set to ""None" (null data) is set (step d3), and the process proceeds to step d10. On the other hand, if there is conceptual data in step d2 (Yes
In s), the character string of "conceptual data" is set in the variable of the comment data (step d4), and the process proceeds to step d10.

【００７４】また、ステップｄ１において、用語データ
を定義した説明文である修飾データがある場合（Ｙｅ
ｓ）は、さらに、用語データの上位概念である概念デー
タがあるかどうかを判定する（ステップｄ５）。ここで
概念データがある場合（Ｙｅｓ）は、解説データの変数
に「修飾データ＋概念データ」の文字列を設定して（ス
テップｄ６）、ステップｄ１０へ進む。In step d1, if there is modified data that is an explanatory sentence defining the term data (Yes
In step s), it is further determined whether there is concept data that is a superordinate concept of the term data (step d5). If there is conceptual data (Yes), the character string of "qualification data + conceptual data" is set in the variable of the commentary data (step d6), and the process proceeds to step d10.

【００７５】一方、ステップｄ５で、用語データの上位
概念である概念データがない場合（Ｎｏ）は、用語デー
タの最終形態素が概念データであるかどうかを判定し
（ステップｄ７）、最終形態素が概念データである場合
（Ｙｅｓ）は、解説データの変数に「修飾データ＋最終
形態素」の文字列を設定して（ステップｄ８）、ステッ
プｄ１０へ進む。On the other hand, in step d5, when there is no conceptual data which is a superordinate concept of the term data (No), it is judged whether or not the final morpheme of the term data is conceptual data (step d7), and the final morpheme is the concept. If it is data (Yes), the character string of “qualification data + final morpheme” is set in the variable of the comment data (step d8), and the process proceeds to step d10.

【００７６】また、ステップｄ７において、用語データ
の最終形態素が概念データでない場合（Ｎｏ）は、解説
データの変数に「修飾データ＋もの（こと）」の文字列
を設定して（ステップｄ９）、ステップｄ１０へ進む。If the final morpheme of the term data is not the conceptual data in step d7 (No), the character string of "qualification data + thing (thing)" is set in the variable of the comment data (step d9). Go to step d10.

【００７７】そして、用語データと、前記処理によって
設定された解説データを出力する（ステップｄ１０）。
以上の動作によって、用語集生成装置１は、ニュース原
稿から、用語データと、その用語データを定義する解説
データを生成することができる。Then, the term data and the commentary data set by the above processing are output (step d10).
Through the above operation, the glossary generation device 1 can generate term data and commentary data defining the term data from a news manuscript.

【００７８】なお、用語集生成装置１は、コンピュータ
において各機能をプログラムで実現することも可能であ
り、各機能プログラムを結合して用語集生成プログラム
として動作させることも可能である。The terminology generation device 1 can realize each function in a computer by a program, or can combine each function program and operate as a terminology generation program.

【００７９】（第二の実施形態：用語集検索装置）図２
は、本発明における第二の実施形態に係る用語集検索装
置の全体構成を示すブロック図である。図２に示すよう
に、用語集検索装置２は、テキストデータであるニュー
ス原稿を入力し、そのニュース原稿に含まれる用語デー
タと、その用語データを定義する解説データとを生成す
るとともに、操作者が問合せを行なった用語データに対
する解説データを出力する装置である。(Second Embodiment: Glossary Search Device) FIG. 2
FIG. 6 is a block diagram showing an overall configuration of a glossary search device according to a second embodiment of the present invention. As shown in FIG. 2, the glossary search device 2 inputs a news manuscript, which is text data, generates term data contained in the news manuscript, and commentary data defining the term data, and also an operator. Is a device that outputs commentary data for the term data for which an inquiry has been made.

【００８０】この用語集検索装置２は、図１に示した用
語集生成装置１に、解説データ検索手段２１、入力手段
２２及び出力手段２３が付加されて構成されている。解
説データ検索手段２１、入力手段２２及び出力手段２３
以外の構成は、図１に示したものと同一の符号を付し、
説明を省略する。さらに、用語集検索装置２は、外部に
キーボード等の入力装置４と、ＣＲＴ等の出力装置５が
外部に接続されている。The terminology search device 2 is configured by adding commentary data search means 21, input means 22 and output means 23 to the terminology generation device 1 shown in FIG. Explanation data retrieval means 21, input means 22 and output means 23
Structures other than the above are denoted by the same reference numerals as those shown in FIG.
The description is omitted. Further, the glossary search device 2 is externally connected with an input device 4 such as a keyboard and an output device 5 such as a CRT.

【００８１】解説データ検索手段２１は、入力手段２２
から入力されたテキストデータである用語データに基づ
いて、解説データ蓄積手段１８に蓄積されている前記用
語データに対応した解説データを検索し、検索された解
説データをテキストデータとして出力手段２３へ出力す
る。The explanation data search means 21 is an input means 22.
The comment data corresponding to the term data accumulated in the comment data accumulating unit 18 is searched based on the term data which is the text data input from the, and the searched comment data is output to the output unit 23 as the text data. To do.

【００８２】入力手段２２は、キーボード、マウス等の
入力装置４から操作者が入力した入力データを入力し解
析を行なう。この入力データとして、用語データが入力
された場合は、この用語データを解説データ検索手段２
１に出力する。The input means 22 inputs and analyzes the input data input by the operator from the input device 4 such as a keyboard and a mouse. When term data is input as this input data, the term data is used as the explanation data search means 2
Output to 1.

【００８３】また、入力データが、ニュース原稿の範囲
指定、例えば、「２０００年のニュース原稿」を指定し
た場合は、その範囲指定データを用語集生成装置１に通
知することで、用語集生成装置１が、その範囲内のニュ
ース原稿をニュース原稿データベース３から入力して、
用語データを抽出し、その用語データを定義する解説デ
ータを生成し、前記用語データ及び解説データをテキス
トデータとして出力手段２３へ出力する。When the input data specifies a range of news manuscripts, for example, "2000 news manuscripts", the range specification data is notified to the terminology generation device 1 so that the terminology generation device is notified. 1 inputs the news manuscript within the range from the news manuscript database 3,
The term data is extracted, commentary data defining the terminology data is generated, and the terminology data and the commentary data are output as text data to the output means 23.

【００８４】出力手段２３は、用語集生成手段１で生成
される用語データ及び解説データ、並びに解説データ検
索手段２１で検索後に出力される解説データを、ＣＲＴ
等の出力装置５へ、出力データとして出力する。The output means 23 outputs the term data and the comment data generated by the glossary generation means 1 and the comment data output after the search by the comment data search means 21 to the CRT.
To the output device 5, etc. as output data.

【００８５】図９に用語集検索装置２に接続された出力
装置５の表示例５０を示す。図９の画面は、操作者が用
語データを入力し、それに対応した解説データを表示す
るニュース原稿の用語解説を行なうアプリケーションの
画面例である。FIG. 9 shows a display example 50 of the output device 5 connected to the glossary search device 2. The screen of FIG. 9 is an example of a screen of an application in which an operator inputs term data and displays term data of a news manuscript corresponding to the term data.

【００８６】例えば、操作者が用語入力欄５０ａに、入
力装置４を介して用語データを入力し、ＲＥＴＵＲＮキ
ーを押下する、あるいはマウスで検索ボタン５０ｂをク
リックすることで、用語集検索装置２は、解説データ蓄
積手段１８に蓄積されている前記用語データに対応する
解説データを検索し、解説表示欄５０ｄに解説データを
表示する。ここで、クリアボタン５０ｃは、用語入力欄
５０ａに入力された解説データを一括して削除するため
のボタンで、終了ボタン５０ｅは、ニュース原稿の用語
解説を行なうアプリケーションを終了するためのボタン
である。For example, when the operator inputs term data into the term input field 50a via the input device 4 and presses the RETURN key or clicks the search button 50b with the mouse, the glossary search device 2 The comment data corresponding to the term data accumulated in the comment data accumulating unit 18 is searched, and the comment data is displayed in the comment display field 50d. Here, the clear button 50c is a button for collectively deleting the commentary data entered in the term input field 50a, and the end button 50e is a button for ending the application for explaining the terminology of the news manuscript. .

【００８７】このように、用語集検索装置２は、ニュー
ス原稿から、用語データとその用語データを定義する解
説データとを生成するとともに、操作者が問合せを行な
った用語データに対する解説データを出力するインター
フェースを有することで、常に用語の最新の解説を素早
く検索することができる。As described above, the terminology search device 2 generates term data and commentary data defining the terminology data from the news manuscript, and outputs commentary data for the terminology data inquired by the operator. By having an interface, you can always quickly find the latest explanation of a term.

【００８８】以上、一実施形態に基づいて本発明に係る
用語集検索装置２について説明したが、本発明はこれに
限定されるものではなく、例えば、ネットワークインタ
ーフェースを有し、ネットワークを介して遠隔地の操作
者が、用語データの解説データを検索することも可能で
ある。The terminology search device 2 according to the present invention has been described above based on the embodiment. However, the present invention is not limited to this. For example, the terminology search device 2 has a network interface and can be remotely accessed via a network. It is also possible for the local operator to retrieve the commentary data of the term data.

【００８９】[0089]

【発明の効果】以上説明したとおり、本発明に係る用語
集生成装置及び用語集生成プログラム並びに用語集検索
装置では、以下に示す優れた効果を奏する。As described above, the terminology generation device, the terminology generation program, and the terminology search device according to the present invention have the following excellent effects.

【００９０】請求項１に記載の発明によれば、用語集生
成装置は、入力された自然言語のテキストデータに対し
て、形態素解析や構文解析により係り受け解析を行なう
ことで、用語データと、その用語データの上位概念を示
す概念データと、用語データを定義する修飾データを前
記テキストデータから抽出し、概念データと修飾データ
とに基づいて、用語データを定義する解説データを生成
することができる。According to the first aspect of the present invention, the terminology generation device performs the dependency analysis on the input natural language text data by morphological analysis or syntactic analysis to obtain the term data, It is possible to extract concept data indicating a superordinate concept of the term data and modification data defining the term data from the text data, and generate commentary data defining the term data based on the concept data and the modification data. .

【００９１】これにより、入力された自然言語のテキス
トデータから、用語集の元となる用語データを抽出する
ことができ、さらに、その用語データの定義文となる解
説データを同時に抽出することができるので、多量のテ
キストデータであっても、人手を介することなく、高速
に用語集のデータを抽出することができる。As a result, the term data that is the source of the terminology can be extracted from the input natural language text data, and the commentary data that is the definition sentence of the term data can be extracted at the same time. Therefore, even with a large amount of text data, the glossary data can be extracted at high speed without human intervention.

【００９２】請求項２に記載の発明によれば、用語集生
成装置は、概念データベース内の用語データに対応した
複数の概念データのうちで、出現頻度の最も高い概念デ
ータを、その用語データの概念データとすることができ
るので、解説データを生成するときに、この概念データ
を末尾に付加することで、前記解説データは、上位概念
が同じ用語データであれば、末尾が同じ概念データで構
成されることになり、統一性のある用語集のデータを生
成することができる。According to the second aspect of the present invention, the terminology generation device selects, from the plurality of concept data corresponding to the term data in the concept database, the concept data having the highest appearance frequency as the term data of the term data. Since it can be conceptual data, by adding this conceptual data to the end when generating the explanatory data, if the superordinate concept is the same term data, the explanatory data is composed of the same conceptual data at the end. Therefore, it is possible to generate uniform glossary data.

【００９３】請求項３に記載の発明によれば、用語集生
成装置は、複数の連体修飾節の中から、用語を定義する
説明文である連体修飾節を、用語に直接係る動詞とその
直前の助詞との組合せにより特定するので、前記連体修
飾節が用語を定義するものであるかどうかの判定を、文
章の意味を考慮しなくてもよいため、簡易な判定基準で
行なうことができ、高速に判定を行なうことができる。According to the third aspect of the present invention, the terminology generation device determines, from among a plurality of adnominal modifiers, an adjunct modifier that is a descriptive sentence defining a term, and a verb directly related to the term and its immediately preceding verb. Since it is specified by a combination with the particle of, it is possible to determine whether or not the adnominal modifier is a definition of a term, because it is not necessary to consider the meaning of the sentence. The determination can be performed at high speed.

【００９４】請求項４に記載の発明によれば、用語集生
成装置は、生成した用語データ及びその用語データに対
応した解説データを解説データ蓄積手段に蓄積し、これ
により、解説データ蓄積手段に、入力された自然言語の
テキストデータの用語集を構築することができる。これ
により、用語集のデータベースを容易に構築することが
でき、さらに、このデータベースを用いて用語データの
検索システムを構築することも可能になる。According to the fourth aspect of the present invention, the terminology generation device stores the generated term data and the comment data corresponding to the term data in the comment data storage means. , It is possible to build a glossary of input natural language text data. This makes it possible to easily build a database of terminology, and it is also possible to build a terminology data search system using this database.

【００９５】請求項５に記載の発明によれば、用語集生
成装置は、入力されるテキストデータがニュース原稿の
データであるため、特にニュース原稿では、難解な用語
や、毎年多くの創り出される新語や、既存の言葉を組み
合わせた造語を、視聴者に容易に理解できるような用語
の説明を伴うことが多く、この説明を利用して効果的に
用語集のデータを生成することができる。According to the fifth aspect of the invention, in the terminology generation device, since the input text data is the data of the news manuscript, especially in the news manuscript, difficult terms and many new words are created every year. Or, a word coined by combining existing words is often accompanied by an explanation of the term so that the viewer can easily understand it, and by using this explanation, the glossary data can be effectively generated.

【００９６】請求項６に記載の発明によれば、用語集生
成プログラムは、入力された自然言語のテキストデータ
に対して、形態素解析や構文解析により係り受け解析を
行なうことで、用語データと、その用語データの上位概
念を示す概念データと、用語データに係る修飾データを
前記テキストデータから抽出し、概念データと修飾デー
タとに基づいて、用語データを定義する解説データを生
成することができる。According to the sixth aspect of the present invention, the terminology generation program performs the dependency analysis on the input natural language text data by morphological analysis or syntactic analysis. It is possible to extract conceptual data indicating a superordinate concept of the term data and modification data related to the term data from the text data, and generate commentary data defining the term data based on the concept data and the modification data.

【００９７】これにより、入力された自然言語のテキス
トデータから、用語集の元となる用語データを抽出する
ことができ、さらに、その用語データの定義文となる解
説データを同時に抽出することができるので、多量のテ
キストデータであっても、人手を介することなく、高速
に用語集のデータを抽出することができる。As a result, the term data that is the source of the terminology can be extracted from the input natural language text data, and further, the commentary data that is the definition sentence of the term data can be extracted at the same time. Therefore, even with a large amount of text data, the glossary data can be extracted at high speed without human intervention.

【００９８】請求項７に記載の発明によれば、用語集検
索装置は、入力されたテキストデータから用語データを
説明する解説データを生成する用語集生成装置に、ユー
ザからの用語データの問合せに対して、その用語データ
に対応する解説データを検索して、検索結果を出力する
ユーザインターフェースを提供することができ、常に最
新の用語データの意味をユーザが容易に、検索すること
ができる。According to the seventh aspect of the present invention, the glossary search device asks the glossary generation device for generating commentary data for explaining the term data from the input text data to the user for inquiring the term data. On the other hand, it is possible to provide a user interface for searching the commentary data corresponding to the term data and outputting the search result, and the user can always easily search the meaning of the latest term data.

[Brief description of drawings]

【図１】本発明の実施の形態に係る用語集生成装置の全
体構成を示すブロック図である。FIG. 1 is a block diagram showing an overall configuration of a glossary generation device according to an embodiment of the present invention.

【図２】本発明の実施の形態に係る用語集検索装置の全
体構成を示すブロック図である。FIG. 2 is a block diagram showing an overall configuration of a glossary search device according to an embodiment of the present invention.

【図３】本発明の特定表現を利用した用語データの概念
データ抽出結果の一部を示した図である。FIG. 3 is a diagram showing a part of a conceptual data extraction result of term data using a specific expression of the present invention.

【図４】本発明の用語データを定義する連体修飾節抽出
結果の一部を示した図である。FIG. 4 is a diagram showing a part of a result of extraction of adnominal modification clauses defining term data of the present invention.

【図５】本発明の実施の形態に係る用語集生成装置の動
作を示すフローチャートである。FIG. 5 is a flowchart showing an operation of the glossary generation device according to the exemplary embodiment of the present invention.

【図６】本発明の実施の形態に係る概念データを抽出す
る動作を示すフローチャートである。FIG. 6 is a flowchart showing an operation of extracting conceptual data according to the embodiment of the present invention.

【図７】本発明の実施の形態に係る修飾データを抽出す
る動作を示すフローチャートである。FIG. 7 is a flowchart showing an operation of extracting decoration data according to the embodiment of the present invention.

【図８】本発明の実施の形態に係る解説データを生成す
る動作を示すフローチャートである。FIG. 8 is a flowchart showing an operation of generating commentary data according to the embodiment of the present invention.

【図９】本発明の実施の形態に係る用語集検索装置の出
力画面の一例を示す図である。FIG. 9 is a diagram showing an example of an output screen of the glossary search device according to the embodiment of the present invention.

[Explanation of symbols]

１……用語集生成装置２……用語集検索装置１１……係り受け解析手段１２……用語データ抽出手段１３……概念データ抽出手段１４……概念データベース１５……修飾データ抽出手段１６……学習データベース１７……解説データ生成手段１８……解説データ蓄積手段２１……解説データ検索手段２２……入力手段２３……出力手段 1 ... Glossary generator 2 ... Glossary search device 11 ... Dependency analysis means 12 ... Term data extraction means 13 ... Conceptual data extraction means 14 ... Concept database 15 ... Modification data extraction means 16 ... Learning database 17 ... Explanation data generation means 18 ... Explanation data storage means 21 …… Explanation data search method 22 ... Input means 23 ... Output means

───────────────────────────────────────────────────── フロントページの続き (72)発明者八木伸行東京都世田谷区砧一丁目10番11号日本放送協会放送技術研究所内Ｆターム(参考） 5B009 MB03 ME25 MF02 5B075 ND03 ND22 NK02 NR02 NR20 UU01 5B091 AA15 CA05 CC02 CC04 CC16 ─────────────────────────────────────────────────── ─── Continued front page (72) Inventor Nobuyuki Yagi 1-10-11 Kinuta, Setagaya-ku, Tokyo, Japan Broadcasting Association Broadcast Technology Institute F-term (reference) 5B009 MB03 ME25 MF02 5B075 ND03 ND22 NK02 NR02 NR20 UU01 5B091 AA15 CA05 CC02 CC04 CC16

Claims

[Claims]

1. A glossary generation device for generating commentary data defining term data from input natural language text data, the text data being subjected to morphological analysis and syntactic analysis. Dependency analysis means for generating dependency information of the phrase, and a term data extraction means for analyzing a character string that becomes a noun or a noun phrase from the text data and extracting as term data, the dependency information, and a term Conceptual data extracting means for extracting conceptual data indicating a superordinate concept of the term data from the text data based on a specific paraphrasing expression for paraphrasing data, and when the adnominal modifier clause becomes an explanatory text defining a term in advance Based on a learning database in which learning data that is a characteristic of the learning data is registered, and the dependency information and the learning data Modification data extraction means for determining whether the adnominal modifier related to the term data is the definition of the term data, and extracting the adnominal modifier that is determined to be the definition as modifier data, the conceptual data and the modifier data And a commentary data generating means for generating commentary data that defines the terminology data, and a glossary generating device.

2. A concept database for registering the term data and a plurality of concept data corresponding to the term data is provided, wherein the concept data extracting means selects from the plurality of concept data corresponding to the term data, the frequency of occurrence thereof. The terminology generation device according to claim 1, wherein one concept data corresponding to the term data is determined based on the terminology data.

3. The learning database registers, as learning data, a verb directly related to a term and an immediately preceding particle when the adnominal modifier is an explanatory text defining the term. The glossary generation device according to claim 1.

4. The glossary according to any one of claims 1 to 3, further comprising: commentary data storage means for storing the term data and commentary data corresponding to the term data. Generator.

5. The glossary generation device according to claim 1, wherein the text data is data of a news manuscript.

6. A term in the text data is defined from input natural language text data and a learning database in which learning data that is characteristic when an adnominal modifier is an explanatory sentence defining the term is registered. In order to generate the commentary data, the computer performs a morphological analysis and a syntactic analysis on the text data to generate dependency information between clauses of the text data. Dependency analysis means, from the text data, a noun Or term data extraction means for extracting a character string that becomes a noun phrase as term data, based on the dependency information and a specific paraphrase expression,
Conceptual data extraction means for extracting conceptual data indicating a superordinate concept of the term data from the text data, based on the dependency information and the learning data, the adnominal modifier related to the term data defines the term data. A modification data extraction unit that determines whether or not the adnominal modification clause determined to be the definition is defined as modification data, and generates commentary data that defines the term data based on the concept data and the modification data. Explanation A glossary generation program characterized by functioning as data generation means.

7. A term for generating commentary data, which is a descriptive sentence defining terminology data, from the input natural language text data, and further searching for commentary data corresponding to the terminology data from the input terminology data. A glossary search device, which generates commentary data for explaining term data from the text data, a glossary generation device according to claim 4, an input means for inputting the term data, and based on the term data, A glossary search device comprising: commentary data search means for searching commentary data stored in the commentary data storage means; and output means for outputting the search results.