JP2000259666A

JP2000259666A - Topic extraction device

Info

Publication number: JP2000259666A
Application number: JP11065658A
Authority: JP
Inventors: Ichiro Yamada; 一郎山田; Enbai Kin; 淵培金; Masahiro Shibata; 正啓柴田; Noriyoshi Uratani; 則好浦谷
Original assignee: Nippon Hoso Kyokai NHK; Japan Broadcasting Corp
Current assignee: Japan Broadcasting Corp
Priority date: 1999-03-11
Filing date: 1999-03-11
Publication date: 2000-09-22

Abstract

PROBLEM TO BE SOLVED: To provide a topic extraction device capable of highly accurately sorting articles included in a news original into clusters, extracting topics from a global viewpoint by extracting an important noun phrase from these clusters and extracting and presenting a topic to be intuitively understood. SOLUTION: The topic extraction device is provided with a word importance operating means 4 for finding out the appearance frequency of a word extracted from an article of an electronically processed news original within a prescribed period and finding out the importance of the word on the basis of the appearance frequency and article sorting means 5, 6 for sorting the articles of the new original into article groups having similar items by finding out the similarity of the article groups having similar items on the basis of the found importance of the word and the appearance frequency of the word in the article groups having similar items.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、電子化されたニュ
ース原稿から時間と共に変化する動的なジャンルである
トピックを抽出するトピック抽出装置に関する。[0001] 1. Field of the Invention [0002] The present invention relates to a topic extracting apparatus for extracting a topic that is a dynamic genre that changes with time from an electronic news manuscript.

【０００２】［発明の概要］本発明は、電子化されたニ
ュース原稿から話題となっているトピックを抽出する装
置であり、日々蓄えられるニュース記事からある時点の
単語の重要度を求め、これに基づきニュース原稿の記事
分類を行うことにより、高精度なニュース記事分類が実
現でき、この記事分類結果から重要な名詞句を抽出する
ことにより、各話題が重複しない分かりやすい名詞句と
して、効果的にトピックを抽出できるようにしたもので
ある。[Summary of the Invention] [0002] The present invention is an apparatus for extracting a topic of interest from a digitized news manuscript, and obtains the importance of a word at a certain point in time from news articles stored every day. Based on article classification of news manuscripts, highly accurate news article classification can be realized. By extracting important noun phrases from this article classification result, each topic can be effectively converted into an easy-to-understand noun phrase that does not overlap The topic can be extracted.

【０００３】[0003]

【従来の技術】ニュースには、政治、経済、社会、スポ
ーツなどの静的なジャンルの他、長野オリンピック、Ｗ
杯サッカー、不況、和歌山毒入りカレー事件など時間と
共に変化する動的なジャンル、つまりトピックが存在す
る。このようなトピックは、視聴者が番組を選択すると
きの重要な鍵となるので、ニュース原稿からトピックを
抽出することが行われている。2. Description of the Related Art News includes static genres such as politics, economy, society and sports, as well as the Nagano Olympics and W
There are dynamic genres, or topics, that change over time, such as cup soccer, recession, and poisoned curry in Wakayama. Since such a topic is an important key when a viewer selects a program, a topic is extracted from a news manuscript.

【０００４】このトピックの抽出では、従来、ニュース
放送原稿や新聞の１つの記事に含まれる単語を対象にし
て、単語の頻度、相互情報量、χ²値のなどの１つを使
って単語の重要度を決定し、重要と評価された複数の単
語を形態素単位（日本語の最小意味単位）で提示する方
法が採用されている。[0004] In the extraction of this topic, conventional, directed to the words that are included in one of the articles of news broadcasting manuscripts and newspaper, word of frequency, mutual information, words using one, such as the χ ² value A method is adopted in which importance is determined, and a plurality of words evaluated as important are presented in morpheme units (minimum meaning units in Japanese).

【０００５】[0005]

【発明が解決しようとする課題】しかし、従来では、記
事に含まれている単語の頻度を手がかりとしてニュース
原稿に含まれている記事を似た項目の記事群（以下「ク
ラスタ」という。）に分類しているが、自然言語処理で
は悪影響を及ぼす単語（頻度は多いが情報量が少ない単
語）が存在するため、含まれている全ての単語をそのま
ま使う従来の分類方法では、良好な分類結果が得られな
い。Conventionally, however, articles included in news manuscripts are grouped into similar article groups (hereinafter referred to as "clusters") based on the frequency of words included in the articles. Although there are words that have an adverse effect in natural language processing (words that have a high frequency but a small amount of information), good classification results can be obtained with the conventional classification method that uses all the words that are included as they are. Can not be obtained.

【０００６】また、従来のトピック抽出方法は、ローカ
ルな１つの記事から重要と考えられる幾つかの単語を抽
出するなどの抽出処理を行うだけであるので、今何が話
題になっているかというグローバルな視点からのトピッ
ク抽出ができない。即ち、従来のトピック抽出方法で
は、各々の記事からトピックとなる単語が抽出される、
つまり同じ話題からその話題を記した記事の数のトピッ
クが提示されることになり、全体として何が話題になっ
ているかを把握するのが困難である。Further, the conventional topic extraction method only performs extraction processing such as extracting some words considered important from one local article, so that a global topic of what is currently being talked about is extracted. Topics cannot be extracted from a simple viewpoint. That is, in the conventional topic extraction method, a topic word is extracted from each article.
In other words, the same number of topics as the number of articles describing the topic are presented from the same topic, and it is difficult to grasp what is the topic as a whole.

【０００７】さらに、従来のトピック抽出方法では、複
数の単語を形態素単位で抽出しているので、幾つの単語
を提示するかという問題があり、また複数の単語が出力
されるので話題を直感的に把握することが困難である。Further, in the conventional topic extraction method, since a plurality of words are extracted in morpheme units, there is a problem of how many words are presented, and since a plurality of words are output, the topic is intuitively displayed. It is difficult to grasp.

【０００８】本発明は上記事情に鑑み、ニュース原稿に
含まれている記事を高精度にクラスタに分類でき、その
クラスタから重要な１つの名詞句を抽出することによ
り、グローバルな視点からのトピック抽出ができ、話題
を直感的に把握できるトピックを抽出、提示できるトピ
ック抽出装置を提供することを目的としている。According to the present invention, in view of the above information, articles included in a news manuscript can be classified into clusters with high accuracy, and one important noun phrase is extracted from the cluster to extract topics from a global viewpoint. It is an object of the present invention to provide a topic extraction device capable of extracting and presenting a topic that can intuitively grasp a topic.

【０００９】[0009]

【課題を解決するための手段】上記の目的を達成するた
めに請求項１に記載のトピック抽出装置は、電子化され
たニュース原稿の記事から抽出した単語の所定期間内に
おける出現率を求め、その出現率に基づいて単語の重要
度を求める単語重要度演算手段と、前記求められた単語
の重要度及び似た項目を持つ記事群における前記単語の
出現率に基づいて前記似た項目を持つ記事群の類似度を
求めることにより、ニュース原稿の記事を似た項目を持
つ記事群に分類する記事分類手段とを備えることを特徴
としている。In order to achieve the above object, a topic extracting apparatus according to claim 1 obtains an appearance rate of a word extracted from an article of an electronic news manuscript within a predetermined period, Word importance calculating means for calculating the importance of a word based on the appearance rate; and having the similar item based on the calculated importance of the word and the appearance rate of the word in a group of articles having similar items. An article classifying means is provided for classifying articles in a news manuscript into an article group having similar items by calculating the similarity between the article groups.

【００１０】請求項１に記載の発明では、日々蓄えられ
るニュース記事から抽出したある時点の単語の重要度を
χ²値を利用して求め、それに基づき似た項目を持つ記
事群の類似度を求めることにより、ニュース原稿の記事
を似た項目を持つ記事群に分類するので、高精度なニュ
ース記事の分類が実現できる。[0010] In the invention according to claim 1, determined by using the importance degree chi ² value of words some point extracted from news articles to be accumulated daily articles group similarity with items similar based thereon By obtaining, the articles of the news manuscript are classified into a group of articles having similar items, so that highly accurate classification of news articles can be realized.

【００１１】請求項２に記載のトピック抽出装置は、請
求項１に記載のトピック抽出装置において、前記記事分
類手段によって分類された記事群における単語の寄与度
に基づいてその記事群から代表記事を抽出する代表記事
抽出手段と、前記抽出された代表記事の中で寄与度の大
きい名詞句を重要名詞句として抽出する重要名詞抽出手
段とを備えることを特徴としている。According to a second aspect of the present invention, in the topic extracting apparatus according to the first aspect, a representative article is extracted from the article group based on a word contribution in the article group classified by the article classifying means. It is characterized by comprising a representative article extracting means for extracting, and an important noun extracting means for extracting a noun phrase having a large contribution from the extracted representative articles as an important noun phrase.

【００１２】請求項２に記載の発明では、分類された記
事群のそれぞれにおける代表記事を寄与度に基づいて抽
出し、前記抽出された代表記事の中で寄与度の高い名詞
句を重要名詞句として抽出するので、各話題が重複しな
い分かり易い名詞句を抽出できる。According to the second aspect of the present invention, representative articles in each of the classified article groups are extracted based on the degree of contribution, and a noun phrase having a high degree of contribution in the extracted representative articles is classified as an important noun phrase. , It is possible to extract an easy-to-understand noun phrase in which each topic does not overlap.

【００１３】請求項３に記載のトピック抽出装置は、請
求項２に記載のトピック抽出装置において、前記記事分
類手段によって分類された各記事群それぞれのラベルと
して、前記重要名詞抽出手段が抽出した名詞句の対応す
るものを付与するラベル付与手段を備えることを特徴と
している。According to a third aspect of the present invention, in the topic extracting apparatus according to the second aspect, the noun extracted by the important noun extracting means is used as a label of each article group classified by the article classifying means. It is characterized in that it comprises a label assigning means for assigning the corresponding phrase.

【００１４】請求項３に記載の発明では、分類された各
記事群それぞれのラベルとして、重要名詞句として抽出
した名詞句の対応するものを付与するので、分類された
記事群のデータベース化ができる。According to the third aspect of the present invention, a label corresponding to a noun phrase extracted as an important noun phrase is given as a label of each classified article group, so that the classified article group can be made into a database. .

【００１５】請求項４に記載のトピック抽出装置は、請
求項３に記載のトピック抽出装置において、前記ラベル
付与手段がラベルを付与した各記事群の中で寄与度が大
きい代表記事を持つ記事群のラベルを順番にトピックの
候補として抽出するトピック抽出手段を備えることを特
徴としている。According to a fourth aspect of the present invention, in the topic extracting apparatus according to the third aspect, a group of articles having a representative article having a large contribution degree among each group of articles to which the label assigning unit assigns a label. Topic extraction means for sequentially extracting the labels as topic candidates.

【００１６】請求項４に記載の発明では、時間と共に変
化する動的なジャンルであるトピックを自動的に抽出、
提示できる。According to the fourth aspect of the present invention, topics that are dynamic genres that change with time are automatically extracted,
Can be presented.

【００１７】[0017]

【発明の実施の形態】図１は、請求項１乃至請求項４に
対応する実施形態のトピック抽出装置の構成例である。
本実施形態のトピック抽出装置は、ファイル装置１と、
形態素解析部２と、構文解析部３と、単語重要度演算部
４と、類似度比較部５と、ニュース記事分類部６と、代
表記事抽出部７と、重要名詞抽出部８と、記憶・ラベル
付与処理部９と、トピック抽出処理部１０とを備える。DESCRIPTION OF THE PREFERRED EMBODIMENTS FIG. 1 shows an example of the configuration of a topic extracting apparatus according to an embodiment corresponding to claims 1 to 4.
The topic extraction device according to the present embodiment includes a file device 1,
Morphological analysis unit 2, syntax analysis unit 3, word importance calculation unit 4, similarity comparison unit 5, news article classification unit 6, representative article extraction unit 7, important noun extraction unit 8, storage / The system includes a label assigning unit 9 and a topic extracting unit 10.

【００１８】ファイル装置１の出力は、形態素解析部２
の入力に接続され、形態素解析部２の出力は、構文解析
部３の入力に接続される。構文解析部３の出力は、単語
重要度演算部４の入力に接続され、単語重要度演算部４
の出力は、類似度比較部５の入力に接続される。類似度
比較部５の出力は、ニュース記事分類部６の入力に接続
され、ニュース記事分類部６の出力は、代表記事抽出部
７の入力と記憶・ラベル付与処理部９の一方の入力とに
接続される。代表記事抽出部７の出力は、重要名詞抽出
部８の入力に接続され、重要名詞抽出部８の出力は、記
憶・ラベル付与処理部９の他方の入力に接続される。記
憶・ラベル付与処理部９の出力は、トピック抽出処理部
１０の入力に接続され、トピック抽出処理部１０の出力
には、抽出されたトピック、記事が順次提示される。The output of the file device 1 is output to a morphological analyzer 2
, And the output of the morphological analysis unit 2 is connected to the input of the syntax analysis unit 3. The output of the syntax analyzer 3 is connected to the input of the word importance calculator 4,
Is connected to the input of the similarity comparison unit 5. The output of the similarity comparison unit 5 is connected to the input of the news article classification unit 6, and the output of the news article classification unit 6 is connected to the input of the representative article extraction unit 7 and one input of the storage / label assignment processing unit 9. Connected. The output of the representative article extraction unit 7 is connected to the input of the important noun extraction unit 8, and the output of the important noun extraction unit 8 is connected to the other input of the storage / label assignment processing unit 9. The output of the storage / labeling processing unit 9 is connected to the input of the topic extraction processing unit 10, and the extracted topic and article are sequentially presented to the output of the topic extraction processing unit 10.

【００１９】以上の構成を請求項との関係は、次のよう
になっている。単語重要度演算手段には、単語重要度演
算部４が対応する。記事分類手段には、類似度比較部５
とニュース記事分類部６の全体が対応する。代表記事抽
出手段には、代表記事抽出部７が対応する。重要名詞抽
出手段には、重要名詞抽出部８が対応する。ラベル付与
手段には、記憶・ラベル付与処理部９が対応する。トピ
ック抽出手段には、トピック抽出処理部１０が対応す
る。The relationship between the above configuration and the claims is as follows. The word importance calculation means 4 corresponds to the word importance calculation means. The article classification means includes a similarity comparison unit 5
And the entirety of the news article classification unit 6. The representative article extracting unit 7 corresponds to the representative article extracting means. The important noun extraction unit corresponds to the important noun extraction means. The storage / label assignment processing unit 9 corresponds to the label assignment unit. The topic extraction unit corresponds to the topic extraction means.

【００２０】次に、本実施形態の動作を図１〜図５を参
照して説明する。なお、図２は、記事分類処理の動作フ
ローチャートである。図３は、ニュース原稿に含まれる
記事の一例である。図４は、演算した単語の重要度の一
例である。図５は、ラベル付与・トピック抽出処理の動
作フローチャートである、図６は、出力されたトピック
と代表記事の一例である。Next, the operation of this embodiment will be described with reference to FIGS. FIG. 2 is an operation flowchart of the article classification process. FIG. 3 is an example of an article included in a news manuscript. FIG. 4 is an example of the calculated importance of a word. FIG. 5 is an operation flowchart of the label assignment / topic extraction processing. FIG. 6 is an example of the output topic and representative article.

【００２１】ニュース番組で実際に利用される１日分の
ニュース原稿には、例えば図３に示すような記事が約２
００記事が含まれるが、ファイル装置１には、そのよう
なニュース原稿の例えば１年分が電子化されて格納され
ている。ここに、各ニュース記事の第１文は、ニュース
内容の全貌を説明することが多いが、第２文以降は、ト
ピック抽出処理では不要が要素が多いので、本実施形態
では、ニュース記事の第１文のみを利用している。A one-day news manuscript actually used in a news program includes, for example, about two articles as shown in FIG.
For example, one year's worth of such news manuscript is stored in the file device 1 in an electronic form. Here, the first sentence of each news article often explains the whole picture of the news content, but the second sentence and later have many unnecessary elements in the topic extraction processing. Uses only one sentence.

【００２２】ファイル装置１から読み出された１つのニ
ュース記事は、形態素解析部２及び構文解析部３におい
て周知の形態素解析処理及び構文解析処理を受け、その
記事から抽出された単語が単語重要度演算部４に入力さ
れる。単語重要度演算部４、類似度比較部５及びニュー
ス記事分類部６では、図２に示す手順により、ニュース
記事の分類処理が行われる。One news article read from the file device 1 is subjected to well-known morphological analysis processing and syntactic analysis processing in the morphological analysis section 2 and the syntax analysis section 3, and the words extracted from the article are assigned word importance. The data is input to the arithmetic unit 4. The word importance calculating section 4, the similarity comparing section 5, and the news article classifying section 6 perform a news article classifying process according to the procedure shown in FIG.

【００２３】単語重要度演算部４では、１つのニュース
記事に含まれる単語が入力されると（ステップＳＴ
１）、χ² 値を利用して記事に含まれる単語の例えば月
単位の重要度を演算する（ステップＳＴ２）。具体的に
は、単語Ｗがある月に出現した頻度をｎ、期待値をｅと
すると、それらを式（１）に代入して単語の重要度Weig
ht(W)を演算する。図４に重要度の算出例を示してあ
る。When the word contained in one news article is input to the word importance calculation section 4 (step ST)
1), it calculates the importance of the example monthly words included in the articles by using the chi ² value (step ST2). Specifically, assuming that the frequency of occurrence of the word W in a certain month is n and the expected value is e, these are substituted into Expression (1) and the importance of the word Weig
ht (W) is calculated. FIG. 4 shows an example of calculating the importance.

【数１】 Weight(w)＝（ｎ−ｅ）²／ｅ・・・ｎ≧ｅ＝０・・・ｎ＜ｅ・・・（１）## EQU1 ## Weight (w) = (ne) ² / e... N ≧ e = 0... N <e (1)

【００２４】次に、類似度比較部５とニュース記事分類
部６では、似た項目に分類された記事の集まりであるク
ラスタを生成するクラスタリングを単語の重要度、出現
頻度を利用して行う。まず、以下のように定義した記事
ベクトルとクラスタベクトルとを利用して記事とクラス
タとの類似度を計算する（ステップＳＴ３）。Next, the similarity comparing section 5 and the news article classifying section 6 perform clustering for generating a cluster, which is a collection of articles classified into similar items, using the importance and appearance frequency of words. First, a similarity between an article and a cluster is calculated using an article vector and a cluster vector defined as follows (step ST3).

【００２５】記事を特徴づける記事ベクトルは、記事に
含まれる単語（例えば記事の第１文に含まれる単語）を
ベクトルの要素に、その単語の重要度を各ベクトル要素
の値として定義する。また、クラスタを特徴づけるクラ
スタベクトルは、そのクラスタに属する記事に含まれる
単語をベクトルの要素に、（各単語の重要度）と（クラ
スタ内での出現率）の積を各ベクトル要素の値として定
義する。なお、出現率は、出現した記事の頻度をクラス
タに含まれる全記事数で割った値である。そして、類似
度は、式（２）によって求められる。An article vector characterizing an article defines a word included in the article (eg, a word included in the first sentence of the article) as a vector element, and defines the importance of the word as a value of each vector element. In addition, the cluster vector characterizing a cluster is obtained by defining a word included in an article belonging to the cluster as a vector element and a product of (importance of each word) and (occurrence rate in the cluster) as a value of each vector element. Define. The appearance rate is a value obtained by dividing the frequency of appearing articles by the total number of articles included in the cluster. Then, the similarity is obtained by Expression (2).

【数２】類似度＝（共通する要素ベクトルの和×２）／（記事、クラスタの要素ベクトルの値の和）・・・（２）## EQU00002 ## Similarity = (sum of common element vectors.times.2) / (sum of element vector values of articles and clusters) (2)

【００２６】例えば、記事「スペースシャトル「コロン
ビア」に乗り組んでいる土井隆雄さんは宇宙遊泳で衛星
の捕獲に成功しました。」において、要素「スペースシ
ャトル」の値（重要度）が２１０．８、要素「コロンビ
ア」の値（重要度）が２２１．５、要素「土井」の値
（重要度）が１５０．１、要素「隆雄さん」の値（重要
度）が２８６、要素「宇宙」の値（重要度）が２１３．
７、要素「遊泳」の値（重要度）が２２１．５、要素
「衛星」の値（重要度）が２３０．７、要素「捕獲」の
値（重要度）が１３．５であれば、この記事の要素ベク
トルの値の和は、１５７５．１である。For example, in the article "Mr. Takao Doi on the Space Shuttle" Columbia "succeeded in capturing a satellite by spacewalk. , The value (importance) of the element “Space Shuttle” is 210.8, the value (importance) of the element “Colombia” is 221.5, the value (importance) of the element “Doi” is 150.1, and the element The value (importance) of “Mr. Takao” is 286, and the value (importance) of the element “universe” is 213.
7. If the value (importance) of the element "swimming" is 221.5, the value (importance) of the element "satellite" is 230.7, and the value (importance) of the element "capture" is 13.5, The sum of the values of the element vectors in this article is 1575.1.

【００２７】一方、クラスタに含まれる記事が「日本人
宇宙飛行士土井隆雄さんが乗り組んだスペースシャトル
「コロンビア」では、トラブルで延期した観測衛星の放
出を日本期間のあすに行い土井さんの宇宙遊泳は予定通
り今月二十五日に行われる事になりました。」、「日本
人として初めて宇宙遊泳を行う土井さんが乗り組んだス
ペースシャトル「コロンビア」は日本時間のけさ予定し
ていた観測衛星の放出がトラブルのため延期され、土井
さんは予定を変更して軌道上での実験を行っていま
す。」、・・・であり、要素「スペースシャトル」の値(重
要度と出現率の積）が１９６.９、要素「コロンビア」
の値（重要度と出現率の積）が１９９．３、要素「土
井」の値（重要度と出現率の積）が１５０．１，要素
「宇宙」の値（重要度と出現率の積）が２１３．７、要
素「遊泳」の値（重要度と出現率の積）が１９５．１，
要素「衛星」の値（重要度と出現率の積）が１２４．
８、要素「捕獲」の値（重要度と出現率の積）が０．
９、要素「トラブル」の値（重要度と出現率の積）が１
０．９、要素「宇宙飛行士」の値（重要度と出現率の
積）が２．６，要素「放出」の値（重要度と出現率の
積）が１０．５、要素「アメリカ航空宇宙局」の値（重
要度と出現率の積）が８．２、要素「ケネディー」の値
（重要度と出現率の積）が２１．８、要素「打ち上げ」
の値（重要度と出現率の積）が１８．５であれば、この
クラスタの要素ベクトルの値の和は、１１５３．３であ
る。On the other hand, the article included in the cluster is "The Space Shuttle" Colombia ", which Japanese Japanese astronaut Takao Doi has boarded, released the observation satellite postponed due to a trouble in the tomorrow of Japan period, and Doi's spacewalk Will be held on the 25th of this month as scheduled. "The space shuttle" Colombia ", which was the first Japanese to take a spacewalk in Japan, was postponed due to a problem with the release of the observation satellite scheduled for Japan time, Mr. Doi changed the schedule and changed the orbit I'm doing the experiment above. , ..., the value of the element " Space Shuttle " (product of importance and appearance rate) is 196.9, and the element " Colombia "
Value (the product of the importance and the appearance rate) is 199.3, the element "earth
The value of the " well " (product of importance and appearance rate) is 150.1, the value of the element "universe" (product of importance and appearance rate) is 213.7, and the value of the element " swimming " (importance and appearance rate) Is 195.1,
The value of the element “ satellite ” (product of importance and appearance rate) is 124.
8. The value of the element " capture " (product of importance and appearance rate) is 0.
9. The value of the element "trouble" (product of importance and appearance rate) is 1
0.9, the value of the element "astronaut" (product of importance and appearance rate) is 2.6, the value of the element "emission" (product of importance and appearance rate) is 10.5, and the element "American Airlines" Space Station value (product of importance and appearance rate) is 8.2, element "Kennedy" value (product of importance and appearance rate) is 21.8, element "launch"
Is 18.5 (product of importance and appearance rate), the sum of the values of the element vectors of this cluster is 1153.3.

【００２８】この場合、「共通する要素ベクトル」は、
クラスタにおけるアンダーラインして示す「スペースシ
ャトル」〜「捕獲」であるので、類似度の式（２）の分
子は、１０８０．２×２となり、分母は、１５７５．１
＋１１５３．３となる。したがって、この場合の類似度
は、０．７９２となる。In this case, the “common element vector” is
Since “space shuttle” to “capture” shown as underlines in the cluster, the numerator of the similarity equation (2) is 1080.2 × 2, and the denominator is 1575.1.
The result is +1153.3. Therefore, the similarity in this case is 0.792.

【００２９】このようにして、記事とクラスタとの類似
度を評価し（ステップＳＴ４）、クラスタとの類似度
が、ある閾値（例えば０．５）以上であれば、最も似て
いると評価されたクラスタに統合する（ステップＳＴ
５）。また、クラスタとの類似度が、閾値以下であれ
ば、全てのクラスタについて同様の評価を行う（ステッ
プＳＴ４→ステップＳＴ６→ステップＳＴ３→ステップ
ＳＴ４）。その結果、全てのクラスタとの類似度が閾値
以下であれば、その記事で新たなクラスタを構築する
（ステップＳＴ７）。以上の処理を繰り返すことにより
（ステップＳＴ８）、類似度の高いクラスタが得られ
る。つまり、ニュース原稿に含まれる記事が高精度に分
類される。In this way, the similarity between the article and the cluster is evaluated (step ST4). If the similarity with the cluster is equal to or more than a certain threshold value (for example, 0.5), it is evaluated that the article is most similar. (Step ST
5). If the degree of similarity with the cluster is equal to or smaller than the threshold, the same evaluation is performed for all clusters (step ST4 → step ST6 → step ST3 → step ST4). As a result, if the similarities with all the clusters are equal to or smaller than the threshold, a new cluster is constructed with the article (step ST7). By repeating the above processing (step ST8), a cluster having a high degree of similarity is obtained. That is, the articles included in the news manuscript are classified with high accuracy.

【００３０】このようにして得られた各クラスタは、代
表記事抽出部７に出力されるとともに、記憶・ラベル付
与処理部９に送られる。なお、比較の順番によっても結
果は異なるが、以上説明した処理では、比較は日付け順
に行っている。Each of the clusters thus obtained is output to the representative article extracting unit 7 and sent to the storage / labeling processing unit 9. Although the result differs depending on the order of comparison, in the processing described above, the comparison is performed in order of date.

【００３１】次いで、代表記事抽出部７、重要名詞抽出
部８，記憶・ラベル付与処理部９及びトピック抽出処理
部１０では、図５に示す手順により、ラベル付与・トピ
ック抽出処理が行われる。Next, the representative article extraction unit 7, important noun extraction unit 8, storage / label assignment processing unit 9, and topic extraction processing unit 10 perform label assignment / topic extraction processing according to the procedure shown in FIG.

【００３２】代表記事抽出部７は、１つのクラスタが入
力すると（ステップＳＴ１０）、そのクラスタに含まれ
る記事中の単語の寄与度を計算し（ステップＳＴ１
１）、求めた寄与度に基づき代表記事を抽出する（ステ
ップＳＴ１２）。When one cluster is input (step ST10), the representative article extractor 7 calculates the contribution of words in articles included in the cluster (step ST1).
1) A representative article is extracted based on the obtained contribution (step ST12).

【００３３】クラスタにおける単語の寄与度は、（単語
の重要度）と（クラスタ内での単語の出現率）との積と
して定義してある。なお、クラスタ内での単語の出現率
は、（クラスタ内で単語が出現した記事数）を（クラス
タ全体記事数）で割った値である。そして、代表記事
は、そのクラスタに含まれる単語（例えば第１文に含ま
れる単語）の寄与度の合計が最も大きい記事である。The degree of contribution of a word in a cluster is defined as the product of (word importance) and (word appearance rate in a cluster). Note that the word appearance rate in a cluster is a value obtained by dividing (the number of articles in which a word appears in a cluster) by (the number of articles in the entire cluster). The representative article is an article in which the total contribution of words included in the cluster (for example, words included in the first sentence) is the largest.

【００３４】次に、重要名詞抽出部８は、得られた代表
記事に含まれる全ての名詞句を対象に、そこに含まれる
単語の寄与度の合計を計算し、寄与度が最も大きい名詞
句を重要名詞句として抽出し、記憶・ラベル付与処理部
９に出力する（ステップＳＴ１３）。記憶・ラベル付与
処理部９は、ニュース記事分類部６から入力したクラス
タに重要名詞抽出部８から入力した名詞句をラベルとし
て付与し、記憶する（ステップＳＴ１４）。Next, the important noun extracting unit 8 calculates the sum of the contributions of the words included in all the noun phrases included in the obtained representative article, and determines the noun phrase having the largest contribution. Is extracted as an important noun phrase and output to the storage / labeling processing unit 9 (step ST13). The storage / label assignment processing section 9 assigns the noun phrase input from the important noun extraction section 8 to the cluster input from the news article classification section 6 as a label, and stores the label (step ST14).

【００３５】例えば、代表記事「・・・日本人宇宙飛行士
の土井さんが宇宙遊泳をして回収するという計画案を・・
・」において、名詞句「日本人宇宙飛行士の土井さん」
の寄与度が１５４．９，名詞句「宇宙遊泳」の寄与度が
４０８．８，名詞句「回収」の寄与度が１０．１，名詞
句「計画案」の寄与度が０である場合、「宇宙遊泳」が
抽出され、上記代表記事「・・・日本人宇宙飛行士の土井
さんが宇宙遊泳をして回収するという計画案を・・・」を
含むクラスタのラベルとなる。For example, a representative article "... A Japanese astronaut, Mr. Doi, plans to take a spacewalk and collect ...
・ ”In the noun phrase“ Japanese astronaut Doi-san ”
Is 154.9, the noun phrase “space swimming” has a contribution of 408.8, the noun phrase “recovery” has a contribution of 10.1, and the noun phrase “plan” has a contribution of 0. “Space swimming” is extracted, and becomes a label of a cluster including the above representative article “... a plan for Japanese astronaut Mr. Doi to perform space swimming and collect ...”.

【００３６】全てのクラスタについて同様な処理が行わ
れ、記憶・ラベル付与処理部９には、ラベルが付与され
たクラスタが逐一記憶される（ステップＳＴ１５）。そ
して、トピック抽出処理部１０は、記憶・ラベル付与処
理部９にデータベース化された、ラベル付きクラスタの
それぞれについて、代表記事に含まれる単語の寄与度の
合計が大きいクラスタを順番に検索し、それに付けられ
ているラベルを、順番にトピックの候補として抽出、提
示する。The same processing is performed for all the clusters, and the storage and labeling processing unit 9 stores the clusters to which labels have been added one by one (step ST15). Then, the topic extraction processing unit 10 sequentially searches clusters having a large sum of the contribution degrees of the words included in the representative article for each of the labeled clusters created in the database in the storage / labeling processing unit 9, The attached labels are extracted and presented as topic candidates in order.

【００３７】以上説明したクラスタリング、トピック抽
出の実験を、１９９５年３月〜１９９８年８月までのニ
ュース原稿を対象に行った。そのうち１９９７年１１月
のトピック上位８項目と代表記事を図６に示してある。
この８項目に対して、クラスタリング結果を、人手によ
り抽出した結果と比較、評価したところ、適合率９１．
８％、再現率７５．５％であり、良好な結果が得られる
ことが実証された。The clustering and topic extraction experiments described above were performed on news manuscripts from March 1995 to August 1998. FIG. 6 shows the top eight topics and the representative articles in November 1997.
For these eight items, the clustering results were compared and evaluated with the results extracted manually, and the relevance rate was 91.
It was 8% and the recall was 75.5%, demonstrating that good results were obtained.

【００３８】[0038]

【発明の効果】以上説明したように、請求項１に記載の
発明では、ニュース原稿の記事を似た項目を持つ記事群
に高精度に分類できる。As described above, according to the first aspect of the invention, news manuscript articles can be classified into article groups having similar items with high accuracy.

【００３９】請求項２に記載の発明では、分類された記
事群のそれぞれから、各話題が重複しない分かり易い名
詞句を抽出できる。According to the second aspect of the present invention, it is possible to extract an easy-to-understand noun phrase from which each topic does not overlap, from each of the classified article groups.

【００４０】請求項３に記載の発明では、分類された記
事群のデータベース化ができる。According to the third aspect of the present invention, a database of classified article groups can be created.

【００４１】請求項４に記載の発明では、時間と共に変
化する動的なジャンルであるトピックを自動的に抽出、
提示できる。According to the fourth aspect of the present invention, topics that are dynamic genres that change with time are automatically extracted,
Can be presented.

【００４２】要するに、請求項１乃至請求項４に記載の
発明によれば、ニュース原稿の記事を似た項目を持つ記
事群であるクラスタに分類した後に、そのクラスタから
１つの名詞句を抽出しているので、同じ話題からは１つ
のトピックが抽出され、グローバルな視点からの話題抽
出が可能となる。In short, according to the first to fourth aspects of the present invention, after a news manuscript article is classified into a cluster which is a group of articles having similar items, one noun phrase is extracted from the cluster. Therefore, one topic is extracted from the same topic, and the topic can be extracted from a global viewpoint.

[Brief description of the drawings]

【図１】請求項１乃至請求項４に対応する実施形態のト
ピック抽出装置の構成ブロック図である。FIG. 1 is a configuration block diagram of a topic extraction device according to an embodiment corresponding to claims 1 to 4;

【図２】記事分類処理動作のフローチャートである。FIG. 2 is a flowchart of an article classification processing operation.

【図３】入力するニュース原稿の例である。FIG. 3 is an example of a news manuscript to be input.

【図４】演算した単語の重要度の例である。FIG. 4 is an example of calculated importance of a word;

【図５】ラベル付与・トピック抽出処理動作のフローチ
ャートである。FIG. 5 is a flowchart of a label assignment / topic extraction processing operation.

【図６】出力されたトピックと代表記事の例である。FIG. 6 is an example of output topics and representative articles.

[Explanation of symbols]

１ファイル装置２形態素解析部３構文解析部４単語重要度演算部５類似度比較部６ニュース記事分類部７代表記事抽出部８重要名詞抽出部９記憶・ラベル付与処理部１０トピック抽出処理部 DESCRIPTION OF SYMBOLS 1 File device 2 Morphological analysis part 3 Syntax analysis part 4 Word importance calculation part 5 Similarity comparison part 6 News article classification part 7 Representative article extraction part 8 Important noun extraction part 9 Storage / label assignment processing part 10 Topic extraction processing part

───────────────────────────────────────────────────── フロントページの続き (72)発明者柴田正啓東京都世田谷区砧一丁目10番11号日本放送協会放送技術研究所内 (72)発明者浦谷則好東京都世田谷区砧一丁目10番11号日本放送協会放送技術研究所内Ｆターム(参考） 5B075 ND03 ND23 NK32 NR12 PQ40 PQ75 PR04 PR06 QM08 QP05 ──────────────────────────────────────────────────の Continuing on the front page (72) Inventor Masahiro Shibata 1-10-11 Kinuta, Setagaya-ku, Tokyo Inside the Japan Broadcasting Research Institute (72) Inventor Noriyoshi Uraya 1-10 Kinuta, Setagaya-ku, Tokyo 11 Japan Broadcasting Corporation Broadcasting Research Institute F term (reference) 5B075 ND03 ND23 NK32 NR12 PQ40 PQ75 PR04 PR06 QM08 QP05

Claims

[Claims]

1. A word importance calculating means for obtaining an appearance rate of a word extracted from an article of an electronic news manuscript within a predetermined period, and for obtaining an importance of the word based on the appearance rate; By calculating the similarity of the articles having the similar items based on the importance of the words and the appearance rate of the words in the articles having the similar items, the articles in the news manuscript can be converted into the articles having the similar items. A topic extraction device, comprising: article classification means for classifying;

2. The topic extraction device according to claim 1, wherein the representative article extraction unit extracts a representative article from the article group based on a word contribution in the article group classified by the article classification unit; An important noun extracting means for extracting a noun phrase having a large contribution degree from the extracted representative articles as an important noun phrase.

3. The topic extracting apparatus according to claim 2, wherein a label corresponding to a noun phrase extracted by the important noun extracting means is assigned as a label of each article group classified by the article classifying means. A topic extracting device, comprising: an assigning unit.

4. The topic extraction apparatus according to claim 3, wherein, in each of the article groups to which the labeling means has added a label, a label of an article group having a representative article having a large contribution is sequentially set as a topic candidate. A topic extraction device, comprising: a topic extraction unit for extracting.