JP2003323424A

JP2003323424A - Subject presuming device, method and program

Info

Publication number: JP2003323424A
Application number: JP2002128080A
Authority: JP
Inventors: Ichiro Yamada; 一郎山田; Hideki Sumiyoshi; 英樹住吉; Takako Ariyasu; 香子有安; Masahiro Shibata; 正啓柴田; Nobuyuki Yagi; 伸行八木
Original assignee: Nippon Hoso Kyokai NHK; Japan Broadcasting Corp
Current assignee: Japan Broadcasting Corp
Priority date: 2002-04-30
Filing date: 2002-04-30
Publication date: 2003-11-14
Anticipated expiration: 2022-04-30
Also published as: JP3956354B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a subject presuming device, method and program capable of presuming a subject as an object of contents of conversation from a plurality of words included in the contents of conversation. <P>SOLUTION: This subject presuming device 1 comprises a feature value operating means 30 for determining feature values of the words of the subject specifying the news note on the basis of the frequency of appearance of the words included in the news note, a presumption degree operating means 60 for extracting the conversation words included in the conversation text data and determining a presumption degree of the subject presumed from the conversation words on the basis of conversation words and the feature values of the words of the subject determined by the feature value operating means, and a subject determination output means 70 for determining the presumption degree determined by the presumption degree operating means 60, and outputting the presumed subject on the basis of a result of the determination. <P>COPYRIGHT: (C)2004,JPO

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は、様々なアプリケー
ションで使用され、会話の内容を特定する情報分類、自
然言語処理技術に関し、より詳細には、自然言語処理技
術と、統計処理技術とを用いて、会話の内容を特定する
ことができる話題特定装置、話題特定方法及び話題特定
プログラムに関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to information classification and natural language processing technology used in various applications to specify the content of conversation, and more specifically, it uses natural language processing technology and statistical processing technology. The present invention relates to a topic specifying device, a topic specifying method, and a topic specifying program capable of specifying the content of conversation.

【０００２】[0002]

【従来の技術】従来、複数人による会話を行っている際
に、その会話の内容がどのような話題を対象としている
かを推定する手法としては、会話内容に含まれる一つの
単語が、どの話題に属するかを予め単語と話題を関連付
けたデータベース等に基づいて推定し、会話内容に含ま
れる全ての単語の推定結果を平均して、その会話内容の
話題を推定していた。2. Description of the Related Art Conventionally, when a conversation is being held by a plurality of people, as a method of estimating what topic the conversation content is targeted for, one word included in the conversation content is It was estimated based on a database or the like in which words and topics were associated with each other in advance, and the estimation results of all words included in the conversation content were averaged to estimate the topic of the conversation content.

【０００３】また、会話内容の複数の単語を考慮する手
法としては、音声データを時系列信号としてモデル（確
率モデル）化し、そのモデルのパラメータ（確率計算用
の係数）を「学習」することで音声の認識を行う、音声
認識技術の代表的アルゴリズムであるＨＭＭ（Ｈｉｄｄ
ｅｎＭａｒｋｏｖＭｏｄｅｌ）を利用して話題に関
連する単語を複数呈示する手法（今井他、「放送ニュ
ースの話題抽出モデル」信学技報，ＳＰ９７−２８，ｐ
ｐ．７５−８２，Ｊｕｎｅ１９９７：以下、先行技術
１という）が提案されている。このＨＭＭによる手法
は、話題を推定する目的ではなく、話題を直接意味する
キーワードをより多く出力することを目的としたもので
ある。Further, as a method of considering a plurality of words in conversation content, voice data is modeled as a time-series signal (probability model), and the parameters (coefficients for probability calculation) of the model are "learned". HMM (Hidd), which is a typical algorithm of speech recognition technology for recognizing speech
En Markov Model) is used to present multiple words related to topics (Imai et al., "Broadcast News Topic Extraction Model," IEICE Technical Report, SP97-28, p.
p. 75-82, June 1997: hereinafter referred to as Prior Art 1). This HMM method is not for the purpose of estimating a topic, but for the purpose of outputting more keywords that directly mean a topic.

【０００４】[0004]

【発明が解決しようとする課題】しかし、前記従来の技
術は、会話内容に含まれる複数の単語の組み合せから、
その会話においてどのような話題を対象としているかを
推定する場合、会話内容に含まれる一つの単語が、どの
話題に属しているか判定することで推定を行っていた
が、複数の単語の組み合せに対する考慮が行われない
と、その推定の精度が低くなってしまうという問題があ
った。また、一つの単語から話題を推定するのではな
く、複数の単語の組み合せを考慮することで、話題を推
定しようとすると、その組み合せの数が膨大になり、複
数の単語と話題を対応付けた学習データを構築すること
が困難であるという問題があった。However, the above-mentioned conventional technique is based on the combination of a plurality of words included in the conversation content.
When estimating what kind of topic the conversation is intended for, it was estimated by determining which topic a single word contained in the conversation belongs to, but considering the combination of multiple words. If is not performed, there is a problem that the accuracy of the estimation becomes low. Also, instead of estimating the topic from a single word, when trying to estimate the topic by considering the combination of multiple words, the number of combinations becomes enormous, and multiple words are associated with the topic. There is a problem that it is difficult to construct learning data.

【０００５】また、会話内容がどのような話題を対象と
しているかを、ニュース記事に出現した話題を利用する
先行技術１の手法では、話題に関連するキーワードを複
数呈示するだけで、話題そのものの定義が不明確であ
り、その話題がどの出来事に含まれているかを明確に定
義できないという問題があった。In addition, in the method of the prior art 1 which uses a topic appearing in a news article, it is possible to define what topic the conversation content is targeted by simply presenting a plurality of keywords related to the topic. Is unclear and there is a problem that it is not possible to clearly define which event includes the topic.

【０００６】本発明は、以上のような問題点に鑑みてな
されたものであり、会話内容に含まれる複数の単語か
ら、その会話内容が対象としている話題を推定すること
ができる話題推定装置、話題推定方法及び話題推定プロ
グラムを提供することを目的とする。The present invention has been made in view of the above problems, and a topic estimation device capable of estimating a topic targeted by a conversation content from a plurality of words included in the conversation content, An object is to provide a topic estimation method and a topic estimation program.

【０００７】[0007]

【課題を解決するための手段】本発明は、前記目的を達
成するために創案されたものであり、まず、請求項１に
記載の話題推定装置は、言語データであるテキスト原稿
と、そのテキスト原稿の内容を特定した話題とに基づい
て、入力された会話内容である会話テキストからその会
話内容を特定する話題を推定する話題推定装置であっ
て、テキスト原稿に含まれる原稿単語の出現頻度を求
め、その出現頻度に基づいて、テキスト原稿を特定する
話題の単語特徴量を求める特徴量演算手段と、会話テキ
ストに含まれる会話単語を抽出し、その会話単語と特徴
量演算手段によって求められた話題の単語特徴量とに基
づいて、会話単語で推定される話題の推定度を求める推
定度演算手段と、この推定度演算手段によって求められ
た推定度を判定し、その判定結果に基づいて推定された
話題を出力する話題判定出力手段と、を備える構成とし
た。The present invention was devised in order to achieve the above-mentioned object. First, the topic estimation apparatus according to claim 1 is a text original which is language data and its text. A topic estimation device that estimates a topic that specifies the conversation content from the conversation text that is the input conversation content based on the topic that identifies the content of the document, and determines the frequency of occurrence of the document words included in the text document. Based on the frequency of appearance, the feature amount calculation means for obtaining the word feature amount of the topic that specifies the text original, and the conversation word included in the conversation text are extracted, and obtained by the conversation word and the feature amount calculation means. Based on the word feature amount of the topic, an estimation degree calculation means for obtaining the estimation degree of the topic estimated by the conversation word, and the estimation degree obtained by the estimation degree calculation means are determined. And topic determination output means for outputting a topic that is estimated based on the determination result, and configured to include a.

【０００８】かかる構成によれば、話題推定装置は、特
徴量演算手段によって、テキスト原稿に含まれる単語の
出現頻度を求め、その出現頻度に基づいて、テキスト原
稿を特定する話題の単語特徴量を求める。そして、推定
度演算手段によって、会話テキストに含まれる単語を抽
出し、その単語と特徴量演算手段によって求められた話
題の単語特徴量とに基づいて、会話テキストの単語で推
定される話題の推定度を求め、話題判定出力手段によっ
て、話題の推定度が最も高い話題を、その会話テキスト
の話題であると判定する。According to this structure, the topic estimating device obtains the appearance frequency of the word contained in the text original by the feature calculating means, and based on the appearance frequency, finds the word feature quantity of the topic specifying the text original. Ask. Then, the estimation degree calculating means extracts a word included in the conversation text, and based on the word and the word feature amount of the topic obtained by the feature amount calculating means, estimation of the topic estimated by the word of the conversation text. Then, the topic determination output means determines that the topic having the highest degree of topic estimation is the topic of the conversation text.

【０００９】なお、話題の単語特徴量は、テキスト原稿
に含まれる個々の名詞の出現頻度に基づいて算出され
る、その話題にどの名詞がより多く使用されているかを
示す尺度である。この話題毎の単語特徴量に基づいて、
会話テキストの複数の単語（名詞）が出現する割合の高
い話題を、会話テキストの話題であると推定する。The word feature amount of a topic is a measure indicating which noun is more frequently used for the topic, which is calculated based on the appearance frequency of each noun included in the text original. Based on the word features for each topic,
It is estimated that a topic in which a plurality of words (nouns) in the conversation text appear frequently is a topic in the conversation text.

【００１０】また、請求項２に記載の話題推定装置は、
請求項１に記載の話題推定装置において、特徴量演算手
段が、テキスト原稿に含まれる原稿単語の出現頻度に基
づいて、少なくとも複数の原稿単語の組み合せ、及びそ
の原稿単語の組み合せの出現頻度を学習データとして生
成する学習データ生成手段と、最大エントロピー法に基
づいて、学習データ生成手段によって生成された学習デ
ータに対して、単語特徴量として原稿単語の組み合せの
出現確率値を与える最大エントロピー化手段と、を備え
る構成とした。The topic estimation apparatus according to claim 2 is
In the topic estimation apparatus according to claim 1, the feature amount computing means learns at least a combination of a plurality of manuscript words and an appearance frequency of the combination of the manuscript words based on the appearance frequencies of the manuscript words included in the text manuscript. A learning data generating means for generating as data, and a maximum entropy converting means for giving an appearance probability value of a combination of original words as a word feature amount to the learning data generated by the learning data generating means based on the maximum entropy method, , Is provided.

【００１１】かかる構成によれば、話題推定装置は、学
習データ生成手段によって、テキスト原稿に含まれる単
語（名詞）の出現頻度から、少なくとも複数の単語の組
み合せ、及びその単語の組み合せの出現頻度を学習デー
タとして生成する。そして、話題推定装置は、最大エン
トロピー化手段によって、学習データを元に、未知のデ
ータにおいても確率値を一様に分布した確率的言語モデ
ルを推定する。これによって、学習データとして保持し
ていない単語（名詞）を含んだ複数の単語の組み合せに
対して、０でない出現確率値（単語特徴量）が与えられ
る。According to this structure, the topic estimation device uses the learning data generation means to determine at least the combination of a plurality of words and the appearance frequency of the combination of words from the appearance frequencies of the words (nouns) included in the text original. Generate as learning data. Then, the topic estimation device estimates a probabilistic language model in which probability values are evenly distributed even in unknown data, based on the learning data, by the maximum entropy conversion means. As a result, a non-zero appearance probability value (word feature amount) is given to a combination of a plurality of words including a word (noun) that is not held as learning data.

【００１２】さらに、請求項３に記載の話題推定装置
は、請求項２に記載の話題推定装置において、推定度演
算手段が、最大エントロピー化手段によって与えられる
学習データの出現確率値に基づいて、話題毎に会話単語
の複数の組み合せが出現する出現確率値を話題の推定度
として算出することを特徴とする。Further, in the topic estimating apparatus according to claim 3, in the topic estimating apparatus according to claim 2, the estimation degree calculating means is based on the appearance probability value of the learning data given by the maximum entropy converting means, The feature is that an appearance probability value in which a plurality of combinations of conversation words appear for each topic is calculated as a topic estimation degree.

【００１３】かかる構成によれば、話題推定装置は、推
定度演算手段によって、最大エントロピー化手段で与え
られる学習データの出現確率値（単語特徴量）に基づい
て、各話題に会話テキストの複数の単語が出現する出現
確率値を推定度として算出する。ここで、会話テキスト
の単語の中に、学習データに存在しない単語が含まれて
いても、その複数の単語の組み合せに対して、既知の単
語から最大エントロピー化手段により確率値が０でない
出現確率値が与えられるため、この出現確率値を推定度
として使用することができる。According to this structure, the topic estimation device causes the estimation degree calculation means to generate a plurality of conversation texts for each topic based on the appearance probability value (word feature amount) of the learning data given by the maximum entropy conversion means. An appearance probability value in which a word appears is calculated as an estimation degree. Here, even if a word that does not exist in the learning data is included in the words in the conversation text, the probability of occurrence of a non-zero probability value from the known word by the maximum entropy conversion means for a combination of the plurality of words. Since the value is given, this appearance probability value can be used as the estimation degree.

【００１４】さらにまた、請求項４に記載の話題推定装
置は、請求項１乃至請求項３のいずれか１項に記載の話
題推定装置において、テキスト原稿が、電子化されたニ
ュース原稿の記事であることを特徴とする。Still further, in the topic estimating device according to claim 4, in the topic estimating device according to any one of claims 1 to 3, the text manuscript is an article of an electronic news manuscript. It is characterized by being.

【００１５】かかる構成によれば、話題推定装置は、ニ
ュース原稿の記事をテキスト原稿として使用すること
で、最新の話題を随時更新したニュース原稿から話題を
推定する。これにより、会話テキストで最新の話題につ
いて会話が行われても、適切に話題を推定することが可
能になる。なお、このニュース原稿は放送局等で放送さ
れるニュース原稿を電子化して蓄積しているテキストデ
ータである。According to this structure, the topic estimation device estimates the topic from the news document in which the latest topic is updated at any time by using the article of the news document as the text document. This makes it possible to properly estimate the topic even if the conversation is about the latest topic in the conversation text. The news manuscript is text data in which the news manuscript broadcast by a broadcasting station or the like is digitized and stored.

【００１６】また、請求項５に記載の話題推定装置は、
電子化されたニュース原稿の記事に基づいて、入力され
た会話内容である会話テキストからその会話内容を特定
する話題を推定する話題推定装置であって、ニュース原
稿の記事から、その記事の内容を特定する話題を抽出す
る話題抽出手段と、ニュース原稿の記事に含まれる原稿
単語の出現頻度を求め、その出現頻度に基づいて、ニュ
ース原稿を特定する話題の単語特徴量を求める特徴量演
算手段と、会話テキストに含まれる会話単語を抽出し、
その会話単語と特徴量演算手段によって求められた話題
の単語特徴量とに基づいて、会話単語で推定される話題
の推定度を求める推定度演算手段と、この推定度演算手
段によって求められた推定度を判定し、その判定結果に
基づいて推定された話題を出力する話題判定出力手段
と、を備える構成とした。The topic estimation device according to claim 5 is
A topic estimation device for estimating a topic for identifying a conversation content from conversation text that is input conversation content based on an article in a digitized news manuscript. Topic extraction means for extracting a topic to be specified, feature amount calculation means for obtaining an appearance frequency of original words included in an article of a news manuscript, and obtaining a word feature amount of a topic for specifying a news manuscript based on the appearance frequency, , Extract conversation words included in conversation text,
An estimation degree calculation means for obtaining an estimation degree of the topic estimated by the conversation word based on the conversation word and a word feature amount of the topic obtained by the feature amount calculation means, and an estimation obtained by the estimation degree calculation means. And a topic determination output unit that outputs the topic estimated based on the determination result.

【００１７】かかる構成によれば、話題推定装置は、話
題抽出手段によって、ニュース原稿の記事から、その記
事の内容を特定する話題を抽出し、特徴量演算手段によ
って、ニュース原稿に含まれる単語の出現頻度を求め、
その出現頻度に基づいて、ニュース原稿を特定する話題
の単語特徴量を求める。そして、推定度演算手段によっ
て、会話テキストに含まれる単語を抽出し、その単語と
特徴量演算手段によって求められた話題の単語特徴量と
に基づいて、会話テキストの単語で推定される話題の推
定度を求め、話題判定出力手段によって、話題の推定度
が最も高い話題を、その会話テキストの話題であると判
定する。According to this structure, the topic estimating device extracts the topic specifying the content of the article from the article of the news manuscript by the topic extracting means, and extracts the words included in the news manuscript by the feature amount computing means. Find the frequency of appearance,
Based on the appearance frequency, the word feature amount of the topic that identifies the news manuscript is obtained. Then, the estimation degree calculating means extracts a word included in the conversation text, and based on the word and the word feature amount of the topic obtained by the feature amount calculating means, estimation of the topic estimated by the word of the conversation text. Then, the topic determination output means determines that the topic having the highest degree of topic estimation is the topic of the conversation text.

【００１８】なお、話題の単語特徴量は、ニュース原稿
に含まれる個々の名詞の出現頻度に基づいて算出され
る、その話題にどの名詞がより多く使用されているかを
示す尺度である。この話題毎の単語特徴量に基づいて、
会話テキストの複数の単語（名詞）が出現する割合の高
い話題を、会話テキストの話題であると推定する。The word feature amount of the topic is a measure indicating which noun is more frequently used for the topic, which is calculated based on the appearance frequency of each noun included in the news manuscript. Based on the word features for each topic,
It is estimated that a topic in which a plurality of words (nouns) in the conversation text appear frequently is a topic in the conversation text.

【００１９】さらに、請求項６に記載の話題推定方法
は、言語データであるテキスト原稿と、そのテキスト原
稿の内容を特定した話題とに基づいて、入力された会話
内容である会話テキストからその会話内容を特定する話
題を推定する話題推定方法であって、テキスト原稿に含
まれる複数の原稿単語の組み合せ、及びその原稿単語の
組み合せの出現頻度を話題の学習データとして生成する
学習データ生成ステップと、この学習データ生成ステッ
プによって生成された学習データにより、最大エントロ
ピー法に基づいて、原稿単語の組み合せの出現確率値を
求める最大エントロピー化ステップと、この最大エント
ロピー化ステップによって求められた出現確率値に基づ
いて、話題毎に会話テキストに含まれる会話単語の複数
の組み合せが出現する出現確率値を話題の推定度として
算出する推定度演算ステップと、この推定度演算ステッ
プによって求められた推定度を判定し、その判定結果に
基づいて推定された話題を出力する話題判定出力ステッ
プと、を含むことを特徴とする。Further, in the topic estimation method according to the sixth aspect, based on the text manuscript which is the language data and the topic which specifies the content of the text manuscript, the conversation from the conversation text which is the input conversation content is conducted. A topic estimation method for estimating a topic specifying a content, a combination of a plurality of manuscript words included in a text manuscript, and a learning data generation step of generating the appearance frequency of the combination of manuscript words as topic learning data, Based on the learning data generated by this learning data generation step, based on the maximum entropy method, the maximum entropy step for obtaining the appearance probability value of the combination of manuscript words and the appearance probability value obtained by this maximum entropy step , Multiple combinations of conversation words appearing in conversation text appear for each topic. An estimation degree calculation step of calculating the appearance probability value as an estimation degree of a topic, and a topic determination output step of determining the estimation degree obtained in this estimation degree calculation step and outputting the topic estimated based on the determination result. And are included.

【００２０】この方法によれば、話題推定方法は、学習
データ生成ステップによって、テキスト原稿に含まれる
複数の単語の組み合せ、及びその単語の組み合せの出現
頻度を話題毎の学習データとして生成し、最大エントロ
ピー化ステップによって、この学習データを元に、学習
されていない未知のデータにおいても確率値を一様に分
布した確率的言語モデルを推定する。そして、推定度算
出ステップによって、最大エントロピー化ステップで推
定し出力される学習データの出現確率値（単語特徴量）
から、各話題に会話テキストの複数の単語が出現する出
現確率値を推定度として算出する。この推定度が高いほ
ど、会話テキストの話題を的確に表わしている指標とな
る。According to this method, in the topic estimation method, the learning data generating step generates the combination of a plurality of words included in the text original and the appearance frequency of the combination of the words as learning data for each topic, and the maximum is obtained. In the entropy step, a probabilistic language model in which probability values are evenly distributed even in unknown data that has not been learned is estimated based on this learning data. Then, in the estimation degree calculation step, the appearance probability value (word feature amount) of the learning data estimated and output in the maximum entropy step
From this, the appearance probability value that a plurality of words in the conversation text appear in each topic is calculated as the estimation degree. The higher the degree of this estimation, the more accurately the topic of the conversation text becomes.

【００２１】また、請求項７に記載の話題推定プログラ
ムは、言語データであるテキスト原稿と、そのテキスト
原稿の内容を特定した話題とに基づいて、入力された会
話内容である会話テキストからその会話内容を特定する
話題を推定するために、コンピュータを、テキスト原稿
に含まれる複数の原稿単語の組み合せ、及びその原稿単
語の組み合せの出現頻度を話題の学習データとして生成
する学習データ生成手段、この学習データ生成手段によ
って生成された学習データにより、最大エントロピー法
に基づいて、原稿単語の組み合せの出現確率値を求める
最大エントロピー化手段、この最大エントロピー化手段
によって求められた出現確率値に基づいて、話題毎に会
話テキストに含まれる会話単語の複数の組み合せが出現
する出現確率値を話題の推定度として算出する推定度演
算手段、この推定度演算手段によって求められた前記推
定度を判定し、その判定結果に基づいて推定された話題
を出力する話題判定出力手段、として機能させることを
特徴とするAccording to a seventh aspect of the present invention, there is provided a topic estimation program, wherein based on a text manuscript which is language data and a topic which specifies the content of the text manuscript, the conversation is started from the conversation text which is the input conversation content. In order to estimate a topic that specifies the content, a learning data generation unit that causes a computer to generate a combination of a plurality of manuscript words included in a text manuscript and an appearance frequency of the combination of the manuscript words as learning data of a topic, and this learning Based on the learning data generated by the data generation means, based on the maximum entropy method, the maximum entropy generating means for obtaining the appearance probability value of the combination of the manuscript words, based on the appearance probability value obtained by this maximum entropy means, the topic The appearance probability value that multiple combinations of conversation words included in the conversation text appear for each An estimation degree calculation means for calculating the estimation degree of the subject, and a topic determination output means for determining the estimation degree obtained by the estimation degree calculation means and outputting the topic estimated based on the determination result. Characterized by

【００２２】かかる構成によれば、話題推定プログラム
は、学習データ生成手段によって、テキスト原稿に含ま
れる複数の単語の組み合せ、及びその単語の組み合せの
出現頻度を話題毎の学習データとして生成し、最大エン
トロピー化手段によって、この学習データを元に、学習
されていない未知のデータにおいても確率値を一様に分
布した確率的言語モデルを推定する。そして、推定度算
出手段によって、最大エントロピー化ステップで推定し
出力される学習データの出現確率値（単語特徴量）か
ら、各話題に会話テキストの複数の単語が出現する出現
確率値を推定度として算出する。この推定度が高いほ
ど、会話テキストの話題を的確に表わしている指標とな
る。According to this structure, the topic estimation program generates the combination of a plurality of words contained in the text manuscript and the appearance frequency of the combination of the words as learning data for each topic by the learning data generating means, Based on this learning data, the entropy conversion means estimates a probabilistic language model in which probability values are evenly distributed even in unknown data that has not been learned. Then, from the appearance probability value (word feature amount) of the learning data estimated and output in the maximum entropy step by the estimation degree calculating means, the appearance probability value that a plurality of words of conversation text appear in each topic is set as the estimation degree. calculate. The higher the degree of this estimation, the more accurately the topic of the conversation text becomes.

【００２３】[0023]

【発明の実施の形態】以下、本発明の実施の形態につい
て図面を参照して説明する。（話題推定装置の構成）図１は、本発明における話題推
定装置の構成を示したブロック図である。図１に示すよ
うに話題推定装置１は、過去のニュース番組等で使用さ
れたニュース原稿に基づいて、会話内容（会話音声デー
タ又は会話テキストデータ）が対象としている話題を推
定し、その話題を推定話題として出力するものである。BEST MODE FOR CARRYING OUT THE INVENTION Embodiments of the present invention will be described below with reference to the drawings. (Structure of Topic Estimation Device) FIG. 1 is a block diagram showing the structure of a topic estimation device according to the present invention. As shown in FIG. 1, the topic estimation device 1 estimates a topic targeted by conversation content (conversation voice data or conversation text data) based on a news manuscript used in a past news program or the like, and identifies the topic. It is output as an estimated topic.

【００２４】この話題推定装置１は、話題抽出手段１０
と、話題蓄積手段２０と、特徴量演算手段３０と、特徴
量蓄積手段４０と、音声認識手段５０と、推定度演算手
段６０と、話題判定出力手段７０と、を備える構成とし
た。また、ニュース原稿は、外部のニュース原稿蓄積手
段２からテキストデータとして入力されるものとする。This topic estimation device 1 is provided with topic extraction means 10
The topic storage unit 20, the feature amount calculation unit 30, the feature amount storage unit 40, the voice recognition unit 50, the estimation degree calculation unit 60, and the topic determination output unit 70 are provided. The news manuscript is assumed to be input as text data from the external news manuscript storage means 2.

【００２５】話題抽出手段１０は、ニュース原稿蓄積手
段２に蓄積されている電子化された過去のニュース原稿
から話題を抽出し、その話題とその話題に関連するニュ
ース原稿とを対応付けて話題蓄積手段２０に蓄積するも
のである。この話題抽出手段１０は、本願出願人におい
て「トピック抽出装置（特開２０００−２５９６６
６）」として開示されている技術を用いて実現すること
ができる。The topic extracting means 10 extracts topics from the digitized past news originals stored in the news original storing means 2, and stores the topics by associating the topics with the news originals related to the topics. It is stored in the means 20. This topic extraction means 10 is referred to by the applicant of the present invention as "topic extraction device (Japanese Patent Laid-Open No. 2000-25966.
6) ”can be realized by using the technology disclosed as“ 6) ”.

【００２６】この話題抽出手段１０によって抽出された
話題と、その話題に関するニュース記事を抽出した例を
図４に示す。図４では、「米などがアフガニスタンを攻
撃」という話題Ｔと、その話題に関する２００１年１０
月のニュース記事Ｎを抽出した例を示している。なお、
この話題抽出手段１０は、ニュース原稿蓄積手段２に蓄
積されている過去のニュース原稿を逐次入力すること
で、最新の話題及びその話題に関するニュース原稿を話
題蓄積手段２０に蓄積する。FIG. 4 shows an example in which the topics extracted by the topic extracting means 10 and the news articles related to the topics are extracted. In Figure 4, the topic T that "rice etc. attacks Afghanistan" and the topic 2001
The example which extracted the news article N of the month is shown. In addition,
The topic extracting unit 10 sequentially inputs the past news originals accumulated in the news original accumulating unit 2, thereby accumulating the latest topic and the news original relating to the topic in the topic accumulating unit 20.

【００２７】話題蓄積手段２０は、話題抽出手段１０に
よって抽出された話題とその話題に関連するニュース原
稿とを対応付けて蓄積するものであり、ハードディスク
等で構成されているものである。この話題蓄積手段２０
は、図４で示した話題Ｔとその話題Ｔに関連するニュー
ス原稿Ｎとをテキストデータとして蓄積するものであ
る。The topic accumulating means 20 accumulates the topic extracted by the topic extracting means 10 and the news manuscript related to the topic in association with each other, and is composed of a hard disk or the like. This topic storage means 20
Stores the topic T shown in FIG. 4 and the news manuscript N related to the topic T as text data.

【００２８】特徴量演算手段３０は、話題蓄積手段２０
に蓄積されている話題とその話題に関連するニュース原
稿とから、その話題の特徴量（単語特徴量）を抽出し、
特徴量蓄積手段４０に蓄積するものである。ここで特徴
量とは、各話題における関連ニュース原稿に出現する単
語（名詞）の出現頻度に基づいて算出される、そのニュ
ース原稿に特定の複数の単語が出現する確率値である。
なお、この特徴量演算手段３０は、学習データ生成部
（学習データ生成手段）３１と、最大エントロピー化部
（最大エントロピー化手段）３２とを備えて構成されて
いる。The feature amount calculating means 30 is the topic accumulating means 20.
The feature quantity (word feature quantity) of the topic is extracted from the topic and the news manuscript related to the topic,
It is stored in the characteristic amount storage means 40. Here, the feature amount is a probability value that a plurality of specific words appear in the news manuscript, which is calculated based on the frequency of appearance of words (nouns) that appear in the related news manuscript in each topic.
The feature amount computing means 30 is configured to include a learning data generating unit (learning data generating unit) 31 and a maximum entropy converting unit (maximum entropy converting unit) 32.

【００２９】この学習データ生成部３１は、話題蓄積手
段２０に蓄積されている話題とその話題に関連するニュ
ース原稿とから、ニュース原稿に含まれる単語（名詞）
の出現頻度を数値化し、その単語が話題を指し示す度合
い（重要度）として生成するものである。なお、ここで
ニュース原稿に含まれる単語を抽出するには、図示して
いない形態素解析手段を用いるものとするが、形態素解
析部６１を共用して使用する形態であっても構わない。
ここで、このニュース原稿に含まれる単語（名詞）の重
要度（出現頻度）を（１）式で定義する。The learning data generating unit 31 selects a word (noun) included in the news manuscript from the topic accumulated in the topic accumulating unit 20 and the news manuscript related to the topic.
The frequency of occurrence of is digitized and is generated as the degree (importance) that the word points to the topic. Note that the morpheme analysis means (not shown) is used here to extract the words included in the news manuscript, but the morpheme analysis unit 61 may be used in common.
Here, the importance (appearance frequency) of the word (noun) included in this news manuscript is defined by the expression (1).

【００３０】[0030]

【数１】 [Equation 1]

【００３１】（１）式において、ｔｆ（ｗ）：単語ｗが
話題中（話題を構成するニュース記事中）に出現した回
数、ＤＦ（ｗ）：１ヶ月のニュース記事中で単語ｗが出
現したニュース記事数、Ｎ（ｍｏｎｔｈ）：１ヶ月のニ
ュース記事数、Ｎ（ｔｏｐｉｃ）：対象としている話題
を構成するニュース記事数を表わしている。この（１）
式によって算出される重要度ｗｅｉｇｈｔ（ｗ）によ
り、各話題は、話題に出現した単語によって同一のベク
トル空間上に特徴付けられる。In the expression (1), tf (w): the number of times the word w appears in the topic (in the news articles constituting the topic), DF (w): the word w appears in the news article of one month. The number of news articles, N (month): the number of news articles in one month, N (topic): the number of news articles constituting the target topic. This (1)
Each topic is characterized by the word appearing in the topic on the same vector space by the importance weight (w) calculated by the formula.

【００３２】ここで、図４乃至図７を参照して、学習デ
ータ生成部３１が生成する学習データの例について説明
する。例えば、図４で示した「米などがアフガニスタン
を攻撃」という話題Ｔは、その話題Ｔに関連するニュー
ス記事Ｎの単語から、（１）式に基づいて、図５に示す
ようなベクトルの要素（話題中の出現単語Ｗ１）と、そ
の値（重要度Ｅ１）を持つこととなり、「アフガニスタ
ン」という出現単語には「１．４４」という重要度が与
えられ、「軍事」という出現単語には「０．９９」とい
う重要度が与えられる。Here, an example of the learning data generated by the learning data generating unit 31 will be described with reference to FIGS. 4 to 7. For example, the topic T that “rice etc. attacks Afghanistan” shown in FIG. 4 is based on the equation (1) from the word of the news article N related to the topic T, and the vector element as shown in FIG. (Appearance word W1 in the topic) and its value (importance level E1), the appearance word "Afghanistan" is given the importance degree "1.44", and the appearance word "military" is given. An importance of "0.99" is given.

【００３３】また、話題Ｔに関連するニュース記事Ｎに
おける単語（出現単語Ｗ１）の複数の組み合せ、例えば
３つの単語の組み合せを生成し、各単語が持つ重要度Ｅ
１の和を、話題Ｔにおける複数単語の組み合せによる組
み合せ重要度とする。これにより、話題Ｔは、図６に示
すように、３単語の組み合せＷ２とその個々の単語の重
要度Ｅ１を加算した組み合せ重要度Ｅ２が算出される。
例えば、「アフガニスタン、アフガニスタン、アフガニ
スタン」の３単語の組み合せＷ２に与えられる組み合せ
重要度Ｅ２は、単一の出現単語「アフガニスタン」の重
要度Ｅ１である「１．４４」を３個加算した値である
「４．３２」となる。A plurality of combinations of words (appearing words W1) in the news article N related to the topic T, for example, combinations of three words are generated, and the importance E of each word is generated.
The sum of 1 is defined as the combination importance of a combination of a plurality of words in the topic T. As a result, for the topic T, as shown in FIG. 6, the combination importance E2 is calculated by adding the combination W2 of three words and the importance E1 of each individual word.
For example, the combination importance level E2 given to the combination W2 of the three words “Afghanistan, Afghanistan, Afghanistan” is a value obtained by adding three “1.44”, which is the importance level E1 of the single occurrence word “Afghanistan”. It becomes a certain "4.32".

【００３４】なお、図７に示すように、３つの単語の順
番を入れ替えた組み合せ（３単語の同一組み合せＷ３）
には同一の組み合せ重要度Ｅ３を付与する。さらに、こ
こでは、単語がない状態（ＮＵＬＬ）との組み合せも考
慮し、例えば、「アフガニスタン、アフガニスタン、Ｎ
ＵＬＬ」といった組み合せも３単語の組み合わせとみな
す。As shown in FIG. 7, a combination in which the order of three words is exchanged (the same combination of three words W3).
Are assigned the same combination importance level E3. Furthermore, here, in consideration of the combination with the word-free state (NULL), for example, “Afghanistan, Afghanistan, N
A combination such as "ULL" is also regarded as a combination of three words.

【００３５】このように学習データ生成部３１（図１）
は、複数単語の組み合せ（３単語の組み合せＷ２）と、
その組み合せ重要度Ｅ２と、その組み合せ重要度Ｅ２を
有する話題Ｔとを、学習データとする。図１に戻って説
明を続ける。In this way, the learning data generation unit 31 (FIG. 1)
Is a combination of multiple words (combination W2 of 3 words),
The combination importance level E2 and the topic T having the combination importance level E2 are used as learning data. Returning to FIG. 1, the description will be continued.

【００３６】最大エントロピー化部３２は、学習データ
生成部３１で生成される学習データから、最大エントロ
ピー法に基づいて、ある話題における３単語の組み合せ
が生起する確率値を特徴量として出力するものである。
この特徴量は、特徴量蓄積手段４０に蓄積される。The maximum entropy conversion section 32 outputs the probability value that a combination of three words in a topic occurs from the learning data generated by the learning data generation section 31 as a feature quantity based on the maximum entropy method. is there.
This characteristic amount is accumulated in the characteristic amount accumulating unit 40.

【００３７】なお、この最大エントロピー法は、既知の
アルゴリズムであり、事象ｔとｈが同時に出現する頻度
Ｏ（ｔ，ｈ）から条件付き確率Ｐ（ｔ｜ｈ）を推定する
アルゴリズムである。この最大エントロピー法では、条
件付き確率Ｐ（ｔ｜ｈ）を推定するために、「素性（ｆ
ｅａｔｕｒｅ）」と、その素性に対する「出力」と、そ
の出力の「出現期待値」とが学習データとして用いられ
る。The maximum entropy method is a known algorithm that estimates the conditional probability P (t | h) from the frequency O (t, h) at which the events t and h occur at the same time. In this maximum entropy method, in order to estimate the conditional probability P (t | h), “feature (f
"Feature)", "output" for the feature, and "appearance expected value" of the output are used as learning data.

【００３８】そこで、この「素性」を、学習データ生成
部３１で生成される学習データである３単語の組み合せ
Ｗ２（図６）とし、「出力」を各話題に固有に付した識
別子（ＩＤ）とする。また、「出現期待値」には、複数
単語の組み合せ重要度Ｅ２（図６）を出現回数とみなし
て使用する。Therefore, this "feature" is defined as a combination W3 (FIG. 6) of three words which is the learning data generated by the learning data generation unit 31, and "output" is an identifier (ID) uniquely attached to each topic. And Further, as the “appearance expected value”, the combination importance degree E2 (FIG. 6) of a plurality of words is used as the number of appearances.

【００３９】この最大エントロピー法を用いることで、
例えば「アフガニスタン、軍事、“未学習語”」という
３単語の組み合せの出現確率は、“未学習語”以外の単
語「アフガニスタン」、「軍事」の出現確率が加味され
た確率値となるため、その確率値は０にはならない。こ
のように特徴量演算手段３０は、話題蓄積手段２０に蓄
積されている話題とその話題に関連するニュース原稿と
から、話題の特徴量である確率値を特徴量蓄積手段４０
に蓄積する。By using this maximum entropy method,
For example, the probability of appearance of a combination of three words "Afghanistan, military," unlearned word "" is a probability value in which the occurrence probabilities of the words "Afghanistan" and "military" other than "unlearned word" are added. The probability value does not become zero. As described above, the feature amount calculating unit 30 obtains the probability value, which is the feature amount of the topic, from the topic accumulated in the topic accumulating unit 20 and the news manuscript related to the topic, in the feature amount accumulating unit 40.
Accumulate in.

【００４０】特徴量蓄積手段４０は、最大エントロピー
化部３２で抽出される特徴量（確率値）を蓄積するもの
で、ハードディスク等で構成されるものである。この特
徴量蓄積手段４０に蓄積された特徴量は、推定度演算手
段６０の確率算出部６２によって参照される。The characteristic amount accumulating means 40 accumulates the characteristic amount (probability value) extracted by the maximum entropy conversion section 32, and is composed of a hard disk or the like. The characteristic amount accumulated in the characteristic amount accumulating unit 40 is referred to by the probability calculating unit 62 of the estimation degree calculating unit 60.

【００４１】音声認識手段５０は、マイク等の音声入力
装置（図示せず）から入力される会話音声データを、音
声認識によって、テキストデータである会話テキストデ
ータに変換するものである。この音声認識結果である会
話テキストデータは、推定度演算手段６０の形態素解析
部６１へ出力される。なお、この音声認識手段５０の音
声認識は、公知の一般的な音声認識技術を用いて実現す
ることができる。The voice recognition means 50 converts conversation voice data input from a voice input device (not shown) such as a microphone into conversation text data which is text data by voice recognition. The conversation text data as the result of the voice recognition is output to the morpheme analysis unit 61 of the estimation degree calculation means 60. The voice recognition of the voice recognition means 50 can be realized by using a known general voice recognition technique.

【００４２】推定度演算手段６０は、テキストデータで
ある会話テキストデータを解析し、特徴量蓄積手段４０
に蓄積されている特徴量（確率値）に基づいて、各話題
が会話内容の話題である推定度（確率値）を話題判定出
力手段７０へ出力するものである。なお、この推定度演
算手段６０は、形態素解析部６１と、確率算出部６２と
を備えて構成されている。The estimation degree calculating means 60 analyzes the conversation text data which is the text data, and the feature quantity accumulating means 40.
Based on the characteristic amount (probability value) accumulated in the above, the degree of estimation (probability value) that each topic is a topic of conversation content is output to the topic determination output means 70. It should be noted that the estimation degree calculation means 60 is configured to include a morpheme analysis unit 61 and a probability calculation unit 62.

【００４３】この形態素解析部６１は、入力された会話
テキストデータから、形態素解析により単語を抽出する
ものである。また、この形態素解析部６１では、会話テ
キストデータから一定の単語数（例えば５単語）の単語
を抽出し、確率算出部６２へ出力する。なお、会話テキ
ストデータに一定の単語数（例えば５単語）が含まれて
いない場合は、それ以前の会話テキストデータの単語を
処理対象としてもよいし、単語がない状態（ＮＵＬＬ）
として処理を行ってもよい。The morpheme analysis unit 61 extracts words from the input conversational text data by morpheme analysis. The morpheme analysis unit 61 also extracts a certain number of words (for example, 5 words) from the conversation text data and outputs the words to the probability calculation unit 62. If the conversation text data does not contain a certain number of words (for example, 5 words), the words in the conversation text data before that may be processed, or there is no word (NULL).
May be processed as.

【００４４】確率算出部６２は、形態素解析部６１から
一定の単語数の単語を入力し、その単語の組み合せが、
ある話題に属する確率値（推定度）を算出するものであ
る。そして、話題に付された固有の識別番号（ＩＤ）
と、その確率値（推定度）とを話題判定出力手段７０へ
出力する。The probability calculation unit 62 inputs a certain number of words from the morphological analysis unit 61, and the combination of the words is
The probability value (estimation degree) of belonging to a certain topic is calculated. And a unique identification number (ID) attached to the topic
And the probability value (estimation degree) thereof are output to the topic determination output means 70.

【００４５】ここで、学習データ生成部３１の学習デー
タにおける組み合せ単語数が３で、形態素解析部６１の
会話テキストデータで処理する単語数を５とし、その会
話テキストデータとして処理する単語を｛ｗ１，ｗ２，
ｗ３，ｗ４，ｗ５｝としたとき、その会話テキストデー
タが話題Ｔ１に属する確率値Ｐ（Ｔ１｜ｗ１，ｗ２，ｗ
３，ｗ４，ｗ５）を（２）式で算出する。Here, the number of combined words in the learning data of the learning data generation unit 31 is 3, the number of words processed by the conversation text data of the morpheme analysis unit 61 is 5, and the word processed as the conversation text data is {w1. , W2
w3, w4, w5}, the probability value P (T1 | w1, w2, w that the conversation text data belongs to the topic T1
3, w4, w5) is calculated by the equation (2).

【００４６】[0046]

【数２】 [Equation 2]

【００４７】（２）式において、Ｐ（Ｔ１｜ｗ１，ｗ
２，ｗ３）は、単語｛ｗ１，ｗ２，ｗ３｝が会話テキス
トデータの中に出現した際に、話題Ｔ１についての会話
が行われている確率値を表わしている。なお、この確率
値Ｐ（Ｔ１｜ｗ１，ｗ２，ｗ３）は、特徴量演算手段３
０で演算され特徴量蓄積手段４０に蓄積されている、ニ
ュース原稿の話題の特徴量（確率値）を使用することが
できる。このように、学習データにおける組み合せ単語
数と、会話内容として処理する組み合せ単語数とは、同
じである必要はない。In equation (2), P (T1 | w1, w
2, w3) represents a probability value that a conversation about the topic T1 is performed when the word {w1, w2, w3} appears in the conversation text data. The probability value P (T1 | w1, w2, w3) is calculated by the characteristic amount calculation means 3
The feature quantity (probability value) of the topic of the news manuscript, which is calculated by 0 and stored in the feature quantity storage means 40, can be used. As described above, the number of combined words in the learning data does not have to be the same as the number of combined words processed as the conversation content.

【００４８】話題判定出力手段７０は、推定度演算手段
６０の確率算出部６２から入力される話題に付された固
有の識別番号（ＩＤ）と、推定度（（１）式の計算結果
による確率値）とから、その推定度（確率値）が最大と
なる話題を判定し、その話題の識別番号（ＩＤ）に基づ
いて、話題蓄積手段２０から話題を読み出し、推定話題
として出力するものである。The topic determination output means 70 includes a unique identification number (ID) attached to the topic input from the probability calculating section 62 of the estimation degree calculating means 60, and the estimation degree (probability based on the calculation result of the equation (1)). Value), a topic having the highest estimation degree (probability value) is determined, the topic is read from the topic accumulating unit 20 based on the identification number (ID) of the topic, and the topic is output as an estimated topic. .

【００４９】以上、一実施形態に基づいて、話題推定装
置１の構成について説明したが、本発明はこれに限定さ
れるものではない。例えば、話題抽出手段１０と話題蓄
積手段２０とを構成から外し、外部から話題とその話題
に関連するニュース原稿とを入力する形態であっても構
わない。また、学習データ生成部３１においてニュース
原稿の単語の組み合せ数を３とし、形態素解析部６１に
おいて、会話テキストデータの単語を抽出する個数を５
としたが、これらの数値は限定されるものではなく、例
えば、図示していない入力装置から、数値を設定する構
成であっても構わない。さらに、推定度演算手段６０へ
の入力は、音声認識手段５０の出力でなくても構わな
い。例えば、パーソナルコンピュータ（ＰＣ）のキーボ
ードから入力された会話テキストデータを利用すること
もできる。The configuration of the topic estimation device 1 has been described above based on the embodiment, but the present invention is not limited to this. For example, the topic extraction means 10 and the topic accumulation means 20 may be removed from the configuration, and a topic and a news manuscript related to the topic may be input from the outside. Further, the learning data generation unit 31 sets the number of combinations of words in the news manuscript to 3, and the morpheme analysis unit 61 sets the number of words to be extracted from the conversation text data to 5.
However, these numerical values are not limited, and the numerical values may be set from an input device (not shown). Furthermore, the input to the estimation degree calculation means 60 does not have to be the output of the voice recognition means 50. For example, conversation text data input from the keyboard of a personal computer (PC) can be used.

【００５０】なお、話題推定装置１は、コンピュータに
おいて、特徴量演算手段３０や推定度演算手段６０を機
能プログラムとして実現することも可能であり、各機能
プログラムを結合して話題推定プログラムとして動作さ
せることも可能である。The topic estimation apparatus 1 can also realize the feature amount calculation means 30 and the estimation degree calculation means 60 in a computer as a function program, and combine the respective function programs to operate as a topic estimation program. It is also possible.

【００５１】（話題推定装置の動作）次に、図１乃至図
３を参照して、話題推定装置１の動作について説明す
る。図２は、話題推定装置１の特徴量演算手段３０の動
作を主に示すフローチャートである。また、図３は、話
題推定装置１の推定度演算手段６０の動作を主に示すフ
ローチャートである。(Operation of Topic Estimation Device) Next, the operation of the topic estimation device 1 will be described with reference to FIGS. FIG. 2 is a flowchart mainly showing the operation of the feature amount computing means 30 of the topic estimation device 1. Further, FIG. 3 is a flowchart mainly showing the operation of the estimation degree calculation means 60 of the topic estimation device 1.

【００５２】図２に示すように、この話題推定装置１
は、まず、ニュース原稿蓄積手段２に蓄積されているニ
ュース原稿を読み込み、話題抽出手段１０によって、話
題とその話題に関連するニュース原稿とを対応付けて話
題蓄積手段２０に蓄積する（フローチャートに図示せ
ず）。As shown in FIG. 2, this topic estimation device 1
First, the news manuscript stored in the news manuscript storage unit 2 is read, and the topic extraction unit 10 stores the topic and the news manuscript related to the topic in the topic storage unit 20 in association with each other (see the flowchart in FIG. (Not shown).

【００５３】そして、特徴量演算手段３０の学習データ
生成部３１によって、話題蓄積手段２０に蓄積されてい
るニュース原稿に出現する単語を抽出し（ステップＳ１
０）、その単語がニュース原稿に対応する話題に対し
て、どの程度の重要度を持つか、（１）式に基づいて算
出する（ステップＳ１１）。Then, the learning data generating section 31 of the feature amount calculating means 30 extracts words that appear in the news manuscript stored in the topic storing means 20 (step S1).
0), the degree of importance of the word with respect to the topic corresponding to the news manuscript is calculated based on the equation (1) (step S11).

【００５４】このステップＳ１１で算出した個々の単語
の重要度を、複数の単語（例えば３単語）分加算するこ
とで、複数の単語（３単語）を組み合せた組み合せ重要
度を算出する（ステップＳ１２）。なお、この複数の単
語の組み合せは、単語がない状態（ＮＵＬＬ）との組み
合せも含むものとする。The importance of each word calculated in step S11 is added for a plurality of words (for example, 3 words) to calculate a combined importance of a combination of a plurality of words (3 words) (step S12). ). It should be noted that this combination of a plurality of words includes a combination with a word-free state (NULL).

【００５５】そして、最大エントロピー化部３２によっ
て、ある話題における複数の単語（３単語）の組み合せ
が生起する確率値を最大エントロピー法に基づいて算出
し（ステップＳ１３）、特徴量蓄積手段４０に蓄積する
（ステップＳ１４）。Then, the maximum entropy conversion section 32 calculates a probability value that a combination of a plurality of words (three words) in a certain topic occurs based on the maximum entropy method (step S13), and stores it in the feature quantity storage means 40. Yes (step S14).

【００５６】以上のステップによって、ニュース原稿か
ら、話題とその話題に関連するニュース原稿とが対応付
けられ、その話題を複数の単語の組み合せによって特定
する確率値を、特徴量として抽出することができる。な
お、ステップＳ１３までの動作は、話題を推定する前段
階として予め動作させておくことができる。また、ニュ
ース原稿蓄積手段２のニュース原稿が更新される度に動
作させることで、最新の話題を推定するための特徴量を
抽出することができる。Through the above steps, a topic and a news document related to the topic are associated with each other from the news document, and a probability value for identifying the topic by a combination of a plurality of words can be extracted as a feature amount. . The operation up to step S13 can be performed in advance as a step before the topic is estimated. Further, by operating the news manuscript storage means 2 each time the news manuscript is updated, the feature amount for estimating the latest topic can be extracted.

【００５７】次に、図３を参照して、会話音声データ又
は会話テキストデータから話題を推定する動作について
説明する。まず、マイク等の音声入力装置から入力され
る会話音声データを、音声認識手段５０によって変換し
た会話テキストデータや、直接テキストデータとして入
力される会話テキストデータを、推定度演算手段６０の
形態素解析部６１によって、形態素解析を行い複数の単
語（５単語：名詞）を抽出する（ステップＳ２０）。Next, referring to FIG. 3, an operation of estimating a topic from conversation voice data or conversation text data will be described. First, the morpheme analysis unit of the estimation degree calculation unit 60 uses the conversation text data obtained by converting the conversation voice data input from a voice input device such as a microphone by the voice recognition unit 50 or the conversation text data input directly as text data. Morphological analysis is performed by 61 to extract a plurality of words (5 words: noun) (step S20).

【００５８】そして、確率算出部６２によって、ステッ
プＳ２０で抽出した会話テキストデータの複数の単語の
組み合せ（５単語から３単語を選択した組み合せ）に基
づいて、話題を特定する確率値を特徴量蓄積手段４０か
ら読み込み（ステップＳ２１）、前記会話テキストデー
タの複数の単語（５単語）が各話題を特定する確率値を
（２）式により算出する（ステップＳ２２）。Then, the probability calculating section 62 accumulates the probability value for identifying a topic based on the combination of a plurality of words of the conversation text data extracted in step S20 (combination of 3 words selected from 5 words). It is read from the means 40 (step S21), and the probability value that a plurality of words (5 words) in the conversation text data identify each topic is calculated by the equation (2) (step S22).

【００５９】このステップＳ２２で算出された各話題を
特定する確率値の中で、最大確率値となる話題を、この
会話テキストデータの話題であると推定して、話題蓄積
手段２０からその話題を読み込み、推定話題として出力
する（ステップＳ２３）。以上のステップによって、話
題推定装置１は、入力された会話音声データや、会話テ
キストデータから、自動的にその会話内容の話題を推定
し出力することができる。Among the probability values for identifying each topic calculated in step S22, the topic having the maximum probability value is estimated to be the topic of this conversation text data, and the topic is accumulated from the topic accumulating means 20. It is read and output as an estimated topic (step S23). Through the above steps, the topic estimation device 1 can automatically estimate and output the topic of the conversation content from the input conversation voice data or conversation text data.

【００６０】（話題推定装置における話題推定例）次
に、図８を参照して、話題推定装置１（図１）における
話題推定例を説明する。図８は、自然に関する会話から
話題を推定する実験結果を示している。図８（１）に示
すように、例えば会話例として、生徒が「異常現象は、
エルニーニョ現象が原因って聞いたことがあるよ。」と
発生した会話音声データ、あるいは会話テキストデータ
から、単語（名詞）を抽出すると、「異常気象、エルニ
ーニョ、現象、原因」の４つが抽出される。なお、形態
素解析部６１（図１）で例えば５つの単語を抽出する場
合は、これら４つの単語以外に「ＮＵＬＬ」を含めて５
つの単語とする。この図８（１）の例では、「異常気
象、エルニーニョ、現象、原因、ＮＵＬＬ」の５つの単
語（会話処理単語）によって、最も高い確率で「環境問
題」が推定話題として出力されたことになる。(Example of topic estimation in topic estimation device) Next, an example of topic estimation in the topic estimation device 1 (FIG. 1) will be described with reference to FIG. FIG. 8 shows an experimental result of estimating a topic from a conversation about nature. As shown in FIG. 8 (1), for example, as an example of conversation, the student "
I've heard that the El Nino phenomenon is the cause. When a word (noun) is extracted from the conversation voice data or the conversation text data that has occurred, four types of "abnormal weather, El Nino, phenomenon, cause" are extracted. When the morpheme analysis unit 61 (FIG. 1) extracts, for example, five words, it includes five words including “NULL” in addition to these four words.
One word. In the example of FIG. 8 (1), it is determined that the "environmental problem" is output as the estimated topic with the highest probability from the five words (conversation processing words) of "abnormal weather, El Nino, phenomenon, cause, and NULL". Become.

【００６１】また、図８（２）では、生徒が「僕は北海
道に住んでいるけど、北海道では、今、桜は満開だよ。
今年、東京では、いつ桜が開花したの？」と発生した会
話音声データ、あるいは会話テキストデータから、５つ
の単語「桜、満開、東京、桜、開花」を会話処理単語と
して抽出している。このように、抽出する単語は、会話
の途中（文の途中）であっても話題を推定することがで
きる。さらに、図８（３）のように、生徒と先生の会話
から単語を抽出することで、二人の会話の話題を推定す
ることも可能である。In FIG. 8 (2), the student said, “I live in Hokkaido, but now in Hokkaido, the cherry blossoms are in full bloom.
When did cherry blossoms bloom in Tokyo this year? The five words "Sakura, full bloom, Tokyo, Sakura, flowering" are extracted as the conversation processing words from the conversation voice data or the conversation text data that has occurred. In this way, the extracted words can be presumed to be a topic even during conversation (in the middle of a sentence). Furthermore, as shown in FIG. 8 (3), it is also possible to estimate the topic of the conversation between the two by extracting words from the conversation between the student and the teacher.

【００６２】[0062]

【発明の効果】以上説明したとおり、本発明に係る話題
推定装置、話題特定方法及び話題特定プログラムでは、
以下に示す優れた効果を奏する。As described above, in the topic estimation device, the topic identification method and the topic identification program according to the present invention,
It has the following excellent effects.

【００６３】請求項１、請求項５、請求項６又は請求項
７に記載の発明によれば、会話内容から、その会話を特
定する話題を自動的に推定することができる。また、会
話内容の話題を推定することができるので、教育現場に
おける学習支援システムとして利用することも可能であ
る。例えば、複数の生徒がグループ学習を行っている際
に、生徒の会話の内容から話題を推定し、その話題に関
連する情報を自動で呈示することも可能である。According to the invention described in claim 1, claim 5, claim 6 or claim 7, the topic specifying the conversation can be automatically estimated from the conversation content. Moreover, since the topic of the conversation content can be estimated, it can be used as a learning support system in an educational setting. For example, when a plurality of students are conducting group learning, it is possible to estimate a topic from the contents of conversation of the students and automatically present information related to the topic.

【００６４】さらに、音声認識の分野において利用する
ことで、会話内容の分野を推定することができ、音声認
識における単語等の候補をその会話内容の分野に絞るこ
とができるので、音声認識率を向上させることができる
という効果をも奏する。Further, by using it in the field of speech recognition, the field of conversation content can be estimated, and candidates such as words in speech recognition can be narrowed down to the field of conversation content. It also has the effect of being able to improve.

【００６５】請求項２又は請求項３に記載の発明によれ
ば、会話内容（会話テキスト）に学習データとして保持
していない単語を含んでいても、他の学習データに保持
されている単語から、話題を推定することができるの
で、会話内容に含まれる任意の複数の単語から話題を推
定することができ、話題の推定率を高めることができ
る。According to the invention described in claim 2 or claim 3, even if the conversation content (conversation text) includes a word which is not stored as learning data, the word is stored in other learning data. Since the topic can be estimated, the topic can be estimated from arbitrary plural words included in the conversation content, and the topic estimation rate can be increased.

【００６６】請求項４に記載の発明によれば、ニュース
原稿の記事に基づいて話題を推定するため、ニュース原
稿を日々更新しておくことで、会話の内容が最新の話題
であっても、適切に話題を推定することが可能になる。According to the invention described in claim 4, since the topic is estimated based on the article of the news manuscript, the news manuscript is updated every day, so that even if the content of the conversation is the latest topic, It is possible to properly estimate the topic.

[Brief description of drawings]

【図１】本発明の実施の形態に係る話題推定装置の構成
を示すブロック図である。FIG. 1 is a block diagram showing a configuration of a topic estimation device according to an embodiment of the present invention.

【図２】本発明の実施の形態に係る話題推定装置の特徴
量を抽出する動作を示すフローチャートである。FIG. 2 is a flowchart showing an operation of extracting a feature amount of the topic estimation device according to the exemplary embodiment of the present invention.

【図３】本発明の実施の形態に係る話題推定装置の会話
内容から話題を推定する動作を示すフローチャートであ
る。FIG. 3 is a flowchart showing an operation of estimating a topic from conversation contents of the topic estimation device according to the exemplary embodiment of the present invention.

【図４】話題とその話題に対応するニュース原稿の一例
を説明するための説明図である。FIG. 4 is an explanatory diagram illustrating an example of a topic and a news manuscript corresponding to the topic.

【図５】単語とその重要度を説明するための説明図であ
る。FIG. 5 is an explanatory diagram for explaining a word and its importance.

【図６】３つの単語の組み合せとその組み合せ重要度を
説明するための説明図である。FIG. 6 is an explanatory diagram for explaining a combination of three words and a degree of importance of the combination.

【図７】３つの単語の順番を入れ替えた組み合せを説明
するための説明図である。FIG. 7 is an explanatory diagram for explaining a combination in which the order of three words is exchanged.

【図８】話題推定結果の例を説明するための説明図であ
る。FIG. 8 is an explanatory diagram illustrating an example of a topic estimation result.

[Explanation of symbols]

１……話題推定装置２……ニュース原稿蓄積手段１０……話題抽出手段２０……話題蓄積手段３０……特徴量演算手段３１……学習データ生成部（学習データ生成手段）３２……最大エントロピー化部（最大エントロピー化手
段）４０……特徴量蓄積手段５０……音声認識手段６０……推定度演算手段６１……形態素解析部６２……確率算出部７０……話題判定出力手段1 ... Topic estimation device 2 ... News manuscript storage means 10 ... Topic extraction means 20 ... Topic storage means 30 ... Feature amount calculation means 31 ... Learning data generation unit (learning data generation means) 32 ... Maximum entropy Conversion unit (maximum entropy conversion unit) 40 ... feature amount storage unit 50 ... voice recognition unit 60 ... estimation degree calculation unit 61 ... morpheme analysis unit 62 ... probability calculation unit 70 ... topic determination output unit

フロントページの続き (72)発明者有安香子東京都世田谷区砧一丁目10番11号日本放送協会放送技術研究所内 (72)発明者柴田正啓東京都世田谷区砧一丁目10番11号日本放送協会放送技術研究所内 (72)発明者八木伸行東京都世田谷区砧一丁目10番11号日本放送協会放送技術研究所内Ｆターム(参考） 5B009 QA05 QA12 VA02 VA09 5B091 AA15 CA02 CC01 CC15 Continued front page (72) Inventor, Kyoko Ariyasu 1-10-11 Kinuta, Setagaya-ku, Tokyo, Japan Broadcasting Association Broadcast Technology Institute (72) Inventor Masahiro Shibata 1-10-11 Kinuta, Setagaya-ku, Tokyo, Japan Broadcasting Association Broadcast Technology Institute (72) Inventor Nobuyuki Yagi 1-10-11 Kinuta, Setagaya-ku, Tokyo, Japan Broadcasting Association Broadcast Technology Institute F term (reference) 5B009 QA05 QA12 VA02 VA09 5B091 AA15 CA02 CC01 CC15

Claims

[Claims]

1. Topic estimation for estimating the topic that specifies the conversation content from the conversation text that is the input conversation content, based on the text document that is the language data and the topic that specifies the content of the text document. A device, wherein the appearance frequency of manuscript words included in the text manuscript is obtained, and based on the appearance frequency, a feature amount calculation unit that finds the word feature amount of the topic that identifies the text manuscript; An estimation degree calculating means for extracting a conversation word included therein, and obtaining an estimation degree of a topic estimated by the conversation word, based on the conversation word and the word feature amount of the topic obtained by the feature amount calculating means; A topic determination output unit that determines the estimation degree obtained by the estimation degree calculation unit and outputs the topic estimated based on the determination result. Topic estimation device characterized by there.

2. The feature amount calculation means generates, as learning data, at least a combination of the manuscript words and an appearance frequency of the manuscript word combination based on the appearance frequencies of the manuscript words included in the text manuscript. Learning data generating means, based on the maximum entropy method, with respect to the learning data generated by the learning data generating means, a maximum entropy converting means for giving the appearance probability value of the combination of the original words as the word feature amount. The topic estimation device according to claim 1, further comprising:

3. The estimation degree calculating means calculates the appearance probability value at which a plurality of combinations of the conversation words appear for each topic, based on the appearance probability value of the learning data given by the maximum entropy converting means. The topic estimation device according to claim 2, wherein the topic estimation apparatus calculates the topic estimation degree.

4. The topic estimation device according to claim 1, wherein the text manuscript is an article of an electronic news manuscript.

5. A topic estimation device for estimating a topic for specifying conversation content from conversation text which is input conversation content, based on an article of a digitized news manuscript, from the article of the news document. A topic extraction unit that extracts a topic that specifies the content of the article; and a word of the topic that identifies the news manuscript based on the appearance frequency of the manuscript words included in the article of the news manuscript. A feature amount calculating unit for obtaining a feature amount, a conversation word included in the conversation text is extracted, and the conversation word is extracted based on the conversation word and the word feature amount of the topic obtained by the feature amount calculating unit. An estimation degree calculating means for obtaining an estimation degree of the topic to be estimated, the estimation degree obtained by the estimation degree calculating means are determined, and the estimation degree is estimated based on the determination result. Topic estimation device characterized by comprising a topic determination output means, for outputting the topic was.

6. Topic estimation for estimating the topic that specifies the conversation content from the conversation text that is the input conversation content, based on the text document that is the language data and the topic that specifies the content of the text document. A learning data generating step of generating a combination of a plurality of manuscript words contained in the text manuscript and an appearance frequency of the combination of manuscript words as learning data of the topic; and a learning data generating step. Based on the learning data, the maximum entropy step for obtaining the appearance probability value of the combination of the original words based on the maximum entropy method, and the topic probability for each topic based on the appearance probability value obtained by the maximum entropy step. Occurrence of multiple combinations of conversation words included in the conversation text in An estimation degree calculation step of calculating a probability value as the estimation degree of the topic, and a topic determination output step of determining the estimation degree obtained in the estimation degree calculation step and outputting the topic estimated based on the determination result. A topic estimation method characterized by including and.

7. In order to estimate the topic specifying the conversation content from the conversation text which is the input conversation content, based on the text document which is the language data and the topic which specifies the content of the text document. , A computer, learning data generating means for generating a combination of a plurality of manuscript words contained in the text manuscript, and an appearance frequency of the combination of the manuscript words as learning data of the topic; Based on the learning data, the maximum entropy means for obtaining the appearance probability value of the combination of the manuscript words based on the maximum entropy method, the conversation for each topic based on the appearance probability value obtained by the maximum entropy means The appearance probability value that multiple combinations of conversation words included in the text appear An estimation degree calculation means for calculating as an estimation degree, a topic determination output means for determining the estimation degree obtained by the estimation degree calculation means, and outputting a topic estimated based on the determination result. The topic estimation program to be.