JP4134975B2

JP4134975B2 - Topic document presentation method, apparatus, and program

Info

Publication number: JP4134975B2
Application number: JP2004309576A
Authority: JP
Inventors: 吉秀佐藤; 努佐々木; 晴美川島; 雅且大久保
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2004-10-25
Filing date: 2004-10-25
Publication date: 2008-08-20
Anticipated expiration: 2024-10-25
Also published as: JP2006120069A

Description

本発明は、話題文書提示方法及び装置及びプログラムに係り、特に、作成時刻が特定可能な多数の文書データを効率的に利用者に閲覧させるため、同一のテーマに沿った文書を一まとまりの集合にし、かつ、文書の作成時刻を考慮して各文書の持つ話題性の大きさを表すスコアを計算して、類似性と話題性の双方の観点で文書を整理して提示する話題文書提示方法及び装置及びプログラムに関する。 The present invention relates to a topic document presentation method, apparatus, and program, and in particular, to efficiently allow a user to browse a large number of document data whose creation times can be specified, a set of documents that meet the same theme. Topic document presentation method that calculates the score representing the degree of topicality of each document in consideration of the document creation time, and arranges and presents the documents from the viewpoint of both similarity and topicality And an apparatus and a program.

インターネットの普及により、ネットワーク上には膨大な文書が溢れている。特にニュースを代表するような、１日に何度も更新・追加される性質の文書の場合、１つ１つ閲覧して情報を把握しようとすると多大な労力を要する。 Due to the spread of the Internet, a huge amount of documents are overflowing on the network. In particular, in the case of a document that is updated and added many times a day, such as representing news, it takes a lot of labor to browse the information one by one and grasp the information.

また、情報の発信源が分散しているのがインターネットの特徴である反面、同一のテーマに沿って様々な観点から書かれた文書が複数の発信源から公開されるため、仮にそれらを取捨選択して必要な情報だけ目を通したいと思っても、流通する全文書量の増加と共に閲覧操作は困難を極める。さらに、文書量の増加は興味のある文書を紛れさせてしまい、必要な情報を入手しにくい現状がある。 In addition, while the source of information is distributed, it is a feature of the Internet, but since documents written from various viewpoints along the same theme are released from multiple sources, they are temporarily selected. Even if you want to read only the necessary information, browsing operations become extremely difficult as the total amount of documents distributed increases. Furthermore, an increase in the amount of documents causes a document of interest to be lost, and it is difficult to obtain necessary information.

こういった状況の中、数多くの文書を効率的に閲覧するために、文書集合を整理して利用者に提示する技術がある（例えば、特許文献１参照）。 In such a situation, there is a technique for organizing a document set and presenting it to a user in order to browse a large number of documents efficiently (see, for example, Patent Document 1).

上記の技術は、利用者の操作によって提示する候補を絞り込みながら文書検索を行なう方法である。利用者の情報要求に基づいて基本検索を行なった結果得られる文書群を互いに類似した文書の集合に分割し、各集合を代表する単語リストを提示し、利用者が選択した集合に対し、再度類似文書の分類と単語リストの提示を行う処理を繰り返しながら対話的に検索を行なう方法である。
特開平１１−２１３０００号公報 The above technique is a method for searching for a document while narrowing down candidates to be presented by a user operation. A group of documents obtained as a result of performing a basic search based on a user's information request is divided into a set of similar documents, a word list representing each set is presented, and the set selected by the user is again displayed. This is a method of interactively searching while repeating the process of classifying similar documents and presenting a word list.
Japanese Patent Laid-Open No. 11-213000

しかしながら、上記の技術は、どちらかと言えば能動的に文書検索を行なう利用者を意識した技術であり、また時間という観点を考慮した提示技術ではないため、続々と新しい文書が増加する状況において監視的に情報を把握する用途に使用することが困難である。 However, since the above technology is rather a technology that is conscious of users who actively search for documents, and is not a presentation technology that takes time into consideration, it is monitored in a situation where new documents continue to increase. Therefore, it is difficult to use for the purpose of grasping information.

本発明は、上記の点に鑑みなされたもので、膨大な文書が流通する状況において、受動的な利用者が日々刻々と変化する情報の話題性を把握する作業を補助するために、膨大な文書を文書間の類似性と話題の時間的な盛り上がりという２つの観点に基づいて整理、提示することで受動的な立場で最新の話題情報を取得したい利用者にとって効率的な情報取得環境を提供することが可能な話題文書提示方法及び装置及びプログラムを提供することを目的とする。 The present invention has been made in view of the above points. In a situation where an enormous amount of documents are distributed, an enormous amount of information is provided to assist passive users in grasping the topicality of information that changes every day. Provides an efficient information acquisition environment for users who want to acquire the latest topic information from a passive standpoint by organizing and presenting documents based on two viewpoints: similarity between documents and time swell of topics An object of the present invention is to provide a topic document presentation method, apparatus, and program that can be used.

図１は、本発明の原理を説明するための図である。 FIG. 1 is a diagram for explaining the principle of the present invention.

本発明（請求項１）は、作成時刻情報付きの膨大な文書群から、話題性の高い文書を自動的に選択して提示する話題文書提示装置における話題文書提示方法において、
類似度算出手段が、入力文書中に含まれる単語に対して属性値を決定し、各入力文書を属性値列のベクトルに変換した文書ベクトルを用いて各文書間の類似度を算出し、類似度記録手段に記録する類似度算出ステップ（ステップ１）と、
クラスタリング手段が、類似度記録手段に記録されている文書間の類似度を用いて、入力文書群を類似文書から少なくとも１つの部分集合に分割し、各部分集合に属する文書をクラスタ記録手段に保存するクラスタリングステップ（ステップ２）と、
暫定話題度算出手段が、各文書について、クラスタリングステップにおいて分割した各部分集合のうち該各文書が属する部分集合内の他の文書との間の類似度を全て加算して暫定話題度を算出する暫定話題度算出ステップと、
新鮮度決定手段が、各文書の作成時刻情報が現在時刻に近いほど値が大きく、現在から遠いほど値が小さくなる関数によって該各文書の新鮮度を決定する新鮮度決定ステップと、
文書話題度算出手段が、各文書について、暫定話題度と新鮮度とを積算して文書話題度を算出し、文書話題度記録手段に保存する文書話題度算出ステップ（ステップ３）と、
提示データ作成手段が、クラスタ記録手段に保存されている各部分集合に属する文書及び、文書話題度記録手段に保存されている文書話題度を利用して、部分集合のそれぞれについて、該部分集合に含まれる文書群を、文書話題度の高い順に並べたデータを作成し、出力する提示データ作成ステップ（ステップ４）と、を行う。 The present invention (Claim 1) is a topic document presentation method in a topic document presentation apparatus that automatically selects and presents a highly topical document from a huge document group with creation time information.
Similarity calculation means determines attribute values for words included in the input document, calculates similarity between documents using a document vector obtained by converting each input document into a vector of attribute value strings, and similarity A similarity calculation step (step 1) to be recorded in the degree recording means;
The clustering means divides the input document group into at least one subset from the similar documents using the similarity between documents recorded in the similarity recording means, and stores the documents belonging to each subset in the cluster recording means A clustering step (step 2),
The provisional topic degree calculation means calculates the provisional topic degree for each document by adding all similarities between the subsets divided in the clustering step and other documents in the subset to which each document belongs. A provisional topic level calculation step;
A freshness determination step in which the freshness determination means determines the freshness of each document by a function having a value that increases as the creation time information of each document is closer to the current time, and decreases as the distance from the current time
Document topic calculation means, for each document, the provisional topic of a by multiplying the freshness calculates the document topic of the document topic calculation step of storing the document topic of the recording means (Step 3),
The presentation data creating means uses the document belonging to each subset stored in the cluster recording means and the document topic level stored in the document topic level recording means, and sets each subset to the subset. A presentation data creation step (step 4) is performed for creating and outputting data in which the included document group is arranged in descending order of the document topic level .

また、本発明（請求項２）は、新鮮度決定ステップにおいて、
新鮮度決定手段が、新鮮度を決定する関数として、指数関数で表される関数を用いる。 Further, the present invention (Claim 2), in the new freshness determining step,
The freshness determination means uses a function represented by an exponential function as a function for determining the freshness.

また、本発明（請求項３）は、カテゴリ分類手段が、入力文書を内容に応じて複数のカテゴリのいずれか１以上に分類するカテゴリ分類ステップを更に行い、
類似度算出ステップ、クラスタリングステップ、暫定話題度算出ステップ、新鮮度決定ステップ、文書話題度算出ステップ、提示データ作成ステップのそれぞれを、分類したカテゴリ別に行う。 The present invention (Claim 3 ) further performs a category classification step in which the category classification means classifies the input document into one or more of a plurality of categories according to the contents,
Each of the similarity calculation step, clustering step, provisional topic level calculation step, freshness determination step, document topic level calculation step, and presentation data creation step is performed for each classified category.

図２は、本発明の原理構成図である。 FIG. 2 is a principle configuration diagram of the present invention.

本発明（請求項４）は、作成時刻情報付きの膨大な文書群から、話題性の高い文書を自動的に選択して提示する話題文書提示装置であって、
入力文書中に含まれる単語に対して属性値を決定し、各入力文書を属性値列のベクトルに変換した文書ベクトルを用いて各文書間の類似度を算出し、類似度記録手段２０７に記録する類似度算出手段２０８と、
類似度記録手段２０７に記録されている文書間の類似度を用いて、入力文書群を類似文書から少なくとも１つの部分集合に分割し、各部分集合に属する文書をクラスタ記録手段２０９に保存するクラスタリング手段２０８と、
各文書について、クラスタリング手段２０８において分割した各部分集合のうち該文書が属する部分集合内の他の文書との間の類似度を全て加算して暫定話題度を算出する暫定話題度算出手段２１０と、
各文書の作成時刻情報が現在時刻に近いほど値が大きく、現在から遠いほど値が小さくなる関数によって各文書の新鮮度を決定する新鮮度決定手段と、
各文書について、暫定話題度と新鮮度とを積算して文書話題度を算出し、文書話題度記録手段２１１に保存する文書話題度算出手段２１０と、
クラスタ記録手段２０９に保存されている各部分集合に属する文書及び、文書話題度記録手段２１１に保存されている文書話題度を利用して、部分集合のそれぞれについて、該部分集合に含まれる文書群を、文書話題度の高い順に並べたデータを作成し、出力する提示データ作成手段２１２と、を有する。 The present invention (Claim 4 ) is a topic document presentation device that automatically selects and presents a highly topical document from a huge document group with creation time information,
An attribute value is determined for a word included in the input document, a similarity between each document is calculated using a document vector obtained by converting each input document into an attribute value sequence vector, and recorded in the similarity recording unit 207 A similarity calculation means 208,
Clustering that divides the input document group into at least one subset from the similar documents using the similarity between documents recorded in the similarity recording unit 207, and stores the documents belonging to each subset in the cluster recording unit 209 Means 208;
For each document, the provisional topic degree calculating unit 210 that calculates a provisional topic of all similarities addition to between the other documents in the subset the document belongs among the subsets divided in the clustering means 208 ,
A freshness determination means for determining the freshness of each document by a function in which the value is larger as the creation time information of each document is closer to the current time, and the value is smaller as it is farther from the current time;
For each document, the provisional topic of the integrated and freshness calculates the document topic degree, document topic degree calculating unit 210 to store the document topic of the recording unit 211,
Using the documents belonging to each subset stored in the cluster recording unit 209 and the document topic level stored in the document topic level recording unit 211, the document group included in the subset for each of the subsets and create a data arranged in order of the document topic degree, having a presentation data creating means 212 to be output.

また、本発明（請求項５）は、新鮮度決定手段において、
新鮮度を決定する関数として、指数関数で表される関数を用いる。 Further, the present invention (Claim 5), in the new freshness determining means,
A function represented by an exponential function is used as a function for determining freshness.

また、本発明（請求項６）は、入力文書を内容に応じて複数のカテゴリのいずれか１以上に分類するカテゴリ分類手段を更に有し、
類似度算出手段、クラスタリング手段、暫定話題度算出手段、文書話題度算出手段、新鮮度決定手段、提示データ作成手段のそれぞれの処理を、分類したカテゴリ別に行う。 The present invention (Claim 6 ) further includes category classification means for classifying the input document into any one or more of a plurality of categories according to the contents,
The processes of the similarity calculation means, clustering means, provisional topic degree calculation means, document topic degree calculation means, freshness determination means, and presentation data creation means are performed for each classified category.

本発明（請求項７）は、請求項４乃至６のいずれか１項に記載の話題文書提示装置を構成する各手段としてコンピュータを機能させるための話題文書提示プログラムである。 The present invention (Claim 7) is a topic document presentation program for causing a computer to function as each means constituting the topic document presentation apparatus according to any one of claims 4 to 6 .

上記のように、本発明によれば、各文書間の類似度を算出して類似文書を集約し、かつ各文書の持つ話題性の大きさを話題度として文書に付与することにより、利用者は同一内容を扱った文書に複数目を通す煩わしさから解放され、類似文書群（クラスタ）のそれぞれについて特に話題性の高い文書のみを優先的に閲覧することが可能となり、効果的な情報取得環境が提供される。 As described above, according to the present invention, the similarity between the documents is calculated, the similar documents are aggregated, and the topic level of each document is assigned to the document as the topic level. Is freed from the hassle of reading multiple documents that deal with the same content, and it is possible to preferentially browse only highly topical documents for each of similar document groups (clusters), so that effective information acquisition is possible. An environment is provided.

以下、図面と共に本発明の実施の形態を説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.

［第１の実施の形態］
図３は、本発明の第１の実施の形態における話題文書提示装置の構成を示す。 [First Embodiment]
FIG. 3 shows the configuration of the topic document presentation device in the first embodiment of the present invention.

同図に示す話題文書提示装置２０は、話題文書情報を出力するため話題出力装置２１が接続される。 The topic document presentation device 20 shown in the figure is connected to a topic output device 21 for outputting topic document information.

話題文書提示装置２０は、文書収集部２０１、文書データ記録部２０２、文書解析部２０３、単語集計部２０４、文書ベクトル記録部２０５、類似度算出部２０６、類似度記録部２０７、クラスタリング部２０８、クラスタ記録部２０９、文書話題度算出部２１０、文書話題度記録部２１１、提示データ作成部２１２から構成される。 The topic document presentation device 20 includes a document collection unit 201, a document data recording unit 202, a document analysis unit 203, a word totaling unit 204, a document vector recording unit 205, a similarity calculation unit 206, a similarity recording unit 207, a clustering unit 208, A cluster recording unit 209, a document topic level calculation unit 210, a document topic level recording unit 211, and a presentation data creation unit 212 are configured.

文書話題度算出部２１０は、時刻情報取得部２１０１、類似度取得部２１０２、類似度加算部２１０３、鮮度係数算出部２１０４、積算部２１０５から構成される。 The document topic level calculation unit 210 includes a time information acquisition unit 2101, a similarity acquisition unit 2102, a similarity addition unit 2103, a freshness coefficient calculation unit 2104, and an integration unit 2105.

以下、話題文書提示装置２０を構成する各処理部の機能を説明する。 Hereinafter, the function of each processing unit constituting the topic document presentation device 20 will be described.

文書収集部２０１は、外部の情報源から文書データを収集する。例えば、インターネット上の新聞社サイトで公開されているニュース記事データを取得し、各文書に一意な識別子（文書ＩＤ）を付与し、文書データ記録部２０２に保存する。その際、各文書の作成時刻情報も取得し、合わせて記録しておく。 The document collection unit 201 collects document data from an external information source. For example, news article data published on a newspaper company site on the Internet is acquired, a unique identifier (document ID) is assigned to each document, and stored in the document data recording unit 202. At that time, the creation time information of each document is also acquired and recorded together.

文書データ記録部２０２には、図４のように、文書の作成時刻と本文、及び文書収集部２０２が各文書に対して一意に付与した文書ＩＤが記録される。 As shown in FIG. 4, the document data recording unit 202 records the creation time and text of the document, and the document ID uniquely assigned to each document by the document collection unit 202.

文書解析部２０３は、文書データ記録部２０２に保存された文書データを取得し、形態素解析処理によって本文を単語毎に分割して、得られる単語のリストを文書ベクトル記録部２０５に一旦出力する。文書解析部２０３は、同時に単語のリストを単語集計部２０４にも送出し、単語集計部２０４において、各単語が現れる文書数を集計する。 The document analysis unit 203 acquires the document data stored in the document data recording unit 202, divides the text into words by morphological analysis processing, and temporarily outputs the obtained word list to the document vector recording unit 205. The document analysis unit 203 simultaneously sends a word list to the word totaling unit 204, and the word totaling unit 204 totals the number of documents in which each word appears.

文書解析部２０３における形態素解析処理は、文章を動詞や助詞などの品詞や「、（句点）」「。（読点）」などの記号などの構成要素（形態素）に切り分ける処理である。本発明の話題文書提示方法が行う処理において文書をベクトルで表現するのは文書間の類似度算出を行うことが目的であり、文書を特徴付けるのに適さない形態素は扱う必要がない。従って、形態素解析処理の結果から句読点、さらには必要に応じて助詞、助動詞などの形態素を除き、名詞や動詞、形容詞などを対象にし、文書ベクトル記録部２０５ならびに単語集計部２０４に送出すればよい。 The morpheme analysis process in the document analysis unit 203 is a process of dividing a sentence into components (morphemes) such as parts of speech such as verbs and particles and symbols such as “, (punctuation)” and “. (Reading)”. In the processing performed by the topic document presentation method of the present invention, the document is expressed as a vector for the purpose of calculating similarity between documents, and it is not necessary to handle morphemes that are not suitable for characterizing the documents. Accordingly, punctuation marks and further morphemes such as particles and auxiliary verbs are excluded from the results of the morphological analysis processing, and nouns, verbs, and adjectives are targeted and sent to the document vector recording unit 205 and the word counting unit 204. .

単語集計部２０４は、文書解析部２０３から送出された各単語（本実施の形態では句読点や助詞などを除いた形態素）が出現する文書数を集計し、その結果を用いて単語毎の重みを決定する。例えば、文書解析部２０３が文書ＩＤ「ＤＯＣ−０１２３４５」の文書を単語毎に分割した結果、Ｗ１〜Ｗｎまでのｎ種類の単語が得られたとする。単語の重みの決定方法は、後述するが、単語集計部２０４が全ての文書中でのこれらの単語の出現回数を考慮した結果、各単語に対応するＶ１〜Ｖｎという重みが得られたとすると、文書ＩＤ「ＤＯＣ−０１２３４５」の文書は、単語Ｗ１に対応する属性値がＶ１，単語Ｗ２に対応する属性値がＶ２，…のように、文書内に出現する単語の重みを列挙したベクトル形式で表現することができる。 The word totaling unit 204 totals the number of documents in which each word (in this embodiment, excluding punctuation marks and particles) sent from the document analysis unit 203 appears, and uses the result to calculate the weight for each word. decide. For example, it is assumed that n types of words W1 to Wn are obtained as a result of the document analysis unit 203 dividing the document with the document ID “DOC-012345” for each word. A method for determining the weight of the word will be described later. As a result of the word totaling unit 204 considering the number of appearances of these words in all documents, the weights V1 to Vn corresponding to the respective words are obtained. The document with the document ID “DOC-012345” has a vector format in which weights of words appearing in the document are enumerated such that the attribute value corresponding to the word W1 is V1, the attribute value corresponding to the word W2 is V2,. Can be expressed.

単語集計部２０４が、各単語の重みを決定する処理の流れを図５を用いて詳しく説明する。 The processing flow in which the word totaling unit 204 determines the weight of each word will be described in detail with reference to FIG.

まず、文書解析部２０４が文書を単語へ分割した結果得られる単語リストを１文書分取得し、各単語が１度でも現れる文書の数（以後、「出現文書数」と呼ぶ）を集計する。同時に、処理を行った文書数をカウントしておく（ステップ２０１）。 First, the document analysis unit 204 acquires a word list obtained as a result of dividing a document into words, and totals the number of documents in which each word appears even once (hereinafter referred to as “appearing document number”). At the same time, the number of processed documents is counted (step 201).

文書解析部２０３が処理した全ての文書について、ステップ２０１のステップを繰り返すと（ステップ２０２）、単語Ｗ１の出現文書数は「５」、単語Ｗ２の文書出現数は「８」，…といったように各単語の出現文書数が集計される。 When the step 201 is repeated for all the documents processed by the document analysis unit 203 (step 202), the number of occurrences of the word W1 is “5”, the number of occurrences of the word W2 is “8”, and so on. The number of appearance documents of each word is totaled.

続いて、各単語の重みを決定する（ステップ２０３）。 Subsequently, the weight of each word is determined (step 203).

通常、文書中に現れる単語の重要性を表す重みを決定する場合には、ＴＦ−ＩＤＦ法を用いることが多い。ＴＦ−ＩＤＦ法は、単語のＴＦ(Term Frequency:文書内での出現回数)と、ＤＦ(Document Frequency：出現文書数)を用い、文書ｄにおける単語ｗの重みＴＦ−ＩＤＦ（ｄ，ｗ）を式（１）で与えるものである。ＴＦ（ｄ，ｗ）は、文書ｄにおける単語ｗの出現回数であり、Ｎは全文書数、ＤＦ（ｗ）は全Ｎ文書中での単語ｗの出現文書数である。ＴＦ−ＩＤＦ法は、文書内で現れる頻度が高い単語ほど、また、全文書中で少数の文書に現れる単語ほど重要であるとみなす評価法である。 Usually, the TF-IDF method is often used to determine the weight representing the importance of a word appearing in a document. The TF-IDF method uses a TF (Term Frequency: number of occurrences in a document) of a word and DF (Document Frequency: the number of occurrences of a document), and calculates the weight TF-IDF (d, w) of the word w in the document d. It is given by equation (1). TF (d, w) is the number of occurrences of the word w in the document d, N is the total number of documents, and DF (w) is the number of occurrences of the word w in all N documents. The TF-IDF method is an evaluation method that considers that words that appear more frequently in a document or words that appear in a small number of documents in all documents are more important.

他にも単語の重み付けに使用可能な方法はあるが、本実施の形態では、特にニュース記事を収集して話題文書提示方法を適用する場合を想定し、上記の式（１）のＴＦ−ＩＤＦ（ｄ，ｗ）ではなく、式（２）のWeight(w)で単語ｗの重み付けを行う。ニュース記事では主題となる人名や団体名などの名詞の１文書内における出現回数が必ずしも高いわけではなく、出現回数と単語の重要度が直結しないことに基づくものであり、ＴＦ−ＩＤＦ法から文書内出現回数（ＴＦ）の効果を省いたものである。

Although there are other methods that can be used for weighting words, in this embodiment, it is assumed that the news document is collected and the topic document presentation method is applied, and the TF-IDF of the above formula (1) is applied. Instead of (d, w), the weighting of the word w is performed by Weight (w) of the formula (2). In news articles, the number of occurrences of nouns such as subject names and organization names in a document is not necessarily high, and is based on the fact that the number of occurrences and the importance of words are not directly linked. The effect of the number of appearances (TF) is omitted.

単語集計部２０４は全単語の重みWeight(w)が決定すると（ステップ２０３）、文書ベクトル記録部２０５に保存されている各文書の単語リストを参照し、各文書に現れた単語に対して属性値を重みWeight(w)で与え、文書ベクトルを決定し、文書ベクトル記録部２０５に保存する（ステップ２０４）。

When the total weight Weight (w) is determined (step 203), the word totaling unit 204 refers to the word list of each document stored in the document vector recording unit 205, and sets the attribute for the word appearing in each document. A value is given by weight Weight (w), a document vector is determined, and stored in the document vector recording unit 205 (step 204).

図６は、本発明の第１の実施の形態における文書ベクトル記録部に保存された文書ベクトルの例である。文書ＩＤ「ＤＯＣ−０１２３４５」は、「今朝」に対する属性値が「0.0053」、「○×」に対する属性値が「2.38」、選手に対する属性値が「1.02」…というベクトルで表現される。なお、上記の式（２）からもわかるように、単語集計部２０４が決定する各単語の重みは全文書数Ｎと各単語の出現文書数ＤＦ（ｗ）のみで決定される値であり、例えば「ＤＯＣ−０１２３４５」と「ＤＯＣ−０１２３４６」に含まれる「今朝」という単語の属性値は、ステップ２０３で決定された単語「今朝」の重み「0.0053」がいずれにも用いられる。 FIG. 6 is an example of a document vector stored in the document vector recording unit according to the first embodiment of the present invention. The document ID “DOC-012345” is represented by a vector having an attribute value “0.0053” for “this morning”, an attribute value “2.38” for “Ox”, an attribute value “1.02” for the player, and so on. As can be seen from the above equation (2), the weight of each word determined by the word totaling unit 204 is a value determined only by the total number of documents N and the number of appearing documents DF (w) of each word, For example, as the attribute value of the word “this morning” included in “DOC-012345” and “DOC-012346”, the weight “0.0053” of the word “this morning” determined in step 203 is used.

類似度算出部２０６は、文書ベクトル記録部２０５に保存された各文書ベクトルを参照し、各文書間の類似度を算出する。図７は、本発明の第１の実施の形態における類似度算出部の処理のフローチャートである。 The similarity calculation unit 206 refers to each document vector stored in the document vector recording unit 205 and calculates the similarity between the documents. FIG. 7 is a flowchart of the process of the similarity calculation unit in the first embodiment of the present invention.

まず、最初に、文書ベクトル記録部２０５を参照し、異なる２文書の文書ベクトルを取得する（ステップ３０１）。 First, referring to the document vector recording unit 205, document vectors of two different documents are acquired (step 301).

続いて、取得した２文書間の類似度を算出する（ステップ３０２）。２つの文書ベクトルを用いた最も単純な類似度の算出方法は、コサイン類似度を利用する方法である。これは２ベクトルのなす角の余弦（コサイン）を２文書間を類似度とする方法であり、２文書に共通に出現する単語が多いほど２文書ベクトルのなす角が小さくなり、従って余弦が大きくなる。本実施の形態では、２文書ｄ_ｉ，ｄ_ｊ間の類似度を式（３）のコサイン類似度Ｓ（ｄ_ｉ，ｄ_ｊ）で定義されるものを用いる。ｄ_ｉ→，ｄ_ｊ→は各文書のベクトル、│ｄ_ｉ→│、│ｄ_ｊ→│は、各文書ベクトルの大きさ、θ_ｉｊは、各文書のベクトルのなす角である。 Subsequently, the similarity between the two acquired documents is calculated (step 302). The simplest method of calculating similarity using two document vectors is a method using cosine similarity. This is a method in which the cosine of the angle formed by two vectors is used as a similarity between two documents. The more words that appear in two documents, the smaller the angle formed by the two document vectors, and thus the larger the cosine. Become. In the present embodiment, the similarity between the two documents d _i and d _j is defined by the cosine similarity S (d _i , d _j ) in the equation (3). d _i →, d _j → is a vector of each document, | d _i → |, | d _j → | is the size of each document vector, and θ _ij is an angle formed by the vectors of each document.

このようにして算出した類似度を類似度記録部２０７に保存し（ステップ３０３）、上記のステップ３０１〜３０３の処理を文書ベクトル記録部２０５に保存された文書ベクトルの全組み合わせについて行う（ステップ３０４）。

The similarity calculated in this way is stored in the similarity recording unit 207 (step 303), and the above steps 301 to 303 are performed for all combinations of document vectors stored in the document vector recording unit 205 (step 304). ).

図８は、本発明の第１の実施の形態における類似度記録部に保存された文書間類似度データの例である。文書「ＤＯＣ−０１２３４５」「ＤＯＣ−０１２３４６」間の類似度が「0.021」であったことを示す。ある２文書に共通して出現する単語が１個も存在しない場合は、式（３）におけるベクトルの内積（ｄ_ｉ→，ｄ_ｊ→）の値が「０」になるため、図８の「ＤＯＣ−０１２３４５」と「ＤＯＣ−０１２３４７」の組み合わせのように類似度「０」となる。 FIG. 8 is an example of inter-document similarity data stored in the similarity recording unit in the first embodiment of the present invention. This indicates that the degree of similarity between the documents “DOC-012345” and “DOC-012346” was “0.021”. When there is no word that appears in common in two documents, the value of the inner product (d _i →, d _j →) of the vector in equation (3) becomes “0”. The degree of similarity is “0” like the combination of “DOC-012345” and “DOC-012347”.

クラスタリング部２０８は、類似度記録部２０７に保存された文書間の類似度を用い、類似文書をひとまとめにする処理を行う。この処理は、互いに類似する文書は同じ主題に沿って書かれた文書である可能性が高いと見做して１つに纏め上げる処理である。以後当該処理によって一まとめとなった文書群を「クラスタ」と呼び、全文書群を複数のクラスタに分割する当該処理を「クラスタリング」と呼ぶ。 The clustering unit 208 uses the similarity between documents stored in the similarity recording unit 207 to perform a process of grouping similar documents together. This process is a process in which it is considered that documents similar to each other are likely to be documents written along the same subject, and are combined into one. Hereinafter, a group of documents grouped by the process is referred to as “cluster”, and the process of dividing the entire document group into a plurality of clusters is referred to as “clustering”.

クラスタリングには、種々の既存手法が利用可能であり、本発明において利用するクラスタリング手法はいずれかの手法に限定するものではないが、本実施の形態では、分割すべきクラスタ数が事前に推定できない場合にも利用可能な手法の１つである最大距離法を用いるものとして説明する。最大距離法の実現手段は、例えば、「岩波書店マルチメディア情報学２「情報の組織化」pp.192〜193」を用いるものとする。 Various existing methods can be used for clustering, and the clustering method used in the present invention is not limited to any method, but in this embodiment, the number of clusters to be divided cannot be estimated in advance. It is assumed that the maximum distance method, which is one of the available methods, is used. As a means for realizing the maximum distance method, for example, “Iwanami Shoten Multimedia Informatics 2“ Organization of Information ”pp.192 to 193” is used.

まず、クラスタリングを行うにあたり、１から文書間類似度を引いたものを文書間の距離として定義する。文書間類似度は上記の式（３）により、０から１の値を持ち、文書が類似しているほど１に近い大きな値をとる。従って、１から文書間類似度を減ずれば、２文書が類似しているほど０に近い小さな値をとり、類似度が下がるにつれて１に近付く「非類似度」を定めることができる。この非類似度を文書間の距離として扱う。 First, in performing clustering, a value obtained by subtracting the inter-document similarity from 1 is defined as a distance between documents. The similarity between documents has a value from 0 to 1 according to the above equation (3), and takes a larger value closer to 1 as the documents are similar. Therefore, Genzure document Similarity 1 takes small value close to about 0 2 document is similar, it is possible to define a "dissimilarity" which approaches 1 as the similarity decreases. This dissimilarity is treated as a distance between documents.

以下、最大距離法に基づくクラスタリング処理について説明する。最大距離法は、既存のクラスタから十分遠くにある文書を中心に新しいクラスタを形成させ、次々とクラスタ数を増加させながら進める手法である。
Hereinafter, clustering processing based on the maximum distance method will be described. The maximum distance method, to form a new cluster in the center of the document that is far enough away from an existing cluster, is a method to proceed while increasing one after another the number of clusters.

図９は、本発明の第１の実施の形態におけるクラスタリング処理のフローチャートである。 FIG. 9 is a flowchart of clustering processing according to the first embodiment of this invention.

ステップ４０１）まず、いずれか１文書を中心とする第１のクラスタを形成する。この第１のクラスタに属する文書は、この時点では１文書のみである。 Step 401) First, a first cluster centered on any one document is formed. There is only one document belonging to the first cluster at this time.

ステップ４０２）各文書から最も距離が近いクラスタを探索する。ここで文書ｄからクラスタｃまでの距離とは、クラスタｃの中心の文書と文書ｄとの距離をいう。なお、最初にステップ４０２を行う際は、クラスタが１つしか存在しないため、最も距離が近いクラスタは探索するまでもなく決まる。 Step 402) Search for the cluster closest to each document. Here, the distance from the document d to the cluster c refers to the distance between the document at the center of the cluster c and the document d. Note that when step 402 is performed for the first time, there is only one cluster, so the nearest cluster is determined without searching.

ステップ４０３）ステップ４０２で各文書から最も近いクラスタまでの距離（最短距離）が求まるが、これらの最短距離が最大になるような文書を調べる。つまり、既存クラスタの中心から最も遠くにある文書ｄ_ｋを求める。 Step 403) In step 402, the distance (shortest distance) from each document to the nearest cluster is obtained, but the document in which these shortest distances are maximized is examined. That is, the document d _k farthest from the center of the existing cluster is _obtained .

ステップ４０４）ステップ４０３で求めた最短距離の最大値をｍとし、既存クラスタの中心間の距離の最大値をＭＡＸとしたとき、ｒを定数として条件式「ｍ／ＭＡＸ＞ｒ」を満たすかを判断する。なお、定数ｒのみ事前に定めておく必要がある。満たす場合には、ステップ４０５に移行し、満たさない場合には処理を終了する。 Step 404) When the maximum value of the shortest distance obtained in Step 403 is m and the maximum value of the distance between the centers of the existing clusters is MAX, whether r satisfies a conditional expression “m / MAX> r” with r as a constant. to decide. Only the constant r needs to be determined in advance. If it is satisfied, the process proceeds to step 405. If it is not satisfied, the process is terminated.

ステップ４０５）上記の条件式を満たす場合には、文書ｄ_ｋを中心とする新しいクラスタを形成し、上記の条件式を満たさなくなるまで、ステップ４０２以降の処理を繰り返す。 Step 405) If the above conditional expression is satisfied, a new cluster centered on the document d _k is formed, and the processing from Step 402 is repeated until the conditional expression is not satisfied.

クラスタリング処理部２０８は、以上の処理を終えると、処理結果をクラスタ記録部２０９に転送し、保存する。図１０は、本発明の一実施の形態におけるクラスタ記録部におけるクラスタリング結果データの例である。Ｃ１，Ｃ２，Ｃ３，Ｃ４，…は、クラスタ識別子（クラスタＩＤ）であり、各クラスタに属する文書のＩＤが記録されている。最長距離法によるクラスタリングの結果では、クラスタ「Ｃ２」のように１文書のみで形成されるクラスタも存在するが、全ての文書はいずれか１クラスタに必ず属する結果となる。 When the above processing is completed, the clustering processing unit 208 transfers the processing result to the cluster recording unit 209 and stores it. FIG. 10 is an example of clustering result data in the cluster recording unit according to the embodiment of the present invention. C1, C2, C3, C4,... Are cluster identifiers (cluster IDs), and IDs of documents belonging to the respective clusters are recorded. As a result of clustering by the longest distance method, there is a cluster formed by only one document such as the cluster “C2”, but all documents always belong to one of the clusters.

文書話題度算出部２１０は、類似度記録部２０７に保存された各文書間の類似度データを利用し、各文書の話題性の大きさを数値化して文書話題度記録部２１１に出力する。以下、話題度算出部２１０が行う処理の流れを図１１を用いて説明する。 The document topic level calculation unit 210 uses the similarity data between the documents stored in the similarity recording unit 207, digitizes the topic level of each document, and outputs it to the document topic level recording unit 211. Hereinafter, the flow of processing performed by the topic level calculation unit 210 will be described with reference to FIG.

図１１は、本発明の第１の実施の形態における話題度算出部が行う処理のフローチャートである。 FIG. 11 is a flowchart of processing performed by the topic degree calculation unit according to the first embodiment of the present invention.

まず、時刻情報取得部２１０１が文書データ記録部２０２から文書ＩＤと作成時刻の組を１組、例えば、文書データ記録部２０２に記録された図４に示すデータから、文書「ＤＯＣ−０１２３４５」とその作成時刻「2004/9/11 23:00」を取得する（ステップ５０１）。 First, the time information acquisition unit 2101 sets one set of document ID and creation time from the document data recording unit 202, for example, the document “DOC-012345” from the data shown in FIG. 4 recorded in the document data recording unit 202. The creation time “2004/9/11 23:00” is acquired (step 501).

時刻情報取得部２１０１が取得した文書ＩＤ「ＤＯＣ−０１２３４５」を類似度取得部２０１２に送出すると、類似度取得部２０１２は、該文書ＩＤと他の文書との間の類似度を類似度記録部２０７より取得し、メモリに格納する（ステップ５０２）。図８に示す例では、「ＤＯＣ−０１２３４５」と他の文書との類似度「0.021」，「0」，「0.300」，…が次々と取得される。 When the document ID “DOC-012345” acquired by the time information acquisition unit 2101 is transmitted to the similarity acquisition unit 2012, the similarity acquisition unit 2012 indicates the similarity between the document ID and another document. Obtained from 207 and stored in the memory (step 502). In the example illustrated in FIG. 8, the similarities “0.021”, “0”, “0.300”,... Between “DOC-012345” and other documents are sequentially acquired.

続いて、類似度加算部２１０３が、類似度取得部２１０２が取得した類似度を次々加算し、メモリに格納する（ステップ５０３）。文書「ＤＯＣ−０１２３４５」については、「0.021」，「0」，「0.300」が次々加算され、合計値が最終的に「1.28」になったものと仮定する。以上のようにして、ある文書「ＤＯＣ−０１２３３４５」と他の各文書との類似度を全て加算した値「1.28」を文書「ＤＯＣ−０１２３４５」の暫定話題度と呼ぶ。暫定話題度は、ある１文書に注目したとき、それと類似した文書の数が多いほど注目度が高いとみなす話題性評価法である。新聞記事などを対象とする場合には、話題性の高い事柄については複数の新聞社が次々と関連記事を掲載して類似文書数が増えることから、特に有効な評価方法であると言える。 Subsequently, the similarity adding unit 2103 adds the similarities acquired by the similarity acquiring unit 2102 one after another and stores it in the memory (step 503). As for the document “DOC-012345”, it is assumed that “0.021”, “0”, “0.300” are added one after another, and the total value finally becomes “1.28”. As described above, a value “1.28” obtained by adding all similarities between a document “DOC-0123345” and other documents is referred to as a provisional topic level of the document “DOC-012345”. The provisional topic level is a topicality evaluation method in which when a certain document is focused, the higher the number of similar documents, the higher the level of attention. In the case of newspaper articles and the like, it can be said that this is a particularly effective evaluation method because a plurality of newspaper companies publish related articles one after another and the number of similar documents increases for highly topical matters.

鮮度係数算出部２１０４は、時刻情報取得部２１０１から受け取った時刻情報「2004/9/11 23:00」と現在時刻との差を比較し、取得した文書「ＤＯＣ−０１２３４５」の鮮度係数を決定し、メモリに保持する（ステップ５０４）。 The freshness coefficient calculation unit 2104 compares the difference between the time information “2004/9/11 23:00” received from the time information acquisition unit 2101 and the current time, and determines the freshness coefficient of the acquired document “DOC-012345”. And stored in the memory (step 504).

鮮度係数は、文書の作成時刻が現在時刻に近いほど値が大きく、古いほど値が小さくなりながら０に近付くような時間の関数で決定される値である。この性質を持つ関数は、予めメモリに記憶されており、例えば、人間が過去に入手した情報の記憶が時間と共に薄れる様子を示した図１２の忘却曲線を利用するのがよい。これは、最近（現在に近い時刻）に入手した情報ほど現在でも大きな記憶量を保ち、過去（現在から遠い時刻）に入手した情報は現在では記憶の量が少なくなっている様子を表しており、縦軸は記憶量である現在時刻よりｔだけ遡った過去（現在時刻を０とする時刻−ｔ）における記憶量Ｆ（ｔ）が式（４）で表される。式（４）は、ｔ＝０（現在）で値がＦ_０となり、過去に遡るほど値が０に近付く。 The freshness coefficient is a value determined by a function of time such that the value is larger as the document creation time is closer to the current time, and the value is smaller as the document creation time is closer to 0. The function having this property is stored in the memory in advance, and for example, it is preferable to use the forgetting curve shown in FIG. 12 that shows how the storage of information obtained in the past by humans fades with time. This shows that the information acquired recently (time close to the present) keeps a large amount of memory even now, and the information acquired in the past (time far from the present) shows that the amount of memory is now small. The vertical axis indicates the storage amount F (t) in the past (time-t where the current time is 0) that is back by t from the current time, which is the storage amount, by the equation (4). In the equation (4), the value becomes F ₀ at t = 0 (current), and the value approaches 0 as it goes back in the past.

Ｆ（ｔ）＝Ｆ_０×exp（−ｔ／Ｔ_０）（４）
式（４）のＴ_０は、時間を遡るにつれて鮮度係数が減衰する速度（忘却の速度）を決定付けるパラメータである。値を大きくすれば緩やかな減衰になり、作成時刻が古い文書であるもある程度大きな鮮度係数が与えられ、話題文書であると判断されやすくなる。逆に，Ｔ_０を小さくすれば減衰が急な曲線になり、極めて新しい作成時刻を持つ文書が話題文書とみなされ、古い文書は話題とみなされにくくなる傾向になる。Ｔ_０は、収集する文書の性質や利用場面に応じて種々に設定可能な定数である。 F (t) = F ₀ × exp (−t / T ₀ ) (4)
T _{0 in} equation (4) is a parameter that determines the rate at which the freshness coefficient decays as the time goes back (forgetting rate). If the value is increased, the attenuation will be moderate, and even if the document has an old creation time, a certain degree of freshness coefficient is given to make it easy to determine that the document is a topic document. On the other hand, the attenuation by reducing the T ₀ becomes steep curve, document with very new creation time is considered the topic document, made in the old document it is less likely to be considered a hot topic trend. T ₀ is a constant that can be variously set according to the nature of the document to be collected and the usage scene.

以下では、ステップ５０４で「ＤＯＣ−０１２３４５」の時刻情報「2004/9/11 23:00」と現在時刻の差を求めて式（４）に適用した結果、「0.7」という鮮度係数が得られたものとして説明を続ける。 In the following, the difference between the time information “2004/9/11 23:00” of “DOC-012345” and the current time is obtained in step 504 and applied to equation (4), resulting in a freshness coefficient of “0.7”. The explanation will continue as if

積算部２１０５が文書「ＤＯＣ−０１２３４５」の暫定話題度「1.28」と「0.7」を積算し、「ＤＯＣ−０１２３４５」の話題度「0.896」を得る（ステップ４０５）。なお、積算を終えると類似度加算部２１０３で保持していた暫定話題度「1.28」は、次の文書の文書話題度算出のために「０」にクリアしておく。 The accumulating unit 2105 accumulates the provisional topic levels “1.28” and “0.7” of the document “DOC-012345” to obtain the topic level “0.896” of “DOC-012345” (step 405). When the integration is completed, the temporary topic level “1.28” held in the similarity adding unit 2103 is cleared to “0” for calculation of the document topic level of the next document.

算出された文書「ＤＯＣ−０１２３４５」の文書話題度は、文書ＩＤと共に文書話題度記録部２１１に保存する（ステップ５０６）。 The calculated document topic level of the document “DOC-012345” is stored in the document topic level recording unit 211 together with the document ID (step 506).

以上ステップ５０１〜ステップ５０６の処理を、時刻情報取得部２１０１が取得した全文書について繰り返し（ステップ５０７）、処理を終える。 The processing from step 501 to step 506 is repeated for all documents acquired by the time information acquisition unit 2101 (step 507), and the processing ends.

図１３は、本発明の第1の実施の形態における文書話題度記録部に記録された文書話題度データの例である。同図に示す文書話題度記録部２１１には、文書ＩＤ「ＤＯＣ−０１２３４５」と文書話題度「1.28」の組が記録されている。他の文書についても同様に文書話題度が記録されている。 FIG. 13 is an example of the document topic level data recorded in the document topic level recording unit in the first embodiment of the present invention. In the document topic level recording unit 211 shown in the figure, a set of the document ID “DOC-012345” and the document topic level “1.28” is recorded. The document topic level is similarly recorded for other documents.

提示データ作成部２１２は、クラスタ記録部２０９及び文書話題度記録部２１１に記録された各データを参照し、本発明の話題提出装置20の出力となるデータを作成する。クラスタ記録部２０９からは各クラスタに属する文書ＩＤの一覧を取得し、文書話題度記録部２１１からは各文書ＩＤに対応する文書話題度を取得する。さらに、文書データ記録部２０２から各文書ＩＤに対応する作成時刻と本文を取得し、クラスタ毎の文書一覧データを作成する。各クラスタ内の文書は、文書話題度によって降順に並び替える。こうして作成したデータは話題出力装置２１に出力され、利用者に提示される。 The presentation data creation unit 212 refers to each data recorded in the cluster recording unit 209 and the document topic level recording unit 211, and creates data to be output from the topic submission device 20 of the present invention. A list of document IDs belonging to each cluster is acquired from the cluster recording unit 209, and a document topic level corresponding to each document ID is acquired from the document topic level recording unit 211. Further, the creation time and the text corresponding to each document ID are acquired from the document data recording unit 202, and document list data for each cluster is created. The documents in each cluster are rearranged in descending order according to the document topic level. The data thus created is output to the topic output device 21 and presented to the user.

話題出力装置２１の画面上に出力された話題文書の表示例を図１４に示す。画面上の横方向には左から順にクラスタ１、クラスタ２、…と、クラスタが並び、各クラスタに属する文書が縦方向に並んで表示されている。同図のａは、このうち、クラスタＩＤ「Ｃ１」のクラスタに属する文書が表示された領域であり、クラスタに属する文書が文書話題度の高い順から上から並んでいる。同図のｂは、このうちクラスタ内の文書話題度が第3位の文書の作成時刻情報、同図のｃは文書話題度、同図のｄは、本文である。 A display example of the topic document output on the screen of the topic output device 21 is shown in FIG. In the horizontal direction on the screen, clusters 1, 2,... Are arranged in order from the left, and documents belonging to each cluster are displayed side by side in the vertical direction. In the figure, a is an area in which documents belonging to the cluster with the cluster ID “C1” are displayed, and the documents belonging to the cluster are arranged from the top in descending order of the document topic level. B in the figure is the creation time information of the document having the third highest document topic level in the cluster, c in the figure is the document topic level, and d in the figure is the text.

図１４の例では、あるスポーツ選手の引退にかかわる4文書が「クラスタ１」としてひとまとまりになっており、「クラスタ２」は、博物館のイベントに関する1文書のみからなるクラスタ、「クラスタ３」は、野球チームの合併に関わる多数の文書が集まったクラスタになっている。「クラスタ２」内の文書で取り扱われている博物館のイベントは、他の文書で取り扱われることがなかったため、ただ1文書のみで構成されるクラスタになっている。逆に「クラスタ３」は、野球チームの合併という1つの話題に関連し、ファンの署名活動を扱った文書やストライキを扱った文書、合併の影響を扱った文書など、関連する文書が連鎖的に集まった大きなクラスタになっている。 In the example of FIG. 14, four documents related to the retirement of a certain athlete are grouped as “Cluster 1”, “Cluster 2” is a cluster consisting of only one document related to museum events, and “Cluster 3” is A cluster of many documents related to the merger of baseball teams. The museum events handled by the documents in “Cluster 2” are not handled by other documents, and are therefore a cluster composed of only one document. Conversely, “Cluster 3” is related to one topic of baseball team merger, and related documents such as documents dealing with fan signature activities, documents dealing with strikes, and documents dealing with the effects of mergers are linked. It is a large cluster gathered in

文書話題度は、類似文書が多ければ多いほど大きく、文書の作成時刻が新しければ新しいほど大きな値をとる。このため、「クラスタ２」に属する文書のように話題性の低い文書は、文書話題度が小さい。さらに、「クラスタ３」に属する文書に注目すると、作成時刻が新しい文書の文書話題度が比較的大きく、上位に表示される傾向がある。 The document topic level increases as the number of similar documents increases, and increases as the document creation time is new. For this reason, a document having a low topicality such as a document belonging to “Cluster 2” has a low document topic level. Further, when attention is focused on documents belonging to “cluster 3”, the document topic level of a document with a new creation time is relatively large and tends to be displayed at the top.

利用者は、各クラスタ上位の文書を流し読みすれば短時間で話題情報を把握することができ、興味を持ったクラスタについて各文書を詳細に読むことで、効率的な文書閲覧が可能になる。 Users can grasp topic information in a short period of time by reading the documents above each cluster, and by reading each document in detail for the cluster they are interested in, it is possible to view the document efficiently. .

［第２の実施の形態］
図１５は、本発明の第２の実施の形態における話題文書提示装置の構成を示す。 [Second Embodiment]
FIG. 15 shows the configuration of a topic document presentation device according to the second embodiment of the present invention.

話題文書提示装置１３０は、話題文書情報を出力するための話題出力装置１３１４、及び、利用者からの操作を受け付ける入力装置１３１５が接続される。 The topic document presentation device 130 is connected to a topic output device 1314 for outputting topic document information and an input device 1315 for receiving an operation from a user.

話題文書提示装置１３０は、文書収集部１３１、文書データ記録部１３２、文書解析部１３３、文書分類部１３４、単語集計部１３５、文書ベクトル記録部１３６、類似度算出部１３７、類似度記録部１３８、クラスタリング部１３９、クラスタ記録部１３１０、文書話題度算出部１３１１、文書話題度記録部１３１２、提示データ作成部１３１３で構成され、文書話題度算出部１３１１は、さらに、時刻情報取得部１３１１１、類似度取得部１３１１２、類似度加算部１３１１３、鮮度係数算出部１３１１４、積算部１３１１５から構成される。 The topic document presentation device 130 includes a document collection unit 131, a document data recording unit 132, a document analysis unit 133, a document classification unit 134, a word totaling unit 135, a document vector recording unit 136, a similarity calculation unit 137, and a similarity recording unit 138. A clustering unit 139, a cluster recording unit 1310, a document topic level calculation unit 1311, a document topic level recording unit 1312, and a presentation data creation unit 1313. The document topic level calculation unit 1311 is further similar to a time information acquisition unit 13111. A degree acquisition unit 13112, a similarity addition unit 13113, a freshness coefficient calculation unit 13114, and an integration unit 13115 are included.

以下、話題文書提示装置１３０を構成する各処理部の機能を説明する。 Hereinafter, the function of each processing unit constituting the topic document presentation device 130 will be described.

文書収集部１３１は、外部の情報源から文書データを収集するものであり、第１の実施
の形態における文書収集部２０１と同様の機能を有する。収集した文書には一意な識別子（文書ＩＤ）を付与し、文書データ記録部１３２に保存する。その際、各文書の作成時刻情報も取得し、合わせて記録しておく。文書収集部１３１は、さらに、収集した文書を文書解析部１３３にも送出する。 The document collection unit 131 collects document data from an external information source, and has the same function as the document collection unit 201 in the first embodiment. A unique identifier (document ID) is assigned to the collected document and stored in the document data recording unit 132. At that time, the creation time information of each document is also acquired and recorded together. The document collection unit 131 further sends the collected document to the document analysis unit 133.

文書解析部１３３は、文書収集部１３１から受け取った文書を形態素解析処理によって単語毎に分割して、得られる単語のリストを文書ベクトル記録部１３６に一旦出力する。また、文書解析部１３３は、同時に単語のリストを単語集計部１３５にも送出し、単語集計部１３５が各単語が現れる文書数を集計する。 The document analysis unit 133 divides the document received from the document collection unit 131 into words by morphological analysis processing, and temporarily outputs the obtained word list to the document vector recording unit 136. In addition, the document analysis unit 133 simultaneously sends a word list to the word totaling unit 135, and the word totaling unit 135 totals the number of documents in which each word appears.

単語集計部１３５は、前述の第１の実施の形態における単語集計部２０４と同様に、各単語の集計文書数を用いて式（２）から重みを決定し、各文書の文書ベクトルを構成して文書ベクトル記録部１３６に保存する。文書ベクトル記録部１３６には、第１の実施の形態と同様に、図６のように文書ＩＤと文書ベクトルが保存される。 Similar to the word totaling unit 204 in the first embodiment described above, the word totaling unit 135 determines the weight from the formula (2) using the total number of documents for each word, and constructs a document vector for each document. And stored in the document vector recording unit 136. As in the first embodiment, the document vector recording unit 136 stores a document ID and a document vector as shown in FIG.

文書解析部１３３は、さらに、文書の単語リストを文書分類部１３４にも送出する。文書分類部１３４は、この単語リストを利用して、文書を予め定めておく複数のカテゴリのうちいずれか１以上のカテゴリに分類する。文書をカテゴリに分類する際には、既存の文書分類技術として、例えば、『上田修功、斉藤和巳、「多重トピックテキストの確率モデル−パラメトリック混合モデル−」、電気情報通信学会論文誌 D-II Vol. J87-D-II No.3 pp.872-883, 2004年３月』を利用する。 The document analysis unit 133 further sends the word list of the document to the document classification unit 134. Using this word list, the document classification unit 134 classifies the document into one or more categories among a plurality of predetermined categories. When classifying documents into categories, existing document classification techniques include, for example, “Nobuo Ueda, Kazuaki Saito,“ Probability model of multi-topic texts—parametric mixed model ”, and D-II Vol. J87-D-II No.3 pp.872-883, March 2004 ”is used.

上記の既存技術の概要を簡単に説明する。 An outline of the above existing technology will be briefly described.

まず、カテゴリが既知の文書集合を機械に学習させ、学習させた機械を用いて未知の文書がどのカテゴリに属するかを予測する。この予測処理が文書分類に相当する。 First, a machine learns a set of documents whose categories are known, and predicts which category an unknown document belongs to using the learned machine. This prediction process corresponds to document classification.

文書解析部１３３は、文書分類部１３４に対して各文書の単語リストを送出するが、この時、各単語には文書内での出現回数を付加して送出する。単語Ｗｎの文書ｄ_ｍ中での出現回数をＸ_ｍｎとすると、文書ｄ_ｍの単語頻度ベクトルは、Ｘ_ｍ→は、Ｘ_ｍ→＝（Ｘ_ｍ１，Ｘ_ｍ２，…，Ｘ_ｍｖ）で表される。 The document analysis unit 133 sends the word list of each document to the document classification unit 134. At this time, each word is sent with the number of appearances in the document added. When the number of occurrences in the document _{d m} word Wn and _{X mn,} word frequency vector of document _{d m} is, _{X m} → _{_{_{is, X m → = (X m1}}} , X m2, ..., X mv) is represented by The

カテゴリの総数をＬとし、文書ｄ_ｍが属するカテゴリを示すカテゴリベクトルを、
Ｙ_ｍ→＝（Ｙ_ｍ１，Ｙ_ｍ２，…，Ｙ_ｍＬ）
で表すこととする。ここで、Ｙ_ｍ，ｋは、文書ｄ_ｍが第ｋカテゴリに属するとき「１」、属さないとき「０」の値をとる。複数のカテゴリに属することも許されるが、少なくとも１つのカテゴリには属するものとする。 The total number of categories is L, the category vector indicating the category to which the document d _m belongs,
Y _m → = (Y _m1 , Y _m2 ,..., Y _mL )
It shall be expressed as Here, Y _{m, k} is "1" when the document d _m belongs to the k category, it takes a value of "0" when not belong. Although it is allowed to belong to a plurality of categories, it belongs to at least one category.

まず、カテゴリが既知の文書集合Ｄ＝｛（Ｘ_ｍ，Ｙ_ｍ）｝（ｍ＝１〜Ｎ）〜Ｘ_ｍのカテゴリＹｍとなるように機械に学習させる。次にこの機械を用いて、カテゴリが未知の文書ｄ_＊の単語頻度ベクトルＸ_＊を入力としてＹ_＊を推定する。予測で計算されるカテゴリベクトルＹｍｋ＝（Ｙ_＊１，Ｙ_＊２，…，Ｙ_＊Ｌ）のＹ_＊ｋが、文書ｄ_＊が第ｋカテゴリに属するかどうかの一致度（属する確率）を示す。 First, the machine is trained so that the category is a document set D = {(X _m , Y _m )} (m = 1 to N) to X _{m with} a known category. Next, using this machine, Y _* is estimated by inputting the word frequency vector X _* of the document d _* whose category is unknown. Y _{* k} of the category vector Ymk = (Y _{* 1} , Y _{* 2} ,..., Y _{* L} ) calculated by prediction indicates the degree of coincidence (probability of belonging) whether or not the document d _* belongs to the kth category. .

上記の既存の文書分類技術は、カテゴリが未知の入力文書の単語頻度ベクトルに対する出力は、各カテゴリへの一致度を羅列したベクトルであるため、入力文書を一致度の高い順に数カテゴリに分類する。一致度がある閾値を越えたカテゴリに分類する、などの利用方法が考えられるが、本実施例では、一致度最大の１カテゴリのみに分類するものとして説明を続ける。 In the above existing document classification technology, the output for the word frequency vector of an input document whose category is unknown is a vector listing the matching degrees to each category, so the input documents are classified into several categories in descending order of matching degrees. . Although a method of use such as classifying into a category where the degree of coincidence exceeds a certain threshold value can be considered, in this embodiment, description will be continued assuming that only one category with the maximum degree of coincidence is classified.

文書分類部１３４が上記の既存の文書分類技術を用いて文書を分類したカテゴリ情報は、文書データ記録部１３２に保存される。 The category information in which the document classification unit 134 classifies the document using the existing document classification technology is stored in the document data recording unit 132.

図１６は、本発明の第２の実施の形態における文書データ記録部に保存されたデータの例である。前述の第１の実施の形態の文書データ記録部２０２に保存されたデータの例（図４）に加え、所属カテゴリ名が記録されている。 FIG. 16 shows an example of data stored in the document data recording unit in the second embodiment of the present invention. In addition to the example of data stored in the document data recording unit 202 of the first embodiment (FIG. 4), the belonging category name is recorded.

類似度算出部１３７は、文書ベクトル記録部１３６に保存された各文書ベクトルを参照して各文書間の類似度を算出するが、このとき、文書データ記録部１３２に保存されているカテゴリ情報を参照し、同一カテゴリに属する文書間の類似度を算出する。 The similarity calculation unit 137 calculates the similarity between the documents with reference to each document vector stored in the document vector recording unit 136. At this time, the category information stored in the document data recording unit 132 is used as the category information. Refer to and calculate the similarity between documents belonging to the same category.

類似度算出の方法は、前述の第１の実施の形態と同様である。図１７は、本発明の第２の実施の形態における類似度記録部に保存された文書間類似度データの例である。図１６でいずれもカテゴリ名「スポーツ」が記録されている文書「ＤＯＣ−０１２３４５」「ＤＯＣ−０１２３４７」の間の類似度「0.021」が、図１７の「スポーツ」の欄に記録されている。他のカテゴリ、他の文書についても同様にして、カテゴリ別に文書間類似度が記録されている。 The method for calculating the similarity is the same as that in the first embodiment. FIG. 17 is an example of inter-document similarity data stored in the similarity recording unit in the second embodiment of the present invention. In FIG. 16, the similarity “0.021” between the documents “DOC-012345” and “DOC-012347” in which the category name “sports” is recorded is recorded in the “sports” column of FIG. Similarly for other categories and other documents, the similarity between documents is recorded for each category.

クラスタリング部１３９は、類似度記録部１３８に記録された文書間類似度を用い、文書のクラスタリング処理を行う。クラスタリング処理をカテゴリ毎にそれぞれ行う以外は第１の実施の形態におけるクラスタリング部２０８と同じ処理を行う。 The clustering unit 139 performs document clustering processing using the inter-document similarity recorded in the similarity recording unit 138. The same processing as the clustering unit 208 in the first embodiment is performed except that the clustering processing is performed for each category.

図１８は、本発明の第２の実施の形態におけるクラスタ記録部に保存されたクラスタリング結果データの例である。クラスタリング結果は、「スポーツ」「社会」「映画」…と、カテゴリ別に記録されており、例えば、「スポーツ」のカテゴリの第１のクラスタ「Ｃ１１」には、文書ＩＤ「ＤＯＣ−０１２３４５」「ＤＯＣ−０１２３４９」「ＤＯＣ−０１２３５５」の３文書が含まれている。「スポーツ」カテゴリでは、この他「Ｃ１２」「Ｃ１３」…と複数のクラスタが含まれ、「社会」カテゴリでは、「Ｃ２１」「Ｃ２２」「Ｃ２３」…、「映画」カテゴリでは、「Ｃ３１」「Ｃ３２」「Ｃ３３」…といったようにクラスタリング結果が格納される。 FIG. 18 is an example of clustering result data stored in the cluster recording unit according to the second embodiment of the present invention. The clustering results are recorded in categories such as “sports”, “society”, “movies”, etc. For example, in the first cluster “C11” of the category “sports”, document IDs “DOC-012345” “DOC” are stored. Three documents “-012349” and “DOC-012355” are included. The “sports” category includes a plurality of clusters “C12”, “C13”... In the “society” category, “C21” “C22” “C23”. Clustering results such as “C32”, “C33”... Are stored.

文書話題度算出部１３１１は、類似度記録部１３８に保存された各文書間の類似度データを利用し、各文書の話題性の大きさを数値化して文書話題度記録部１３１２に出力する。 The document topic level calculation unit 1311 uses similarity data between documents stored in the similarity level recording unit 138, quantifies the topic level of each document, and outputs it to the document topic level recording unit 1312.

話題度算出部１３１１が行う処理は、前述の第１の実施の形態における文書話題度算出部２１０が行う処理と同様であるが、処理をカテゴリ別に行う点のみが異なる。処理の流れを図１９を用いて説明する。 The processing performed by the topic level calculation unit 1311 is the same as the processing performed by the document topic level calculation unit 210 in the first embodiment described above, except that the processing is performed for each category. The flow of processing will be described with reference to FIG.

図１９は、本発明の第２の実施の形態における話題度算出部が行う処理のフローチャートである。 FIG. 19 is a flowchart of processing performed by the topic level calculation unit according to the second embodiment of the present invention.

まず、時刻情報取得部１３１１１が文書データ記録部１３２から文書ＩＤ、作成時刻、カテゴリ名の組を１組、例えば、文書データ記録部１３２に記録された図１６のデータから、文書「ＤＯＣ−０１２３４５」とその作成時刻「2004/9/11 23:00」、及びカテゴリ名「スポーツ」を取得する（ステップ６０１）。 First, the time information acquisition unit 13111 receives one set of document ID, creation time, and category name from the document data recording unit 132, for example, the document “DOC-012345 from the data in FIG. 16 recorded in the document data recording unit 132. ”And its creation time“ 2004/9/11 23:00 ”and the category name“ sports ”are acquired (step 601).

時刻情報取得部１３１１１が取得した文書ＩＤ「ＤＯＣ−０１２３４５」とカテゴリ名「スポーツ」を類似度取得部１３１１２に送出すると、類似度取得部１３１１２は「スポーツ」カテゴリに分類された文書と該文書ＩＤ「ＤＯＣ−０１２３４５」との間の類似度を類似度記録部１３８より取得する（ステップ６０２）。図１７に示す類似度記録部１３８の例では、「スポーツ」カテゴリの項に記録された「ＤＯＣ−０１２３４５」と他の文書との類似度「0.021」「0」、…が次々と取得される。 When the document ID “DOC-012345” acquired by the time information acquisition unit 13111 and the category name “sports” are sent to the similarity acquisition unit 13112, the similarity acquisition unit 13112 and the document ID classified into the “sports” category The similarity with “DOC-012345” is acquired from the similarity recording unit 138 (step 602). In the example of the similarity recording unit 138 shown in FIG. 17, the similarity “0.021”, “0”,... Between “DOC-012345” recorded in the “sports” category and other documents is acquired one after another. .

類似度加算部１３１１３がこれらの類似度を加算して文書「ＤＯＣ−０１２３４５」の暫定話題度を算出するステップ（ステップ６０３）、鮮度係数算出部１３１１４が各文書の作成時刻と現在時刻との差から鮮度係数を決定するステップ（ステップ６０４）、積算部１３１１５が暫定話題度に鮮度係数を乗じて文書話題度を算出するステップ（ステップ６０５）のそれぞれは、第１の実施の形態における各ステップと同一である。 A step of calculating the provisional topic level of the document “DOC-012345” by the similarity adding unit 13113 adding these similarities (step 603), and a freshness coefficient calculating unit 13114 calculating the difference between the creation time of each document and the current time The step of determining the freshness coefficient from the step (step 604) and the step of calculating the document topic level by multiplying the provisional topic level by the freshness factor (step 605) by the accumulating unit 13115 are the respective steps in the first embodiment. Are the same.

積算部１３１１５が算出した各文書の文書話題度は、各文書の所属するカテゴリ情報、文書ＩＤと共に文書話題度記録部１３１２に保存する（ステップ６０６）。 The document topic level of each document calculated by the integrating unit 13115 is stored in the document topic level recording unit 1312 together with the category information and document ID to which each document belongs (step 606).

以上ステップ６０１〜６０６の処理を、時刻情報取得部１３１１１が取得した全文書について繰り返し（ステップ６０７）、処理を終える。 The processing in steps 601 to 606 is repeated for all documents acquired by the time information acquisition unit 13111 (step 607), and the processing ends.

図２０は、本発明の第２の実施の形態における文書話題度記録部に保存された文書話題度データの例である。同図に示す文書話題度記録部１３１２には、文書ＩＤ「ＤＯＣ−０１２３４５」と文書話題度「1.28」の組が「スポーツ」カテゴリの項に記録されており、他のカテゴリに分類された文書についても同様にそれぞれ文書話題度が記録されている。 FIG. 20 is an example of document topic level data stored in the document topic level recording unit according to the second embodiment of the present invention. In the document topic level recording unit 1312 shown in the figure, a set of the document ID “DOC-012345” and the document topic level “1.28” is recorded in the “sports” category, and the documents classified into other categories are recorded. Similarly, the document topic level is recorded for each.

提示データ作成部１３１３は、クラスタ記録部１３１０及び文書話題度記録部１３１２に記録された各データを参照し、本発明の話題文書提示装置１３０の出力となるデータを作成する。クラスタ記録部１３１０からは各カテゴリの各クラスタに属する文書の文書ＩＤの一覧を取得し、文書話題度記録部１３１２からは各文書ＩＤに対応する文書話題度を取得する。さらに文書データ記録部１３２から各文書ＩＤに対応する作成時刻と本文を取得し、クラスタ毎の文書一覧データをカテゴリ別に作成する。各クラスタ内の文書は、文書話題度によって降順に並び替える。こうして作成したデータは話題出力装置１３１４に出力され、利用者に提示される。 The presentation data creation unit 1313 refers to each data recorded in the cluster recording unit 1310 and the document topic level recording unit 1312 and creates data to be output from the topic document presentation device 130 of the present invention. A list of document IDs of documents belonging to each cluster of each category is acquired from the cluster recording unit 1310, and a document topic level corresponding to each document ID is acquired from the document topic level recording unit 1312. Further, the creation time and the text corresponding to each document ID are acquired from the document data recording unit 132, and the document list data for each cluster is created for each category. The documents in each cluster are rearranged in descending order according to the document topic level. The data thus created is output to the topic output device 1314 and presented to the user.

利用者は、本発明の話題文書提示装置１３０に接続されたマウスやキーボード、タッチパネル操作などの入力装置１３１５を用い、画面上に表示されたボタンを選択しながら対話的に表示画面を切替えて文書を閲覧することができる。 A user uses an input device 1315 such as a mouse, a keyboard, and a touch panel operation connected to the topic document presentation device 130 of the present invention to interactively switch a display screen while selecting a button displayed on the screen, and Can be viewed.

話題出力装置１３１４の画面上に出力された話題文書の表示例を図２１に示す。カテゴリ毎に画面が表示されるが、図２１は「スポーツ」カテゴリを表示した状態の例である。 A display example of the topic document output on the screen of the topic output device 1314 is shown in FIG. Although a screen is displayed for each category, FIG. 21 is an example of a state in which the “sports” category is displayed.

図２１（ａ）は、カテゴリ「社会」の話題文書を表示する画面へ表示を切替えるためのボタンであり、他にも「映画」…「経済」「芸能」など、文書分類部１３４が分類したカテゴリそれぞれの話題文書を表示することができる。 FIG. 21A is a button for switching the display to a screen displaying topic documents of the category “Society”, and the document classification unit 134 classifies “movie”, “economic”, “entertainment”, and the like. Topic documents for each category can be displayed.

図２１において、画面の下部には、表示したカテゴリに含まれる文書のうち、各クラスタ毎に文書話題度が最大の文書のみが表示される。左から順に話題度（ｂ）、作成時刻（ｃ）、本文（ｄ）が並んでいる。一番上に表示されている文書は図１６からもわかるように、文書ＩＤ「ＤＯＣ−０１２３４７」の文書であり、文書話題度算出部１３１１が算出した話題度「3.51」（図２０参照）が表示されている。 In FIG. 21, at the bottom of the screen, only the documents having the highest document topic level for each cluster are displayed among the documents included in the displayed category. The topic level (b), creation time (c), and text (d) are arranged in order from the left. As can be seen from FIG. 16, the document displayed at the top is the document with the document ID “DOC-012347”, and the topic level “3.51” (see FIG. 20) calculated by the document topic level calculation unit 1311 is displayed. It is displayed.

ここに表示されている文書は、図１８で「スポーツ」の項に記録された文書のうち、「Ｃ１１」「Ｃ１２」「Ｃ１３」…のぞれぞれのカテゴリにおいて文書話題度が最大の文書であり、「Ｃ１１」から１文書、「Ｃ１２」から１文書、…といったように選択されたものである。文書「ＤＯＣ−０１２３４７」は、クラスタ「Ｃ１３」から選択された文書である。 The document displayed here is the document having the highest document topic level in each of the categories “C11”, “C12”, “C13”,... Among the documents recorded in the item “Sports” in FIG. Are selected from “C11” to one document, “C12” to one document, and so on. The document “DOC-012347” is a document selected from the cluster “C13”.

従って、図２１には「スポーツ」カテゴリに含まれるクラスタの数に等しい文書が表示されることになる。選択された文書は、更に文書話題度によって降順に並べ替えられ、各クラスタの代表文書として図２１に示すように表示される。図２１（ｅ）は、画面上に表示されている代表文書を含むクラスタの詳細情報を得るためのボタンである。利用者がこのボタンを選択すると、対応するクラスタ（図２１の例では文書「ＤＯＣ−０１２３４７」が属するクラスタ「Ｃ１３」）に含まれる全文書を表示する画面に遷移する。 Accordingly, in FIG. 21, a document equal to the number of clusters included in the “sports” category is displayed. The selected documents are further sorted in descending order according to the document topic level, and are displayed as representative documents of each cluster as shown in FIG. FIG. 21E is a button for obtaining detailed information of the cluster including the representative document displayed on the screen. When the user selects this button, a transition is made to a screen for displaying all the documents included in the corresponding cluster (cluster “C13” to which the document “DOC-012347” belongs in the example of FIG. 21).

図２２は、図２１に示す画面上に表示された第１位の文書の「詳細を表示」ボタンｅを利用者が選択した場合の遷移後の画面の表示例である。図２２のｂは、図２１にも表示されていた文書であり、選択したクラスタ「Ｃ１３」内で文書話題度が最大の文書である。以下、「Ｃ１３」に含まれる文書が文書話題度の降順に並んで表示される。 FIG. 22 is a display example of the screen after transition when the user selects the “display details” button e of the first document displayed on the screen shown in FIG. 22b is the document displayed in FIG. 21 as well, and is the document having the highest document topic level in the selected cluster “C13”. Hereinafter, the documents included in “C13” are displayed in descending order of the document topic level.

図２２のａは、再び図２１の画面に戻るためのボタンである。 22a is a button for returning to the screen of FIG. 21 again.

以上説明した画面表示によって、利用者は任意のカテゴリの代表文書を閲覧し、特に、興味を持った文書の「詳細を表示」ボタンを選択することで、該当するクラスタ内の全ての文書を表示し、より詳しい情報を得ることができる。 With the screen display described above, the user browses representative documents of any category, and displays all the documents in the corresponding cluster by selecting the “Show details” button for the document that is particularly interested. More detailed information.

上記の第１の実施の形態及び第２の実施の形態は、いずれも文書話題度算出の処理をクラスタリングの処理とは独立させ、文書話題度算出部２１０、１３１１が文書間類似度を加算する処理を全ての文書（第２の実施例ではカテゴリ内の全ての文書）を対象に行うものであった。しかし、これに代えて、各クラスタ内の文書間のみの類似度を加算して暫定話題度を算出するようにしてもよい。この場合、加算する類似度の数がクラスタによって異なるため、異なるクラスタに属する２文書の文書話題度を比較して重要性の大小を判断することはできなくなる。しかしながら、１つのクラスタに属する文書間でのみ重要性の大小を判断する場合には使用可能な方法である。また、話題文書提示に要する処理時間の削減も望める。 In both the first embodiment and the second embodiment described above, the document topic level calculation process is independent of the clustering process, and the document topic level calculation units 210 and 1311 add the inter-document similarity. Processing is performed on all documents (all documents in a category in the second embodiment). However, instead of this, the degree of provisional topic may be calculated by adding the similarity only between documents in each cluster. In this case, since the number of similarities to be added differs depending on the cluster, it is impossible to determine the magnitude of importance by comparing the document topic levels of two documents belonging to different clusters. However, this is a method that can be used when determining the importance only between documents belonging to one cluster. In addition, a reduction in processing time required for topic document presentation can be expected.

上記の第１及び第２の実施の形態では、いずれも話題出力装置２１、１３１４としてモニタ装置を想定し、画面上に表示して利用者に閲覧させる形態としたが、話題出力装置２１、１３１４を、上記の実施の形態で画面上に提示した情報を保存する記憶装置として、利用者または別の装置が読み出し可能な状態にしてもよい。 In the first and second embodiments described above, the topic output devices 21 and 1314 are assumed to be monitor devices, and are displayed on the screen so that the user can browse them. As a storage device for storing the information presented on the screen in the above embodiment, a user or another device may be readable.

また、上記の実施の形態で説明した処理に基づいて、コンピュータ上で動作するプログラムとして構築し、話題文書提示装置として動作するコンピュータにインストールする、または、ネットワークを介して流通させることが可能である。 Further, based on the processing described in the above embodiment, it can be constructed as a program that operates on a computer, and can be installed on a computer that operates as a topic document presentation device, or can be distributed via a network. .

また、構築されたプログラムを話題文書提示装置として利用されるコンピュータに接続されるハードディスク装置やフレキシブルディスク、ＣＤ−ＲＯＭ等の可搬記憶媒体に格納することも可能である。 It is also possible to store the constructed program in a portable storage medium such as a hard disk device, a flexible disk, or a CD-ROM connected to a computer used as a topic document presentation device.

なお、本発明は、上記の実施の形態に限定されることなく、特許請求の範囲内において種々変更・応用が可能である。 The present invention is not limited to the above-described embodiment, and various modifications and applications can be made within the scope of the claims.

本発明は、大量の文書データを話題性に基づいて利用者に閲覧させるための技術に適用可能である。 The present invention is applicable to a technique for allowing a user to view a large amount of document data based on topicality.

本発明の原理を説明するための図である。It is a figure for demonstrating the principle of this invention. 本発明の原理構成図である。It is a principle block diagram of this invention. 本発明の第１の実施の形態における話題文書提示装置の構成図である。It is a block diagram of the topic document presentation apparatus in the 1st Embodiment of this invention. 本発明の第１の実施の形態における文書データ記録部に保存されたデータの例である。It is an example of the data preserve | saved at the document data recording part in the 1st Embodiment of this invention. 本発明の第１の実施の形態における単語集計部が行う処理のフローチャートである。It is a flowchart of the process which the word totalization part in the 1st Embodiment of this invention performs. 本発明の第１の実施の形態における文書ベクトル記録部に保存されたデータの例である。It is an example of the data preserve | saved at the document vector recording part in the 1st Embodiment of this invention. 本発明の第１の実施の形態における類似度算出部が行う処理のフローチャートである。It is a flowchart of the process which the similarity calculation part in the 1st Embodiment of this invention performs. 本発明の第１の実施の形態における類似度記録部に保存された文書間類似度データの例である。It is an example of the similarity data between documents preserve | saved in the similarity recording part in the 1st Embodiment of this invention. 本発明の第１の実施の形態におけるクラスタリング部におけるクラスタリング処理のフローチャートである。It is a flowchart of the clustering process in the clustering part in the 1st Embodiment of this invention. 本発明の第１の実施の形態におけるクラスタ記録部に保存されたクラスタリング結果のデータの例である。It is an example of the data of the clustering result preserve | saved at the cluster recording part in the 1st Embodiment of this invention. 本発明の第１の実施の形態における話題度算出部が行う処理のフローチャートである。It is a flowchart of the process which the topic degree calculation part in the 1st Embodiment of this invention performs. 本発明の第１の実施の形態における鮮度係数を決定する関数の例である。It is an example of the function which determines the freshness coefficient in the 1st Embodiment of this invention. 本発明の第１の実施の形態における文書話題度記録部に保存された文書話題度データの例である。It is an example of the document topic level data preserve | saved in the document topic level recording part in the 1st Embodiment of this invention. 本発明の第１の実施の形態の話題出力装置に出力された話題文書データの表示例である。It is a display example of topic document data output to the topic output device of the first exemplary embodiment of the present invention. 本発明の第２の実施の形態における話題文書提示装置の構成図である。It is a block diagram of the topic document presentation apparatus in the 2nd Embodiment of this invention. 本発明の第２の実施の形態における文書データ記録部に保存されたデータの例である。It is an example of the data preserve | saved in the document data recording part in the 2nd Embodiment of this invention. 本発明の第２の実施の形態における類似度記録部に保存された文書間類似度データの例である。It is an example of the similarity data between documents preserve | saved in the similarity recording part in the 2nd Embodiment of this invention. 本発明の第２の実施の形態におけるクラスタ記録部に保存されたクラスタリング結果データの例である。It is an example of the clustering result data preserve | saved at the cluster recording part in the 2nd Embodiment of this invention. 本発明の第２の実施の形態における話題度算出部が行う処理のフローチャートである。It is a flowchart of the process which the topic degree calculation part in the 2nd Embodiment of this invention performs. 本発明の第２の実施の形態における文書話題記録部に保存された文書話題度データの例である。It is an example of the document topic degree data preserve | saved in the document topic recording part in the 2nd Embodiment of this invention. 本発明の第２の実施の形態における話題出力装置に出力された話題文書データの表示例である。It is a display example of topic document data output to the topic output device in the second embodiment of the present invention. 本発明の第２の実施の形態における話題出力装置に出力された話題文書データの画面遷移後の表示例である。It is the example of a display after the screen transition of the topic document data output to the topic output device in the 2nd Embodiment of this invention.

Explanation of symbols

２０，１３０話題文書提示装置
２１，１３１４話題出力装置
２０１，１３１文書収集部
２０２，１３２文書データ記録部
２０３，１３３文書解析部
１３４文書分類部
２０４，１３５単語集計部
２０５，１３６文書ベクトル記録部
２０６，１３７類似度算出部
２０７，１３８類似度記録部
２０８，１３９クラスタリング部
２０９，１３１０クラスタ記録部
２１０，１３１１文書話題度算出部
２１１，１３１２文書話題度記録部
２１２，１３１３提示データ作成部
２１０１，１３１１１時刻情報取得部
２１０２，１３１１２類似度取得部
２１０３，１３１１３類似度加算部
２１０４，１３１１４鮮度係数算出部
２１０５，１３１１５積算部
１３１５入力装置 20, 130 Topic document presentation device 21, 1314 Topic output device 201, 131 Document collection unit 202, 132 Document data recording unit 203, 133 Document analysis unit 134 Document classification unit 204, 135 Word aggregation unit 205, 136 Document vector recording unit 206 , 137 Similarity calculation unit 207, 138 Similarity recording unit 208, 139 Clustering unit 209, 1310 Cluster recording unit 210, 1311 Document topic level calculation unit 211, 1312 Document topic level recording unit 212, 1313 Presentation data creation unit 2101, 13111 Time information acquisition unit 2102, 13112 Similarity acquisition unit 2103, 13113 Similarity addition unit 2104, 13114 Freshness coefficient calculation unit 2105, 13115 Accumulation unit 1315 Input device

Claims

In a topic document presentation method in a topic document presentation device that automatically selects and presents a highly topical document from a huge document group with creation time information,
Similarity calculation means determines attribute values for words included in the input document, calculates similarity between documents using a document vector obtained by converting each input document into a vector of attribute value strings, and similarity A similarity calculation step for recording in the degree recording means;
Clustering means divides the input document group into at least one subset from the similar documents using the similarity between the documents recorded in the similarity recording means, and the clusters belonging to each subset are cluster recording means A clustering step to store in
The provisional topic level calculation means of the document topic level calculation means adds all similarities between each document and the other documents in the subset to which each document belongs among the subsets divided in the clustering step. A provisional topic degree calculation step for calculating a provisional topic degree;
The freshness determination means of the document topic level calculation means determines the freshness of each document by a function that has a value that increases as the creation time information of each document is closer to the current time and decreases as the distance from the current time Steps,
The document topic calculation means, for each document, by integrating with the freshness and the provisional topic of calculating a document topic of a document topic calculation step of storing the document topic of the recording means,
The presentation data creation means uses the documents belonging to each subset stored in the cluster recording means and the document topic level stored in the document topic level recording means for each of the subsets. , A presentation data creation step for creating and outputting data in which document groups included in the subset are arranged in descending order of document topics; and
The topic document presentation method characterized by performing .

In the freshness determination step,
The freshness determining means is
Wherein as a function of determining the freshness, the topic document presentation method of claim 1 wherein using a function represented by an exponential function.

Category classification means further performs a category classification step of classifying the input document into any one or more of a plurality of categories according to the contents,
The similarity calculation step, the clustering step, the provisional topic degree calculating step, the freshness determining step, the document topic calculation step, each of the presentation data generating step, according to claim 1, wherein performing the classification category Topic document presentation method.

A topic document presentation device that automatically selects and presents a highly topical document from a huge document group with creation time information,
An attribute value is determined for a word included in the input document, a similarity between each document is calculated using a document vector obtained by converting each input document into a vector of attribute value strings, and recorded in the similarity recording unit. Similarity calculation means;
Clustering that divides an input document group into at least one subset from similar documents using the similarity between the documents recorded in the similarity recording means, and stores the documents belonging to each subset in the cluster recording means Means,
For each document, provisional topic degree calculation means for calculating a provisional topic degree by adding all the similarities with other documents in the subset to which the document belongs among each subset divided by the clustering means ;
A freshness determination means for determining the freshness of each document by a function in which the value is larger as the creation time information of each document is closer to the current time, and the value is smaller as it is farther from the current time;
For each document, by integrating with the freshness and the provisional topic of calculating a document topic of a document topic calculation means for saving the document topic of the recording means,
Each of the subsets is included in the subset by using the document belonging to each subset stored in the cluster recording unit and the document topic level stored in the document topic level recording unit. Creating presentation data creating means for creating and outputting data in which document groups are arranged in descending order of document topics ; and
Topic document presentation device which is characterized in that it has a.

In the previous Symbol freshness determination means,
The topic document presentation apparatus according to claim 4, wherein a function represented by an exponential function is used as the function for determining the freshness.

Further comprising category classification means for classifying the input document into any one or more of a plurality of categories according to the contents;
The similarity calculation means, said clustering means, the provisional topic degree calculating section, the document topic calculation means, the freshness determining means, the respective processing of the presentation data generating unit, classified performed by category Claim 4 or 5. The topic document presentation device according to 5 .

A topic document presentation program for causing a computer to function as each means constituting the topic document presentation device according to any one of claims 4 to 6 .