JP2005063071A

JP2005063071A - Document set classification device and its program

Info

Publication number: JP2005063071A
Application number: JP2003290929A
Authority: JP
Inventors: Shu Nobata; 周野畑; Hitoshi Isahara; 均井佐原; Satoshi Sekine; 聡関根
Original assignee: National Institute of Information and Communications Technology
Current assignee: National Institute of Information and Communications Technology
Priority date: 2003-08-08
Filing date: 2003-08-08
Publication date: 2005-03-10
Anticipated expiration: 2023-08-08
Also published as: JP3921540B2

Abstract

<P>PROBLEM TO BE SOLVED: To apply a more suitable classification which has excellent exhaustiveness without depending on the characteristics of specific languages. <P>SOLUTION: This document set classification device for applying a classification to a document set being the group of a plurality of documents is provided with: a decision means 101 for deciding whether the theme of the document set is related with a single specific expression or a plurality of specific expressions and further deciding to which specific expression class the specific expressions are belonging; and an output means 102 for outputting information about a classification specified by two factors, that is, whether the specific expressions related with the theme are single or a plurality and to which specific expression class the specific expressions are belonging on the basis of the decision result of the decision means 101. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

本発明は、複数の文書を包含する文書セットに分類を付与するための分類装置及びそのプログラムに関する。 The present invention relates to a classification apparatus for assigning a classification to a document set including a plurality of documents, and a program therefor.

複数文書の自動要約は、要約の研究において近年関心の高まっている分野である。米国のDocument Understanding Conference（DUC）や日本のText Summarization Challenge（TSC）でも、要約システムの評価を行う課題として複数文書の要約が対象に加えられている。複数文書要約とは、単一の主題について収集された複数の文書を含む文書セットを単一の文書に要約することである。より具体的に述べると、ある事件の始まりから終わりまでの一連の報告や、特定個人の行動・発言の内容、各地で発生した地震の被害状況等の主題に沿って収集された複数の文書より、当該主題に関する要約を生成することである。 Automatic summarization of multiple documents is an area of increasing interest in summary research in recent years. In the US Document Understanding Conference (DUC) and the Japanese Text Summarization Challenge (TSC), multiple document summaries have been added as subjects to evaluate the summarization system. Multiple document summarization is the summarization of a set of documents containing multiple documents collected on a single subject into a single document. More specifically, from a series of reports collected from the series of reports from the beginning to the end of an incident, the contents of the actions and remarks of specific individuals, the damage situation of earthquakes that occurred in various places, etc. Generating a summary on the subject.

要約の精度を向上させるためには、文書セットがもつ主題を正しく把握し、それに応じて適切な要約手法、出力形式を選択する必要があると考えられる。複数文書要約の観点から文書セットを分類する先行研究として、コロンビア大学のMcKeown等によるものがある（非特許文献１を参照）。McKeown等は、複数の新聞記事を包含する記事セットに付与すべき分類として、
（Ａ）Single-Event（特定の地域・期間に限定された単独の出来事に関する記事セット）
（Ｂ）Person-centered（特定人物にまつわる出来事を記述した記事セット）
（Ｃ）Multi-Event（異なる地域・期間にまたがった複数の出来事に関する記事セット。大抵は行動主体も異なる）
（Ｄ）Other（上記の３分類に当てはまらない、互いに漠然と関連している記事セット）
の４種類を定義した。そして、記事セットを分類する際の手がかりとして、記事セット中の全記事間のタイムスパン、同日に掲載された記事の割合、大文字で始まる語の頻度、he、she等の人称代名詞の頻度、を用いている。
K. R. McKeown and R. Barzilay and D. Evans and V. Hatzivassilogou and M. Yen Kan and B. Schiffman and S. Teufel, [online], “Columbia Multi-Document Summarization: Approach and Evaluation”, Online Proceedings of DUC2001 ＜http://www-nlpir.nist.gov/projects/duc/pubs/2001papers/columbia#redo.pdf＞ In order to improve the accuracy of summarization, it is necessary to correctly grasp the subject of the document set and to select an appropriate summarization method and output format accordingly. There is a research by McKeown et al. Of Columbia University as a prior research for classifying a document set from the viewpoint of multi-document summarization (see Non-Patent Document 1). McKeown et al. As a classification to be assigned to an article set that includes multiple newspaper articles,
(A) Single-Event (a set of articles related to a single event limited to a specific region / period)
(B) Person-centered (A set of articles describing events related to a specific person)
(C) Multi-Event (a set of articles about multiple events across different regions / periods, usually with different actors)
(D) Other (a set of articles that are not related to the above three categories and are vaguely related to each other)
The following four types were defined. And as a clue when classifying article sets, the time span between all articles in the article set, the ratio of articles published on the same day, the frequency of words starting with capital letters, the frequency of personal pronouns such as he, she, etc. Used.
KR McKeown and R. Barzilay and D. Evans and V. Hatzivassilogou and M. Yen Kan and B. Schiffman and S. Teufel, [online], “Columbia Multi-Document Summarization: Approach and Evaluation”, Online Proceedings of DUC2001 <http : //www-nlpir.nist.gov/projects/duc/pubs/2001papers/columbia#redo.pdf>

McKeown等の分類は、要約の対象となる記事セットによく見られる性質を効率よく分類している。しかしながら、幾つかの問題も抱えている。即ち、
・Otherに分類される記事セットが多くなる。これらに対しては他に適切な分類があるのではないかと考えられる。
・分類を判定するアルゴリズムに用いられている手がかりのうち、大文字で始まる語の頻度及びhe、she等の人称代名詞の頻度は英語に特化したものである。より一般的に記事セットの分類を行うためには、用いるべき手がかりを考慮する必要がある。
・上記非特許文献１で分類対象とされた記事データは、複数記事要約の評価ワークショップで使用するために作成されたものである。そのために、一般的な記事セットと比較して整えられているか、あるいは偏りが生じている。
・McKeown等の要約システムでは、実際に複数記事要約を行うときにMulti-EventとOtherとを同一視して同じ要約手法を適用しており、Multi-EventとOtherとを区別した意義が失われている。 McKeown et al.'S classification efficiently categorizes the properties that are often found in the set of articles being summarized. However, there are some problems. That is,
・ The article set classified as Other increases. There may be other appropriate classifications for these.
Of the cues used in the algorithm for determining classification, the frequency of words beginning with capital letters and the frequency of personal pronouns such as he and she are specific to English. In order to classify article sets more generally, it is necessary to consider the clues to be used.
The article data that is classified as the non-patent document 1 is created for use in a multiple article summary evaluation workshop. For this reason, it is arranged or biased compared to a general article set.
・ Summary systems such as McKeown apply the same summarization method by equating Multi-Event and Other when actually summarizing multiple articles, and the significance of distinguishing between Multi-Event and Other is lost. ing.

以上に鑑みてなされた本発明は、特定言語の特性に依存せずなおかつ分類の網羅性を高めたより好適な分類を定義した上で、これに基づいた分類を文書セットに付与できる分類装置を提供するものである。 The present invention has been made in view of the above, and provides a classification device that can define a more suitable classification that does not depend on the characteristics of a specific language and that enhances the comprehensiveness of classification, and can assign a classification based on the classification to a document set. To do.

上述した課題を解決すべく、本発明では、複数の文書の集合である文書セットに対し分類を付与するものとして、図１に示すように、前記文書セットの主題が単独の固有表現（Named Entity）に関するものか複数の固有表現に関するものかを判断し、かつ、該固有表現が何れの固有表現クラスに属するかを判断する判断手段１０１と、前記判断手段１０１が下した判断に基づき、前記文書セットの主題に係る固有表現が単独であるか複数であるか、及び、該固有表現が属している固有表現クラスという２つの要素より規定される分類についての情報を出力する出力手段１０２とを具備する文書セット分類装置を構成した。 In order to solve the above-described problem, in the present invention, classification is given to a document set that is a set of a plurality of documents. As shown in FIG. 1, the subject of the document set is a single named entity (Named Entity). ) Or a plurality of specific expressions, and determination means 101 for determining which specific expression class the specific expression belongs to, and the document based on the determination made by the determination means 101 Output means 102 for outputting information about a classification defined by two elements, ie, a specific expression class to which the specific expression belongs, and whether or not the specific expression related to the theme of the set is single or plural. A document set classifying apparatus is configured.

本発明では、文書セットがもつ主題に関連する固有表現が属する固有表現クラスに基づいて分類を定義し、本発明に係る分類装置がこの定義された分類を文書セットに付与するものとした。このようなものであれば、特定言語に依存する度合いが低減するとともに、分類の網羅性が高まる、言い換えるならば文書セットがOtherに分類される可能性が小さくなる。 In the present invention, the classification is defined based on the unique expression class to which the unique expression related to the subject included in the document set belongs, and the classification apparatus according to the present invention assigns the defined classification to the document set. In such a case, the degree of dependence on a specific language is reduced, and the comprehensiveness of classification is increased. In other words, the possibility that a document set is classified as Other is reduced.

ここで、固有表現とは、文書中に含まれる人名、組織名等の固有名詞や日付、金額等の数値表現その他の情報抽出の要素となる表現を言う。固有表現は、情報として重要でありかつその内容を示す表現がほぼ一意に定まるものである。固有表現が属する固有表現クラスの定義は種々考えられるが、一態様として、人名（Person）、組織名（Organization）、地名（Location）、施設名（Facility）、固有物名（Product（製品名、法律名等））、イベント名（Event）の６種を内包するクラス定義を採用することができる。この固有表現クラス定義を採用した場合、文書セットに付与すべき分類は以下の通りとなる。即ち、
（Ａ）Single-Person（単一人物に関する文書セット）
（Ｂ）Single-Organization（単一組織に関する文書セット）
（Ｃ）Single-Location（単一地域に関する文書セット）
（Ｄ）Single-Facility（単一施設に関する文書セット）
（Ｅ）Single-Product（単一固有物に関する文書セット）
（Ｆ）Single-Event（単一イベントに関する文書セット）
（Ｇ）Multi-Person（複数人物に関する文書セット）
（Ｈ）Multi-Organization（複数組織に関する文書セット）
（Ｉ）Multi-Location（複数地域に関する文書セット）
（Ｊ）Multi-Facility（複数施設に関する文書セット）
（Ｋ）Multi-Product（複数固有物に関する文書セット）
（Ｌ）Multi-Event（複数イベントに関する文書セット）
（Ｍ）Other（その他）
但し、固有表現クラスの定義ひいては文書セットの分類の定義はこれに限定されない。よって、例えば、動物名クラス（Single/Multi-Animal）等を追加することができ、人名、動物名等を一のクラス、いわば行動主体を表現するクラスである「主体名クラス」にまとめることもできる。 Here, the specific expression means a specific noun such as a person name and an organization name included in a document, a numerical expression such as a date and a monetary amount, and other expressions that are elements of information extraction. The proper expression is important as information, and the expression indicating the content is almost uniquely determined. There are various ways to define a specific expression class to which a specific expression belongs, but as one aspect, a person name (Person), an organization name (Organization), a place name (Location), a facility name (Facility), a unique object name (Product (product name, Legal names etc.)) and class definitions that include six types of event names (Event) can be adopted. When this specific expression class definition is adopted, the classification to be given to the document set is as follows. That is,
(A) Single-Person (Document set for a single person)
(B) Single-Organization (Document set for a single organization)
(C) Single-Location (Document set for a single region)
(D) Single-Facility (Document set for a single facility)
(E) Single-Product (Document set for a single unique product)
(F) Single-Event (document set related to a single event)
(G) Multi-Person (Document set for multiple persons)
(H) Multi-Organization (Document set for multiple organizations)
(I) Multi-Location (Document set for multiple regions)
(J) Multi-Facility (Document set for multiple facilities)
(K) Multi-Product (Document set for multiple unique products)
(L) Multi-Event (document set related to multiple events)
(M) Other
However, the definition of the specific expression class and the definition of the classification of the document set are not limited to this. Therefore, for example, an animal name class (Single / Multi-Animal) can be added, and a person name, an animal name, etc. can be combined into one class, in other words, a “subject name class” that is a class representing an action subject it can.

文書セット分類装置における前記判断手段１０１は、通常、前記文書セットに含まれる複数の文書の中に出現する固有表現の頻度または固有表現クラスの頻度のうち少なくとも一方を材料として、前記文書セットの主題に係る固有表現が単独であるか複数であるかの判断及び該固有表現が属する固有表現クラスの判断を実行する。但し、ここに言う頻度は、文書セットに包含される複数の文書中に出現する特定の固有表現等の出現頻度には限られず、特定の固有表現等が出現する文書の数（文書セットにおける文書の頻度）であることがある。 The determination unit 101 in the document set classification apparatus normally uses at least one of the frequency of the specific expression or the frequency of the specific expression class appearing in a plurality of documents included in the document set as a material. A determination is made as to whether or not there are a plurality of unique expressions, and a specific expression class to which the specific expression belongs is determined. However, the frequency mentioned here is not limited to the appearance frequency of specific specific expressions that appear in a plurality of documents included in the document set, but the number of documents in which specific specific expressions appear (documents in the document set). Frequency).

加えて、前記判断手段１０１が、前記文書セットに含まれる複数の文書の中に出現するクラスタームの頻度またはクラスタームが関連する固有表現クラスの頻度のうち少なくとも一方を材料として、前記文書セットの主題に係る固有表現が属する固有表現クラスの判断を実行するものであってもよい。クラスタームとは、特定の固有表現クラスに関連の強い名詞または複合名詞のことである。例えば、「首相」等の役職名は人名クラスのクラスタームであり、「地震」等の名詞はイベント名クラスのクラスタームである。因みに、クラスタームは、固有表現そのものとは異なり、一般名詞である。 In addition, the determination unit 101 uses as a material at least one of the frequency of a cluster that appears in a plurality of documents included in the document set and the frequency of a unique expression class to which the cluster is related. The determination of the specific expression class to which the specific expression related to the subject belongs may be performed. A cluster is a noun or compound noun that is strongly associated with a particular named entity class. For example, a title such as “Prime Minister” is a cluster of a person name class, and a noun such as “earthquake” is a cluster of an event name class. Incidentally, clusterum is a general noun, unlike proper expression itself.

また、特に、文書セットに含まれる各文書の作成時または発表時が判明しているような場合において、前記判断手段１０１が、前記文書セットに含まれる複数の文書の各々の作成若しくは発表された時点に関する情報を参照し、これら複数の文書のうちの一部または全部が予め定められた期間内に作成若しくは発表されていることを条件として、前記記事セットの主題に係る固有表現が単独でありかつその属する固有表現クラスがイベント名クラスである旨の判断を下すものとしてもよい。このとき、当該文書セットにはSingle-Eventの分類が付与される。 In particular, in the case where the creation or publication time of each document included in the document set is known, the determination unit 101 has created or announced each of the plurality of documents included in the document set. With reference to the point-in-time information, the specific expression related to the subject of the article set is single, provided that some or all of these documents are created or published within a predetermined period. In addition, it may be determined that the named entity class to which it belongs is an event name class. At this time, a single-event classification is assigned to the document set.

図２に示すように、上記の文章セット分類装置が、与えられた文書の中に存在するキーワードを抽出し、一の文書のキーワードと他の文書のキーワードとの類似度を算出し、その類似度が閾値を超える場合にこれらの文書を同一の文書セットに割り当てることを通じて、複数の文書から少なくとも一の文書セットを生成し得る文書セット生成手段１０３をさらに具備するものであれば、与えられた複数の文書を一または複数の文書セットに仕分けしこれに分類を付与するまでの処理を一括に実行可能となる。このものは、与えられた文書を基に一または複数の要約を自動生成するシステムを構築するために有用となる。 As shown in FIG. 2, the above sentence set classification device extracts keywords existing in a given document, calculates the similarity between the keyword of one document and the keyword of another document, and the similarity Given that the document set generation means 103 can further generate at least one document set from a plurality of documents by assigning these documents to the same document set when the degree exceeds a threshold value Processing until a plurality of documents are sorted into one or a plurality of document sets and a classification is given thereto can be executed in a batch. This is useful for constructing a system that automatically generates one or more summaries based on a given document.

さらに、図３に示すように、上記の文書セット要約装置が文書セットに対して付与した分類を参照し、この分類に対応した要約アルゴリズムを選択して前記文書セットに含まれる複数の文書を要約する要約手段２０１を具備する文書要約装置を構成することで、より適切に複数文書の要約を実行することが可能となる。 Further, as shown in FIG. 3, the document set summarization apparatus refers to the classification given to the document set, selects a summarization algorithm corresponding to the classification, and summarizes a plurality of documents included in the document set. By configuring the document summarizing apparatus including the summarizing means 201, it is possible to execute summarization of a plurality of documents more appropriately.

以上に詳述した本発明によれば、特定言語の特性に依存せず、かつ分類の網羅性を高めたより好適な分類を文書セットに付与し得る。 According to the present invention described in detail above, it is possible to assign a more suitable classification to a document set that does not depend on the characteristics of a specific language and that improves the comprehensiveness of classification.

以下、本発明の一実施形態を、図面を参照して説明する。はじめに、本発明における分類の定義及びその妥当性について述べる。ここでは、分析対象とする文書セットとして複数の新聞記事を包含する記事セットを実験的に生成し、これを分析して固有表現クラスを基にした分類を定義する。そして、この分類を、テストデータとなる別の日本語新聞記事セット、及び、DUC2001で使用された英語新聞記事セットに適用することにより、分類の定義の妥当性を検証する。 Hereinafter, an embodiment of the present invention will be described with reference to the drawings. First, the definition of the classification in the present invention and its validity will be described. Here, an article set including a plurality of newspaper articles is experimentally generated as a document set to be analyzed, and this is analyzed to define a classification based on the specific expression class. Then, by applying this classification to another Japanese newspaper article set as test data and the English newspaper article set used in DUC2001, the validity of the classification definition is verified.

記事セットの偏りを避けるため、日本語新聞記事コーパスから無作為に一の記事を抽出し、その記事に類似する記事を情報検索システムを使用して収集して、記事セットを生成した。具体的な手順は以下の通りである。
（１）日本語新聞記事コーパスに含まれる記事から無作為に記事を一つ選択する
（２）選択した記事からキーワード列を抽出する；キーワード列は、既知の日本語形態素解析ソフトウェア（黒橋禎夫、長尾真日本語形態素解析システム Juman version 3.61. 京都大学, 1999.）を用いた形態素解析結果より時相名詞、副詞的名詞を除いた名詞のうち頻度２以上のものとした
（３）抽出したキーワード列を用いて、記事間の類似度を求める；各記事について上記（２）と同様にキーワード列を抽出し、キーワード同士の類似度をDice's coefficient（Diceの係数）を用いて求めた
（４）類似した記事を取り出す；同一の記事以外でDiceの係数が所定値（例えば、０．５）以上となる記事を類似記事と見なして取り出した。 In order to avoid the bias of the article set, an article was randomly extracted from the Japanese newspaper article corpus, and articles similar to the article were collected using an information retrieval system to generate an article set. The specific procedure is as follows.
(1) Randomly select one article from articles included in the Japanese newspaper article corpus (2) Extract a keyword string from the selected article; the keyword string is a known Japanese morphological analysis software (Tatsuo Kurohashi) , Makoto Nagao Japanese morphological analysis system Juman version 3.61. Kyoto University, 1999.) From the results of morphological analysis, nouns excluding temporal nouns and adverbial nouns with frequency 2 or higher were extracted (3) The similarity between articles is calculated | required using a keyword row | line | column; A keyword row | line | column is extracted similarly to said (2) about each article | item, and the similarity degree between keywords was calculated | required using Dice's coefficient (coefficient of Dice) (4 A similar article is taken out; articles other than the same article whose Dice coefficient is a predetermined value (for example, 0.5) or more are regarded as similar articles and taken out.

以上の記事セット生成を複数回（例えば、５０回）繰り返して得られた複数（５０）記事セットのうち、３以上の記事を含む記事セットを選出すると２６記事セットとなった。これらから、さらに記事セットの内容がほぼ同じと考えられるものを省くと、１９記事セットが残った。これらの１９記事セットの主題を、図４に示す。 When an article set including three or more articles is selected from a plurality (50) article sets obtained by repeating the above-described article set generation a plurality of times (for example, 50 times), 26 article sets are obtained. From these, 19 articles were left out if items that were considered to have almost the same content were omitted. The subject of these 19 article sets is shown in FIG.

図４の分析結果に示されるように、Singleとは、記事セット中のほとんどの記事が単一のイベントや人名、組織名等の固有表現について記述してあるものである。他方、Multiとは、記事セット中の記事が複数の相異なるイベントや人名、組織名等の固有表現について記述してあるものである。 As shown in the analysis result of FIG. 4, “Single” is a description in which most articles in an article set describe a single event, a specific expression such as a person name and an organization name. On the other hand, Multi is a description in which articles in an article set describe a plurality of different events, specific expressions such as person names and organization names.

本発明では、文書セットたる記事セットの主題を、固有表現クラスのうち一つを選択することを通じて分類した。固有表現クラスの定義としては、拡張されたクラス定義（S. Sekine, K. Sudo and C. Nobata, “Extended named entity hierarchy”, In Proceedings of the LREC-2002 conference, 2002.）を採用した。この定義は階層的であるが、最上位の階層を用いてほぼ全ての記事セットの主題を分類することができた。固有表現クラスを割り当てず、Otherに分類した記事セットは一つだけである（図４における記事セットＮｏ．１４）。この記事セットに固有表現クラスを割り当てるとするならば動物名（Animal）クラスとなるが、動物名クラスに分類される記事セットはそれほど多くはないと考えられるので、Otherとした。上述したように固有表現クラスを選択的に割り当てた結果、既に言及した１３種類の分類が定義された。因みに、図４に示した分析結果には地名（Location）クラスの分類に対応する記事セットが存在しないが、特定の国や地域に関する記事セットは存在し得るし重要でもあると考えられるので、記事セットの分類の定義に含めた。 In the present invention, the subject of the article set as the document set is classified through selecting one of the proper expression classes. The extended class definition (S. Sekine, K. Sudo and C. Nobata, “Extended named entity hierarchy”, In Proceedings of the LREC-2002 conference, 2002.) was adopted as the definition of the proper expression class. Although this definition is hierarchical, the top level hierarchy could be used to categorize almost all article sets. There is only one article set classified as Other without assigning a unique expression class (article set No. 14 in FIG. 4). If we assign a specific expression class to this article set, it will be an animal name (Animal) class, but since there are not many article sets classified into the animal name class, it was set to Other. As described above, as a result of selectively assigning the specific expression class, the 13 types of classifications already mentioned are defined. Incidentally, the analysis result shown in FIG. 4 does not have an article set corresponding to the classification of the location class (Location) class, but an article set concerning a specific country or region can exist and is also important. Included in the set classification definition.

続いて、定義した分類に基づいてテストデータの分類を行う。ここで使用するテストデータは、日本語新聞記事コーパスから先に述べた方法と同様にして作成される。そして、テストデータとして作成された２０の記事セットについて、二人の被験者が独立に分類を付与した。原則として、被験者は各記事セットに一つの分類を割り当てたが、幾つかの記事セットについては複数の分類が可能であると判断して二つの分類を割り当てた。二人の被験者によって割り当てられた各分類の数を、図５に示す。被験者間の一致率は、被験者が最初に選択した分類同士を比較した場合には５５％、二番目の分類までを含めた場合には８５％であった。Otherに分類された記事セットはなかった。被験者が二つの分類を割り当てた記事セットの数は、それぞれ６と５であった。 Subsequently, the test data is classified based on the defined classification. The test data used here is created in the same manner as described above from the Japanese newspaper article corpus. And about 20 article sets created as test data, two subjects gave classification independently. In principle, subjects assigned one category to each set of articles, but for some set of articles, judged that multiple categories were possible and assigned two categories. The number of each classification assigned by two subjects is shown in FIG. The coincidence rate between subjects was 55% when the categories first selected by the subjects were compared, and 85% when the second category was included. There was no article set classified as Other. The number of article sets to which the subject assigned two categories was 6 and 5, respectively.

被験者による分類結果を基に、テストデータの分類の正解データを作成した。被験者間で共通の分類となった１７記事セットではその割り当てられた分類を正解とする一方、被験者間で分類が分かれた３記事セットについては被験者同士の討論により正解の分類を決定した。なお、この正解データを用いて、後述する文書セット分類装置による自動分類の実験結果の評価を行う。 Based on the results of classification by the subjects, correct data for test data classification was created. In the 17-article set that is a common classification among the subjects, the assigned classification is regarded as the correct answer, while in the 3-article set in which the classification is divided among the subjects, the correct classification is determined by discussion between the subjects. It should be noted that this correct answer data is used to evaluate the result of an automatic classification experiment performed by a document set classification apparatus described later.

この分類の定義が他言語においても妥当であることを示すために、DUC2001の複数記事要約タスクで用いられたトレーニングデータである英語新聞記事セットに対しても分類の付与を試みた。先と同様に、二人の被験者が独立に記事セットに対して最適と判断した分類を付与した。被験者は各記事セットに対して一つないし二つの分類を付与した。二人の被験者によって割り当てられた各分類の数を、図６に示す。被験者が二つの分類を割り当てた記事セットはそれぞれ４セットあった。Otherの分類は、被験者の一人が二番目の分類として付与した１セットのみであった。被験者間の一致率は、被験者が最初に選択した分類同士を比較した場合に８０％、二番目の分類までを含めた場合は９３．３％であった。この一致率は、日本語記事のテストデータに対するものよりも高い。これは、DUC2001のデータが日本語記事テストデータよりも整えられており、記事セットの主題が意図的に選択されているからであると考えられる。 In order to show that the definition of this classification is valid in other languages, we also tried to assign a classification to the English newspaper article set, which is training data used in the multi-article summary task of DUC2001. As before, the two subjects independently assigned a classification that they judged to be optimal for the article set. Subjects assigned one or two categories to each article set. The number of each classification assigned by two subjects is shown in FIG. There were 4 sets of articles each assigned 2 categories by the subjects. The other category was only one set given by one of the subjects as the second category. The agreement rate between subjects was 80% when the categories selected by the subjects were compared first, and 93.3% when the categories up to the second category were included. This matching rate is higher than that for Japanese article test data. This is probably because the data of DUC2001 is more prepared than the Japanese article test data, and the subject of the article set is intentionally selected.

以降、定義された分類を記事セットに付与するための文書セット分類装置について詳述する。本実施形態における文書セット分類装置は、コンピュータ１に所定のプログラムをインストールすることで構成されるものである。コンピュータ１は、例えば、図７に示すように、プロセッサ１ａ、メインメモリ１ｂ、ハードディスクドライブに代表される補助記憶デバイス１ｃ等のハードウェア資源が、コントローラ１ｄ（即ち、いわゆるシステムコントローラ、Ｉ／Ｏコントローラ等）により制御され連携して動作するものである。また、図示しないが、電気通信回線を介して外部とのデータ授受を行うための通信デバイス、ユーザによる操作入力を受け付けるキーボードやポインティングデバイス等の入力デバイス、情報を画像ないし映像として表示するディスプレイ及びこのディスプレイに映像信号を送出するため表示制御デバイス（いわゆるグラフィクスチップ等）等を実装することを妨げない。 Hereinafter, a document set classification apparatus for assigning a defined classification to an article set will be described in detail. The document set classification apparatus according to the present embodiment is configured by installing a predetermined program in the computer 1. For example, as shown in FIG. 7, the computer 1 has a hardware resource such as a processor 1a, a main memory 1b, and an auxiliary storage device 1c represented by a hard disk drive as a controller 1d (that is, a so-called system controller, I / O controller). Etc.) and operate in cooperation with each other. Although not shown, a communication device for exchanging data with the outside via an electric communication line, an input device such as a keyboard and a pointing device for accepting an operation input by a user, a display for displaying information as an image or video, and this It does not prevent the mounting of a display control device (so-called graphics chip or the like) or the like for sending a video signal to the display.

通常、プロセッサ１ａによって実行されるべきプログラムが補助記憶デバイス１ｃに格納されており、プログラムの実行の際には補助記憶デバイス１ｃからメインメモリ１ｂに読み込まれ、プロセッサ１ａによって解読される。そして、該プログラムに従い上記のハードウェア資源を作動して、少なくとも、判断手段１０１、出力手段１０２としての機能を発揮するようにしている。 Normally, a program to be executed by the processor 1a is stored in the auxiliary storage device 1c. When the program is executed, the program is read from the auxiliary storage device 1c into the main memory 1b and decoded by the processor 1a. Then, the hardware resources are operated according to the program so that at least the functions of the determination unit 101 and the output unit 102 are exhibited.

判断手段１０１は、複数の記事（文書）を包含してなる記事セット（文書セット）の主題が単独の固有表現に関するものか複数の固有表現に関するものかを判断し、かつ、該固有表現が何れの固有表現クラスに属するかを判断する。入力として与えられる記事セットの要素である記事のデータは、通常、メインメモリ１ｂまたは補助記憶デバイス１ｃの所要の記憶領域に予め格納されている。よって、プロセッサ１ａが、プログラムに従い、メインメモリ１ｂまたは補助記憶デバイス１ｃに格納されている記事のデータを読み込み、これを基に記事セットに付与すべき分類についての判定を行う。 The determination unit 101 determines whether the subject of an article set (document set) including a plurality of articles (documents) relates to a single specific expression or a plurality of specific expressions, and which of the specific expressions It is determined whether it belongs to the proper expression class. Article data that is an element of an article set given as input is normally stored in advance in a required storage area of the main memory 1b or the auxiliary storage device 1c. Therefore, the processor 1a reads the article data stored in the main memory 1b or the auxiliary storage device 1c according to the program, and makes a determination on the classification to be assigned to the article set based on the read data.

出力手段１０２は、前記判断手段１０１が下した判断より、前記記事セットに付与すべき分類についての情報、即ち、当該記事セットの主題に係る固有表現が単独であるか複数であるか及び該固有表現が属している固有表現クラスという２つの要素より規定される分類についての情報を出力する。情報の出力の態様としては、ディスプレイの画面への表示、プリンタ（図示せず）を使用したプリントアウト、通信デバイス及び電気通信回線を介した外部のコンピュータへの送信、メインメモリ１ｂまたは補助記憶デバイス１ｃへの書き込み、その他を挙げることができる。出力手段１０２の具体的構成は、記事セットに付与すべき情報の出力態様に応じたものとなる。 Based on the determination made by the determination unit 101, the output unit 102 determines whether or not the information about the classification to be given to the article set, that is, whether or not the unique expression related to the subject of the article set is single or plural. Outputs information about a classification defined by two elements, a named entity class to which the expression belongs. Examples of information output modes include display on a display screen, printout using a printer (not shown), transmission to an external computer via a communication device and a telecommunication line, main memory 1b, or auxiliary storage device Write to 1c and others. The specific configuration of the output unit 102 depends on the output mode of information to be added to the article set.

文書セット分類装置による分類の自動付与では、記事セットに含まれる複数の記事中に出現する単語や固有表現クラスの出現頻度と、記事頻度（記事セットにおいて所要の単語、固有表現クラス等が出現した記事の数。情報検索等で用いられるｉｄｆ値とは異なる）とを手がかりとして利用する。よって、コンピュータ１が、図８に示す固有表現抽出手段としての機能をも発揮し得ることが望ましい。固有表現抽出手段は、記事セットに含まれる記事の中に出現する固有表現を抽出するとともに、抽出した固有表現が属する固有表現クラスを判定する。固有表現抽出手段は、例えば、記事中の文章を単語に切り分けて品詞の付与を行う形態素解析ソフトウェア１０４ａと、形態素解析ソフトウェア１０４ａによる解析結果を参照して記事中に出現する固有表現を列挙しこれを固有表現クラス分けする固有表現抽出ソフトウェア１０４ｂとを用いて構成できる。形態素解析、固有表現抽出の一例を、図８に示している。図示例では、パターンベースのシステムで固有表現抽出を行っている。即ち、プロセッサ１ａが、プログラムに従い、入力として与えられた記事を形態素解析し、得られた形態素解析済み記事を固有表現リスト１０４ｃ（固有表現及びその属する固有表現クラスが列挙されたデータ群。通常、メインメモリ１ｂまたは補助記憶デバイス１ｃの所要の記憶領域に格納されている）に照らし合わせることで、記事中の固有表現を全て抽出する。しかる後、複数の固有表現が入れ子関係となっているもの（例えば、組織名クラスに属する固有表現「吉本工業」の中に、さらに人名クラスに属する固有表現「吉本」が存在）が存在しているときにはより文字列の長い固有表現を優先的に認定（即ち、「吉本」ではなく「吉本工業」という固有表現と認定）して固有表現を一意に決定し、その結果を出力する。判断手段１０１は、この固有表現抽出手段による出力を参照して、記事セットの分類を行うことができる。 In the automatic assignment of classification by the document set classification device, the appearance frequency of words and specific expression classes appearing in multiple articles included in the article set, and the article frequency (required words, specific expression classes, etc. appeared in the article set) The number of articles (which is different from the idf value used in information retrieval or the like) is used as a clue. Therefore, it is desirable that the computer 1 can also exhibit the function as the specific expression extracting means shown in FIG. The specific expression extraction unit extracts a specific expression that appears in an article included in the article set, and determines a specific expression class to which the extracted specific expression belongs. The specific expression extraction means enumerates the specific expressions that appear in the article by referring to the analysis result by the morpheme analysis software 104a, for example, referring to the analysis result by the morpheme analysis software 104a by dividing the sentence in the article into words and giving part of speech Can be configured using specific expression extraction software 104b for classifying the specific expressions into specific expression classes. An example of morphological analysis and specific expression extraction is shown in FIG. In the illustrated example, the unique expression is extracted by a pattern-based system. That is, the processor 1a performs morphological analysis on an article given as an input according to a program, and the obtained morpheme analyzed article is a specific expression list 104c (a data group in which specific expressions and their specific expression classes are listed. All of the specific expressions in the article are extracted by checking against the main memory 1b or the required storage area of the auxiliary storage device 1c. After that, there exists a nested relationship of multiple named entities (for example, the named entity “Yoshimoto” belonging to the name class exists in the named entity “Yoshimoto Kogyo” belonging to the organization name class). If there is, the unique expression having a longer character string is preferentially recognized (ie, recognized as “Yoshimoto Kogyo” instead of “Yoshimoto”), the unique expression is uniquely determined, and the result is output. The determination unit 101 can classify the article set with reference to the output from the specific expression extraction unit.

また、固有表現に加え、記事中に出現するクラスタームも分類の手がかりとして用いることができる。クラスタームとは、特定の固有表現クラスに関連の強い名詞または複合名詞のことである。例示すると、「首相」等の役職名は人名クラスのクラスタームであり、「地震」等の名詞はイベント名クラスのクラスタームである。よって、コンピュータ１が、図９に示すクラスターム抽出手段としての機能をも発揮し得ることが望ましい。固有表現抽出手段は、記事セットに含まれる記事の中に出現するクラスタームを抽出するとともに、抽出したクラスタームが関連する固有表現クラスを判定する。クラスターム抽出手段は、例えば、記事中の文章を単語に切り分けて品詞の付与を行う形態素解析ソフトウェア１０４ａと、形態素解析ソフトウェア１０４ａによる解析結果を参照して記事中に出現するクラスタームを列挙しこれを固有表現クラス分けするクラスターム抽出ソフトウェア１０５ａとを用いて構成できる。形態素解析、クラスターム抽出の一例を、図９に示している。図示例もまた、上記の固有表現抽出の例と同様に、パターンベースのシステムでクラスターム抽出を行っている。即ち、プロセッサ１ａが、プログラムに従い、入力として与えられた記事を形態素解析し、得られた形態素解析済み記事をクラスタームリスト１０５ｂ（クラスターム及びその関連する固有表現クラスが列挙されたデータ群。通常、メインメモリ１ｂまたは補助記憶デバイス１ｃの所要の記憶領域に格納されている）に照らし合わせることで記事中のクラスタームを全て抽出し、その結果を出力する。判断手段１０１は、このクラスターム抽出手段による出力を参照して、記事セットの分類を行うことができる。因みに、クラスタームのリスト１０５ｂは、既存のシソーラスと人手による収集結果とから作成することができる。発明者が実験的に作成し使用しているクラスタームの数は約１６０００語である。 In addition to proper expressions, clusters that appear in articles can also be used as clues for classification. A cluster is a noun or compound noun that is strongly associated with a particular named entity class. For example, a title such as “Prime Minister” is a cluster of a person name class, and a noun such as “earthquake” is a cluster of an event name class. Therefore, it is desirable that the computer 1 can also exhibit the function as the cluster extraction means shown in FIG. The specific expression extraction unit extracts a cluster that appears in an article included in the article set, and determines a specific expression class to which the extracted cluster is related. The clusterome extraction means, for example, enumerates the morpheme analysis software 104a that divides sentences in an article into words and adds parts of speech, and lists the clusters that appear in the article with reference to the analysis results by the morpheme analysis software 104a. Can be configured using clusterum extraction software 105a for classifying the entity into a proper expression class. An example of morphological analysis and cluster extraction is shown in FIG. Also in the illustrated example, clustering extraction is performed by a pattern-based system in the same manner as the above-described example of extracting the unique expression. That is, the processor 1a performs morphological analysis on an article given as an input according to a program, and the obtained morpheme-analyzed article is a cluster list 105b (a data group in which a cluster and its associated unique expression class are listed. And all the clusters in the article are extracted with reference to the main memory 1b or the required storage area of the auxiliary storage device 1c), and the result is output. The judgment unit 101 can classify the article set with reference to the output from the cluster extraction unit. Incidentally, the cluster list 105b can be created from an existing thesaurus and a manually collected result. The number of clusters created and used by the inventors experimentally is about 16,000 words.

固有表現、クラスタームの他、記事の掲載日付や記事間のタイムスパン等もまた、記事セットの分類のための手がかりとして用いることが可能である。例えば、ある記事セットに含まれるほとんどの記事が同日かまたは所定の短い期間以内に掲載されたものであるならば、その記事セットをSingle-Eventに分類できる可能性が高い。 In addition to proper expressions and clusters, article publication dates and article time spans can also be used as clues for classifying article sets. For example, if most articles included in an article set are published on the same day or within a predetermined short period, it is likely that the article set can be classified as a single-event.

本実施形態における判断手段１０１が実行する判断のアルゴリズムに関して、図１０ないし図１４のフローチャートを参照して詳述する。本実施形態において、判断手段１０１は、下記の４つのアルゴリズムにより記事セットの分類を行う。まず、判断手段１０１は、第一のアルゴリズムに従い、入力として与えられた記事セットがSingle-Eventに分類されるか否かを判断する。この判断は、記事セットに含まれる複数の記事の各々が作成若しくは発表された時点に関する情報を参照して下される。記事セットに含まれる記事のうちの一定の割合以上のもの（または、全部）が、所定期間内に作成、公開、発表、掲載等されたものであるならば、判断手段１０１は当該記事セットをSingle-Eventに分類する。より具体的には、記事セットに含まれる記事が新聞記事である場合に、その大半が同日あるいは所定の短い期間内に掲載されたものであるならば当該記事セットにSingle-Eventの分類を付与する旨の判断を下す。このときの処理の手順を、図１０に示している。判断手段１０１は、入力として与えられた記事セットに含まれている各記事の掲載日付に関する情報（この情報は、例えば、記事データに関連づけてメインメモリ１ｂまたは補助記憶デバイス１ｃの所要の記憶領域に格納されている）を参照して（ステップＳ１０１）最も記事頻度の高い掲載日を確認し、この日に掲載された記事の数を計数する（ステップＳ１０２）。そして、この日に掲載された記事数またはこの日に掲載された記事数の記事セットに含まれる全記事数に対する割合が、予め設定された閾値Ｔ_aを上回っているならば（ステップＳ１０３）、与えられた記事セットをSingle-Eventに分類する（ステップＳ１０６）。また、記事セットに含まれる全記事の掲載日のタイムスパンを確認し（ステップＳ１０４）、このタイムスパンが予め設定された閾値Ｔ_sを下回っているならば（ステップＳ１０５）、与えられた記事セットをSingle-Eventに分類する（ステップＳ１０５）。つまり、特定の日に掲載された記事の割合、及び、記事間のタイムスパンの最大値の二つのパラメータを材料として判断を行う。記事セットをSingle-Eventに分類できなかったときには、第二のアルゴリズムに移行する。なお、記事中に出現する日付表現を参照することで第一のアルゴリズムを実行することを妨げない。 The determination algorithm executed by the determination unit 101 in this embodiment will be described in detail with reference to the flowcharts of FIGS. In the present embodiment, the determination unit 101 classifies article sets using the following four algorithms. First, the determination unit 101 determines whether an article set given as an input is classified as a single-event according to the first algorithm. This determination is made with reference to information regarding the time when each of the plurality of articles included in the article set is created or published. If articles (or all) of a certain percentage or more of the articles included in the article set are created, published, announced, published, etc. within a predetermined period, the judgment means 101 selects the article set. Classify as Single-Event. More specifically, when articles included in an article set are newspaper articles, if most of them are published on the same day or within a predetermined short period, a single-event classification is assigned to the article set. Make a decision to do so. The processing procedure at this time is shown in FIG. The determination means 101 is information relating to the posting date of each article included in the article set given as input (this information is stored in a required storage area of the main memory 1b or the auxiliary storage device 1c in association with article data, for example). Referring to (stored) (step S101), the publication date with the highest article frequency is confirmed, and the number of articles published on this day is counted (step S102). If the number of articles published on this day or the ratio of the number of articles published on this day to the total number of articles included in the article set exceeds _a preset threshold value Ta (step S103), The given article set is classified into Single-Event (step S106). Also, the time span of the posting dates of all articles included in the article set is confirmed (step S104), and if this time span is below a preset threshold T _s (step S105), the given article set is set. Are classified into Single-Event (step S105). That is, the determination is made using the two parameters of the ratio of articles published on a specific day and the maximum value of the time span between articles as materials. When the article set cannot be classified as Single-Event, the process proceeds to the second algorithm. In addition, it does not prevent executing the first algorithm by referring to the date expression appearing in the article.

次に、判断手段１０１は、第二のアルゴリズムに従い、与えられた記事セットがSingle-class（何れかの固有表現クラス）に分類されるかどうか判断する。この判断は、記事頻度の高い固有表現について、その出現頻度、記事頻度を計数することで下される。特定の固有表現が記事セットに含まれる多くの記事にわたって頻繁に出現するならば、判断手段１０１はその固有表現を記事セットの主題を表すものと見なし、その固有表現が属する固有表現クラスを記事セットの分類の要素classとする。このときの処理の手順を、図１１に示している。判断手段１０１は、与えられた記事セットに対して前記固有表現抽出手段が行った固有表現抽出処理の結果出力を参照し（ステップＳ２０１）、記事セットに含まれる記事中に出現する固有表現及び固有表現が属する固有表現クラスのそれぞれについて、出現頻度を計数する（ステップＳ２０２）。かつ、記事中に出現する各固有表現についてその記事頻度を計数して、当該記事セットにおいて最も記事頻度の高い固有表現を選出する（ステップＳ２０３）。ここで、記事頻度が等しい複数の固有表現が存在する場合には、例えば出現頻度のより高い固有表現を選択する。しかして、選出された固有表現の記事頻度（または、選出された固有表現の記事頻度の記事セットに含まれる全記事数に対する割合）が予め設定された閾値Ｔ_dを上回っているならば（ステップＳ２０４）、この固有表現が属する固有表現クラスを記事セットの分類の要素classとすることができる。なお、記事セットをSingle-classに分類するに際し、さらなる判断処理を付加することを妨げない。即ち、図１１に示しているように、選出された固有表現の出現頻度／選出された固有表現が属する固有表現クラスの出現頻度、の比が予め定められた閾値Ｔ_wを上回っていることを条件として（ステップＳ２０７）、選出された固有表現が属する固有表現クラスを記事セットの分類の要素classとする（ステップＳ２０８）ものとしてもよい。但し、本実施形態では、トレーニングデータを調査した実験結果から、イベント名クラスに関しては他の固有表現クラスに優先して判断するものとした。従って、選出された固有表現がイベント名クラスに属するときには（ステップＳ２０５）、与えられた記事セットをそのままSingle-Eventに分類する（ステップＳ２０６）ようにしている。記事セットをSingle-classに分類できなかったときには、第三のアルゴリズムに移行する。 Next, the determination unit 101 determines whether the given article set is classified into a single-class (any specific expression class) according to the second algorithm. This determination is made by counting the appearance frequency and the article frequency of the unique expression having a high article frequency. If a specific unique expression frequently appears across many articles included in the article set, the determination unit 101 regards the specific expression as representing the subject of the article set, and sets the specific expression class to which the specific expression belongs to the article set. The element class of The processing procedure at this time is shown in FIG. The determination unit 101 refers to the output of the result of the specific expression extraction process performed by the specific expression extraction unit for the given article set (step S201), and the specific expression and the specific expression that appear in the articles included in the article set The appearance frequency is counted for each unique expression class to which the expression belongs (step S202). And the article frequency is counted for each unique expression appearing in the article, and the unique expression having the highest article frequency in the article set is selected (step S203). Here, when there are a plurality of specific expressions having the same article frequency, for example, a specific expression having a higher appearance frequency is selected. Therefore, if the article frequency of the selected unique expression (or the ratio of the article frequency of the selected unique expression to the total number of articles included in the article set) exceeds a preset threshold T _d (step S204), the unique expression class to which this unique expression belongs can be used as the element class of the article set classification. In addition, when classifying an article set into a single-class, it does not prevent adding further judgment processing. That is, as shown in FIG. 11, that the frequency ratio of the named entity class elected frequency / elected unique representation of the named entity belongs exceeds the threshold value T _w a predetermined As a condition (step S207), the specific expression class to which the selected specific expression belongs may be used as the element class of the article set classification (step S208). However, in the present embodiment, it is assumed that the event name class is determined with priority over other specific expression classes from the experimental results of examining the training data. Therefore, when the selected unique expression belongs to the event name class (step S205), the given article set is classified as a single-event as it is (step S206). When the article set cannot be classified into Single-class, the process moves to the third algorithm.

続いて、判断手段１０１は、第三のアルゴリズムに従い、与えられた記事セットがMulti-class（何れかの固有表現クラス）に分類されるかどうか判断する。この判断は、記事頻度の高い固有表現クラスについて、その出現頻度、記事頻度を計数することで下される。特定の固有表現クラスに属する固有表現が記事セットに含まれる多くの記事にわたって頻繁に出現するならば、判断手段１０１はその固有表現クラスを記事セットの主題を表すものと見なし、記事セットの分類の要素classとする。このときの処理の手順を、図１２に示している。判断手段１０１は、与えられた記事セットに対して前記固有表現抽出手段が行った固有表現抽出処理の結果出力を参照し（ステップＳ２０１）、記事セットに含まれる記事中に出現する固有表現及び固有表現が属する固有表現クラスのそれぞれについて、出現頻度を計数する（ステップＳ２０２。これらの処理は、既に第二のアルゴリズムにおいて実行されている）。かつ、記事中に出現する固有表現が属する各固有表現クラスについてその記事頻度を計数して、当該記事セットにおいて最も記事頻度の高い固有表現クラスを選出する（ステップＳ３０１）。ここで、記事頻度が等しい複数の固有表現クラスが存在する場合には、例えば出現頻度のより高い固有表現クラスを選択する。しかして、選出された固有表現クラスの記事頻度（または、選出された固有表現クラスの記事頻度の記事セットに含まれる全記事数に対する割合）が予め設定された閾値Ｔ_dを上回っているならば（ステップＳ３０２）、この固有表現クラスを記事セットの分類の要素classとすることができる。なお、記事セットをMulti-classに分類するに際し、さらなる判断処理を付加することを妨げない。即ち、図１２に示しているように、選出された固有表現クラスの出現頻度／全固有表現クラス（全固有表現）の出現頻度、の比が予め定められた閾値Ｔ_Cを上回っていることを条件として（ステップＳ３０５）、選出された固有表現クラスを記事セットの分類の要素classとする（ステップＳ３０６）ものとしてもよい。但し、本実施形態では、イベント名クラスに関しては他の固有表現クラスに優先して判断するものとし、選出された固有表現クラスがイベント名クラスに属するときには（ステップＳ３０３）、与えられた記事セットをそのままMulti-Eventに分類する（ステップＳ３０４）ようにしている。加えて、固有表現毎に相異する閾値Ｔ_Cを設定することを妨げない。例えば、選出された固有表現クラスの出現頻度／全固有表現クラスの出現頻度の値と比較される閾値Ｔ_Cについて、選出された固有表現クラスが地名クラス、組織名クラス、人名クラスの何れかである場合にはより厳しい即ちより大きい閾値Ｔ_C1を適用し、選出された固有表現クラスが上記以外である場合にはより緩い即ちより小さい閾値Ｔ_C2（Ｔ_C1＞Ｔ_C2）を適用することができる。記事セットをMulti-classに分類できなかったときには、第四のアルゴリズムに移行する。 Subsequently, the determination unit 101 determines whether the given article set is classified into Multi-class (any specific expression class) according to the third algorithm. This determination is made by counting the appearance frequency and the article frequency of the unique expression class having a high article frequency. If a specific expression belonging to a specific specific expression class frequently appears across many articles included in the article set, the determination unit 101 regards the specific expression class as representing the subject of the article set, and determines the classification of the article set. Element class. The processing procedure at this time is shown in FIG. The determination unit 101 refers to the output of the result of the specific expression extraction process performed by the specific expression extraction unit for the given article set (step S201), and the specific expression and the specific expression appearing in the articles included in the article set. The appearance frequency is counted for each unique expression class to which the expression belongs (step S202. These processes have already been executed in the second algorithm). The article frequency is counted for each unique expression class to which the unique expression that appears in the article belongs, and the unique expression class with the highest article frequency is selected in the article set (step S301). Here, when there are a plurality of unique expression classes having the same article frequency, for example, a specific expression class having a higher appearance frequency is selected. Therefore, if the article frequency of the selected unique expression class (or the ratio of the article frequency of the selected unique expression class to the total number of articles included in the article set) exceeds a preset threshold value _Td. (Step S302), this specific expression class can be used as an element class of article set classification. In addition, when classifying an article set into Multi-class, it does not prevent adding further judgment processing. That is, as shown in FIG. 12, the frequency of occurrence of elected frequency / total named entity class named entity class (all named entities), that the ratio of exceeds the threshold value T _C of predetermined As a condition (step S305), the selected specific expression class may be used as an element class of article set classification (step S306). However, in this embodiment, the event name class is determined with priority over other specific expression classes, and when the selected specific expression class belongs to the event name class (step S303), the given article set is selected. As it is, it is classified into Multi-Event (step S304). In addition, it does not interfere with setting the threshold value T _C of different from each named entities. For example, regarding the threshold value T _C to be compared with the value of the appearance frequency of the selected specific expression class / the appearance frequency of all the specific expression classes, the selected specific expression class is any of the place name class, the organization name class, and the person name class. In some cases, a stricter or larger threshold value T _C1 is applied, and in a case where the elected named entity class is other than the above, a weaker or smaller threshold value T _C2 (T _C1 > T _C2 ) is applied. it can. When the article set cannot be classified into Multi-class, the process proceeds to the fourth algorithm.

第三のアルゴリズムまでの過程で記事セットに付与すべき適切な分類を見出せなかった場合、判断手段１０１は、第四のアルゴリズムに従い、付与すべき分類を検討する。第四のアルゴリズムは、第二のアルゴリズムないし第三のアルゴリズムを、固有表現でなくクラスタームを対象として実行するものと言える。即ち、記事中に出現するクラスタームの頻度またはクラスタームが関連する固有表現クラスの頻度のうち少なくとも一方を材料として、与えられた記事セットの主題に係る固有表現が属する固有表現クラスの判断を下す。なお、特定のクラスタームが記事セット中の多くの記事にわたって頻繁に出現していても、当該記事セットに割り当てるべき分類はSingle-classでなくMulti-classとすることが望ましい。これは、クラスタームは固有名詞ではなく一般名詞であって、複数種の固有表現を指示し得るものであることによる。このときの処理の手順を、図１３及び図１４に示している。判断手段１０１は、与えられた記事セットに対して前記クラスターム抽出手段が行ったクラスターム抽出処理の結果出力を参照し（ステップＳ４０１）、記事セットに含まれる記事中に出現するクラスターム及びクラスタームが関連する固有表現クラスのそれぞれについて、出現頻度を計数する（ステップＳ４０２）。かつ、記事中に出現する各クラスタームについてその記事頻度を計数して、当該記事セットにおいて最も記事頻度の高いクラスタームを選出する（ステップＳ４０３）。ここで、記事頻度が等しい複数のクラスタームが存在する場合には、例えば出現頻度のより高いクラスタームを選択する。しかして、選出されたクラスタームの記事頻度（または、選出されたクラスタームの記事頻度の記事セットに含まれる全記事数に対する割合）が予め設定された閾値Ｔ_dを上回っているならば（ステップＳ４０４）、このクラスタームが関連する固有表現クラスを記事セットの分類の要素classとすることができる。なお、記事セットをMulti-classに分類するに際し、さらなる判断処理を付加することを妨げない。即ち、図１３に示しているように、選出されたクラスタームの出現頻度／選出されたクラスタームが関連する固有表現クラスの出現頻度、の比が予め定められた閾値Ｔ_wを上回っていることを条件として（ステップＳ４０７）、選出された固有表現が属する固有表現クラスを記事セットの分類の要素classとする（ステップＳ４０８）ものとしてもよい。但し、イベント名クラスに関しては他の固有表現クラスに優先して判断するものとし、選出されたクラスタームがイベント名クラスに関連するものであるときには（ステップＳ４０５）、与えられた記事セットをそのままMulti-Eventに分類する（ステップＳ４０６）ようにしている。上記に加えて、記事中に出現するクラスタームが関連している各固有表現クラスについてその記事頻度を計数し、当該記事セットにおいて最も記事頻度の高い固有表現クラスを選出する（ステップＳ４０９）。記事頻度が等しい複数の固有表現クラスが存在する場合には、例えば出現頻度のより高い固有表現クラスを選択する。しかして、選出された固有表現クラスの記事頻度（または、選出された固有表現クラスの記事頻度の記事セットに含まれる全記事数に対する割合）が予め設定された閾値Ｔ_dを上回っているならば（ステップＳ４１０）、この固有表現クラスを記事セットの分類の要素classとすることができる。記事セットをMulti-classに分類するに際しては、さらなる判断処理を付加することができる。即ち、図１４に示しているように、選出された固有表現クラスの出現頻度／全固有表現クラス（全クラスターム）の出現頻度、の比が予め定められた閾値Ｔ_Cを上回っていることを条件として（ステップＳ４１３）、選出された固有表現クラスを記事セットの分類の要素classとする（ステップＳ４１４）ものとできる。但し、イベント名クラスに関しては他の固有表現クラスに優先して判断するものとし、選出された固有表現クラスがイベント名クラスに属するときには（ステップＳ４１１）、与えられた記事セットをそのままMulti-Eventに分類する（ステップＳ４１２）ようにしている。なお、ここでも、第三のアルゴリズムと同様、固有表現毎に相異する閾値Ｔ_C1、Ｔ_C2を設定することが許容される。 When an appropriate classification to be assigned to the article set cannot be found in the process up to the third algorithm, the determination unit 101 examines the classification to be assigned according to the fourth algorithm. It can be said that the fourth algorithm executes the second algorithm or the third algorithm on the cluster, not the specific expression. In other words, a judgment is made on the specific expression class to which the specific expression related to the subject of the given article set belongs, using at least one of the frequency of the cluster appearing in the article and the frequency of the specific expression class related to the cluster as a material. . Even if a specific cluster frequently appears across many articles in an article set, the classification to be assigned to the article set is preferably a multi-class instead of a single-class. This is because the cluster is a general noun, not a proper noun, and can indicate multiple types of proper expressions. The processing procedure at this time is shown in FIGS. The determination unit 101 refers to the output of the result of the cluster extraction process performed by the cluster extraction unit for the given article set (step S401), and the cluster and cluster appearing in the articles included in the article set. The appearance frequency is counted for each of the unique expression classes related to the event (step S402). And the article frequency is counted about each cluster which appears in an article, and the cluster with the highest article frequency in the said article set is selected (step S403). Here, when there are a plurality of clusters having the same article frequency, for example, a cluster having a higher appearance frequency is selected. Accordingly, if the article frequency of the selected clusterum (or the ratio of the article frequency of the selected clusterum to the total number of articles included in the article set) exceeds a preset threshold value _Td (step In S404, the unique expression class to which this cluster is related can be used as the element class of the article set classification. In addition, when classifying an article set into Multi-class, it does not prevent adding further judgment processing. That is, as shown in FIG. 13, the frequency / elected cluster beam of elected cluster beam is above the frequency of occurrence, the threshold T _w which the ratio of predetermined named entities related classes (Step S407), the specific expression class to which the selected specific expression belongs may be used as the element class of the article set classification (step S408). However, the event name class is determined in preference to the other named entity classes, and when the selected cluster is related to the event name class (step S405), the given article set is directly used as Multi. -Events are classified (step S406). In addition to the above, the article frequency is counted for each unique expression class related to the cluster that appears in the article, and the unique expression class having the highest article frequency in the article set is selected (step S409). When there are a plurality of unique expression classes having the same article frequency, for example, a specific expression class having a higher appearance frequency is selected. Therefore, if the article frequency of the selected unique expression class (or the ratio of the article frequency of the selected unique expression class to the total number of articles included in the article set) exceeds a preset threshold value _Td. (Step S410), this specific expression class can be used as an element class of article set classification. When classifying article sets into multi-classes, additional judgment processing can be added. That is, as shown in FIG. 14, the frequency of occurrence of elected frequency / total named entity class named entity class (all cluster beam), that the ratio of exceeds the threshold value T _C of predetermined As a condition (step S413), the selected specific expression class can be used as an element class of article set classification (step S414). However, the event name class is determined with priority over other specific expression classes. When the selected specific expression class belongs to the event name class (step S411), the given article set is directly used as a Multi-Event. Classification is performed (step S412). Here, as in the third algorithm, it is allowed to set different threshold values T _C1 and T _C2 for each unique expression.

上記の全てのアルゴリズムを用いても分類を付与できなかった場合、判断手段１０１は、予め定められたデフォルトの分類を当該記事セットに付与する（ステップＳ４１５）。デフォルトの分類は、例えば、Multi-EventまたはOtherとする。 If no classification can be given using all the above algorithms, the determination unit 101 gives a predetermined default classification to the article set (step S415). The default classification is, for example, Multi-Event or Other.

上述のテストデータに対し、本実施形態の文書セット分類装置を使用して分類を付与する自動分類実験を行った結果について述べる。なお、アルゴリズム中の各閾値の決定は、ここではトレーニングデータを基に人手で行う。各閾値の設定は、Ｔ_a＝０．３３、Ｔ_s＝１５０、Ｔ_d＝０．９０、Ｔ_w＝０．４０、Ｔ_C1＝０．８０、Ｔ_C2＝０．４０とした。但し、閾値の大きさがここに示す値に限られないことは言うまでもない。テストデータに対する自動分類実験の結果の評価を、図１５に示す。図１５には、被験者による分類付与の結果の評価及びベースラインをも示した。被験者の評価は、各被験者が付与した分類と正解との比較評価である。両被験者の正解に対する評価は両被験者間の一致率５５％よりも高いが、これは分類の正解が両被験者による分類付与結果を総合して作成されたためである。ベースラインは、トレーニングデータにおいて最も頻度の高い分類（この実験では、Single-Event）の記事セットのテストデータにおける数（及び、占める割合）である。文書セット分類装置について、「一致」の値は文書セット分類装置が出力した分類が正解に一致した記事セット数（及び、割合）を示し、「部分一致」の値は文書セット分類装置が出力した分類が被験者によって付与された分類の何れかに一致した数（及び、割合）を示す。被験者が複数の分類を付与した記事セットに関してはその双方を含む。被験者について、「一致」の値は被験者が最初に与えた分類が正解の分類に一致した記事セット数（及び、割合）を示し、「部分一致」の値は被験者が二番目に与えた分類も含めて正解の分類に一致した記事セット数（及び、割合）を示す。 The results of an automatic classification experiment for assigning a classification to the test data described above using the document set classification apparatus of the present embodiment will be described. Here, each threshold value in the algorithm is manually determined based on the training data. The threshold values were set such that T _a = 0.33, T _s = 150, T _d = 0.90, T _w = 0.40, T _C1 = 0.80, and T _C2 = 0.40. However, it goes without saying that the size of the threshold is not limited to the value shown here. Evaluation of the result of the automatic classification experiment for the test data is shown in FIG. FIG. 15 also shows the evaluation result and the baseline of the classification given by the subject. The evaluation of the subject is a comparative evaluation between the classification given by each subject and the correct answer. The evaluation for the correct answer of both subjects is higher than the concordance rate of 55% between the two subjects, because the correct answer of the classification was created by combining the classification grant results by both subjects. The baseline is the number (and the occupancy) in the test data of the article set of the most frequent classification (in this experiment, Single-Event) in the training data. For the document set classification device, the “match” value indicates the number (and percentage) of article sets in which the classification output by the document set classification device matches the correct answer, and the “partial match” value is output by the document set classification device. The number (and percentage) where the classification matches any of the classifications given by the subject. Both are included regarding the article set which the test subject gave the some classification | category. For the subject, the “match” value indicates the number (and percentage) of article sets where the first category given by the subject matched the correct category, and the “partial match” value is the second category given by the subject. The number of article sets (and percentage) that match the correct answer classification is shown.

文書セット分類装置は、２０記事セットのうち９つを正しく分類し、さらに２つの記事セットについてはその分類結果が被験者が与えた分類に含まれていた。分類が正しくなかった残り９記事セットのうちの３つは、正解の分類がSingle-Productであるのに対してSingle-Eventと分類していた。実験に使用されたこれら記事セットの中に現れる固有物名（Product）は、特定の法案や国際条約等であり、記事セットに含まれる記事はその法案の審議や国際条約に対する発言について記述されたものであった。現在のアルゴリズムでは、Single-Eventを優先して分類するようになっているため、このような誤りが生起したと考えられる。しかしながら、正解の分類に関連する法案や国際条約等の固有物名は記事セット中の記事全体にわたって現れているため、判断手段１０１が一旦Single-Eventの分類を与えておきながらその後の判断過程を継続し、Single-Productの分類を与え直すことができるように構成することは可能であると考えられる。 The document set classification apparatus correctly classified nine of the 20 article sets, and the classification results of the two article sets were included in the classification given by the subject. Three of the remaining nine article sets that were not classified correctly were classified as Single-Event while the correct answer was Single-Product. The unique names (Product) that appear in these article sets used in the experiments are specific bills and international treaties, etc., and the articles included in the article set describe the deliberations of the bills and remarks on international treaties. It was a thing. In the current algorithm, Single-Event is given priority for classification, and it is considered that such an error has occurred. However, since bills related to correct classification and unique names of international treaties, etc. appear throughout the articles in the article set, the determination means 101 once gave the single-event classification while performing the subsequent determination process. It may be possible to continue and configure it so that it can be given a single-product classification again.

別の３記事セットでは、正解の分類がSingle-Eventであるのに対して異なる分類を付与していた。これらの記事セットでは、イベント名（Event）にあたる記述が固有表現ではなく、句や節の形で表されていた。一例を挙げると、「クリントン前大統領のホワイトハウス元実習生モニカ・ルインスキさんに対する不倫疑惑」という表現は固有表現ではないが、特定のイベントを指す表現である。現状の固有表現抽出システムでは、このような表現を一の固有表現として認識することはできない。また、記事セットのタイムスパンは設定した閾値Ｔ_sよりも大きかった（上記例では、１年以上）ため、判断手段１０１は当該記事セットにSingle-Eventの分類を付与することができなかった。このことは、Single-Eventの分類を確実に付与するためには現在用いている手がかりの他に新たな手がかりを用いる必要があると言うことを示唆している。 In another set of 3 articles, the correct classification was Single-Event, but a different classification was assigned. In these article sets, the description corresponding to the event name (Event) was expressed in the form of a phrase or a clause instead of a specific expression. For example, the expression “the suspicion of affair against Monica Luinski, former president of the White House, former President of Clinton” is not a specific expression, but an expression that points to a specific event. In the current specific expression extraction system, such an expression cannot be recognized as one specific expression. Further, since the time span of the article set was larger than the set threshold value T _s (in the above example, one year or more), the determination unit 101 could not assign the single-event classification to the article set. This suggests that it is necessary to use a new clue in addition to the currently used clue to reliably assign the single-event classification.

また、実験の過程で、記事セットの中には本質的に一以上の分類を付与し得るものがあることが分かった。その理由の一つは、Single-Event、Multi-Event等、イベント名に基づく分類を付与すべき記事セットには他の固有表現クラスに基づく分類を付与可能なケースも多いことである。イベントの多くは、特定の人名や組織名、地名等に関連することがしばしばであり、イベントに関する記事を集めた記事セットに対しそのイベントに関連する（イベント名以外の）固有表現に焦点を絞ることが可能である。もう一つの理由として、イベントにおけるSingleとMultiとの区別が難しいことが挙げられる。あるイベントの中には、幾つかの小さなイベントが包含されることがある。例えば、「シドニーオリンピック」に関する記事セットは、一つのスポーツイベントを対象とするものとしてSingle-Eventの分類を付与し得るが、この記事セット中に複数の種目の結果を報じる記事が含まれているならば、それらに着目することでMulti-Eventの分類を付与することも可能である。イベントの単位をどのように認識するかは、被験者の観点に依存する。 In the course of the experiment, it was found that some article sets could inherently be given more than one classification. One of the reasons is that there are many cases where classifications based on other unique expression classes can be assigned to article sets that should be assigned classifications based on event names, such as Single-Event and Multi-Event. Many events are often related to a specific person name, organization name, place name, etc., and focus on specific expressions (other than the event name) related to the event for a set of articles about the event. It is possible. Another reason is that it is difficult to distinguish between Single and Multi in the event. Some events may include several small events. For example, an article set related to the “Sydney Olympics” may be assigned a single-event classification for one sporting event, but this article set includes articles reporting the results of multiple events. Then, it is also possible to give the classification of Multi-Event by paying attention to them. How the event unit is recognized depends on the subject's perspective.

因みに、上述の実験では、固有表現抽出手段による結果出力に人手による修正を加えて固有表現抽出タスクにおける誤りを排除している。本実施形態の文書セット分類装置を用いて機械的に分類を付与するにあたり、固有表現抽出タスクの段階でエラーが生起することを完全に避けるのは難しい（完璧な固有表現抽出ソフトウェアは現存しないため）。固有表現抽出手段における固有表現抽出処理を補完するためには、共参照や、文献（Y. Shin-yama, S. Sekine K. Sudo, and R. Grishman, “Automatic paraphrase acquisition from news articles”, In Proceedings of the HLT-2002 conference, 2002.）に述べられているようなイベントの記述に関する言い換え表現の認識手法等を導入することが考えられる。 Incidentally, in the above-described experiment, manual correction is added to the result output by the specific expression extraction means to eliminate errors in the specific expression extraction task. When assigning a classification mechanically using the document set classification apparatus of this embodiment, it is difficult to completely prevent an error from occurring at the stage of the specific expression extraction task (because perfect specific expression extraction software does not exist currently) ). To complement the proper expression extraction process in the proper expression extraction means, co-references and literature (Y. Shin-yama, S. Sekine K. Sudo, and R. Grishman, “Automatic paraphrase acquisition from news articles”, In It may be possible to introduce a paraphrase recognition method for event description as described in the Proceedings of the HLT-2002 conference, 2002.

以上では、入力として与えられる複数の記事（文書）が予め記事セット（文書セット）に仕分けされていることを前提としていた。しかしながら、入力として複数の記事が単純に与えられるような状況も考えられる。このような場合において、文書セット分類装置が、与えられる複数の記事を一または複数の記事セットに仕分けし、仕分けした記事セットに分類を付与するまでの処理を機械的に実行し得ることが好ましい。即ち、文書セット分類装置を構成するコンピュータ１が、図２に示す文書セット生成手段１０３としての機能をも発揮し得ることが好ましい。 In the above description, it is assumed that a plurality of articles (documents) given as input are sorted in advance into article sets (document sets). However, there may be a situation where a plurality of articles are simply given as input. In such a case, it is preferable that the document set classification device can mechanically execute a process of sorting a plurality of given articles into one or a plurality of article sets and adding a classification to the sorted article sets. . That is, it is preferable that the computer 1 constituting the document set classification apparatus can also function as the document set generation unit 103 shown in FIG.

文書セット生成手段１０３は、ソフトウェアを主体として構成され、入力として与えられた文書の中に存在するキーワードを抽出し、一の文書のキーワードと他の文書のキーワードとの類似度を算出し、その類似度が閾値を超える場合にこれらの文書を同一の文書セットに割り当てることを通じて、複数の文書から少なくとも一の文書セットを生成する処理を行う。文書セット生成手段１０３が実行する処理の手順は、既に述べた記事セットの生成手法に類似する。即ち、プロセッサ１ａが、プログラムに基づき、入力として与えられた記事データ（通常、メインメモリ１ｂまたは補助記憶デバイス１ｃの所要の記憶領域に格納されている）のうちの一つを選択的に読み込み、この記事データよりキーワード列を抽出する。キーワードの抽出は、形態素解析ソフトウェアを利用して行うことができる。例えば、記事データを形態素解析した結果より時相名詞、副詞的名詞を除いた名詞のうち頻度が所定値（例えば、２）以上のものをキーワードとして抽出する。入力として与えられた各記事データについて上記の方法でキーワードを抽出した後、プロセッサ１ａが、一の記事データに係るキーワード列と他の記事データに係るキーワード列との間の類似度を算出する。プロセッサ１ａが算出する類似度の指標としては、Dice's coefficient、Jacquard measure、cosine similarity等を採用することができる。その上で、類似度の指標が所定値（例えば、０．５）以上である複数の記事を類似記事として一の記事セットに含めることを通じて、記事セットの生成を行う。 The document set generation unit 103 is composed mainly of software, extracts keywords existing in a document given as input, calculates the similarity between the keyword of one document and the keyword of another document, When the degree of similarity exceeds a threshold, these documents are assigned to the same document set, thereby generating at least one document set from a plurality of documents. The procedure of processing executed by the document set generation unit 103 is similar to the article set generation method already described. That is, the processor 1a selectively reads one of article data (usually stored in a required storage area of the main memory 1b or the auxiliary storage device 1c) given as an input based on a program, A keyword string is extracted from this article data. Keyword extraction can be performed using morphological analysis software. For example, a noun excluding temporal nouns and adverbial nouns from the result of morphological analysis of article data is extracted as a keyword having a frequency equal to or higher than a predetermined value (for example, 2). After extracting a keyword by the above method for each article data given as input, the processor 1a calculates the similarity between the keyword string related to one article data and the keyword string related to other article data. As an index of similarity calculated by the processor 1a, Dice's coefficient, Jacquard measure, cosine similarity and the like can be adopted. Then, an article set is generated by including a plurality of articles whose similarity index is a predetermined value (for example, 0.5) or more as a similar article in one article set.

ところで、文書セット分類装置が出力する記事セットの分類についての情報を参照することで、当該記事セットに適した要約アルゴリズムを選択し得る。よって、文書セット分類装置が出力する分類についての情報を参照し、この分類に対応した要約アルゴリズムを用いて記事セット（文書セット）に含まれる複数記事（文書）の要約を行う要約装置を構築すれば、要約の精度の向上を図ることができる。 By the way, by referring to the information about the classification of the article set output by the document set classification device, a summary algorithm suitable for the article set can be selected. Therefore, a summary device that summarizes a plurality of articles (documents) contained in an article set (document set) using a summary algorithm corresponding to the classification is constructed by referring to the information about the classification output by the document set classification device. For example, the accuracy of summarization can be improved.

本実施形態における文書要約装置は、文書セット分類装置を構成するコンピュータ１またはこのコンピュータ１とは別のコンピュータ（図示せず）に所定のプログラムをインストールすることで構築される。通常、プログラムは補助記憶デバイス１ｃに格納され、その実行の際には補助記憶デバイス１ｃからメインメモリ１ｂに読み込まれてプロセッサ１ａにより解読される。そして、該プログラムに従いハードウェア資源を作動して、図３に示す要約手段２０１としての機能を発揮するようにしている。 The document summarization apparatus in this embodiment is constructed by installing a predetermined program on the computer 1 constituting the document set classification apparatus or a computer (not shown) different from the computer 1. Normally, the program is stored in the auxiliary storage device 1c, and when executed, the program is read from the auxiliary storage device 1c into the main memory 1b and decoded by the processor 1a. Then, hardware resources are operated according to the program so as to exhibit the function as the summarizing means 201 shown in FIG.

要約手段２０１は、文書セット分類装置が出力する、記事セットに付与された分類についての情報を参照し、この分類に対応した要約アルゴリズムにより前記記事セットに含まれる複数の記事を単一の文書に要約する。即ち、プロセッサ１ａが、プログラムに基づき、記事セットに付与された分類についての情報を参照し、この分類に対応した要約アルゴリズムを選択する。しかる後、入力として与えられた記事セットに含まれる記事のデータを読み込み、要約の生成を行う。本実施形態における要約手段２０１は、ソフトウェアを主体として構成される。要約手段２０１の主体となるソフトウェアには、既知の複数記事要約ソフトウェアを応用できる。既知の複数記事要約ソフトウェアでは、一般に、記事セットに含まれる複数記事から重要と判断される文（ないし、文章）を抽出し、抽出した文を基にして要約を生成する。それぞれの文の重要度は、文の位置、文の長さ、文中に出現する単語の頻度、見出しとの類似度等の複数の条件に関するスコアを加算して算定される。また、重要度の算定にあたっては、個々の条件に関するスコアに対して重み付けがなされる（重みは、トレーニングデータを用いた訓練を通じて得られる）。しかして、本実施形態における要約手段２０１では、文書セット分類装置によって付与された分類に基づくスコアを加味して、それぞれの文の重要度を算出することとしている。具体例を挙げると、対象の記事セットに付与された分類がSingle-Personでありその主題を表す固有表現が「小泉」である場合には、各記事において、人名として認識できる「小泉」を含む文にスコアを与える。あるいは、対象の記事セットに付与された分類がMulti-Organizationである場合には、各記事において、組織名を含む文にスコアを与える。このように、分類に基づくスコアを加味してそれぞれの文の重要度を算定することにより、生成される要約の的確性の向上が期待できる。 The summarizing means 201 refers to the information about the classification given to the article set output from the document set classification device, and converts the articles included in the article set into a single document by the summarization algorithm corresponding to the classification. To summarize. That is, the processor 1a refers to the information about the classification given to the article set based on the program, and selects a summary algorithm corresponding to this classification. Thereafter, the article data included in the article set given as input is read, and a summary is generated. The summarizing means 201 in this embodiment is composed mainly of software. Known multi-article summary software can be applied to the software that is the main body of the summary means 201. Known multi-article summary software generally extracts sentences (or sentences) judged to be important from a plurality of articles included in an article set, and generates a summary based on the extracted sentences. The importance of each sentence is calculated by adding scores related to a plurality of conditions such as the position of the sentence, the length of the sentence, the frequency of words appearing in the sentence, and the similarity to the headline. In calculating the importance, weights are assigned to scores relating to individual conditions (weights are obtained through training using training data). Therefore, the summarizing means 201 in the present embodiment calculates the importance of each sentence in consideration of the score based on the classification given by the document set classification device. As a specific example, if the classification given to the target article set is Single-Person and the specific expression representing the subject is "Koizumi", each article includes "Koizumi" that can be recognized as a personal name. Give a sentence a score. Or when the classification | category provided to the object article set is Multi-Organization, a score is given to the sentence containing an organization name in each article. Thus, by calculating the importance of each sentence in consideration of the score based on the classification, improvement in the accuracy of the generated summary can be expected.

加えて、要約手段２０１が、複数記事中の一の文と他の文との間の類似度を（共通する単語の個数等を参照することで）算出して類似する複数の文を抽出し、抽出した類似する文のうち要約生成に用いる文を選出するものとしてもよい。その上で、類似する複数の文より要約生成に用いる文を選出するための処理を、対象の記事セットに付与された分類に応じて変更することが好適である。具体例を挙げると、対象の記事セットに付与された分類がSingle-classである場合には、類似する複数の文が同一の事物を表現している可能性が高いことから、類似する複数の文のうちの一部の文のみを代表として選出する。つまり、類似する文として、「京都市で起きた震度４の地震で、３人が軽いけがを負った。」、「京都市で起きた震度４の地震で、新たに２人が入院し、けが人は５人となった。」というような複数の文が抽出されたとき、これらのうち何れか一文のみを要約記事の要素として選出する。これらの文のうちの何れを選択するかは、それぞれの文の重要度のスコアを参照する、時系列で最も後者の文を選択する等のヒューリスティクスにより決定できる。他方、対象の記事セットに付与された分類がMulti-classである場合には、類似する複数の文が相異なる事物を表現している可能性が高いことから、重要度スコアの高い文に類似する一部または全部の文をまとめて選出する。つまり、類似する文として、「京都市で起きた震度４の地震で、３人が軽いけがを負った。」、「大阪市で起きた震度５の地震で、５人が入院し、８人が軽いけがを負った。」というような複数の文が抽出されたとき、これらの文は表現上似ているものの相異なるイベントを記述していると考えられる。であるから、これらの文の全てを要約記事の要素として選出することもあり得る。 In addition, the summarizing means 201 calculates a similarity between one sentence and other sentences in a plurality of articles (by referring to the number of common words, etc.) and extracts a plurality of similar sentences. Of the extracted similar sentences, a sentence used for summary generation may be selected. In addition, it is preferable to change the process for selecting a sentence used for summary generation from a plurality of similar sentences according to the classification given to the target article set. For example, if the category assigned to the target article set is Single-class, there is a high possibility that similar sentences represent the same thing. Select only some of the sentences as representatives. In other words, similar sentences are: “An earthquake of seismic intensity 4 that occurred in Kyoto city, three people were lightly injured.” “An earthquake of seismic intensity 4 that occurred in Kyoto city, two people were newly hospitalized, When a plurality of sentences such as “the number of injured persons is five” is extracted, only one of these sentences is selected as an element of the summary article. Which of these sentences is selected can be determined by heuristics such as referring to the importance score of each sentence and selecting the latter sentence in the time series. On the other hand, if the category assigned to the target article set is Multi-class, it is likely that multiple similar sentences represent different things, so it is similar to a sentence with a high importance score. Select some or all of the sentences you want to collect. In other words, similar sentences are: “3 earthquakes with a seismic intensity of 4 that occurred in Kyoto, 3 people were injured.” “5 earthquakes with a seismic intensity of 5 that occurred in Osaka, 5 were hospitalized and 8 When multiple sentences such as “has been injured lightly,” these sentences are considered to describe different events, although they are similar in expression. Therefore, all of these sentences may be selected as elements of the summary article.

総じて言えば、本実施形態における文書要約装置の要約手段２０１は、記事セットに付与された分類に基づく重要度スコアを加味してそれぞれの文の重要度を算定するプロセス、及び／または、記事セットに付与された分類に応じて類似する複数の文の取捨選択の手法を変えるプロセスを、既存の複数記事要約ソフトウェアに追加したものとして構成可能である。そして、対象の記事セットに付与された分類に応じて異なる要約アルゴリズムの要約処理を実行可能である。因みに、Single-Personの分類が付与された記事セットより伝記的な記述を要約出力させたり、Multi-Productの分類が付与された記事セットより製品の名称・機能・値段等の要素を抽出させて表の形態で出力させたりというように、記事セットの分類に応じた多様な要約を要約手段２０１に出力させることも考えられる。 Generally speaking, the summarizing means 201 of the document summarizing apparatus according to the present embodiment is a process of calculating the importance of each sentence by taking into account the importance score based on the classification assigned to the article set, and / or the article set. The process of changing the method of selecting a plurality of similar sentences according to the classification assigned to the can be configured as an addition to the existing multi-article summary software. The summarization process of different summarization algorithms can be executed according to the classification assigned to the target article set. By the way, the biographical description can be summarized and output from the article set with Single-Person classification, or the product name, function, price, etc. can be extracted from the article set with Multi-Product classification. It is also conceivable to output various summaries according to the classification of the article set to the summarizing means 201 such as outputting in the form of a table.

本実施形態によれば、複数の文書の集合である文書セットに対し分類を付与するものとして、前記文書セットの主題が単独の固有表現（Named Entity）に関するものか複数の固有表現に関するものかを判断し、かつ、該固有表現が何れの固有表現クラスに属するかを判断する判断手段１０１と、前記判断手段１０１が下した判断に基づき、前記文書セットの主題に係る固有表現が単独であるか複数であるか、及び、該固有表現が属している固有表現クラスという２つの要素より規定される分類についての情報を出力する出力手段１０２とを具備する文書セット分類装置を構成したため、網羅性の高い分類を文書セットに付与可能となる。 According to the present embodiment, whether a subject of the document set is related to a single named entity (Named Entity) or a plurality of named entities is to be assigned to a document set that is a set of a plurality of documents. A determination unit 101 for determining which specific expression class the specific expression belongs to, and whether the specific expression related to the subject of the document set is independent based on the determination made by the determination unit 101 The document set classification device is configured to include a plurality of output units 102 that output information about classifications defined by two elements, namely, a specific expression class to which the specific expressions belong. A high classification can be assigned to a document set.

前記判断手段１０１が、前記文書セットに含まれる複数の文書の中に出現する固有表現の頻度または固有表現クラスの頻度のうち少なくとも一方を材料として、前記文書セットの主題に係る固有表現が単独であるか複数であるかの判断及び該固有表現が属する固有表現クラスの判断を実行するため、大文字で始まる語の頻度やhe、she等の人称代名詞の頻度等の特定言語の特性に依存することなく分類を実行し得る。即ち、より一般的に記事セットの分類を行うことができる。 The determination unit 101 uses as a material at least one of the frequency of the specific expression or the frequency of the specific expression class that appears in a plurality of documents included in the document set, and the specific expression related to the subject of the document set is independent. Depends on the characteristics of a specific language, such as the frequency of words beginning with capital letters and the frequency of personal pronouns such as he and she, in order to determine whether or not there is a plural and the specific expression class to which the specific expression belongs Classification can be performed without any. That is, the article set can be classified more generally.

前記判断手段１０１が、前記文書セットに含まれる複数の文書の中に出現するクラスタームの頻度またはクラスタームが関連する固有表現クラスの頻度のうち少なくとも一方を材料として、前記文書セットの主題に係る固有表現が属する固有表現クラスの判断を実行するため、固有表現のみを手がかりとして分類できない文書セットに対しても適切な分類を付与することが可能である。 The determination unit 101 relates to a subject of the document set using at least one of a frequency of a cluster appearing in a plurality of documents included in the document set and a frequency of a specific expression class related to the cluster as a material. Since the specific expression class to which the specific expression belongs is determined, it is possible to assign an appropriate classification even to a document set that cannot be classified using only the specific expression as a clue.

また、前記判断手段１０１が、前記文書セットに含まれる複数の文書の各々の作成若しくは発表された時点に関する情報を参照し、これら複数の文書のうちの一部または全部が予め定められた期間内に作成若しくは発表されていることを条件として、前記記事セットの主題に係る固有表現が単独でありかつその属する固有表現クラスがイベント名クラスである旨の判断を下すものとしており、少なくともSingle-Eventクラスの文書セットを速やかに分類できる。 In addition, the determination unit 101 refers to information about the time point at which each of the plurality of documents included in the document set is created or published, and part or all of the plurality of documents is within a predetermined period. On the condition that the specific expression related to the subject of the article set is single and the specific expression class to which it belongs is an event name class. Quickly classify class document sets.

文章セット分類装置が、与えられた文書の中に存在するキーワードを抽出し、一の文書のキーワードと他の文書のキーワードとの類似度を算出し、その類似度が閾値を超える場合にこれらの文書を同一の文書セットに割り当てることを通じて、複数の文書から少なくとも一の文書セットを生成し得る文書セット生成手段１０３をさらに具備するものであれば、与えられた複数の文書を一または複数の文書セットに仕分けしこれに分類を付与するまでの処理を一括に実行可能となる。このものは、与えられた文書を基に一または複数の要約を自動生成するシステムを構築するために有用となる。 The sentence set classification device extracts keywords existing in a given document, calculates the similarity between the keyword of one document and the keyword of another document, and if the similarity exceeds a threshold, As long as it further includes document set generation means 103 capable of generating at least one document set from a plurality of documents by assigning the documents to the same document set, the given plurality of documents are converted into one or a plurality of documents. Processing until sorting and assigning a classification to a set can be executed in a batch. This is useful for constructing a system that automatically generates one or more summaries based on a given document.

さらに、上記の文書セット要約装置が文書セットに対して付与した分類を参照し、この分類に対応した要約アルゴリズムを選択して前記文書セットに含まれる複数の文書を要約する要約手段２０１を具備する文書要約装置を構成して、より適切な複数文書の要約処理を可能とすることができる。 The document set summarization apparatus further includes a summarizing unit 201 that refers to the classification given to the document set and selects a summarization algorithm corresponding to the classification to summarize a plurality of documents included in the document set. The document summarization apparatus can be configured to enable more appropriate summarization processing of a plurality of documents.

なお、本発明は以上に詳述した実施形態に限られるものではない。特に、本発明で定義した記事セットの分類やその分類を行う文書セット分類装置及びそのプログラムは、自動要約以外での応用も考えられる。例示すると、情報検索において、検索された結果中の上位の記事を用いた検索後の再ランク付けや検索結果の効率的な表示を行うことができる。さらに、オープンドメインの情報抽出に利用することも考えられる。従来の情報抽出ではドメインが限定されており、記事の主題や分類は前提として与えられていた。しかし、ドメインを限定することなく情報抽出を行うためには、対象となるドメインの情報、即ち記事セットの分類を動的に実施する必要があると考えられるからである。 The present invention is not limited to the embodiment described in detail above. In particular, the article set classification defined in the present invention, the document set classification apparatus for performing the classification, and the program thereof may be applied to applications other than automatic summarization. For example, in information retrieval, re-ranking after retrieval using an upper article in the retrieved result and efficient display of the retrieval result can be performed. Furthermore, it may be used for open domain information extraction. In the conventional information extraction, the domain is limited, and the subject and classification of the article are given as a premise. However, in order to extract information without limiting the domain, it is considered that it is necessary to dynamically classify target domain information, that is, article sets.

その他各部の具体的構成や図１０ないし図１４に示す処理の手順等もまた、上記実施形態に限られるものではなく、本発明の趣旨を逸脱しない範囲で種々変形が可能である。勿論、パーソナルコンピュータその他の汎用的な情報処理装置にプログラムをインストールして本発明に係る文書セット分類装置を構成することが可能であって、専用の装置を製造することが必須であるわけではない。 The specific configuration of each part and the processing procedures shown in FIGS. 10 to 14 are not limited to the above embodiment, and various modifications can be made without departing from the spirit of the present invention. Of course, it is possible to configure a document set classification apparatus according to the present invention by installing a program in a personal computer or other general-purpose information processing apparatus, and it is not essential to manufacture a dedicated apparatus. .

本発明に係る文書セット分類装置の構成説明図。BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a configuration explanatory diagram of a document set classification apparatus according to the present invention. 本発明に係る文書セット分類装置の構成説明図。BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a configuration explanatory diagram of a document set classification apparatus according to the present invention. 本発明に係る文書セット分類装置の構成説明図。BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a configuration explanatory diagram of a document set classification apparatus according to the present invention. 複数記事を包含する記事セットを例示する図。The figure which illustrates the article set containing a plurality of articles. ２人の被験者による記事分類実験の結果を示す図。The figure which shows the result of the article classification | category experiment by two test subjects. ２人の被験者による記事分類実験の結果を示す図。The figure which shows the result of the article classification | category experiment by two test subjects. 本発明の一実施形態における文書セット分類装置が具備するハードウェア資源を示す図。The figure which shows the hardware resource with which the document set classification | category apparatus in one Embodiment of this invention is provided. 固有表現抽出手段による固有表現抽出処理を説明する図。The figure explaining the specific expression extraction process by a specific expression extraction means. クラスターム抽出手段によるクラスターム抽出処理を説明する図。The figure explaining the cluster extraction process by a cluster extraction means. 判断手段が実行する判断処理の手順を示すフローチャート。The flowchart which shows the procedure of the judgment process which a judgment means performs. 同フローチャート。The flowchart. 同フローチャート。The flowchart. 同フローチャート。The flowchart. 同フローチャート。The flowchart. 文書セット分類装置による自動分類実験の結果を示す図。The figure which shows the result of the automatic classification | category experiment by a document set classification | category apparatus.

Explanation of symbols

１…コンピュータ（文書セット分類装置、文書要約装置）
１０１…判断手段
１０２…出力手段
１０３…文書セット生成手段
２０１…要約手段 1 Computer (document set classification device, document summarization device)
DESCRIPTION OF SYMBOLS 101 ... Judgment means 102 ... Output means 103 ... Document set production | generation means 201 ... Summarization means

Claims

A classification is given to a document set that is a collection of a plurality of documents,
Determining means for determining whether the subject of the document set relates to a single specific expression or a plurality of specific expressions, and to determine which specific expression class the specific expression belongs to;
Based on the determination made by the determination means, a classification defined by two elements, that is, a single expression or a plurality of specific expressions related to the subject of the document set, and a specific expression class to which the specific expression belongs A document set classifying apparatus comprising: output means for outputting information about

The determination means uses only at least one of the frequency of the specific expression or the frequency of the specific expression class appearing in a plurality of documents included in the document set, and the specific expression related to the subject of the document set is independent. 2. The document set classification apparatus according to claim 1, wherein determination of whether or not there is a plurality and determination of a specific expression class to which the specific expression belongs can be performed.

The determination means uses, as a material, at least one of the frequency of a cluster that appears in a plurality of documents included in the document set and the frequency of a unique expression class to which the cluster is related, and the uniqueness related to the subject of the document set 3. The document set classification apparatus according to claim 1, wherein the specific expression class to which the expression belongs can be determined.

The determination means refers to information about the time when each of the plurality of documents included in the document set is created or announced, and a part or all of the plurality of documents is created or created within a predetermined period. The document set according to claim 1, 2 or 3, wherein it is determined that the specific expression related to the subject of the article set is single and that the specific expression class to which the article belongs is an event name class on the condition that it is published. Classification device.

The keywords existing in a given document are extracted, the similarity between the keyword of one document and the keyword of another document is calculated, and when the similarity exceeds a threshold, these documents are set to the same document set. 5. The document set classification device according to claim 1, further comprising a document set generation unit capable of generating at least one document set from a plurality of documents by assigning to the document set.

It is used with the document set classification device according to claim 1, 2, 3, 4 or 5,
A summary for summarizing a plurality of documents included in the document set into a single document by referring to information about the classification assigned to the document set, which is output from the document set summarization apparatus, and using a summary algorithm corresponding to the classification. Document summarizing device comprising means.

It is used for constituting the document set classification device according to claim 1, 2, 3, 4 or 5, and comprises at least a computer,
A determination means for determining whether a subject of a document set that is a set of a plurality of documents relates to a single specific expression or a plurality of specific expressions, and to determine to which specific expression class the specific expression belongs; as well as,
Based on the determination made by the determination means, a classification defined by two elements, that is, a single expression or a plurality of specific expressions related to the subject of the document set, and a specific expression class to which the specific expression belongs A program that functions as an output means for outputting information about the.

The determination means uses only at least one of the frequency of the specific expression or the frequency of the specific expression class appearing in a plurality of documents included in the document set, and the specific expression related to the subject of the document set is independent. The program according to claim 7, wherein the determination of whether or not there is a plurality and the determination of a specific expression class to which the specific expression belongs is possible.

The determination means uses, as a material, at least one of the frequency of a cluster that appears in a plurality of documents included in the document set and the frequency of a unique expression class to which the cluster is related, and the uniqueness related to the subject of the document set 9. The program according to claim 7 or 8, which can execute a determination of a specific expression class to which the expression belongs.

The determination means refers to information about the time when each of the plurality of documents included in the document set is created or announced, and a part or all of the plurality of documents is created or created within a predetermined period. 10. The program according to claim 7, 8 or 9, wherein, on the condition that it is published, a determination is made that the unique expression related to the subject of the article set is single and the unique expression class to which the article belongs is an event name class.

Further, the computer extracts keywords existing in a given document, calculates the similarity between the keyword of one document and the keyword of another document, and if these similarities exceed a threshold, The program according to claim 7, 8, 9, or 10, which also functions as a document set generation unit capable of generating at least one document set from a plurality of documents by assigning to the same document set.

It is used for constituting the document summarizing device according to claim 6, and the computer is at least
As summarization means for referring to information about a classification given to a document set, which is output from the document set classification device, and summarizing a plurality of documents included in the document set into a single document by a summarization algorithm corresponding to the classification A program to function.