JP2013191194A

JP2013191194A - Document categorizing device, method thereof and program

Info

Publication number: JP2013191194A
Application number: JP2012136868A
Authority: JP
Inventors: Shinji Tamoto; 真詞田本; Hirokazu Masataki; 浩和政瀧; Osamu Yoshioka; 理吉岡; Satoshi Takahashi; 敏高橋
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2012-02-15
Filing date: 2012-06-18
Publication date: 2013-09-26
Anticipated expiration: 2032-06-18
Also published as: JP5758349B2

Abstract

PROBLEM TO BE SOLVED: To provide a document categorizing device capable of performing document classification close to human sensitivity.SOLUTION: A document of a portion corresponding to a certain topic of an input document containing a plurality of topics is defined as a reference document, a document vector of the reference document is extracted, and inter-vector similarity is determined between the document vector and a topic vector that is a center of gravity of a document vector included in a topic class obtained by cluster-classifying topics determined from a sample document. A category sample document vector correspondence table makes a sample document vector determined from a sample document whose topics are classified for each category correspondent to a topic. Inter-vector similarity is determined for each topic class between the topic vector corresponding to a topic class of the high inter-vector similarity and a sample document vector in the category sample document vector correspondence table. A value obtained by accumulating values each resulting from multiplying the similarity by a degree of importance of the topic is determined as document similarity, and the reference document is classified to a category having the highest document similarity.

Description

この発明は、複数の文書データを分類する文書カテゴライズ装置とその方法とそのプログラムに関する。 The present invention relates to a document categorizing apparatus for classifying a plurality of document data, a method thereof, and a program thereof.

文書集合全体を任意の文書群に分割する文書カテゴライズにおいて、文書間の類似度を算出するための特徴量として文書ベクトルが広く用いられている。文書ベクトルは、単語出現頻度ないし単語固有に定まる概念ベクトルから生成される列ベクトルである。文書ベクトル間の類似度で文書間の類似度を表すことができる。 In document categorization in which an entire document set is divided into arbitrary document groups, document vectors are widely used as feature amounts for calculating similarity between documents. The document vector is a column vector generated from a concept vector determined in terms of word appearance frequency or word. The similarity between documents can be expressed by the similarity between document vectors.

文書ベクトル間の類似度を用いて、収集したい文書（標本文書）をカテゴライズする方法が特許文献１に開示されている。この方法は、標本の文書集合に含まれる複数の単語について単語ごとに定められる単語概念ベクトルを取得して、各々の単語のベクトル重心を文書ベクトルで表される特徴量として抽出する。次に、標本文書集合の特徴量と、カテゴライズされる対象の参照文書の持つ特徴量との類似度を算出し、収集したい文書を、その値の最も小さな標本文書集合と同一カテゴリに収集してカテゴライズする。 Patent Document 1 discloses a method of categorizing a document (specimen document) to be collected using similarity between document vectors. In this method, a word concept vector determined for each word is obtained for a plurality of words included in a sample document set, and the vector centroid of each word is extracted as a feature amount represented by the document vector. Next, calculate the similarity between the feature quantity of the sample document set and the feature quantity of the reference document to be categorized, and collect the documents you want to collect in the same category as the sample document set with the smallest value. Categorize.

この手法では、ある文書の特徴量における単語概念ベクトルの重みは、ＴＦ−ＩＤＦなどの文書中の出現頻度を元に決定されるか、若しくは人手で付与される。この際に、話題の共通性の観点は考慮されないのが一般的である。 In this method, the weight of the word concept vector in the feature amount of a certain document is determined based on the appearance frequency in a document such as TF-IDF or is given manually. At this time, it is general that the commonality of topics is not considered.

各文書に含む文脈から話題を分類する方法として、特許文献２に開示された文書データ分類装置１が知られている。文書データ分類装置１の機能構成（図１２）を参照してその動作を簡単に説明する。ここで話題とは、一まとまりの文書や文脈に含まれる特徴的な単語や、その単語集合のことであり、それらから想起されるものである。 As a method for classifying topics from the context included in each document, a document data classification device 1 disclosed in Patent Document 2 is known. The operation will be briefly described with reference to the functional configuration of the document data classification device 1 (FIG. 12). Here, the topic is a characteristic word or a set of words included in a group of documents or contexts, and is recalled from them.

文書データ分類装置１は、メモリ１０、文書入力部２０、文脈抽出部３０、文書ベクトル生成部４０、文脈結合部５０、クラスタ分類部６０、結果表示部７０、制御部８０、を備える。文書入力部２０は、外部から入力される入力文書をメモリ１０に記録する。文脈抽出部３０は、メモリ１０から各入力文書を読み込み、入力文書中の各単語と予めメモリ１０に記録されている主要単語とが一致するか否かを判定し、一致した入力文書中の単語の位置を基準として定まる当該入力文書中の一部の単語列をそれぞれ推定文脈として抽出する。文書ベクトル生成部４０は、各推定文脈の文書ベクトルである第１の文書ベクトルを生成する。文脈結合部５０は、抽出した複数の推定文脈の一部が相互に重複する複数の推定文脈を１つの推定文脈に統合し、第１の文書ベクトルを合成した合成ベクトルの第２のベクトルを生成する。クラスタ分類部６０は、全ての第２のベクトルを対象とした第２のクラスタリングを行い、最終的な分類を決める。 The document data classification device 1 includes a memory 10, a document input unit 20, a context extraction unit 30, a document vector generation unit 40, a context combination unit 50, a cluster classification unit 60, a result display unit 70, and a control unit 80. The document input unit 20 records an input document input from the outside in the memory 10. The context extraction unit 30 reads each input document from the memory 10, determines whether each word in the input document matches a main word recorded in the memory 10 in advance, and matches the word in the input document. A part of word strings in the input document that are determined based on the position of each are extracted as estimated contexts. The document vector generation unit 40 generates a first document vector that is a document vector of each estimated context. The context combining unit 50 integrates a plurality of estimated contexts in which some of the extracted plurality of estimated contexts overlap each other into one estimated context, and generates a second vector of a combined vector obtained by combining the first document vectors. To do. The cluster classification unit 60 performs the second clustering for all the second vectors and determines the final classification.

文書データ分類装置１は、１つの文書中に複数の分野に分類されるべき複数の話題が混在しても、高速に話題を分類することができる。 The document data classification device 1 can classify topics at high speed even when a plurality of topics to be classified into a plurality of fields coexist in one document.

特開２００９−２７７０９９号公報JP 2009-277099 A 特開２００９−２１１２７７号公報JP 2009-211177 A

特許文献１に開示されているように、文書集合全体を任意の文書群に分類する文書カテゴライズにおいて文書間の類似度を算出するための特徴量として広く用いられる文書ベクトルは、単語出現頻度ないし単語固有に定まる概念ベクトルから生成され、文書ベクトル間の類似度で文書間の文書類似度を表す。一方、文書間の類似性を人の主観で判定する場合、文書に含まれる重要な話題について文書相互での共通性をもとに判定するのが一般的である。しかし、文書ベクトル法では、単語に対して重要度を基準に重み付けするのが一般的で、話題の観点が考慮されない。よって、各々の重要な話題の共通性を類似度に反映できないため、主観による類似性の判定との差を生じ、文書カテゴライズにおける分類精度の低下を生じる。特許文献２の方法は、文書に含まれる各話題について、特徴量の類似度の大きいものを１つのカテゴリに分類するのみで、話題ごとに文書を分類するものではない。 As disclosed in Patent Document 1, a document vector widely used as a feature amount for calculating similarity between documents in document categorization for classifying an entire document set into an arbitrary document group is a word appearance frequency or a word It is generated from a uniquely determined concept vector, and the document similarity between documents is represented by the similarity between document vectors. On the other hand, when the similarity between documents is determined by human subjectivity, it is common to determine important topics included in the document based on the commonality between the documents. However, the document vector method generally weights words based on importance, and does not consider the topical viewpoint. Therefore, since the commonality of each important topic cannot be reflected in the similarity, a difference from the subjective similarity determination is caused, and the classification accuracy in document categorization is lowered. In the method of Patent Document 2, for each topic included in a document, only those having a large feature amount similarity are classified into one category, and the document is not classified for each topic.

つまり、従来において人間の直感に近い形で任意の文書群を所定のカテゴリに分類する装置はなかった。この発明は、このような課題に鑑みてなされたものであり、人の直感により近い文書分類を実現することができる文書カテゴライズ装置とその方法と、そのプログラムを提供することを目的とする。 In other words, there has conventionally been no device for classifying an arbitrary document group into a predetermined category in a form close to human intuition. The present invention has been made in view of such problems, and an object of the present invention is to provide a document categorizing apparatus, a method thereof, and a program thereof that can realize document classification closer to human intuition.

この発明の文書カテゴライズ装置は、特徴量抽出部と、話題/話題ベクトル/重要度対応表と、話題・話題ベクトル・重要度出力部と、カテゴリ−文書ベクトル対応表と、類似度比較分類部と、を具備する。特徴量抽出部は、複数の話題を含む入力文書を参照文書とし、当該参照文書の文書ベクトルを抽出して当該参照文書と上記文書ベクトルとを出力する。話題/話題ベクトル/重要度対応表は、標本文書から求めた話題をクラスタ分類して得られた話題クラスと、当該話題クラスに含まれる文書ベクトルの重心である話題ベクトルと、上記標本文書のすべての文書と上記参照文書に含まれる上記話題の共起関係から算出される当該話題の重要度と、を対応付けて記録する。話題・話題ベクトル・重要度出力部は、文書ベクトルと上記話題ベクトルとのベクトル間の類似度を求め、当該類似度の大きい話題ベクトルと対応する話題クラスと重要度と、単語ベクトルに対応した参照文書とを出力する。カテゴリ−標本文書ベクトル対応表は、上記話題/話題ベクトル/重要度対応表の話題ベクトルをカテゴリごとに分類し、当該分類された話題ベクトルの重心を、カテゴリに対応させた標本文書ベクトルとして記録する。類似度比較分類部は、話題クラスごとに上記話題ベクトルと上記標本文書ベクトルとの間の類似度を求め、当該類似度に当該話題クラスの重要度を乗じた値を話題クラスごとに累積した値を文書類似度として求め、当該文書類似度の最も大きなカテゴリに参照文書を分類する。 The document categorizing apparatus of the present invention includes a feature amount extraction unit, a topic / topic vector / importance level correspondence table, a topic / topic vector / importance level output unit, a category-document vector correspondence table, a similarity comparison and classification unit, Are provided. The feature amount extraction unit uses an input document including a plurality of topics as a reference document, extracts a document vector of the reference document, and outputs the reference document and the document vector. The topic / topic vector / importance correspondence table includes a topic class obtained by clustering topics obtained from a sample document, a topic vector that is the center of gravity of the document vector included in the topic class, and all of the sample documents. And the importance level of the topic calculated from the co-occurrence relationship of the topics included in the reference document are recorded in association with each other. The topic / topic vector / importance output unit obtains the similarity between the vector of the document vector and the topic vector, the topic class corresponding to the topic vector having the large similarity, the importance, and the reference corresponding to the word vector Output document. The category-sample document vector correspondence table classifies the topic vectors of the topic / topic vector / importance correspondence table for each category, and records the centroid of the classified topic vector as a sample document vector corresponding to the category. . The similarity comparison classifying unit obtains a similarity between the topic vector and the sample document vector for each topic class, and accumulates a value obtained by multiplying the similarity by the importance of the topic class for each topic class. Is obtained as the document similarity, and the reference document is classified into the category having the largest document similarity.

この発明の文書カテゴライズ装置で用いる重要度は、標本文書の全文書における或る話題と、参照文書の当該話題の共起関係とに基づいて話題の共通性を類似度に反映させるものである。この重要度を用いて参照文書（入力文書）をカテゴリごとに分類することで、人の感覚に近い文書分類を行うことが可能になる。 The importance used in the document categorizing apparatus of the present invention reflects the commonality of topics in the similarity based on a certain topic in all the documents of the sample document and the co-occurrence relationship of the topic in the reference document. By classifying reference documents (input documents) for each category using this importance, it is possible to perform document classification close to human senses.

この発明の文書カテゴライズ装置１００の機能構成例を示す図。The figure which shows the function structural example of the document categorizing apparatus 100 of this invention. 文書カテゴライズ装置１００の動作フローを示す図。The figure which shows the operation | movement flow of the document categorizing apparatus 100. FIG. 話題/話題ベクトル/重要度/対応表作成装置の機能構成例を示す図。The figure which shows the function structural example of a topic / topic vector / importance / correspondence table creation apparatus. 話題クラス抽出部２１０の機能構成例を示す図。The figure which shows the function structural example of the topic class extraction part 210. FIG. 話題クラス抽出部２１０の動作フローを示す図。The figure which shows the operation | movement flow of the topic class extraction part 210. FIG. 話題クラスの生成過程を模式的に示す図。The figure which shows the production | generation process of a topic class typically. 文書の重要度の違いを模式的に示す図。The figure which shows the difference in the importance of a document typically. この発明の文書カテゴライズ装置３００の機能構成例を示す図。The figure which shows the function structural example of the document categorizing apparatus 300 of this invention. カテゴリ−話題クラス対照表３２０の構造を示す図。The figure which shows the structure of the category-topic class comparison table. 類似度算出不要話題クラス選択部３１０の動作フローを示す図。The figure which shows the operation | movement flow of the topic class selection part 310 which does not require similarity calculation. 文書−話題クラス対照表の構成を示す図。The figure which shows the structure of a document-topic class comparison table. 従来の文書データ分類装置１の機能構成を示す図。The figure which shows the function structure of the conventional document data classification device 1. FIG.

以下、この発明の実施の形態を図面を参照して説明する。複数の図面中同一のものには同じ参照符号を付し、説明は繰り返さない。 Embodiments of the present invention will be described below with reference to the drawings. The same reference numerals are given to the same components in a plurality of drawings, and the description will not be repeated.

図１に、この発明の文書カテゴライズ装置１００の機能構成例を示す。その動作フローを図２に示す。文書カテゴライズ装置１００は、特徴量抽出部１１０と、話題・話題ベクトル・重要度出力部１２０と、話題/話題ベクトル/重要度対応表１３０と、類似度比較分類部１４０と、カテゴリ−標本文書ベクトル対応表１５０と、制御部１６０と、を具備する。文書カテゴライズ装置１００は、例えばＲＯＭ、ＲＡＭ、ＣＰＵ等で構成されるコンピュータに所定のプログラムが読み込まれて、ＣＰＵがそのプログラムを実行することで実現されるものである。 FIG. 1 shows a functional configuration example of a document categorizing apparatus 100 according to the present invention. The operation flow is shown in FIG. The document categorizing apparatus 100 includes a feature amount extraction unit 110, a topic / topic vector / importance output unit 120, a topic / topic vector / importance correspondence table 130, a similarity comparison / classification unit 140, and a category-sample document vector. A correspondence table 150 and a control unit 160 are provided. The document categorizing apparatus 100 is realized by, for example, reading a predetermined program into a computer constituted by a ROM, a RAM, a CPU, and the like, and executing the program by the CPU.

特徴量抽出部１１０は、複数の話題を含む入力文書の或る話題に対応する部分の文書を参照文書として当該参照文書の文書ベクトルを話題ベクトルとして抽出し、当該話題と上記参照文書と上記話題ベクトルとを出力する（ステップＳ１１０）。 The feature quantity extraction unit 110 extracts a document vector of the reference document as a topic vector using a document corresponding to a certain topic of an input document including a plurality of topics as a topic document, and extracts the topic, the reference document, and the topic. The vector is output (step S110).

話題/話題ベクトル/重要度対応表１３０は、標本文書から求めた話題クラスと、当該話題クラスに含まれる文書ベクトルの重心である話題ベクトルと、標本文書のすべての文書と参照文書に含まれる話題の共起関係から算出される当該話題の重要度と、をあらかじめ対応付けて記録する。 The topic / topic vector / importance correspondence table 130 includes a topic class obtained from a sample document, a topic vector that is the center of gravity of the document vector included in the topic class, and topics included in all documents and reference documents of the sample document. The importance level of the topic calculated from the co-occurrence relationship is recorded in advance in association with each other.

話題・話題ベクトル・重要度出力部１２０は、話題ベクトルと文書ベクトルとのベクトル間の類似度を求め、当該類似度の近い文書ベクトルに対応する話題クラスと話題ベクトルと重要度と、上記文書ベクトルに対応する参照文書とを出力する（ステップＳ１２０）。カテゴリ−標本文書ベクトル対応表１５０は、カテゴリと、当該カテゴリに含まれる話題クラスごとの話題ベクトルの重心である標本文書ベクトルとを、あらかじめ対応付けて記録する。 The topic / topic vector / importance output unit 120 obtains the similarity between the topic vector and the document vector, the topic class, the topic vector, the importance corresponding to the document vector having the similar similarity, and the document vector. The reference document corresponding to is output (step S120). The category-sample document vector correspondence table 150 records a category and a sample document vector that is a centroid of a topic vector for each topic class included in the category in association with each other in advance.

類似度比較分類部１４０は、話題クラスごとに話題ベクトルと標本文書ベクトルとの間の類似度を求め、当該類似度に当該話題クラスの重要度を乗じた値を話題クラスごとに累積した値を文書類似度として求め、当該文書類似度の最も大きなカテゴリに参照文書を分類する（ステップＳ１４０）。参照文書は、カテゴリごとにカテゴライズ結果蓄積部１７０に蓄積される。制御部１６０は各部の動作を制御する。 The similarity comparison and classification unit 140 obtains a similarity between the topic vector and the sample document vector for each topic class, and accumulates a value obtained by multiplying the similarity by the importance of the topic class for each topic class. The document similarity is obtained and the reference document is classified into the category having the largest document similarity (step S140). The reference document is stored in the categorization result storage unit 170 for each category. The control unit 160 controls the operation of each unit.

話題/話題ベクトル/重要度対応表１３０に、あらかじめ記録される重要度は、標本文書のすべての文書と参照文書に含まれる話題の共起関係から算出されるものである。この重要度を用いて入力文書を話題ごとに分類すると、人の感覚に近い文書分類を行うことができる。重要度の詳しい説明は後述する。 The importance recorded in advance in the topic / topic vector / importance correspondence table 130 is calculated from the co-occurrence relationship between all the documents of the sample document and the topics included in the reference document. If the input document is classified for each topic using this importance, it is possible to perform document classification close to a human sense. Detailed explanation of the importance will be described later.

以降各機能部の動作を更に詳しく説明する。まず、話題/話題ベクトル/重要度対応表１３０について説明する。話題/話題ベクトル/重要度対応表１３０は、話題/話題ベクトル/重要度/対応表作成装置２００によって生成する。 Hereinafter, the operation of each functional unit will be described in more detail. First, the topic / topic vector / importance correspondence table 130 will be described. The topic / topic vector / importance level correspondence table 130 is generated by the topic / topic vector / importance / correspondence table creation apparatus 200.

〔話題/話題ベクトル/重要度/対応表作成装置〕
図３に、話題/話題ベクトル/重要度/対応表作成装置２００の機能構成例を示す。話題/話題ベクトル/重要度/対応表作成装置２００は、話題クラス抽出部２１０と、重要度取得部２２０と、を備える。重要度取得部２２０は、文書数計数手段２２１とカルバック・ライブラー情報量算出手段２２２とで構成される。 [Topic / Topic Vector / Importance / Correspondence Table Creation Device]
FIG. 3 shows a functional configuration example of the topic / topic vector / importance / correspondence table creation apparatus 200. The topic / topic vector / importance / correspondence table creation apparatus 200 includes a topic class extraction unit 210 and an importance level acquisition unit 220. The importance level acquisition unit 220 includes a document number counting unit 221 and a Cullback / liver information amount calculating unit 222.

話題/話題ベクトル/重要度/対応表作成装置２００も、例えばＲＯＭ、ＲＡＭ、ＣＰＵ等で構成されるコンピュータに所定のプログラムが読み込まれて、ＣＰＵがそのプログラムを実行することで実現されるものである。 The topic / topic vector / importance / correspondence table creation apparatus 200 is also realized by reading a predetermined program into a computer composed of, for example, a ROM, a RAM, a CPU, etc., and executing the program by the CPU. is there.

話題クラス抽出部２１０は、標本文書群を入力として、その標本文書群に含まれる話題クラスと話題クラスの文書ベクトルを抽出する。標本文書群とは、収集対象の標本となる話題を含む文書を、人の主観によりカテゴリごとに選別した文書の集合である。話題クラスとは、各話題を含む文書から生成された文書ベクトルを対象にしてクラスタリングを行って生成されたクラスタのことである。当該クラスタの話題ベクトルとは、話題クラスに含まれる単語ベクトルの重心ある。 The topic class extraction unit 210 receives a sample document group as an input, and extracts a topic class and a document vector of the topic class included in the sample document group. A sample document group is a set of documents in which documents including topics that are samples to be collected are selected for each category by human subjectivity. The topic class is a cluster generated by performing clustering on a document vector generated from a document including each topic. The topic vector of the cluster is the centroid of word vectors included in the topic class.

〔話題クラス抽出部〕
図４に、話題クラス抽出部２１０の機能構成例を示す。その動作フローを図５に示す。なお、話題クラス抽出部２１０は、従来技術の特許文献２に開示された文書データ分類装置１と同じものである。よってここでは、この発明との関連の強い文書ベクトルと話題クラスを明らかにする目的で説明する。 [Topic class extraction section]
FIG. 4 shows a functional configuration example of the topic class extraction unit 210. The operation flow is shown in FIG. The topic class extraction unit 210 is the same as the document data classification device 1 disclosed in Patent Document 2 of the prior art. Therefore, here, a description will be given for the purpose of clarifying document vectors and topic classes that are strongly related to the present invention.

話題クラス抽出部２１０は、文書抽出手段１１１、第１文書ベクトル生成手段１１３、文脈結合手段１１４、話題クラス分類手段１１５、メモリ１１２、を備える。メモリ１１２は、話題/話題ベクトル/重要度/対応表作成装置２００を例えばコンピュータで実現した場合のＲＡＭに当たる。 The topic class extraction unit 210 includes a document extraction unit 111, a first document vector generation unit 113, a context combination unit 114, a topic class classification unit 115, and a memory 112. The memory 112 corresponds to a RAM when the topic / topic vector / importance / correspondence table creation apparatus 200 is realized by a computer, for example.

文脈抽出手段１１１は、標本文書群を入力として、メモリ１１２に記録された推定文脈を抽出するための１個以上の主要単語を参照し、主要単語が一致する標本文書の単語の位置を基準として定まる当該標本文書の所定範囲の単語列をそれぞれ推定文脈として抽出する（ステップＳ１１１）。各推定文脈は、メモリ１１２に格納される。 The context extraction unit 111 receives a sample document group as an input, refers to one or more main words for extracting the estimated context recorded in the memory 112, and uses the position of the word of the sample document that matches the main word as a reference. A word string in a predetermined range of the sample document to be determined is extracted as an estimated context (step S111). Each estimated context is stored in memory 112.

第１文書ベクトル生成手段１１３は、メモリ１１２から各推定文脈を読み込み、各推定文脈の文書ベクトルである第１の文書ベクトルをそれぞれ生成し、各第１の文書ベクトルをメモリ１１２に出力し、そこに格納する（ステップＳ１１３）。なお、文書ベクトルは、文書集合に含まれるすべての異なる単語の各出現傾向を示す重みを成分とした列ベクトルで表現される（例えば参考文献：岸田和明，「文書クラスタリングの技法：文献レビュー」，三田図書館・情報学会，NO.49(2003), p.33-75））。文書ベクトルの生成には公知の方法を用いればよい。例えば、ＴＦ−ＩＤＦ尺度を用いて文書群から複数の単語を選別し、選別された各単語の推定文脈における出現頻度（ＴＦ）を当該推定文脈の文書ベクトルの要素としてもよい。また、単語間共起頻度行列をもとに次元数を圧縮した文書ベクトルを生成してもよい（例えば参考文献：徳永健伸著，辻井潤一編，「情報検索と言語処理言語と計算」，東京大学出版会，1999年）。また、生成された各第１の文書ベクトルには対応する各推定文脈を識別するための推定文脈ＩＤが付与される。そして、第１の文書ベクトルとそのＩＤはメモリ１１２に格納される。 The first document vector generation means 113 reads each estimated context from the memory 112, generates a first document vector that is a document vector of each estimated context, and outputs each first document vector to the memory 112. (Step S113). The document vector is represented by a column vector whose components are weights indicating the appearance tendencies of all the different words included in the document set (for example, reference: Kazuaki Kishida, “Document clustering technique: document review”). , Mita Library and Information Society, NO.49 (2003), p.33-75)). A known method may be used to generate the document vector. For example, a plurality of words may be selected from a document group using a TF-IDF scale, and the appearance frequency (TF) of each selected word in the estimated context may be used as an element of the document vector of the estimated context. Alternatively, a document vector with a reduced number of dimensions may be generated based on the inter-word co-occurrence frequency matrix (for example, reference: Takenobu Tokunaga, Junichi Sakurai, “Information Retrieval and Language Processing, Language and Calculation”, Tokyo University Press, 1999). In addition, an estimated context ID for identifying each corresponding estimated context is assigned to each generated first document vector. The first document vector and its ID are stored in the memory 112.

文脈結合手段１１４は、メモリ１１２に格納された第１の文書ベクトルを読み込み、第１のクラスタリングを行う。複数の第１文書ベクトルが属するクラスタについてはそれに属する複数の第１の文書ベクトルを合成した合成ベクトルを文書ベクトルとして出力し、１個の第１の文書ベクトルが属するクラスタについてはそれに属する１個の第１の文書ベクトルを文書ベクトルとして出力する（ステップＳ１１４）。文書ベクトルはメモリ１１２に格納される。 The context combining unit 114 reads the first document vector stored in the memory 112 and performs first clustering. For a cluster to which a plurality of first document vectors belong, a synthesized vector obtained by synthesizing a plurality of first document vectors belonging to the cluster is output as a document vector, and for a cluster to which one first document vector belongs, The first document vector is output as a document vector (step S114). The document vector is stored in the memory 112.

話題クラス分類手段１１５は、メモリ１１２から文書ベクトルを読み込み、すべての文書ベクトルを対象とした第２のクラスタリングを行い、第２のクラスタリングの結果をメモリ１１２に出力し、そこに格納する（ステップＳ１１５）。ステップＳ１１５の処理は様々なクラスタリング方法を用いて実現できる。例えば、分割最適化による非階層的クラスタリング手法（ｋ−ｍｅａｎｓ法）を用いることで実現できる。第２のクラスタリングの結果は、文書ベクトルとその文書ベクトルが得られた文書番号とが対応付けられて分類されたもので、話題クラスを構成する。 The topic class classification unit 115 reads the document vector from the memory 112, performs the second clustering for all the document vectors, outputs the result of the second clustering to the memory 112, and stores it therein (step S115). ). The process of step S115 can be realized using various clustering methods. For example, it can be realized by using a non-hierarchical clustering method (k-means method) by division optimization. The result of the second clustering is that the document vector and the document number from which the document vector is obtained are associated and classified, and constitutes a topic class.

図６に、話題クラスの生成過程を模式的に示す。１行目は、推定文脈を抽出するための主要単語であり、主要単語が集って各文書Ｄ_１，Ｄ_２，Ｄ_３を構成している様子を表している。２行目は、主要単語を含む抽出された推定文脈を示している。３行目は、クラスタリングされた話題クラスを示している。 FIG. 6 schematically shows a topic class generation process. The first line is a main word for extracting the estimated context, and shows how the main words are collected to form the documents D ₁ , D ₂ , and D ₃ . The second line shows the extracted estimated context containing the main word. The third line shows clustered topic classes.

話題クラスのｉは、ｉ＝ａ（例えば天気）、ｉ＝ｂ（例えば経済）、ｉ＝ｃ（例えばスポーツ）、ｉ＝ｄ（例えば政治）、などの話題を表す識別子である。話題クラスｉ＝ａ（天気）には、文書Ｄ_１に含まれた文書ベクトル（ａ_１）と文書Ｄ_２に含まれた文書ベクトル（ａ_２）が分類されている。話題クラスｉ＝ｂ（経済）には、文書Ｄ_１に含まれた文書ベクトル（ｂ_１）と、文書Ｄ_２に含まれた文書ベクトル（ｂ_２）と、文書Ｄ_３に含まれた文書ベクトル（ｂ_３）が分類されている。話題クラスｉ＝ｃ（スポーツ）には、文書Ｄ_２に含まれた文書ベクトル（ｃ_２）と、文書Ｄ_３に含まれた文書ベクトル（ｃ_３）が分類されている。話題クラスｉ＝ｄ（政治）には、文書Ｄ_３に含まれた文書ベクトル（ｄ_３）が分類されている。 The topic class i is an identifier representing a topic such as i = a (for example, weather), i = b (for example, economy), i = c (for example, sports), i = d (for example, politics), or the like. In the topic class i = a (weather), the document vector (a ₁ ) included in the document D ₁ and the document vector (a ₂ ) included in the document D ₂ are classified. The topic class i = b (economy) includes a document vector (b ₁ ) included in the document D ₁ , a document vector (b ₂ ) included in the document D ₂ , and a document vector included in the document D _3. (B ₃ ) is classified. In the topic class i = c (sports), the document vector (c ₂ ) included in the document D ₂ and the document vector (c ₃ ) included in the document D ₃ are classified. In the topic class i = d (politics), the document vector (d ₃ ) included in the document D ₃ is classified.

各話題クラスｉの単語ベクトルは、それぞれの話題クラスごとに平均されて当該話題クラスを表すベクトルの重心である話題ベクトルとされる。 The word vector of each topic class i is averaged for each topic class to be a topic vector that is the center of gravity of the vector representing the topic class.

〔重要度取得部〕
重要度取得部２２０は、文書計数手段２２１とカルバック・ライブラー情報量算出手段２２２とで構成される。文書計数手段２２１は、話題クラスｉを入力として、各話題クラスに含まれる文書数を計数する。図６に示した例では、話題クラスｉ＝ａの文書数は２、話題クラスｉ＝ｂの文書数は３、話題クラスｉ＝ｃの文書数は２、話題クラスｉ＝ｄの文書数は１である。よって全体の文書数は８であると計数する。 [Importance acquisition part]
The importance level acquisition unit 220 includes a document counting unit 221 and a cullback / liver information amount calculating unit 222. The document counting means 221 counts the number of documents included in each topic class with the topic class i as an input. In the example shown in FIG. 6, the number of documents of topic class i = a is 2, the number of documents of topic class i = b is 3, the number of documents of topic class i = c is 2, and the number of documents of topic class i = d is 1. Therefore, the total number of documents is counted as eight.

カルバック・ライブラー情報量算出手段２２２は、或る話題ｉを含む出現確率分布Ｐ（ｉ）を求め、全文書Ｑにおける話題ｉの出現確率分布Ｑ（ｉ）との分布の差を話題クラスの重要度として次式で算出する。 The cullback / liver information amount calculation means 222 obtains an appearance probability distribution P (i) including a certain topic i, and calculates a difference in distribution from the appearance probability distribution Q (i) of the topic i in all documents Q in the topic class. The importance is calculated by the following formula.

この尺度は、或る話題クラスｉに関しての他の全ての話題クラスとの共起関係と全文書（Ｑ）における当該話題の共起関係との差を、話題クラスｉの重要度として反映するものである。共起関係の具体例を、図６に示した例で説明する。話題クラスｉ＝ａにおいてａを含む文書は文書Ｄ_１とＤ_２であるので共起数は２である。話題クラスｉ＝ｂにおいてａを含む文書は文書Ｄ_１とＤ_２であるので共起数は２である。話題クラスｉ＝ｃにおいてａを含む文書はＤ_２だけであるので共起数は１である。話題クラスｉ＝ｄにおいてａを含む文書は無いので共起数は０である。そして、すべての共起数は２＋２＋１＝５であるので、ａを含む、つまり話題クラスｉの出現確率分布Ｐ（ｉ）はＰ（ｉ）＝２/５、全文書Ｑにおける話題ｉの出現確率分布Ｑ（ｉ）はＱ（ｉ）＝２/８となる。 This scale reflects the difference between the co-occurrence relationship of a topic class i with all other topic classes and the co-occurrence relationship of the topic in all documents (Q) as the importance of the topic class i. It is. A specific example of the co-occurrence relationship will be described with reference to the example shown in FIG. In the topic class i = a, documents including a are documents D ₁ and D ₂ , so the co-occurrence number is 2. Since the documents including a in the topic class i = b are the documents D ₁ and D ₂ , the co-occurrence number is 2. Article containing a in topic class i = c co-occurrence count since only D ₂ is 1. Since there is no document including a in the topic class i = d, the co-occurrence number is 0. Since all co-occurrence numbers are 2 + 2 + 1 = 5, the occurrence probability distribution P (i) of topic class i includes P (i) = 2/5, and the appearance probability of topic i in all documents Q The distribution Q (i) is Q (i) = 2/8.

この値を、式（１）に代入して話題クラスｉ＝ａの重要度を計算すると0.073となる。同様に話題クラスｉ＝ｂの重要度は0.0、話題クラスｉ＝ｃの重要度は0.016、話題クラスｉ＝ｄの重要度は0.167となる。この重要度を計算した共起数の関係を表１に示す。 Substituting this value into equation (1) to calculate the importance of topic class i = a yields 0.073. Similarly, the importance level of the topic class i = b is 0.0, the importance level of the topic class i = c is 0.016, and the importance level of the topic class i = d is 0.167. Table 1 shows the relationship between the co-occurrence numbers for which the importance is calculated.

重要度の順番は、この例では、ｄ（政治）＞ａ（天気）＞ｃ（スポーツ）＞ｂ（経済）となった。

In this example, the order of importance is d (politics)> a (weather)> c (sports)> b (economics).

この実施例の重要度がこのようになる理由を、図７に模式的に示す。７０は推定文脈であり、複数存在している。参照符号は省略している。話題クラスＰｅを示す一点鎖線の楕円は、他の話題クラスＰｆとＰｇと共起関係があることを示している。話題クラスＰｅのように他の話題クラスとの共起関係が多い話題は、独立した話題クラスＰｆ，Ｐｇに比較して重要度が低くなる。共起関係の多い話題は、形式的な内容や間投詞、感動詞等の偶発が含まれるため重要度は相対的に低くなる。 The reason why the importance of this embodiment becomes like this is schematically shown in FIG. 70 is an estimation context, and there are a plurality of them. Reference numerals are omitted. The dashed-dotted ellipse indicating the topic class Pe indicates that there is a co-occurrence relationship with the other topic classes Pf and Pg. Topics such as the topic class Pe that have many co-occurrence relationships with other topic classes are less important than the independent topic classes Pf and Pg. A topic with many co-occurrence relationships is relatively low in importance because it includes formal content, interjections such as interjections and impression verbs.

話題/文書ベクトル/重要度/対応表作成装置２００が出力する話題クラスと話題ベクトルとその重要度を対応付けて記録することで話題/文書ベクトル/重要度/対応表１３０が作成できる。 The topic / document vector / importance / correspondence table 130 can be created by associating and recording the topic class, the topic vector, and the importance that are output by the topic / document vector / importance / correspondence table creation apparatus 200.

〔特徴量抽出部〕
特徴量抽出部１１０は、複数の話題を含む入力文書の或る話題に対応する部分の文書を参照文書として当該参照文書の文書ベクトルを抽出し、当該参照文書と上記文書ベクトルとを出力する。特徴量抽出部１１０は、上記した話題/話題ベクトル/重要度/対応表作成装置２００の話題クラス抽出部２１０と、ほぼ同じものである。特徴量抽出部１１０は、複数の話題を含む入力文書を入力として、話題クラス抽出部２１０の文脈抽出手段１１１〜文脈結合手段１１４までの処理を行う。その処理の結果、文書ベクトルとそのベクトルに対応する参照文書とが、話題・文書ベクトル・重要度出力部１２０に出力される。 [Feature Extraction Unit]
The feature quantity extraction unit 110 extracts a document vector of the reference document by using a part of the input document including a plurality of topics corresponding to a certain topic as a reference document, and outputs the reference document and the document vector. The feature quantity extraction unit 110 is substantially the same as the topic class extraction unit 210 of the topic / topic vector / importance / correspondence table creation apparatus 200 described above. The feature quantity extraction unit 110 performs processing from the context extraction unit 111 to the context combination unit 114 of the topic class extraction unit 210 using an input document including a plurality of topics as an input. As a result of the processing, the document vector and the reference document corresponding to the vector are output to the topic / document vector / importance level output unit 120.

〔話題・話題ベクトル・重要度出力部〕
話題・話題ベクトル・重要度出力部１２０は、特徴量抽出部１１０が出力する文書ベクトルと、話題/話題ベクトル/重要度対応表１３０に記録された話題ベクトルとのベクトル間の類似度を求め、当該類似度の大きい話題ベクトルに対応する話題クラスと文書ベクトルと重要度と、上記話題ベクトルの参照文書とを出力する。 [Topic / Topic Vector / Importance Output Unit]
The topic / topic vector / importance output unit 120 obtains the similarity between the document vector output by the feature amount extraction unit 110 and the topic vector recorded in the topic / topic vector / importance correspondence table 130, A topic class, a document vector, an importance level corresponding to the topic vector having a high similarity, and a reference document of the topic vector are output.

ベクトル間の類似度は、例えば次式に示すコサイン類似度を用いて評価する。 The similarity between vectors is evaluated using, for example, a cosine similarity expressed by the following equation.

ａ_ｎは文書ベクトル、Ｐ（ｉ）は話題ベクトルである。 a _n is a document vector, and P (i) is a topic vector.

〔カテゴリ−文書ベクトル対応表〕
カテゴリ−標本文書ベクトル対応表１５０は、上記した話題/話題ベクトル/重要度対応表１３０から作成する。話題/話題ベクトル/重要度対応表１３０の話題ベクトルを標本文書のカテゴリごとに分類し、当該分類された話題ベクトルの重心を、上記カテゴリに対応させた標本文書ベクトルとして記録したものがカテゴリ−標本文書ベクトル対応表１５０である。 [Category-document vector correspondence table]
The category-sample document vector correspondence table 150 is created from the topic / topic vector / importance level correspondence table 130 described above. A category-sample is obtained by classifying the topic vectors in the topic / topic vector / importance correspondence table 130 for each category of the sample document and recording the centroid of the classified topic vectors as the sample document vector corresponding to the category. This is a document vector correspondence table 150.

〔類似度比較分類部１４０〕
類似度比較分類部１４０は、話題クラスごとに話題・話題ベクトル・重要度出力部１２０が出力する単語ベクトルと、カテゴリ−文書ベクトル対応表１５０に記録された標本文書ベクトルとの間の類似度を求め、当該類似度に当該話題クラスの重要度を乗じた値を、話題クラスごとに累積した値を文書類似度として求め、当該文書類似度の最も大きなカテゴリに話題・話題ベクトル・重要度出力部１２０が出力する参照文書を分類する。 [Similarity comparison and classification unit 140]
The similarity comparison and classification unit 140 calculates the similarity between the word vector output by the topic / topic vector / importance output unit 120 for each topic class and the sample document vector recorded in the category-document vector correspondence table 150. A value obtained by multiplying the similarity by the importance of the topic class and a cumulative value for each topic class is obtained as a document similarity, and a topic / topic vector / importance output unit is assigned to the category having the highest document similarity. The reference documents output by 120 are classified.

類似度の計算方法としては種々の方法を適用できるが、この実施例では、文書ベクトル間のコサイン類似度を各々の話題の重要度で正規化した値を加算して、文書の文書類似度とする。 Various methods can be applied as the similarity calculation method. In this embodiment, a value obtained by normalizing the cosine similarity between the document vectors with the importance of each topic is added to calculate the document similarity of the document. To do.

２つの文書間の類似度計算において、双方に含まれる話題クラスａの話題ベクトル（ａ_１，ａ_２）間のコサイン類似度ｃｏｓ（ａ_１，ａ_２）を話題の類似度Ｓ（ａ_１，ａ_２）とし、すべての文書Ｑにおける話題の確率分布と話題クラスａを含む文書集合Ａにおける話題の確率分布とのカルバック・ライブラー情報量Ｄ_ＫＬ（Ａ‖Ｑ）を話題の重要度Ｉ（ａ）として、話題の類似度を各々の話題の重要度で重み付け加算して文書間の文書類似度とする。 In calculating the similarity between two documents, the cosine similarity cos (a ₁ , a ₂ ) between the topic vectors (a ₁ , a ₂ ) of the topic class a included in both is used as the topic similarity S (a ₁ , a ₂ , a ₂ ), and the Kullback-Roller information amount D _KL (A‖Q) between the topic probability distribution in all the documents Q and the topic probability distribution in the document set A including the topic class a is the topic importance I ( As a), the similarity between topics is weighted and added with the importance of each topic to obtain the document similarity between documents.

例えば、１カテゴリ１文書の場合、標本文書Ｄ_１、参照文書Ｄ_２がそれぞれ話題クラスａ，ｂの話題（ａ_１，ｂ_１），（ａ_２，ｂ_２）を持つとき、その文書類似度は式（３）で表される。 For example, in the case of one category and one document, when the sample document D ₁ and the reference document D ₂ have topics (a ₁ , b ₁ ) and (a ₂ , b ₂ ) of topic classes a and b, respectively, the document similarity Is represented by Formula (3).

また、１カテゴリ複数文書の場合、複数の標本文書Ｄ_１，Ｄ_２で構成されるカテゴリＣ_１において、カテゴリＣ_１の話題ベクトル（ａ_ｃ１，ｂ_ｃ１）を、標本文書Ｄ_１，Ｄ_２の話題ベクトルの相加平均ａ_ｃ１＝１/２・（ａ_１＋ａ_２），ｂ_ｃ１＝１/２・（ｂ_１＋ｂ_２）とし、カテゴリＣ_１と、参照文書Ｄ_３の文書類似度は、式（４）で表される。 Further, in the case of one category plural documents, the topic vector (a _c1 , b _c1 ) of the category C ₁ in the category C ₁ composed of the plurality of sample documents D ₁ and D ₂ is _converted into the sample documents D ₁ and D ₂ . The arithmetic average of topic vectors a _c1 = 1/2 · (a ₁ + a ₂ ), b _c1 = 1/2 · (b ₁ + b ₂ ), and the document similarity between the category C ₁ and the reference document D ₃ is It is represented by Formula (4).

重要度Ｉ（・）は、標本文書のすべての文書と入力文書（参照文書）に含まれる話題の共起関係から算出されるものであり、この重要度Ｉ（・）を用いて入力文書を話題ごとに分類することで、重要度の低い話題が文書の類似性判定の対象になることを防止し、従来の人の主観による文書カテゴライズの分類結果と異なってしまう課題を解決することができる。また、重要度の低い単語の影響によって低下していた類似度の信頼性を向上させることができ、重要度の低い話題を排除するために、人為的に単語や話題を選択する作業を不要にすることができる。 The importance level I (•) is calculated from the co-occurrence relationship of topics included in all the documents of the sample document and the input document (reference document), and the input document is determined using the importance level I (•). By classifying by topic, it is possible to prevent low-priority topics from becoming the target of document similarity determination, and to solve problems that differ from the results of conventional document categorization based on human subjectivity. . In addition, the reliability of the similarity that has been reduced due to the influence of words of low importance can be improved, and the task of manually selecting words and topics to eliminate topics of low importance is unnecessary. can do.

上記した文書カテゴライズ装置１００は、標本文書のすべての文書と入力文書（参照文書）に含まれる話題の共起関係から文書類似度を算出する必要があり、計算量が多くなる場合がある。そこで、計算量を減らす工夫をした文書カテゴライズ装置３００を次に説明する。 The document categorizing apparatus 100 described above needs to calculate the document similarity from the co-occurrence relationship of topics included in all the documents of the sample document and the input document (reference document), and the calculation amount may increase. Therefore, a document categorizing apparatus 300 devised to reduce the amount of calculation will be described next.

図８に、文書カテゴライズ装置３００の機能構成例を示す。文書カテゴライズ装置３００は、文書カテゴライズ装置１００に対して類似度算出不要話題クラス選択部３１０と、カテゴリ−話題クラス対照表３２０と、を備える点で異なる。また、その構成の違いにより制御部１６０が、制御部１６０′となる点で異なる。その他の構成部は、参照符号から明らかなように文書カテゴライズ装置１００と同じである。 FIG. 8 shows a functional configuration example of the document categorizing apparatus 300. The document categorizing apparatus 300 differs from the document categorizing apparatus 100 in that it includes a topic class selection unit 310 that does not require similarity calculation and a category-topic class comparison table 320. Further, the control unit 160 is different from the configuration in that the control unit 160 becomes the control unit 160 ′. Other components are the same as those in the document categorizing apparatus 100 as is clear from the reference numerals.

〔カテゴリ−話題クラス対照表〕
図９に、カテゴリ−話題クラス対照表３２０の構造を例示する。例えば、行方向にカテゴリが配列され、列方向に話題クラスｉが配列されて、カテゴリ−話題クラス対照表３２０が構成される。 [Category-topic class comparison table]
FIG. 9 illustrates the structure of the category-topic class comparison table 320. For example, categories are arranged in the row direction, and topic classes i are arranged in the column direction, whereby the category-topic class comparison table 320 is configured.

例えば、カテゴリＣ_１は『政治』、カテゴリＣ_２は『スポーツ』、カテゴリＣ_３は『天気』、といったものである。そのカテゴリに対する話題クラスは、例えば「選挙」、「首相」、「サッカー」、「野球」、「低気圧」、「台風」といったものである。 For example, category C ₁ is “politics”, category C ₂ is “sports”, and category C ₃ is “weather”. The topic classes for the category are, for example, “election”, “prime”, “soccer”, “baseball”, “low pressure”, and “typhoon”.

カテゴリ−話題クラス対照表３２０は、対象とする参照文書に応じて事前に作成しておく。その作成は人手で行っても良いし、大量の文書を形態素解析した結果からコンピュータを用いて生成するようにしても良い。 The category-topic class comparison table 320 is created in advance according to the target reference document. The creation may be performed manually or may be generated using a computer from the result of morphological analysis of a large number of documents.

〔類似度算出不要話題クラス選択部〕
図１０に、類似度算出不要話題クラス選択部３１０の動作フローを示す。その動作フローを参照して動作を説明する。 [Similarity calculation unnecessary topic class selection part]
FIG. 10 shows an operation flow of the topic level selection unit 310 that does not require similarity calculation. The operation will be described with reference to the operation flow.

類似度算出不要話題クラス選択部３１０は、カテゴリ−話題クラス対照表３２０を参照して、ある話題クラスを含むカテゴリを横断的に探索して所定の数よりも少ない当該ある話題クラスを含むカテゴリを抽出する（ステップＳ３１０）。この処理は、例えば話題クラスｉ＝３の話題を含むカテゴリを、図９の例では縦方向に探索して行き、そのカテゴリの数が所定の数（例えば５個）より少ないカテゴリを抽出する。図９では、話題クラスｉ＝３を含むカテゴリは、カテゴリＣ_２のみである場合を例示している。所定の数は、処理対象の参照文書のカテゴリの数に応じて定められる数である。 The similarity calculation unnecessary topic class selection unit 310 refers to the category-topic class comparison table 320 to search for a category including a certain topic class in a cross-sectional manner, and to select a category including a certain topic class that is less than a predetermined number. Extract (step S310). In this process, for example, a category including a topic of topic class i = 3 is searched in the vertical direction in the example of FIG. 9, and a category whose number is smaller than a predetermined number (for example, 5) is extracted. 9, the category comprising topic class i = 3 illustrates the case where only the category C _2. The predetermined number is a number determined according to the number of categories of the reference document to be processed.

このカテゴリＣ_２にだけ含まれる話題クラスの集合をｇ_ｅ（２）、それ以外の集合をｇ_ｏ（２）とした時、例えば、話題クラスをｉ＝１〜４とした場合、ｇ_ｅ（２）＝｛３｝，ｇ_ｏ（２）＝｛１，２，４｝である。各々の話題クラスの話題ベクトルをＶ（ｉ）（ｉ＝１，２，…，ｋ）とする。図１１に、文書と話題クラスの対応関係を、文書−話題クラス対照表として表す。文書毎に話題クラスの話題ベクトルが順番に配列される。２行目の文書Ｄ_２は、話題クラスｉ＝２，３，４を含まないことを表している。この図１１に示す関係は参照文書でも同じである。この文書と話題クラスの対応関係は、話題・話題ベクトル・重要度出力部１２０の出力する情報に含まれている。 The category _{C 2} only Included set of topics class _{g e (2),} when the set of others was _{g o (2),} for example, if the topic class was i = 1~4, _{g e ( 2)} = {3}, go ₍₂₎ = {1, 2, 4}. Let the topic vector of each topic class be V (i) (i = 1, 2,..., K). FIG. 11 shows the correspondence between documents and topic classes as a document-topic class comparison table. For each document, topic vectors of topic classes are arranged in order. Article _{D 2} of the second row represents that does not include the topic classes i = 2, 3, 4. The relationship shown in FIG. 11 is the same for the reference document. The correspondence between the document and the topic class is included in the information output from the topic / topic vector / importance output unit 120.

次に類似度算出不要話題クラス選択部３１０は、抽出したカテゴリｊ以外のカテゴリが含む話題クラスを持つ文書のカテゴリ類似度の最大値を求める（式（５））（ステップＳ３１１）。 Next, the similarity calculation unnecessary topic class selection unit 310 obtains the maximum category similarity of documents having topic classes included in categories other than the extracted category j (formula (5)) (step S311).

ここで、ｓ≠ｊは抽出したカテゴリ以外のカテゴリを意味する。Ｗ_ｉは話題クラスｉの重要度（重み）、１は話題ベクトルの類似度の最大値である。よって、抽出したカテゴリ以外のカテゴリの類似度の最大値を求めることが出来る。 Here, s ≠ j means a category other than the extracted category. _Wi is the importance (weight) of the topic class i, and 1 is the maximum value of the similarity of the topic vector. Therefore, the maximum value of the similarity of categories other than the extracted category can be obtained.

次に、抽出したカテゴリに含まれる話題クラスを持つ文書のカテゴリ類似度の最小値を求める（式（６））（ステップＳ３１２）。 Next, the minimum value of the category similarity of the document having the topic class included in the extracted category is obtained (formula (6)) (step S312).

ここで、εは、例えば分割最適化クラスタリングによって話題クラスを分割する場合の、クラス内の任意の２つの話題ベクトルの類似度の最小値である。εは、１よりもかなり小さな値である。 Here, ε is the minimum value of the similarity between any two topic vectors in the class when the topic class is divided by, for example, division optimization clustering. ε is a value considerably smaller than 1.

次に、上記最小値が上記最大値よりも大きなカテゴリに含まれる話題クラスを類似度算出不要話題クラスとして当該カテゴリに対応付ける（ステップＳ３１３）。ここで、ε≪１の関係から明らかなように、この関係が成り立つ類似度を持つ話題クラスは、そのカテゴリに強く関連する話題クラスであり、文書類似度を計算するまでもなくそのカテゴリに分類することが出来る。その話題クラスｉは、「話題クラスｉ」∈「あるカテゴリにだけ含まれる話題クラスの集合」として対応付けられる。 Next, a topic class included in a category having the minimum value larger than the maximum value is associated with the category as a similarity calculation unnecessary topic class (step S313). Here, as is clear from the relationship of ε << 1, a topic class having a similarity that satisfies this relationship is a topic class strongly related to the category, and is classified into the category without calculating the document similarity. I can do it. The topic class i is associated as “topic class i” ∈ “a set of topic classes included only in a certain category”.

次に類似度算出不要話題クラス選択部３１０は、話題・話題ベクトル・重要度出力部１２０が出力する参照文書を入力として、当該参照文書に含まれる話題クラスが上記類似度算出不要話題クラスと一致すると、当該参照文書を上記当該カテゴリに分類して外部（カテゴライズ結果蓄積部１７０）に出力すると共に、上記類似度算出不要話題クラスを含まない参照文書を類似度比較分類部１４０に出力する（ステップＳ３１４）。 Next, the similarity calculation unnecessary topic class selection unit 310 receives the reference document output from the topic / topic vector / importance output unit 120, and the topic class included in the reference document matches the similarity calculation unnecessary topic class. Then, the reference document is classified into the category and output to the outside (categorization result accumulation unit 170), and the reference document that does not include the similarity calculation unnecessary topic class is output to the similarity comparison classification unit 140 (step) S314).

以上説明したように、類似度算出不要話題クラス選択部３１０と、カテゴリ−話題クラス対照表３２０と、を備えることで、参照文書の話題クラスから、文書類似度計算の要不要を判定することができるので、計算量を削減することが出来る。 As described above, by including the topic level selection unit 310 that does not require similarity calculation and the category-topic class comparison table 320, it is possible to determine whether it is necessary to calculate the document similarity from the topic class of the reference document. Since it is possible, the amount of calculation can be reduced.

なお、εで求めた類似度の最大値が、εの値が小さい故に得られない場合が想定される。その場合は、参照文書が入力された時に類似度の最小値を式（７）で求めれば良い。その時の最大値は、式（８）で求めた最大値を用いる。 It is assumed that the maximum value of the similarity obtained with ε cannot be obtained because the value of ε is small. In that case, the minimum value of the similarity may be obtained by Expression (7) when the reference document is input. As the maximum value at that time, the maximum value obtained by the equation (8) is used.

ここで｜ｇ_ｅ（ｊ）｜は、あるカテゴリにだけ含まれる話題クラスの集合ｇ_ｅ（ｊ）が含む話題クラスの個数、Ｖ_ｒ（ｉ）は、ある参照文書の話題ベクトルである。 Here, | ge _(j) | is the number of topic classes included in the set of topic classes g _{e (j)} included only in a certain category, and V _r (i) is the topic vector of a reference document.

また、類似度比較分類部１４０における類似度計算は、上記した式（３）及び式（４）で計算しても良いし、式（９）に示す類似スコアＲＳを計算して類似度を判定しても良い。 Further, the similarity calculation in the similarity comparison and classification unit 140 may be calculated by the above formulas (3) and (4), or the similarity score RS shown in the formula (9) is calculated to determine the similarity. You may do it.

文書カテゴライズ装置３００よりも更に計算量を減らすことの可能な文書カテゴライズ装置４００を次に説明する。文書カテゴライズ装置４００の図示は省略する。文書カテゴライズ装置４００は、文書カテゴライズ装置３００の類似度算出不要話題クラス選択部３１０が、類似度算出不要話題クラス選択部４１０に置き代わったものである。 Next, a document categorizing apparatus 400 capable of reducing the amount of calculation further than the document categorizing apparatus 300 will be described. The illustration of the document categorizing apparatus 400 is omitted. The document categorizing apparatus 400 is obtained by replacing the similarity calculation unnecessary topic class selecting unit 310 of the document categorizing apparatus 300 with a similarity calculation unnecessary topic class selecting unit 410.

類似度算出不要話題クラス選択部４１０は、参照文書が入力された時に、当該参照文書に含まれる話題クラスを含むカテゴリの範囲を特定する処理を行う点でのみ異なる。カテゴリの範囲を特定した後は、カテゴリ−話題クラス対照表のその特定範囲のみを処理の対象とする。 Similarity calculation unnecessary topic class selection unit 410 differs only in that, when a reference document is input, a process of specifying a category range including a topic class included in the reference document is performed. After the category range is specified, only the specified range of the category-topic class comparison table is the processing target.

参照文書に含まれる話題クラスの集合をｒとするとき、ｒの一部ないし全部を話題クラスに含むカテゴリの集合Ｃ_ｒ（式（１０））に絞り込むことで、計算量を削減することが出来る。 When r is a set of topic classes included in the reference document, the amount of calculation can be reduced by narrowing down to a set of categories C _r (equation (10)) including part or all of r in the topic class. .

カテゴリ−話題クラス対照表３２０の特定範囲を絞り込んだ後の処理は、類似度算出不要話題クラス選択部３１０と同じである。 The processing after narrowing down the specific range of the category-topic class comparison table 320 is the same as that of the topic class selection unit 310 that does not require similarity calculation.

文書カテゴライズ装置４００によれば、計算対象のカテゴリの範囲が、参照文書の入力された時点で絞り込まれるので、更に計算量を削減することが可能になる。なお、ある一定量の参照文書が入力された後は、式（１０）の計算を行わずに、参照文書の話題クラスから、直近の特定範囲のカテゴリ−話題クラス対照表を用いて良いかの判断を行うことも可能である。 According to the document categorizing apparatus 400, since the range of the category to be calculated is narrowed down when the reference document is input, the amount of calculation can be further reduced. After a certain amount of reference document is input, whether the category-topic class comparison table in the latest specific range can be used from the topic class of the reference document without performing the calculation of equation (10). It is also possible to make a judgment.

また、実施例２に述べたεで求めた類似度の最大値が求められない場合に、参照文書が入力された時に類似度の最小値を式（７）で求める例を説明したが、その最小値を計算する前に、式（１０）で対象とするカテゴリ−話題クラス対照表の範囲を絞り込んでから、式（７）で最小値を求めるようにしても良い。そのようにすることで、実施例２で述べた方法よりも計算量を減らすことが出来る。 Further, the example in which the minimum value of the similarity is obtained by the equation (7) when the reference document is input when the maximum value of the similarity obtained by ε described in the second embodiment is not obtained has been described. Before calculating the minimum value, the range of the target category-topic class comparison table may be narrowed down using Expression (10), and then the minimum value may be calculated using Expression (7). By doing so, the amount of calculation can be reduced as compared with the method described in the second embodiment.

上記各装置及び方法において説明した処理は、記載の順に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されるとしてもよい。 The processes described in the above apparatuses and methods are not only executed in time series according to the order of description, but may also be executed in parallel or individually as required by the processing capability of the apparatus that executes the processes. .

また、上記装置における処理手段をコンピュータによって実現する場合、各装置が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、各装置における処理手段がコンピュータ上で実現される。 Further, when the processing means in the above apparatus is realized by a computer, the processing contents of functions that each apparatus should have are described by a program. Then, by executing this program on the computer, the processing means in each apparatus is realized on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。具体的には、例えば、磁気記録装置として、ハードディスク装置、フレキシブルディスク、磁気テープ等を、光ディスクとして、ＤＶＤ（Digital Versatile Disc）、ＤＶＤ−ＲＡＭ（Random Access Memory）、ＣＤ−ＲＯＭ（Compact Disc Read Only Memory）、ＣＤ−Ｒ（Recordable）/ＲＷ（ReWritable）等を、光磁気記録媒体として、ＭＯ（Magneto Optical disc）等を、半導体メモリとしてＥＥＰ−ＲＯＭ（Electronically Erasable and Programmable-Read Only Memory）等を用いることができる。 The program describing the processing contents can be recorded on a computer-readable recording medium. As the computer-readable recording medium, any recording medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory may be used. Specifically, for example, as a magnetic recording device, a hard disk device, a flexible disk, a magnetic tape or the like, and as an optical disk, a DVD (Digital Versatile Disc), a DVD-RAM (Random Access Memory), a CD-ROM (Compact Disc Read Only). Memory), CD-R (Recordable) / RW (ReWritable), etc., magneto-optical recording medium, MO (Magneto Optical disc), etc., semiconductor memory, EEP-ROM (Electronically Erasable and Programmable-Read Only Memory), etc. Can be used.

また、このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記録装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 The program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Further, the program may be distributed by storing the program in a recording device of a server computer and transferring the program from the server computer to another computer via a network.

また、各手段は、コンピュータ上で所定のプログラムを実行させることにより構成することにしてもよいし、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 Each means may be configured by executing a predetermined program on a computer, or at least a part of these processing contents may be realized by hardware.

Claims

A feature quantity extraction unit that takes an input document including a plurality of topics as a reference document, extracts a document vector of the reference document, and outputs the reference document and the document vector;
The topic class obtained by clustering the topics obtained from the sample document, the topic vector that is the center of gravity of the document vector included in the topic class, the topic included in all the documents of the sample document and the reference document A topic / topic vector / importance correspondence table recorded in association with the importance of the topic calculated from the co-occurrence relationship of
A topic / topic that obtains a similarity between vectors of the document vector and the topic vector, and outputs a topic class and importance corresponding to the topic vector having a large similarity and a reference document corresponding to the word vector A vector / importance output section;
A category-sample document vector correspondence table in which the topic vectors of the topic / topic vector / importance correspondence table are classified for each category and the center of gravity of the classified topic vectors is recorded as a sample document vector corresponding to the category. When,
The similarity between the topic vector and the sample document vector is obtained for each topic class, and a value obtained by multiplying the similarity by the importance of the topic class is accumulated as the document similarity. A similarity comparison and classification unit that classifies the reference document into a category having the highest document similarity,
A document categorizing apparatus comprising:

The document categorizing apparatus according to claim 1,
The above importance is
A document categorizing apparatus, characterized in that it is based on a difference between an appearance probability distribution of a certain topic in the reference document and an appearance probability distribution of the topic in all documents of the sample document.

The document categorizing apparatus according to claim 1,
The above importance is
It is given by the amount of information of Cullback-Librer that calculates the difference between two probability distributions, and is defined by the following equation:

It is given by the difference in distribution between the appearance probability distribution P (i) of the topic class i in the set P of documents included in a certain topic class p and the appearance probability distribution Q (i) of the topic class i in all documents Q. Feature document categorizing device.

The document categorizing apparatus according to any one of claims 1 to 3,
The above document similarity is
A value obtained by multiplying the cosine similarity between the topic vector for each topic and the sample document vector by the amount of the cullback / railer information representing the difference in probability distribution between the topic vector and the sample document vector for each topic. A document categorizing apparatus characterized by being an accumulated value for all topics.

The document categorizing apparatus according to any one of claims 1 to 4,
Furthermore,
A category-topic class comparison table that records the correspondence between categories and topic classes included in the category;
With reference to the category-topic class comparison table, a category including a certain topic class is traversed to extract a category including a certain topic class less than a predetermined number,
Find the maximum category similarity of documents with topic classes included in categories other than the extracted categories,
Find the minimum category similarity of documents with topic classes included in the extracted category,
A topic class included in a category in which the minimum value is larger than the maximum value is associated with the category as a similarity calculation unnecessary topic class,
When the reference document is input and the topic class included in the reference document matches the similarity calculation unnecessary topic class, the reference document is classified into the category and output to the outside, and the similarity calculation unnecessary topic is output. A similarity calculation unnecessary topic class selection unit that outputs a reference document not including a class to the similarity comparison and classification unit;
A document categorizing apparatus comprising:

The document categorizing apparatus according to any one of claims 1 to 4,
Furthermore,
A category-topic class comparison table that records the correspondence between categories and topic classes included in the category;
When the reference document is input, the range to be referred to in the category-topic class comparison table is specified according to the topic class included in the reference document.
With reference to the category-topic class comparison table in the specific range, a category including a certain topic class is traversed to extract a category including a certain topic class less than a predetermined number,
Find the maximum category similarity of documents with topic classes included in categories other than the extracted categories,
Find the minimum category similarity of documents with topic classes included in the extracted category,
A topic class included in a category in which the minimum value is larger than the maximum value is associated with the category as a similarity calculation unnecessary topic class,
When the reference document is input and the topic class included in the reference document matches the similarity calculation unnecessary topic class, the reference document is classified into the category and output to the outside, and the similarity calculation unnecessary topic is output. A similarity calculation unnecessary topic class selection unit that outputs a reference document not including a class to the similarity comparison and classification unit;
A document categorizing apparatus comprising:

A feature amount extraction process in which an input document including a plurality of topics is used as a reference document, a document vector of the reference document is extracted, and the reference document and the document vector are output;
Topic class obtained by cluster classification of topics obtained from sample documents recorded in the above document vector, topic / topic vector / importance correspondence table, and topic vector that is the centroid of the document vector included in the topic class To the topic class corresponding to the topic vector having the large similarity and all the documents of the sample document recorded in the topic / topic vector / importance correspondence table and the reference document. A topic / topic vector / importance output process for outputting the importance of the topic calculated from the co-occurrence relationship of the included topics and a reference document corresponding to the word vector;
For each topic class, the topic vectors of the topic / topic vector / importance correspondence table recorded in the topic vector / category-sample document vector correspondence table are classified for each category, and the centroid of the classified topic vectors is used. A similarity between a sample document vector is obtained, a value obtained by multiplying the similarity by the importance of the topic class for each topic class is obtained as a document similarity, and the largest document similarity is obtained. A similarity comparison and classification process that classifies reference documents into categories;
A document categorizing method comprising:

A program for causing a computer to function as the document categorizing apparatus according to any one of claims 1 to 6.