JP2002108894A

JP2002108894A - Device and method for sorting document and recording medium for executing the method

Info

Publication number: JP2002108894A
Application number: JP2000293597A
Authority: JP
Inventors: Eiji Kenmochi; 栄治剣持
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 2000-09-27
Filing date: 2000-09-27
Publication date: 2002-04-12

Abstract

PROBLEM TO BE SOLVED: To provide a document sorting device which can supply information effective for analyzing a partial document set and also can extract many analysis information from a document set. SOLUTION: A document analysis part 102 extracts word information from a document that is inputted from a document input part 101 and a document sorting part 103 sorts the document into a partial document set on the basis of the word information. A keyword extraction part 104 extracts a keyword set from the partial document set and a relative word extraction part 105 extracts a relative word set of the partial document set by using a relative word dictionary. A partial document set information generation part 106 generates information on each partial document set and the relative information on these partial document sets on the basis of information on the relative word set, the keyword set and the document set of partial document set. Then a sorting result preservation part 107 preserves the sorting result of the part 103 and information generated at the part 106.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、文書分類装置、文
書分類方法及び該方法を実行するための記録媒体に関
し、情報分類、情報分析、情報検索等に応用可能な文書
分類技術に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a document classification apparatus, a document classification method, and a recording medium for executing the method, and relates to a document classification technique applicable to information classification, information analysis, information retrieval, and the like.

【０００２】[0002]

【従来の技術】インターネット等の普及により大量の文
書情報へのアクセスが可能になり、収集した大量の文書
情報を意味のあるグループに分類し、文書集合の構造を
把握するなどの知的作業が行われ始めている。大量な文
書集合を分析する場合、まず文書集合をいくつかの話題
で分類し、得られた部分文書集合（ある基準で集められ
た複数の文書）を単位としてさまざまな作業を行うこと
で、分析作業を効率的に行うことができるものと考えら
れる。大量の文書情報をユーザが手動で分類する場合、
人的／時間的コストが膨大なものになるため、文書集合
を文書の内容により自動分類できる装置が望まれてい
る。2. Description of the Related Art With the spread of the Internet and the like, access to a large amount of document information becomes possible, and intellectual work such as classifying a large amount of collected document information into meaningful groups and grasping the structure of a document set is performed. It is beginning to take place. When analyzing a large document set, the document set is first classified into several topics, and various tasks are performed on the obtained partial document set (a plurality of documents collected according to a certain standard). It is considered that work can be performed efficiently. If users manually classify large amounts of document information,
Since human / time costs are enormous, a device that can automatically classify a document set according to the content of the document is desired.

【０００３】従来、膨大な文書集合からの質の高い分類
結果を得るための発明が広く行われている。例えば、特
開平７−３６８９７号公報に記載の発明は、分類対象文
書集合に含まれる単語を特徴量とする文書特徴ベクトル
を用い、その文書特徴ベクトルに対してクラスタリング
手法を適用して分類を行うものである。上記の発明では
ユーザの意図を反映した分類を行うためにクラスタリン
グの初期重心ベクトルをユーザが指定することも示唆し
ている。Conventionally, inventions for obtaining high-quality classification results from a huge document set have been widely performed. For example, the invention described in Japanese Patent Application Laid-Open No. 7-36897 uses a document feature vector having words included in a set of documents to be classified as feature amounts, and performs classification by applying a clustering method to the document feature vector. Things. The above invention also suggests that the user designates an initial center-of-gravity vector for clustering in order to perform classification reflecting the user's intention.

【０００４】また、特開平１１−２９６５５２号公報に
記載の発明は、単語の多義性/同義性を考慮するために
文書間の内積行列に特異値分解を適用することにより文
書間の単語の共起性を基に潜在的意味空間を生成して、
文書と単語を潜在的意味空間に射影し、その潜在的意味
空間においてクラスタリング手法などを用いて文書分類
を行うものである。このように膨大な文書集合からの質
の高い分類結果を得るための発明は種々提案されている
が、文書集合の分析を行うためには文書集合を分類する
だけは不十分であり、生成された部分文書集合からどの
ように有効な情報を抽出するかということも重要な問題
であるが、この点についての発明はあまり見られない。The invention described in Japanese Patent Application Laid-Open No. H11-296552 discloses a method of sharing words between documents by applying singular value decomposition to an inner product matrix between documents in order to consider the ambiguity / synonymity of words. Generate a latent semantic space based on the origin,
Documents and words are projected into a potential semantic space, and document classification is performed in the potential semantic space using a clustering method or the like. Various inventions have been proposed to obtain high-quality classification results from such a huge document set. However, in order to analyze the document set, it is not enough to classify the document set, and An important issue is how to extract effective information from the partial document set, but there are few inventions on this point.

【０００５】また、形態素解析などの自然言語処理を用
いて文書からそれらを構成する単語を抽出することによ
り文書を単語頻度のベクトル（文書特徴ベクトル）とし
て空間表現することが可能となるが、これは文書ベクト
ル空間モデルと呼ばれ、広く用いられている。上述した
特開平７−３６８９７号公報の発明は、このような文書
ベクトル空間において、クラスタリング手法を適用する
ことにより文書分類を行うものである。[0005] In addition, it is possible to spatially express a document as a word frequency vector (document feature vector) by extracting words constituting the document from the document using natural language processing such as morphological analysis. Is called a document vector space model and is widely used. The invention of Japanese Patent Laid-Open No. 7-36897 described above classifies documents in such a document vector space by applying a clustering method.

【０００６】このように文書ベクトル空間で統計的手法
を用いて文書分類処理や文書検索処理等を行う場合、文
書ベクトル空間が異なれば得られる結果の質も変わると
考えられるので、如何にして良い文書ベクトル空間を生
成するかが高品位な処理結果を得るためには重要な問題
となる。When performing a document classification process or a document search process using a statistical method in the document vector space as described above, it is considered that the quality of the obtained result is different if the document vector space is different. Whether to generate a document vector space is an important issue in obtaining high-quality processing results.

【０００７】前述したように、通常文書ベクトル空間の
各軸は分類対象文書データに形態素解析を適用した結果
抽出される単語をもとに構成されるため、例えば、特開
平１１−１１０４０８号公報や特開平１１−２５９４８
７号公報に代表される発明は、検索問い合わせ語や検索
対象文書に対し、形態素解析を適用し、その結果抽出さ
れる単語から適切な条件のもとに複合語を生成し、これ
らの複合語の情報も前記文書ベクトル空間の生成に用い
ることで、文書ベクトル空間上で行う文書検索の精度の
向上を目的としている。従って、文書ベクトル空間で文
書分類処理を行う場合においても、複合語を考慮して文
書ベクトル空間を生成することで高品位な分類結果を得
ることが期待される。As described above, since each axis of the normal document vector space is configured based on words extracted as a result of applying morphological analysis to the document data to be classified, for example, Japanese Patent Laid-Open No. 11-110408, JP-A-11-25948
The invention typified by Japanese Patent Publication No. 7-203, discloses a method in which morphological analysis is applied to a search query word and a search target document, and compound words are generated from words extracted as a result under appropriate conditions. Is also used to generate the document vector space, thereby improving the accuracy of document search performed in the document vector space. Therefore, even when the document classification process is performed in the document vector space, it is expected that a high-quality classification result is obtained by generating the document vector space in consideration of the compound words.

【０００８】ところで、上記先願を含め、通常複合語を
考慮する場合は、品詞が名詞もしくはそれに類するもの
が対象とされているが、名詞だけでなく他の結合可能な
品詞も適切に結合させることで、より高品位な文書ベク
トル空間を構成することが可能になると考えられる。す
なわち、先願の発明等ではあまり扱われることのなかっ
た、接頭詞、接尾詞、助数詞、及びそれらに類する品詞
を有する単語について、適切な基準でそれらの前後の単
語と結合することで生成される単語と置き換えるとも
に、品詞も適切なものに置き換えることを考える。By the way, when a compound word is considered, including the above-mentioned prior application, the part of speech is a noun or something similar thereto, but not only the noun but also other possible parts of speech are appropriately combined. It is considered that this makes it possible to configure a higher-quality document vector space. In other words, words that have not been treated so much in the prior invention, etc., are generated by combining words with prefixes, suffixes, classifiers, and parts of speech similar to them with words before and after them with appropriate criteria. Consider replacing words with appropriate words, and replacing parts of speech with appropriate ones.

【０００９】例えば、“イタリア製の車”という文字列
に対して形態素解析を適用し、“イタリア［普通名
詞］、製［接頭詞］、の［格助詞］、車”という結果が
得られた場合、接頭詞である“製”という単語に着目
し、これをこの直前に抽出されている“イタリア”とい
う普通名詞と結合し、“イタリア製”という単語を生成
し、これを普通名詞の品詞を有する単語として、“製”
という単語と置き換える。そして、この文字列に加え、
“イタリアの特色”、“イタリア製の皿”という文字列
で構成するベクトル空間を生成することを考えてみる。For example, when a morphological analysis is applied to a character string "Car made in Italy" and the result is "[Particle], car in Italy [common noun], made [prefix]," Attention is paid to the word "made" which is a prefix, and this is combined with the common noun "Italy" extracted immediately before this to generate the word "made in Italy", which has the part of speech of the common noun The word "made"
Replace with the word And in addition to this string,
Consider generating a vector space consisting of the strings "Italian features" and "Italian dishes".

【００１０】名詞だけで空間を生成することを考えた場
合、前記の結合・置き換え処理を行わない場合、ベクト
ル空間を構成する単語は、“イタリア、車、特色、皿”
であり、前記文字列は、単語の出現頻度を座標値と考え
た場合、（１，０，１，０）、（１，０，０，０）、
（１，０，０，１）となる。この場合、前記３つの文字
列の相互の類似度をベクトル間の内積で計算すると、前
記３つの文字列の相互の類似度は同じものとなる。一
方、前記の結合・置き換え処理を行った後、名詞で空間
を生成すると、ベクトル空間を構成する単語は、“イタ
リア、イタリア製、車、特色、皿”となる。同様に、前
記３文書のこの空間でのベクトルは、（１，１，０，
１，０）、（１，０，０，０，０）、（１，１，０，
０，１）となる。この場合、前記３つの文字列の相互の
類似度には、差異が生じ、最初の文字列と最後の文字列
が２番目の文字列より高い類似度を持つことになる。す
なわち、この結合・置き換え処理によりベクトル空間に
より限定化された意味を測る特徴次元を加えることがで
き、これによりこのベクトル空間で行う文書分類等の質
も向上するものと考えられる。[0010] Considering that a space is generated only by a noun, if the above-mentioned combination / replacement processing is not performed, the words constituting the vector space are "Italy, car, spot color, plate".
And the character string is (1, 0, 1, 0), (1, 0, 0, 0), when the appearance frequency of the word is considered as a coordinate value.
(1, 0, 0, 1). In this case, when the mutual similarity of the three character strings is calculated by the inner product between the vectors, the mutual similarity of the three character strings becomes the same. On the other hand, if a space is generated with a noun after performing the above-mentioned combination / replacement processing, the words constituting the vector space are "Italy, made in Italy, car, special feature, plate". Similarly, the vectors of the three documents in this space are (1,1,0,
(1,0), (1,0,0,0,0), (1,1,0,
0, 1). In this case, there is a difference in the similarity between the three character strings, and the first character string and the last character string have a higher similarity than the second character string. That is, it is considered that the feature dimension for measuring the meaning limited by the vector space can be added by the combination / replacement process, and the quality of the document classification performed in the vector space can be improved.

【００１１】また、“２０００年の目標”という文字列
に対し形態素解析を適用し、“２０００［数詞］、年
［助数詞］、の［格助詞］、目標［普通名詞］”のよう
な結果が得られているとする。このとき、助数詞である
“年”という単語に着目し、これをこの直前に抽出され
ている“２０００”という数詞と結合し、“２０００
年”という単語を生成し、これを普通名詞の品詞を有す
る単語として、“年”という単語と置き換え、かつ“２
０００”という数詞を削除する。これにより、非常に漠
然とした意味しか有していない助数詞である“年”や
“２０００”という単語にかえて、“２０００年”とい
うより意味的に限定された、それゆえ変数としてはより
重要な単語をもとにして文書ベクトル空間が構成可能に
なることが期待される。Further, a morphological analysis is applied to a character string “goal of 2000”, and a result such as “2000 [numerical], year [classifier], [case particle], goal [common noun]” is obtained. It is assumed that it has been obtained. At this time, attention is paid to the word "year" which is a classifier, and this is combined with the numeral "2000" extracted immediately before this to form "2000".
The word “year” is generated, and this word is replaced with the word “year” as a word having the part of speech of a common noun.
000 ", which is more semantically limited to" 2000 "instead of the words" year "and" 2000 ", which have very vague meanings. Therefore, it is expected that the document vector space can be configured based on more important words as variables.

【００１２】また、上述のようにインターネット等の普
及により大量の文書データへのアクセスが可能になり、
その結果として興味のある情報が記述されている文書デ
ータを簡単にかつ大量に収集できるようになったが、し
かしその一方で、収集した文書データが大量であるがた
めに、それら文書データから有効な情報を読み取る作業
は非常に困難なものになってしまっている。このため、
大量の文書データから自動もしくは半自動で有効な情報
を簡単に抽出することを目的として、文書検索や文書自
動分類に関する研究・開発が盛んに行われている。特
に、文書分類手法は、生成される複数の部分文書データ
集合個々を文書データに含まれる複数の話題を示すもの
と考えると、文書データ全体の構造を把握する手法とし
て非常に有効なものである。As described above, the spread of the Internet and the like makes it possible to access a large amount of document data.
As a result, it became possible to easily and easily collect a large amount of document data describing information of interest, but on the other hand, due to the large amount of collected document data, it was possible to collect The task of reading sensitive information has become extremely difficult. For this reason,
Research and development on document search and automatic document classification have been actively conducted for the purpose of easily extracting valid information automatically or semi-automatically from a large amount of document data. In particular, the document classification method is very effective as a method for grasping the structure of the entire document data, considering that a plurality of generated partial document data sets indicate a plurality of topics included in the document data. .

【００１３】上述のような目的のために開発された手法
の代表的なものに、Scatter/Gather法（D.Cutting et.a
l., Scater/Gather: A Cluster-based Approach to B r
owsing Large Document Collections., Proc. ACM SIGI
R ’92）がある。Scatter/Gather法では、文書データ集
合の話題を代表文書と代表単語によリ表現するととも
に、話題が不明瞭な文書集合に対して逐次クラスタリン
グを適用し、複数の部分文書データ集合に分割していく
ことで文書集合に含まれる様々な話題を理解していく。
文書集合の構造を理解するためには、文書集合に含まれ
る部分文書集合個々を理解することはもちろん必要であ
るが、加えて部分文書集合間の関係に関する情報も必要
であると考えられる。しかしながら、Scatter/Gather法
では個々の部分文書集合に関する情報しか提示されてい
ないため、Scatter/Gather法のみでは文書集合の構造を
把握することは困難であると考えられる。A typical technique developed for the above-mentioned purpose is the Scatter / Gather method (D. Cutting et.
l., Scater / Gather: A Cluster-based Approach to Br
owsing Large Document Collections., Proc. ACM SIGI
R '92). In the Scatter / Gather method, the topic of a document data set is represented by a representative document and a representative word, and sequential clustering is applied to a document set with an unclear topic to divide it into multiple partial document data sets. By understanding the various topics included in the document set,
In order to understand the structure of the document set, it is of course necessary to understand each of the partial document sets included in the document set, but in addition, it is considered that information on the relationship between the partial document sets is also necessary. However, since the Scatter / Gather method only presents information on each partial document set, it is considered difficult to grasp the structure of the document set only by the Scatter / Gather method.

【００１４】また、一般的に文書分類手法においては、
生成する部分文書集合の数が実行前に必要であるが、最
適な部分文書集合の数を予測することは極めて困難であ
る。しかも、一方で生成する部分文書集合の数が異なれ
ば生成される部分文書集合の構造も変化してしまう。こ
のため、必要な情報を得るためには生成する部分文書集
合の数をかえながら、繰り返し文書分類を行わなければ
ないらない。Scatter/Gather法はこの点についても一つ
の解決法を提示しており、ユーザがより詳細な構造を知
りたいと考える部分文書集合のみに対し逐次クラスタリ
ングを適用し、あらたな部分文書集合を生成し、それら
を詳細に分析することで所望の情報を得ることができる
とともに、この行為により文書集合全体の構造を理解す
ることも容易になっていると考えられる。In general, in a document classification method,
The number of partial document sets to be generated is required before execution, but it is extremely difficult to predict the optimal number of partial document sets. Moreover, on the other hand, if the number of generated partial document sets is different, the structure of the generated partial document set also changes. Therefore, in order to obtain necessary information, it is necessary to repeatedly perform document classification while changing the number of partial document sets to be generated. The Scatter / Gather method also offers a solution in this regard, in which the sequential clustering is applied only to the sub-document set for which the user wants to know a more detailed structure, and a new sub-document set is generated. It can be considered that desired information can be obtained by analyzing them in detail, and that this operation makes it easier to understand the structure of the entire document set.

【００１５】すなわち、ユーザが行いたいことは文書集
合の構造の把握であり、部分文書集合を生成するという
行為は本来ユーザが行う必要がないものと考えられる。
そして、ユーザが、事前に文書集合から様々な数の部分
文書集合を生成し、生成された多数の部分文書集合間の
関係を算出しておくことで、ユーザは始めから構造の把
握を行う作業に集中できると考えられる。しかしなが
ら、前述の通りScatter/Gather法は部分文書間の関連に
関しては考慮されていない。That is, what the user wants to do is grasp the structure of the document set, and it is considered that the act of generating the partial document set does not need to be performed by the user.
Then, the user generates a various number of partial document sets from the document set in advance and calculates the relationship between the generated multiple partial document sets, so that the user can grasp the structure from the beginning. You can concentrate on However, as described above, the Scatter / Gather method does not consider the relation between partial documents.

【００１６】[0016]

【発明が解決しようとする課題】本発明の請求項１〜
４，１４〜１７，及び２７〜３０の発明では、文書分類
を行うとともに、単語の関連語情報を基に生成された部
分文書集合個々及びそれらの関連情報をさらに生成する
ことで部分文書集合の分析に有効な情報を提供すること
を目的とする。さらに、関連語として反対語に着目する
ことで、各部分文書集合の代表語セットの反対語を含む
部分文書集合が、生成された部分文書集合にはない場
合、反対語を含むあらたな部分文書集合を生成すること
で、文書集合からより多くの分析情報を抽出しうる文書
分類装置を提供することを目的とする。SUMMARY OF THE INVENTION Claims 1 to 5 of the present invention
In the inventions of Nos. 4, 14 to 17, and 27 to 30, the document classification is performed, and the partial document sets generated based on the related word information of the words and the related information thereof are further generated, whereby the partial document set is generated. The purpose is to provide useful information for analysis. Furthermore, by focusing on the antonym as a related word, if the generated partial document set does not include a partial document set that includes the opposite word of the representative word set of each partial document set, a new partial document that includes the opposite word An object of the present invention is to provide a document classification device that can extract more analysis information from a document set by generating the set.

【００１７】従って請求項１，１４及び２７の発明は、
生成された部分文書集合それぞれの代表語セットを抽出
し、さらにそれら代表語それぞれについて関連語を求
め、これらの情報をもとに各部分文書集合および部分文
書集合間の関連情報を生成することで、部分文書集合の
分析に有効な情報を提供する文書分類装置、方法または
記録媒体を提供することを目的とする。Therefore, the invention of claims 1, 14 and 27 is
By extracting the representative word set of each of the generated partial document sets, further obtaining related words for each of the representative words, and generating the relevant information between each of the partial document sets and the partial document sets based on this information. It is an object of the present invention to provide a document classification device, method, or recording medium that provides information effective for analyzing a partial document set.

【００１８】請求項２，１５及び２８の発明は、関連語
として同義語、類義語、反対語のすくなくとも一つ以上
の組合わせを用いることで主に類似性に関す情報を提供
する文書分類装置、方法または記録媒体を提供すること
を目的とする。The invention according to claims 2, 15 and 28 is a document classification device which mainly provides information on similarity by using at least one combination of synonyms, synonyms and antonyms as related words. It is intended to provide a method or a recording medium.

【００１９】請求項３，４，１６，１７，２９及び３０
の発明は、各部分文書集合の代表語セットの関連語とし
て反対語を用い、反対語が自分を含む他のどの部分文書
集合の代表語セットとも一致しない場合、その反対語を
含む文書を文書集合から抽出し、それを新たな部分文書
集合とすることで、文書集合からより多くの分析情報を
抽出する文書分類装置、方法または記録媒体を提供する
ことを目的とする。Claims 3, 4, 16, 17, 29 and 30
Uses an antonym as a related word of the set of representative words of each sub-document set. An object of the present invention is to provide a document classification device, a method, or a recording medium that extracts more analysis information from a document set by extracting the set from a set and making it a new partial document set.

【００２０】また請求項５，１８及び３１の発明は、分
類対象文書に形態素解析し、得られた解析結果をもとに
分類対象文書を幾つかの文書集合に分類する文書分類装
置において、形態素解析の結果得られる単語のうち指定
される品詞をもつ単語について、その前後の単語と適切
に組合わせた単語と置き換え、かつ品詞もまた適切なも
のに置き換える処理を施すことによって、高品位な文書
ベクトル空間を構成し、この文書ベクトル空間で統計処
理を用いて文書分類を行うことで高品質な文書分類結果
を得ることができる文書分類装置、方法または記録媒体
を提供することを目的とする。According to a fifth aspect of the present invention, there is provided a document classification apparatus for performing a morphological analysis on a document to be classified and classifying the document to be classified into several document sets based on the obtained analysis result. A high-quality document is obtained by replacing words having the specified part of speech among the words obtained as a result of the analysis with words that are appropriately combined with the words before and after the words, and also replacing the parts of speech with appropriate words. An object of the present invention is to provide a document classification device, a method, or a recording medium capable of forming a vector space, and performing high-quality document classification results by performing document classification using statistical processing in the document vector space.

【００２１】請求項６，１９及び３２の発明は、文書分
類を行うための統計手法として、クラスタリング手法を
用いることで、簡便に高品質な文書分類結果を得ること
ができる文書分類装置、方法または記録媒体を提供する
ことを目的とする。According to the inventions of claims 6, 19 and 32, a clustering method is used as a statistical method for performing document classification, whereby a high-quality document classification result can be obtained easily and easily. It is intended to provide a recording medium.

【００２２】請求項７，２０及び３３の発明は、分類対
象文書に形態素解析を適用することで抽出される単語の
中で、特に、品詞が、接頭詞、接尾詞、助数詞、及びそ
れらに類する品詞である単語について、適切な結合処理
を施こすことで、高品質な文書ベクトル空間を得ること
ができる文書分類装置、方法または記録媒体を提供する
ことを目的とする。According to the seventh, twentieth and thirty-third inventions, among words extracted by applying morphological analysis to a document to be classified, in particular, a part of speech is a prefix, a suffix, a classifier, and the like. An object of the present invention is to provide a document classification device, a method, or a recording medium that can obtain a high-quality document vector space by performing an appropriate combining process on a word that is a part of speech.

【００２３】請求項８，２１及び３４の発明は、単語の
結合処理において特定の品詞の単語が出現するまで単語
の結合を続けることによって新たな単語を生成すること
で、高品質な文書ベクトル空間を得ることができる文書
分類装置、方法または記録媒体を提供することを目的と
する。According to the eighth, twenty-first and thirty-fourth aspects of the present invention, a high-quality document vector space is generated by generating new words by continuing to combine words until a specific part of speech word appears in the word combining process. It is an object of the present invention to provide a document classification device, method, or recording medium that can obtain the following.

【００２４】請求項９，２２及び３５の発明は、単語の
結合処理において、品詞が数詞接尾詞もしくは助数詞の
単語について、結合される複数の単語を削除し、文書ベ
クトル空間を生成する際にはそれらの単語の情報は用い
ないことで、高品質な文書ベクトル空間を得ることがで
きる文書分類装置、方法または記録媒体を提供すること
を目的とする。According to a ninth, twenty-second, and thirty-fifth aspect of the present invention, in a word combining process, when a word having a part of speech is a numerical suffix or a classifier, a plurality of words to be combined are deleted and a document vector space is generated. An object of the present invention is to provide a document classification device, method, or recording medium that can obtain a high-quality document vector space by not using information of those words.

【００２５】また本発明の請求項１０〜１３，２３〜２
６及び３６〜３９の発明では、事前に文書集合から様々
な数の部分文書集合を生成し、生成された多数の部分文
書集合間の関係を算出することで、ユーザが始めから文
書集合の構造の把握を行う作業に集中できる情報を提供
することを目的とする。[0025] Further, claims 10 to 13, 23 to 2 of the present invention.
In the inventions 6 and 36 to 39, various numbers of partial document sets are generated in advance from the document sets, and the relationship between the generated multiple partial document sets is calculated, so that the user can start the structure of the document set from the beginning. The purpose is to provide information that allows the user to concentrate on the task of grasping the situation.

【００２６】従って請求項１０，２３及び３６の発明
は、文書のベクトル空間モデルを用い、生成する部分文
書集合の数をパラメータとして繰り返し文書分類処理を
行うことで、多数の部分文書集合を生成し、さらに生成
された多数の文書集合について相互の関係を算出するこ
とで、文書集合の構造の把握を支援しうる情報を生成す
る文書分類装置を提供する文書分類装置、方法または記
録媒体を提供することを目的とする。Therefore, according to the tenth, twenty-third, and thirty-sixth aspects of the present invention, a large number of partial document sets are generated by repeatedly performing document classification processing using the number of partial document sets to be generated as a parameter using a vector space model of the document. In addition, the present invention provides a document classification device, a method, or a recording medium that provides a document classification device that generates information that can assist in understanding the structure of a document set by calculating a mutual relationship between a large number of generated document sets. The purpose is to:

【００２７】請求項１１，２４及び３７の発明は、文書
分類を行う統計手法として、非階層クラスタリング手法
を用いることで、簡便に多数の部分文書集合を生成する
文書分類装置、方法または記録媒体を提供することを目
的とする。According to the eleventh, twenty-fourth, and thirty-seventh aspects of the present invention, a non-hierarchical clustering technique is used as a statistical technique for classifying documents, thereby providing a document classification apparatus, method, or recording medium for easily generating a large number of partial document sets. The purpose is to provide.

【００２８】請求項１２，２５及び３８の発明は、生成
された多数の文書集合について相互の関係として、類似
関係と包含関係を算出することで、容易に文書集合の構
造を把握しうる情報を提供する文書分類装置、方法また
は記録媒体を提供することを目的とする。According to the twelfth, twenty-fifth, and thirty-eighth aspects of the present invention, information that can easily grasp the structure of a document set can be obtained by calculating a similarity relation and an inclusion relation as a mutual relation between a large number of generated document sets. An object of the present invention is to provide a document classification device, method, or recording medium to be provided.

【００２９】請求項１３，２６及び３９の発明は、生成
された多数の文書集合が有する情報のうち、単語に関す
る情報のみを用いて相互の関係を算出することで、汎用
性・再利用性の高い関係情報を算出する文書分類装置、
方法または記録媒体を提供することを目的とする。According to the inventions of claims 13, 26 and 39, the mutual relationship is calculated by using only information on words among the information of a large number of generated document sets, so that versatility and reusability are improved. A document classification device that calculates high related information,
It is intended to provide a method or a recording medium.

【００３０】[0030]

【課題を解決するための手段】請求項１の発明は、文書
集合をその内容に従って分類する文書分類装置であっ
て、複数の文書を入力する文書入力部と、該文書入力部
にて入力された各文書から該各文書を構成する単語情報
を抽出する文書解析部と、該文書解析部にて抽出された
各文書の単語情報をもとに前記複数の文書による文書集
合をいくつかの部分文書集合に分類する文書分類部と、
該文書分類部にて分類された各部分文書集合からそれら
の代表語セットを抽出する代表語抽出部と、任意の単語
についてその関連語が記述された関連語辞書を用いて前
記代表語抽出部にて抽出した各部分文書集合の代表語セ
ットそれぞれについて関連語セットを抽出する関連語抽
出部と、該関連語抽出部にて抽出した関連語セットと前
記代表語抽出部で抽出した代表語セットと各部分文書集
合に所属する文書に関する情報とをもとに個々の部分文
書集合及び部分文書集合間の関連情報を生成する部分文
書集合情報生成部と、前記文書分類部での分類結果を前
記部分文書集合情報生成部にて生成された情報と合わせ
て保存する分類結果保存部とを含むことを特徴としたも
のである。According to the first aspect of the present invention, there is provided a document classification apparatus for classifying a set of documents in accordance with the contents thereof, wherein a document input unit for inputting a plurality of documents and a document input unit for inputting the plurality of documents. A document analysis unit for extracting word information constituting each document from each document, and a document set including the plurality of documents based on the word information of each document extracted by the document analysis unit. A document classification unit for classifying the document into a set of documents;
A representative word extracting unit for extracting a representative word set from each of the partial document sets classified by the document classifying unit, and a representative word extracting unit using a related word dictionary in which related words of arbitrary words are described. A related word extracting unit for extracting a related word set for each of the representative word sets of each partial document set extracted in step 2, a related word set extracted by the related word extracting unit, and a representative word set extracted by the representative word extracting unit A partial document set information generation unit that generates individual partial document sets and related information between the partial document sets based on information about documents belonging to each partial document set, and a classification result in the document classification unit. And a classification result storage unit that stores the information together with the information generated by the partial document set information generation unit.

【００３１】請求項２の発明は、請求項１の発明におい
て、前記関連語抽出部にて抽出される関連語セットが、
同義語、類義語、反対語のうちの少なくとも一つ以上の
組合わせであることを特徴としたものである。According to a second aspect of the present invention, in the first aspect of the present invention, the related word set extracted by the related word extracting unit is:
It is a combination of at least one of synonyms, synonyms, and antonyms.

【００３２】請求項３の発明は、請求項１の発明におい
て、前記関連語抽出部にて抽出される関連語セットが少
なくとも反対語を含み、ある部分文書集合の代表語セッ
トから抽出された反対語セットが、自分を含む他のどの
部分文書集合の代表語セットとも一致しない場合、該一
致しない反対語セットを含む文書を文書集合から抽出
し、あらたな部分文書集合を生成する処理を全部分文書
集合に対し再帰的に繰り返す反意部分文書集合生成部を
さらに含むことを特徴としたものである。According to a third aspect of the present invention, in the first aspect, the related word set extracted by the related word extracting section includes at least an opposite word, and the opposite word set extracted from the representative word set of a certain partial document set. If the word set does not match the representative word set of any other sub-document set including itself, the document including the inconsistent opposite word set is extracted from the document set, and the process of generating a new sub-document set is entirely performed. It is characterized in that it further includes a reciprocal partial document set generation unit that recursively repeats the document set.

【００３３】請求項４の発明は、請求項１の発明におい
て、前記関連語抽出部にて抽出される関連語が少なくと
も反対語を含み、ある部分文書集合の代表語セットから
抽出された反対語セットが、自分を含む他のどの部分文
書集合の代表語セットとも一致しない場合、該一致しな
い反対語セットと代表語セットから反対語セットに対応
する代表語を除いた単語セットを含む文書を文書集合か
ら抽出し、あらたな部分文書集合を生成する処理を全部
分文書集合に対し再帰的に繰り返す反意部分文書集合生
成部をさらに含むことを特徴としたものである。According to a fourth aspect of the present invention, in the first aspect of the present invention, the related words extracted by the related word extracting section include at least an opposite word, and the opposite word extracted from a representative word set of a certain partial document set. If the set does not match the representative word set of any other sub-document set, including the self, the document including the unmatched opposite word set and the word set obtained by removing the representative word corresponding to the opposite word set from the document set is converted to a document. It is characterized in that it further includes a reciprocal partial document set generation unit that extracts from the set and generates a new partial document set recursively for all the partial document sets.

【００３４】請求項５の発明は、文書の内容に従って文
書の分類を行う文書分類装置であって、文書データを入
力する文書入力部と、前記文書データに形態素解析を適
用し、前記文書データを構成する単語をそれらの品詞情
報等とともに抽出する文書解析部と、該文書解析部にて
抽出された文書データの解析情報から文書データを多次
元ベクトル空間で表現するための文書ベクトル空間を生
成する文書ベクトル空間生成部と、該文書ベクトル空間
生成部にて生成した文書ベクトル空間において統計手法
を用いることにより文書データの分類を行う文書分類部
とを含み、前記文書解析部にて抽出される特定の品詞を
有する単語を、該特定の品詞の品詞情報に基づき、該特
定の品詞の前後に抽出される一つ以上の単語と結合する
ことにより生成される単語と置き換え、かつ該特定の品
詞の品詞情報も適切に置き換えることを特徴としたもの
である。According to a fifth aspect of the present invention, there is provided a document classifying apparatus for classifying a document according to the content of the document, comprising: a document input unit for inputting document data; applying a morphological analysis to the document data; A document analysis unit that extracts constituent words together with their part of speech information and the like, and a document vector space for expressing the document data in a multidimensional vector space is generated from the analysis information of the document data extracted by the document analysis unit. A document vector space generating unit, and a document classifying unit that classifies document data by using a statistical method in the document vector space generated by the document vector space generating unit, Is generated by combining a word having a part of speech with one or more words extracted before and after the specific part of speech based on the part of speech information of the specific part of speech. That replaces a word, and the word class information of a specific part of speech is also that is characterized by appropriately replacing.

【００３５】請求項６の発明は、請求項５の発明におい
て、前記文書分類部において統計手法としてクラスタリ
ング法を用いることで文書データの分類を行うことを特
徴としたものである。According to a sixth aspect of the present invention, in the fifth aspect of the invention, the document classifying section classifies the document data by using a clustering method as a statistical method.

【００３６】請求項７の発明は、請求項５または６の発
明において、前記文書解析部において品詞が接頭詞、接
尾詞、助数詞、及びそれらに類する品詞である単語につ
いて、単語および品詞の置き換えを行うことを特徴とし
たものである。According to a seventh aspect of the present invention, in the invention of the fifth or sixth aspect, the document analysis unit replaces the words and the parts of speech with respect to the words whose parts of speech are prefixes, suffixes, classifiers, and similar parts of speech. It is characterized by performing.

【００３７】請求項８の発明は、請求項５ないし７のい
ずれか１の発明において、前記文書解析部において特定
の品詞の単語が出現するまで単語の結合を続けることを
特徴としたものである。According to an eighth aspect of the present invention, in any one of the fifth to seventh aspects of the present invention, the combination of words is continued until a word of a specific part of speech appears in the document analysis unit. .

【００３８】請求項９の発明は、請求項５ないし８のい
ずれか１の発明において、前記文書解析部において品詞
が数詞接尾詞もしくは助数詞の単語について、該数詞接
尾詞もしくは助数詞の単語に結合される複数の単語を削
除し、前記文書分類部では削除した単語の情報を用いな
いことを特徴としたものである。According to a ninth aspect of the present invention, in the invention according to any one of the fifth to eighth aspects, the part of speech is combined with the word of the numeral suffix or the classifier in the document analysis unit. The document classification unit does not use information on the deleted words.

【００３９】請求項１０の発明は、文書の内容に従って
文書データ集合を分類する文書分類装置であって、文書
データ集合を入力する文書入力部と、すべての文書デー
タに形態素解析を適用し、前記文書データを構成する単
語をそれらの品詞情報等とともに抽出する文書解析部
と、該文書解析部にて抽出された文書データの解析結果
を記憶する文書解析結果記憶部と、前記文書解析部にて
抽出された文書データの解析情報から前記文書データを
多次元ベクトル空間で表現するためのベクトル空間を生
成する文書ベクトル空間生成部と、該文書ベクトル空間
生成部にて生成された文書ベクトル空間の各文書データ
のベクトルデータを記憶する文書ベクトルデータ記憶部
と、指定される条件から文書データ集合の分類数を決定
する分類数決定部と、前記文書ベクトル空間生成部にて
生成した文書ベクトル空間において統計手法を用いるこ
とにより文書データを前記指定された分類数の部分文書
集合に分類する文書分類部と、該文書分類部で生成され
た分類結果を記憶する分類結果記憶部と、前記分類数決
定部から前記分類結果記憶部までの処理を繰り返し行う
か否かの判定をおこなう繰り返し判定部と、前記文書ベ
クトルデータ記憶部と前記分類結果記憶部に記憶された
情報を用いて生成されたすべての部分文書集合間の関係
情報を算出する部分文書集合間関係算出部と、該部分文
書集合間関係算出部にて生成された部分文書集合間の関
係情報を記憶する部分文書集合間関係記憶部とを含むこ
とを特徴としたものである。According to a tenth aspect of the present invention, there is provided a document classification device for classifying a document data set according to the contents of a document, wherein the document input unit for inputting the document data set and a morphological analysis are applied to all the document data. A document analysis unit that extracts words constituting the document data together with their part of speech information, a document analysis result storage unit that stores analysis results of the document data extracted by the document analysis unit, A document vector space generation unit that generates a vector space for expressing the document data in a multidimensional vector space from the extracted analysis information of the document data; and a document vector space generated by the document vector space generation unit. A document vector data storage unit that stores the vector data of the document data; and a classification number determination unit that determines the classification number of the document data set based on a designated condition. A document classifying unit for classifying document data into sub-document sets of the specified number of classifications by using a statistical method in the document vector space generated by the document vector space generating unit; and a classification generated by the document classifying unit. A classification result storage unit that stores results, a repetition determination unit that determines whether to repeat the processing from the classification number determination unit to the classification result storage unit, the document vector data storage unit, and the classification result storage A partial document set relation calculating unit that calculates relation information between all partial document sets generated using information stored in the partial document set, and a partial document set generated by the partial document set relation calculating unit. And a partial document set relation storage unit for storing the relation information of the partial document set.

【００４０】請求項１１の発明は、請求項１０の発明に
おいて、前記文書分類部にて用いられる統計手法が非階
層クラスタリング手法であることを特徴としたものであ
る。According to an eleventh aspect, in the tenth aspect, the statistical method used in the document classifying unit is a non-hierarchical clustering method.

【００４１】請求項１２の発明は、請求項１０または１
１の発明において、前記部分文書集合間関係算出部にて
算出される関係が、類似関係と包含関係であることを特
徴としたものである。The twelfth aspect of the present invention provides the tenth or the first aspect.
In the invention according to the first aspect, the relationship calculated by the partial document set relationship calculating unit is a similarity relationship and an inclusion relationship.

【００４２】請求項１３の発明は、請求項１２の発明に
おいて、前記部分文書集合間の関係は各部分文書集合か
ら抽出される単語情報のみを用いて算出されることを特
徴としたものである。According to a thirteenth aspect, in the twelfth aspect, the relation between the partial document sets is calculated using only word information extracted from each partial document set. .

【００４３】請求項１４の発明は、文書集合をその内容
に従って分類する文書分類方法であって、複数の文書を
入力する文書入力ステップと、該文書入力ステップにて
入力された各文書から該各文書を構成する単語情報を抽
出する文書解析ステップと、該文書解析ステップにて抽
出された各文書の単語情報をもとに前記複数の文書によ
る文書集合をいくつかの部分文書集合に分類する文書分
類ステップと、該文書分類ステップにて分類された各部
分文書集合からそれらの代表語セットを抽出する代表語
抽出ステップと、任意の単語についてその関連語が記述
された関連語辞書を用いて前記代表語抽出ステップにて
抽出した各部分文書集合の代表語セットそれぞれについ
て関連語セットを抽出する関連語抽出ステップと、該関
連語抽出ステップにて抽出した関連語セットと前記代表
語抽出ステップで抽出した代表語セットと各部分文書集
合に所属する文書に関する情報とをもとに個々の部分文
書集合及び部分文書集合間の関連情報を生成する部分文
書集合情報生成ステップと、前記文書分類ステップでの
分類結果を前記部分文書集合情報生成部にて生成された
情報と合わせて保存する分類結果保存ステップとを含む
ことを特徴としたものである。According to a fourteenth aspect of the present invention, there is provided a document classification method for classifying a set of documents in accordance with the contents thereof, comprising: a document inputting step of inputting a plurality of documents; A document analysis step for extracting word information constituting the document, and a document for classifying a document set of the plurality of documents into several partial document sets based on the word information of each document extracted in the document analysis step A classification step, a representative word extraction step of extracting a representative word set from each of the partial document sets classified in the document classification step, and using a related word dictionary in which related words of arbitrary words are described. A related word extracting step of extracting a related word set for each representative word set of each partial document set extracted in the representative word extracting step; Based on the extracted related word set, the representative word set extracted in the representative word extraction step, and information on documents belonging to each partial document set, individual partial document sets and related information between the partial document sets are generated. A partial document set information generating step; and a classification result storing step of storing a classification result in the document classifying step together with information generated by the partial document set information generating unit. .

【００４４】請求項１５の発明は、請求項１４の発明に
おいて、前記関連語抽出ステップにて抽出される関連語
セットが、同義語、類義語、反対語のうちの少なくとも
一つ以上の組合わせであることを特徴としたものであ
る。According to a fifteenth aspect, in the fourteenth aspect, the related word set extracted in the related word extracting step is a combination of at least one of a synonym, a synonym, and an antonym. It is characterized by having.

【００４５】請求項１６の発明は、請求項１４の発明に
おいて、前記関連語抽出ステップにて抽出される関連語
セットが少なくとも反対語を含み、ある部分文書集合の
代表語セットから抽出された反対語セットが、自分を含
む他のどの部分文書集合の代表語セットとも一致しない
場合、該一致しない反対語セットを含む文書を文書集合
から抽出し、あらたな部分文書集合を生成する処理を全
部分文書集合に対し再帰的に繰り返す反意部分文書集合
生成ステップをさらに含むことを特徴としたものであ
る。According to a sixteenth aspect, in the fourteenth aspect, the related word set extracted in the related word extracting step includes at least an opposite word, and the opposite word set extracted from a representative word set of a certain partial document set. If the word set does not match the representative word set of any other sub-document set including itself, the document including the inconsistent opposite word set is extracted from the document set, and the process of generating a new sub-document set is entirely performed. The method further comprises a step of generating a reciprocal partial document set recursively with respect to the document set.

【００４６】請求項１７の発明は、請求項１４の発明に
おいて、前記関連語抽出ステップにて抽出される関連語
が少なくとも反対語を含み、ある部分文書集合の代表語
セットから抽出された反対語セットが、自分を含む他の
どの部分文書集合の代表語セットとも一致しない場合、
該一致しない反対語セットと代表語セットから反対語セ
ットに対応する代表語を除いた単語セットを含む文書を
文書集合から抽出し、あらたな部分文書集合を生成する
処理を全部分文書集合に対し再帰的に繰り返す反意部分
文書集合生成ステップをさらに含むことを特徴としたも
のである。According to a seventeenth aspect, in the fourteenth aspect, the related word extracted in the related word extracting step includes at least an opposite word, and the opposite word extracted from a representative word set of a certain partial document set. If the set does not match the set of terms in any other sub-documents, including you,
A document including a word set obtained by removing a non-matching opposite word set and a representative word corresponding to the opposite word set from the representative word set is extracted from the document set, and a process of generating a new partial document set is performed on all the partial document sets. The method further includes a step of generating a recursive partial document set that is recursively repeated.

【００４７】請求項１８の発明は、文書の内容に従って
文書の分類を行う文書分類方法であって、文書データを
入力する文書入力ステップと、前記文書データに形態素
解析を適用し、前記文書データを構成する単語をそれら
の品詞情報等とともに抽出する文書解析ステップと、該
文書解析ステップにて抽出された文書データの解析情報
から文書データを多次元ベクトル空間で表現するための
文書ベクトル空間を生成する文書ベクトル空間生成ステ
ップと、該文書ベクトル空間生成ステップにて生成した
文書ベクトル空間において統計手法を用いることにより
文書データの分類を行う文書分類ステップとを含み、前
記文書解析ステップにて抽出される特定の品詞を有する
単語を、該特定の品詞の品詞情報に基づき、該特定の品
詞の前後に抽出される一つ以上の単語と結合することに
より生成される単語と置き換え、かつ該特定の品詞の品
詞情報も適切に置き換えることを特徴としたものであ
る。The invention according to claim 18 is a document classification method for classifying a document according to the contents of the document, comprising the steps of: inputting document data; applying morphological analysis to the document data; Generating a document vector space for expressing the document data in a multidimensional vector space from the analysis information of the document data extracted in the document analysis step; A document vector space generating step, and a document classifying step of classifying document data by using a statistical method in the document vector space generated in the document vector space generating step, wherein the identification extracted in the document analyzing step Are extracted before and after the specific part of speech based on the part of speech information of the specific part of speech. That replaces one or more of the words and the words that are generated by combining and is obtained by said replacing also appropriate part of speech information of the particular part of speech.

【００４８】請求項１９の発明は、請求項１８の発明に
おいて、前記文書分類ステップにおいて統計手法として
クラスタリング法を用いることで文書データの分類を行
うことを特徴としたものである。The invention of claim 19 is characterized in that, in the invention of claim 18, the document data is classified by using a clustering method as a statistical method in the document classification step.

【００４９】請求項２０の発明は、請求項１８または１
９の発明において、前記文書解析ステップにおいて品詞
が接頭詞、接尾詞、助数詞、及びそれらに類する品詞で
ある単語について、単語および品詞の置き換えを行うこ
とを特徴としたものである。The twentieth aspect of the present invention is the twelfth aspect of the invention.
A ninth aspect of the present invention is characterized in that in the document analyzing step, the words and the parts of speech are replaced for words whose parts of speech are prefixes, suffixes, classifiers, and similar parts of speech.

【００５０】請求項２１の発明は、請求項１８ないし２
０のいずれか１の発明において、前記文書解析ステップ
において特定の品詞の単語が出現するまで単語の結合を
続けることを特徴としたものである。The invention of claim 21 is the invention of claims 18 to 2
0, wherein the combination of words is continued until a word of a specific part of speech appears in the document analysis step.

【００５１】請求項２２の発明は、請求項１８ないし２
１のいずれか１の発明において、前記文書解析ステップ
において品詞が数詞接尾詞もしくは助数詞の単語につい
て、該数詞接尾詞もしくは助数詞の単語に結合される複
数の単語を削除し、前記文書分類ステップでは削除した
単語の情報を用いないことを特徴としたものである。The invention of claim 22 is the invention of claims 18 to 2
In the invention according to any one of the first to third aspects, for the word whose part of speech is a numeral suffix or a classifier in the document analysis step, a plurality of words combined with the numeral suffix or the classifier word are deleted, and in the document classification step, the plurality of words are deleted. This feature is characterized by not using the information of the word.

【００５２】請求項２３の発明は、文書の内容に従って
文書データ集合を分類する文書分類方法であって、文書
データ集合を入力する文書入力ステップと、すべての文
書データに形態素解析を適用し、前記文書データを構成
する単語をそれらの品詞情報等とともに抽出する文書解
析ステップと、該文書解析ステップにて抽出された文書
データの解析結果を記憶する文書解析結果記憶ステップ
と、前記文書解析ステップにて抽出された文書データの
解析情報から前記文書データを多次元ベクトル空間で表
現するためのベクトル空間を生成する文書ベクトル空間
生成ステップと、該文書ベクトル空間生成ステップにて
生成された文書ベクトル空間の各文書データのベクトル
データを記憶する文書ベクトルデータ記憶ステップと、
指定される条件から文書データ集合の分類数を決定する
分類数決定ステップと、前記文書ベクトル空間生成ステ
ップにて生成した文書ベクトル空間において統計手法を
用いることにより文書データを前記指定された分類数の
部分文書集合に分類する文書分類ステップと、該文書分
類ステップで生成された分類結果を記憶する分類結果記
憶ステップと、前記分類数決定ステップから前記分類結
果記憶ステップまでの処理を繰り返し行うか否かの判定
をおこなう繰り返し判定ステップと、前記文書ベクトル
データ記憶ステップと前記分類結果記憶ステップにて記
憶された情報を用いて生成されたすべての部分文書集合
間の関係情報を算出する部分文書集合間関係算出ステッ
プと、該部分文書集合間関係算出ステップにて生成され
た部分文書集合間の関係情報を記憶する部分文書集合間
関係記憶ステップとを含むことを特徴としたものであ
る。According to a twenty-third aspect of the present invention, there is provided a document classification method for classifying a document data set according to the contents of a document, wherein a document input step of inputting the document data set and a morphological analysis are applied to all the document data. A document analysis step of extracting words constituting the document data together with their part of speech information and the like; a document analysis result storing step of storing an analysis result of the document data extracted in the document analysis step; A document vector space generating step of generating a vector space for expressing the document data in a multidimensional vector space from the analysis information of the extracted document data; and a document vector space generated in the document vector space generating step. A document vector data storing step of storing vector data of the document data;
A classification number determining step of determining a classification number of a document data set from a specified condition; and a document vector space generated by the document vector space generation step by using a statistical method to convert the document data into the specified classification number. A document classification step of classifying the documents into partial document sets, a classification result storage step of storing the classification results generated in the document classification step, and whether to repeat the processing from the classification number determination step to the classification result storage step And a relation between partial document sets for calculating relation information between all partial document sets generated using the information stored in the document vector data storing step and the information stored in the classification result storing step. Calculating step and the partial document set generated in the partial document set relation calculating step It is obtained by comprising a partial document set between relationship storage step of storing the relationship information.

【００５３】請求項２４の発明は、請求項２３の発明に
おいて、前記文書分類ステップにて用いられる統計手法
が非階層クラスタリング手法であることを特徴としたも
のである。According to a twenty-fourth aspect, in the twenty-third aspect, the statistical technique used in the document classification step is a non-hierarchical clustering technique.

【００５４】請求項２５の発明は、請求項２３または２
４の発明において、前記部分文書集合間関係算出ステッ
プにて算出される関係が、類似関係と包含関係であるこ
とを特徴としたものである。The invention of claim 25 is the invention of claim 23 or 2
The invention according to claim 4, characterized in that the relations calculated in the partial document set relation calculation step are similarity relations and inclusion relations.

【００５５】請求項２６の発明は、請求項２５の発明に
おいて、前記部分文書集合間の関係は各部分文書集合か
ら抽出される単語情報のみを用いて算出されることを特
徴としたものである。The invention of claim 26 is the invention of claim 25, wherein the relation between the partial document sets is calculated using only word information extracted from each partial document set. .

【００５６】請求項２７の発明は、文書集合をその内容
に従って分類する文書分類方法を実行するためのプログ
ラムを記録したコンピュータ読み取り可能な記録媒体で
あって、複数の文書を入力する文書入力ステップと、該
文書入力ステップにて入力された各文書から該各文書を
構成する単語情報を抽出する文書解析ステップと、該文
書解析ステップにて抽出された各文書の単語情報をもと
に前記複数の文書による文書集合をいくつかの部分文書
集合に分類する文書分類ステップと、該文書分類ステッ
プにて分類された各部分文書集合からそれらの代表語セ
ットを抽出する代表語抽出ステップと、任意の単語につ
いてその関連語が記述された関連語辞書を用いて前記代
表語抽出ステップにて抽出した各部分文書集合の代表語
セットそれぞれについて関連語セットを抽出する関連語
抽出ステップと、該関連語抽出ステップにて抽出した関
連語セットと前記代表語抽出ステップで抽出した代表語
セットと各部分文書集合に所属する文書に関する情報と
をもとに個々の部分文書集合及び部分文書集合間の関連
情報を生成する部分文書集合情報生成ステップと、前記
文書分類ステップでの分類結果を前記部分文書集合情報
生成部にて生成された情報と合わせて保存する分類結果
保存ステップとを含む文書分類方法を実行するためのプ
ログラムを記録したコンピュータ読み取り可能な記録媒
体である。According to a twenty-seventh aspect of the present invention, there is provided a computer-readable recording medium having recorded thereon a program for executing a document classification method for classifying a document set according to the contents thereof, and a document input step of inputting a plurality of documents. A document analysis step of extracting word information constituting each document from each document input in the document input step; and the plurality of document information based on the word information of each document extracted in the document analysis step. A document classification step of classifying a set of documents into several partial document sets; a representative word extraction step of extracting a set of representative words from each of the partial document sets classified in the document classification step; For each representative word set of each partial document set extracted in the representative word extraction step using the related word dictionary in which the related words are described. A related word extraction step of extracting a related word set, a related word set extracted in the related word extraction step, a representative word set extracted in the representative word extraction step, and information on documents belonging to each partial document set. A partial document set information generating step for generating individual partial document sets and related information between the partial document sets based on the information generated by the partial document set information generating unit; And a computer-readable recording medium on which a program for executing a document classification method including a classification result storage step of storing together is stored.

【００５７】請求項２８の発明は、請求項２７に記載の
文書分類方法を実行するためのプログラムを記録したコ
ンピュータ読み取り可能な記録媒体において、前記関連
語抽出ステップにて抽出される関連語セットが、同義
語、類義語、反対語のうちの少なくとも一つ以上の組合
わせである文書分類方法を実行するためのプログラムを
記録したコンピュータ読み取り可能な記録媒体である。According to a twenty-eighth aspect of the present invention, there is provided a computer-readable recording medium having recorded thereon a program for executing the document classification method according to the twenty-seventh aspect, wherein a related word set extracted in the related word extracting step is included. , A computer-readable storage medium storing a program for executing a document classification method that is a combination of at least one of synonyms, synonyms, and antonyms.

【００５８】請求項２９の発明は、請求項２７に記載の
文書分類方法を実行するためのプログラムを記録したコ
ンピュータ読み取り可能な記録媒体において、前記関連
語抽出ステップにて抽出される関連語セットが少なくと
も反対語を含み、ある部分文書集合の代表語セットから
抽出された反対語セットが、自分を含む他のどの部分文
書集合の代表語セットとも一致しない場合、該一致しな
い反対語セットを含む文書を文書集合から抽出し、あら
たな部分文書集合を生成する処理を全部分文書集合に対
し再帰的に繰り返す反意部分文書集合生成ステップをさ
らに含む文書分類方法を実行するためのプログラムを記
録したコンピュータ読み取り可能な記録媒体である。According to a twenty-ninth aspect of the present invention, there is provided a computer-readable recording medium having recorded thereon a program for executing the document classification method according to the twenty-seventh aspect, wherein a related word set extracted in the related word extracting step is included. A document containing at least an antonym and extracted from a set of representative words of a partial document set and not including a set of non-matching opposite words, if the set of the opposite words does not match the representative word set of any other partial document set including itself. Computer for recording a program for executing a document classification method further including a reciprocal partial document set generating step of recursively repeating a process of generating a new partial document set recursively for all the partial document sets. It is a readable recording medium.

【００５９】請求項３０の発明は、請求項２７に記載の
文書分類方法を実行するためのプログラムを記録したコ
ンピュータ読み取り可能な記録媒体において、前記関連
語抽出ステップにて抽出される関連語が少なくとも反対
語を含み、ある部分文書集合の代表語セットから抽出さ
れた反対語セットが、自分を含む他のどの部分文書集合
の代表語セットとも一致しない場合、該一致しない反対
語セットと代表語セットから反対語セットに対応する代
表語を除いた単語セットを含む文書を文書集合から抽出
し、あらたな部分文書集合を生成する処理を全部分文書
集合に対し再帰的に繰り返す反意部分文書集合生成ステ
ップをさらに含む文書分類方法を実行するためのプログ
ラムを記録したコンピュータ読み取り可能な記録媒体で
ある。According to a thirtieth aspect of the present invention, in a computer-readable recording medium storing a program for executing the document classification method according to the twenty-seventh aspect, at least a related word extracted in the related word extracting step is included. If the set of the opposite words including the antonym and extracted from the set of the representative words of a partial document set does not match the representative word set of any of the other partial document sets including the same, the incompatible opposite word set and the representative word set An anonymous sub-document generation that recursively repeats the process of extracting from the document set a document containing the word set excluding the representative word corresponding to the antonym set from the document set and generating a new sub-document set A computer-readable storage medium storing a program for executing a document classification method further including steps.

【００６０】請求項３１の発明は、文書の内容に従って
文書の分類を行う文書分類方法を実行するためのプログ
ラムを記録したコンピュータ読み取り可能な記録媒体で
あって、文書データを入力する文書入力ステップと、前
記文書データに形態素解析を適用し、前記文書データを
構成する単語をそれらの品詞情報等とともに抽出する文
書解析ステップと、該文書解析ステップにて抽出された
文書データの解析情報から文書データを多次元ベクトル
空間で表現するための文書ベクトル空間を生成する文書
ベクトル空間生成ステップと、該文書ベクトル空間生成
ステップにて生成した文書ベクトル空間において統計手
法を用いることにより文書データの分類を行う文書分類
ステップとを含み、前記文書解析ステップにて抽出され
る特定の品詞を有する単語を、該特定の品詞の品詞情報
に基づき、該特定の品詞の前後に抽出される一つ以上の
単語と結合することにより生成される単語と置き換え、
かつ該特定の品詞の品詞情報も適切に置き換える文書分
類方法を実行するためのプログラムを記録したコンピュ
ータ読み取り可能な記録媒体である。According to a thirty-first aspect of the present invention, there is provided a computer-readable recording medium having recorded thereon a program for executing a document classification method for classifying documents according to the contents of the document. A document analysis step of applying morphological analysis to the document data to extract words constituting the document data together with their part of speech information and the like, and converting the document data from the analysis information of the document data extracted in the document analysis step. A document vector space generating step for generating a document vector space for expressing in a multidimensional vector space, and a document classification for classifying document data by using a statistical method in the document vector space generated in the document vector space generating step And a specific part of speech extracted in the document analysis step. That a word, on the basis of the part of speech information of the particular part of speech, replaced with words that are generated by combining one or more words to be extracted before and after the specific part of speech,
In addition, the present invention is a computer-readable recording medium on which a program for executing a document classification method for appropriately replacing the part of speech information of the specific part of speech is recorded.

【００６１】請求項３２の発明は、請求項３１に記載の
文書分類方法を実行するためのプログラムを記録したコ
ンピュータ読み取り可能な記録媒体において、前記文書
分類ステップにおいて統計手法としてクラスタリング法
を用いることで文書データの分類を行う文書分類方法を
実行するためのプログラムを記録したコンピュータ読み
取り可能な記録媒体である。According to a thirty-second aspect of the present invention, there is provided a computer-readable recording medium storing a program for executing the document classification method according to the thirty-first aspect, wherein a clustering method is used as a statistical method in the document classification step. This is a computer-readable recording medium on which a program for executing a document classification method for classifying document data is recorded.

【００６２】請求項３３の発明は、請求項３１または３
２に記載の文書分類方法を実行するためのプログラムを
記録したコンピュータ読み取り可能な記録媒体におい
て、前記文書解析ステップにおいて品詞が接頭詞、接尾
詞、助数詞、及びそれらに類する品詞である単語につい
て、単語及び品詞の置き換えを行う文書分類方法を実行
するためのプログラムを記録したコンピュータ読み取り
可能な記録媒体である。The invention of claim 33 is the invention of claim 31 or 3
2. In a computer-readable recording medium on which a program for executing the document classification method described in 2 is recorded, in the document analyzing step, the words of which the parts of speech are prefixes, suffixes, classifiers, and parts of speech similar thereto And a computer-readable recording medium on which a program for executing a document classification method for replacing part of speech is recorded.

【００６３】請求項３４の発明は、請求項３１ないし３
３のいずれか１に記載の文書分類方法を実行するための
プログラムを記録したコンピュータ読み取り可能な記録
媒体において、前記文書解析ステップにおいて特定の品
詞の単語が出現するまで単語の結合を続ける文書分類方
法を実行するためのプログラムを記録したコンピュータ
読み取り可能な記録媒体である。The invention of claim 34 is the invention of claims 31 to 3
3. In a computer-readable recording medium having recorded thereon a program for executing the document classification method according to any one of 3., a document classification method that continues combining words until a word of a specific part of speech appears in the document analysis step Is a computer-readable recording medium on which a program for executing the program is recorded.

【００６４】請求項３５の発明は、請求項３１ないし３
４のいずれか１に記載の文書分類方法を実行するための
プログラムを記録したコンピュータ読み取り可能な記録
媒体において、前記文書解析ステップにおいて品詞が数
詞接尾詞もしくは助数詞の単語について、該数詞接尾詞
もしくは助数詞の単語に結合される複数の単語を削除
し、前記文書分類ステップでは削除した単語の情報を用
いない文書分類方法を実行するためのプログラムを記録
したコンピュータ読み取り可能な記録媒体である。The invention of claim 35 is the invention of claims 31 to 3
4. In a computer-readable recording medium storing a program for executing the document classification method according to any one of 4), in the document analysis step, for the words whose part of speech is a numeral suffix or a classifier, the numeral suffix or a classifier And a computer-readable recording medium on which a program for executing a document classification method that deletes a plurality of words combined with the word and does not use information of the deleted words in the document classification step is recorded.

【００６５】請求項３６の発明は、文書の内容に従って
文書データ集合を分類する文書分類方法を実行するため
のプログラムを記録したコンピュータ読み取り可能な記
録媒体であって、文書データ集合を入力する文書入力ス
テップと、すべての文書データに形態素解析を適用し、
前記文書データを構成する単語をそれらの品詞情報等と
ともに抽出する文書解析ステップと、該文書解析ステッ
プにて抽出された文書データの解析結果を記憶する文書
解析結果記憶ステップと、前記文書解析ステップにて抽
出された文書データの解析情報から前記文書データを多
次元ベクトル空間で表現するためのベクトル空間を生成
する文書ベクトル空間生成ステップと、該文書ベクトル
空間生成ステップにて生成された文書ベクトル空間の各
文書データのベクトルデータを記憶する文書ベクトルデ
ータ記憶ステップと、指定される条件から文書データ集
合の分類数を決定する分類数決定ステップと、前記文書
ベクトル空間生成ステップにて生成した文書ベクトル空
間において統計手法を用いることにより文書データを前
記指定された分類数の部分文書集合に分類する文書分類
ステップと、該文書分類ステップで生成された分類結果
を記憶する分類結果記憶ステップと、前記分類数決定ス
テップから前記分類結果記憶ステップまでの処理を繰り
返し行うか否かの判定をおこなう繰り返し判定ステップ
と、前記文書ベクトルデータ記憶ステップと前記分類結
果記憶ステップにて記憶された情報を用いて生成された
すべての部分文書集合間の関係情報を算出する部分文書
集合間関係算出ステップと、該部分文書集合間関係算出
ステップにて生成された部分文書集合間の関係情報を記
憶する部分文書集合間関係記憶ステップとを含む文書分
類方法を実行するためのプログラムを記録したコンピュ
ータ読み取り可能な記録媒体である。According to a thirty-sixth aspect of the present invention, there is provided a computer-readable recording medium having recorded thereon a program for executing a document classification method for classifying a document data set according to the contents of a document. Step and apply morphological analysis to all document data,
A document analysis step of extracting words constituting the document data together with their part of speech information and the like; a document analysis result storage step of storing an analysis result of the document data extracted in the document analysis step; A document vector space generating step of generating a vector space for expressing the document data in a multidimensional vector space from the analysis information of the extracted document data, and a document vector space generated in the document vector space generating step. A document vector data storing step of storing vector data of each document data, a classification number determining step of determining a classification number of a document data set from designated conditions, and a document vector space generated in the document vector space generation step. Classification of the document data into the specified classification by using a statistical method A document classification step of classifying the document into a partial document set, a classification result storage step of storing the classification result generated in the document classification step, and whether to repeat the processing from the classification number determination step to the classification result storage step Between the partial document sets for calculating relation information between all the partial document sets generated using the information stored in the document vector data storing step and the classification result storing step. A program for executing a document classification method including a relation calculating step and a partial document set relation storing step of storing relation information between partial document sets generated in the partial document set relation calculating step is recorded. It is a computer-readable recording medium.

【００６６】請求項３７の発明は、請求項３６に記載の
文書分類方法を実行するためのプログラムを記録したコ
ンピュータ読み取り可能な記録媒体において、前記文書
分類ステップにて用いられる統計手法が非階層クラスタ
リング手法である文書分類方法を実行するためのプログ
ラムを記録したコンピュータ読み取り可能な記録媒体で
ある。According to a thirty-seventh aspect of the present invention, in the computer-readable recording medium storing a program for executing the document classification method according to the thirty-sixth aspect, the statistical method used in the document classification step is a non-hierarchical clustering. This is a computer-readable recording medium on which a program for executing a document classification method, which is a technique, is recorded.

【００６７】請求項３８の発明は、請求項３６または３
７に記載の文書分類方法を実行するためのプログラムを
記録したコンピュータ読み取り可能な記録媒体におい
て、前記部分文書集合間関係算出ステップにて算出され
る関係が、類似関係と包含関係である文書分類方法を実
行するためのプログラムを記録したコンピュータ読み取
り可能な記録媒体である。The invention of claim 38 is the invention of claim 36 or 3
7. In a computer-readable recording medium on which a program for executing the document classification method described in 7 is recorded, a relation calculated in the step of calculating a relation between partial document sets is a similar relation and an inclusion relation. Is a computer-readable recording medium on which a program for executing the program is recorded.

【００６８】請求項３９の発明は、請求項３８に記載の
文書分類方法を実行するためのプログラムを記録したコ
ンピュータ読み取り可能な記録媒体において、前記部分
文書集合間の関係は各部分文書集合から抽出される単語
情報のみを用いて算出される文書分類方法を実行するた
めのプログラムを記録したコンピュータ読み取り可能な
記録媒体である。According to a thirty-ninth aspect of the present invention, in a computer-readable recording medium storing a program for executing the document classification method according to the thirty-eighth aspect, the relation between the partial document sets is extracted from each partial document set. The computer-readable recording medium stores a program for executing a document classification method calculated using only word information to be processed.

【００６９】[0069]

【発明の実施の形態】本発明の実施例の説明において
は、自然言語で記述された１つ以上の文の集まりで、そ
れが分類対象となる場合は、これを文書と言う。また、
ひとつの文書の終端には、それが判別可能な文書終端記
号が付置されているものとする。具体的な例をあげれ
ば、公開特許公報や特定の新聞記事も文書であるし、そ
れらから請求項や特定の１文を取り出したものであって
もこれを文書と見なす。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS In the description of the embodiments of the present invention, a group of one or more sentences described in a natural language, which is to be classified, is called a document. Also,
It is assumed that the end of one document is provided with a document end symbol that can be identified. To give a concrete example, a patent publication and a specific newspaper article are also documents, and even if a claim or a specific one sentence is extracted therefrom, it is regarded as a document.

【００７０】図１は本発明の請求項１，２，１４，１
５，２７及び２８の発明に対応する実施例を説明するた
めの文書分類装置のブロック構成図である。文書入力部
１０１は、キーボード、ＯＣＲ装置、ハードディスク等
の補助記憶装置等の入力手段が文書分類装置１００に直
接に、または、ネットワーク経由で接続され、このよう
な入力手段から文書や文書群を獲得し、文書データを入
力するインターフェースである。図２は、文書データを
入力する処理の一例を示すフローチャートである。FIG. 1 shows a first embodiment of the present invention.
FIG. 6 is a block diagram of a document classification device for describing an embodiment corresponding to the inventions of Nos. 5, 27 and 28. In the document input unit 101, input means such as a keyboard, an OCR device, and an auxiliary storage device such as a hard disk are connected to the document classification apparatus 100 directly or via a network, and a document or a document group is obtained from such an input means. An interface for inputting document data. FIG. 2 is a flowchart illustrating an example of a process for inputting document data.

【００７１】図１における文書解析部１０２では、入力
された文書それぞれに対し、自然言語解析を行い、単語
やその品詞などを抽出する。さらに、文書内での単語の
出現順序や、文書の作成者や作成日などの文書のメタ情
報なども含めることができる。その後、文書群で出現し
た単語に対しユニークな単語ＩＤを付与し、文書内での
単語出現回数を計数する。一例として、文書に対し形態
素解析を適用することで、文書内の単語表記と品詞を抽
出し、その結果をもとに文書群で出現したユニークな単
語の表記、品詞、識別番号を抽出し、また各文書を抽出
されたユニークな単語識別番号とその頻度で表現する例
を示すこととし、そのフローチャートを図３に示す。The document analysis unit 102 in FIG. 1 performs a natural language analysis on each of the input documents, and extracts words and their parts of speech. Furthermore, the order in which words appear in the document, and meta information of the document such as the creator and date of creation of the document can also be included. After that, a unique word ID is assigned to the word that has appeared in the document group, and the number of occurrences of the word in the document is counted. As an example, by applying morphological analysis to a document, word expressions and parts of speech in the document are extracted, and based on the results, unique word expressions, parts of speech, and identification numbers that appear in the document group are extracted. Further, an example in which each document is represented by the extracted unique word identification number and its frequency is shown, and a flowchart thereof is shown in FIG.

【００７２】例えば、図４（Ａ）に示す文書１と文書２
に対し、形態素解析を適用すると図４（Ｂ）のような結
果が得られる。図４（Ｂ）において各切り出された単語
の下の数値はそれらの品詞を示しており、その対応表は
図４（Ｃ）に示す。文書群が図４（Ａ）に示す２つの文
書のみで構成されているとすると、文書群で出現したユ
ニークな単語の表記、品詞、識別番号と各文書を単語識
別番号とその頻度で表現した結果は図５（Ａ）〜図５
（Ｃ）のようになる。ただし、簡単のため品詞としては
名詞と未登録語のみを採用する。For example, document 1 and document 2 shown in FIG.
On the other hand, when morphological analysis is applied, a result as shown in FIG. 4B is obtained. In FIG. 4B, the numerical value below each cut-out word indicates those parts of speech, and the correspondence table is shown in FIG. 4C. Assuming that the document group is composed of only the two documents shown in FIG. 4A, the notation, part of speech, and identification number of a unique word appearing in the document group, and each document are represented by a word identification number and its frequency. The results are shown in FIGS.
(C). However, for the sake of simplicity, only nouns and unregistered words are used as parts of speech.

【００７３】文書分類部１０３では、文書解析部１０２
で生成された情報をもとに文書群の分類をおこなう。本
発明では、分類手法は特に限定しないが、ここでは一例
として、上記文書解析部１０２における実施例を継承し
て、各文書を文書群でユニークな単語の出現頻度のベク
トルで表現し、これらのベクトルをもとにクラスタリン
グ手法の１つであるｋｍｅａｎｓ法を用いて文書分類
を行う例を示すこととし、そのフローチャートを図６に
示す。ここで、ベクトル間の類似度は０１の間の実
数、かつ最大類似度は１であるとする。In the document classifying unit 103, the document analyzing unit 102
Classify documents based on the information generated in. In the present invention, although the classification method is not particularly limited, here, as an example, each document is represented by a vector of the frequency of occurrence of a unique word in the document group, inheriting the embodiment in the document analysis unit 102, and An example in which document classification is performed using the k-means method, which is one of clustering methods based on vectors, will be described, and a flowchart thereof is shown in FIG. Here, it is assumed that the similarity between the vectors is a real number between 0 and 1 and the maximum similarity is 1.

【００７４】図７（Ａ）に示す１５個の文書を図３及び
図５に示すアルゴリズムを基に３つの部分文書集合に分
類した結果を図７（Ｂ）に示す。ここで、品詞としては
名詞と未登録語のみを採用し、またｋｍｅａｎｓ法に
おける類似測度は余弦測度であり、反復停止条件は繰返
し回数５回としている。代表語抽出部１０４では、文書
解析部１０２で生成した各文書の単語情報及び文書分類
部で生成した部分文書グループに関する情報をもとに各
部分文書集合における代表語セットを抽出する。FIG. 7B shows the result of classifying the 15 documents shown in FIG. 7A into three partial document sets based on the algorithm shown in FIGS. Here, only the noun and the unregistered word are adopted as the parts of speech, the similarity measure in the k means method is a cosine measure, and the repetition stop condition is five repetitions. The representative word extraction unit 104 extracts a representative word set in each partial document set based on the word information of each document generated by the document analysis unit 102 and the information on the partial document group generated by the document classification unit.

【００７５】本発明では、代表語の抽出方法を特に限定
しないが、ここでは一例として、上記文書分類部におけ
る実施例を継承し、各部分文書集合においてそれらに所
属する文書をひとつの仮想的な文書とみなした時の、文
書群でユニークな単語の出現頻度が指定されたしきい値
以上の単語をそれらの部分文書集合の代表語セットとす
る例を示すこととし、そのフローチャートを図８に示
す。In the present invention, the method of extracting the representative words is not particularly limited. Here, as an example, the embodiment in the above-described document classifying unit is inherited, and the documents belonging to them in each partial document set are regarded as one virtual document. FIG. 8 shows an example in which a word having a frequency of occurrence of a unique word in a document group that is equal to or higher than a specified threshold when the document is regarded as a document is set as a representative word set of those partial document sets. Show.

【００７６】上記文書分類部における実施例の各部分文
書集合について上記のフローチャートに従って求めた代
表語セットを図９に示す。ここで、出現頻度のしきい値
は２としている。FIG. 9 shows a representative word set obtained according to the above-described flowchart for each partial document set of the embodiment in the document classification unit. Here, the threshold value of the appearance frequency is 2.

【００７７】関連語抽出部１０５では、代表語抽出部１
０４にて抽出した各部分文書集合の代表語それぞれにつ
いて、関連語辞書を用いて関連語を抽出し、それらを各
部分文書集合の関連語セットとする。関連語辞書として
は、同義語辞書、広義語辞書、狭義語辞書、類義語辞
書、反対語辞書、兄弟語辞書、上位概念語辞書、下位概
念語辞書等を用いることができるが、ここでは一例とし
て、上記代表語抽出部における実施例を継承し、任意の
一つの関連語辞書を用いて各部分文書集合の関連語セッ
トを求める例を示すこととし、そのフローチャートを図
１０に示す。なお、複数の辞書を用いる場合には、前記
処理を各辞書について繰り返し行えばよい。簡単のため
関連語として同義語のみを扱うとして、前記代表語抽出
部の実施例で求めた各代表語の同義語が図１１（Ａ）に
示されるような場合、各部分文書集合の関連語セットは
図１１（Ｂ）のように示される。In the related word extracting unit 105, the representative word extracting unit 1
For each of the representative words of each partial document set extracted in 04, related words are extracted using a related word dictionary, and these are set as a related word set of each partial document set. As the related term dictionary, a synonym dictionary, a broad term dictionary, a narrow term dictionary, a synonym dictionary, an antonym dictionary, a sibling dictionary, a high-level concept dictionary, a low-level concept dictionary, etc. can be used. An example in which the embodiment in the representative word extracting unit is inherited and a related word set of each partial document set is obtained using an arbitrary one related word dictionary is shown, and its flowchart is shown in FIG. If a plurality of dictionaries are used, the above process may be repeated for each dictionary. For the sake of simplicity, it is assumed that only the synonyms are handled as related words. If the synonyms of each representative word obtained by the embodiment of the representative word extracting unit are as shown in FIG. The set is shown as in FIG.

【００７８】部分文書集合情報生成部１０６では、文書
解析部１０２で生成した各文書の単語情報、文書分類部
１０３で生成した部分文書グループに関する情報、代表
語抽出部１０４で抽出した各部分文書集合の代表語セッ
ト、及び関連語抽出部１０５で生成した各部分文書集合
の関連語セットを基に個々の部分文書集合及び部分文書
集合間の関連情報を生成する。The partial document set information generation unit 106 includes word information of each document generated by the document analysis unit 102, information on the partial document group generated by the document classification unit 103, and each partial document set extracted by the representative word extraction unit 104. Based on the representative word set and the related word set of each partial document set generated by the related word extraction unit 105, individual partial document sets and related information between the partial document sets are generated.

【００７９】各部分文書集合固有の情報としては、代表
語セットの集合、関連語セットの集合、及び各部分文書
集合が多重分類を許す分類手法により生成されている場
合は、代表語及び／または関連語を指定されるしきい値
個数以上含む部分文書集合に所属する文書の部分集合等
の情報を用いることができる。また、部分文書集合間の
関連情報としては、部分文書集合間の代表語セット集合
の積集合や和集合や差集合、関連語セットの集合の積集
合や和集合や差集合、及び各部分文書集合が多重分類を
許す分類手法により生成されている場合は、部分文書集
合に所属する文書の積集合や和集合や差集合、代表語及
び／または関連語を多く含む部分文書集合に所属する文
書の部分集合間の積集合や和集合や差集合等の情報を用
いることができる。The information unique to each partial document set includes a set of representative word sets, a set of related word sets, and, when each partial document set is generated by a classification method that allows multiple classification, representative words and / or Information such as a subset of documents belonging to a partial document set that includes a related word or more than a specified threshold number can be used. The related information between the partial document sets includes the intersection, union, and difference set of the representative word set between the partial document sets, the intersection, union, and difference set of the set of the related word sets, and each partial document. If the set is generated by a classification method that allows multiple classification, documents belonging to a partial document set that contains many intersections, unions, difference sets, representative words, and / or related words of documents belonging to the partial document set Information such as the intersection, union, and difference between the subsets can be used.

【００８０】ここでは一例として、上記関連語抽出部に
おける実施例を継承し、文書部分集合が多重分類を許す
分類手法により生成されているとしたときに、部分文書
集合情報として、代表語セットの集合、関連語セットの
集合、部分文書集合間の代表語セット集合の積集合と和
集合と差集合、部分文書集合間の関連語セット集合の積
集合と和集合と差集合を生成する例を示すこととし、そ
のフローチャートを図１２に示す。これらの情報によ
り、特に部分文書集合間の類似性、関連性、及び包含関
係などを把握することが可能になる。Here, as an example, assuming that the document subset is generated by a classification method that allows multiple classifications, inheriting the embodiment in the related word extraction unit, the representative word set information is used as partial document set information. An example of generating a set, a set of related word sets, a product set, union and difference set of a representative word set set between partial document sets, and a product set, union set and difference set of a related word set set between partial document sets The flowchart is shown in FIG. With such information, it is possible to grasp the similarity, relevance, inclusion relationship, and the like among the partial document sets.

【００８１】分類結果保存部１０７では、文書解析部１
０２で生成した各文書の単語情報、文書分類部１０３で
生成した部分文書グループに関する情報、代表語抽出部
１０４で抽出した各部分文書集合の代表語セット、関連
語抽出部１０５で生成した各部分文書集合の関連語セッ
ト、及び部分文書集合情報生成部１０６で生成した個々
の部分文書集合及び部分文書集合間の関連情報を適切な
形式で保存する。保存された関連情報は、出力部１０８
からユーザの要求に応じて、または予め定められた条件
に従って所定の出力手段に適宜出力される。In the classification result storage unit 107, the document analysis unit 1
02, information on the partial document group generated by the document classifying unit 103, a representative word set of each partial document set extracted by the representative word extracting unit 104, and each part generated by the related word extracting unit 105 The related word set of the document set, the individual partial document sets generated by the partial document set information generation unit 106, and the related information between the partial document sets are stored in an appropriate format. The stored related information is output to the output unit 108.
Is output to a predetermined output unit as needed in response to a user request or according to a predetermined condition.

【００８２】図１３は本発明の請求項３，４，１６，１
７，２９及び３０に対応する実施例を説明するための文
書分類装置２００のブロック構成図である。なお、図１
と同様の機能を有する部分には図１と同一の番号を付し
ている。反意部分文書集合生成部２０１では、関連語抽
出部１０５にて生成される関連語としてすくなくとも反
対語が抽出されるとき、任意の文書部分集合が有する反
対語が、自分を含む他のどの部分文書集合の代表語とも
一致しない場合、この反対語を含む文書を文書群から抽
出し、それを新しい部分文書集合とする処理をすべての
部分文書集合について再帰的におこなう。FIG. 13 shows a third embodiment of the present invention.
FIG. 7 is a block diagram of a document classification device 200 for describing an embodiment corresponding to 7, 29, and 30. FIG.
Portions having the same functions as in FIG. 1 are given the same numbers as in FIG. When at least an antonym is extracted as a related word generated by the related word extracting unit 105 in the antonymous partial document set generation unit 201, the antonym of any document subset is changed to any other part including itself. If the word does not match the representative word of the document set, a document containing this opposite word is extracted from the document group, and the process of making it a new partial document set is performed recursively for all the partial document sets.

【００８３】ここでは一例として、上記実施例を継承し
て、関連語抽出部１０５にて反対語のみが抽出されるこ
ととし、各部分文書集合が有する反意語セットについて
それが自分を含む他の部分文書集合の代表語と一致する
か否かを判定し、反意語がどの代表語とも一致しない場
合、検索手法を用いて文書群からその反意語を含む文書
を抽出し、それらを新しい部分文書集合とする例を示す
こととし、そのフローチャートを図１４に示す。Here, as an example, it is assumed that only the antonym is extracted by the related word extraction unit 105 by inheriting the above embodiment, and the antonym set included in each partial document set is included in another part including itself. Judge whether or not it matches the representative word of the document set, and if the antonym does not match any of the representative words, extract documents containing the antonym from the document group using a search method and make them a new sub-document set An example is shown, and the flowchart is shown in FIG.

【００８４】例えば、図７（Ａ）に示す文書群を分類し
た結果得られている図７（Ｂ）の部分文書集合３の代表
語セットに着目してみる。この場合、代表語“商用”の
反対語として、“無料、フリー”という単語が得られた
とする。この場合、これらの単語はどの代表語とも一致
せず、単語“無料”で文書群を検索した結果は該当０件
であるが、単語“フリー”で検索した場合は、文書４、
文書５、文書１２が検索される。これをあらたな部分文
書集合とした場合、代表語として、“リナックス、フリ
ー、ディストリビューション”を得ることができる。For example, let us focus on the representative word set of the partial document set 3 in FIG. 7B obtained as a result of classifying the document group shown in FIG. 7A. In this case, it is assumed that the word “free, free” is obtained as an opposite word of the representative word “commercial”. In this case, these words do not match any of the representative words, and the result of searching the document group for the word “free” is 0. However, when searching for the word “free”, the document 4,
Document 5 and document 12 are searched. If this is a new partial document set, "Linux, free, distribution" can be obtained as a representative word.

【００８５】これにより文書群から任意の部分文書集合
とは反対の意味を有する部分文書集合が文書分類部では
生成されなかった場合にも、反対の意味を有する部分文
書集合を生成することができるため、文書群からより広
範囲な話題を抽出することが可能となる。Thus, even if a partial document set having the opposite meaning to an arbitrary partial document set is not generated from the document group by the document classifying unit, a partial document set having the opposite meaning can be generated. Therefore, it is possible to extract a wider topic from the document group.

【００８６】請求項４，１７，３０の発明では、反対語
からあらたな部分文書集合を求める際に、反対語を生成
した代表語以外の部分文書集合の代表語も合わせて部分
文書集合を求めることにより、より対象の部分文書集合
とは反対の意味をもつ部分文書集合を生成することが可
能となるが、基本的な処理は上記実施例と同様の処理で
求めることができる。すなわち、例えば、図１４に示す
フローチャートにおいて反対語を用いて文書群を検索す
るステップを反対語と反対語を生成した代表語以外の部
分文書集合の代表語を組合わせた論理式を用いればよ
い。According to the inventions of claims 4, 17 and 30, when a new partial document set is obtained from an antonym, a partial document set is also obtained together with a representative word of a partial document set other than the representative word that generated the antonym. This makes it possible to generate a partial document set having a meaning opposite to that of the target partial document set, but the basic processing can be obtained by the same processing as in the above embodiment. That is, for example, in the flowchart shown in FIG. 14, the step of searching for a document group using an antonym may be performed using a logical expression that combines an antonym and a representative word of a partial document set other than the representative word that generated the antonym. .

【００８７】図１５は、本発明の請求項５〜９，１８〜
２２及び３１〜３５に対応する実施例を説明するための
文書分類装置のブロック構成図である。文書入力部３０
１は、キーボード、ＯＣＲ装置、ハードディスク等の補
助記憶装置等の入力手段が文書分類装置３００に直接
に、または、ネットワーク経由で接続され、このような
入力手段から文書や文書群を獲得し、文書データを入力
するインターフェースである。この際、各文書データを
一意に識別するために、例えばユニークな数などの、識
別子を各文書に割り当てる。FIG. 15 is a cross-sectional view of the present invention.
FIG. 22 is a block configuration diagram of a document classification device for describing an embodiment corresponding to Nos. 22 and 31 to 35. Document input unit 30
An input unit 1 such as a keyboard, an OCR device, an auxiliary storage device such as a hard disk is connected to the document classifying device 300 directly or via a network, and a document or a document group is obtained from such an input unit. An interface for inputting data. At this time, in order to uniquely identify each document data, an identifier such as a unique number is assigned to each document.

【００８８】文書解析部３０２では、入力された文書そ
れぞれに対し形態素解析を適用し、各文書を構成する単
語を品詞情報等とともに抽出する。この際、抽出した単
語を識別するために、抽出した単語のうちユニークな表
記を持つものについては、ユニークな識別子を付置して
おく。さらに、形態素解析の結果得られる単語のうち指
定される品詞をもつ単語について、その前後の単語と適
切に組合わせた単語と置き換え、かつ品詞もまた適切な
ものに置き換える処理を施す。例として、品詞が接頭詞
全般、接尾詞全般、及び助数詞である単語について前記
の結合及び置き換え処理を行う動作を説明する。The document analysis unit 302 applies morphological analysis to each of the input documents, and extracts words constituting each document together with part of speech information and the like. At this time, in order to identify the extracted words, unique identifiers are attached to the extracted words having a unique notation. Further, a word having a specified part of speech out of the words obtained as a result of the morphological analysis is replaced with a word appropriately combined with the preceding and following words, and the part of speech is also replaced with an appropriate word. As an example, a description will be given of an operation of performing the above-described combination and replacement processing on a word whose part of speech is a general prefix, a general suffix, and a classifier.

【００８９】まず、本例では、前記の結合および置き換
え処理を品詞が、１．接頭詞全般、２．数詞接尾詞以外
の接尾詞全般、３．数詞接尾詞もしくは数助詞の場合別
に以下のような規則でおこなうこととする。ただし、本
発明における結合及び置き換え処理の規則はこれらに限
定するものではない。First, in the present example, the part-of-speech is described as “1. 1. general prefixes; 2. general suffixes other than numeric suffixes; In the case of a numerical suffix or a number particle, it is performed according to the following rules. However, the rules of the combination and replacement processing in the present invention are not limited to these.

【００９０】 ○接頭詞全般もし｛対象単語の品詞が接頭詞である｝ならば｛計数用変数：ｉに１を代入する繰り返す｛対象単語の先頭に対象単語よりｉ回前に抽出された単語を結合させるもし｛i回前に抽出されている単語の品詞が分類時使用品詞である}ならば{ 繰り返しループを抜ける｝さもなくば｛ｉを１増加する｝｝対象単語の品詞を変更する｝○ Prefix in general If {the part of speech of the target word is a prefix}｝ Counting variable: assign 1 to i Repeat ｛Word extracted at the beginning of the target word i times before the target word If {the part of speech of the word extracted i times before is the part of speech used for classification}, {exit the repetition loop もなく otherwise ｛increase i by 1｝を change the part of speech of the target word ｝

【００９１】 ○数詞接尾詞以外の接尾詞全般もし｛対象単語の品詞が数詞接尾詞以外の接尾詞である｝ならば｛計数用変数：ｉに１を代入する繰り返す｛対象単語の終端に対象単語よりｉ回後に抽出された単語を結合させるもし｛i回前に抽出されている単語の品詞が分類時使用品詞である}ならば{ 繰り返しループを抜ける｝さもなくば｛ｉを１増加する｝｝対象単語の品詞を変更する｝○ General suffixes other than numerical suffixes If {the part of speech of the target word is a suffix other than the numerical suffix}, {counting variable: assign 1 to i repeat} target at the end of the target word Combine words extracted i times after the word If {the part of speech of the word extracted i times before is the part of speech used during classification}, {exit the repetition loop もなく otherwise, increase i by 1 ｝｝ Change the part of speech of the target word｝

【００９２】 ○数詞接尾詞もしくは助数詞もし｛対象単語の品詞が数詞接尾詞もしくは助数詞である｝ならば｛繰り返す｛もし｛対象単語の直前に抽出されている単語の品詞が数詞である｝ならば｛対象単語の先頭に対象単語の直前に抽出された単語を結合させる対象単語のｉ回前に抽出された単語を削除する｝さもなくば｛繰り返しループを抜ける｝｝対象単語の品詞を変更する｝○ Numerical suffix or classifier If {the part of speech of the target word is a numerical suffix or classifier} repeat {if the part of speech of the word extracted immediately before the target word is a numeric part}結合 Join the word extracted just before the target word to the beginning of the target word Delete the word extracted i times before the target word もなく Otherwise, go out of the repetition loop｝変更 Change the part of speech of the target word ｝

【００９３】図１６に示す６つの文書データを分類対象
文書データとし、この文書データに対して形態素解析を
適用し、単語及びそれらの品詞を抽出したものを図１７
に示す。ただし、本発明では形態素解析系については特
に規定しない。また、分類時使用品詞を普通名詞、サ変
名詞、固有名詞、数詞、形容詞、接頭詞全般、接尾詞全
般、助数詞賭した場合の文書データの解析結果を図１８
に示す。The six document data shown in FIG. 16 are classified as document data, and morphological analysis is applied to the document data to extract words and their parts of speech as shown in FIG.
Shown in However, the morphological analysis system is not specified in the present invention. FIG. 18 shows the analysis result of the document data when the part of speech used at the time of classification is a common noun, a paranoun, a proper noun, a number, an adjective, a general prefix, a general suffix, and a classifier bet.
Shown in

【００９４】図１８に示されている結果において、品詞
が接頭詞全般、接尾詞、もしくは数助詞である単語に対
し前記規則に従い、結合・置き換え処理を施した結果を
図１９に示す。例えば、文書１における｛千葉［普通名
詞］、氏［固有名詞接尾詞］｝という文字列は、数詞接
尾詞以外の接尾詞全般の規則を用いて、｛千葉［普通名
詞］、千葉氏［固有名詞］｝という文字列になり、また
｛１［数詞］、９［数詞］、５［数詞］、０［数詞］、
年［助数詞］｝という文字列は、数詞接尾詞もしくは助
数詞の規則を用いて、｛１９５０年［普通名詞］｝とい
う文字列になる。FIG. 19 shows the result of performing the combining / replacing process on the word whose POS is a general prefix, a suffix, or a number particle in the result shown in FIG. 18 in accordance with the above rules. For example, the character string {Chiba [ordinary noun], Mr. [proper noun suffix]} in document 1 is expressed as {Chiba [ordinary noun], Mr. Chiba [property noun] using the general suffix rules other than the numerical suffix. Noun]} and {1 [numerical], 9 [numerical], 5 [numerical], 0 [numerical],
The character string “year [number classifier]} becomes a character string“ {1950 [common noun]} ”using the rule of a numerical suffix or classifier.

【００９５】文書ベクトル空間生成部３０３では、前記
文書解析部にて抽出された各文書データの単語情報をも
とに文書データをベクトル表現するための空間を生成す
る。例として、前記文書解析部での例をもとに、文書デ
ータ全体でユニークな単語の頻度により文書ベクトル空
間を生成することとする場合の各文書データのベクトル
表現を生成する動作を説明する。ただし、本発明では、
ベクトル空間生成手法はこれに限定するものではなく、
例えば、全単語の線形変換によりベクトル空間を生成す
ることもできる。The document vector space generation unit 303 generates a space for expressing the document data in a vector based on the word information of each document data extracted by the document analysis unit. As an example, an operation of generating a vector representation of each document data in the case where a document vector space is generated based on the frequency of unique words in the entire document data based on the example in the document analysis unit will be described. However, in the present invention,
The vector space generation method is not limited to this,
For example, a vector space can be generated by linear conversion of all words.

【００９６】図１８及び図１９に示す文書解析結果から
ユニークな単語を抽出し、各文書での該当単語の頻度を
計数し、それらの結果を、単語を列方向に、文書データ
を行方向に付置することで、行列表現したものをそれぞ
れ図２０と図２１に示す。これら行列において、列ベク
トルが各文書データのベクトルデータとなる。A unique word is extracted from the document analysis results shown in FIG. 18 and FIG. 19, the frequency of the corresponding word in each document is counted, and the results are converted into the word in the column direction and the document data in the row direction. FIG. 20 and FIG. 21 show the matrix representation by the attachment. In these matrices, the column vector is the vector data of each document data.

【００９７】文書分類部３０４では、前記文書ベクトル
空間生成部にて生成された文書データベクトルを統計手
法を用いることで幾つかの集合に分類する。出力部３０
５では、文書分類部３０４で分類された文書データベク
トルの集合をユーザの要求に応じてまたは予め定められ
た条件に従って所定の出力手段に適宜出力する。文書分
類部３０４における統計処理は様々なものが利用可能で
あるが、請求項５の発明ではアルゴリズムの簡潔さやパ
ラメータの有無等の理由からクラスタリング手法を用い
ることに限定している。例として、前記文書ベクトル空
間生成部での例をもとに、クラスタリング手法を用いて
文書ベクトルを分類する動作を説明する。The document classifying unit 304 classifies the document data vectors generated by the document vector space generating unit into several sets by using a statistical method. Output unit 30
In 5, a set of document data vectors classified by the document classification unit 304 is output to a predetermined output unit as needed in accordance with a user request or according to predetermined conditions. Although various types of statistical processing can be used in the document classifying unit 304, the invention of claim 5 is limited to using a clustering method for reasons such as the simplicity of the algorithm and the presence or absence of parameters. As an example, an operation of classifying document vectors by using a clustering method based on the example in the document vector space generation unit will be described.

【００９８】ここでは、クラスタリング手法の１つであ
るＷａｒｒｄ法を用いることとし、また類似測度は標準
化ユークリッド距離測度を使用する。なお、クラスタリ
ング手法に関しては、“多変量解析入門（森北出版）”
に詳しい。図２０及び図２１に示されている文書データ
に対し、Ｗａｒｒｄ法を適用した結果を図２２と図２３
に示す。ここで、図２０は前記結合・置き換えの処理を
適用した結果で文書ベクトル空間を構成したデータであ
り、図２１は結合・置き換え処理を適用していない結果
で文書ベクトル空間を構成したデータである。また、図
２２と図２３の図中の数値は各クラスタ間の距離であ
る。Here, the Warrd method, which is one of the clustering techniques, is used, and the similarity measure uses a standardized Euclidean distance measure. As for the clustering method, "Introduction to Multivariate Analysis (Morikita Publishing)"
Familiar with. FIGS. 22 and 23 show the results of applying the Warrd method to the document data shown in FIGS.
Shown in Here, FIG. 20 shows data constituting a document vector space as a result of applying the combining / replacement processing, and FIG. 21 shows data constituting a document vector space as a result of not applying the combining / replacement processing. . The numerical values in FIGS. 22 and 23 are the distances between the clusters.

【００９９】図２２及び図２３の結果を比較した場合、
文書４の位置の差異が非常に特徴的であり、結合・置き
換えの処理を適用した場合は、文書４は文書２や文書５
と類似していると判断され、結合・置き換えの処理を適
用しない場合は、文書４は文書１や文書６と類似してい
ると判断される。主観的な語彙の適合度などから判断し
て文書４は｛文書２、文書５｝の集合よりも｛文書１、
文書３、文書６｝の集合に含まれる方が適切であると思
われる。従って、この結果から、結合・置き換えの処理
を適用することにより、より質の高い文書ベクトル空間
を構成でき、この文書ベクトル空間で分類処理をおこな
うことで、質の高い文書分類結果を得ることができる。When comparing the results of FIGS. 22 and 23,
The difference in the position of the document 4 is very distinctive, and when the combining / replacement process is applied, the document 4 becomes the document 2 or the document 5
When the combining / replacement process is not applied, the document 4 is determined to be similar to the document 1 or the document 6. Judging from the degree of conformity of the subjective vocabulary, Document 4 is more likely to be {Document 1, Document 5} than {Document 2, Document 5}.
It seems that it is more appropriate to be included in the set of Document 3, Document 6 #. Therefore, from this result, a higher-quality document vector space can be configured by applying the combining / replacement process, and a high-quality document classification result can be obtained by performing the classification process in this document vector space. it can.

【０１００】図２４は本発明の請求項１０〜１３，２３
〜２６及び３６〜３９に対応する実施例を説明するため
の文書分類装置のブロック構成図である。文書入力部４
０１は、キーボード、ＯＣＲ装置、ハードディスク等の
補助記憶装置による入力手段が文書分類装置４００に直
接に、または、ネットワーク経由で接続され、このよう
な入力手段から文書や文書群を獲得し、文書データを入
力するインターフェースである。この際、各文書データ
を一意に識別するために、例えばユニークな数などの、
識別子を各文書に割り当てる。FIG. 24 is a cross-sectional view of the present invention.
FIG. 27 is a block configuration diagram of a document classification device for describing embodiments corresponding to -26 and -39. Document input unit 4
Reference numeral 01 denotes an input unit using an auxiliary storage device such as a keyboard, an OCR device, or a hard disk, which is connected to the document classifying device 400 directly or via a network. Interface for inputting At this time, in order to uniquely identify each document data, for example, a unique number,
Assign an identifier to each document.

【０１０１】文書解析部４０２では、入力された文書そ
れぞれに対し形態素語解析を適用し、各文書を構成する
単語を品詞情報等とともに抽出する。この際、抽出した
単語を識別するために、抽出した単語のうちユニークな
表記を持つものについては、前記文書データと同様にユ
ニークな識別子を付置しておく。例として、文書データ
に対し形態素解析を適用し、文書データ全体で表記と品
詞がユニークである単語を同定し、それらに一意な識別
番号を付与するとともに、各文書データを、それを構成
する単語の識別番号とその出現頻度を表現するための擬
似コードを図２５に示す。なお、本発明では、形態素解
析系は必要な情報を抽出できるものであれば、どのよう
なものでもよい。The document analysis unit 402 applies morphological analysis to each of the input documents, and extracts words constituting each document together with part of speech information and the like. At this time, in order to identify the extracted word, a unique identifier is attached to the extracted word having a unique notation, similarly to the document data. As an example, morphological analysis is applied to document data to identify words whose notation and part of speech are unique in the entire document data, assign unique identification numbers to them, and assign each document data to the words that compose it. FIG. 25 shows a pseudo code for expressing the identification number and its appearance frequency. In the present invention, the morphological analysis system may be any system that can extract necessary information.

【０１０２】文書解析結果記憶部４０３では、文書解析
部４０２にて抽出された文書データの形態素解析結果を
適切な形式で記憶する。文書ベクトル空間生成部４０４
では、文書解析部４０２にて抽出された各文書データの
単語情報をもとに文書データをベクトル表現するための
空間を生成する。例として、文書解析部４０２での例を
もとに、文書データ全体でユニークな単語の正規化され
た頻度により文書ベクトル空間を生成する場合の、各文
書データのベクトル表現を生成する擬似コードを図２６
に示す。ただし、本発明では、ベクトル空間生成手法は
これに限定するものではなく、例えば、特異値分解など
を使用して全単語の線形変換によりベクトル空間を生成
することもできる。The document analysis result storage unit 403 stores the result of the morphological analysis of the document data extracted by the document analysis unit 402 in an appropriate format. Document vector space generation unit 404
Then, based on the word information of each document data extracted by the document analysis unit 402, a space for expressing the document data in a vector is generated. As an example, a pseudo code for generating a vector representation of each document data in the case where a document vector space is generated based on the normalized frequency of unique words in the entire document data based on the example in the document analysis unit 402. FIG.
Shown in However, in the present invention, the vector space generation method is not limited to this. For example, a vector space can be generated by linear transformation of all words using singular value decomposition or the like.

【０１０３】文書ベクトルデータ記憶部４０５では、文
書ベクトル空間生成部４０４にて生成された文書データ
ベクトルを適切な形式で記憶する。分類数決定部４０６
では、繰り返し文書分類を行う際の分類数を決定する
（分類数を定数×繰返し数とした場合の擬似コードを図
２７に含む）。文書分類部４０７では、文書ベクトル空
間生成部４０４にて生成された文書データベクトルを統
計手法を用いることで分類数決定部集合に分類する。The document vector data storage unit 405 stores the document data vector generated by the document vector space generation unit 404 in an appropriate format. Classification number determination unit 406
Then, the number of classifications at the time of repeatedly classifying documents is determined (a pseudo code in the case where the number of classifications is a constant × the number of repetitions is included in FIG. 27). The document classification unit 407 classifies the document data vectors generated by the document vector space generation unit 404 into a classification number determination unit set by using a statistical method.

【０１０４】統計処理は様々なものが利用可能である
が、請求項１１の発明ではアルゴリズムの簡潔さやクラ
スタ数の変化により分類構造が動的に変化する特性等か
ら非階層クラスタリング手法を用いることに限定してい
る。例として、クラスタ数を繰返し数と定数Ｎを乗じた
数としてクラスタリング手法を用いて文書ベクトルを分
類する擬似コードを図２７に示す。ここでは、クラスタ
リング手法の１つであるｋｍｅａｎｓ法を一部変更し
たもの用いることとし、また類似測度は余弦測度を使用
する。なお、クラスタリング手法に関しては、“多変量
解析入門（森北出版）”に詳しい。Although various types of statistical processing can be used, the invention of claim 11 uses a non-hierarchical clustering method because of the simplicity of the algorithm and the characteristic that the classification structure dynamically changes due to the change in the number of clusters. Limited. As an example, FIG. 27 shows a pseudo code for classifying document vectors by using a clustering method as the number of clusters multiplied by the number of repetitions and a constant N. Here, the k mean method, which is one of the clustering methods, is partially modified, and a cosine measure is used as the similarity measure. The clustering method is described in detail in "Introduction to Multivariate Analysis (Morikita Publishing)".

【０１０５】文書分類結果記憶部４０８では、文書分類
部４０７で生成される文書分類結果を適切な形式で記憶
する。繰り返し判定部４０９では、繰り返し文書分類を
おこなう際の繰り返しを継続するか否かの判定を行う
（繰り返し判定を指定された最大数を限度とした場合の
擬似コードを図２７に含む）。部分文書集合間関係算出
部４１０では、文書分類結果記憶部４０８に記憶されて
いる複数の部分文書集合間の関係情報を、文書解析結果
記憶部４０３と文書ベクトルデータ記憶部４０５にて記
憶されている種々の文書データに関する情報を用いて算
出する。例として、部分文書集合間の類似関係と包含関
係を文書データ及び／または文書データを構成する単語
情報で算出する動作を説明する。The document classification result storage unit 408 stores the document classification result generated by the document classification unit 407 in an appropriate format. The repetition determination unit 409 determines whether or not to continue repetition when performing repetitive document classification (FIG. 27 includes a pseudo code when the repetition determination is limited to a specified maximum number). In the partial document set relation calculating section 410, the relation information between a plurality of partial document sets stored in the document classification result storage section 408 is stored in the document analysis result storage section 403 and the document vector data storage section 405. It is calculated using information on various types of document data. As an example, an operation of calculating a similarity relation and an inclusion relation between partial document sets by using document data and / or word information constituting the document data will be described.

【０１０６】まず、部分文書集合間の類似関係と包含関
係を文書データで表現するための定式化を行う。文書分
類結果記憶部４０８に記憶されている複数の部分文書集
合はユニークな識別番号が付与されているものとする。
第ｍ番目の部分文書集合の特性ベクトル：Ｖｍを以下の
ように定義する。First, a formulation for expressing the similarity relation and the inclusion relation between the partial document sets by the document data is performed. It is assumed that a plurality of partial document sets stored in the document classification result storage unit 408 have unique identification numbers.
The characteristic vector Vm of the m-th partial document set is defined as follows.

【０１０７】・Ｖｍの次元数は全文書データ数に等しい・Ｖｍの各要素はそれぞれ１つの文書データに対応し、
重複はない。・要素ｉに対応する文書データと部分文書集合との類似
度が閾値以上の場合、要素ｉは１となる。・要素ｉに対応する文書データと部分文書集合との類似
度が閾値未満の場合、要素ｉは０となる。The number of dimensions of Vm is equal to the total number of document data. Each element of Vm corresponds to one document data.
There is no overlap. When the similarity between the document data corresponding to the element i and the partial document set is equal to or larger than the threshold, the element i is 1. When the similarity between the document data corresponding to the element i and the partial document set is less than the threshold value, the element i becomes 0.

【０１０８】上記定義を用いて、第ｍ番目の部分文書集
合と第ｎ番目の部分文書集合の関係：ＲｍｎとＲｎｍを
以下のように定義する。（１）Ｒｍｎ＝＜Ｖｍ，Ｖｎ＞／＜Ｖｍ，Ｖｍ＞（２）Ｒｎｍ＝＜Ｖｍ，Ｖｎ＞／＜Ｖｎ，Ｖｎ＞ただし、＜，＞は内積を示す。Using the above definitions, the relationship between the m-th partial document set and the n-th partial document set: Rmn and Rnm are defined as follows. (1) Rmn = <Vm, Vn> / <Vm, Vm> (2) Rnm = <Vm, Vn> / <Vn, Vn> where <,> indicates an inner product.

【０１０９】上記のＲｍｎとＲｎｍの値により、部分文
書集合間の類似関係と包含関係を算出することが可能と
なる。図２８はＲｍｎとＲｎｍの値による幾何学的解釈
を示したものである。すなわち、Ｒｍｎが１に近い場合
は、部分文書集合ｍは部分文書集合ｎに包含されている
といえる。また、ＲｍｎとＲｍｎが両方１に近いほど部
分文書集合ｍと部分文書集合ｎは類似しているものとい
える。さらに、（Ｒｍｎ，Ｒｎｍ）がＲｍｎ＝Ｒｎｍの
直線に近いほど、同じ程度の割合で相互に文書データを
包含していることなども読み取れる。From the values of Rmn and Rnm, it is possible to calculate the similarity and the inclusive relation between the partial document sets. FIG. 28 shows a geometrical interpretation based on the values of Rmn and Rnm. That is, when Rmn is close to 1, it can be said that the partial document set m is included in the partial document set n. Further, it can be said that the partial document set m and the partial document set n are more similar as Rmn and Rmn are both closer to 1. Further, it can be read that, as (Rmn, Rnm) is closer to the straight line of Rmn = Rnm, the document data are mutually included at the same rate.

【０１１０】次に、部分文書集合間の類似関係と包含関
係を文書データを構成する単語の出現頻度情報で表現す
るための定式化をおこなう。第ｍ番目の部分文書集合の
特性ベクトル：Ｗｍを以下のように定義する。Next, a formula for expressing the similarity relation and the inclusion relation between the partial document sets by the appearance frequency information of the words constituting the document data is formulated. The characteristic vector Wm of the m-th partial document set is defined as follows.

【０１１１】Ｗｍの次元数は全文書データでユニークな
単語数に等しい。Ｗｍの各要素はそれぞれユニークな単
語に対応し、重複はないＷｍの第Ｉ番目の要素値をｗｍ
（ｉ）と示す。部分文書集合との類似度が閾値以上の文
書すべてにおける、要素ｉに対応する単語の出現頻度
（出現回数）を要素ｉの要素値とする。The number of dimensions of Wm is equal to the number of unique words in all document data. Each element of Wm corresponds to a unique word, and the I-th element value of Wm without duplication is represented by wm
(I). The appearance frequency (the number of appearances) of the word corresponding to the element i in all the documents whose similarity with the partial document set is equal to or larger than the threshold value is set as the element value of the element i.

【０１１２】上記定義を用いて、第m番目の部分文書集
合と第ｎ番目の部分文書集合の関係：Ｒ’mnとＲ’nmを
以下のように定義する。（３）R’mn = Σf(wm(k),wn(k)) /Σf(wm(k),wm(k)) （４）R’mn = Σf(wm(k),wn(k)) /Σf(wn(k),wn(k)) （５）f(wm(k),wn(k)) = 0 for wm(k)×wn(k) = 0= wm
(k)×(a + b /｜wn(k) - wn(k)｜+ 1) for forwm(k)
×wn(k) != 0 ただし、a,bは定数で、a + b = 1, a,b >= 0Using the above definition, the relationship between the m-th partial document set and the n-th partial document set: R′mn and R′nm are defined as follows. (3) R'mn = Σf (wm (k), wn (k)) / Σf (wm (k), wm (k)) (4) R'mn = Σf (wm (k), wn (k) ) / Σf (wn (k), wn (k)) (5) f (wm (k), wn (k)) = 0 for wm (k) × wn (k) = 0 = wm
(k) × (a + b / | wn (k)-wn (k) | +1) for forwm (k)
× wn (k)! = 0 where a and b are constants, and a + b = 1, a, b> = 0

【０１１３】上記のＲ’ｍｎとＲ’ｎｍの値を用いても
図２８に示すＲｍｎとＲｍｎの関係と同様の解釈がで
き、したがって、部分文書集合間の類似関係と包含関係
を算出することが可能となる。さらに、Ｒ’ｍｎとＲ’
ｎｍを用いて部分文書集合の関係を定義する場合、文書
データのレベルでは得ることのできない関係を得ること
が可能になるとともに、例えば内容は一致してても、分
析対象の文書データが異なっている場合にも部分文書集
合間の関係を算出することが可能となる。また、部分文
書集合間関係記憶部４１１では、部分文書集合間関係算
出部４１０にて生成された部分文書集合間の関係情報を
適切な形式で記憶する。また、出力部４１２は、部分文
書集合間関係記憶部４１１で記憶された関係情報をユー
ザの要求に応じて、または予め定められた条件に従って
出力手段に適宜出力する。Even if the values of R'mn and R'nm are used, the same interpretation as the relationship between Rmn and Rmn shown in FIG. 28 can be made. Therefore, it is possible to calculate the similarity relationship and the inclusion relationship between the partial document sets. Becomes possible. Furthermore, R'mn and R '
In the case where the relationship of the partial document set is defined using nm, it is possible to obtain a relationship that cannot be obtained at the document data level. For example, even if the contents match, the document data to be analyzed is different. It is also possible to calculate the relationship between the partial document sets even when there is. Further, the partial document set relation storage unit 411 stores the relation information between the partial document sets generated by the partial document set relation calculation unit 410 in an appropriate format. Further, the output unit 412 outputs the relation information stored in the partial document set relation storage unit 411 to the output unit in response to a user request or according to a predetermined condition.

【０１１４】[0114]

【発明の効果】請求項１，１４及び２７の発明によれ
ば、生成された部分文書集合それぞれの代表語セットを
抽出し、さらにそれら代表語それぞれについて関連語を
求め、これらの情報をもとに各部分文書集合および部分
文書集合間の関連情報を生成することで、部分文書集合
の分析に有効な情報を提供することができる。According to the inventions of claims 1, 14 and 27, a representative word set of each of the generated partial document sets is extracted, and a related word is obtained for each of the representative words. By generating the partial document sets and the related information between the partial document sets, it is possible to provide information effective for analyzing the partial document sets.

【０１１５】請求項２，１５及び２８の発明によれば、
関連語として同義語、類義語、反対語のすくなくとも一
つ以上の組合わせを用いることで主に類似性に関す情報
を提供することができる。According to the invention of claims 2, 15 and 28,
The use of at least one combination of synonyms, synonyms, and antonyms as related words can provide information mainly on similarity.

【０１１６】請求項３，４，１６，１７，２９及び３０
の発明によれば、各部分文書集合の代表語セットの関連
語として反対語を用い、反対語が自分を含む他のどの部
分文書集合の代表語セットとも一致しない場合、その反
対語を含む文書を文書集合から抽出し、それを新たな部
分文書集合とすることで、文書集合からより多くの分析
情報を抽出することができる。Claims 3, 4, 16, 17, 29 and 30
According to the invention, when an antonym is used as a related word of the representative word set of each partial document set, and the opposite word does not match the representative word set of any other partial document set including itself, the document including the opposite word is used. Is extracted from a set of documents, and is used as a new partial document set, whereby more analysis information can be extracted from the set of documents.

【０１１７】請求項５，１８及び３１の発明によれば、
分類対象文書に形態素解析し、得られた解析結果をもと
に分類対象文書を幾つかの文書集合に分類する文書分類
装置において、形態素解析の結果得られる単語のうち指
定される品詞をもつ単語について、その前後の単語と適
切に組合わせた単語と置き換え、かつ品詞もまた適切な
ものに置き換える処理を施すことによって、高品位な文
書ベクトル空間を構成し、この文書ベクトル空間で統計
処理を用いて文書分類を行うことで高品質な文書分類結
果を得ることができる。According to the invention of claims 5, 18 and 31,
In a document classification device that performs a morphological analysis on a classification target document and classifies the classification target document into several document sets based on the obtained analysis result, a word having a specified part of speech among words obtained as a result of the morphological analysis , A high-quality document vector space is constructed by performing a process of replacing the word with the word before and after the word and a word appropriately combined, and replacing the part of speech with an appropriate word. By performing document classification, high-quality document classification results can be obtained.

【０１１８】請求項６，１９及び３２の発明によれば、
文書分類をおこなうための統計手法として、クラスタリ
ング手法を用いることで、簡便に高品質な文書分類結果
を得ることができる。According to the invention of claims 6, 19 and 32,
By using a clustering method as a statistical method for performing document classification, a high-quality document classification result can be easily obtained.

【０１１９】請求項７，２０及び３３の発明によれば、
分類対象文書に形態素解析を適用することで抽出される
単語の中で、特に、品詞が、接頭詞、接尾詞、助数詞、
及びそれらに類する品詞である単語について、適切な結
合処理を施こすことで、高品質な文書ベクトル空間を得
ることができる。According to the seventh, twentieth and thirty-third inventions,
Among words extracted by applying morphological analysis to documents to be classified, in particular, parts of speech are prefixes, suffixes, classifiers,
By performing appropriate combining processing on words that are part of speech and similar words, a high-quality document vector space can be obtained.

【０１２０】請求項８，２１及び３４の発明によれば、
単語の結合処理において特定の品詞の単語が出現するま
で単語の結合を続けることによって新たな単語を生成す
ることで、高品質な文書ベクトル空間を得ることができ
る。According to the invention of claims 8, 21 and 34,
By generating new words by continuing to combine words until a specific part of speech word appears in the word combining process, a high-quality document vector space can be obtained.

【０１２１】請求項９，２２及び３５の発明によれば、
単語の結合処理において、品詞が数詞接尾詞もしくは助
数詞の単語について、結合される複数の単語を削除し、
文書ベクトル空間を生成する際にはそれらの単語の情報
は用いないことで、高品質な文書ベクトル空間を得るこ
とができる。According to the ninth, twenty-second and thirty-fifth aspects,
In the word combining process, for a word whose part of speech is a numerical suffix or a classifier, delete a plurality of combined words,
When generating the document vector space, information of those words is not used, so that a high-quality document vector space can be obtained.

【０１２２】請求項１０，２３及び３６の発明によれ
ば、文書のベクトル空間モデルを用い、生成する部分文
書集合の数をパラメータとして繰り返し文書分類処理を
おこなうことで、多数の部分文書集合を生成し、さらに
生成された多数の文書集合について相互の関係を算出す
ることで、文書集合の構造の把握を支援しうる情報を生
成する文書分類装置を提供することができる。According to the tenth, twenty-third, and thirty-sixth aspects, a large number of partial document sets are generated by repeatedly performing document classification processing using the number of partial document sets to be generated as a parameter using a vector space model of the document. Further, by calculating the mutual relationship between a large number of generated document sets, it is possible to provide a document classification device that generates information that can assist in understanding the structure of the document set.

【０１２３】請求項１１，２４及び３７の発明によれ
ば、上記目的に加え、文書分類をおこなう統計手法とし
て、非階層クラスタリング手法を用いることで、簡便に
多数の部分文書集合を生成することができる。According to the eleventh, twenty-fourth, and thirty-seventh aspects, in addition to the above objects, a non-hierarchical clustering technique is used as a statistical technique for classifying documents, so that a large number of partial document sets can be easily generated. it can.

【０１２４】請求項１２，２５及び３８の発明によれ
ば、上記目的に加え、生成された多数の文書集合につい
て相互の関係として、類似関係と包含関係を算出するこ
とで、容易に文書集合の構造の把握しうる情報を提供す
ることができる。According to the twelfth, twenty-fifth, and thirty-eighth aspects of the present invention, in addition to the above objects, similarity relations and inclusion relations are calculated as a mutual relation between a large number of generated document sets, so that the document sets can be easily obtained. The information which can grasp a structure can be provided.

【０１２５】請求項１３，２６及び３９の発明によれ
ば、上記目的に加え、生成された多数の文書集合が有す
る情報のうち、単語に関する情報のみを用いて相互の関
係を算出することで、汎用性・再利用性の高い関係情報
を算出することができる。According to the thirteenth, twenty-sixth, and thirty-ninth aspects of the present invention, in addition to the above objects, the mutual relationship is calculated by using only information relating to words among the information of a large number of generated document sets. It is possible to calculate related information having high versatility and reusability.

[Brief description of the drawings]

【図１】本発明の請求項１，２，１４，１５，２７及
び２８の発明に対応する実施例を説明するための文書分
類装置のブロック構成図である。FIG. 1 is a block diagram of a document classification device for explaining an embodiment corresponding to the inventions of claims 1, 2, 14, 15, 27 and 28 of the present invention.

【図２】文書データを入力する処理の一例を示すフロ
ーチャートである。FIG. 2 is a flowchart illustrating an example of a process for inputting document data.

【図３】文書に対し形態素解析を適用する処理の一例
を示すフローチャートである。FIG. 3 is a flowchart illustrating an example of processing for applying morphological analysis to a document.

【図４】形態素解析の適用例について説明するための
図である。FIG. 4 is a diagram for describing an application example of morphological analysis.

【図５】形態素解析の適用結果の一例について説明す
るための図である。FIG. 5 is a diagram illustrating an example of a result of applying morphological analysis.

【図６】文書解析部で生成された情報をもとに文書群
の分類を行う処理の一例を示すフローチャートである。FIG. 6 is a flowchart illustrating an example of a process of classifying a document group based on information generated by a document analysis unit.

【図７】文書の部分文書集合への分類を説明するため
の図である。FIG. 7 is a diagram illustrating classification of documents into partial document sets.

【図８】代表語の抽出の処理の一例を示すフローチャ
ートである。FIG. 8 is a flowchart illustrating an example of representative word extraction processing.

【図９】図８に示すフローチャートに従って求めた代
表語セットの一例を示す図である。9 is a diagram showing an example of a representative word set obtained according to the flowchart shown in FIG.

【図１０】各部分文書集合の代表語セットのそれぞれ
について関連語辞書を用いて関連語を抽出する処理の一
例を示すフローチャートである。FIG. 10 is a flowchart illustrating an example of a process of extracting a related word using a related word dictionary for each of the representative word sets of each partial document set.

【図１１】代表語抽出部で求め各代表語の同義語及び
各部分文書集合の関連語セットの一例を示す図である。FIG. 11 is a diagram illustrating an example of a synonym of each representative word obtained by the representative word extraction unit and a related word set of each partial document set.

【図１２】抽出または生成した代表語セット及び関連
語セットを共に個々の部分文書集合及び部分文書集合間
の関連情報を生成する処理の一例を示すフローチャート
である。FIG. 12 is a flowchart illustrating an example of a process of generating a partial document set and related information between the partial document sets together with the extracted or generated representative word set and related word set.

【図１３】本発明の請求項３，４，１６，１７，２９
及び３０に対応する実施例を説明するための文書分類装
置のブロック構成図である。FIG. 13 of the present invention.
30 is a block diagram of a document classification device for explaining an embodiment corresponding to FIGS.

【図１４】反対語を含む文書を抽出して新しい部分文
書集合とする処理の一例を示すフローチャートである。FIG. 14 is a flowchart illustrating an example of a process of extracting a document including an antonym and forming a new partial document set.

【図１５】本発明の請求項５〜９，１８〜２２及び３
１〜３５に対応する実施例を説明するための文書分類装
置のブロック構成図である。FIG. 15 of the present invention.
FIG. 1 is a block diagram of a document classification device for describing an embodiment corresponding to 1 to 35.

【図１６】分類対象文書データの例を示す図である。FIG. 16 is a diagram illustrating an example of classification target document data.

【図１７】図１６に示す文書データに形態素解析を適
用して単語及び品詞を抽出した例を示す図である。17 is a diagram showing an example in which words and parts of speech are extracted by applying morphological analysis to the document data shown in FIG. 16;

【図１８】文書データの解析結果の一例を示す図であ
る。FIG. 18 is a diagram illustrating an example of an analysis result of document data.

【図１９】文書データの解析結果の他の例を示す図で
ある。FIG. 19 is a diagram illustrating another example of the analysis result of the document data.

【図２０】文書データを行方向に位置することで行列
表現した例を示す図である。FIG. 20 is a diagram illustrating an example in which document data is represented in a matrix by being positioned in a row direction.

【図２１】文書データを行方向に位置することで行列
表現した他の例を示す図である。FIG. 21 is a diagram illustrating another example in which document data is represented in a matrix by being positioned in the row direction.

【図２２】図２０の文書データに対しＷａｒｒｄ法を
適用した結果を示す図である。FIG. 22 is a diagram illustrating a result of applying the Warrd method to the document data of FIG. 20;

【図２３】図２１の文書データに対しＷａｒｒｄ法を
適用した結果を示す図である。FIG. 23 is a diagram showing a result of applying the Warrd method to the document data of FIG. 21;

【図２４】本発明の請求項１０〜１３，２３〜２６及
び３６〜３９に対応する実施例を説明するための文書分
類装置のブロック構成図である。FIG. 24 is a block diagram of a document classification device for explaining an embodiment corresponding to claims 10 to 13, 23 to 26 and 36 to 39 of the present invention.

【図２５】文書データの単語の識別番号とその出現頻
度を表現するための擬似コードの一例を示す図である。FIG. 25 is a diagram showing an example of a pseudo code for expressing an identification number of a word of document data and its appearance frequency.

【図２６】各文書データのベクトル表現を生成する擬
似コードの一例を示す図である。FIG. 26 is a diagram showing an example of a pseudo code for generating a vector representation of each document data.

【図２７】クラスタリング手法を用いて文書ベクトル
を分類する擬似コードの一例を示す図である。FIG. 27 is a diagram showing an example of pseudo code for classifying document vectors using a clustering technique.

【図２８】ＲｍｎとＲｎｍの値による幾何学的解釈を
示したものである。FIG. 28 shows a geometric interpretation based on values of Rmn and Rnm.

[Explanation of symbols]

１００，２００，３００，４００…文書分類装置、１０
１，３０１，４０１…文書入力部、１０２，３０２，４
０２…文書解析部、１０３，３０４，４０７…文書分類
部、１０４…代表語抽出部、１０５…関連語抽出部、１
０６…部分文書集合情報生成部、１０７…分類結果保存
部、１０８，３０５，４１２…出力部、２０１…反意部
分文書集合生成部、３０３，４０４…文書ベクトル空間
生成部、４０３…文書解析結果記憶部、４０５…文書ベ
クトルデータ記憶部、４０６…分類数決定部、４０８…
文書分類結果記憶部、４０９…繰り返し判定部、４１０
…部分文書集合間関係算出部、４１１…部分文書集合間
関係記憶部。100, 200, 300, 400 ... document classification device, 10
1, 301, 401... Document input unit, 102, 302, 4
02: Document analysis unit, 103, 304, 407: Document classification unit, 104: Representative word extraction unit, 105: Related word extraction unit, 1
06: partial document set information generation unit, 107: classification result storage unit, 108, 305, 412: output unit, 201: unintended partial document set generation unit, 303, 404: document vector space generation unit, 403: document analysis result Storage unit, 405: Document vector data storage unit, 406: Classification number determination unit, 408:
Document classification result storage unit, 409 ... repetition determination unit, 410
... A partial document set relation calculating section 411... A partial document set relation storage section.

Claims

[Claims]

1. A document classification apparatus for classifying a set of documents according to their contents, comprising: a document input unit for inputting a plurality of documents; and a word constituting each document from each of the documents input by the document input unit. A document analysis unit for extracting information; a document classification unit for classifying a document set of the plurality of documents into several partial document sets based on word information of each document extracted by the document analysis unit; A representative word extracting unit that extracts a set of representative words from each partial document set classified by the document classifying unit, and a relevant word dictionary that describes a related word of an arbitrary word. A related word extraction unit for extracting a related word set for each of the representative word sets of each partial document set extracted, and a related word set extracted by the related word extraction unit and a representative word set extracted by the representative word extraction unit. Each partial document set A partial document set information generating unit that generates individual partial document sets and related information between the partial document sets based on information about documents belonging to And a classification result storage unit that stores the information together with the information generated by the unit.

2. The document classification device according to claim 1, wherein the related word set extracted by the related word extracting unit is:
A document classification device characterized by a combination of at least one of synonyms, synonyms, and antonyms.

3. The document classification apparatus according to claim 1, wherein the related word set extracted by the related word extracting unit includes at least an opposite word, and an opposite word extracted from a representative word set of a certain partial document set. If the set does not match the representative word set of any other sub-document set including itself, the process of extracting a document including the inconsistent opposite word set from the document set and generating a new sub-document set is performed for all sub-documents. A document classification apparatus, further comprising a reciprocal partial document set generation unit that recursively repeats a set.

4. The document classification apparatus according to claim 1, wherein the related words extracted by the related word extracting unit include at least an opposite word, and an opposite word set extracted from a representative word set of a certain partial document set. If the word set does not match the representative word set of any other sub-document set including itself, the document set including the word set obtained by removing the non-matching opposite word set and the representative word corresponding to the opposite word set from the representative word set. A document classification apparatus further comprising: a reciprocal partial document set generation unit that recursively repeats a process of generating a new partial document set by extracting the partial document sets from all the partial document sets.

5. A document classification apparatus for classifying documents according to the contents of the document, comprising: a document input unit for inputting document data; And a document vector space generating unit for generating a document vector space for expressing the document data in a multidimensional vector space from the analysis information of the document data extracted by the document analyzing unit. And a document classifying unit that classifies document data by using a statistical method in the document vector space generated by the document vector space generating unit, and a word having a specific part of speech extracted by the document analyzing unit. Is replaced with a word generated by combining one or more words extracted before and after the specific part of speech based on the part of speech information of the specific part of speech. And a part of speech information of the specific part of speech is appropriately replaced.

6. The document classification device according to claim 5, wherein the document classification unit classifies the document data by using a clustering method as a statistical method.

7. The document classification device according to claim 5, wherein the document analysis unit replaces a word and a part of speech with respect to a word whose part of speech is a prefix, a suffix, a classifier, or a part of speech similar thereto. Document classification apparatus characterized by the above-mentioned.

8. The document classification device according to claim 5, wherein the document analysis unit continues combining words until a word of a specific part of speech appears in the document analysis unit.

9. The document classification device according to claim 5, wherein the part of speech is combined with the word of the numeral suffix or the classifier in the document analysis unit with respect to the word of the numeral suffix or the classifier. A document classification device, wherein a plurality of words are deleted, and the information of the deleted words is not used in the document classification unit.

10. A document classification device for classifying a document data set according to the contents of a document, comprising: a document input unit for inputting a document data set; and applying morphological analysis to all the document data to form the document data. A document analysis unit that extracts words together with their part-of-speech information, a document analysis result storage unit that stores an analysis result of the document data extracted by the document analysis unit, and a document data extracted by the document analysis unit A document vector space generating unit for generating a vector space for expressing the document data in a multidimensional vector space from the analysis information of the document vector space, and vector data of each document data in the document vector space generated by the document vector space generating unit A document vector data storage unit that stores the document vector, a classification number determination unit that determines the classification number of the document data set based on designated conditions, A document classifying unit that classifies document data into partial document sets of the specified number of classes by using a statistical method in a document vector space generated by a space generating unit, and stores a classification result generated by the document classifying unit. A classification result storage unit, a repetition determination unit that determines whether to repeat the processing from the classification number determination unit to the classification result storage unit, and stores the document vector data storage unit and the classification result storage unit. And a relation information between the partial document sets generated by the partial document set relation calculation unit for calculating relation information between all the partial document sets generated using the generated information. And a partial document set relation storage unit for storing a document classification set.

11. The document classification device according to claim 10, wherein the statistical method used in the document classification unit is a non-hierarchical clustering method.

12. The document classification device according to claim 10, wherein the relationship calculated by the partial document set relationship calculation unit is a similarity relationship and an inclusive relationship.

13. The document classification device according to claim 12, wherein the relationship between the partial document sets is calculated using only word information extracted from each partial document set.

14. A document classification method for classifying a set of documents in accordance with the contents thereof, comprising: a document input step of inputting a plurality of documents; A document analysis step of extracting information; a document classification step of classifying a document set of the plurality of documents into several partial document sets based on word information of each document extracted in the document analysis step; A representative word extraction step of extracting those representative word sets from each partial document set classified in the document classification step,
A related word extracting step of extracting a related word set for each representative word set of each partial document set extracted in the representative word extracting step using a related word dictionary in which the related word is described for an arbitrary word; Based on the related word set extracted in the word extraction step, the representative word set extracted in the representative word extraction step, and information on documents belonging to each partial document set, the individual partial document sets and the relation between the partial document sets A partial document set information generating step of generating information; and a classification result storing step of storing the classification result in the document classification step together with the information generated by the partial document set information generating unit. Document classification method to be used.

15. The document classification method according to claim 14, wherein the related word set extracted in the related word extracting step is a combination of at least one of a synonym, a synonym, and an antonym. A document classification method, characterized in that:

16. The document classification method according to claim 14, wherein the related word set extracted in the related word extracting step includes at least an opposite word, and an opposite word extracted from a representative word set of a certain partial document set. If the set does not match the representative word set of any other sub-document set including itself, the process of extracting a document including the inconsistent opposite word set from the document set and generating a new sub-document set is performed for all sub-documents. A document classification method, further comprising a step of generating a reciprocal partial document set recursively repeating the set.

17. The document classification method according to claim 14, wherein the related words extracted in the related word extracting step include at least an opposite word, and an opposite word set extracted from a representative word set of a certain partial document set. If the word set does not match the representative word set of any other sub-document set including itself, the document set including the word set obtained by removing the non-matching opposite word set and the representative word corresponding to the opposite word set from the representative word set. And generating a new partial document set, and recursively repeating the process of generating a new partial document set for all the partial document sets.

18. A document classification method for classifying a document according to the contents of the document, comprising: a document inputting step of inputting document data; And a document vector space generating step of generating a document vector space for expressing the document data in a multidimensional vector space from the analysis information of the document data extracted in the document analyzing step. And a document classification step of classifying the document data by using a statistical method in the document vector space generated in the document vector space generation step, and a word having a specific part of speech extracted in the document analysis step. Based on the part-of-speech information of the specific part of speech, one or more words extracted before and after the specific part of speech A document classification apparatus characterized in that the word classification is replaced with a word generated by combining the word class and the part of speech information of the specific part of speech.

19. The document classification method according to claim 18, wherein in the document classification step, the document data is classified by using a clustering method as a statistical method.

20. The document classification method according to claim 18 or 19, wherein in the document analyzing step, the words and the parts of speech are replaced for words whose parts of speech are prefixes, suffixes, classifiers, and similar parts of speech. A document classification method, characterized in that:

21. The document classification method according to claim 18, wherein the combining of the words is continued until a word of a specific part of speech appears in the document analysis step.

22. The document classification method according to claim 18, wherein in the document analysis step, a part of speech is combined with a word of a numeral suffix or a classifier with a word of the numeral suffix or a classifier. A document classification method, wherein a plurality of words are deleted, and the information of the deleted words is not used in the document classification step.

23. A document classification method for classifying a document data set according to the contents of the document, the document input step of inputting the document data set, and applying morphological analysis to all the document data to form the document data A document analysis step of extracting words together with their part-of-speech information, a document analysis result storage step of storing analysis results of the document data extracted in the document analysis step, and a document data extracted in the document analysis step A document vector space generating step of generating a vector space for expressing the document data in a multidimensional vector space from the analysis information of
A document vector data storing step of storing vector data of each document data in the document vector space generated in the document vector space generating step, a classification number determining step of determining a classification number of the document data set from designated conditions; A document classification step of classifying document data into partial document sets of the specified classification number by using a statistical method in the document vector space generated in the document vector space generation step; A classification result storing step of storing a classification result; a repetition determining step of determining whether to repeat the processing from the classification number determining step to the classification result storing step; a document vector data storing step and the classification result The source generated using the information stored in the storage step Calculating the relation information between the partial document sets, and storing the relation information between the partial document sets generated in the calculating step. And a document classification method.

24. The document classification method according to claim 23, wherein the statistical method used in the document classification step is a non-hierarchical clustering method.

25. The document classification method according to claim 23, wherein the relation calculated in the partial document set relation calculation step is a similarity relation and an inclusion relation.

26. The document classification method according to claim 25, wherein the relationship between the partial document sets is calculated using only word information extracted from each partial document set.

27. A computer-readable recording medium on which a program for executing a document classification method for classifying a document set according to its contents is recorded, the document inputting step of inputting a plurality of documents, and the document inputting step. A document analysis step of extracting word information constituting each document from each of the documents input in the step; and a document set of the plurality of documents based on the word information of each document extracted in the document analysis step. A document classification step of classifying the document into several partial document sets, a representative word extraction step of extracting a representative word set from each of the partial document sets classified in the document classification step, A related word set for each representative word set of each partial document set extracted in the representative word extracting step using the described related word dictionary Related word extraction step of extracting
Based on the related word set extracted in the related word extracting step, the representative word set extracted in the representative word extracting step, and information on documents belonging to each partial document set, each partial document set and each partial document set And a classification result storing step of storing the classification result in the document classification step together with the information generated by the partial document collection information generation unit. A computer-readable recording medium on which a program for executing the method is recorded.

28. A computer-readable recording medium recording a program for executing the document classification method according to claim 27, wherein a related word set extracted in the related word extracting step is a synonym or a synonym. And a computer-readable recording medium recording a program for executing a document classification method that is a combination of at least one of the opposite words.

29. A computer-readable recording medium storing a program for executing the document classification method according to claim 27, wherein a related word set extracted in the related word extracting step includes at least an opposite word. ,
If the opposite word set extracted from the representative word set of a certain partial document set does not match the representative word set of any other partial document set including itself, the document including the inconsistent opposite word set is extracted from the document set. A computer-readable recording medium storing a program for executing a document classification method further including a step of generating a recursive partial document set, which recursively repeats a process of generating a new partial document set for all partial document sets.

30. A computer-readable recording medium recording a program for executing the document classification method according to claim 27, wherein the related words extracted in the related word extracting step include at least an opposite word, If the antonym set extracted from the representative word set of a partial document set does not match the representative word set of any other partial document set including itself, the non-matching opposite word set and the representative word set are changed to the opposite word set. A document further including an anonymous sub-document set generation step of extracting a document including the word set excluding the corresponding representative word from the document set and recursively repeating the process of generating a new sub-document set for all the sub-document sets A computer-readable recording medium recording a program for executing the classification method.

31. A computer-readable recording medium on which a program for executing a document classification method for classifying documents according to the content of the document is recorded, wherein a document input step of inputting document data; A document analysis step of applying morphological analysis to extract the words constituting the document data together with their part of speech information and the like; and converting the document data in a multidimensional vector space from the analysis information of the document data extracted in the document analysis step. A document vector space generating step of generating a document vector space for expression, and a document classifying step of classifying document data by using a statistical method in the document vector space generated in the document vector space generating step, A word having a specific part of speech extracted in the document analysis step is identified. A document classification method for replacing a word generated by combining with one or more words extracted before and after the specific part of speech based on the part of speech information of the specific part of speech and appropriately replacing the part of speech information of the specific part of speech Computer-readable recording medium on which a program for executing the program is recorded.

32. A computer-readable recording medium on which a program for executing the document classification method according to claim 31 is recorded, wherein the classification of the document data is performed by using a clustering method as a statistical method in the document classification step. A computer-readable storage medium storing a program for executing a document classification method to be performed.

33. A computer-readable recording medium recording a program for executing the document classification method according to claim 31 or 32, wherein the part of speech is a prefix, a suffix, a classifier, and a part thereof in the document analysis step. A computer-readable storage medium storing a program for executing a document classification method for replacing words and parts of speech for words that are parts of speech similar to.

34. A computer-readable recording medium on which a program for executing the document classification method according to claim 31 is recorded, wherein a word of a specific part of speech appears in the document analysis step. A computer-readable storage medium storing a program for executing a document classification method that continues to combine words up to a word.

35. A computer-readable recording medium on which a program for executing the document classification method according to claim 31 is recorded, wherein the part of speech is a numerical suffix or a classifier in the document analyzing step. A computer-readable recording program for deleting a plurality of words associated with the word of the number suffix or classifier for the word, and executing a document classification method that does not use information of the deleted word in the document classification step. Recording medium.

36. A computer-readable recording medium recording a program for executing a document classification method for classifying a document data set according to the contents of a document,
A document input step of inputting a document data set, a document analysis step of applying morphological analysis to all the document data, and extracting words constituting the document data together with their part of speech information, and the like; Generating a vector space for expressing the document data in a multi-dimensional vector space from the analysis information of the document data extracted in the document analysis step; A document vector space generating step; a document vector data storing step of storing vector data of each document data in the document vector space generated in the document vector space generating step; The number of classifications to be determined and the document vector space generation step A document classification step of classifying document data into partial document sets of the specified number of classifications by using a statistical method in the document vector space, and a classification result storage step of storing a classification result generated in the document classification step. A repetition determining step of determining whether or not to repeat the processing from the classification number determination step to the classification result storage step; and using the information stored in the document vector data storage step and the classification result storage step. Calculating the relation information between all the partial document sets generated by the partial document sets, and storing the relation information between the partial document sets generated in the calculating step between the partial document sets. Computer storing a program for executing a document classification method including a set relation storage step. Over data-readable recording medium.

37. A computer-readable recording medium having recorded thereon a program for executing the document classification method according to claim 36, wherein the statistical method used in the document classification step is a non-hierarchical clustering method. A computer-readable recording medium on which a program for executing the method is recorded.

38. A computer-readable recording medium on which a program for executing the document classification method according to claim 36 or 37 is recorded, wherein the relation calculated in the partial document set relation calculation step is similar. A computer-readable storage medium storing a program for executing a document classification method having a relation and an inclusion relation.

39. A computer-readable recording medium recording a program for executing the document classification method according to claim 38, wherein the relation between the partial document sets is only word information extracted from each partial document set. A computer-readable recording medium on which a program for executing a document classification method calculated by using a computer is recorded.