JP6173848B2

JP6173848B2 - Document classification device

Info

Publication number: JP6173848B2
Application number: JP2013188860A
Authority: JP
Inventors: 秀樹岩崎; 後藤　和之; 和之後藤; 博司平; 泰成宮部; 松本　茂; 茂松本
Original assignee: Toshiba Corp; Toshiba Digital Solutions Corp
Current assignee: Toshiba Corp; Toshiba Digital Solutions Corp
Priority date: 2013-09-11
Filing date: 2013-09-11
Publication date: 2017-08-02
Anticipated expiration: 2033-09-11
Also published as: JP2015056020A

Description

本発明の実施形態は、文書分類装置に関する。 Embodiments described herein relate generally to a document classification apparatus.

近年、計算機の高性能化や記憶媒体の大容量化、計算機ネットワークの普及などに伴い、電子化された文書を計算機システムに大量に記憶管理して利用することが可能である。ここでいう文書とは、例えば、帳票、企画書、設計書、議事録といった業務文書や、マニュアル、特許、技術文献、法令、規程、ニュース記事、電子メール、ウェブページ、書籍などを指す。大量の文書を未整理のまま計算機のファイルシステムやデータベースに記憶するだけでは、どこにどのような情報が存在するかが分からなくなるため、せっかくの情報が利用できなくなるという問題が生じる。 In recent years, with the increase in performance of computers, the increase in capacity of storage media, and the spread of computer networks, it is possible to store and manage a large amount of electronic documents in a computer system. The document here refers to, for example, business documents such as forms, planning documents, design documents, minutes, manuals, patents, technical literature, laws, regulations, news articles, e-mails, web pages, books, and the like. If only a large number of documents are stored in a computer file system or database without being organized, it becomes impossible to know where and what information is present, which causes a problem that information cannot be used.

このような問題に対し、文書を内容や用途に応じて分類・整理しておくことで、情報の有効活用や共有の促進を図るということが行われている。また、例えば日々作成され、蓄積される日報や週報などの報告書や、顧客から送付されてくる問い合わせ、製品等の不具合情報、特許・技術文献等の知的財産などを、分析・調査して、内容の傾向を把握したり、新たな知見を得たりするためにも、互いに内容が類似した文書同士をグルーピングすることが行われている。このような文書の分類作業を人手で行うには労力がかかるため、従来から、文書を自動的に分類する文書分類技術が開発されている。 In order to deal with such problems, it has been attempted to promote effective utilization and sharing of information by classifying and organizing documents according to contents and uses. For example, analyze and investigate reports such as daily and weekly reports that are created and accumulated daily, inquiries sent from customers, defect information on products, intellectual property such as patents and technical literature, etc. In order to grasp the tendency of contents or to obtain new knowledge, grouping of documents having similar contents is performed. Since it takes labor to perform such document classification work manually, a document classification technique for automatically classifying documents has been developed.

文書分類技術の一例として、ユーザがあらかじめ分類の基準となるルールやカテゴリを用意せずに、分類対象の文書データ集合から、何らかの共通点や類似性のある文書同士のまとまり（すなわちクラスタ）を自動的に生成するクラスタリングがある。このクラスタリングには、代表的な方法が二つある。
クラスタリングの第一の方法では、各々の文書が持つ特徴を特徴ベクトルによって表し、文書同士の類似度を各文書の特徴ベクトル同士の類似度（例えば内積や余弦）によって定義する。そして、定義された類似度が大きい文書同士のまとまりとして、クラスタを生成する（例えば、特許文献１参照）。この方法を文書クラスタリングと呼ぶ。文書の特徴は、文書のテキスト中に出現する単語の頻度を用いて表現することが多く、この場合の特徴ベクトルは、単語ベクトルと呼ばれる。 As an example of document classification technology, a user automatically prepares a group (ie, cluster) of documents that have something in common or similar to each other from a set of document data to be classified without preparing a rule or category as a classification reference in advance. There is a clustering to generate automatically. There are two typical methods for this clustering.
In the first method of clustering, the features of each document are represented by feature vectors, and the similarity between documents is defined by the similarity (for example, inner product or cosine) between feature vectors of each document. Then, a cluster is generated as a group of documents having a high degree of similarity defined (see, for example, Patent Document 1). This method is called document clustering. Document features are often expressed using the frequency of words that appear in the text of the document, and the feature vectors in this case are called word vectors.

クラスタリングの第二の方法は、文書中に出現する単語に着目する方法である。この方法では、分類対象の文書データ集合の中で、各単語が出現する文書数（すなわち出現頻度）や、複数の単語が共通に出現する文書の個数（すなわち共起頻度）に基づいて、まず、その文書の内容をよく表す重要な単語や、単語間の関連、あるいは出現傾向が類似した単語のまとまりなどを抽出する。この抽出結果に基づき、各々の単語と、これが出現する文書を対応付けることで、文書のまとまりを自動的に生成する（例えば、特許文献２参照）。この方法を、単語クラスタリングと呼ぶ。 The second method of clustering is a method that focuses on words appearing in a document. In this method, first, based on the number of documents in which each word appears (that is, the appearance frequency) and the number of documents in which a plurality of words appear in common (ie, the co-occurrence frequency) in the document data set to be classified, Then, important words that well represent the contents of the document, relations between words, or groups of words having similar appearance tendencies are extracted. Based on the extraction result, a group of documents is automatically generated by associating each word with the document in which it appears (see, for example, Patent Document 2). This method is called word clustering.

また、分類構造を利用して、内容の傾向を把握したり、新たな知見を得たりするための分析方法として、２軸マップ（クロス集計とも呼ばれる）がある。この方法によれば、２つの分類軸を選び、各分類軸の項目である各カテゴリ同士で、文書の積集合（すなわち両カテゴリにともに分類されている文書データ集合）を求め、その文書数をマトリックス状に表示する。これにより、文書データ集合の全体像が把握でき、各カテゴリの相関関係などについての知見を得ることができる。２軸マップに関する従来技術としては、文書に含まれる項目の内容ごと（特許文書なら「要約」や「請求項」など）でクラスタリングを行い、そのクラスタリング結果を用いて２軸マップを行うものや（例えば、特許文献３参照）、互いに異なる観点や分類手法で作成した分類階層上で、ユーザが任意の部分を選択して２軸マップを行えるようにしたものがある(例えば、特許文献４参照）。 Further, there is a biaxial map (also referred to as cross tabulation) as an analysis method for grasping a tendency of contents or obtaining new knowledge using a classification structure. According to this method, two classification axes are selected, a product set of documents (that is, a document data set classified in both categories) is obtained for each category as an item of each classification axis, and the number of documents is calculated. Display in a matrix. As a result, an overall image of the document data set can be grasped, and knowledge about the correlation of each category can be obtained. As a conventional technique related to a biaxial map, clustering is performed for each item included in a document (for example, “summary” or “claim” for a patent document), and a biaxial map is performed using the clustering result ( For example, refer to Patent Document 3), and there is one in which a user can select an arbitrary part on a classification hierarchy created by different viewpoints and classification methods and perform a biaxial map (for example, refer to Patent Document 4). .

特開平７−３６８９７号公報JP 7-36897 A 特開２０００−２３１５６０号公報JP 2000-231560 A 特開２００７−１０８８６７号公報JP 2007-108867 A 特開２００４−８６３５０号公報JP 2004-86350 A

特許文献１や特許文献２のような単語クラスタリングや文書クラスタリングでは、ある文書集合に対して、単語の有意性や内容の類似性によって分類構造が生成されるが、その際、２軸マップで利用する分類軸は考慮されない。そのため、特許文献３のようにクラスタリング結果を用いて２軸マップを構成しても、必ずしもユーザの目的に合った分類のマップとはならない。例えば、ユーザが横軸を「出願年別」として特許文書集合の２軸マップを見たい場合には、ユーザは、時系列的な出願傾向を把握して技術の潮流を分析したいという要求があると考えられる。また、「出願人別」を横軸とした場合には、ユーザは、各社の強みや弱みといった技術傾向を分析したいという要求があると考えられる。このようなニーズに対しては、それぞれの場合に応じて２軸マップにおける分類軸を考慮してクラスタリングを行う必要があるが、従来技術ではそれができない。
そして、特許文献３や特許文献４では、クラスタリング結果やその時の分類構造を利用して２軸マップの表示を行うという１方向での処理である（ただし、特許文献４では分類軸の絞り込みは可能）。そのため、２軸マップ上で分析の目的やユーザの認識に合わせた分類構造を生成したいというニーズがあるものの、従来技術ではそれができない。 In word clustering and document clustering as in Patent Literature 1 and Patent Literature 2, a classification structure is generated for a certain document set based on word significance and content similarity. The classification axis is not considered. Therefore, even if a two-axis map is configured using the clustering result as in Patent Document 3, it is not always a classification map suitable for the user's purpose. For example, when a user wants to see a two-axis map of a collection of patent documents with the horizontal axis as “by filing year”, the user needs to grasp the time-series application trend and analyze the technology trends. it is conceivable that. Further, when “By Applicant” is set on the horizontal axis, it is considered that there is a demand for the user to analyze technical trends such as strengths and weaknesses of each company. For such needs, it is necessary to perform clustering in consideration of the classification axis in the biaxial map according to each case, but this cannot be done with the prior art.
And in patent document 3 and patent document 4, it is the process of one direction of displaying a biaxial map using a clustering result and the classification structure at that time (however, patent document 4 can narrow down a classification axis. ). For this reason, although there is a need to generate a classification structure in accordance with the purpose of analysis and the user's recognition on the biaxial map, this cannot be done with the prior art.

本発明が解決しようとする課題は、ユーザの観点に適した分類構造を生成し、ユーザの目的にあった分類と分析を可能とする文書分類装置を提供することである。 The problem to be solved by the present invention is to provide a document classification device that generates a classification structure suitable for the user's viewpoint and enables classification and analysis suitable for the user's purpose.

実施形態の文書分類装置は、文書記憶部と、カテゴリ記憶部と、カテゴリ操作部と、特徴度算出部と、傾向ベクトル生成部と、クラスタリング部と、カテゴリ生成処理部と、２軸マップ表示部とを具備する。文書記憶部は、文書データを記憶する。カテゴリ記憶部は、カテゴリの階層構造と、文書データをカテゴリへ分類する際の分類ルールとを記憶する。カテゴリ操作部は、分類の観点とするカテゴリと分類対象のカテゴリである対象カテゴリとの入力を受け、カテゴリ記憶部から分類の観点とするカテゴリの下位のカテゴリである軸カテゴリの集合を軸カテゴリ集合として読み出す。特徴度算出部は、文書記憶部に記憶されている文書データのうち、対象カテゴリの分類ルールを満たす文書データの集合を対象文書データ集合とし、対象文書データ集合に含まれる単語の特徴度を算出する。傾向ベクトル生成部は、特徴度算出部が算出した特徴度に基づいて対象文書データ集合の特徴を表す単語を選択し、選択した単語それぞれについて、当該軸カテゴリ集合中の各軸カテゴリの分類ルールを満たす対象文書データにおける当該単語の出現頻度に基づく統計量を算出し、前記統計量を当該軸カテゴリに対応する要素の値として設定した傾向ベクトルを生成する。クラスタリング部は、傾向ベクトル生成部が生成した傾向ベクトルの類似性に基づいて単語をクラスタリングする。カテゴリ生成処理部は、クラスタリング部によるクラスタリングの結果得られたクラスタごとに、対象カテゴリを上位のカテゴリとし、クラスタに属する単語をフィルタ語に用いた分類ルールを有する特徴語カテゴリを生成してカテゴリ記憶部に登録する。２軸マップ表示部は、軸カテゴリ集合に含まれる各カテゴリを第１軸の分類項目とし、カテゴリ生成処理部によって生成された特徴語カテゴリを第２軸の分類項目とした２軸マップの各セルに、文書記憶部に記憶されている文書データのうち、当該セルに対応した軸カテゴリの分類ルールと当該セルに対応した特徴語カテゴリの分類ルールとを満たす文書データの数を表す情報を表示させる。 A document classification device according to an embodiment includes a document storage unit, a category storage unit, a category operation unit, a feature calculation unit, a trend vector generation unit, a clustering unit, a category generation processing unit, and a biaxial map display unit It comprises. The document storage unit stores document data. The category storage unit stores a hierarchical structure of categories and classification rules for classifying document data into categories. The category operation unit receives an input of a category to be classified and a target category that is a category to be classified, and a set of axis categories that are subordinate categories of the category to be classified from the category storage unit. Read as. The feature degree calculation unit calculates a feature degree of a word included in the target document data set by setting a set of document data satisfying the classification rule of the target category among the document data stored in the document storage unit as the target document data set. To do. The trend vector generation unit selects a word representing the feature of the target document data set based on the feature degree calculated by the feature degree calculation unit, and sets a classification rule for each axis category in the axis category set for each selected word. A statistic based on the appearance frequency of the word in the target document data to be satisfied is calculated, and a trend vector in which the statistic is set as a value of an element corresponding to the axis category is generated. The clustering unit clusters words based on the similarity of the trend vectors generated by the trend vector generation unit. The category generation processing unit generates, for each cluster obtained as a result of clustering by the clustering unit, a feature word category having a classification rule using a target category as a higher category and a word belonging to the cluster as a filter word, and stores the category Register with the department. The biaxial map display unit uses each category included in the axis category set as a classification item for the first axis, and each cell of the biaxial map uses the feature word category generated by the category generation processing unit as the classification item for the second axis. Display information indicating the number of document data satisfying the classification rule of the axis category corresponding to the cell and the classification rule of the feature word category corresponding to the cell among the document data stored in the document storage unit .

実施形態に係る文書分類装置の構成を示すブロック図である。It is a block diagram which shows the structure of the document classification device which concerns on embodiment. 実施形態に係る文書記憶部に記憶される文書データの例を示す図である。It is a figure which shows the example of the document data memorize | stored in the document memory | storage part which concerns on embodiment. 実施形態に係るカテゴリ記憶部に記憶されるカテゴリデータの例を示す図である。It is a figure which shows the example of the category data memorize | stored in the category memory | storage part which concerns on embodiment. 実施形態に係るカテゴリ記憶部に記憶される特徴語カテゴリデータの例を示す図である。It is a figure which shows the example of the feature word category data memorize | stored in the category memory | storage part which concerns on embodiment. 実施形態に係る特徴度データ記憶部に記憶される特徴度データの例を示す図である。It is a figure which shows the example of the feature-value data memorize | stored in the feature-value data storage part which concerns on embodiment. 実施形態に係る着目語設定部が内部に記憶する着目語リストデータの例を示す図である。It is a figure which shows the example of the attention word list data which the attention word setting part which concerns on embodiment memorize | stores inside. 実施形態に係る文書分類処理の流れを示すフローチャートである。It is a flowchart which shows the flow of the document classification process which concerns on embodiment. 実施形態に係る初期２軸マップを表示させる処理の流れを示すフローチャートである。It is a flowchart which shows the flow of the process which displays the initial biaxial map which concerns on embodiment. 実施形態に係る２軸マップ上のカテゴリ操作に対する処理の流れを示すフローチャートである。It is a flowchart which shows the flow of a process with respect to the category operation on the biaxial map which concerns on embodiment. 実施形態に係る特徴度を算出する処理の流れを示すフローチャートである。It is a flowchart which shows the flow of the process which calculates the characteristic degree which concerns on embodiment. 実施形態に係る着目語を設定する処理の流れを示すフローチャートである。It is a flowchart which shows the flow of the process which sets the attention word which concerns on embodiment. 実施形態に係る補正特徴度を求める処理の流れを示すフローチャートである。It is a flowchart which shows the flow of the process which calculates | requires the correction | amendment feature degree which concerns on embodiment. 実施形態に係る特徴度データを取得する処理の流れを示すフローチャートである。It is a flowchart which shows the flow of the process which acquires the characteristic data which concern on embodiment. 実施形態に係る傾向ベクトルを求める処理の流れを示すフローチャートである。It is a flowchart which shows the flow of the process which calculates | requires the tendency vector which concerns on embodiment. 実施形態に係る特徴語クラスタリングの処理の流れを示すフローチャートである。It is a flowchart which shows the flow of the process of the feature word clustering which concerns on embodiment. 実施形態に係る特徴語カテゴリを生成する処理の流れを示すフローチャートである。It is a flowchart which shows the flow of the process which produces | generates the feature word category which concerns on embodiment. 実施形態に係る２軸マップを表示させる処理の流れを示すフローチャートである。It is a flowchart which shows the flow of the process which displays the biaxial map which concerns on embodiment. 実施形態に係る特徴語カテゴリのフィルタ語の追加または削除の処理の流れを示すフローチャートである。It is a flowchart which shows the flow of a process of addition or deletion of the filter word of the feature word category which concerns on embodiment. 実施形態に係るカテゴリ記憶部の初期状態として記憶されたカテゴリ構造の例を示す図である。It is a figure which shows the example of the category structure memorize | stored as an initial state of the category memory | storage part which concerns on embodiment. 実施形態に係る２軸マップ表示の実行画面の表示例を示す図である。It is a figure which shows the example of a display of the execution screen of the biaxial map display which concerns on embodiment. 実施形態に係る２軸マップの初期テーブルの表示例を示す図である。It is a figure which shows the example of a display of the initial table of the biaxial map which concerns on embodiment. 実施形態に係る２軸マップの初期表示例を示す図である。It is a figure which shows the example of an initial display of the biaxial map which concerns on embodiment. 実施形態に係る特徴語クラスタリングの実行画面及び着目語の設定画面の表示例を示す図である。It is a figure which shows the example of a display of the execution screen of the feature word clustering based on embodiment, and the setting screen of an attention word. 実施形態に係る特徴語クラスタリング実行後のカテゴリ構造の表示例を示す図である。It is a figure which shows the example of a display of the category structure after the characteristic word clustering execution which concerns on embodiment. 実施形態に係る特徴語クラスタリング実行後の２軸マップの表示例を示す図である。It is a figure which shows the example of a display of the biaxial map after execution of the feature word clustering which concerns on embodiment. 実施形態に係るフィルタ語選択時の２軸マップの表示例を示す図である。It is a figure which shows the example of a display of the biaxial map at the time of filter word selection which concerns on embodiment. 実施形態に係る２軸マップにおける特徴語カテゴリの編集操作とその画面の表示例を示す図である。It is a figure which shows the edit operation of the feature word category in the biaxial map which concerns on embodiment, and the example of a display of the screen. 実施形態に係る特徴語カテゴリの編集操作後の２軸マップの表示例を示す図である。It is a figure which shows the example of a display of the biaxial map after edit operation of the feature word category which concerns on embodiment. 実施形態に係る２軸マップを折れ線グラフで表現したときの表示例を示す図である。It is a figure which shows the example of a display when the biaxial map which concerns on embodiment is expressed with the line graph.

以下、本発明の実施形態について、図面を参照しながら説明する。
図１は、本発明の実施形態に係る文書分類装置１００の構成を示すブロック図である。同図に示すように、文書分類装置１００は、文書記憶部１、カテゴリ記憶部２、文書分類部３、特徴度データ記憶部４、及びユーザインターフェース部５を備えて構成される。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.
FIG. 1 is a block diagram showing a configuration of a document classification apparatus 100 according to an embodiment of the present invention. As shown in FIG. 1, the document classification apparatus 100 includes a document storage unit 1, a category storage unit 2, a document classification unit 3, a feature data storage unit 4, and a user interface unit 5.

文書記憶部１は、文書分類装置１００による文書分類処理の対象となる文書データを格納する。文書記憶部１は、例えばファイルシステムや文書データベースなどによって実現される。もしくは計算機ネットワークによって接続した複数の記憶手段によって、文書記憶部１を構成してもよい。 The document storage unit 1 stores document data to be subjected to document classification processing by the document classification device 100. The document storage unit 1 is realized by, for example, a file system or a document database. Alternatively, the document storage unit 1 may be configured by a plurality of storage units connected by a computer network.

カテゴリ記憶部２は、文書データの分類に用いられるカテゴリのデータを記憶する。カテゴリのデータは、カテゴリの名称や、カテゴリの階層構造、カテゴリの分類ルールを示す。カテゴリの階層構造は、カテゴリ間の上位と下位の関係を表す。分類ルールは、文書データをカテゴリに分類する際に用いられるルールであり、例えば、文書データの属性、作成日や作成者、ジャンルといった書誌情報などが利用される。もしくは、既存のクラスタリングによって文書データをカテゴリに分類してもよい。
さらに、カテゴリ記憶部２は、後述する文書分類部３によって生成される特徴語カテゴリのデータも記憶する。特徴語カテゴリは、文書を分類する際の観点となるカテゴリである軸カテゴリ集合に対して出現傾向が類似した特徴語を分類ルールとして分類されるカテゴリである。 The category storage unit 2 stores category data used for document data classification. The category data indicates a category name, a category hierarchy, and a category classification rule. The category hierarchy represents the upper and lower relationships between categories. The classification rule is a rule used when classifying document data into categories. For example, bibliographic information such as document data attributes, creation date, creator, and genre is used. Alternatively, the document data may be classified into categories by existing clustering.
Furthermore, the category storage unit 2 also stores feature word category data generated by the document classification unit 3 described later. The feature word category is a category in which feature words having similar appearance tendencies with respect to the axis category set, which is a category serving as a viewpoint when classifying documents, are classified as classification rules.

後述する図１９では、カテゴリ記憶部２の初期状態として記憶されたカテゴリの階層構造の全体像を表す画面表示例を示しており、後述する図３では、カテゴリ記憶部２に記憶されるカテゴリのデータを示している。
図１９において、ユーザインターフェース部５（具体的には、後述する２軸マップ表示部５１）は、上位カテゴリが設定されていないカテゴリデータ３００のカテゴリ名３０３「ルート」（図３）を、ルート階層の「ルート」カテゴリ１６００として表示させる。また、ユーザインターフェース部５（後述する２軸マップ表示部５１）は、カテゴリデータ３００を上位カテゴリとするカテゴリデータ３１０、３２０、３３０のカテゴリ名３１３、３２３、３３３に設定されている「出願人別」、「出願年別」、「内容別」（図３）を、「ルート」カテゴリ１６００の下位階層である「出願人別」カテゴリ１６０１、「出願年別」カテゴリ１６０２、「内容別」カテゴリ１６０３として表示させる。このような全体像を表示させる前提で、以下の説明を行なう。 FIG. 19 to be described later shows a screen display example showing an overall image of the hierarchical structure of the categories stored as the initial state of the category storage unit 2, and FIG. 3 to be described later shows the categories stored in the category storage unit 2. Data are shown.
In FIG. 19, the user interface unit 5 (specifically, a biaxial map display unit 51 described later) displays the category name 303 “root” (FIG. 3) of the category data 300 in which the upper category is not set as the root hierarchy. "Root" category 1600. In addition, the user interface unit 5 (two-axis map display unit 51 to be described later) displays “category by applicant” set in the category names 313, 323, and 333 of the category data 310, 320, and 330 having the category data 300 as a higher category. ], “By application year”, and “by content” (FIG. 3) are subordinate to the “root” category 1600, “by applicant” category 1601, “by application year” category 1602, and “by content” category 1603. Display as. The following description will be given on the premise that such a whole image is displayed.

図１に示す文書分類部３は、軸カテゴリ集合と対象カテゴリを入力として受け取る。軸カテゴリ集合は、２軸マップにおいて一方の軸となるカテゴリの子カテゴリ（軸カテゴリ）の集合であり、上述したように、分類における観点とするカテゴリ集合である。なお、子カテゴリとは、あるカテゴリに対して１つ下位のカテゴリであり、軸カテゴリ集合に含まれる子カテゴリ（軸カテゴリ）の数は１以上である。対象カテゴリは、２軸マップにおける他方の軸となるカテゴリであり、文書分類の観点となるカテゴリに対して分類対象となるカテゴリである。すなわち、本実施形態では、軸カテゴリ集合を分類における観点として、この軸カテゴリ集合に基づいて、対象カテゴリに含まれる文書データを分類する。ここで、文書分類部３は、軸カテゴリに分類された文書データについて、対象カテゴリを用いて分類する。例えば、「内容別」カテゴリに分類された文書データ集合を年代別の観点で分類したい場合、対象カテゴリに「内容別」カテゴリを指定し、観点とするカテゴリには「年代別」カテゴリを指定する。このとき、軸カテゴリ集合は、「年代別」カテゴリの子カテゴリである「２００４年」，「２００５年」，「２００６年」，「２００７年」，「２００８年」の各集合（図１９におけるカテゴリ１６２１〜１６２５）となる。そして、この内容を２軸マップで表する場合は、一方の軸が「年代別」カテゴリとなり、他方の軸が「内容別」カテゴリとなる。
この分類に用いられる特徴語を決定するため、文書分類部３は、入力された対象カテゴリに分類される文書データ集合（対象文書データ集合）に出現する単語について特徴度を算出する。特徴度とは、単語が対象文書データ集合の特徴をどの程度表しているかを定量的に示す指標値である。なお、対象文書データ集合に含まれる文書データの数は、１以上である。また、文書分類部３は、対象カテゴリの子カテゴリとして、軸カテゴリ集合に対して出現傾向が類似した特徴語のグループに基づく特徴語カテゴリを生成する。 The document classification unit 3 shown in FIG. 1 receives an axis category set and a target category as inputs. The axis category set is a set of child categories (axis categories) of a category that is one axis in the 2-axis map, and as described above, is a category set as a viewpoint in classification. The child category is a category one level lower than a certain category, and the number of child categories (axis categories) included in the axis category set is one or more. The target category is a category that is the other axis in the two-axis map, and is a category that is a classification target for a category that is a viewpoint of document classification. That is, in the present embodiment, the axis category set is used as a viewpoint in classification, and the document data included in the target category is classified based on the axis category set. Here, the document classification unit 3 classifies the document data classified into the axis category using the target category. For example, if you want to classify a document data set classified in the “by content” category from a perspective by age, specify the “by content” category as the target category, and specify the “by age” category as the category of perspective. . At this time, the axis category set includes child sets of “2004”, “2005”, “2006”, “2007”, and “2008”, which are child categories of the “by age” category (the categories in FIG. 19). 1621-1625). When this content is represented by a two-axis map, one axis is in the “by age” category and the other axis is in the “by content” category.
In order to determine a feature word used for this classification, the document classification unit 3 calculates a feature degree for words appearing in the document data set (target document data set) classified into the input target category. The feature degree is an index value that quantitatively indicates how much the word represents the feature of the target document data set. Note that the number of document data included in the target document data set is one or more. Further, the document classification unit 3 generates a feature word category based on a group of feature words whose appearance tendency is similar to the axis category set as a child category of the target category.

文書分類部３は、特徴度算出部３１、特徴度補正部３２、及び特徴語カテゴリ生成部３３を備えて構成される。
特徴度算出部３１は、対象カテゴリに分類された文書データの集合である対象文書データ集合に出現する単語について、所定の文書データ集合における出現頻度の統計的有意性に基づき特徴度を算出し、特徴度データ記憶部４に記憶する。
特徴度補正部３２は、後述するユーザインターフェース部５の着目語設定部５３から、ユーザが選択した着目語を複数集めた着目語集合を入力として受取る。着目語とは、対象カテゴリの子カテゴリとして特徴語カテゴリを生成する際にユーザが着目対象として選択した単語である。特徴度補正部３２は、特徴度データ記憶部４に記憶された各単語の特徴度を、該単語と着目語集合との共起度に基づいて補正する。 The document classification unit 3 includes a feature calculation unit 31, a feature correction unit 32, and a feature word category generation unit 33.
The feature degree calculation unit 31 calculates a feature degree based on the statistical significance of the appearance frequency in a predetermined document data set for words appearing in the target document data set that is a set of document data classified into the target category, It is stored in the feature data storage unit 4.
The feature correction unit 32 receives, as an input, a focused word set obtained by collecting a plurality of focused words selected by the user from a focused word setting unit 53 of the user interface unit 5 described later. The word of interest is a word selected as a target of attention by the user when generating a feature word category as a child category of the target category. The feature correction unit 32 corrects the feature of each word stored in the feature data storage unit 4 based on the co-occurrence of the word and the word set of interest.

特徴語カテゴリ生成部３３は、軸カテゴリ集合に対して出現傾向の類似した特徴語のグループに基づいて特徴語カテゴリを生成し、生成した特徴語カテゴリのデータをカテゴリ記憶部２に記憶する。特徴語カテゴリ生成部３３は、傾向ベクトル生成部３４、クラスタリング部３５、及びカテゴリ生成処理部３６を備えて構成される。傾向ベクトル生成部３４は、特徴度データ記憶部４に記憶された特徴度が所定の値より大きい単語の集合を特徴語集合とする。傾向ベクトル生成部３４は、特徴語集合に含まれる各単語について、対象文書データ集合中での各軸カテゴリにおける出現頻度を算出し、算出した各軸カテゴリにおける出現頻度に基づく統計量を、その軸カテゴリに対応した要素の値とする傾向ベクトルを生成して特徴度データ記憶部４に記憶する。クラスタリング部３５は、生成した傾向ベクトルの類似性に基づきクラスタリングを行い、関連の強い単語のグループである特徴語クラスタを抽出する。カテゴリ生成処理部３６は、特徴語クラスタに含まれる単語である特徴語をフィルタ語とし、そのフィルタ語を分類の条件とする特徴語カテゴリを生成する。カテゴリ生成処理部３６は、生成した特徴語カテゴリのデータをカテゴリ記憶部２に記憶する。 The feature word category generation unit 33 generates a feature word category based on a group of feature words similar in appearance tendency to the axis category set, and stores the generated feature word category data in the category storage unit 2. The feature word category generation unit 33 includes a trend vector generation unit 34, a clustering unit 35, and a category generation processing unit 36. The trend vector generation unit 34 sets a set of words whose feature degree stored in the feature degree data storage unit 4 is greater than a predetermined value as a feature word set. The trend vector generation unit 34 calculates an appearance frequency in each axis category in the target document data set for each word included in the feature word set, and calculates a statistic based on the calculated appearance frequency in each axis category. A tendency vector as an element value corresponding to the category is generated and stored in the feature data storage unit 4. The clustering unit 35 performs clustering based on the similarity of the generated trend vectors, and extracts feature word clusters that are groups of closely related words. The category generation processing unit 36 generates a feature word category that uses a feature word that is a word included in the feature word cluster as a filter word and uses the filter word as a classification condition. The category generation processing unit 36 stores the generated feature word category data in the category storage unit 2.

特徴度データ記憶部４は、対象文書データ集合に含まれる各単語についての特徴度データを格納する手段である。特徴度データは、文書分類部３によって生成された特徴度と、軸カテゴリ集合に対する傾向ベクトルとを含む。 The feature data storage 4 is means for storing feature data for each word included in the target document data set. The feature degree data includes the feature degree generated by the document classification unit 3 and a tendency vector for the axis category set.

ユーザインターフェース部５は、２軸マップの横軸となるカテゴリ（以下、「横軸カテゴリと記載する）と縦軸となるカテゴリ（以下、「縦軸カテゴリ」と記載する）との２つのカテゴリの入力を受け、ユーザに対して２軸マップを提示する。実際のこの提示の際には、インターネットを介して通信し、パーソナルコンピュータ（ＰＣ）のディスプレイ上に表されたブラウザなどの汎用的な機器を用いることが多い。また、ユーザインターフェース部５は、ユーザからの特徴語クラスタリングの実行要求や、着目語集合の設定、特徴語カテゴリの編集操作を受け付ける。ユーザインターフェース部５は、例えば、グラフィカル・ユーザ・インタフェース（以下、「ＧＵＩ」と記載する）によって実現される。 The user interface unit 5 includes two categories, a category that is the horizontal axis of the biaxial map (hereinafter referred to as “horizontal axis category”) and a category that is the vertical axis (hereinafter referred to as “vertical axis category”). Receives the input and presents the biaxial map to the user. In the actual presentation, a general-purpose device such as a browser that communicates via the Internet and is displayed on a display of a personal computer (PC) is often used. In addition, the user interface unit 5 accepts an execution request for feature word clustering, a setting of a focused word set, and an operation for editing a feature word category from the user. The user interface unit 5 is realized by, for example, a graphical user interface (hereinafter referred to as “GUI”).

ユーザインターフェース部５は、２軸マップ表示部５１、カテゴリ操作部５２、及び着目語設定部５３を備えて構成される。
２軸マップ表示部５１は、横軸カテゴリと縦軸カテゴリの２つのカテゴリの入力を受け、それらカテゴリの下位カテゴリをそれぞれ行と列の項目とした２軸マップをパーソナルコンピュータのディスプレイなどに表示させる（例えば、後述する図２２や図２５）。２軸マップ表示部５１は、パーソナルコンピュータ（ＰＣ）のディスプレイを通じて２軸マップを表示させる際、２軸マップの各セルに、そのセルが対応する行の項目のカテゴリと列の項目のカテゴリとの両カテゴリに分類された文書データの数に応じたグラフを表示させる。さらに、２軸マップ表示部５１は、２軸マップ上において、特徴語カテゴリのタイトル行のセル（例えば、後述する図２５のセル２１０１）に、該特徴語カテゴリのフィルタ語を表示させる。また、２軸マップ表示部５１は、２軸マップ上でユーザが選択したフィルタ語の入力を受けると、フィルタ語が選択された行について、横軸カテゴリの各下位カテゴリと、フィルタ語が選択された行の特徴語カテゴリとの両カテゴリに分類された文書データ集合中で、選択されたフィルタ語を含む文書データの数を上記のグラフとは区別して表示させる（例えば、後述する図２６）。 The user interface unit 5 includes a biaxial map display unit 51, a category operation unit 52, and a focused word setting unit 53.
The biaxial map display unit 51 receives input of two categories of the horizontal axis category and the vertical axis category, and displays a biaxial map with the lower categories of the categories as items of rows and columns, respectively, on a display of a personal computer or the like. (For example, FIG. 22 and FIG. 25 described later). When the biaxial map display unit 51 displays the biaxial map through the display of a personal computer (PC), each cell of the biaxial map includes a row item category and a column item category corresponding to the cell. A graph corresponding to the number of document data classified into both categories is displayed. Further, the biaxial map display unit 51 displays the filter word of the feature word category in the cell of the title row of the feature word category (for example, the cell 2101 in FIG. 25 described later) on the biaxial map. In addition, when the 2-axis map display unit 51 receives the input of the filter word selected by the user on the 2-axis map, the subordinate category of the horizontal axis category and the filter word are selected for the row in which the filter word is selected. The number of document data including the selected filter word in the document data set classified into both categories of the feature word category in the row is displayed separately from the above graph (for example, FIG. 26 described later).

カテゴリ操作部５２は、２軸マップ上でユーザからの特徴語クラスタリングの実行要求を受け付け、この実行要求を文書分類部３に出力する。また、カテゴリ操作部５２は、２軸マップ上でユーザによる特徴語カテゴリの選択を受け、選択された特徴語カテゴリの特徴語を表示させる（例えば、後述する図２７の特徴語追加画面２３１０）。さらに、カテゴリ操作部５２は、ユーザから特徴語カテゴリに対するフィルタ語の追加・削除といった編集操作を受け付け、カテゴリ記憶部２に記憶されている該特徴語カテゴリのデータをその編集操作に応じて更新する。 The category operation unit 52 receives a feature word clustering execution request from the user on the biaxial map, and outputs this execution request to the document classification unit 3. Further, the category operation unit 52 receives the selection of the feature word category by the user on the two-axis map, and displays the feature words of the selected feature word category (for example, the feature word addition screen 2310 in FIG. 27 described later). Further, the category operation unit 52 accepts an editing operation such as addition / deletion of a filter word to / from the feature word category from the user, and updates the data of the feature word category stored in the category storage unit 2 according to the editing operation. .

着目語設定部５３は、分類にあたり着目する複数の単語を着目語集合としてユーザから受け付けて、文書分類部３に出力する。文書分類部３は、着目語設定部５３から着目語集合の入力を受け、特徴度データ記憶部４に記憶された各単語の特徴度データに対して特徴度の補正を行い、補正された特徴度に基づいて選択した単語を特徴語として特徴語カテゴリを生成する。これにより、文書分類部３は、生成される特徴語カテゴリと、カテゴリ操作部５２によって提示されるカテゴリの特徴語とを、ユーザが着目した単語に則したものとすることができる。 The focused word setting unit 53 accepts a plurality of words focused on for classification as a focused word set from the user and outputs them to the document classifying unit 3. The document classification unit 3 receives the input of the target word set from the target word setting unit 53, corrects the feature level of the feature level data of each word stored in the feature level data storage unit 4, and corrects the corrected feature. A feature word category is generated using a word selected based on the degree as a feature word. As a result, the document classification unit 3 can make the generated feature word category and the category feature word presented by the category operation unit 52 in accordance with the word focused by the user.

文書分類装置１００は上述したように構成され、例えばパーソナルコンピュータ（ＰＣ）によって実現される。ＰＣは、例えばＣＰＵ（central processing unit）、メモリ、ハードディスクドライブ（ＨＤＤ）、液晶ディスプレイ、キーボード、マウス等で構成される。 The document classification device 100 is configured as described above, and is realized by, for example, a personal computer (PC). The PC includes, for example, a central processing unit (CPU), a memory, a hard disk drive (HDD), a liquid crystal display, a keyboard, a mouse, and the like.

図２は、文書記憶部１に記憶される文書データの一例を示す図である。文書記憶部１には複数の文書データが記憶される。各文書データは、図２（ａ）の文書データ２００ａに示すように、ユニークな識別子である文書番号２０１のデータを含む。
さらに、文書データ２００ａは、その目的や形式に従った属性のデータを含む。例えば、図２（ａ）は、特許情報を記述した文書の例を示しており、出願日２０３、出願人２０４などの属性のデータを含む。また、文書データ２００ａは、文書のテキストデータ、すなわち、日本語や英語などの自然言語で記述されたデータとして、文書名２０２、本文２０５などのテキストを含む。加えて、文書データ２００ａは、その文書データ２００ａに含まれる各単語の出現数（出現頻度ｔｆ）を要素とする単語ベクトル２０６のデータを保持する。この単語ベクトル２０６は、文書分類部３によって算出される。なお、単語ベクトル２０６は、文書データ２００ａと対応する形で、元の文書データが保持されている文書データベースとは異なるデータベースに保持されてもよい。 FIG. 2 is a diagram illustrating an example of document data stored in the document storage unit 1. The document storage unit 1 stores a plurality of document data. Each document data includes data of a document number 201 which is a unique identifier as shown in document data 200a in FIG.
Further, the document data 200a includes attribute data according to the purpose and format. For example, FIG. 2A shows an example of a document describing patent information, and includes attribute data such as an application date 203 and an applicant 204. The document data 200a includes text such as a document name 202 and a body 205 as text data of the document, that is, data described in a natural language such as Japanese or English. In addition, the document data 200a holds data of a word vector 206 whose element is the number of appearances (appearance frequency tf) of each word included in the document data 200a. The word vector 206 is calculated by the document classification unit 3. Note that the word vector 206 may be stored in a database different from the document database in which the original document data is stored in a form corresponding to the document data 200a.

図２（ｂ）に示す文書データ２００ｂは、図２（ａ）に示す文書データ２００ａをＸＭＬ（extensible markup language）形式で記述した場合の例である。この場合、文書データ２００ｂには、文書データ２００ａにおける文書番号、各属性、各テキスト、単語ベクトルに相当するデータが、ＸＭＬの要素（タグ）や属性を用いて記述される。 The document data 200b shown in FIG. 2B is an example when the document data 200a shown in FIG. 2A is described in the XML (extensible markup language) format. In this case, in the document data 200b, data corresponding to the document number, each attribute, each text, and the word vector in the document data 200a is described using XML elements (tags) and attributes.

図３は、カテゴリ記憶部２に記憶されるカテゴリデータの一例を示す図である。なお、後述する図１９に、カテゴリの階層構造の一例を示している。
図３においては、カテゴリデータ３００、３１０、３２０、３３０、３４０、３５０の６つの例を示している。各カテゴリデータはそれぞれ、ユニークな識別子であるカテゴリ番号３０１、３１１、３２１、３３１、３４１、３５１のデータを含む。ここでは、文書記憶部１に記憶された文書データ２００ａ（または文書データ２００ｂ）の属性を示す出願人２０４、出願日２０３のそれぞれにより、出願人別、出願年別にカテゴリが作成されている。 FIG. 3 is a diagram illustrating an example of category data stored in the category storage unit 2. Note that FIG. 19 described later shows an example of a hierarchical structure of categories.
In FIG. 3, six examples of category data 300, 310, 320, 330, 340, 350 are shown. Each category data includes data of category numbers 301, 311, 321, 331, 341, and 351, which are unique identifiers. Here, categories are created for each applicant and each application year based on the applicant 204 and the application date 203 indicating the attributes of the document data 200a (or document data 200b) stored in the document storage unit 1, respectively.

本実施形態の文書分類装置１００では、複数のカテゴリによってツリー形式の階層構造を構成する（例えば、後述する図１９）。そのため、各カテゴリデータはそれぞれ、カテゴリの上位と下位の関係を表すデータとして上位カテゴリ３０２、３１２、３２２、３３２、３４２、３５２のデータを有する。ただし、カテゴリデータ３００は階層構造のルートにあたるカテゴリを表すため、上位カテゴリ３０２には「（なし）」が設定される。また、「出願人別」カテゴリの上位カテゴリは「ルート」カテゴリである。そのため、「出願人別」カテゴリに対応したカテゴリデータ３１０の上位カテゴリ３１２には、「ルート」カテゴリに対応したカテゴリデータ３００のカテゴリ番号３０１の設定値「Ｃ０００」が設定される。また、各カテゴリデータ３００、３１０、３２０、３３０、３４０、３５０はそれぞれ、カテゴリ名３０３、３１３、３２３、３３３、３４３、３５３のデータを有する。 In the document classification device 100 of the present embodiment, a hierarchical structure in a tree format is configured by a plurality of categories (for example, FIG. 19 described later). Therefore, each category data has data of upper categories 302, 312, 322, 332, 342, and 352 as data representing the relationship between the upper and lower categories. However, since the category data 300 represents a category corresponding to the root of the hierarchical structure, “(none)” is set in the upper category 302. The upper category of the “by applicant” category is the “root” category. Therefore, the set value “C000” of the category number 301 of the category data 300 corresponding to the “root” category is set in the upper category 312 of the category data 310 corresponding to the “by applicant” category. Each category data 300, 310, 320, 330, 340, 350 includes data of category names 303, 313, 323, 333, 343, 353, respectively.

また、カテゴリの分類の基準は分類ルール３０４、３１４、３２４、３３４、３４４、３５４に記述される。例えば、「ルート」カテゴリのカテゴリデータ３００は、階層構造のルートのカテゴリを表すため、その分類基準である分類ルール３０４は、「ｔｒｕｅ（恒真）」とする。分類ルール３０４が「ｔｒｕｅ（恒真）」であるとは、全ての文書データが条件を満足するというルールが設定されていることを示す。また、例えば、カテゴリデータ３４０の分類ルール３４４は、「．／出願人＝”Ａ社”」である。この分類ルール３４４は、文書記憶部１に記憶された文書データ２００ａが持つ属性の出願人２０４が「Ａ社」であることを条件としたルールである。 Further, the category classification criteria are described in classification rules 304, 314, 324, 334, 344, and 354. For example, since the category data 300 of the “root” category represents the category of the root of the hierarchical structure, the classification rule 304 that is the classification criterion is “true”. The classification rule 304 of “true (true)” indicates that the rule that all document data satisfies the condition is set. For example, the classification rule 344 of the category data 340 is “./Applicant=“Company A” ”. This classification rule 344 is a rule on condition that the applicant 204 having the attribute of the document data 200 a stored in the document storage unit 1 is “Company A”.

なお、分類ルール３０４、３１４、３２４、３３４、３４４、３５４は、ＸＭＬ形式で記述した文書データ２００ｂについての条件の記述、すなわち、ＸＱｕｅｒｙ（もしくはＸＰａｔｈ）と呼ばれる問い合わせに言語を用いて記述した例である。分類ルール３０４、３１４、３２４、３３４、３４４、３５４の記述形式は、文書記憶部１の実現方法や、文書データの形式に応じて定めればよく、例えばＳＱＬなどを用いてもよい。 The classification rules 304, 314, 324, 334, 344, and 354 are examples in which a description of a condition for the document data 200b described in the XML format, that is, an inquiry called XQuery (or XPath) is described using a language. is there. The description format of the classification rules 304, 314, 324, 334, 344, and 354 may be determined according to the method for realizing the document storage unit 1 and the format of the document data. For example, SQL may be used.

カテゴリデータ３５０の分類ルール３５４もＸＱｕｅｒｙの条件式を用いて記述した例であり、文書データの「出願日」属性が「２００８年」であることを条件としたルールである。この分類ルールは、カテゴリの階層構造に対して、ＡＮＤ条件として作用する。例えば、図３に示す例とは異なるが、カテゴリデータ３４０に示す「Ａ社」カテゴリの下位カテゴリとして、カテゴリデータ３５０に示す「２００８年」カテゴリが存在した場合を想定する。この場合、「２００８年」カテゴリに分類される文書データは、「Ａ社」カテゴリの分類ルール「．／出願人=”Ａ社”」と、その下位カテゴリである「２００８年」カテゴリの分類ルール「．／出願日＞＝”２００８／０１／０１” ａｎｄ．／出願日＜＝”２００８／１２／３１”」とのＡＮＤ条件「（．／出願人＝”Ａ社”）ａｎｄ（．／出願日＞＝”２００８／０１／０１” ａｎｄ．／出願日＜＝”２００８／１２／３１）」にマッチする文書データ、すなわち、出願人が「Ａ社」であり、かつ、出願年が「２００８年」の文書データとなる。 The classification rule 354 of the category data 350 is also an example described using the XQuery conditional expression, and is a rule on condition that the “application date” attribute of the document data is “2008”. This classification rule acts as an AND condition for the category hierarchy. For example, although different from the example illustrated in FIG. 3, it is assumed that the “2008” category indicated in the category data 350 exists as a lower category of the “Company A” category indicated in the category data 340. In this case, the document data classified into the “2008” category includes a classification rule “./applicant=“Company A” ”of the“ Company A ”category and a classification rule of the“ 2008 ”category which is a lower category thereof. AND condition “(./applicant=“Company A”) ”and (./application) with“ ./application date> = ”2008/01/01” and ./application date <= “2008/12/31” ” Date> = “2008/01/01” and ./application date <= “2008/12/31)”, that is, the applicant is “Company A” and the application year is “2008”. Document data of “Year”.

図４は、文書分類部３によって生成される特徴語カテゴリのカテゴリデータである特徴語カテゴリデータの一例である。
特徴語カテゴリデータは、図３に示す他のカテゴリデータと同様にカテゴリ記憶部２に格納される。図４には、特徴語カテゴリデータ４００、４１０の２つの例を示している。特徴語カテゴリデータ４００、４１０はそれぞれ、図３に示したような他のカテゴリデータと同様に、カテゴリ番号４０１、４１１、上位カテゴリ４０２、４１２、カテゴリ名４０３、４１３、分類ルール４０４、４１４のデータを含む。さらに、特徴語カテゴリデータ４００、４１０は、カテゴリ生成処理部３６によって抽出された特徴語クラスタに含まれる特徴語を示すフィルタ語４０５、４１５のデータを有する。 FIG. 4 is an example of feature word category data that is category data of the feature word category generated by the document classification unit 3.
The feature word category data is stored in the category storage unit 2 in the same manner as the other category data shown in FIG. FIG. 4 shows two examples of feature word category data 400 and 410. The feature word category data 400 and 410 are data of category numbers 401 and 411, upper categories 402 and 412, category names 403 and 413, classification rules 404 and 414, respectively, similarly to other category data as shown in FIG. including. Further, the feature word category data 400 and 410 includes data of filter words 405 and 415 indicating feature words included in the feature word cluster extracted by the category generation processing unit 36.

特徴語カテゴリデータに含まれる分類ルール４０４、４１４は、カテゴリ生成処理部３６によって、フィルタ語４０５、４１５に基づいて生成される。例えば、カテゴリ生成処理部３６は、文書データのテキスト情報にフィルタ語４０５、４１５が含まれることを示す条件を分類ルールとして生成する。同図に示す特徴語カテゴリデータ４００のフィルタ語４０５には、「分類」と「知識」と「共有」が設定されている。従って、特徴語カテゴリデータ４００の分類ルール４０４には、文書データ２００ａのテキスト情報である本文２０５にこれらのフィルタ語４０５が含まれるという条件、つまり、「ｃｏｎｔａｉｎｓ（．／本文，“分析”）ａｎｄｃｏｎｔａｉｎｓ（．／本文，“知識”）ａｎｄｃｏｎｔａｉｎｓ（．／本文，“共有”）」が設定される。 The classification rules 404 and 414 included in the feature word category data are generated by the category generation processing unit 36 based on the filter words 405 and 415. For example, the category generation processing unit 36 generates a condition indicating that filter words 405 and 415 are included in the text information of the document data as a classification rule. “Classification”, “knowledge”, and “shared” are set in the filter word 405 of the feature word category data 400 shown in FIG. Accordingly, the classification rule 404 of the feature word category data 400 includes a condition that these filter words 405 are included in the body 205 that is text information of the document data 200a, that is, “contains (./body,“ analysis ”) and “contains (./text,“ knowledge ”)” and “contains (./text,“ share ”)” are set.

図５は、文書分類部３によって生成され、特徴度データ記憶部４に格納される特徴度データの一例である。同図に示す特徴度データ５００−１、５００−２、…は、文書分類部３において対象文書データ集合から抽出された各単語に対応し、単語５１０と、文書頻度５０１と、特徴度５０２と、補正特徴度５０３と、軸カテゴリ（カテゴリ番号）に対する頻度ベクトル／傾向ベクトル５０４のデータを有する。なお、特徴度データ５００−１、５００−２、…を総称して、特徴度データ５００と記載する。 FIG. 5 is an example of feature data generated by the document classification unit 3 and stored in the feature data storage unit 4. The feature data 500-1, 500-2,... Shown in the figure correspond to each word extracted from the target document data set in the document classification unit 3, and include a word 510, a document frequency 501, and a feature 502. , Correction feature 503 and frequency vector / trend vector 504 data for the axis category (category number). The feature data 500-1, 500-2,... Are collectively referred to as the feature data 500.

文書頻度５０１は、対象文書データ集合中で単語５１０が出現する文書データの数である文書数（文書頻度ｄｆ）を示す。特徴度５０２は、特徴度算出部３１によって算出される特徴度を示す。補正特徴度５０３は、ユーザから指定された着目語集合に基づき特徴度補正部３２が特徴度を補正した値を示す。軸カテゴリ（カテゴリ番号）に対する頻度ベクトル／傾向ベクトル５０４は、頻度ベクトル５１１及び傾向ベクトル５１２のデータを有する。特徴度データ５００−ｉ（ｉ＝１、２、…）の頻度ベクトル５１１及び傾向ベクトル５１２をそれぞれ、頻度ベクトル５１１−ｉ及び傾向ベクトル５１２−ｉとする。頻度ベクトル５１１は、対象文書データ集合と軸カテゴリ集合中の各軸カテゴリとの共通集合中に単語５１０が出現する文書データの数である文書数（ｃｆ）を要素（成分）の値とする。傾向ベクトル５１２は、対象文書データ集合に対する前述の共通集合中における単語５１０の出現割合（ｃｐ）を各要素の値とする。ただし、傾向ベクトル生成部３４は、補正特徴度が所定の値以下の単語は、特徴語クラスタの対象外とするため、頻度ベクトルと傾向ベクトルを求めない。図５においては、単語５１０「探索」と「メール」とが特徴語クラスタの対象外となった例であり、これらの単語の特徴度データ５００−３、５００−５の頻度ベクトル５１１−３、５１１−５と傾向ベクトル５１２−３、５１２−５は空となる。 The document frequency 501 indicates the number of documents (document frequency df) that is the number of document data in which the word 510 appears in the target document data set. The feature degree 502 indicates the feature degree calculated by the feature degree calculation unit 31. The corrected feature 503 indicates a value obtained by correcting the feature by the feature correcting unit 32 based on the focused word set designated by the user. The frequency vector / trend vector 504 for the axis category (category number) includes data of the frequency vector 511 and the trend vector 512. Let the frequency vector 511 and the trend vector 512 of the feature data 500-i (i = 1, 2,...) Be a frequency vector 511-i and a trend vector 512-i, respectively. The frequency vector 511 uses the number of documents (cf), which is the number of document data in which the word 510 appears, in the common set of the target document data set and each axis category in the axis category set as the value of the element (component). The trend vector 512 uses the appearance ratio (cp) of the word 510 in the common set with respect to the target document data set as the value of each element. However, the trend vector generation unit 34 does not obtain a frequency vector and a trend vector because words whose correction feature degrees are equal to or less than a predetermined value are excluded from the feature word cluster. FIG. 5 is an example in which the words 510 “search” and “mail” are excluded from the feature word cluster, and the frequency vectors 511-3 of the feature data 500-3 and 500-5 of these words, 511-5 and the trend vectors 512-3 and 512-5 are empty.

図６は、着目語設定部５３が内部に記憶する着目語リストデータの一例を示す図である。図６に示す着目語リストデータ６００の着目語候補６０１には、２軸マップ上でユーザが着目語として選択したフィルタ語、もしくは、ユーザが着目語として入力した文字列のリストが設定される。着目語候補６０１には、文書分類部３において実際に着目語として適用するか否かを示すフラグである着目語設定６０２が付与される。本実施形態では、着目語設定６０２が「１」ならば着目語として適用し、「０」ならば非適用であることを示す。 FIG. 6 is a diagram illustrating an example of attention word list data stored in the attention word setting unit 53. In the target word candidate 601 of the target word list data 600 shown in FIG. 6, a filter word selected by the user as the target word on the two-axis map or a list of character strings input by the user as the target word is set. The attention word candidate 601 is given a attention word setting 602 that is a flag indicating whether or not the document classification unit 3 actually applies the attention word as a attention word. In this embodiment, if the attention word setting 602 is “1”, it is applied as a attention word, and if it is “0”, it indicates that it is not applicable.

以下、図７〜図２９を参照して、本発明の実施形態に係る文書分類装置が行う文書分類処理の一例を説明する。
図７は、本実施形態の文書分類装置１００が行う文書分類処理の流れの一例を示すフローチャートである。まず、ユーザインターフェース部５の２軸マップ表示部５１は、ユーザの操作により、２軸マップの横軸カテゴリ及び縦軸カテゴリの入力を受け、初期２軸マップ表示処理を行う（ステップＳ１）。初期２軸マップ表示処理において、２軸マップ表示部５１は、カテゴリ記憶部２から横軸カテゴリの子カテゴリの集合と縦軸カテゴリの子カテゴリの集合を取得する。２軸マップ表示部５１は、横軸カテゴリの子カテゴリ（以下、「横軸子カテゴリ」と記載する）を横軸の各項目とし、縦軸カテゴリの子カテゴリ（以下、「縦軸子カテゴリ」と記載する）を縦軸の各項目とした２軸マップを表示させる。なお、横軸カテゴリに子カテゴリがない場合は、横軸カテゴリを横軸の項目とし、縦軸カテゴリに子カテゴリがない場合は、縦軸カテゴリを縦軸の項目とする（例えば、後述する図２２）。 Hereinafter, an example of document classification processing performed by the document classification device according to the embodiment of the present invention will be described with reference to FIGS.
FIG. 7 is a flowchart showing an example of the flow of document classification processing performed by the document classification device 100 of this embodiment. First, the biaxial map display unit 51 of the user interface unit 5 receives an input of the horizontal axis category and the vertical axis category of the biaxial map by a user operation, and performs an initial biaxial map display process (step S1). In the initial biaxial map display process, the biaxial map display unit 51 acquires a set of child categories of the horizontal axis category and a set of child categories of the vertical axis category from the category storage unit 2. The 2-axis map display unit 51 uses the horizontal axis category child categories (hereinafter referred to as “horizontal axis category”) as the horizontal axis items, and the vertical axis category child categories (hereinafter referred to as “vertical axis category”). 2 axis map is displayed with each item on the vertical axis. When there is no child category in the horizontal axis category, the horizontal axis category is the horizontal axis item, and when there is no child category in the vertical axis category, the vertical axis category is the vertical axis item (for example, a diagram to be described later). 22).

続いて、カテゴリ操作部５２は、ユーザからクラスタリング要求と対象カテゴリの入力を受ける（ステップＳ２）。例えば、ユーザは、ステップＳ１において表示させた２軸マップの横軸カテゴリ、または、縦軸カテゴリの中から対象カテゴリを選択する。カテゴリ操作部５２は、入力された対象カテゴリと軸カテゴリ集合を文書分類部３に出力する。軸カテゴリ集合は、対象カテゴリとして選択されなかった横軸カテゴリの下位カテゴリの集合、または、縦軸カテゴリの下位カテゴリの集合である。文書分類部３の特徴度算出部３１は、対象カテゴリに分類された文書データの集合である対象文書データ集合から、所定の品詞であり、かつ、不要語ではない単語を抽出して特徴度を算出し、特徴度データ記憶部４に書き込む（ステップＳ３）。カテゴリ操作部５２は、ユーザから着目語の入力を受け（例えば、後述する図２３の着目語設定フォーム１９１０）、文書分類部３に出力する（ステップＳ４）。文書分類部３の特徴度補正部３２は、ステップＳ３において特徴度データ記憶部４に記憶された各単語の特徴度を、対象文書データ集合における該単語と着目語集合との共起度に基づいて補正する（ステップＳ５）。 Subsequently, the category operation unit 52 receives a clustering request and a target category input from the user (step S2). For example, the user selects a target category from the horizontal axis category or vertical axis category of the biaxial map displayed in step S1. The category operation unit 52 outputs the input target category and axis category set to the document classification unit 3. The axis category set is a set of subordinate categories of the horizontal axis category not selected as the target category or a set of subordinate categories of the vertical axis category. The feature calculation unit 31 of the document classification unit 3 extracts a word that has a predetermined part of speech and is not an unnecessary word from the target document data set that is a set of document data classified into the target category, and calculates the feature level. Calculate and write to the feature data storage unit 4 (step S3). The category operation unit 52 receives an input of a word of interest from the user (for example, a word of interest setting form 1910 in FIG. 23 described later), and outputs it to the document classification unit 3 (step S4). The feature correction unit 32 of the document classification unit 3 determines the feature of each word stored in the feature data storage unit 4 in step S3 based on the co-occurrence of the word and the target word set in the target document data set. (Step S5).

傾向ベクトル生成部３４は、補正された特徴度が所定の値より大きい単語について傾向ベクトルを生成し、特徴度データ記憶部４に記憶する（ステップＳ６）。クラスタリング部３５は、傾向ベクトルの類似性に基づき単語のクラスタリングを行い、関連の強い単語のグループである特徴語クラスタを抽出する（ステップＳ７）。カテゴリ生成処理部３６は、特徴語クラスタに含まれる単語である特徴語をフィルタ語とし、そのフィルタ語を分類の条件とする特徴語カテゴリを生成する。カテゴリ生成処理部３６は、生成した特徴語カテゴリのデータ（例えば、図４の特徴語カテゴリデータ４００、４１０）をカテゴリ記憶部２に記憶する（ステップＳ８）。これにより、カテゴリ構造が更新される（例えば、後述する図２４）。 The trend vector generation unit 34 generates a trend vector for words whose corrected feature values are larger than a predetermined value, and stores them in the feature data storage unit 4 (step S6). The clustering unit 35 performs clustering of words based on the similarity of tendency vectors, and extracts feature word clusters that are groups of closely related words (step S7). The category generation processing unit 36 generates a feature word category that uses a feature word that is a word included in the feature word cluster as a filter word and uses the filter word as a classification condition. The category generation processing unit 36 stores the generated feature word category data (for example, the feature word category data 400 and 410 in FIG. 4) in the category storage unit 2 (step S8). Thereby, the category structure is updated (for example, FIG. 24 described later).

２軸マップ表示部５１は、軸カテゴリ集合と、ステップＳ８において生成された特徴語カテゴリ集合とを２軸とする２軸マップを表示させる（ステップＳ９）。このとき、２軸マップ表示部５１は、特徴語カテゴリそれぞれのフィルタ語を表示させる（例えば、後述する図２５）。 The biaxial map display unit 51 displays a biaxial map having the axis category set and the feature word category set generated in step S8 as two axes (step S9). At this time, the biaxial map display unit 51 displays the filter words of each feature word category (for example, FIG. 25 described later).

カテゴリ操作部５２は、ユーザからのフィルタ語の追加や削除などの編集操作を受け付け（例えば、後述する図２７）、カテゴリ記憶部２に記憶されている特徴語カテゴリデータを更新する（ステップＳ１０）。２軸マップ表示部５１は、軸カテゴリ集合と、ステップＳ１０において更新された特徴語カテゴリデータに基づく特徴語カテゴリ集合とを２軸とする２軸マップ（例えば、後述する図２８）を表示させる（ステップＳ１１）。 The category operation unit 52 accepts editing operations such as addition or deletion of filter words from the user (for example, FIG. 27 described later), and updates the feature word category data stored in the category storage unit 2 (step S10). . The biaxial map display unit 51 displays a biaxial map (for example, FIG. 28 to be described later) having two axes, the axis category set and the feature word category set based on the feature word category data updated in step S10 ( Step S11).

続いて、図７の各処理ステップにおける詳細な処理について説明する。
図８は、初期２軸マップを表示させる処理の流れを示すフローチャートである。同図に示すフローチャートは、図７のステップＳ１における初期２軸マップ表示処理の詳細な処理の一例を示す。 Next, detailed processing in each processing step of FIG. 7 will be described.
FIG. 8 is a flowchart showing the flow of processing for displaying the initial biaxial map. The flowchart shown in the figure shows an example of detailed processing of the initial biaxial map display processing in step S1 of FIG.

最初に、２軸マップ表示部５１は、ユーザによる２軸マップの横軸カテゴリｘＡｘｉｓＣａｔと、縦軸カテゴリｙＡｘｉｓＣａｔとの入力を受ける（ステップＳ１００１）。この入力のため、本実施形態では、２軸マップ表示部５１は、カテゴリ記憶部２に記憶されているカテゴリデータ及び特徴語カテゴリデータに基づいて、カテゴリ構造を表す画面を表示させる。ここでは、図４に示す特徴語カテゴリデータ４００、４１０がまだ生成されていないものとする。 First, the biaxial map display unit 51 receives an input of a horizontal axis category xAxisCat and a vertical axis category yAxisCat of the biaxial map by the user (step S1001). For this input, in this embodiment, the biaxial map display unit 51 displays a screen representing the category structure based on the category data and feature word category data stored in the category storage unit 2. Here, it is assumed that the feature word category data 400 and 410 shown in FIG. 4 has not been generated yet.

図１９は、カテゴリ記憶部２の初期状態として記憶されたカテゴリ構造の画面表示例を示す図である。２軸マップ表示部５１は、上位カテゴリが設定されていないカテゴリデータ３００のカテゴリ名３０３「ルート」を、ルート階層の「ルート」カテゴリ１６００として表示させる。さらに、２軸マップ表示部５１は、カテゴリデータ３００のカテゴリ番号３０１「Ｃ０００」が上位カテゴリに設定されているカテゴリデータ３１０、３２０、３３０を特定する。２軸マップ表示部５１は、これらカテゴリデータ３１０、３２０、３３０のカテゴリ名３１３、３２３、３３３に設定されている「出願人別」、「出願年別」、「内容別」を、「ルート」カテゴリ１６００の下位階層である「出願人別」カテゴリ１６０１、「出願年別」カテゴリ１６０２、「内容別」カテゴリ１６０３として表示させる。 FIG. 19 is a diagram illustrating a screen display example of the category structure stored as the initial state of the category storage unit 2. The biaxial map display unit 51 displays the category name 303 “root” of the category data 300 in which the upper category is not set as the “root” category 1600 of the root hierarchy. Further, the biaxial map display unit 51 identifies the category data 310, 320, 330 in which the category number 301 “C000” of the category data 300 is set as the upper category. The biaxial map display unit 51 sets “by applicant”, “by application year”, and “by content” set to the category names 313, 323, and 333 of the category data 310, 320, and 330 to “route”. It is displayed as “by applicant” category 1601, “by application year” category 1602, and “by content” category 1603, which are lower layers of the category 1600.

さらに、２軸マップ表示部５１は、カテゴリデータ３１０のカテゴリ番号３０１「Ｃ０００１」が上位カテゴリに設定されているカテゴリデータを特定する。２軸マップ表示部５１は、特定したカテゴリデータが示すカテゴリ名「Ａ社」、「Ｂ社」、「Ｃ社」、「Ｄ社」、「Ｅ社」を、「出願人別」カテゴリ１６０１の下位階層である「Ａ社」カテゴリ１６１１、「Ｂ社」カテゴリ１６１２、「Ｃ社」カテゴリ１６１３、「Ｄ社」カテゴリ１６１４、「Ｅ社」カテゴリ１６１５として表示させる。 Further, the biaxial map display unit 51 identifies category data in which the category number 301 “C0001” of the category data 310 is set as the upper category. The biaxial map display unit 51 displays the category names “Company A”, “Company B”, “Company C”, “Company D”, and “Company E” indicated by the identified category data in the “by applicant” category 1601. The lower hierarchy “Company A” category 1611, “Company B” category 1612, “Company C” category 1613, “Company D” category 1614, and “Company E” category 1615 are displayed.

同様に、２軸マップ表示部５１は、カテゴリデータ３２０のカテゴリ番号３２１「Ｃ０００２」が上位カテゴリに設定されているカテゴリデータを特定する。２軸マップ表示部５１は、特定したカテゴリデータが示すカテゴリ名「２００４年」、「２００５年」、「２００６年」、「２００７年」、「２００８年」を、「出願年別」カテゴリ１６０２の下位階層である「２００４年」カテゴリ１６２１、「２００５年」カテゴリ１６２２、「２００６年」カテゴリ１６２３、「２００７年」カテゴリ１６２４、「２００８年」カテゴリ１６２５として表示させる。「Ａ社」カテゴリ１６１１及び「２００８年」カテゴリ１６２５はそれぞれ、図３に示すカテゴリデータ３４０、３５０に対応する。
そして、カテゴリデータ３３０のカテゴリ番号３３１「Ｃ０００３」を上位カテゴリとするカテゴリデータがないため、２軸マップ表示部５１は、「内容別」カテゴリ１６０３には下位カテゴリを表示させない。 Similarly, the biaxial map display unit 51 identifies category data in which the category number 321 “C0002” of the category data 320 is set as the upper category. The biaxial map display unit 51 displays the category names “2004”, “2005”, “2006”, “2007”, “2008” indicated by the identified category data in the “by application year” category 1602. The lower layer “2004” category 1621, “2005” category 1622, “2006” category 1623, “2007” category 1624, and “2008” category 1625 are displayed. The “Company A” category 1611 and the “2008” category 1625 correspond to the category data 340 and 350 shown in FIG. 3, respectively.
Since there is no category data having the category number 331 “C0003” of the category data 330 as the upper category, the biaxial map display unit 51 does not display the lower category in the “by contents” category 1603.

さらに、２軸マップ表示部５１は、カテゴリ記憶部２に記憶されている各カテゴリのカテゴリデータ（または特徴語カテゴリデータ）から分類ルールを読み出す。２軸マップ表示部５１は、読み出した分類ルールを用いて各カテゴリに分類された文書データ数をカウントし、そのカウント数を表示させる。なお、２軸マップ表示部５１は、各カテゴリのカテゴリデータ（または特徴語カテゴリデータ）に、当該カテゴリに分類された文書データの文書番号を対応付けて記憶しておいてもよい。 Further, the biaxial map display unit 51 reads out classification rules from the category data (or feature word category data) of each category stored in the category storage unit 2. The biaxial map display unit 51 counts the number of document data classified into each category using the read classification rule, and displays the count number. Note that the biaxial map display unit 51 may store the category data (or feature word category data) of each category in association with the document number of the document data classified into the category.

続いて、図１９の表示を用いて横軸カテゴリｘＡｘｉｓＣａｔと、縦軸カテゴリｙＡｘｉｓＣａｔとを入力する方法の一例について説明する。
図２０は、２軸マップ表示の実行画面の表示例を示す図であり、図１９の表示を用いて横軸カテゴリｘＡｘｉｓＣａｔと縦軸カテゴリｙＡｘｉｓＣａｔを入力する場合の表示例を示す。まず、ユーザは２軸マップ表示部５１が表示させたカテゴリ構造に対して、２軸マップにおいて横軸と縦軸にする２つのカテゴリを選択する。ここでは、ユーザは図２０において「出願年別」カテゴリ１６０２と「内容別」カテゴリ１６０３を選択する。 Next, an example of a method for inputting the horizontal axis category xAxisCat and the vertical axis category yAxisCat will be described using the display of FIG.
FIG. 20 is a diagram illustrating a display example of the execution screen of the biaxial map display, and illustrates a display example when the horizontal axis category xAxisCat and the vertical axis category yAxisCat are input using the display of FIG. First, the user selects two categories for the horizontal axis and the vertical axis in the biaxial map for the category structure displayed by the biaxial map display unit 51. Here, the user selects the “by application year” category 1602 and the “by content” category 1603 in FIG.

２軸マップ表示部５１は、選択された２つのカテゴリの入力を受けると、これら２つのカテゴリのうち、いずれを分類観点である横軸カテゴリｘＡｘｉｓＣａｔとするかを選択させる画面１７１０を表示させる。ユーザは、分類観点とするカテゴリのラジオボタン１７１１を選択して実行ボタン１７１２を押す。これにより、２軸マップ表示部５１は、横軸カテゴリｘＡｘｉｓＣａｔとして、ユーザが選択した「出願年別」カテゴリ１６０２の入力を受ける。縦軸カテゴリｙＡｘｉｓＣａｔは、ユーザが選択しなかった「内容別」カテゴリ１６０３となる。 Upon receiving the input of the two selected categories, the biaxial map display unit 51 displays a screen 1710 for selecting which of the two categories is the horizontal axis category xAxisCat that is a classification viewpoint. The user selects a category radio button 1711 to be classified and presses an execute button 1712. Accordingly, the biaxial map display unit 51 receives an input of the “by application year” category 1602 selected by the user as the horizontal axis category xAxisCat. The vertical axis category yAxisCat becomes the “content-specific” category 1603 that is not selected by the user.

上記のようにして、２軸マップ表示部５１は、ユーザが選択した横軸カテゴリｘＡｘｉｓＣａｔと、縦軸カテゴリｙＡｘｉｓＣａｔの入力を受ける。なお、本実施形態では、２軸マップへの入力方法として図１９及び図２０に示すようなＧＵＩを使用した例を示したが、２軸マップを表示させる上で、カテゴリ記憶部２が記憶しているカテゴリデータまたは特徴語カテゴリデータが示すカテゴリのうちいずれを、横軸カテゴリｘＡｘｉｓＣａｔ、および縦軸カテゴリｙＡｘｉｓＣａｔとするかを入力できればよい。そのため、ＧＵＩの使用に限らず、コンピュータシステムのコマンドラインからの入力としてもよい。 As described above, the biaxial map display unit 51 receives input of the horizontal axis category xAxisCat and the vertical axis category yAxisCat selected by the user. In this embodiment, an example in which a GUI as shown in FIGS. 19 and 20 is used as an input method to the biaxial map has been shown. However, the category storage unit 2 stores the biaxial map in order to display the biaxial map. It is only necessary to input which of the categories indicated by the category data or the feature word category data is the horizontal axis category xAxisCat and the vertical axis category yAxisCat. Therefore, not only the use of the GUI but also the input from the command line of the computer system may be used.

図８において、２軸マップ表示部５１は、カテゴリ記憶部２に記憶されているカテゴリデータ及び特徴語カテゴリデータに基づいて、横軸カテゴリｘＡｘｉｓＣａｔの子カテゴリの集合である横軸子カテゴリ集合ｘＣａｔｓと、縦軸カテゴリｙＡｘｉｓＣａｔの子カテゴリの集合である縦軸子カテゴリ集合ｙＣａｔｓを取得する（ステップＳ１００２）。なお、横軸カテゴリｘＡｘｉｓＣａｔの子カテゴリを横軸子カテゴリｘＣａｔとし、縦軸カテゴリｙＡｘｉｓＣａｔの子カテゴリを縦軸子カテゴリｙＣａｔとする。横軸子カテゴリｘＣａｔは、横軸カテゴリｘＡｘｉｓＣａｔのカテゴリデータまたは特徴語カテゴリデータのカテゴリ番号が上位カテゴリに設定されているカテゴリデータまたは特徴語カテゴリデータに対応する。同様に、縦軸子カテゴリｙＣａｔは、縦軸カテゴリｙＡｘｉｓＣａｔのカテゴリデータまたは特徴語カテゴリデータのカテゴリ番号が上位カテゴリに設定されているカテゴリデータまたは特徴語カテゴリデータに対応する。 In FIG. 8, the biaxial map display unit 51 includes a horizontal axis category set xCats, which is a set of child categories of the horizontal axis category xAxisCat, based on the category data and feature word category data stored in the category storage unit 2. The vertical axis child category set yCats, which is a set of child categories of the vertical axis category yAxisCat, is acquired (step S1002). A child category of the horizontal axis category xAxisCat is a horizontal axis child category xCat, and a child category of the vertical axis category yAxisCat is a vertical axis child category yCat. The horizontal axis category xCat corresponds to category data or feature word category data in which the category number of the horizontal axis category xAxisCat or the category number of the feature word category data is set as the upper category. Similarly, the vertical axis child category yCat corresponds to the category data or the feature word category data in which the category number of the vertical axis category yAxisCat or the category number of the feature word category data is set as the upper category.

図２０に示す例の場合、２軸マップ表示部５１は、横軸子カテゴリ集合ｘＣａｔｓとして、横軸カテゴリｘＡｘｉｓＣａｔである「出願年別」カテゴリ１６０２の子カテゴリの集合｛「２００４年」カテゴリ１６２１、「２００５年」カテゴリ１６２２、「２００６年」カテゴリ１６２３、「２００７年」カテゴリ１６２４、「２００８年」カテゴリ１６２５｝を取得する。横軸子カテゴリｘＣａｔは、「出願年別」カテゴリ１６０２に対応したカテゴリデータ３２０のカテゴリ番号３２１が、上位カテゴリに設定されているカテゴリデータに対応する。また、縦軸カテゴリｙＡｘｉｓＣａｔである「内容別」カテゴリ１６０３は子カテゴリを持たないため、２軸マップ表示部５１は、縦軸子カテゴリ集合ｙＣａｔｓとして空集合を取得する。つまり、「内容別」カテゴリ１６０３に対応したカテゴリデータ３３０のカテゴリ番号３３１が上位カテゴリに設定されているカテゴリデータや特徴語カテゴリデータは、カテゴリ記憶部２に記憶されていない。 In the case of the example illustrated in FIG. 20, the biaxial map display unit 51 sets the horizontal axis category set xCats as a set of child categories of the “by application year” category 1602 that is the horizontal axis category xAxisCat {{2004} category 1621, The “2005” category 1622, the “2006” category 1623, the “2007” category 1624, and the “2008” category 1625} are acquired. The horizontal axis category xCat corresponds to the category data in which the category number 321 of the category data 320 corresponding to the “by application year” category 1602 is set as the upper category. Further, since the “by contents” category 1603 which is the vertical axis category yAxisCat has no child category, the biaxial map display unit 51 acquires an empty set as the vertical axis child category set yCats. That is, category data or feature word category data in which the category number 331 of the category data 330 corresponding to the “by content” category 1603 is set as a higher category is not stored in the category storage unit 2.

２軸マップ表示部５１は、縦軸カテゴリｙＡｘｉｓＣａｔと、縦軸子カテゴリ集合ｙＣａｔｓに含まれる各縦軸子カテゴリｙＣａｔとを行とし、横軸子カテゴリ集合ｘＣａｔｓに含まれる各横軸子カテゴリｘＣａｔを列として２軸マップ初期テーブルを作成し、表示させる（ステップＳ１００３）。 The biaxial map display unit 51 sets the vertical axis category yAxisCat and each vertical axis category yCat included in the vertical axis category set yCats as rows, and sets each horizontal axis category xCat included in the horizontal axis category set xCats. A biaxial map initial table is created and displayed as a column (step S1003).

図２１は、ステップＳ１００３で作成される２軸マップ初期テーブル１８００を示す。２軸マップ初期テーブル１８００は、カテゴリを表示させるタイトル行及びタイトル列も含むため、行数が（１＋縦軸カテゴリ数＋縦軸子カテゴリ数）、列数が（１＋横軸子カテゴリ数）のテーブルである。本実施形態では縦軸子カテゴリがないために行数が（１＋縦軸カテゴリ数）となっている。同様に横軸子カテゴリがない場合、列数は（１＋横軸カテゴリ数）となる。２軸マップ初期テーブル１８００の１行目（セル１８０２を含む行）がタイトル行であり、１列目（セル１８０１を含む行）がタイトル列である。 FIG. 21 shows the biaxial map initial table 1800 created in step S1003. Since the biaxial map initial table 1800 also includes title rows and title columns for displaying categories, the number of rows is (1 + vertical category number + vertical child category number) and the number of columns is (1 + horizontal child category number). It is a table. In this embodiment, since there is no vertical axis child category, the number of rows is (1 + the number of vertical axis categories). Similarly, when there is no horizontal axis category, the number of columns is (1 + the number of horizontal axis categories). The first row (row including cell 1802) of the biaxial map initial table 1800 is a title row, and the first column (row including cell 1801) is a title column.

図８において、２軸マップ表示部５１は、作成した２軸マップ初期テーブル１８００における全てのセル（以下、「ｃｅｌｌ」と記載する）を１つずつ選択し、選択したｃｅｌｌについてステップＳ１００５〜ステップＳ１０１０の処理を繰り返す（ステップＳ１００４−ＮＯ）。 In FIG. 8, the biaxial map display unit 51 selects all the cells (hereinafter, referred to as “cell”) in the created biaxial map initial table 1800 one by one, and steps S1005 to S1010 for the selected cell. The above process is repeated (step S1004-NO).

まず、２軸マップ表示部５１は、ｃｅｌｌが先頭行（１行目）もしくは先頭列（１列目）であるか否か判定する（ステップＳ１００５）。ｃｅｌｌが先頭行（１行目）もしくは先頭列（１列目）であると判定した場合（ステップＳ１００５−ＹＥＳ）、２軸マップ表示部５１は、ステップＳ１００６〜ステップＳ１００８の処理を行う。ステップＳ１００６〜ステップＳ１００８の処理では、先頭行や先頭列をテーブルのタイトル行もしくはタイトル列として、ｃｅｌｌに対応するカテゴリのカテゴリ名とフィルタ語を表示させる。 First, the biaxial map display unit 51 determines whether or not the cell is the first row (first row) or the first column (first column) (step S1005). When it is determined that the cell is the first row (first row) or the first column (first column) (YES in step S1005), the biaxial map display unit 51 performs the processing from step S1006 to step S1008. In the processing from step S1006 to step S1008, the category name and filter word of the category corresponding to the cell are displayed with the first row or first column as the title row or title column of the table.

すなわち、２軸マップ表示部５１は、処理中のｃｅｌｌが先頭行もしくは先頭列の場合、当該ｃｅｌｌに対応するカテゴリｃａｔ（縦軸カテゴリｙＡｘｉｓＣａｔ、縦軸子カテゴリｙＣａｔ、または、横軸子カテゴリｘＣａｔ）のカテゴリ名を表示させる（ステップＳ１００６）。２軸マップ表示部５１は、カテゴリ名を、ｃｅｌｌに対応したカテゴリのカテゴリデータまたは特徴語カテゴリデータから読み出す。さらに、２軸マップ表示部５１は、ｃｅｌｌに対応するカテゴリｃａｔがフィルタ語集合ｆｉｌｔｅｒｓを持つか否か判定する（ステップＳ１００７）。具体的には、２軸マップ表示部５１は、カテゴリｃａｔが特徴語カテゴリデータに対応しており、かつ、フィルタ語が設定されているか否かによりフィルタ語集合ｆｉｌｔｅｒｓを持つか否かにより判定する。 That is, when the cell being processed is the first row or the first column, the biaxial map display unit 51 has a category cat corresponding to the cell (vertical axis category yAxisCat, vertical axis category yCat, or horizontal axis category xCat). The category name is displayed (step S1006). The biaxial map display unit 51 reads the category name from the category data or the feature word category data of the category corresponding to the cell. Furthermore, the biaxial map display unit 51 determines whether or not the category cat corresponding to the cell has a filter word set filter (step S1007). Specifically, the biaxial map display unit 51 determines whether or not the category cat corresponds to the feature word category data and whether or not the filter word set filters are included depending on whether or not the filter word is set. .

カテゴリｃａｔがフィルタ語集合ｆｉｌｔｅｓを持つと判定した場合（ステップＳ１００７−ＹＥＳ）、２軸マップ表示部５１は、フィルタ語集合ｆｉｌｔｅｒｓに含まれるフィルタ語をｃｅｌｌに表示させる（ステップＳ１００８）。このフィルタ語集合ｆｉｌｔｅｒｓは、ｃｅｌｌに対応したカテゴリｃａｔの特徴語カテゴリデータに設定されているフィルタ語の集合である。カテゴリｃａｔがフィルタ語集合ｆｉｌｔｅｓを持たないと判定した場合（ステップＳ１００７−ＮＯ）、あるいは、ステップＳ１００８の処理の後、２軸マップ表示部５１は、ステップＳ１００４に戻り、未選択のｃｅｌｌを選択して処理を繰り返す。 When it is determined that the category cat has the filter word set filters (step S1007—YES), the biaxial map display unit 51 displays the filter words included in the filter word set filters on the cell (step S1008). The filter word set filters is a set of filter words set in the feature word category data of the category cat corresponding to the cell. When it is determined that the category cat does not have the filter word set filters (step S1007-NO), or after the process of step S1008, the biaxial map display unit 51 returns to step S1004 and selects an unselected cell. Repeat the process.

ステップＳ１００５において、ｃｅｌｌが先頭行（１行目）でも先頭列（１列目）でもないと判定した場合（ステップＳ１００５−ＮＯ）、２軸マップ表示部５１は、ｃｅｌｌの行に対応する縦軸カテゴリｙＡｘｉｓＣａｔまたは縦軸子カテゴリｙＣａｔと、ｃｅｌｌの列に対応する横軸子カテゴリｘＣａｔとの両方に分類された文書データの数である文書数ｄｎを求める（ステップＳ１００９）。 When it is determined in step S1005 that the cell is neither the first row (first row) nor the first column (first column) (step S1005-NO), the biaxial map display unit 51 displays the vertical axis corresponding to the cell row. The number of documents dn which is the number of document data classified into both the category yAxisCat or the vertical axis category yCat and the horizontal axis category xCat corresponding to the cell column is obtained (step S1009).

文書数ｄｎは、文書データがＸＭＬの場合は、ＸＱｕｅｒｙにおいて、条件式の積を用いることで求めることができる。例えば、後述する図２２のセル１８０３の場合、対応する縦軸カテゴリｙＡｘｉｓＣａｔ＝「内容別」カテゴリの分類ルールと横軸子カテゴリｘＣａｔ＝「２００４年」カテゴリの分類ルールとの論理積を満たす文書データの数をカウントする。 When the document data is XML, the document number dn can be obtained by using a product of conditional expressions in XQuery. For example, in the case of a cell 1803 in FIG. 22 described later, document data that satisfies the logical product of the classification rule of the corresponding vertical axis category yAxisCat = “by content” category and the classification rule of the horizontal axis category xCat = “2004” category. Count the number of

「内容別」カテゴリのカテゴリデータ３３０に設定されている分類ルール３３４は「ｔｒｕｅ」であり、その上位カテゴリ「ルート」カテゴリデータ３００に設定されている分類ルール３０４は「ｔｒｕｅ」である。よって、縦軸カテゴリｙＡｘｉｓＣａｔ「内容別」カテゴリの分類ルールは、「（ｔｒｕｅ）ａｎｄ（ｔｒｕｅ）」である。 The classification rule 334 set in the category data 330 of the “by content” category is “true”, and the classification rule 304 set in the upper category “root” category data 300 is “true”. Therefore, the classification rule of the vertical axis category yAxisCat “by content” category is “(true) and (true)”.

一方、「２００４年」カテゴリのカテゴリデータに設定されている分類ルールは「．／出願日＞＝”２００４／０１／０１” ａｎｄ．／出願日＜＝”２００４／１２／３１”」であり、その上位カテゴリ「ルート」のカテゴリデータ３００に設定されている分類ルール３０４は「ｔｒｕｅ」である。よって、横軸子カテゴリｘＣａｔ「２００４年」カテゴリの分類ルールは「（ｔｒｕｅ）ａｎｄ (．／出願日＞＝”２００４／０１／０１” ａｎｄ．／出願日＜＝”２００４／１２／３１”)」である。 On the other hand, the classification rule set in the category data of the “2004” category is “./application date> =” 2004/01/01 ”and ./application date <=“ 2004/12/31 ””, The classification rule 304 set in the category data 300 of the upper category “root” is “true”. Therefore, the classification rule of the horizontal axis category xCat “2004” category is “(true) and (./application date> =“ 2004/01/01 ”and ./application date <=” 2004/12/31 ”) It is.

従って、２軸マップ表示部５１は、縦軸カテゴリｙＡｘｉｓＣａｔ「内容別」カテゴリの分類ルールと横軸子カテゴリｘＣａｔ「２００４年」カテゴリの分類ルールとの論理積「｛（ｔｒｕｅ）ａｎｄ（ｔｒｕｅ）｝ａｎｄ｛（ｔｒｕｅ）ａｎｄ (．／出願日＞＝”２００４／０１／０１” ａｎｄ．／出願日＜＝”２００４／１２／３１”)｝」を満たす文書データの数をカウントし、文書数ｄｎとする。文書数ｄｎは、ＸＱｕｅｒｙのｃｏｕｎｔ（）関数を利用することで算出することができる。 Accordingly, the biaxial map display unit 51 calculates the logical product “{(true) and (true)} of the classification rule of the vertical axis category yAxisCat“ by contents ”category and the classification rule of the horizontal axis category xCat“ 2004 ”category. and {(true) and (./application date> = “2004/01/01” and ./application date <= “2004/12/31”)} ”are counted, and the number of documents dn And The number of documents dn can be calculated by using the XQuery count () function.

次に、２軸マップ表示部５１は、ステップＳ１００３において作成したテーブルのｃｅｌｌに、ステップＳ１００９において算出した文書数ｄｎに応じた大きさの円ｃｈａｒｔを表示させる（ステップＳ１０１０）。２軸マップ表示部５１は、ステップＳ１００４に戻り、未選択のｃｅｌｌを選択して処理を繰り返す。 Next, the biaxial map display unit 51 displays a circle chart having a size corresponding to the number of documents dn calculated in step S1009 on the cell of the table created in step S1003 (step S1010). The biaxial map display unit 51 returns to step S1004, selects an unselected cell, and repeats the process.

そして、２軸マップ表示部５１は、２軸マップ初期テーブル１８００における全てのｃｅｌｌに対してステップＳ１００５〜ステップＳ１０１０の処理を終了すると（ステップＳ１００４−ＹＥＳ）、図８の処理を終了する。 And the biaxial map display part 51 will complete | finish the process of FIG. 8, if the process of step S1005-step S1010 is complete | finished with respect to all the cells in the biaxial map initial table 1800 (step S1004-YES).

図２２は、上記の処理終了時に２軸マップ表示部５１が表示させる２軸マップの初期表示例を示す図である。ここでは、「内容別」カテゴリの下位に特徴語カテゴリが生成されていない初期状態の２軸マップのため、同図に示すように、縦軸の１列目には「内容別」カテゴリのみが表示されている。 FIG. 22 is a diagram illustrating an initial display example of the biaxial map displayed by the biaxial map display unit 51 at the end of the above processing. In this case, since the feature word category is not generated below the “by content” category, as shown in the figure, only the “by content” category is displayed in the first column of the vertical axis. It is displayed.

図９は、カテゴリ操作部５２が実行する２軸マップ上のカテゴリ操作に対する処理の流れを示すフローチャートである。同図に示す処理により、カテゴリ操作部５２は、図７のステップＳ２〜ステップＳ１１の処理の流れを制御する。ユーザは、カテゴリ操作部５２を介して対象カテゴリを入力することによって、特徴語クラスタリングの実行を要求する。例えば、カテゴリ操作部５２は、２軸マップ表示部５１が図９の初期２軸マップ表示処理により表示させた初期２軸マップ上でユーザからの特徴語クラスタリングの実行要求を受け付ける。また、カテゴリ操作部５２は、特徴語クラスタリングにより生成された特徴語カテゴリの選択を受ける。カテゴリ操作部５２は、選択された特徴語カテゴリにフィルタ語として用いられている特徴語を表示させ（例えば、後述する図２７の特徴語追加画面２３１０）、ユーザから特徴語カテゴリに対するフィルタ語の追加・削除といった編集操作を受け付ける。カテゴリ操作部５２は、受け付けた編集操作に応じてカテゴリ記憶部２に記憶された特徴語カテゴリのカテゴリデータを更新する。 FIG. 9 is a flowchart showing a flow of processing for category operations on the biaxial map executed by the category operation unit 52. By the processing shown in the figure, the category operation unit 52 controls the flow of processing in steps S2 to S11 in FIG. The user requests execution of the feature word clustering by inputting the target category via the category operation unit 52. For example, the category operation unit 52 receives a feature word clustering execution request from the user on the initial biaxial map displayed by the biaxial map display unit 51 by the initial biaxial map display process of FIG. The category operation unit 52 receives a selection of feature word categories generated by feature word clustering. The category operation unit 52 displays a feature word used as a filter word in the selected feature word category (for example, a feature word addition screen 2310 in FIG. 27 described later), and adds a filter word to the feature word category from the user. -Accept editing operations such as deletion. The category operation unit 52 updates the category data of the feature word category stored in the category storage unit 2 in accordance with the received editing operation.

そこで、カテゴリ操作部５２は、２軸マップにおいてユーザが選択したカテゴリｃａｔの入力を受けた場合（ステップＳ１１０１−ＹＥＳ）、さらに、ユーザからの特徴語クラスタリングの実行要求の入力を受けたか（ステップＳ１１０２）、フィルタ語の追加要求の入力を受けたか（ステップＳ１１０５）、あるいは、フィルタ語の削除要求の入力を受けたか（ステップＳ１１１１）に応じて、それぞれの処理を行う。以下、図９の処理を詳細に説明する。 Therefore, when the category operation unit 52 receives an input of the category cat selected by the user in the biaxial map (step S1101-YES), the category operation unit 52 further receives an input of a feature word clustering execution request from the user (step S1102). ), Whether a filter word addition request is input (step S1105) or a filter word deletion request is input (step S1111). Hereinafter, the process of FIG. 9 will be described in detail.

まず、カテゴリ操作部５２は、ユーザが選択したカテゴリｃａｔの入力を受ける（ステップＳ１１０１−ＹＥＳ）。
図２３は、特徴語クラスタリングの実行要求画面及び着目語の設定画面の表示例を示す図である。ここでは、ユーザは、２軸マップ表示部５１が初期２軸マップ表示処理において図２２のように表示させた２軸マップ初期テーブル上で、クラスタリングの対象カテゴリを選択している。同図では、ユーザは、セル１８０１を選択することにより、縦軸カテゴリである「内容別」カテゴリを対象カテゴリとして選択している。これにより、カテゴリ操作部５２は、カテゴリｃａｔとして、ユーザが選択した「内容別」カテゴリの入力を受ける。さらに、カテゴリ操作部５２は、ユーザから特徴語クラスタリングの実行要求の入力を受けると、特徴語クラスタリングの実行確認画面１９３０を表示させる。 First, the category operation unit 52 receives an input of a category cat selected by the user (step S1101-YES).
FIG. 23 is a diagram illustrating a display example of a feature word clustering execution request screen and a focused word setting screen. Here, the user has selected the clustering target category on the biaxial map initial table displayed by the biaxial map display unit 51 as shown in FIG. 22 in the initial biaxial map display processing. In the figure, the user selects a cell 1801 to select the “content-specific” category, which is the vertical axis category, as the target category. Thereby, the category operation unit 52 receives an input of the “by contents” category selected by the user as the category cat. Further, upon receiving an input of a feature word clustering execution request from the user, the category operation unit 52 displays a feature word clustering execution confirmation screen 1930.

図９において、カテゴリ操作部５２は、ユーザが実行確認画面１９３０の「実行」ボタン１９３１を選択した旨の入力を受けると（ステップＳ１１０２−ＹＥＳ）、ステップＳ１１０３の処理を行う。つまり、カテゴリ操作部５２は、対象カテゴリとして入力されたカテゴリｃａｔと、２軸マップにおいて対象カテゴリとして入力されなかった他の軸の子カテゴリの集合である軸カテゴリ集合を文書分類部３に出力し、特徴語クラスタリングの実行を指示する（ステップＳ１１０３）。本実施形態では、対象カテゴリとして縦軸カテゴリが入力されるため、カテゴリ操作部５２は、軸カテゴリ集合として、横軸子カテゴリｘＣａｔの集合である横軸子カテゴリ集合ｘＣａｔｓを文書分類部３に出力する。横軸子カテゴリ集合ｘＣａｔｓは、分類における観点とする軸カテゴリ集合である。図２３に示すように、ユーザが「内容別」カテゴリを選択し、特徴語クラスタリングの実行を選択した場合、カテゴリ操作部５２は、文書分類部３にカテゴリｃａｔとして「内容別」カテゴリを出力し、横軸子カテゴリ集合ｘＣａｔｓとして｛「２００４年」カテゴリ、「２００５年」カテゴリ、「２００６年」カテゴリ、「２００７年」カテゴリ、「２００８年」カテゴリ｝を出力する。ステップＳ１１０１〜ステップＳ１１０３の処理は、図７におけるステップＳ２のカテゴリ操作処理に相当する。 In FIG. 9, upon receiving an input indicating that the user has selected the “execute” button 1931 on the execution confirmation screen 1930 (step S1102—YES), the category operation unit 52 performs the process of step S1103. That is, the category operation unit 52 outputs the category cat input as the target category and the axis category set, which is a set of child categories of other axes not input as the target category in the 2-axis map, to the document classification unit 3. The execution of feature word clustering is instructed (step S1103). In this embodiment, since the vertical axis category is input as the target category, the category operation unit 52 outputs the horizontal axis category set xCats, which is a set of the horizontal axis categories xCat, to the document classification unit 3 as the axis category set. To do. The horizontal axis category set xCats is an axis category set as a viewpoint in classification. As shown in FIG. 23, when the user selects the “by content” category and selects the execution of feature word clustering, the category operation unit 52 outputs the “by content” category as the category cat to the document classification unit 3. , {“2004” category, “2005” category, “2006” category, “2007” category, “2008” category} are output as the horizontal axis category set xCats. The processing in steps S1101 to S1103 corresponds to the category operation processing in step S2 in FIG.

ステップＳ１１０３においてカテゴリｃａｔと横軸子カテゴリ集合ｘＣａｔｓの入力を受けた文書分類部３が、特徴語クラスタリングを実行し、図７におけるステップＳ３〜ステップＳ８までの処理を終了する。カテゴリ操作部５２は、現在の２軸マップの横軸カテゴリｘＡｘｉｓＣａｔと、縦軸カテゴリｙＡｘｉｓＣａｔを２軸マップ表示部５１に出力し、２軸マップの表示を更新する（ステップＳ１１０４）。例えば、図２３に示すように、ユーザがセル１８０１を選択し、特徴語クラスタリングの実行要求を入力した場合、横軸カテゴリｘＡｘｉｓＣａｔは「出願年別」カテゴリであり、縦軸カテゴリｙＡｘｉｓＣａｔは「内容別」カテゴリである。これにより、文書分類部３による特徴語クラスタリングの結果を２軸マップに反映する。ステップＳ１１０４の処理は、図７におけるステップＳ９の処理に相当する。カテゴリ操作部５２は、ステップＳ１１０１からの処理を繰り返す。 The document classification unit 3 that has received the input of the category cat and the horizontal axis category set xCats in step S1103 executes the feature word clustering, and ends the processes from step S3 to step S8 in FIG. The category operation unit 52 outputs the horizontal axis category xAxisCat and the vertical axis category yAxisCat of the current biaxial map to the biaxial map display unit 51, and updates the display of the biaxial map (step S1104). For example, as shown in FIG. 23, when the user selects a cell 1801 and inputs an execution request for feature word clustering, the horizontal axis category xAxisCat is the “by application year” category, and the vertical axis category yAxisCat is “by content”. Category. Thereby, the result of the feature word clustering by the document classification unit 3 is reflected on the biaxial map. The process of step S1104 corresponds to the process of step S9 in FIG. The category operation unit 52 repeats the processing from step S1101.

そして、カテゴリ操作部５２は、ユーザにより選択されたカテゴリｃａｔの入力を受け（ステップＳ１１０１−ＹＥＳ）、さらに、フィルタ語の追加要求の入力を受けた場合は（ステップＳ１１０２−ＮＯ、ステップＳ１１０５−ＹＥＳ）、ステップＳ１１０６〜ステップＳ１１１０の処理を行い、フィルタ語の削除要求の入力を受けた場合は（ステップＳ１１０２、ステップＳ１１０５−ＮＯ、ステップＳ１１１１−ＹＥＳ）、ステップＳ１１１２の処理を行う。カテゴリ操作部５２は、ステップＳ１１１０またはステップＳ１１１２の処理の終了後、表示させている２軸マップの横軸カテゴリｘＡｘｉｓＣａｔと、縦軸カテゴリｙＡｘｉｓＣａｔを２軸マップ表示部５１に出力し、２軸マップの表示を更新する（ステップＳ１１０４）。ステップＳ１１０５〜ステップＳ１１１２の処理は、図７におけるステップＳ１０の処理に相当し、その後のステップＳ１１０４の処理は、図７におけるステップＳ１１の処理に相当する。カテゴリ操作部５２は、ステップＳ１１０１からの処理を繰り返す。これらの処理の詳細については、後述する。 Then, the category operation unit 52 receives an input of the category cat selected by the user (step S1101-YES), and further receives an input of a filter word addition request (step S1102-NO, step S1105-YES). ), The processing of step S1106 to step S1110 is performed, and when the filter word deletion request is received (step S1102, step S1105-NO, step S1111-YES), the processing of step S1112 is performed. The category operation unit 52 outputs the horizontal axis category xAxisCat of the displayed biaxial map and the vertical axis category yAxisCat to the biaxial map display unit 51 after the process of step S1110 or step S1112 is completed, and outputs the biaxial map. The display is updated (step S1104). The processing in steps S1105 to S1112 corresponds to the processing in step S10 in FIG. 7, and the subsequent processing in step S1104 corresponds to the processing in step S11 in FIG. The category operation unit 52 repeats the processing from step S1101. Details of these processes will be described later.

カテゴリ操作部５２は、カテゴリの選択が入力されない場合（ステップＳ１１０１−ＮＯ）、あるいは、カテゴリの選択が入力された後（ステップＳ１１０１−ＹＥＳ）、ユーザから特徴語クラスタリングの実行要求、フィルタ語の追加要求、フィルタ語の削除要求、終了要求のいずれも入力されない場合は（ステップＳ１１０２、ステップＳ１１０５、ステップＳ１１１１、ステップＳ１１１３−ＮＯ）、ステップＳ１１０１からの処理を繰り返し、終了要求が入力された場合は（ステップＳ１１０２、ステップＳ１１０５、ステップＳ１１１１−ＮＯ、ステップＳ１１１３−ＹＥＳ）、処理を終了する。 When category selection is not input (step S1101-NO), or after category selection is input (step S1101-YES), the category operation unit 52 adds feature word clustering execution requests and filter words from the user. When neither a request, a filter word deletion request, or an end request is input (step S1102, step S1105, step S1111, step S1113-NO), the processing from step S1101 is repeated, and an end request is input ( Step S1102, Step S1105, Step S1111-NO, Step S1113-YES), the process is terminated.

続いて、図７のステップＳ３〜ステップＳ９の処理の詳細について説明する。この処理は、図９のステップＳ１１０３〜ステップＳ１１０４の処理に相当する。 Next, details of the processing in steps S3 to S9 in FIG. 7 will be described. This processing corresponds to the processing in steps S1103 to S1104 in FIG.

図１０は、特徴度算出部３１が特徴度を算出する処理の流れを示すフローチャートである。同図に示すフローチャートは、図７のステップＳ３における特徴度算出処理の詳細な処理の一例である。この処理において、特徴度算出部３１は、文書記憶部１に記憶されている対象文書データ集合のテキスト情報を形態素解析して単語を抽出し、抽出した単語について算出した特徴度を設定した特徴度データを特徴度データ記憶部４に記憶する。 FIG. 10 is a flowchart showing a flow of processing in which the feature calculation unit 31 calculates the feature. The flowchart shown in the figure is an example of detailed processing of the feature calculation processing in step S3 of FIG. In this process, the feature calculation unit 31 extracts a word by performing morphological analysis on the text information of the target document data set stored in the document storage unit 1, and sets the calculated feature for the extracted word. Data is stored in the feature data storage unit 4.

まず、文書分類部３は、ユーザインターフェース部５のカテゴリ操作部５２から分類の対象カテゴリｔｇｔＣａｔを受信する（ステップＳ１２０１）。つまり、文書分類部３は、図９のステップＳ１１０３においてカテゴリ操作部５２が出力したカテゴリｃａｔを対象カテゴリｔｇｔＣａｔとして受信する。例えば、図２３に示すように、ユーザが、セル１８０１を選択し、特徴語クラスタリングの実行要求を入力した場合、対象カテゴリｔｇｔＣａｔは「内容別」カテゴリとなる。 First, the document classification unit 3 receives the classification target category tgtCat from the category operation unit 52 of the user interface unit 5 (step S1201). That is, the document classification unit 3 receives the category cat output from the category operation unit 52 in step S1103 of FIG. 9 as the target category tgtCat. For example, as illustrated in FIG. 23, when the user selects the cell 1801 and inputs a feature word clustering execution request, the target category tgtCat becomes the “by content” category.

特徴度算出部３１は、対象カテゴリｔｇｔＣａｔに分類された文書データ集合ｔｇｔＤｏｃｓを取得する。具体的には、特徴度算出部３１は、カテゴリ記憶部２を参照し、対象カテゴリｔｇｔＣａｔに対応したカテゴリデータと、そのカテゴリデータの上位カテゴリデータとから分類ルールを読み出す。特徴度算出部３１は、文書記憶部１に記憶される文書データの中から、読み出した分類ルールを全て満たす文書データｄを選択し、選択した文書データｄからなる集合を文書データ集合ｔｇｔＤｏｃｓとする。特徴度算出部３１は、文書データ集合ｔｇｔＤｏｃｓに含まれる全ての文書データｄから分析対象とするテキスト情報を取得する（ステップＳ１２０２）。本実施形態では、文書記憶部１に記憶されている文書データは、図２に示す文書データ２００ａまたは２００ｂと同様のデータ形式であり、分析対象とするテキスト情報は、本文２０５とする。 The feature degree calculation unit 31 acquires a document data set tgtDocs classified into the target category tgtCat. Specifically, the feature calculation unit 31 refers to the category storage unit 2 and reads out a classification rule from the category data corresponding to the target category tgtCat and the higher category data of the category data. The feature calculation unit 31 selects document data d satisfying all the read classification rules from the document data stored in the document storage unit 1, and sets a set of the selected document data d as a document data set tgtDocs. . The feature degree calculation unit 31 acquires text information to be analyzed from all the document data d included in the document data set tgtDocs (step S1202). In the present embodiment, the document data stored in the document storage unit 1 has the same data format as the document data 200a or 200b shown in FIG.

特徴度算出部３１は、ステップＳ１２０２において取得したテキスト情報である本文２０５を形態素解析する（ステップＳ１２０３）。特徴度算出部３１は、形態素解析の結果得られた全ての単語（形態素）ｔを１つずつ選択し、選択した単語ｔについてステップＳ１２０５〜ステップＳ１２０９の処理を行う（ステップＳ１２０４−ＮＯ）。 The feature calculation unit 31 performs morphological analysis on the body 205 that is the text information acquired in step S1202 (step S1203). The feature degree calculation unit 31 selects all the words (morpheme) t obtained as a result of the morpheme analysis one by one, and performs the processing from step S1205 to step S1209 for the selected word t (step S1204-NO).

特徴度算出部３１は、単語ｔの品詞が、単語ベクトルに含めるべき所定の品詞の単語であり、かつ、不要語ではないか否かを判定する（ステップＳ１２０５）。例えば、特徴度算出部３１は、単語ｔの品詞が名詞やサ変名詞、固有名詞などである単語は単語ベクトルに含め、接続詞や副詞などは特徴ベクトルに含めないといった単語の選別を行う。また、特徴度算出部３１は、あらかじめ文書データの特徴を示さない単語として登録された不要語と単語ｔを比較する。例えば、特許文書を処理対象とする場合には、「装置」、「手段」といった単語は文書の特徴を表さないため、不要語とする。選択した単語ｔの品詞が、単語ベクトルに含めるべき所定の品詞の単語でない、もしくは、不要語であると判定した場合（ステップＳ１２０５−ＮＯ）、特徴度算出部３１は、ステップＳ１２０４に戻り、未選択の単語ｔを選択して処理を繰り返す。 The feature degree calculation unit 31 determines whether or not the part of speech of the word t is a word of a predetermined part of speech to be included in the word vector and is not an unnecessary word (step S1205). For example, the feature degree calculation unit 31 performs word selection such that a word whose part of speech of the word t is a noun, a strange noun, a proper noun, or the like is included in the word vector and a conjunction or adverb is not included in the feature vector. Further, the feature degree calculation unit 31 compares the word t with an unnecessary word registered in advance as a word that does not indicate the feature of the document data. For example, when a patent document is to be processed, words such as “device” and “means” are unnecessary words because they do not represent the characteristics of the document. When it is determined that the part of speech of the selected word t is not a word of a predetermined part of speech to be included in the word vector or is an unnecessary word (step S1205—NO), the feature calculation unit 31 returns to step S1204, The selected word t is selected and the process is repeated.

一方、選択した単語ｔの品詞が、単語ベクトルに含めるべき所定の品詞の単語であり、かつ、不要語でないと判定した場合（ステップＳ１２０５−ＹＥＳ）、特徴度算出部３１は、ステップＳ１２０２において取得した文書データ集合ｔｇｔＤｏｃｓのテキスト情報に単語ｔが出現する数である出現頻度ｔｆを算出する（ステップＳ１２０６）。さらに、特徴度算出部３１は、文書データ集合ｔｇｔＤｏｃｓに含まれる文書データのうち、テキスト情報（本文２０５）に単語ｔが出現する文書データの数である文書頻度ｄｆを算出する（ステップＳ１２０７）。特徴度算出部３１は、単語ｔの特徴度ｓ（ｔ）を以下の計算式（１）に基づいて算出する（ステップＳ１２０８）。 On the other hand, if it is determined that the part of speech of the selected word t is a word of a predetermined part of speech to be included in the word vector and is not an unnecessary word (step S1205—YES), the feature calculation unit 31 acquires in step S1202 The appearance frequency tf that is the number of occurrences of the word t in the text information of the document data set tgtDocs is calculated (step S1206). Further, the feature degree calculation unit 31 calculates the document frequency df, which is the number of document data in which the word t appears in the text information (body 205) among the document data included in the document data set tgtDocs (step S1207). The feature degree calculation unit 31 calculates the feature degree s (t) of the word t based on the following calculation formula (1) (step S1208).

ｓ（ｔ）＝ｔｆ×（ｌｏｇ（｜ｔｇｔＤｏｃｓ｜／ｄｆ）＋１） …（１） s (t) = tf × (log (| tgtDocs | / df) +1) (1)

計算式（１）において、｜ｔｇｔＤｏｃｓ｜は、対象文書データ集合ｔｇｔＤｏｃｓに含まれる文書データｄの数（文書数）である。この計算式は一般にＴＦ・ＩＤＦと呼ばれ、従来から情報検索や文書分類の分野で広く使用されている。すなわち、単語ｔが文書データｄに出現する頻度が多い（ｔｆが大きい）ほど、もしくは、全文書データのうち単語ｔを含む文書が少ない（ｄｆが小さい）ほど、当該単語ｔは文書データｄの特徴をよく表す単語であるとみなされる。本発明では、このＴＦ・ＩＤＦを対象文書データ集合について、単語の特徴度を算出するために用いる。 In the calculation formula (1), | tgtDocs | is the number of document data d (number of documents) included in the target document data set tgtDocs. This calculation formula is generally called TF / IDF and has been widely used in the fields of information retrieval and document classification. That is, the more frequently the word t appears in the document data d (tf is larger), or the fewer the documents that include the word t (all df is smaller) of all document data, the more the word t is in the document data d. It is considered to be a word that well describes the feature. In the present invention, this TF / IDF is used to calculate the word feature degree for the target document data set.

特徴度算出部３１は、単語ｔと、当該単語ｔについて算出した文書頻度ｄｆ及び特徴度ｓ（ｔ）とをそれぞれ、単語５１０、文書頻度５０１、及び特徴度５０２に設定した特徴度データを生成し、特徴度データ記憶部４に記憶する（ステップＳ１２０９）。特徴度算出部３１は、ステップＳ１２０４に戻り、未選択の単語ｔを選択して処理を繰り返す。
そして、形態素解析の結果得られた全ての単語ｔについてステップＳ１２０５〜ステップＳ１２０９の処理を行ったと判定した場合（ステップＳ１２０４−ＹＥＳ）、特徴度算出部３１は、処理を終了する。 The feature calculation unit 31 generates feature data in which the word t, the document frequency df and the feature s (t) calculated for the word t are set as the word 510, the document frequency 501, and the feature 502, respectively. And it memorize | stores in the characteristic data storage part 4 (step S1209). The feature calculation unit 31 returns to step S1204, selects an unselected word t, and repeats the process.
And when it determines with having processed the process of step S1205-step S1209 about all the words t obtained as a result of the morphological analysis (step S1204-YES), the characteristic degree calculation part 31 complete | finishes a process.

図１１は、着目語設定部５３が着目語を設定する処理の流れを示すフローチャートである。同図に示すフローチャートは、図７のステップＳ４における着目語設定処理の詳細な処理の一例を示す。この処理において、着目語設定部５３は、特徴語カテゴリを分類するにあたり、着目する複数の単語を着目語としてユーザから受け付けて、文書分類部３に出力する。 FIG. 11 is a flowchart showing a flow of processing in which the attention word setting unit 53 sets the attention word. The flowchart shown in the figure shows an example of detailed processing of the attention word setting processing in step S4 of FIG. In this processing, the focused word setting unit 53 accepts a plurality of focused words from the user as focused words when classifying the feature word category, and outputs it to the document classifying unit 3.

まず、着目語設定部５３は、例えば、図２３に示すように、ユーザが着目語を設定するための着目語設定フォーム１９１０を表示させる（ステップＳ１３０１）。着目語設定フォーム１９１０には、着目語入力フィールド１９１１と、着目語リストデータ６００の内容の一覧が表示された着目語リスト１９１３とが含まれる。さらに、着目語設定フォーム１９１０は、着目語入力フィールド１９１１に入力された文字列を着目語リストに追加するための「リストに追加」ボタン１９１２、着目語リスト１９１３において選択した単語を着目語に設定するための「着目語に設定」ボタン１９１４、及び、着目語の設定操作をキャンセルするための「キャンセル」ボタン１９１５を含む。 First, for example, as shown in FIG. 23, the attention word setting unit 53 displays a attention word setting form 1910 for the user to set a attention word (step S1301). The attention word setting form 1910 includes a attention word input field 1911 and a attention word list 1913 in which a list of contents of the attention word list data 600 is displayed. Further, the attention word setting form 1910 sets an “add to list” button 1912 for adding the character string input in the attention word input field 1911 to the attention word list, and the word selected in the attention word list 1913 as the attention word. A “set to focus word” button 1914 for canceling, and a “cancel” button 1915 for canceling the focus word setting operation.

着目語リストデータ６００の着目語候補６０１には、２軸マップ上でユーザが着目語として選択したフィルタ語、もしくは、着目語入力フィールド１９１１にユーザが入力した文字列のリストが設定される。ただし、着目語リストデータ６００の初期値は空のリストである。なお、着目語設定部５３は、特徴度データ記憶部４に記憶されている特徴度データに設定された単語のリストを初期値として着目語候補６０１に設定してもよい。この場合、着目語設定６０２の初期値は全て「０」としてもよく、全て「１」としてもよい。 In the target word candidate 601 of the target word list data 600, a filter word selected as a target word by the user on the biaxial map or a list of character strings input by the user in the target word input field 1911 is set. However, the initial value of the focused word list data 600 is an empty list. Note that the focused word setting unit 53 may set a list of words set in the feature data stored in the feature data storage unit 4 as the initial value for the focused word candidate 601. In this case, the initial values of the attention word setting 602 may all be “0” or all “1”.

着目語リスト１９１３には、着目語候補６０１として登録されている単語の前に、実際に着目語とするか否かを示すチェックボックスが表示されている。着目語設定部５３は、着目語設定フォーム１９１０が初期表示される場合、着目語リストデータ６００の着目語設定６０２に「１」が設定されている着目語候補６０１に対応するチェックボックスにデフォルトでチェックを設定して表示させる。 In the attention word list 1913, a check box indicating whether or not to actually set the attention word is displayed before the word registered as the attention word candidate 601. When the attention word setting form 1910 is initially displayed, the attention word setting unit 53 defaults to a check box corresponding to the attention word candidate 601 in which “1” is set in the attention word setting 602 of the attention word list data 600. Set and display the check.

次に、図１１において、着目語設定部５３は、ユーザから着目語の入力または着目語の選択を受け付ける（ステップＳ１３０２）。ここで、ユーザは、着目語設定フォーム１９１０の着目語リスト１９１３に表示されている着目語候補の中から実際に着目語として使用する単語を選択する（チェックボックスにチェックをつける）こともできるし、着目する単語が着目語リスト１９１３にない場合は直接その単語（文字列）を着目語入力フィールド１９１１により入力することもできる。 Next, in FIG. 11, the attention word setting unit 53 receives input of the attention word or selection of the attention word from the user (step S <b> 1302). Here, the user can also select a word that is actually used as a focused word from the focused word candidates displayed in the focused word list 1913 of the focused word setting form 1910 (check the check box). If the focused word is not in the focused word list 1913, the word (character string) can be directly input in the focused word input field 1911.

着目語設定部５３は、ユーザからの着目語の追加要求を受けた場合、つまり、ユーザが着目語入力フィールド１９１１に文字列を入力して、「リストに追加」ボタン１９１２を選択した場合（ステップＳ１３０３−ＹＥＳ）、着目語入力フィールド１９１１に入力された文字列を着目語リスト１９１３に追加表示させ（ステップＳ１３０４）、ステップＳ１３０２の処理に遷移する。ユーザからのリストへの着目語の追加要求がない場合、着目語設定部５３は、ステップＳ１３０５の処理に遷移する（ステップＳ１３０３−ＮＯ）。 When the focused word setting unit 53 receives a request for adding a focused word from the user, that is, when the user inputs a character string in the focused word input field 1911 and selects the “add to list” button 1912 (step (S1303-YES), the character string input in the attention word input field 1911 is additionally displayed in the attention word list 1913 (step S1304), and the process proceeds to step S1302. When there is no request for adding a focused word to the list from the user, the focused word setting unit 53 transitions to the process of step S1305 (step S1303-NO).

リストへの着目語の追加要求がなく（ステップＳ１３０３−ＮＯ）、かつ、ユーザからの着目語の設定要求を受けた場合、つまり、ユーザが「着目語に設定」ボタン１９１４を選択した場合、着目語設定部５３は、ステップＳ１３０６とステップＳ１３０７の処理を行う（ステップＳ１３０５−[着目語の設定]）。
すなわち、着目語設定部５３は、着目語リスト１９１３の内容を着目語リストデータ６００として記憶する（ステップＳ１３０６）。具体的には、着目語設定部５３は、着目語リスト１９１３に含まれる単語を着目語候補６０１に設定する。このとき、着目語設定部５３は、着目語リスト１９１３でユーザによってチェックボックスにチェックがつけられた単語には、着目語設定６０２にフラグを立て（本実施例では「１」）、チェックされなかった単語については、フラグを立てずに（本実施例では「０」）、着目語リストデータ６００に格納する。着目語設定部５３は、ステップＳ１３０６において更新した着目語リストデータ６００において、着目語設定６０２にフラグが立っている着目語候補６０１に設定された単語を読み出す。着目語設定部５３は、読み出した単語の集合を着目語集合として、文書分類部３に出力し（ステップＳ１３０７）、図１１の着目語設定処理を終了する。 When there is no request for adding a focused word to the list (step S1303-NO), and when a setting request for the focused word is received from the user, that is, when the user selects the “set as focused word” button 1914, the focused The word setting unit 53 performs the processing of step S1306 and step S1307 (step S1305-[target word setting]).
That is, the attention word setting unit 53 stores the contents of the attention word list 1913 as the attention word list data 600 (step S1306). Specifically, the attention word setting unit 53 sets a word included in the attention word list 1913 as the attention word candidate 601. At this time, the attention word setting unit 53 sets a flag in the attention word setting 602 for the words whose check boxes are checked by the user in the attention word list 1913 (“1” in this embodiment), and is not checked. The word is stored in the focused word list data 600 without setting a flag (“0” in this embodiment). The attention word setting unit 53 reads the word set in the attention word candidate 601 in which the flag is set in the attention word setting 602 in the attention word list data 600 updated in step S1306. The focused word setting unit 53 outputs the set of read words as the focused word set to the document classification unit 3 (step S1307), and ends the focused word setting process of FIG.

なお、ステップＳ１３０５において、着目語の設定要求またはキャンセル要求のいずれも入力されない場合、着目語設定部５３は、ステップＳ１３０２の処理に遷移する（ステップＳ１３０５−［要求なし］）。あるいは、ステップＳ１３０５において、キャンセル要求があった場合、つまり、ユーザが「キャンセル」ボタン１９１５を選択した場合、着目語設定部５３は、図１１の着目語設定処理を終了する（ステップＳ１３０５−[キャンセル]）。 If neither the attention word setting request nor the cancellation request is input in step S1305, the attention word setting unit 53 transitions to the process of step S1302 (step S1305 [no request]). Alternatively, when there is a cancel request in step S1305, that is, when the user selects the “cancel” button 1915, the focused word setting unit 53 ends the focused word setting process of FIG. 11 (step S1305- [Cancel ]).

図１２は、特徴度補正部３２が補正特徴度を求める処理の流れを示すフローチャートである。同図に示すフローチャートは、図７のステップＳ５における特徴度補正処理の詳細な処理の一例を示す。この処理において、特徴度補正部３２は、ユーザインターフェース部５の着目語設定部５３から着目語集合を受信し、特徴度データ記憶部４に記憶された特徴度データに補正特徴度を設定する。 FIG. 12 is a flowchart showing a flow of processing in which the feature correction unit 32 obtains a corrected feature. The flowchart shown in the figure shows an example of detailed processing of the feature correction processing in step S5 of FIG. In this processing, the feature correction unit 32 receives the attention word set from the attention word setting unit 53 of the user interface unit 5 and sets the correction feature in the feature data stored in the feature data storage unit 4.

最初に、特徴度補正部３２は、ユーザインターフェース部５の着目語設定部５３を介してユーザが指定した着目語ａｔの集合である着目語集合ａｔｓを受信する（ステップＳ１４０１）。具体的には、特徴度補正部３２は、図１１のステップＳ１３０７において着目語設定部５３が出力した着目語集合を受信し、着目語集合ａｔｓとする。このユーザによる着目語集合ａｔｓの入力は、図９のステップＳ１１０３における対象カテゴリ（カテゴリｃａｔ）の入力と同時でもよいし、別のタイミングで行われてもよい。 First, the feature correction unit 32 receives a focused word set ats that is a set of focused words at specified by the user via the focused word setting unit 53 of the user interface unit 5 (step S1401). Specifically, the feature correction unit 32 receives the attention word set output from the attention word setting unit 53 in step S1307 in FIG. 11 and sets it as the attention word set ats. The input of the focused word set ats by the user may be performed simultaneously with the input of the target category (category cat) in step S1103 in FIG. 9 or may be performed at another timing.

特徴度補正部３２は、受信した着目語集合ａｔｓ中の全ての着目語ａｔを１つずつ選択し、ステップＳ１４０３の処理を繰り返す（ステップＳ１４０２−ＮＯ）。すなわち、特徴度補正部３２は、着目語ａｔに対応する特徴度データを取得する特徴度データ取得処理を行う（ステップＳ１４０３）。この処理により、特徴度補正部３２は、特徴度データ記憶部４に着目語ａｔの特徴度データが登録されている場合にはそれを取得し、登録されていない場合は着目語ａｔの特徴度データを生成して特徴度データ記憶部４に登録する。特徴度データが登録されていない着目語ａｔは、例えば、図１０に示す特徴度算出処理において形態素解析により取得されなかった複合語などの単語である。特徴度データ取得処理の詳細については後述する図１３のフローチャートにおいて説明する。 The feature correction unit 32 selects all the attention words at in the received attention word set ats one by one, and repeats the process of step S1403 (step S1402-NO). That is, the feature correction unit 32 performs feature data acquisition processing for acquiring feature data corresponding to the attention word at (step S1403). By this processing, the feature correction unit 32 acquires the feature data of the attention word at if it is registered in the feature data storage unit 4, and if not registered, the feature degree of the attention word at. Data is generated and registered in the feature data storage unit 4. The attention word at which the feature data is not registered is, for example, a word such as a compound word that has not been acquired by morphological analysis in the feature calculation processing illustrated in FIG. Details of the feature data acquisition processing will be described later with reference to the flowchart of FIG.

着目語集合ａｔｓ中の全ての着目語ａｔについて特徴度データ取得処理が終了すると（ステップＳ１４０２−ＹＥＳ）、特徴度補正部３２は、特徴度データ記憶部４に特徴度データが登録されている全ての単語ｔの中から１つずつ選択し、選択した単語ｔについて以下のステップＳ１４０５〜ステップＳ１４１０の処理を繰り返す（ステップＳ１４０４−ＮＯ）。 When the feature degree data acquisition processing is completed for all the attention words at in the attention word set ats (step S1402-YES), the feature degree correction unit 32 stores all the feature degree data registered in the feature degree data storage unit 4. The word t is selected one by one, and the following steps S1405 to S1410 are repeated for the selected word t (step S1404-NO).

まず、特徴度補正部３２は、特徴度データ記憶部４から単語ｔに対応する特徴度データｋｄを取得する特徴度データ取得処理を行う（ステップＳ１４０５）。特徴度データ取得処理の詳細については後述する図１３のフローチャートにおいて説明する。特徴度補正部３２は、ステップＳ１４０５において取得した特徴度データｋｄから特徴度ｓ（ｔ）を取得し、以下の計算式（２）のように、単語ｔの補正特徴度ｍｓ（ｔ）の初期値とする（ステップＳ１４０６）。 First, the feature correction unit 32 performs feature data acquisition processing for acquiring the feature data kd corresponding to the word t from the feature data storage unit 4 (step S1405). Details of the feature data acquisition processing will be described later with reference to the flowchart of FIG. The feature correction unit 32 acquires the feature s (t) from the feature data kd acquired in step S1405, and the initial correction feature ms (t) of the word t as shown in the following calculation formula (2). A value is set (step S1406).

ｍｓ（ｔ）＝ｓ（ｔ） …（２） ms (t) = s (t) (2)

続いて、特徴度補正部３２は、着目語集合ａｔｓに含まれる全ての着目語ａｔを１つずつ選択し、選択した着目語ａｔについて、ステップＳ１４０８とステップＳ１４０９の処理を繰り返す（ステップＳ１４０７−ＮＯ）。
まず、特徴度補正部３２は、対象カテゴリに分類された文書データ集合ｔｇｔＤｏｃｓにおける単語ｔと着目語ａｔとの共起度ｃｏ（ｔ，ａｔ）を算出する（ステップＳ１４０８）。文書データ集合ｔｇｔＤｏｃｓは、図１０のステップＳ１２０２と同様の処理により取得するか、図１０のステップＳ１２０２において特徴度算出部３１が取得したものを用いることができる。ここで、文書データ集合ｔｇｔＤｏｃｓにおける単語ｔと着目語ａｔとの共起度ｃｏ（ｔ，ａｔ）は、以下の計算式（３）〜（７）のいずれかによって算出される値である。 Subsequently, the feature correction unit 32 selects all the attention words at included in the attention word set ats one by one, and repeats the processing of step S1408 and step S1409 for the selected attention word at (step S1407—NO). ).
First, the feature correction unit 32 calculates the co-occurrence degree co (t, at) between the word t and the attention word at in the document data set tgtDocs classified into the target category (step S1408). The document data set tgtDocs can be acquired by the same processing as step S1202 in FIG. 10 or can be obtained by the feature calculation unit 31 in step S1202 in FIG. Here, the co-occurrence degree co (t, at) of the word t and the attention word at in the document data set tgtDocs is a value calculated by any of the following calculation formulas (3) to (7).

共起数＝｜ｔ∩ａｔ｜ …（３）
Ｄｉｃｅ係数Ｄ＝｜ｔ∩ａｔ｜／（｜ｔ｜＋｜ａｔ｜） …（４）
Ｊａｃｃａｒｄ係数Ｊ＝｜ｔ∩ａｔ｜／｜ｔ∪ａｔ｜ …（５）
Ｓｉｍｐｓｏｎ係数Ｓ＝｜ｔ∩ａｔ｜／ｍｉｎ（ｔ，ａｔ） …（６）
Ｃｏｓｉｎｅ係数Ｃ＝｜ｔ∩ａｔ｜／ｓｑｒｔ（｜ｔ｜×｜ａｔ｜） …（７） Number of co-occurrence = | t∩at | (3)
Dice coefficient D = | t∩at | / (| t | + | at |) (4)
Jaccard coefficient J = | t∩at | / | t∪at | (5)
Simpson coefficient S = | t∩at | / min (t, at) (6)
Cosine coefficient C = | t∩at | / sqrt (| t | × | at |) (7)

上記では、文書データ集合ｔｇｔＤｏｃｓにおいて、テキスト情報に単語ｔを含んだ文書データｄの数（以下、「生起数」という）を｜ｔ｜とし、文書データ集合ｔｇｔＤｏｃｓにおける着目語ａｔの生起数を｜ａｔ｜とする。また、文書データ集合ｔｇｔＤｏｃｓにおいて、単語ｔと着目語ａｔをともにテキスト情報に含んだ文書データｄの数（以下、「共起数」という）を｜ｔ∩ａｔ｜とし、単語ｔと着目語ａｔのうち少なくとも１つをテキスト情報に含んだ文書データｄの数を｜ｔ∪ａｔ｜とする。また、ｍｉｎ（ｔ，ａｔ）は、単語ｔの生起数と着目語ａｔの生起数のうち少ないほうを示し、ｓｑｒｔは平方根を求めることを示す。 In the above description, in the document data set tgtDocs, the number of document data d including the word t in the text information (hereinafter referred to as “occurrence number”) is | t |, and the occurrence number of the attention word at in the document data set tgtDocs is | At | In the document data set tgtDocs, the number of document data d including both the word t and the attention word at in the text information (hereinafter referred to as “co-occurrence number”) is | t∩at |, and the word t and the attention word at. Let | t | at | be the number of document data d that includes at least one of them in the text information. Min (t, at) indicates the smaller of the number of occurrences of the word t and the number of occurrences of the attention word at, and sqrt indicates that the square root is obtained.

特徴度補正部３２は、ステップＳ１４０８において算出した単語ｔと着目語ａｔとの共起度ｃｏ（ｔ，ａｔ）を用いて、以下の計算式（８）に基づいて、補正特徴度ｍｓ（ｔ）を更新する（ステップＳ１４０９）。 The feature correction unit 32 uses the co-occurrence co (t, at) between the word t calculated in step S1408 and the word of interest at (co), based on the following calculation formula (8), and the corrected feature ms (t ) Is updated (step S1409).

ｍｓ（ｔ）＝ｍｓ（ｔ）×ｃｏ（ｔ，ａｔ）・・・（８） ms (t) = ms (t) × co (t, at) (8)

ステップＳ１４０９の処理の後、特徴度補正部３２は、ステップＳ１４０７からの処理に戻り、着目語集合ａｔｓ中の未選択の着目語ａｔを選択して処理を繰り返す。そして、全ての着目語ａｔについてステップＳ１４０８及びステップＳ１４０９の繰り返し処理が終了すると（ステップＳ１４０７−ＹＥＳ）、特徴度補正部３２は、ステップＳ１４０５において取得した特徴度データｋｄの補正特徴度５０３に補正特徴度ｍｓ（ｔ）を挿入する。特徴度補正部３２は、特徴度データ記憶部４に現在記憶されている単語ｔの特徴度データを、補正特徴度５０３を設定した特徴度データｋｄにより更新する（ステップＳ１４１０）。 After the process of step S1409, the feature correction unit 32 returns to the process from step S1407, selects an unselected target word at in the target word set ats, and repeats the process. When the repetitive processing of step S1408 and step S1409 is completed for all the attention words at (step S1407—YES), the feature correction unit 32 corrects the corrected feature to the corrected feature 503 of the feature data kd acquired in step S1405. Insert degrees ms (t). The feature correction unit 32 updates the feature data of the word t currently stored in the feature data storage 4 with the feature data kd set with the correction feature 503 (step S1410).

ステップＳ１４１０の後、特徴度補正部３２はステップＳ１４０４に戻り、未選択の単語ｔを選択して処理を繰り返す。全ての単語ｔについてステップＳ１４０５〜ステップＳ１４１０の処理が終了すると（ステップＳ１４０４−ＹＥＳ）、特徴度補正部３２は特徴度補正処理を終了する。 After step S1410, the feature correction unit 32 returns to step S1404, selects an unselected word t, and repeats the process. When the processing of steps S1405 to S1410 is completed for all words t (step S1404—YES), the feature correction unit 32 ends the feature correction processing.

上記においては、特徴度補正部３２は、着目語設定部５３から着目語集合を受信して上記の特徴度補正処理を行っているが、着目語設定部５３に着目語が１つ入力されるたびにその着目語を受信することにより、上記の特徴度補正処理を逐次実行してもよい。 In the above, the feature correction unit 32 receives the attention word set from the attention word setting unit 53 and performs the feature correction processing. However, one attention word is input to the attention word setting unit 53. The feature correction processing may be executed sequentially by receiving the attention word each time.

図１３は、特徴度補正部３２が特徴度データを取得する処理の流れを示すフローチャートである。同図に示すフローチャートは、図１２のステップＳ１４０３及びステップＳ１４０５における特徴度データ取得処理の一例を示す。この処理において、特徴度補正部３２は、特徴度データ記憶部４から所定の単語ｋの特徴度データを取得する。単語ｋは、ステップＳ１４０３の処理の場合は着目語ａｔであり、ステップＳ１４０５の処理の場合は、単語ｔである。 FIG. 13 is a flowchart showing a flow of processing in which the feature correction unit 32 acquires feature data. The flowchart shown in the figure shows an example of the feature data acquisition processing in steps S1403 and S1405 in FIG. In this process, the feature correction unit 32 acquires feature data of a predetermined word k from the feature data storage unit 4. The word k is the attention word at in the case of the process of step S1403, and is the word t in the case of the process of step S1405.

特徴度補正部３２は、単語ｋの特徴度データ取得要求を受信すると（ステップＳ１５０１）、特徴度データ記憶部４に単語ｋの特徴度データが存在するか否かを判定する（ステップＳ１５０２）。特徴度データ記憶部４に単語ｋに対する特徴度データが記憶されていないと判定した場合（ステップＳ１５０２−ＮＯ）、特徴度補正部３２は、以下のステップＳ１５０３〜ステップＳ１５０７の処理を行い、単語ｋに対する特徴度データｋｄを生成する。 When the feature correction unit 32 receives the feature data acquisition request for the word k (step S1501), the feature correction unit 32 determines whether the feature data for the word k exists in the feature data storage unit 4 (step S1502). When it is determined that the feature data for the word k is not stored in the feature data storage unit 4 (step S1502-NO), the feature correction unit 32 performs the following steps S1503 to S1507, and the word k The feature degree data kd for is generated.

ステップＳ１５０３〜ステップＳ１５０７は、特徴度算出部３１による形態素解析では得られなかった単語を特徴語カテゴリの生成に利用するための処理である。形態素解析によって抽出される（対象とする品詞かつ不要語でない）単語ｔであれば、その単語ｔに対する特徴度データは、図１０示す特徴度算出部３１の処理において生成される。しかし、ユーザインターフェース部５の着目語設定部５３においてユーザは任意の文字列を着目語ａｔとして設定できる。このとき、ユーザが指定する着目語ａｔは、形態素解析によって抽出される単語ｔに含まれるとは限らない。例えば、ユーザが”内部統制”という文字列を着目語ａｔとして設定した場合、特徴度算出部３１が抽出した形態素が”内部”と”統制”であれば、この２つの単語に対応する特徴度データは生成されているが、”内部統制”という単語としては、特徴度データは生成されない。このような問題は、特に”内部統制”のように複数の単語を１つの単語として扱う複合語において生じる。ステップＳ１５０３〜ステップＳ１５０７の処理は、この問題に対処するための処理である。 Steps S <b> 1503 to S <b> 1507 are processes for using words that are not obtained by morphological analysis by the feature calculation unit 31 to generate a feature word category. If the word t is extracted by morphological analysis (the target part-of-speech and not an unnecessary word), the feature data for the word t is generated in the process of the feature calculation unit 31 shown in FIG. However, the user can set an arbitrary character string as the focused word at in the focused word setting unit 53 of the user interface unit 5. At this time, the attention word at specified by the user is not necessarily included in the word t extracted by the morphological analysis. For example, when the user sets the character string “internal control” as the attention word at, and the morphemes extracted by the characteristic calculation unit 31 are “internal” and “control”, the characteristic corresponding to these two words Data is generated, but no feature data is generated for the word “internal control”. Such a problem occurs particularly in a compound word that handles a plurality of words as one word such as “internal control”. The processes in steps S1503 to S1507 are processes for dealing with this problem.

具体的には、特徴度補正部３２は、対象カテゴリに分類された文書データ集合ｔｇｔＤｏｃｓの中に含まれる全ての文書データｄのテキスト情報（本文２０５）において単語ｋが出現する数である出現頻度ｋｆを算出する（ステップＳ１５０３）。文書データ集合ｔｇｔＤｏｃｓは、図１０のステップＳ１２０２と同様の処理により取得するか、図１０のステップＳ１２０２において特徴度算出部３１が取得したものを用いることができる。ここで、単語ｋは上述の通り形態素解析によって抽出されない単語であるため、特徴度補正部３２は、形態素解析結果から出現頻度をカウントするのではなく、文字列検索などを使ってカウントする。 Specifically, the feature correction unit 32 generates an appearance frequency that is the number of occurrences of the word k in the text information (body text 205) of all the document data d included in the document data set tgtDocs classified into the target category. kf is calculated (step S1503). The document data set tgtDocs can be acquired by the same processing as step S1202 in FIG. 10 or can be obtained by the feature calculation unit 31 in step S1202 in FIG. Here, since the word k is a word that is not extracted by the morphological analysis as described above, the feature correction unit 32 does not count the appearance frequency from the morphological analysis result, but counts it using a character string search or the like.

次に、特徴度補正部３２は文書データ集合ｔｇｔＤｏｃｓの中で単語ｋがテキストデータ（本文２０５）に出現する文書データｄの数である文書頻度ｄｆを算出する（ステップＳ１５０４）。特徴度補正部３２は、ステップＳ１２０９における計算式（１）と同様の以下の計算式（９）を用いて単語ｋの特徴度ｓ（ｋ）を算出する（ステップＳ１５０５）。 Next, the feature correction unit 32 calculates a document frequency df that is the number of document data d in which the word k appears in the text data (body 205) in the document data set tgtDocs (step S1504). The feature correction unit 32 calculates the feature s (k) of the word k using the following calculation formula (9) similar to the calculation formula (1) in step S1209 (step S1505).

ｓ（ｋ）＝ｋｆ×（ｌｏｇ（｜ｔｇｔＤｏｃｓ｜／ｄｆ）＋１） …（９） s (k) = kf × (log (| tgtDocs | / df) +1) (9)

特徴度補正部３２は、単語ｋと、算出した文書頻度ｄｆ及び特徴度ｓ（ｋ）とをそれぞれ、単語５１０、文書頻度５０１、及び特徴度５０２に設定した特徴度データｋｄを生成し（ステップＳ１５０６）、特徴度データ記憶部４に格納する（ステップＳ１５０７）。特徴度補正部３２は、生成した特徴度データｋｄを特徴度データ取得要求元に出力する（ステップＳ１５０９）。 The feature correction unit 32 generates feature data kd in which the word k, the calculated document frequency df, and the feature s (k) are set to the word 510, document frequency 501, and feature 502, respectively (step S1506) and stored in the feature data storage unit 4 (step S1507). The feature correction unit 32 outputs the generated feature data kd to the feature data acquisition request source (step S1509).

一方、ステップＳ１５０２において、単語ｋの特徴度データが特徴度データ記憶部４に記憶されていると判定した場合（ステップＳ１５０２−ＹＥＳ）、特徴度補正部３２は、特徴度データ記憶部４から単語ｋの特徴度データｋｄを取得する（ステップＳ１５０８）。特徴度補正部３２は、取得した特徴度データｋｄを特徴度データ取得要求元に出力する（ステップＳ１５０９）。 On the other hand, if it is determined in step S1502 that the feature data of the word k is stored in the feature data storage unit 4 (YES in step S1502), the feature correction unit 32 reads the word from the feature data storage unit 4. The feature data kd of k is acquired (step S1508). The feature correction unit 32 outputs the acquired feature data kd to the feature data acquisition request source (step S1509).

図１２及び図１３の処理により、図７に示す特徴度補正処理（ステップＳ５）が終了すると、特徴語カテゴリ生成部３３において、傾向ベクトル生成部３４が、傾向ベクトル生成処理を行い（ステップＳ６）、クラスタリング部３５はクラスタリング処理を行い（ステップＳ７）、カテゴリ生成処理部３６が特徴語カテゴリ生成処理を行う（ステップＳ８）。このように、ステップＳ６〜ステップＳ８において、特徴語カテゴリ生成部３３は、特徴度データ記憶部４に記憶されている特徴度データを用いて、軸カテゴリ集合に対して出現傾向の類似した特徴語のグループに基づく特徴語カテゴリを生成する。 When the feature correction processing (step S5) shown in FIG. 7 is completed by the processing of FIG. 12 and FIG. 13, in the feature word category generation unit 33, the trend vector generation unit 34 performs the trend vector generation processing (step S6). The clustering unit 35 performs clustering processing (step S7), and the category generation processing unit 36 performs feature word category generation processing (step S8). In this manner, in step S6 to step S8, the feature word category generation unit 33 uses the feature data stored in the feature data storage unit 4 to use feature words having similar appearance tendencies with respect to the axis category set. A feature word category based on the group is generated.

図１４は、特徴語カテゴリ生成部３３の傾向ベクトル生成部３４が傾向ベクトルを求める処理の流れを示すフローチャートである。同図に示すフローチャートは、図７のステップＳ６における傾向ベクトル生成処理の詳細な処理の一例を示す。この処理において、傾向ベクトル生成部３４は、特徴度データ記憶部４に記憶されている特徴度データに基づいて特徴語集合を抽出し、抽出された特徴語集合に含まれる各特徴語について傾向ベクトルを生成して特徴度データ記憶部４に記憶する。 FIG. 14 is a flowchart showing a flow of processing in which the trend vector generation unit 34 of the feature word category generation unit 33 obtains a trend vector. The flowchart shown in the figure shows an example of detailed processing of the trend vector generation processing in step S6 of FIG. In this process, the trend vector generation unit 34 extracts a feature word set based on the feature data stored in the feature data storage unit 4, and the trend vector for each feature word included in the extracted feature word set. Is stored in the feature data storage unit 4.

最初に、傾向ベクトル生成部３４は、空の特徴語リストｔｌを生成する（ステップＳ１６０１）。傾向ベクトル生成部３４は、特徴度データ記憶部４に特徴度データが格納されている全ての単語ｔを１つずつ選択し、選択した単語ｔについてステップＳ１６０３〜ステップＳ１６０５の処理を行う（ステップＳ１６０２−ＮＯ）。ステップＳ１６０３〜ステップＳ１６０５において、傾向ベクトル生成部３４は、特徴語を抽出し、抽出した特徴語の集合を特徴語リストｔｌに格納する。 First, the trend vector generation unit 34 generates an empty feature word list tl (step S1601). The trend vector generation unit 34 selects all the words t stored with the feature data in the feature data storage unit 4 one by one, and performs the processing from step S1603 to step S1605 on the selected word t (step S1602). -NO). In steps S1603 to S1605, the trend vector generation unit 34 extracts feature words, and stores a set of extracted feature words in the feature word list tl.

具体的には、傾向ベクトル生成部３４は、特徴度データ記憶部４に記憶されている単語ｔの特徴度データを読み出し、読み出した特徴度データから補正特徴度５０３に設定されている補正特徴度ｍｓ（ｔ）を取得する（ステップＳ１６０３）。取得した補正特徴度ｍｓ（ｔ）があらかじめ設定された一定のしきい値ｍｉｎｍｓより大きい（ｍｉｎｍｓ＜ｍｓ（ｔ））場合（ステップＳ１６０４−ＹＥＳ）、傾向ベクトル生成部３４は、当該単語ｔを特徴語リストｔｌに追加する（ステップＳ１６０５）。取得した補正特徴度ｍｓ（ｔ）がしきい値ｍｉｎｍｓ以下（ｍｉｎｍｓ≧ｍｓ（ｔ））の場合（ステップＳ１６０４−ＮＯ）、あるいは、ステップＳ１６０５の処理の後、傾向ベクトル生成部３４は、ステップＳ１６０２に戻り、未選択の単語ｔを選択して処理を繰り返す。 Specifically, the trend vector generation unit 34 reads the feature data of the word t stored in the feature data storage unit 4 and the corrected feature set in the corrected feature 503 from the read feature data. ms (t) is acquired (step S1603). When the acquired corrected feature value ms (t) is larger than a predetermined threshold value minms (minms <ms (t)) (step S1604-YES), the trend vector generation unit 34 characterizes the word t. It adds to the word list tl (step S1605). When the acquired corrected feature value ms (t) is equal to or less than the threshold value minms (minms ≧ ms (t)) (NO in step S1604), or after the processing in step S1605, the trend vector generation unit 34 performs step S1602. Returning to FIG. 5, the unselected word t is selected and the process is repeated.

ここで、しきい値ｍｉｎｍｓは、単語ｔに対する補正特徴度ｍｓ（ｔ）の最小値であり、システム側で事前に設定する値である。このしきい値ｍｉｎｍｓによって、特徴語が抽出される。ただし、本実施形態ではしきい値を補正特徴度ｍｓ（ｔ）の最小値として設定したが、これに限らず、傾向ベクトル生成部３４は、ｍｓ（ｔ）が上位から所定個の単語ｔを特徴語とするという個数指定により特徴語を抽出してもよい。 Here, the threshold value minms is the minimum value of the correction feature degree ms (t) for the word t, and is a value set in advance on the system side. Feature words are extracted based on the threshold value minms. However, in the present embodiment, the threshold value is set as the minimum value of the corrected feature value ms (t). However, the present invention is not limited to this. Feature words may be extracted by specifying the number of feature words.

特徴度データ記憶部４に特徴度データが格納されている全て単語ｔについて繰り返し処理を終了すると（ステップＳ１６０２−ＹＥＳ）、傾向ベクトル生成部３４は、特徴語リストｔｌに含まれる全ての特徴語ｔを１つずつ選択し、選択した特徴語ｔについて、ステップＳ１６０７〜ステップＳ１６１２の処理を繰り返す（ステップＳ１６０６−ＮＯ）。 When the iterative process is completed for all the words t for which the feature degree data is stored in the feature degree data storage unit 4 (step S1602-YES), the tendency vector generation unit 34 includes all the feature words t included in the feature word list tl. Are selected one by one, and the processing of step S1607 to step S1612 is repeated for the selected feature word t (step S1606-NO).

まず、傾向ベクトル生成部３４は、特徴度データ記憶部４から特徴語ｔの特徴度データｋｄを取得する（ステップＳ１６０７）。さらに、傾向ベクトル生成部３４は、カテゴリ操作部５２から入力された横軸子カテゴリ集合ｘＣａｔｓに含まれるカテゴリ数（横軸子カテゴリｘＣａｔの数）と同じ次元数の頻度ベクトルｖｃｆと傾向ベクトルｖｐｔｎを生成する（ステップＳ１６０８）。頻度ベクトルｖｃｆ及び傾向ベクトルｖｐｔｎの各要素は横軸子カテゴリｘＣａｔに対応する。 First, the trend vector generation unit 34 acquires the feature data kd of the feature word t from the feature data storage unit 4 (step S1607). Further, the trend vector generation unit 34 obtains the frequency vector vcf and the trend vector vptn having the same number of dimensions as the number of categories included in the horizontal axis category set xCats input from the category operation unit 52 (the number of horizontal axis categories xCat). Generate (step S1608). Each element of the frequency vector vcf and the trend vector vptn corresponds to the horizontal axis category xCat.

傾向ベクトル生成部３４は、横軸子カテゴリ集合ｘＣａｔｓに含まれる全ての横軸子カテゴリｘＣａｔを１つずつ選択し、選択した横軸子カテゴリｘＣａｔについてステップＳ１６１０、及びステップＳ１６１１の処理を繰り返す（ステップＳ１６０９−ＮＯ）。
つまり、傾向ベクトル生成部３４は、対象カテゴリｔｇｔＣａｔと横軸子カテゴリｘＣａｔに共通して含まれる文書データ集合について、特徴語ｔがテキスト情報（本文２０５）に出現する文書データの数（以下、「カテゴリ内頻度」という）ｃｆを算出する（ステップＳ１６１０）。対象カテゴリｔｇｔＣａｔと横軸子カテゴリｘＣａｔに共通して含まれる文書データ集合は、対象カテゴリｔｇｔＣａｔの分類ルールと横軸子カテゴリｘＣａｔの分類ルールとの論理積を満たす文書データであり、図８のステップＳ１００９と同様の処理により得られる。傾向ベクトル生成部３４は、特徴度データｋｄから特徴語ｔの文書頻度ｄｆとして文書頻度５０１を取得する。傾向ベクトル生成部３４は、頻度ベクトルｖｃｆの横軸子カテゴリｘＣａｔに対応した要素の値を、ステップＳ１６１０において算出したカテゴリ内頻度ｃｆとし、傾向ベクトルｖｐｔｎの横軸子カテゴリｘＣａｔに対応した要素の値を、ｃｆ／（ｄｆ＋１）とする（ステップＳ１６１１）。傾向ベクトル生成部３４は、ステップＳ１６０９に戻り、未選択の横軸子カテゴリｘＣａｔを選択して処理を繰り返す。 The trend vector generation unit 34 selects all the horizontal axis categories xCat included in the horizontal axis category set xCat one by one, and repeats the processing of Step S1610 and Step S1611 for the selected horizontal axis category xCat (Step S1611). S1609-NO).
In other words, the trend vector generation unit 34 determines the number of document data in which the feature word t appears in the text information (the body 205) (hereinafter, “for the document data set included in common in the target category tgtCat and the horizontal axis category xCat”. Cf) (referred to as “frequency within category”) (step S1610). The document data set that is commonly included in the target category tgtCat and the horizontal axis category xCat is document data that satisfies the logical product of the classification rule of the target category tgtCat and the classification rule of the horizontal axis category xCat. It is obtained by the same processing as S1009. The trend vector generation unit 34 acquires the document frequency 501 as the document frequency df of the feature word t from the feature data kd. The trend vector generation unit 34 sets the element value corresponding to the horizontal axis category xCat of the frequency vector vcf as the intra-category frequency cf calculated in step S1610, and the value of the element corresponding to the horizontal axis category xCat of the trend vector vptn. Is set to cf / (df + 1) (step S1611). The trend vector generation unit 34 returns to step S1609, selects an unselected horizontal axis category xCat, and repeats the process.

傾向ベクトル生成部３４は、全ての横軸子カテゴリｘＣａｔについてステップＳ１６０９〜ステップＳ１６１１の処理を行ったと判定した場合（ステップＳ１６０９−ＹＥＳ）、各横軸子カテゴリｘＣａｔについてステップＳ１６１１で算出した要素を並べた頻度ベクトルｖｃｆと傾向ベクトルｖｐｔｎを、特徴度データｋｄの軸カテゴリ（カテゴリ番号）に対する頻度ベクトル／傾向ベクトル５０４に格納する。傾向ベクトル生成部３４は、特徴度データ記憶部４に現在記憶されている特徴語ｔの特徴度データを、頻度ベクトルｖｃｆと傾向ベクトルｖｐｔｎを格納した特徴度データｋｄにより更新する（ステップＳ１６１２）。その後、傾向ベクトル生成部３４は、ステップＳ１６０６に戻り、未選択の特徴語ｔを選択して処理を繰り返す。
そして、ステップＳ１６０６において、特徴語リストｔｌに含まれる全ての単語（特徴語）ｔについて、ステップＳ１６０７〜ステップＳ１６１２の処理を行ったと判定した場合（ステップＳ１６０６−ＹＥＳ）、傾向ベクトル生成部３４は傾向ベクトル生成処理を終了する。 When the trend vector generation unit 34 determines that the processes in steps S1609 to S1611 have been performed for all the horizontal axis categories xCat (step S1609-YES), the trend vector generation unit 34 arranges the elements calculated in step S1611 for each horizontal axis category xCat. The frequency vector vcf and the trend vector vptn are stored in the frequency vector / trend vector 504 for the axis category (category number) of the feature data kd. The trend vector generation unit 34 updates the feature data of the feature word t currently stored in the feature data storage unit 4 with the feature data kd storing the frequency vector vcf and the trend vector vptn (step S1612). Thereafter, the trend vector generation unit 34 returns to step S1606, selects an unselected feature word t, and repeats the process.
If it is determined in step S1606 that the processing of steps S1607 to S1612 has been performed for all words (feature words) t included in the feature word list tl (YES in step S1606), the tendency vector generation unit 34 determines the tendency. The vector generation process ends.

なお、本実施形態においては、傾向ベクトルの要素の値をｃｆ／（ｄｆ＋１）、つまり対象カテゴリｔｇｔＣａｔに対する横軸子カテゴリｘＣａｔでの単語ｔの「出現割合」としたが、単純に文書頻度（ｄｆ）やカテゴリ内頻度（ｃｆ）としてもよい。もしくは以下のような自己相互情報量やイエーツ補正χ２乗値といった統計量に基づく値でもよい。統計量は従来の技術で算出される。傾向ベクトルの要素の値は、クラスタリング処理において各特徴語の重みづけとなり、クラスタリング結果に反映される。 In this embodiment, the value of the element of the trend vector is cf / (df + 1), that is, the “appearance ratio” of the word t in the horizontal axis category xCat with respect to the target category tgtCat, but the document frequency (df ) Or in-category frequency (cf). Alternatively, values based on statistics such as the following self-mutual information amount and Yates correction χ square value may be used. Statistics are calculated by conventional techniques. The value of the element of the trend vector becomes a weight of each feature word in the clustering process and is reflected in the clustering result.

自己相互情報量ＰＭＩは以下の計算式（１０）で算出される。 The self mutual information PMI is calculated by the following calculation formula (10).

自己相互情報量ＰＭＩ＝ｌｏｇ（ａｎ／（（ａ＋ｂ）（ａ＋ｃ））） …（１０） Self mutual information PMI = log (an / ((a + b) (a + c))) (10)

また、イエーツ補正χ２乗値Ｙａｔｅｓは以下の計算式（１１）で算出される。 The Yates correction χ square value Yates is calculated by the following calculation formula (11).

Ｙａｔｅｓ’＝ｎ（｜ａｄ−ｂｃ｜−ｎ／２）＾２／（（ａ＋ｂ）（ｃ＋ｄ）（ａ＋ｃ）（ｂ＋ｄ））
ｉｆ（（ａｄ−ｂｄ）＜０）Ｙａｔｅｓ＝−Ｙａｔｅｓ’
ｅｌｓｅＹａｔｅｓ＝Ｙａｔｅｓ’ …（１１） Yates' = n (| ad-bc | -n / 2) ^ 2 / ((a + b) (c + d) (a + c) (b + d))
if ((ad−bd) <0) Yates = −Yates ′
else Yates = Yates' (11)

なお、計算式（１０）、（１１）において、｜ｘＣａｔ｜は、横軸子カテゴリｘＣａｔに分類された文書数、｜ｔｇｔＣａｔ｜は対象カテゴリｔｇｔＣａｔに分類された文書数であり、ａ、ｂ、ｃ、ｄ、ｎは以下のとおりである。 In the calculation formulas (10) and (11), | xCat | is the number of documents classified into the horizontal axis category xCat, | tgtCat | is the number of documents classified into the target category tgtCat, and a, b, c, d, and n are as follows.

自己相互情報量ＰＭＩでは、対象カテゴリｔｇｔＣａｔ中での出現確率と、横軸子カテゴリｘＣａｔ中での出現確率とで偏りの大きい特徴語を高く評価する。また、低頻度語を過大評価する傾向があるため、自己相互情報量ＰＭＩを利用する場合は、文書頻度ｄｆが極端に小さい単語は、特徴語から排除するなどの処理が必要となる。
一方、イエーツ補正χ２乗値Ｙａｔｅｓは、対象カテゴリｔｇｔＣａｔ中での出現確率に対し、横軸子カテゴリｘＣａｔ中での出現確率が高い単語を高く評価する。結果として、クラスタリングにおいて比較的低頻度の特徴語が強く重みづけされる。しかし、自己相互情報量ＰＭＩと比べるとその傾向は小さい。 In the self-mutual information amount PMI, feature words having a large bias between the appearance probability in the target category tgtCat and the appearance probability in the horizontal axis category xCat are highly evaluated. In addition, since there is a tendency to overestimate low-frequency words, when using the self-mutual information PMI, it is necessary to perform processing such as excluding words with extremely low document frequency df from feature words.
On the other hand, the Yates correction chi-square value Yates highly evaluates words having a high appearance probability in the horizontal axis category xCat with respect to the appearance probability in the target category tgtCat. As a result, relatively low frequency feature words are strongly weighted in clustering. However, the tendency is small compared with the self-mutual information amount PMI.

図１５は、特徴語カテゴリ生成部３３のクラスタリング部３５が実行する特徴語クラスタリングの処理の流れを示すフローチャートである。同図に示すフローチャートは、図７のステップＳ７におけるクラスタリング処理の詳細な処理の一例である。この処理において、クラスタリング部３５は、上述した図１４の傾向ベクトル生成処理において生成された傾向ベクトルを用いて、特徴語の類似性に基づく特徴語のクラスタリングを行い、関連の強い特徴語グループである特徴語クラスタを生成する。 FIG. 15 is a flowchart showing a flow of feature word clustering processing executed by the clustering unit 35 of the feature word category generation unit 33. The flowchart shown in the figure is an example of detailed processing of the clustering processing in step S7 of FIG. In this process, the clustering unit 35 performs clustering of feature words based on the similarity of feature words using the trend vectors generated in the above-described trend vector generation process of FIG. 14, and is a strongly related feature word group. Generate feature word clusters.

なお、本実施形態では、クラスタリング手法として、文書クラスタリング手法を応用する。従来の文書クラスタリング手法は、例えば各々の文書データが備える特徴を、特徴ベクトル（ベクトル要素には文書内の単語の出現頻度などが用いられる）によって表し、この特徴ベクトルの類似度（例えば内積や余弦）に基づき、文書同士のまとまりである文書クラスタを生成するという方法である。 In the present embodiment, the document clustering method is applied as the clustering method. In the conventional document clustering method, for example, the features included in each document data are represented by feature vectors (the frequency of appearance of words in the document is used as vector elements), and the similarity (for example, inner product or cosine) of the feature vectors. ) To generate a document cluster that is a group of documents.

本実施形態のクラスタリング部３５は、抽出された特徴語の傾向ベクトルの類似度に基づき、特徴語クラスタを生成する。クラスタリングの手法としては、従来から様々なものが考案されているが、本実施形態では、ｌｅａｄｅｒ−ｆｏｌｌｏｗｅｒ法と呼ばれる比較的単純なクラスタリング手法を用いる。ただし、このクラスタリング手法に限定はされない。 The clustering unit 35 of the present embodiment generates a feature word cluster based on the similarity of the extracted feature word tendency vectors. Various clustering methods have been conventionally devised, but in this embodiment, a relatively simple clustering method called a leader-follower method is used. However, this clustering method is not limited.

最初に、クラスタリング部３５は、軸カテゴリ（カテゴリ番号）に対する頻度ベクトル／傾向ベクトル５０４に傾向ベクトル５１２が設定されている特徴度データｋｄの集合である特徴度データ集合ｋｄｓを特徴度データ記憶部４から取得する（ステップＳ１７０１）。クラスタリング部３５は、取得した特徴度データ集合ｋｄｓに含まれる単語ｔをクラスタリング対象の単語集合Ｔとし、分類先である特徴語クラスタ集合Ｃの初期値を空集合とする（ステップＳ１７０２）。特徴度データ集合ｋｄｓに含まれる単語ｔとは、特徴度データｋｄに単語５１０として設定されている単語である。クラスタリング部３５は、単語集合Ｔに含まれる全ての単語ｔを１つずつ選択し、選択した単語ｔについてステップＳ１７０４〜ステップＳ１７１５の処理を繰り返す（ステップＳ１７０３−ＮＯ）。 First, the clustering unit 35 generates a feature data set kds, which is a set of feature data kd in which the trend vector 512 is set in the frequency vector / trend vector 504 for the axis category (category number). (Step S1701). The clustering unit 35 sets the word t included in the acquired feature data set kds as the clustering target word set T, and sets the initial value of the feature word cluster set C as the classification destination as an empty set (step S1702). The word t included in the feature data set kds is a word set as the word 510 in the feature data kd. The clustering unit 35 selects all the words t included in the word set T one by one, and repeats the processing from step S1704 to step S1715 for the selected word t (step S1703-NO).

まず、クラスタリング部３５は、特徴度データ集合ｋｄｓから単語ｔの傾向ベクトルｖｐｔｎを取得する（ステップＳ１７０４）。クラスタリング部３５は、単語ｔの分類先の特徴語クラスタである分類先特徴語クラスタｃｍａｘの初期値を「なし」とし、単語ｔの類似度の最大値ｓｍａｘの初期値を０とする（ステップＳ１７０５）。 First, the clustering unit 35 acquires the tendency vector vptn of the word t from the feature data set kds (step S1704). The clustering unit 35 sets the initial value of the classification target feature word cluster cmax, which is the feature word cluster of the classification target of the word t, to “none”, and sets the initial value of the maximum value smax of the similarity of the word t to 0 (step S1705). ).

クラスタリング部３５は、特徴語クラスタ集合Ｃに含まれる全ての特徴語クラスタｃを１つずつ選択し、選択した特徴語クラスタｃについてステップＳ１７０７〜ステップＳ１７０９の処理を繰り返す（ステップＳ１７０６−ＮＯ）。クラスタリング部３５は、特徴語クラスタ集合Ｃに含まれる全ての特徴語クラスタｃについて処理を終了すると（ステップＳ１７０６−ＹＥＳ）、ステップＳ１７１０の処理を行う。 The clustering unit 35 selects all the feature word clusters c included in the feature word cluster set C one by one, and repeats the processing of step S1707 to step S1709 for the selected feature word cluster c (step S1706-NO). When the clustering unit 35 finishes the processing for all the feature word clusters c included in the feature word cluster set C (step S1706: YES), the clustering unit 35 performs the process of step S1710.

ただし、最初の単語ｔについての処理の場合、Ｃは初期値の空集合であるため、クラスタリング部３５は特徴語クラスタ集合Ｃに含まれる全ての特徴語クラスタｃについて処理が終了したと判定し（ステップＳ１７０６−ＹＥＳ）、分類先特徴語クラスタｃｍａｘが存在するか否かを判定する（ステップＳ１７１０）。分類先特徴語クラスタｃｍａｘが初期値「なし」であるため（ステップＳ１７１０−ＮＯ）、クラスタリング部３５は、新たな特徴語クラスタｃである特徴語クラスタｃｎｅｗを新規に作成し、作成した特徴語クラスタｃｎｅｗを特徴語クラスタ集合Ｃに追加する（ステップＳ１７１３）。クラスタリング部３５は、作成した特徴語クラスタｃｎｅｗに、単語ｔの傾向ベクトルｖｐｔｎを分類する（ステップＳ１７１４）。そして、クラスタリング部３５は、特徴語クラスタｃｎｅｗの特徴ベクトルｖｃを単語ｔの傾向ベクトルｖｐｔｎとする（ステップＳ１７１５）。すなわちこの時点では、特徴語クラスタｃｎｅｗに分類されている単語は単語ｔ１つであるため、特徴語クラスタｃｎｅｗの特徴ベクトルｖｃは、単語ｔの傾向ベクトルｖｐｔｎと同じとなる。ステップＳ１７１５の処理が終了すると、クラスタリング部３５は、ステップＳ１７０３に戻り、未選択の単語ｔを選択して処理を繰り返す。 However, in the case of processing for the first word t, since C is an empty set of initial values, the clustering unit 35 determines that the processing has been completed for all feature word clusters c included in the feature word cluster set C ( Step S1706—YES), it is determined whether or not the classification target feature word cluster cmax exists (step S1710). Since the classification target feature word cluster cmax is the initial value “none” (step S1710—NO), the clustering unit 35 newly creates a feature word cluster cnew, which is a new feature word cluster c, and the created feature word cluster cnew is added to the feature word cluster set C (step S1713). The clustering unit 35 classifies the tendency vector vptn of the word t into the created feature word cluster cnew (step S1714). Then, the clustering unit 35 sets the feature vector vc of the feature word cluster cnew as the tendency vector vptn of the word t (step S1715). That is, at this time, since the word categorized in the feature word cluster cnew is the word t1, the feature vector vc of the feature word cluster cnew is the same as the tendency vector vptn of the word t. When the process of step S1715 ends, the clustering unit 35 returns to step S1703, selects an unselected word t, and repeats the process.

２回目以降のステップＳ１７０３からの繰り返し処理においては特徴語クラスタ集合Ｃが空き集合ではないため、ステップＳ１７０６において、クラスタリング部３５は、クラスタリング部３５は特徴語クラスタ集合Ｃに含まれる全ての特徴語クラスタｃを１つずつ選択し、ステップＳ１７０７〜ステップＳ１７０９の処理を繰り返す（ステップＳ１７０６−ＮＯ）。 Since the feature word cluster set C is not an empty set in the second and subsequent iterations from step S1703, in step S1706, the clustering unit 35 selects all the feature word clusters included in the feature word cluster set C. c is selected one by one, and the processing of step S1707 to step S1709 is repeated (step S1706-NO).

具体的には、クラスタリング部３５は、単語ｔの傾向ベクトルｖｐｔｎと、特徴語クラスタｃの特徴ベクトルｖｃとを用いて、単語ｔと特徴語クラスタｃの類似度ｓを算出する（ステップＳ１７０７）。特徴語クラスタｃの特徴ベクトルｖｃは、ステップＳ１７１５においてこの特徴語クラスタｃに最初に分類された単語の傾向ベクトルを初期として生成され、後述のステップＳ１７１０〜ステップＳ１７１５においてこの特徴語クラスタに追加で分類された単語の傾向ベクトルを用いて更新されるベクトルである。クラスタリング部３５は、単語ｔと特徴語クラスタｃの類似度ｓを、単語ｔの傾向ベクトルｖｐｔｎと特徴語クラスタｃの特徴ベクトルｖｃとの類似度により算出する。この類似度は、例えばベクトルの余弦、すなわち、（ｖｐｔｎ・ｖｃ）／（｜ｖｐｔｎ｜×｜ｖｃ｜）を用いて算出される。なお、ｖｐｔｎ・ｖｃは傾向ベクトルｖｐｔｎと特徴ベクトルｖｃの内積を表し、｜ｖｐｔｎ｜、｜ｖｃ｜はそれぞれ、傾向ベクトルｖｐｔｎ、特徴ベクトルｖｃのノルムを表す。 Specifically, the clustering unit 35 calculates the similarity s between the word t and the feature word cluster c using the tendency vector vptn of the word t and the feature vector vc of the feature word cluster c (step S1707). The feature vector vc of the feature word cluster c is generated with the initial tendency vector of the word classified into the feature word cluster c in step S1715, and additionally classified into the feature word cluster in steps S1710 to S1715 described later. This is a vector that is updated by using the trend vector of the updated word. The clustering unit 35 calculates the similarity s between the word t and the feature word cluster c based on the similarity between the tendency vector vptn of the word t and the feature vector vc of the feature word cluster c. This similarity is calculated using, for example, the cosine of a vector, that is, (vptn · vc) / (| vptn | × | vc |). Note that vptn · vc represents the inner product of the trend vector vptn and the feature vector vc, and | vptn | and | vc | represent the norms of the trend vector vptn and the feature vector vc, respectively.

クラスタリング部３５は、単語ｔと特徴語クラスタｃの類似度ｓがあらかじめ設定された所定のしきい値ｓｍｉｎ以上であり、かつ、単語ｔの類似度の最大値ｓｍａｘより大きい場合（ステップＳ１７０８−ＹＥＳ）、分類先特徴語クラスタｃｍａｘに特徴語クラスタｃを設定し、単語ｔの類似度の最大値ｓｍａｘに特徴語クラスタｃとの類似度ｓを設定する（ステップＳ１７０９）。クラスタリング部３５は、ステップＳ１７０６に戻り処理を繰り返す。一方、クラスタリング部３５は、単語ｔと特徴語クラスタｃの類似度ｓがあらかじめ設定された所定のしきい値ｓｍｉｎ未満である、もしくは、単語ｔの類似度の最大値ｓｍａｘ以下である場合（ステップＳ１７０８−ＮＯ）、そのままステップＳ１７０６に戻り処理を繰り返す。 The clustering unit 35 determines that the similarity s between the word t and the feature word cluster c is greater than or equal to a predetermined threshold value smin and is greater than the maximum similarity smax of the word t (step S1708—YES). ), The feature word cluster c is set as the classification target feature word cluster cmax, and the similarity s with the feature word cluster c is set as the maximum similarity smax of the word t (step S1709). The clustering unit 35 returns to Step S1706 and repeats the process. On the other hand, the clustering unit 35 determines that the similarity s between the word t and the feature word cluster c is less than a predetermined threshold value smin, or is equal to or less than the maximum value smax of the similarity of the word t (step (S1708-NO), the process returns to step S1706 and the process is repeated.

ステップＳ１７０６の繰り返し処理の終了後（ステップＳ１７０６−ＹＥＳ）、クラスタリング部３５は、分類先特徴語クラスタｃｍａｘが存在するか否か判定する（ステップＳ１７１０）。クラスタリング部３５は、分類先特徴語クラスタｃｍａｘが存在する、すなわち、既存の特徴語クラスタｃのうち、単語ｔとの類似度ｓが最大であり、かつ、その類似度がしきい値以上のクラスタが存在する場合（ステップＳ１７１０−ＹＥＳ）、クラスタリング部３５は、分類先特徴語クラスタｃｍａｘに単語ｔを分類する（ステップＳ１７１１）。クラスタリング部３５は、分類先特徴語クラスタｃｍａｘの特徴ベクトルｖｃに単語ｔの傾向ベクトルｖｐｔｎを加え、特徴ベクトルｖｃを再計算する（ステップＳ１７１２）。すなわち、分類先特徴語クラスタｃｍａｘとなっている特徴語クラスタｃの特徴ベクトルｖｃは、当該特徴語クラスタｃに分類された単語ｔの傾向ベクトルの和となる。そのため、分類先特徴語クラスタｃｍａｘとなっている特徴語クラスタｃの特徴ベクトルｖｃに単語ｔの傾向ベクトルｖｐｔｎを加えた結果が、単語ｔを分類した後の当該特徴語クラスタｃの特徴ベクトルとなる。 After the repetition process of step S1706 is completed (step S1706—YES), the clustering unit 35 determines whether or not the classification target feature word cluster cmax exists (step S1710). The clustering unit 35 has a classification target feature word cluster cmax. That is, among the existing feature word clusters c, the cluster has the maximum similarity s with the word t and the similarity is equal to or greater than a threshold value. Is present (step S1710—YES), the clustering unit 35 classifies the word t into the classification target feature word cluster cmax (step S1711). The clustering unit 35 adds the tendency vector vptn of the word t to the feature vector vc of the classification target feature word cluster cmax, and recalculates the feature vector vc (step S1712). That is, the feature vector vc of the feature word cluster c that is the classification target feature word cluster cmax is the sum of the tendency vectors of the word t classified into the feature word cluster c. Therefore, the result of adding the tendency vector vptn of the word t to the feature vector vc of the feature word cluster c that is the classification target feature word cluster cmax becomes the feature vector of the feature word cluster c after the classification of the word t. .

一方、ステップＳ１７１０にて、分類先特徴語クラスタｃｍａｘが存在しない場合（ステップＳ１７１０−ＮＯ）、クラスタリング部３５は、上記と同様にステップＳ１７１３〜ステップＳ１７１５の処理を行う。すなわち、クラスタリング部３５は、特徴語クラスタｃｎｅｗを新規に作成して特徴語クラスタ集合Ｃに追加するとともに（ステップＳ１７１３）、特徴語クラスタｃｎｅｗに単語ｔの傾向ベクトルｖｐｔｎを分類する（ステップＳ１７１４）。クラスタリング部３５は、特徴語クラスタｃｎｅｗの特徴ベクトルｖｃを単語ｔの傾向ベクトルｖｐｔｎとする（ステップＳ１７１５）。 On the other hand, if the classification target feature word cluster cmax does not exist in step S1710 (step S1710-NO), the clustering unit 35 performs the processing of step S1713 to step S1715 in the same manner as described above. That is, the clustering unit 35 newly creates a feature word cluster cnew and adds it to the feature word cluster set C (step S1713), and classifies the tendency vector vptn of the word t into the feature word cluster cnew (step S1714). The clustering unit 35 sets the feature vector vc of the feature word cluster cnew as the tendency vector vptn of the word t (step S1715).

ステップＳ１７１２もしくはステップＳ１７１５が終了すると、クラスタリング部３５は、ステップＳ１７０３に戻り処理を繰り返す。単語集合Ｔに含まれる全ての単語ｔについて繰り返し処理が終了すると（ステップＳ１７０３−ＹＥＳ）、クラスタリング部３５は、図１５のクラスタリング処理を終了する。 When step S1712 or step S1715 ends, the clustering unit 35 returns to step S1703 and repeats the process. When the iterative process ends for all the words t included in the word set T (step S1703—YES), the clustering unit 35 ends the clustering process of FIG.

このように、図１５に示すクラスタリング部３５のクラスタリング処理によって、図１４に示す傾向ベクトル生成処理において傾向ベクトルが生成された単語、つまり特徴語として判定された単語について、傾向ベクトルが互いに類似し合う特徴語のまとまりである特徴語クラスタが生成される。 As described above, the tendency vectors of the words for which the trend vector is generated in the trend vector generation process shown in FIG. 14 by the clustering process of the clustering unit 35 shown in FIG. 15, that is, the words determined as the feature words are similar to each other. A feature word cluster that is a group of feature words is generated.

図１０における特徴度算出処理においては形態素解析によって単語を抽出しているため、文書分類部３で自動生成される特徴語（フィルタ語）は形態素単位となる。一方、着目語設定部５３において、ユーザが直接入力する着目語は形態素単位である必要はなく、任意の文字列を着目語として設定することが可能である。そして、図１１における着目語設定処理においてユーザが着目語として設定した文字列が形態素単位でない場合でも、図１３に示す特徴度データ取得処理によって、それ以降は単語として扱われることになる。さらに、図１２における特徴度補正処理において、文書分類部３の特徴度補正部３２は、着目語設定部５３から入力された着目語集合に応じて単語の特徴度を補正し、図１５において、クラスタリング部３５は、ユーザが着目する単語に則した特徴語クラスタリングを実現する。 In the feature degree calculation processing in FIG. 10, since words are extracted by morphological analysis, feature words (filter words) automatically generated by the document classification unit 3 are in morpheme units. On the other hand, in the focused word setting unit 53, the focused word directly input by the user does not need to be in morpheme units, and an arbitrary character string can be set as the focused word. And even if the character string set as the focused word by the user in the focused word setting process in FIG. 11 is not in morpheme units, the character string is subsequently treated as a word by the feature data acquisition process shown in FIG. Furthermore, in the feature correction processing in FIG. 12, the feature correction unit 32 of the document classification unit 3 corrects the word feature according to the target word set input from the target word setting unit 53, and in FIG. The clustering unit 35 realizes feature word clustering according to the word that the user focuses on.

図１６は、特徴語カテゴリ生成部３３のカテゴリ生成処理部３６が特徴語カテゴリを生成する処理の流れを示すフローチャートである。同図に示すフローチャートは、図７のステップＳ８における特徴語カテゴリ生成処理の詳細な処理の一例を示す。この処理において、カテゴリ生成処理部３６は、上述した図１５のクラスタリング処理により生成された特徴語クラスタに含まれる単語をフィルタ語とし、フィルタ語を分類の条件とする特徴語カテゴリのデータを生成してカテゴリ記憶部２に記憶する。 FIG. 16 is a flowchart showing a flow of processing in which the category generation processing unit 36 of the feature word category generation unit 33 generates a feature word category. The flowchart shown in the figure shows an example of detailed processing of the feature word category generation processing in step S8 of FIG. In this process, the category generation processing unit 36 generates feature word category data using the words included in the feature word clusters generated by the clustering process of FIG. 15 described above as filter words and the filter words as classification conditions. And stored in the category storage unit 2.

カテゴリ生成処理部３６は、図１５に示すクラスタリングにより生成された特徴語クラスタ集合Ｃを受信する（ステップＳ１８０１）。カテゴリ生成処理部３６は、特徴語クラスタ集合Ｃ中のクラスタに含まれる全ての単語の特徴度データｋｄの集合である特徴度データ集合ｋｄｓを特徴度データ記憶部４から取得する（ステップＳ１８０２）。カテゴリ生成処理部３６は、特徴語クラスタ集合Ｃに含まれる全て特徴語クラスタｃを１つずつ選択し、選択した特徴語クラスタｃについてステップＳ１８０４〜ステップＳ１８０７の処理を繰り返す（ステップＳ１８０３−ＮＯ）。 The category generation processing unit 36 receives the feature word cluster set C generated by the clustering shown in FIG. 15 (step S1801). The category generation processing unit 36 acquires from the feature data storage unit 4 a feature data set kds that is a set of feature data kd of all the words included in the clusters in the feature word cluster set C (step S1802). The category generation processing unit 36 selects all the feature word clusters c included in the feature word cluster set C one by one, and repeats the processing of step S1804 to step S1807 for the selected feature word cluster c (step S1803-NO).

まず、カテゴリ生成処理部３６は、特徴語クラスタｃに含まれる単語ｔの集合をフィルタ語集合ｔｓとし、特徴度データ集合ｋｄｓを参照してフィルタ語集合ｔｓの中で補正特徴度が最も大きい単語ｔをカテゴリ名ｃｎとする（ステップＳ１８０４）。例えば、特徴語クラスタｃの単語ｔの集合が｛“検索”，“分類”，“管理”｝であるとする。この場合、カテゴリ生成処理部３６は、｛“検索”，“分類”，“管理”｝を特徴語クラスタｃのフィルタ語集合ｔｓとし、フィルタ語集合ｔｓ中で最も補正特徴度が大きい単語である“検索”をカテゴリ名ｃｎとする。 First, the category generation processing unit 36 sets a set of words t included in the feature word cluster c as a filter word set ts, refers to the feature data set kds, and has the highest corrected feature in the filter word set ts. Let t be the category name cn (step S1804). For example, it is assumed that a set of words t in the feature word cluster c is {“search”, “classification”, “management”}. In this case, the category generation processing unit 36 uses {“search”, “classification”, “management”} as the filter word set ts of the feature word cluster c, and is the word having the largest correction feature degree in the filter word set ts. Let “search” be the category name cn.

カテゴリ生成処理部３６は、フィルタ語集合ｔｓに基づいて分類ルールｒを生成する（ステップＳ１８０５）。分類ルールｒは、例えば図４の特徴語カテゴリデータ４００、４１０に示すように、「文書データの本文２０５（テキスト情報）にフィルタ語集合ｔｓが含まれること」を条件として、文書を特徴語カテゴリに分類するためのルールとして生成される。 The category generation processing unit 36 generates a classification rule r based on the filter word set ts (step S1805). For example, as shown in the feature word category data 400 and 410 of FIG. 4, the classification rule r is based on the condition that “the filter word set ts is included in the body 205 (text information) of the document data”. It is generated as a rule for classifying.

なお、対象とする文書データが図２の文書データ２００ｂに示すようなＸＭＬ文書の場合、分類ルールは、ＸＱｕｅｒｙもしくはＸＰａｔｈで表現する。上記の例で文書データ２００ｂの「本文」要素がテキスト情報である場合、特徴語クラスタｃの分類ルールは「ｃｏｎｔａｉｎｓ（．／本文，“検索”）ａｎｄｃｏｎｔａｉｎｓ（．／本文，“分類”）ａｎｄｃｏｎｔａｉｎｓ（．／本文，“管理”）」となる。 When the target document data is an XML document as shown in the document data 200b in FIG. 2, the classification rule is expressed by XQuery or XPath. In the above example, when the “body” element of the document data 200b is text information, the classification rule of the feature word cluster c is “contains (./body,“ search ”) and containers (./body,“ classification ”) and contains (contains (./text, “management”)).

カテゴリ生成処理部３６は、生成した分類ルールｒに該当する文書データｄの集合である文書データ集合ｄｏｃｓを文書記憶部１から取得する（ステップＳ１８０６）。カテゴリ生成処理部３６は、対象カテゴリｔｇｔＣａｔ、カテゴリ名ｃｎ、フィルタ語集合ｔｓ、及び分類ルールｒに基づいて、特徴語カテゴリデータを生成し、カテゴリ記憶部２に記憶する（ステップＳ１８０７）。すなわち、カテゴリ生成処理部３６は、図４に示す特徴語カテゴリデータ４００、４１０と同様の形式で特徴語カテゴリデータを生成する。具体的には、カテゴリ生成処理部３６は、特徴語カテゴリデータのカテゴリ番号、上位カテゴリ、カテゴリ名、分類ルール、フィルタ語にそれぞれ、新たに付与した任意の番号、対象カテゴリｔｇｔＣａｔのカテゴリデータまたは特徴語カテゴリデータのカテゴリ番号、カテゴリ名ｃｎ、分類ルールｒ、フィルタ語集合ｔｓを設定する。さらに、カテゴリ生成処理部３６は、生成した特徴語カテゴリデータに、文書データ集合ｄｏｃｓを対応付けて書き込む。例えば、文書データ集合ｄｏｃｓは、文書データの文書番号により示される。その後、カテゴリ生成処理部３６は、ステップＳ１８０３からの処理に戻り、未選択の特徴語クラスタｃを選択して処理を繰り返す。 The category generation processing unit 36 acquires a document data set docs, which is a set of document data d corresponding to the generated classification rule r, from the document storage unit 1 (step S1806). The category generation processing unit 36 generates feature word category data based on the target category tgtCat, the category name cn, the filter word set ts, and the classification rule r, and stores it in the category storage unit 2 (step S1807). That is, the category generation processing unit 36 generates feature word category data in the same format as the feature word category data 400 and 410 shown in FIG. Specifically, the category generation processing unit 36 adds the category number of the feature word category data, the upper category, the category name, the classification rule, and the filter data, the arbitrary number newly assigned, the category data or the feature of the target category tgtCat, respectively. The category number of the word category data, the category name cn, the classification rule r, and the filter word set ts are set. Further, the category generation processing unit 36 writes the document data set docs in association with the generated feature word category data. For example, the document data set docs is indicated by the document number of the document data. Thereafter, the category generation processing unit 36 returns to the processing from step S1803, selects an unselected feature word cluster c, and repeats the processing.

ステップＳ１８０３において、特徴語クラスタ集合Ｃ中の特徴語クラスタｃ全てにステップＳ１８０４〜ステップＳ１８０７の繰り返し処理が終了すると（ステップＳ１８０３−ＹＥＳ）、カテゴリ生成処理部３６は、特徴語カテゴリ生成処理を終了する。 In step S1803, when the iterative processing of step S1804 to step S1807 is completed for all the feature word clusters c in the feature word cluster set C (step S1803-YES), the category generation processing unit 36 ends the feature word category generation processing. .

図２４は、図１６の特徴語クラスタリング処理の実行により特徴語カテゴリが生成された後のカテゴリ構造の表示例を示す図である。同図に示すように、「内容別」カテゴリ１６０３の下位カテゴリには、特徴語カテゴリとして、カテゴリ２０３１、「検索」カテゴリ２０３２、及び「マイニング」カテゴリ２０３３が生成されている。「分析」カテゴリ２０３１、「マイニング」カテゴリ２０３３に対応するカテゴリデータはそれぞれ、図４に示す特徴語カテゴリデータ４００、４１０である。 FIG. 24 is a diagram showing a display example of the category structure after the feature word category is generated by executing the feature word clustering process of FIG. As shown in the figure, a category 2031, a “search” category 2032, and a “mining” category 2033 are generated as feature word categories in the lower categories of the “by contents” category 1603. The category data corresponding to the “analysis” category 2031 and the “mining” category 2033 are the feature word category data 400 and 410 shown in FIG.

図１０〜図１６の処理が終了すると、ユーザインターフェース部５のカテゴリ操作部５２は、図９のステップＳ１１０４の処理を行い、２軸マップ表示部５１に２軸マップ表示を指示する。２軸マップ表示部５１は、カテゴリ操作部５２からの指示を受け、図７のステップＳ６における２軸マップ表示処理を行う。 When the processes of FIGS. 10 to 16 are completed, the category operation unit 52 of the user interface unit 5 performs the process of step S1104 of FIG. 9 and instructs the biaxial map display unit 51 to display the biaxial map. The biaxial map display unit 51 receives an instruction from the category operation unit 52 and performs biaxial map display processing in step S6 of FIG.

図１７は、２軸マップ表示部５１が２軸マップを表示させる処理の流れを示すフローチャートである。同図に示すフローチャートは、図７のステップＳ９における２軸マップ表示処理の詳細な処理の一例を示す。 FIG. 17 is a flowchart showing a flow of processing in which the biaxial map display unit 51 displays the biaxial map. The flowchart shown in the figure shows an example of detailed processing of the biaxial map display processing in step S9 of FIG.

２軸マップ表示部５１は、カテゴリ操作部５２から２軸マップの横軸カテゴリｘＡｘｉｓＣａｔと、縦軸カテゴリｙＡｘｉｓＣａｔとの入力を受ける（ステップＳ１９０１）。２軸マップ表示部５１は、図８のステップＳ１００２と同様の処理により、カテゴリ記憶部２に記憶されているカテゴリデータ及び特徴語カテゴリデータに基づいて、横軸カテゴリｘＡｘｉｓＣａｔの横軸子カテゴリｘＣａｔの集合である横軸子カテゴリ集合ｘＣａｔｓと、縦軸カテゴリｙＡｘｉｓＣａｔの縦軸子カテゴリｙＣａｔの集合である縦軸子カテゴリ集合ｙＣａｔｓを取得する（ステップＳ１９０２）。 The biaxial map display unit 51 receives the horizontal axis category xAxisCat and the vertical axis category yAxisCat of the biaxial map from the category operation unit 52 (step S1901). The biaxial map display unit 51 performs the processing of the horizontal axis category xCat of the horizontal axis category xAxisCat based on the category data and feature word category data stored in the category storage unit 2 by the same processing as Step S1002 of FIG. The horizontal axis category set xCats, which is a set, and the vertical axis category set yCats, which is a set of vertical axis categories yCat of the vertical axis category yAxisCat, are acquired (step S1902).

具体的には、図２４のようなカテゴリ構造である場合、２軸マップ表示部５１は、横軸子カテゴリ集合ｘＣａｔｓとして、横軸カテゴリｘＡｘｉｓＣａｔである「出願年別」カテゴリ１６０２の子カテゴリの集合｛「２００４年」カテゴリ１６２１、「２００５年」カテゴリ１６２２、「２００６年」カテゴリ１６２３、「２００７年」カテゴリ１６２４、「２００８年」カテゴリ１６２５｝を取得する。また、２軸マップ表示部５１は、縦軸カテゴリｙＡｘｉｓＣａｔである「内容別」カテゴリ１６０３の子カテゴリの集合｛「分析」カテゴリ２０３１、「検索」カテゴリ２０３２、「マイニング」カテゴリ２０３３｝を取得する。 Specifically, in the case of the category structure as shown in FIG. 24, the two-axis map display unit 51 sets a set of child categories of the “by application year” category 1602 that is the horizontal axis category xAxisCat as the horizontal axis category set xCats. {"2004" category 1621, "2005" category 1622, "2006" category 1623, "2007" category 1624, "2008" category 1625} are acquired. Further, the biaxial map display unit 51 acquires a set {“analysis” category 2031, “search” category 2032, “mining” category 2033} of child categories of the “by contents” category 1603 which is the vertical axis category yAxisCat.

２軸マップ表示部５１は、縦軸カテゴリｙＡｘｉｓＣａｔと、縦軸子カテゴリ集合ｙＣａｔｓに含まれる各縦軸子カテゴリｙＣａｔを行とし、横軸子カテゴリ集合ｘＣａｔｓに含まれる各横軸子カテゴリｘＣａｔを列として２軸マップテーブルを作成し、表示させる（ステップＳ１９０３）。２軸マップテーブルは、カテゴリが表示されるタイトル行及びタイトル列も含むため、行数が（１＋縦軸カテゴリ数＋縦軸子カテゴリ数）、列数が（１＋横軸子カテゴリ数）のテーブルである。 The biaxial map display unit 51 uses the vertical axis category yAxisCat and the vertical axis category yCat included in the vertical axis category set yCats as a row, and the horizontal axis category xCat included in the horizontal axis category set xCats. A two-axis map table is created and displayed (step S1903). Since the 2-axis map table also includes title rows and title columns in which categories are displayed, the number of rows is (1 + vertical category number + vertical child category number) and the number of columns is (1 + horizontal child category number). It is.

２軸マップ表示部５１は、作成した２軸マップテーブルにおける全てのｃｅｌｌを１つずつ選択し、選択したｃｅｌｌについてステップＳ１９０５〜ステップＳ１９１０の処理を繰り返す（ステップＳ１９０４−ＮＯ）。ステップＳ１９０５〜ステップＳ１９１０の処理は、図８のステップＳ１００５〜ステップＳ１０１０と同様の処理である。 The biaxial map display unit 51 selects all the cells in the created biaxial map table one by one, and repeats the processing of step S1905 to step S1910 for the selected cell (step S1904-NO). The processes in steps S1905 to S1910 are the same as those in steps S1005 to S1010 in FIG.

まず、２軸マップ表示部５１はｃｅｌｌが先頭行（１行目）もしくは先頭列（１列目）であるか否か判定する（ステップＳ１９０５）。ｃｅｌｌが先頭行（１行目）もしくは先頭列（１列目）であると判定した場合（ステップＳ１９０５−ＹＥＳ）、２軸マップ表示部５１は、ステップＳ１９０６〜ステップＳ１９０８の処理を行う。すなわち、２軸マップ表示部５１は、選択したｃｅｌｌに対応するカテゴリｃａｔ（縦軸カテゴリｙＡｘｉｓＣａｔ、縦軸子カテゴリｙＣａｔ、または、横軸子カテゴリｘＣａｔ）のカテゴリ名を表示させる（ステップＳ１９０６）。さらに、２軸マップ表示部５１は、ｃｅｌｌに対応するカテゴリｃａｔがフィルタ語集合ｆｉｌｔｅｒｓを持つか否かを判定する（ステップＳ１９０７）。カテゴリｃａｔがフィルタ語集合ｆｉｌｔｅｒｓを持つと判定した場合（ステップＳ１９０７−ＹＥＳ）、２軸マップ表示部５１は、フィルタ語集合ｆｉｌｔｅｒｓに含まれるフィルタ語を当該ｃｅｌｌに表示させる（ステップＳ１９０８）。カテゴリｃａｔがフィルタ語集合ｆｉｌｔｅｒｓを持たないと判定した場合（ステップＳ１９０７−ＮＯ）、あるいは、ステップＳ１９０８の処理の後、２軸マップ表示部５１は、ステップＳ１９０４に戻り、未選択のｃｅｌｌを選択して処理を繰り返す。 First, the biaxial map display unit 51 determines whether or not the cell is the first row (first row) or the first column (first column) (step S1905). When it is determined that the cell is the first row (first row) or the first column (first column) (YES in step S1905), the biaxial map display unit 51 performs the processing from step S1906 to step S1908. That is, the biaxial map display unit 51 displays the category name of the category cat (vertical axis category yAxisCat, vertical axis child category yCat, or horizontal axis child category xCat) corresponding to the selected cell (step S1906). Further, the biaxial map display unit 51 determines whether or not the category cat corresponding to the cell has a filter word set filters (step S1907). When it is determined that the category cat has filter word set filters (step S1907—YES), the biaxial map display unit 51 displays the filter words included in the filter word set filters on the cell (step S1908). When it is determined that the category cat does not have the filter word set filter (step S1907—NO), or after the process of step S1908, the biaxial map display unit 51 returns to step S1904 and selects an unselected cell. Repeat the process.

ステップＳ１９０５において、ｃｅｌｌが先頭行（１行目）でも先頭列（１列目）でもないと判定した場合（ステップＳ１９０５−ＮＯ）、２軸マップ表示部５１は、ｃｅｌｌの行に対応する縦軸カテゴリｙＡｘｉｓＣａｔまたは縦軸子カテゴリｙＣａｔと、ｃｅｌｌの列に対応する横軸子カテゴリｘＣａｔとの両方に分類された文書データの数である文書数ｄｎを求める（ステップＳ１９０９）。２軸マップ表示部５１は、ステップＳ１９０３で作成した２軸マップテーブルのｃｅｌｌに、ステップＳ１９０９において算出した文書数ｄｎに応じた大きさの円ｃｈａｒｔを表示させる（ステップＳ１９１０）。その後、２軸マップ表示部５１は、ステップＳ１９０４に戻り、未選択のｃｅｌｌを選択して処理を繰り返す。 If it is determined in step S1905 that the cell is neither the first row (first row) nor the first column (first column) (NO in step S1905), the biaxial map display unit 51 displays the vertical axis corresponding to the cell row. The number of documents dn which is the number of document data classified into both the category yAxisCat or the vertical axis category yCat and the horizontal axis category xCat corresponding to the cell column is obtained (step S1909). The biaxial map display unit 51 displays a circle chart having a size corresponding to the number of documents dn calculated in step S1909 on the cell of the biaxial map table created in step S1903 (step S1910). Thereafter, the biaxial map display unit 51 returns to step S1904, selects an unselected cell, and repeats the process.

２軸マップ表示部５１は、ステップＳ１９０４において全てのｃｅｌｌに対してステップＳ１９０５〜ステップＳ１９１０の処理を終了すると（ステップＳ１９０４−ＹＥＳ）、ステップＳ１９１１の処理を行う。 When the biaxial map display unit 51 finishes the processing of step S1905 to step S1910 for all the cells in step S1904 (step S1904—YES), it performs the processing of step S1911.

図２５は、図１６の特徴語クラスタリング処理の実行により特徴語カテゴリが生成された後に図２４に示すカテゴリ構造となった場合に、ステップＳ１９０４までの処理の終了時に２軸マップ表示部５１が表示させる２軸マップテーブルの表示例を示す図である。同図に示す２軸マップの横軸カテゴリ（軸カテゴリ）は、図２２に示す２軸マップと同じである。縦軸には対象カテゴリである「内容別」カテゴリと、「内容別」カテゴリについて生成された下位カテゴリである特徴語カテゴリ「分析」、「検索」、及び「マイニング」とが含まれ、これら特徴語カテゴリそれぞれのフィルタ語が表示されている。例えば、特徴語カテゴリ「マイニング」のセル２１０１には、フィルタ語「マイニング、解析、関連語」が表示されている。また、２軸マップを表示させる際、２軸マップ表示部５１は、２軸マップの各セルに、そのセルが対応する行の項目のカテゴリと列の項目のカテゴリとの両カテゴリに分類された文書数に応じたグラフを表示させている。 FIG. 25 shows the display of the biaxial map display unit 51 at the end of the processing up to step S1904 when the feature word category is generated by executing the feature word clustering process of FIG. It is a figure which shows the example of a display of the 2-axis map table to be made. The horizontal axis category (axis category) of the biaxial map shown in the figure is the same as the biaxial map shown in FIG. The vertical axis includes the “category by content” category that is the target category, and the feature word categories “analysis”, “search”, and “mining” that are subcategories generated for the “by content” category. The filter words for each word category are displayed. For example, the filter word “mining, analysis, and related words” is displayed in the cell 2101 of the feature word category “mining”. Further, when displaying the biaxial map, the biaxial map display unit 51 is classified into each category of the row item category and the column item category corresponding to each cell of the biaxial map. A graph corresponding to the number of documents is displayed.

ステップＳ１９０４の繰り返し終了後、ユーザがあるカテゴリｃａｔのフィルタ語集合ｆｉｌｔｅｓの中からフィルタ語ｆを選択した場合（ステップＳ１９１１−ＹＥＳ）、２軸マップ表示部５１は、ステップＳ１９１２〜ステップＳ１９１５の処理を行う。 After the repetition of step S1904, when the user selects a filter word f from the filter word set filters of a certain category cat (step S1911-YES), the biaxial map display unit 51 performs the processes of steps S1912 to S1915. Do.

２軸マップ表示部５１は、フィルタ語ｆが選択されたカテゴリｃａｔの親カテゴリｐｃａｔをカテゴリ記憶部２から取得する（ステップＳ１９１２）。２軸マップ表示部５１は、このカテゴリｃａｔに該当する行内のタイトル列以外の全てのｃｅｌｌを１つずつ選択し、選択したｃｅｌｌについてステップＳ１９１４及びステップＳ１９１５の処理を繰り返す（ステップＳ１９１３−ＮＯ）。 The biaxial map display unit 51 acquires the parent category pcat of the category cat for which the filter word f is selected from the category storage unit 2 (step S1912). The biaxial map display unit 51 selects all the cells other than the title column in the row corresponding to the category cat one by one, and repeats the processing of step S1914 and step S1915 for the selected cell (step S1913-NO).

２軸マップ表示部５１は、選択したｃｅｌｌに対応する横軸子カテゴリｘＣａｔのカテゴリデータ（または特徴カテゴリデータ）と横軸子カテゴリｘＣａｔの上位カテゴリのカテゴリデータ（または特徴カテゴリデータ）とから分類ルールを読み出し、読み出した分類ルールの論理積を横軸子カテゴリｘＣａｔの分類ルールｘｒとする。さらに、２軸マップ表示部５１は、親カテゴリｐｃａｔのカテゴリデータ（または特徴カテゴリデータ）から分類ルールｐｒを読み出す。２軸マップ表示部５１は、文書記憶部１を参照し、横軸子カテゴリｘＣａｔの分類ルールｘｒと、親カテゴリｐｃａｔの分類ルールｐｒと、選択されたフィルタ語ｆとに基づき、選択したｃｅｌｌにおいてフィルタ語を含んだ対象文書データの数である文書数ｆｄｎを求める（ステップＳ１９１４）。この文書数ｆｄｎは、上述のｄｎ同様に条件式の積で求めることができ、その条件式は「ｘｒａｎｄｐｒａｎｄ（ｃｏｎｔａｉｎｓ（．／本文，ｆ））」となる。 The biaxial map display unit 51 performs classification rules based on category data (or feature category data) of the horizontal axis category xCat corresponding to the selected cell and category data (or feature category data) of the upper category of the horizontal axis category xCat. And the logical product of the read classification rules is set as the classification rule xr of the horizontal axis category xCat. Further, the biaxial map display unit 51 reads out the classification rule pr from the category data (or feature category data) of the parent category pcat. The biaxial map display unit 51 refers to the document storage unit 1 and selects the selected cell based on the classification rule xr of the horizontal axis category xCat, the classification rule pr of the parent category pcat, and the selected filter word f. The number of documents fdn, which is the number of target document data including the filter word, is obtained (step S1914). The number of documents fdn can be obtained by a product of conditional expressions as in the case of dn described above, and the conditional expressions are “xr and pr and (contains (./text, f))”.

２軸マップ表示部５１は、表示させた２軸マップテーブルにおいて選択されたｃｅｌｌに、ステップＳ１９１０において表示させた円ｃｈａｒｔとは異なる色により、文書数ｆｄｎに応じた大きさの円ｃｈａｒｔを表示させる（ステップＳ１９１５）。カテゴリ操作部５２は、ステップＳ１９１３に戻り、未選択のｃｅｌｌを選択して処理を繰り返す。
そして、フィルタ語ｆが選択されたカテゴリｃａｔに該当する行内のタイトル列以外の全てのｃｅｌｌについて処理を終了すると（ステップＳ１９１３−ＹＥＳ）、カテゴリ操作部５２は、ステップＳ１９１１に戻る。 The biaxial map display unit 51 displays, on the cell selected in the displayed biaxial map table, a circle chart having a size corresponding to the number of documents fdn with a color different from the circle chart displayed in step S1910. (Step S1915). The category operation unit 52 returns to step S1913, selects an unselected cell, and repeats the process.
When the process is finished for all cells other than the title column in the row corresponding to the category cat for which the filter word f is selected (step S1913—YES), the category operation unit 52 returns to step S1911.

以上のステップＳ１９１２〜ステップＳ１９１５の処理では、ユーザが２軸マップ上でフィルタ語を選択した場合に、２軸マップ表示部５１は、フィルタ語が選択された行の各列のｃｅｌｌに、フィルタ語を含む文書数の円ｃｈａｒｔを、ステップＳ１９１０において表示させた円ｃｈａｒｔとは区別して表示させる。 In the processes in steps S1912 to S1915 described above, when the user selects a filter word on the two-axis map, the two-axis map display unit 51 displays the filter word on the cell in each column of the row in which the filter word is selected. Are displayed separately from the circle chart displayed in step S1910.

図２６は、フィルタ語選択時の２軸マップの表示例を示す図である。図２６の例では、図２５の表示の後、ユーザが、縦軸子カテゴリ「マイニング」が表示されているセル２２０１においてフィルタ語「関連語」を選択した場合を示している。２軸マップ表示部５１は、フィルタ語が選択されたセル２２０１が含まれる行については、横軸子カテゴリ「２００４年」、「２００５年」、「２００６年」、「２００７年」、「２００８年」のそれぞれに対応した列のセルに、網掛けの部分により「関連語」を含む文書データの数を表わしている。例えば、セル２２２２には、セル２２０１に対応した特徴語カテゴリ「マイニング」とセル２２１１に対応した横軸子カテゴリ「２００４年」との両カテゴリに含まれる文書データの中で、「関連語」を含む文書データの数を表している。 FIG. 26 is a diagram illustrating a display example of a biaxial map when a filter word is selected. In the example of FIG. 26, after the display of FIG. 25, the user selects the filter word “related word” in the cell 2201 in which the vertical axis category “mining” is displayed. The biaxial map display unit 51 sets the horizontal axis categories “2004”, “2005”, “2006”, “2007”, “2008” for the row including the cell 2201 in which the filter word is selected. The number of document data including the “related word” is indicated by the shaded portion in the cell of the column corresponding to each of “”. For example, in the cell 2222, a “related word” is included in the document data included in both the feature word category “mining” corresponding to the cell 2201 and the horizontal axis category “2004” corresponding to the cell 2211. Indicates the number of document data included.

図１７のステップＳ１９１１において、２軸マップ表示部５１は、フィルタ語の選択が入力されていないと判定した場合（ステップＳ１９１１−ＮＯ）、ユーザから終了要求が入力されないときには（ステップＳ１９１６−ＮＯ）、ステップＳ１９１１の処理に戻り、終了要求の入力を受けたときには処理を終了する（ステップＳ１９１６−ＹＥＳ）。 When the biaxial map display unit 51 determines in step S1911 in FIG. 17 that no filter word selection has been input (step S1911—NO), when no end request is input from the user (step S1916—NO), Returning to the process of step S1911, the process ends when an end request is received (step S1916—YES).

続いて、図７のステップＳ１０及びステップＳ１１の詳細な処理について、図９及び後述する図１８の処理フローを用いて説明する。
図９のステップＳ１１０４において、２軸マップ表示部５１が、カテゴリ操作部５２からの指示を受け、図１７の処理により図２５に示すような２軸マップを表示させると、カテゴリ操作部５２は、ステップＳ１１０１の処理に戻る。カテゴリ操作部５２は、現在表示されている２軸マップ上でユーザが選択したカテゴリｃａｔの入力を受け（ステップＳ１１０１−ＹＥＳ）、さらに、フィルタ語の追加要求の入力を受けた場合（ステップＳ１１０２−ＮＯ、ステップＳ１１０５−ＹＥＳ）、ステップＳ１１０６〜ステップＳ１１１０の処理を行う。 Next, detailed processing in step S10 and step S11 in FIG. 7 will be described using the processing flow in FIG. 9 and FIG.
In step S1104 of FIG. 9, when the biaxial map display unit 51 receives an instruction from the category operation unit 52 and displays the biaxial map as shown in FIG. 25 by the process of FIG. The process returns to step S1101. The category operation unit 52 receives an input of the category cat selected by the user on the currently displayed two-axis map (step S1101-YES), and further receives an input of a filter word addition request (step S1102-). NO, Step S1105—YES), Step S1106 to Step S1110 are performed.

まず、カテゴリ操作部５２は、特徴度データ記憶部４に記憶されている特徴度データのうち、傾向ベクトルが設定されている特徴度データｋｄを読み出し、読み出した特徴度データｋｄの集合である特徴度データ集合ｋｄｓを取得する（ステップＳ１１０６）。この特徴度データ集合ｋｄｓは、文書分類部３において特徴語と判定された単語の特徴度データの集合である。カテゴリ操作部５２は、特徴度データ集合ｋｄｓ中の単語を補正特徴度順に表示させる（ステップＳ１１０７）。本実施形態では、上述の通り特徴語の表示を補正特徴度順としたが、これに限らず、単に文書頻度順や特徴度順としてもよい。 First, the category operation unit 52 reads out the feature data kd in which the tendency vector is set from the feature data stored in the feature data storage unit 4, and the feature that is a set of the read feature data kd. The degree data set kds is acquired (step S1106). The feature data set kds is a set of feature data of words determined as feature words by the document classification unit 3. The category operation unit 52 displays the words in the feature data set kds in the order of corrected features (step S1107). In the present embodiment, as described above, the display of the feature words is in the corrected feature degree order. However, the display is not limited to this, and may be simply in the document frequency order or the feature degree order.

図２７は、２軸マップにおける特徴語カテゴリの編集操作とその画面の表示例を示す図である。同図において、カテゴリ操作部５２は、ステップＳ１１０１において選択されたセル２３０３に対応した特徴語カテゴリ「マイニング」の特徴語追加画面２３１０を表示させている。カテゴリ操作部５２は、特徴語追加画面２３１０の特徴語リスト表示フィールド２３１１に、ステップＳ１１０６において取得した特徴度データ集合ｋｄｓに含まれる単語である特徴語の一覧を表示させている。このとき、カテゴリ操作部５２は、それぞれの特徴語にチェックボックスをつけて表示させる。また、カテゴリ操作部５２は、初期表示として、選択されたカテゴリｃａｔのフィルタ語になっている特徴語に対応したチェックボックスにはチェックをつけて表示させる。特徴語追加画面２３１０は、選択されたｃａｔのフィルタ語としてユーザが任意の文字列を入力するための入力フィールド２３１２と、ユーザがフィルタ語の追加の実行を要求するための「フィルタ語に追加」ボタン２３１３を含む。 FIG. 27 is a diagram illustrating an editing operation of a feature word category in a biaxial map and a display example of the screen. In the figure, the category operation unit 52 displays a feature word addition screen 2310 of the feature word category “mining” corresponding to the cell 2303 selected in step S1101. The category operation unit 52 causes the feature word list display field 2311 of the feature word addition screen 2310 to display a list of feature words that are words included in the feature data set kds acquired in step S1106. At this time, the category operation unit 52 displays each feature word with a check box. Further, the category operation unit 52 displays the check box corresponding to the feature word that is the filter word of the selected category “cat” as an initial display. The feature word addition screen 2310 includes an input field 2312 for the user to input an arbitrary character string as the filter word of the selected cat, and “add to filter word” for requesting the user to execute addition of the filter word. A button 2313 is included.

カテゴリ操作部５２は、特徴語の表示に対してユーザからのフィルタ語の選択、もしくは入力を受け付ける（ステップＳ１１０８）。具体的には、ユーザは、入力フィールド２３１２にフィルタ語として追加すべき文字列を入力するか、特徴語リスト表示フィールド２３１１においてフィルタ語として追加すべき特徴語に対応したチェックボックスにチェックをつける。カテゴリ操作部５２は、ユーザからフィルタ語ｆの追加の実行要求が入力されない場合（ステップＳ１１０９−ＮＯ）、ステップＳ１１０５からの処理を繰り返す。 The category operation unit 52 accepts selection or input of a filter word from the user for the display of the feature word (step S1108). Specifically, the user inputs a character string to be added as a filter word in the input field 2312 or checks a check box corresponding to a feature word to be added as a filter word in the feature word list display field 2311. If the user does not receive an additional execution request for the filter word f (NO in step S1109), the category operation unit 52 repeats the processing from step S1105.

ステップＳ１１０９において、ユーザからフィルタ語ｆの追加の実行要求を受けた場合、具体的には、ユーザが図２７の「フィルタ語に追加」ボタン２３１３を選択した場合（ステップＳ１１０９−ＹＥＳ）、カテゴリ操作部５２は、選択されたカテゴリｃａｔの特徴語カテゴリデータへフィルタ語ｆを追加する処理を行う（ステップＳ１１１０）。フィルタ語ｆは、ステップＳ１１０８において、ユーザが入力フィールド２３１２に入力した文字列、または、特徴語リスト表示フィールド２３１１においてチェックをつけた特徴語である。このフィルタ語ｆの追加処理の詳細については、後述する図１８のフィルタ語の追加・削除処理の流れで説明する。 In step S1109, when the execution request for adding the filter word f is received from the user, specifically, when the user selects the “add to filter word” button 2313 in FIG. 27 (step S1109—YES), the category operation is performed. The unit 52 performs a process of adding the filter word f to the feature word category data of the selected category cat (step S1110). The filter word f is a character string input in the input field 2312 by the user in step S1108 or a feature word checked in the feature word list display field 2311. The details of the process of adding the filter word f will be described later in the flow of the filter word adding / deleting process of FIG.

ステップＳ１１１０の処理の後、カテゴリ操作部５２は、上述したステップＳ１１０４の処理を行って２軸マップの更新を２軸マップ表示部５１に指示する。２軸マップ表示部５１は、カテゴリ操作部５２から現在表示させている２軸マップの横軸カテゴリｘＡｘｉｓＣａｔと、縦軸カテゴリｙＡｘｉｓＣａｔの入力を受けて図１７の処理を行い、フィルタ語の追加を２軸マップに反映する。カテゴリ操作部５２は、ステップＳ１１０１の処理に戻る。 After the process of step S1110, the category operation unit 52 performs the process of step S1104 described above to instruct the biaxial map display unit 51 to update the biaxial map. The biaxial map display unit 51 receives the input of the horizontal axis category xAxisCat and the vertical axis category yAxisCat of the biaxial map currently displayed from the category operation unit 52, performs the process of FIG. Reflect in the axis map. The category operation unit 52 returns to the process of step S1101.

カテゴリ操作部５２は、ユーザにより選択されたカテゴリｃａｔの入力を受け（ステップＳ１１０１−ＹＥＳ）、さらに、ユーザからフィルタ語ｆの削除要求を受けた場合（ステップＳ１１０２、ステップＳ１１０５−ＮＯ、ステップＳ１１１１−ＹＥＳ）、選択されたカテゴリｃａｔの特徴語カテゴリデータからフィルタ語ｆを削除する処理を行う（ステップＳ１１１２）。このフィルタ語ｆの削除処理の詳細については、後述する図１８のフィルタ語の追加・削除処理の流れで説明する。 The category operation unit 52 receives an input of the category cat selected by the user (step S1101-YES), and further receives a request to delete the filter word f from the user (step S1102, step S1105-NO, step S1111-). YES), a process of deleting the filter word f from the feature word category data of the selected category cat is performed (step S1112). Details of the process of deleting the filter word f will be described later in the flow of the process of adding / deleting a filter word in FIG.

ステップＳ１１１２の処理の後、カテゴリ操作部５２は、上述したステップＳ１１０４の処理を行って２軸マップの更新を２軸マップ表示部５１に指示する。２軸マップ表示部５１は、カテゴリ操作部５２から現在表示させている２軸マップの横軸カテゴリｘＡｘｉｓＣａｔと、縦軸カテゴリｙＡｘｉｓＣａｔの入力を受けて図１７の処理を行い、フィルタ語の削除を２軸マップに反映する。カテゴリ操作部５２は、ステップＳ１１０１の処理に戻る。 After the process of step S1112, the category operation unit 52 performs the process of step S1104 described above to instruct the biaxial map display unit 51 to update the biaxial map. The biaxial map display unit 51 receives the input of the horizontal axis category xAxisCat and the vertical axis category yAxisCat of the biaxial map currently displayed from the category operation unit 52 and performs the process of FIG. Reflect in the axis map. The category operation unit 52 returns to the process of step S1101.

図１８は、カテゴリ操作部５２における特徴語カテゴリのフィルタ語の追加または削除の処理の流れを示すフローチャートである。同図に示すフローチャートは、図９のステップＳ１１１０及びステップＳ１１１２における詳細な処理の一例を示す。 FIG. 18 is a flowchart showing the flow of processing for adding or deleting filter words in the feature word category in the category operation unit 52. The flowchart shown in the figure shows an example of detailed processing in steps S1110 and S1112 of FIG.

まず、カテゴリ操作部５２は、追加もしくは削除する単語ｔと、カテゴリｃａｔを入力として受け取る（ステップＳ２００１）。カテゴリｃａｔは、図９のステップＳ１１０１において選択されたカテゴリであり、追加する単語ｔは、図９のステップＳ１１０８においてユーザが選択もしくは入力したフィルタ語であり、削除する単語は、図９のステップＳ１１１１において削除が要求されたフィルタ語ｆである。カテゴリ操作部５２は、カテゴリ記憶部２に記憶されているカテゴリｃａｔの特徴語カテゴリデータを特定し、特定した特徴語カテゴリデータに設定されているフィルタ語の集合であるフィルタ語集合ｆｓを取得する。 First, the category operation unit 52 receives a word t to be added or deleted and a category cat as inputs (step S2001). The category cat is the category selected in step S1101 in FIG. 9, the word t to be added is a filter word selected or input by the user in step S1108 in FIG. 9, and the word to be deleted is step S1111 in FIG. Is the filter word f requested to be deleted. The category operation unit 52 specifies the feature word category data of the category cat stored in the category storage unit 2, and acquires the filter word set fs which is a set of filter words set in the specified feature word category data. .

フィルタ語の追加の場合（ステップＳ２００２−[追加]）、カテゴリ操作部５２は、フィルタ語集合ｆｓに単語ｔを追加する（ステップＳ２００３）。ただし、フィルタ語集合ｆｓに、すでに単語ｔが存在する場合、カテゴリ操作部５２は何もしない。一方、フィルタ語の削除の場合（ステップＳ２００２−[削除]）、カテゴリ操作部５２は、フィルタ語集合ｆｓから単語ｔを削除する（ステップＳ２００４）。ただし、フィルタ語集合ｆｓに単語ｔが存在しない場合、カテゴリ操作部５２は何もしない。 In the case of adding a filter word (Step S2002- [Add]), the category operation unit 52 adds the word t to the filter word set fs (Step S2003). However, if the word t already exists in the filter word set fs, the category operation unit 52 does nothing. On the other hand, in the case of deleting a filter word (step S2002-[delete]), the category operation unit 52 deletes the word t from the filter word set fs (step S2004). However, if the word t does not exist in the filter word set fs, the category operation unit 52 does nothing.

ステップＳ２００３またはステップＳ２００４の処理の後、カテゴリ操作部５２は、フィルタ語集合ｆｓに基づいて分類ルールｒを生成し、カテゴリｃａｔの分類ルールを更新する（ステップＳ２００５）。分類ルールの生成方法については、図１６のステップＳ１８０５の説明を参照されたい。カテゴリ操作部５２は、カテゴリ記憶部２に格納されているカテゴリｃａｔの特徴語カテゴリデータに設定されている分類ルール、及びフィルタ語をそれぞれ、ステップＳ２００５に更新した分類ルールｒ、及びステップＳ２００３またはステップＳ２００４において更新したフィルタ語集合ｆｓに更新する（ステップＳ２００６）。 After the process of step S2003 or step S2004, the category operation unit 52 generates a classification rule r based on the filter word set fs and updates the classification rule of the category cat (step S2005). For the classification rule generation method, refer to the description of step S1805 in FIG. The category operation unit 52 includes the classification rule r set in the feature word category data of the category cat stored in the category storage unit 2 and the classification rule r updated in step S2005 and the step S2003 or step. The filter word set fs updated in S2004 is updated (step S2006).

本実施形態では、フィルタ語の追加・削除処理について述べたが、追加・削除を組み合わせることで、カテゴリ操作部５２は、あるカテゴリのフィルタ語を他のカテゴリのフィルタ語として移動もしくは複写することが可能となる。 In this embodiment, filter word addition / deletion processing has been described, but by combining addition / deletion, the category operation unit 52 can move or copy a filter word of a certain category as a filter word of another category. It becomes possible.

図２８は、特徴語カテゴリの編集操作後の２軸マップの表示例を示す図であり、図２７に示す２軸マップにおいて、ユーザが以下の（１）及び（２）の編集操作を行った後の表示例である。
（１）ユーザが、セル２３０２に表示されている「検索」カテゴリのフィルタ語“管理”を、セル２３０１に表示されている「分析」カテゴリに移動する。
（２）ユーザが、セル２３０３に表示されている「マイニング」カテゴリのフィルタ語“関連語”を削除する。 FIG. 28 is a diagram illustrating a display example of the biaxial map after the feature word category editing operation. In the biaxial map illustrated in FIG. 27, the user performs the following editing operations (1) and (2). It is a later display example.
(1) The user moves the filter word “management” in the “search” category displayed in the cell 2302 to the “analysis” category displayed in the cell 2301.
(2) The user deletes the filter word “related word” in the “mining” category displayed in the cell 2303.

上述した図９のステップＳ１１０５〜ステップＳ１１１２の処理によって、ユーザは２軸マップ上で、文書分類部３で自動生成された特徴語カテゴリに対して、簡単に編集を行うことができる。従って、文書分類装置１００は、ユーザの分類・分析の意図や目的に合わせてカテゴリ構造を構成することが可能となる。また、この際にユーザがフィルタ語を選択することによって、文書分類装置１００は、図２６に示すような表示を行う。この表示により、ユーザは、選択されたフィルタ語に関する出現傾向を把握することができる。従って、文書分類装置１００は、ユーザが指定した横軸カテゴリ（図２６では「出願年別」カテゴリ）を基軸とした分類・分析作業を支援することが可能となる。 By the processing in steps S1105 to S1112 of FIG. 9 described above, the user can easily edit the feature word category automatically generated by the document classification unit 3 on the biaxial map. Therefore, the document classification apparatus 100 can configure a category structure in accordance with the user's intention and purpose of classification / analysis. In addition, when the user selects a filter word at this time, the document classification apparatus 100 performs display as shown in FIG. By this display, the user can grasp the appearance tendency related to the selected filter word. Therefore, the document classification device 100 can support the classification / analysis work based on the horizontal axis category designated by the user (the “by application year” category in FIG. 26).

なお、本実施形態においては、２軸マップの各セルには、文書数ｄｎを円の大きさで表示させる、いわゆるバブルチャートで表現したが、文書数ｄｎの表示はこれに限らない。例えば、各セルに対応する文書数ｄｎを、折れ線グラフや棒グラフで表現してもよい。 In this embodiment, each cell of the biaxial map is expressed by a so-called bubble chart in which the number of documents dn is displayed in a circle size, but the display of the number of documents dn is not limited to this. For example, the number of documents dn corresponding to each cell may be expressed by a line graph or a bar graph.

図２９は、２軸マップを折れ線グラフで表現したときの表示例を示す図である。同図に示すように、折れ線グラフを利用した場合は、特許文献なら出願傾向の遷移の様子を把握するなど、時系列的な傾向変化を捉えるのに有効である。このとき、同図に示すように、フィルタ語については異なる線種のグラフで文書数ｄｎを表現することで、さらに各フィルタ語に対する出現傾向の把握を容易にすることができる。 FIG. 29 is a diagram illustrating a display example when the biaxial map is expressed by a line graph. As shown in the figure, when a line graph is used, a patent document is effective for grasping a time-series trend change, such as grasping a state of transition of application tendency. At this time, as shown in the figure, by expressing the number of documents dn with a graph of different line types for the filter word, it is possible to further easily grasp the appearance tendency for each filter word.

また、上記実施形態においては、特徴語とする単語を補正特徴度に基づいて選択しているが、特徴度に基づいて選択してもよい。この場合、特徴度データは、補正特徴度のデータを有せず、特徴度補正部３２は、図１２のステップＳ１４０４〜ステップＳ１４１０までの処理は行わない。そして、図１４のステップＳ１６０３において、傾向ベクトル生成部３４は、特徴度データ記憶部４に記憶されている特徴度データから単語ｔの特徴度を取得する。ステップＳ１６０４において、傾向ベクトル生成部３４は、取得した特徴度があらかじめ設定された一定のしきい値より大きい場合は、ステップＳ１６０５の処理を行い、しきい値より小さい場合は、ステップＳ１６０２に戻る。 Moreover, in the said embodiment, although the word used as a feature word is selected based on correction | amendment feature degree, you may select based on feature degree. In this case, the feature data does not include correction feature data, and the feature correction unit 32 does not perform the processing from step S1404 to step S1410 in FIG. Then, in step S1603 of FIG. 14, the tendency vector generation unit 34 acquires the feature degree of the word t from the feature degree data stored in the feature degree data storage unit 4. In step S1604, the trend vector generation unit 34 performs the process of step S1605 if the acquired feature value is larger than a predetermined threshold value, and returns to step S1602 if the feature value is smaller than the threshold value.

また、上記実施形態においては、ユーザからの着目語の入力を受けているが、着目語の入力がなくてもよい。着目語の入力がない場合、図１２の処理において、特徴度補正部３２は、ステップＳ１４０１〜ステップＳ１４０３、及びステップＳ１４０７〜ステップＳ１４０９の繰り返し処理を行わない。そして、特徴度補正部３２は、ステップＳ１４０６で補正特徴度ｍｓ（ｔ）を算出すると、算出した補正特徴度ｍｓ（ｔ）を特徴度データに格納するステップＳ１４１０の処理を行い、ステップＳ１４０４に戻る処理となる。 Moreover, in the said embodiment, although the input of the attention word from a user is received, there is no need of input of a attention word. When there is no input of the word of interest, in the process of FIG. 12, the feature correction unit 32 does not repeat the processes of steps S1401 to S1403 and steps S1407 to S1409. After calculating the corrected feature value ms (t) in step S1406, the feature value correcting unit 32 performs the process of step S1410 for storing the calculated corrected feature value ms (t) in the feature data, and returns to step S1404. It becomes processing.

また、着目語の入力を受けない場合、文書分類装置１００が、着目語設定部５３を設けない構成とすることもできる。この場合、図１２の処理は行われず、図１４のステップＳ１６０３において、傾向ベクトル生成部３４は、特徴度データ記憶部４に記憶されている特徴度データから単語ｔの特徴度を取得する。そして、ステップＳ１６０４において、傾向ベクトル生成部３４は、取得した特徴度があらかじめ設定された一定のしきい値より大きい場合は、ステップＳ１６０５の処理を行い、しきい値より小さい場合は、ステップＳ１６０２に戻る。 In addition, when the target word is not input, the document classification apparatus 100 may be configured not to include the target word setting unit 53. In this case, the process of FIG. 12 is not performed, and in step S1603 of FIG. 14, the tendency vector generation unit 34 acquires the feature degree of the word t from the feature degree data stored in the feature degree data storage unit 4. In step S1604, the trend vector generation unit 34 performs the process of step S1605 if the acquired feature value is larger than a predetermined threshold value set in advance, and proceeds to step S1602 if it is smaller than the threshold value. Return.

なお、上記実施形態では横軸を軸カテゴリ、縦軸を対象カテゴリとした例を示しているが、縦軸を軸カテゴリ、横軸を対象カテゴリとしてもよい。 In the above embodiment, an example is shown in which the horizontal axis is the axis category and the vertical axis is the target category, but the vertical axis may be the axis category and the horizontal axis may be the target category.

以上述べた少なくともひとつの実施形態の文書分類装置１００によれば、カテゴリ操作部５２、及び特徴語カテゴリ生成部３３を持つことにより、ユーザが指定した２つのカテゴリを２軸とする２軸マップにおいてユーザが選択した分類軸に対する単語の出現傾向に基づき特徴語を選択し、選択した特徴語を用いて特徴語カテゴリを生成する。これにより、ユーザは、現在生成されているクラスタリング結果や分類構造を利用した２軸マップを見ながら分類軸を選択することができるため、文書分類装置１００は、ユーザの観点に適した分類構造を生成し、ユーザの目的にあった分類と分析を支援することが可能となる。 According to the document classification device 100 of at least one embodiment described above, by having the category operation unit 52 and the feature word category generation unit 33, the two-axis map having two categories specified by the user as two axes. A feature word is selected based on a word appearance tendency with respect to the classification axis selected by the user, and a feature word category is generated using the selected feature word. As a result, the user can select a classification axis while looking at the two-axis map using the currently generated clustering result and classification structure. Therefore, the document classification apparatus 100 has a classification structure suitable for the user's viewpoint. It is possible to generate and support classification and analysis suitable for the user's purpose.

また、従来の単語クラスタリングでは、クラスタの生成に用いる単語を適切に選定し、比較的小さい計算量によって処理を行っていたが、この単語の選定は、文書に出現する単語の統計的な出現傾向に基づいて自動的に行われていた。そのため、文書集合の内容をあまりよく表さない単語や、ユーザの意図や分析の目的に合わない単語がクラスタの生成に用いる単語として選定されてしまうことも多かった。このような場合には、ユーザは、不要な単語を除外したり、重要な単語を登録したりといった作業を行って、ユーザの目的にあったクラスタの生成に用いるために、分類装置に対して所望の単語が選定されるよう指示する必要があり、この作業には熟練や労力を要していた。
しかし、以上述べた少なくともひとつの実施形態の文書分類装置１００によれば、着目語設定部５３を持つことによりユーザによる着目語の指定を受け、指定された着目語と関連が強い特徴語に基づき特徴語カテゴリを生成する。さらに、文書分類装置１００は、カテゴリ操作部５２を持つことにより、生成された特徴語カテゴリに対するフィルタ語の編集を受け、編集語のフィルタ語を用いて２軸マップを更新する。これらにより、クラスタ生成のために重要な単語をユーザが選定するための労力を軽減しながら、ユーザの関心に合わせた特徴語による分類構造の生成と修正を実現する。 In addition, in conventional word clustering, the word used to generate the cluster is appropriately selected and processed with a relatively small amount of calculation. This word selection is based on the statistical appearance tendency of words appearing in the document. Was done automatically based on. For this reason, words that do not express the contents of the document set very well, or words that do not match the user's intention or the purpose of analysis, are often selected as the words used for cluster generation. In such a case, the user performs operations such as removing unnecessary words or registering important words and uses them to generate a cluster suitable for the user's purpose. It is necessary to instruct the user to select a desired word, and this work requires skill and effort.
However, according to the document classification device 100 of at least one embodiment described above, the attention word setting unit 53 is provided so that the user receives the designation of the attention word, and the feature word is strongly related to the designated attention word. Generate feature word categories. Further, the document classification apparatus 100 has the category operation unit 52, and receives the editing of the filter word for the generated feature word category, and updates the biaxial map using the filter word of the edited word. As a result, it is possible to realize generation and correction of a classification structure based on feature words according to the user's interest while reducing the effort for the user to select an important word for cluster generation.

なお、上述の各実施形態における図１の文書分類装置１００の機能を実現するためのプログラムをコンピュータ読み取り可能な記録媒体に記録して、この記録媒体に記録されたプログラムをコンピュータシステムに読み込ませ、実行することにより文書分類装置１００として動作させるようにしてもよい。なお、ここでいう「コンピュータシステム」とは、ＯＳや周辺機器等のハードウェアを含むものとする。また、「コンピュータシステム」は、ホームページ提供環境（あるいは表示環境）を備えたＷＷＷシステムも含むものとする。また、「コンピュータ読み取り可能な記録媒体」とは、フレキシブルディスク、光磁気ディスク、ＲＯＭ、ＣＤ−ＲＯＭ等の可搬媒体、コンピュータシステムに内蔵されるハードディスク等の記憶装置のことをいう。さらに「コンピュータ読み取り可能な記録媒体」とは、インターネット等のネットワークや電話回線等の通信回線を介してプログラムが送信された場合のサーバやクライアントとなるコンピュータシステム内部の揮発性メモリ（ＲＡＭ）のように、一定時間プログラムを保持しているものも含むものとする。 Note that a program for realizing the function of the document classification apparatus 100 of FIG. 1 in each of the above-described embodiments is recorded on a computer-readable recording medium, and the program recorded on the recording medium is read by a computer system. By executing this, the document classification apparatus 100 may be operated. Here, the “computer system” includes an OS and hardware such as peripheral devices. The “computer system” includes a WWW system having a homepage providing environment (or display environment). The “computer-readable recording medium” refers to a storage device such as a flexible medium, a magneto-optical disk, a portable medium such as a ROM and a CD-ROM, and a hard disk incorporated in a computer system. Further, the “computer-readable recording medium” refers to a volatile memory (RAM) in a computer system that becomes a server or a client when a program is transmitted via a network such as the Internet or a communication line such as a telephone line. In addition, those holding programs for a certain period of time are also included.

また、上記プログラムは、このプログラムを記憶装置等に格納したコンピュータシステムから、伝送媒体を介して、あるいは、伝送媒体中の伝送波により他のコンピュータシステムに伝送されてもよい。ここで、プログラムを伝送する「伝送媒体」は、インターネット等のネットワーク（通信網）や電話回線等の通信回線（通信線）のように情報を伝送する機能を有する媒体のことをいう。また、上記プログラムは、前述した機能の一部を実現するためのものであってもよい。さらに、前述した機能をコンピュータシステムに既に記録されているプログラムとの組み合わせで実現できるもの、いわゆる差分ファイル（差分プログラム）であってもよい。 The program may be transmitted from a computer system storing the program in a storage device or the like to another computer system via a transmission medium or by a transmission wave in the transmission medium. Here, the “transmission medium” for transmitting the program refers to a medium having a function of transmitting information, such as a network (communication network) such as the Internet or a communication line (communication line) such as a telephone line. The program may be for realizing a part of the functions described above. Furthermore, what can implement | achieve the function mentioned above in combination with the program already recorded on the computer system, what is called a difference file (difference program) may be sufficient.

以上、本発明の実施形態を説明したが、この実施形態は、例として提示したものであり、発明の範囲を限定することを意図していない。この実施形態は、その他の様々な形態で実施されることが可能であり、発明の要旨を逸脱しない範囲で、種々の省略、置き換え、変更を行うことができる。この実施形態やその変形は、発明の範囲や要旨に含まれると同様に、特許請求の範囲に記載された発明とその均等の範囲に含まれるものである。 As mentioned above, although embodiment of this invention was described, this embodiment is shown as an example and is not intending limiting the range of invention. This embodiment can be implemented in various other forms, and various omissions, replacements, and changes can be made without departing from the spirit of the invention. This embodiment and its modifications are included in the scope of the present invention and the gist thereof, and are also included in the invention described in the claims and the equivalent scope thereof.

１…文書記憶部
２…カテゴリ記憶部
３…文書分類部
３１…特徴度算出部
３２…特徴度補正部
３３…特徴語カテゴリ生成部
３４…傾向ベクトル生成部
３５…クラスタリング部
３６…カテゴリ生成処理部
４…特徴度データ記憶部
５…ユーザインターフェース部
５１…２軸マップ表示部
５２…カテゴリ操作部
５３…着目語設定部
１００…文書分類装置 DESCRIPTION OF SYMBOLS 1 ... Document memory | storage part 2 ... Category memory | storage part 3 ... Document classification | category part 31 ... Feature degree calculation part 32 ... Feature degree correction | amendment part 33 ... Feature word category generation part 34 ... Trend vector generation part 35 ... Clustering part 36 ... Category generation process part DESCRIPTION OF SYMBOLS 4 ... Feature data storage part 5 ... User interface part 51 ... 2-axis map display part 52 ... Category operation part 53 ... Interesting word setting part 100 ... Document classification apparatus

Claims

A document storage unit for storing document data;
A category storage unit that stores a hierarchical structure of categories and a classification rule for classifying the document data into the categories;
A category that receives an input of a category as a classification viewpoint and a target category that is a classification target category, and reads a set of axis categories that are lower categories of the category as a classification viewpoint from the category storage unit as an axis category set An operation unit;
Of the document data stored in the document storage unit, a set of the document data satisfying the classification rule of the target category is set as a target document data set, and a feature degree of a word included in the target document data set is calculated. A feature calculation unit;
The target document satisfying the classification rule of each axis category in the axis category set for each of the selected words is selected based on the characteristic degree calculated by the characteristic degree calculation unit. A trend vector generation unit that calculates a statistic based on the appearance frequency of the word in the data set, and generates a trend vector in which the statistic is set as a value of an element corresponding to the axis category;
A clustering unit that clusters the words based on the similarity of the trend vectors generated by the trend vector generation unit;
For each cluster obtained as a result of clustering by the clustering unit, a feature word category having a classification rule using the target category as a higher category and a word belonging to the cluster as a filter word is generated in the category storage unit A category generation processing unit to be registered;
Of the document data stored in the document storage unit, each cell of the biaxial map having the axis category as the first axis classification item and the feature word category as the second axis classification item A biaxial map display unit for displaying information representing the number of the document data satisfying the classification rule of the axis category corresponding to and the classification rule of the feature word category corresponding to the cell;
A document classification apparatus comprising:

The biaxial map display section
When the filter word used in the classification rule of the feature word category is displayed and the filter word selected from the displayed filter words is received, the filter word is stored in the document storage unit. Displaying information indicating the number of the document data satisfying the classification rule of the axis category and including the selected filter word among the document data.
The document classification apparatus according to claim 1, wherein:

The biaxial map display section
When the filter word used in the classification rule of the feature word category is displayed and the filter word is edited with respect to the feature word category, the classification rule of the feature word category subjected to the editing operation is displayed. Based on the editing operation,
Of the document data stored in the document storage unit in each cell of the two-axis map, the classification rule of the axis category corresponding to the cell and the characteristic word category corresponding to the cell are changed. Displaying information indicating the number of document data satisfying the classification rule;
The document classification device according to claim 1, wherein the document classification device is a document classification device.

A focused word setting unit that receives an input of a focused word that is a word focused on when classifying the document data;
A feature correction unit that corrects the feature of the word calculated by the feature calculation unit based on the co-occurrence of the word and the word of interest in the target document data set;
The trend vector generation unit selects the word that represents the feature of the document based on the feature degree corrected by the feature degree correction unit, and for each of the selected word, each axis category in the axis category set. Calculating a statistic based on the appearance frequency of the word in the target document data set satisfying the classification rule, and generating a trend vector in which the statistic is set as a value of an element corresponding to the axis category;
The document classification device according to any one of claims 1 to 3, wherein the document classification device is characterized in that:

The biaxial map display unit receives an input of a horizontal axis category and a vertical axis category, reads out a horizontal axis category which is a lower category of the category which is a horizontal axis from the category storage unit, and a vertical axis The vertical axis category which is a lower category of the above category is read out, the horizontal axis category is read out when the horizontal axis category is read out, and the horizontal axis category is read out when the horizontal axis category is not read out. As a column item, if the vertical axis category is read, the vertical axis category is used. If not, the vertical axis category is used as a row item. Information indicating the number of the document data satisfying the classification rule of the category corresponding to the column item of the cell and the classification rule of the category corresponding to the item of the cell row Not shown,
The category operation unit receives an input as to which of the category corresponding to the item of the row and the category corresponding to the item of the column to be a category or a target category as a classification viewpoint,
The document classification apparatus according to claim 1, wherein the document classification apparatus is a document classification apparatus.