JP5450699B2

JP5450699B2 - Document analysis apparatus and document analysis program

Info

Publication number: JP5450699B2
Application number: JP2012056518A
Authority: JP
Inventors: 泰成宮部; 茂松本; 和之後藤; 秀樹岩崎; 幹門小林
Original assignee: Toshiba Corp; Toshiba Solutions Corp
Current assignee: Toshiba Corp; Toshiba Digital Solutions Corp
Priority date: 2012-03-13
Filing date: 2012-03-13
Publication date: 2014-03-26
Anticipated expiration: 2032-03-13
Also published as: JP2013190988A

Description

本発明の実施形態は、文書分析装置および文書分析プログラムに関する Embodiments described herein relate generally to a document analysis apparatus and a document analysis program.

近年の情報システムの高度化に伴い、特許文献や新聞記事、ウェブページ、書籍といった文書を大量に記録し蓄積することが可能になっている。ここで、これらの蓄積された文書群を分類・分析して日々の業務などに有効活用していくための技術が求められている。例えば、過去の膨大な量の新聞記事をカテゴリ別に分類して、多くの人が利用しやすいように整理する、あるいは、現在に研究開発をしている技術に関係のある自他社の特許群の傾向を分析し、新しい研究開発分野を発見する、などである。 With the advancement of information systems in recent years, it has become possible to record and accumulate a large amount of documents such as patent documents, newspaper articles, web pages, and books. Here, there is a need for a technique for classifying and analyzing these accumulated document groups and effectively utilizing them for daily work. For example, a huge amount of past newspaper articles are classified into categories and arranged so that many people can use them easily, or patents of other companies related to the technology currently being researched and developed. Analyzing trends and discovering new R & D fields.

このように、大量の文書を分類したカテゴリにおいて、カテゴリ内の文書内容を把握するために、カテゴリの特徴を知りたいというニーズがある。 As described above, in a category in which a large number of documents are classified, there is a need to know the characteristics of the category in order to grasp the document contents in the category.

カテゴリの特徴を調べる技術として、例えば、文書集合間の違いを表す特徴語を統計的に抽出する方法がある。 As a technique for examining the characteristics of a category, for example, there is a method of statistically extracting feature words representing differences between document sets.

しかしながら、この方法はカテゴリにつけられた順位を考慮して抽出していない。例えば、自動車の不具合情報において、不具合の危険度順に「極めて危険」、「やや危険」、「危険」、「安全」と順位付けされたカテゴリに分類された文書集合がある。このとき、危険度１位のカテゴリが「極めて危険」であり、危険度２位のカテゴリが「やや危険」であり、危険度３位のカテゴリが「危険」であり、危険度４位のカテゴリが「安全」であるとする。 However, this method does not extract in consideration of the ranking given to the category. For example, in the defect information of automobiles, there is a document set classified into categories ranked as “extremely dangerous”, “slightly dangerous”, “dangerous”, and “safety” in the order of the risk of failure. At this time, the category with the first risk level is “extremely dangerous”, the category with the second risk level is “slightly dangerous”, the category with the third risk level is “danger”, and the category with the fourth risk level is Is "safe".

このとき文書データが「危険」という順位のカテゴリに分類された要因、すなわち文書データに「危険」という順位が付与された要因を知りたいとき、上記のように統計的な方法では、必ずしも順位付けの要因となる単語を抽出できるとは限らないという問題がある。 At this time, if you want to know the factors that classified the document data into the category of “danger”, that is, the factors that gave the document data a ranking of “danger”, the statistical method as described above is not necessarily ranked. There is a problem that it is not always possible to extract words that cause the problem.

例えば、上述した自動車の不具合情報において、不具合の危険度別に「極めて危険」、「危険」、「安全」という順位付きカテゴリに分類された文書集合において、「極めて危険」カテゴリに分類された要因が、「燃料」、「漏れ」や「タイヤ」、「脱落」といった単語にあるとき、上記のような統計的な方法でカテゴリの特徴語を抽出すると、「ポンプ」、「パイプ」のような、カテゴリ内で出現回数が多い単語が抽出され、「極めて危険」とはいえない単語が抽出されることがある。 For example, in the above-described vehicle defect information, the factors classified into the “very dangerous” category in the document set classified into the ranking categories “very dangerous”, “dangerous”, and “safety” according to the risk level of the failure are as follows. , "Fuel", "Leakage", "Tire", "Dropout", if you extract the feature word of the category by the statistical method as above, like "pump", "pipe", Words that appear frequently in a category are extracted, and words that are not “very dangerous” may be extracted.

特開２００７−２６５０６８号公報JP 2007-265068 A 特開２００７−２４１９０５号公報JP 2007-241905 A

本発明が解決しようとする課題は、カテゴリの順位の上昇もしくは下降の要因となる単語を抽出する文書分析装置およびプログラムを提供することである。 The problem to be solved by the present invention is to provide a document analysis apparatus and a program for extracting words that cause an increase or decrease in category ranking.

実施形態の文書分析装置は、文書データを記憶する文書記憶部と、文書データを分類する順位付きの複数のカテゴリおよびカテゴリの階層構造を記憶する分類結果記憶部と、複数のカテゴリのうちの１つのカテゴリにおいて、当該カテゴリに分類される文書データに含まれる単語群から、当該カテゴリにおける単語の出現頻度が当該カテゴリと同じ階層にある他のカテゴリにおける当該単語の出現頻度よりも多く、かつ、前記他のカテゴリにおける当該単語の出現頻度が当該カテゴリから順位が遠くなるに従って減少する単語を抽出する要因語抽出部と、を備える。 The document analysis apparatus according to the embodiment includes a document storage unit that stores document data, a plurality of categories with rankings that classify document data, and a classification result storage unit that stores a hierarchical structure of categories, and one of the plurality of categories. In one category, from the word group included in the document data classified into the category, the appearance frequency of the word in the category is higher than the appearance frequency of the word in other categories in the same hierarchy as the category, and A factor word extraction unit that extracts words whose frequency of appearance in other categories decreases as the rank becomes farther from the category.

実施形態に係る文書分析装置のハードウェア構成を示すブロック図。FIG. 2 is a block diagram showing a hardware configuration of the document analysis apparatus according to the embodiment. 実施形態に係る文書分析装置の機能構成を示すブロック図。FIG. 2 is a block diagram showing a functional configuration of the document analysis apparatus according to the embodiment. 実施形態に係る文書分析装置の文書記憶部に記憶される文書データの一例を示すブロック図。4 is an exemplary block diagram illustrating an example of document data stored in a document storage unit of the document analysis apparatus according to the embodiment. FIG. 実施形態に係る文書分析装置の分類結果記憶部に記憶されるカテゴリの一例を示すブロック図。The block diagram which shows an example of the category memorize | stored in the classification result memory | storage part of the document analyzer which concerns on embodiment. 実施形態に係る文書分析装置の動作の一例を示すフローチャート。6 is a flowchart illustrating an example of an operation of the document analysis apparatus according to the embodiment. 実施形態に係る文書分析装置の分類結果の表示例を示す図。FIG. 10 is a diagram showing a display example of classification results of the document analysis apparatus according to the embodiment. 実施形態に係る文書分析装置の順位付きカテゴリの集合を入力するときの表示例を示す図。The figure which shows the example of a display when the set of the category with a rank of the document analyzer which concerns on embodiment is input. 実施形態に係る文書分析装置の要因語抽出部の処理の一例を示すフローチャート。6 is a flowchart illustrating an example of processing of a factor word extraction unit of the document analysis apparatus according to the embodiment. 実施形態に係る文書分析装置の要因語抽出部が要因語抽出処理の概念図。The conceptual diagram of the factor word extraction process by the factor word extraction part of the document analyzer which concerns on embodiment. 実施形態に係る文書分析装置の要因語抽出部の要因語抽出処理の概念図。The conceptual diagram of the factor word extraction process of the factor word extraction part of the document analyzer which concerns on embodiment. 実施形態に係る文書分析装置の要因語抽出部の抽出結果例を示す図。The figure which shows the example of an extraction result of the factor word extraction part of the document analyzer which concerns on embodiment. 実施形態に係る文書分析装置の複合要因語抽出部の処理の一例を示すフローチャート。6 is a flowchart illustrating an example of processing of a complex factor word extraction unit of the document analysis apparatus according to the embodiment. 実施形態に係る文書分析装置の複合要因語抽出部の処理の一例を示すフローチャート。6 is a flowchart illustrating an example of processing of a complex factor word extraction unit of the document analysis apparatus according to the embodiment. 実施形態に係る文書分析装置の複合要因語抽出部の抽出結果例を示す図。The figure which shows the example of an extraction result of the compound factor word extraction part of the document analyzer which concerns on embodiment. 実施形態に係る文書分析装置の要因カテゴリ生成部の処理の一例を示すフローチャート。5 is a flowchart illustrating an example of processing of a factor category generation unit of the document analysis apparatus according to the embodiment. 実施形態に係る文書分析装置の要因カテゴリ生成部の処理結果の一例を示す図。The figure which shows an example of the process result of the factor category production | generation part of the document analyzer which concerns on embodiment. 実施形態に係る文書分析装置の２軸マップ可視化部の処理の一例を示すフローチャート。6 is a flowchart showing an example of processing of a biaxial map visualization unit of the document analysis apparatus according to the embodiment. 実施形態に係る文書分析装置の２軸マップ可視化部の処理の結果の表示例を示す図。The figure which shows the example of a display of the result of the process of the biaxial map visualization part of the document analyzer which concerns on embodiment.

（第１の実施形態）
以下、実施形態の文書分析装置について図１乃至図１８を参照して説明する。なお、以下の装置は、ハードウェア構成、又はハードウェア資源とソフトウェアとの組合せ構成のいずれでも実施可能となっている。組合せ構成のソフトウェアとしては、予めネットワーク又は記憶媒体から対応する装置のコンピュータにインストールされ、対応する装置の機能を実現させるためのプログラムが用いられる。 (First embodiment)
The document analysis apparatus according to the embodiment will be described below with reference to FIGS. The following apparatus can be implemented with either a hardware configuration or a combined configuration of hardware resources and software. As the software of the combined configuration, a program that is installed in advance on a computer of a corresponding device from a network or a storage medium and that realizes the function of the corresponding device is used.

図１は、本実施形態に係る文書分析装置１０の構成を示す模式図である。この計算機１０は、主記憶装置１１、キーボード１２、マウス１３、中央演算装置１４およびディスプレイ装置１５を備えている。なお、文書分析装置１０は例えば、本実施形態に係る文書分析プログラムが適用された計算機によって実現される。 FIG. 1 is a schematic diagram illustrating a configuration of a document analysis apparatus 10 according to the present embodiment. The computer 10 includes a main storage device 11, a keyboard 12, a mouse 13, a central processing unit 14, and a display device 15. The document analysis apparatus 10 is realized by, for example, a computer to which the document analysis program according to this embodiment is applied.

記憶装置１１は、中央演算装置１４から読出／書込み可能な記憶装置であり、例えばＲＡＭである。記憶装置１１は、あらかじめ本実施形態の文書分析プログラムが記憶されている。 The storage device 11 is a storage device that can be read / written from the central processing unit 14, and is, for example, a RAM. The storage device 11 stores the document analysis program of this embodiment in advance.

キーボード１２およびマウス１３は、通常の入力装置であり、例えば、操作者の操作により、データ又は命令からなる各種情報を中央演算装置１５に入力する。 The keyboard 12 and the mouse 13 are normal input devices. For example, various information including data or commands is input to the central processing unit 15 by an operation of the operator.

中央演算装置１４は、通常のＣＰＵであり、例えば、記憶装置１１に記憶された文書分析プログラムを実行する機能と、キーボード１２又はマウス１３から入力される情報に基づいて各プログラムの実行を制御する機能と、実行結果等をディスプレイ装置１５に出力する機能をもつ。 The central processing unit 14 is a normal CPU, and controls the execution of each program based on, for example, a function for executing a document analysis program stored in the storage device 11 and information input from the keyboard 12 or the mouse 13. A function and a function of outputting an execution result or the like to the display device 15;

ディスプレイ装置１５は、通常の表示装置であり、例えば、編集中の各アーキテクチャモデルやフィーチャモデル等を表示して可視化し、他にも中央演算装置１４から出力された情報を表示する機能をもっている。 The display device 15 is a normal display device. For example, the display device 15 displays and visualizes each architecture model and feature model being edited, and has a function of displaying information output from the central processing unit 14.

図２を参照して、以上のように構成された文書分析装置１０の機能構成について説明する。 With reference to FIG. 2, the functional configuration of the document analysis apparatus 10 configured as described above will be described.

文書分析装置１０は、文書記憶部１、分類結果記憶部２、ユーザインタフェース部３、および要因生成部４を備える。 The document analysis apparatus 10 includes a document storage unit 1, a classification result storage unit 2, a user interface unit 3, and a factor generation unit 4.

文書記憶部１は、文書分析装置１０が分析の対象とする複数の文書のデータ（以下、文書データという）を記憶する。本実施形態では、文書記憶部１に記憶された文書データは、自動車の不具合情報に関する文書データであり、階層構造で構成された複数のカテゴリによって分類されている。 The document storage unit 1 stores data of a plurality of documents (hereinafter referred to as document data) to be analyzed by the document analysis apparatus 10. In the present embodiment, the document data stored in the document storage unit 1 is document data related to vehicle malfunction information, and is classified by a plurality of categories configured in a hierarchical structure.

ここで、図３を参照して、文書記憶部１に記憶される文書データの一例について説明する。文書記憶部１に記憶される文書データ２０は、自動車の不具合情報を報告する文書データであり、文書番号２１、タイトル２２、本文２３、不具合の危険度２４、開発社２５、および購入日２６を項目としてもち、各項目に対応した文書データの内容が記憶されている。 Here, an example of document data stored in the document storage unit 1 will be described with reference to FIG. The document data 20 stored in the document storage unit 1 is document data for reporting vehicle malfunction information, and includes a document number 21, a title 22, a body 23, a malfunction risk 24, a development company 25, and a purchase date 26. As items, the contents of document data corresponding to each item are stored.

文書番号２１はこの文書データ固有のＩＤを示す。タイトル２２は、文書データ２０のタイトルである。本文２３は、文書データ２０の本文である。不具合の危険度２４は、この文書データが報告する不具合情報の危険度を示す。開発社２５は不具合が起きた自動車の開発社であり、購入日２６不具合が起きた自動車が購入された日である。 The document number 21 indicates an ID unique to this document data. The title 22 is the title of the document data 20. The body 23 is the body of the document data 20. Defect risk 24 indicates the risk of defect information reported by the document data. The development company 25 is a development company of a car in which a failure has occurred, and purchase date 26 is a date on which a car in which a failure has occurred is purchased.

なお、文書データ２０の項目名は上記の項目名すべてでなくてもよいし、他の項目名を追加してもよい。また、ここでは一例として文書データを１つ示したが、文書記憶部１には複数の文書データが記憶されていてよい。 Note that the item names of the document data 20 need not be all of the above item names, and other item names may be added. Although one document data is shown here as an example, the document storage unit 1 may store a plurality of document data.

分類結果記憶部２は、文書記憶部１に記憶された文書データにおけるカテゴリと、このカテゴリの階層構造とを記憶する。本実施形態では、分類結果記憶部２に記憶されたカテゴリはあらかじめ作成され、文書データが分類されているとする。なお、カテゴリの作成は、文書データをクラスタリングすることによって作成してもよい。 The classification result storage unit 2 stores a category in the document data stored in the document storage unit 1 and a hierarchical structure of the category. In this embodiment, it is assumed that the categories stored in the classification result storage unit 2 are created in advance and the document data is classified. The category may be created by clustering document data.

図４を参照して、分類結果記憶部２に記憶されるカテゴリとこのカテゴリの階層構造とについて説明する。この図４に示すカテゴリ（ａ）〜（ｆ）は、カテゴリ番号３０１、親カテゴリ３０２、カテゴリ名３０３、文書３０４を項目として備える。 With reference to FIG. 4, the category memorize | stored in the classification result memory | storage part 2 and the hierarchical structure of this category are demonstrated. The categories (a) to (f) shown in FIG. 4 include a category number 301, a parent category 302, a category name 303, and a document 304 as items.

カテゴリ番号３０１は、文書分析装置１０がカテゴリを特定するためのユニークなデータを格納する。 The category number 301 stores unique data for the document analysis apparatus 10 to specify a category.

親カテゴリ３０２は、このカテゴリの親カテゴリを示すデータを格納する。すなわち、親カテゴリ３０２は、当該カテゴリの親カテゴリのカテゴリ番号である。例えば、カテゴリ（ａ）の親カテゴリは、カテゴリ（ａ）がカテゴリ（ａ）〜（ｆ）における階層構造の最上位（ルート）に位置するため「（なし）」となる。 The parent category 302 stores data indicating the parent category of this category. That is, the parent category 302 is the category number of the parent category of the category. For example, the parent category of category (a) is “(none)” because category (a) is located at the top (root) of the hierarchical structure in categories (a) to (f).

また、カテゴリ（ｂ）（カテゴリ番号「ｃ０２」）の親カテゴリ３０２はカテゴリ番号「ｃ０１」のカテゴリ（カテゴリ（ａ））である。すなわち、カテゴリ（ａ）の子カテゴリの１つが、カテゴリ（ｂ）であるといえる。カテゴリ（ｃ）〜（ｆ）についても同様に親カテゴリ３０２が設定されている。 The parent category 302 of the category (b) (category number “c02”) is the category (category (a)) of the category number “c01”. That is, it can be said that one of the child categories of category (a) is category (b). Similarly, the parent category 302 is set for the categories (c) to (f).

以下、本実施形態では、あるカテゴリの直接の親に位置するカテゴリを親カテゴリ、直接の子に位置するカテゴリを子カテゴリと呼ぶ。あるカテゴリの直接または間接の親に位置するカテゴリを、総じて上位カテゴリと呼び、あるカテゴリの直接または間接の子（子孫）に位置するカテゴリを、総じて下位カテゴリと呼ぶ。 Hereinafter, in the present embodiment, a category located at a direct parent of a category is called a parent category, and a category located at a direct child is called a child category. A category that is located in the direct or indirect parent of a certain category is generally called an upper category, and a category that is located in a direct or indirect child (descendant) of a certain category is generally called a lower category.

カテゴリ名３０３は、各カテゴリがその内容をユーザに示すための名称を格納する。 The category name 303 stores a name for each category to indicate its contents to the user.

文書３０４は、当該カテゴリに含まれる文書データに文書番号を格納する。複数の文書がカテゴリに属する場合は文書番号のデータを複数列挙する。 The document 304 stores the document number in the document data included in the category. When a plurality of documents belong to a category, a plurality of document number data are listed.

なお、カテゴリ（ｃ）に示すように、当該カテゴリに分類される文書が満たすべき条件３０５を設定し、この条件３０５に格納された条件を満たす文書データのみを文書３０４に列挙してもよい。また、カテゴリ（ａ）、カテゴリ（ｂ）、カテゴリ（ｄ）は、文書３０４に格納されている文書データは「なし」であり、すなわち各カテゴリに属する文書は「（なし）」となっている。これは、当該カテゴリに直接分類されている文書がないという意味であり、下位カテゴリを介して間接的に分類されている文書は存在し得る。例えばカテゴリ（ａ）の間接的に分類されている文書は、カテゴリ（ａ）の全ての下位カテゴリに分類されている文書の和集合となる。 Note that, as shown in category (c), a condition 305 to be satisfied by a document classified into the category may be set, and only document data satisfying the condition stored in the condition 305 may be listed in the document 304. In the categories (a), (b), and (d), the document data stored in the document 304 is “none”, that is, the documents that belong to each category are “(none)”. . This means that there is no document that is directly classified into the category, and there may be a document that is indirectly classified through the lower category. For example, a document that is indirectly classified in category (a) is a union of documents that are classified in all lower categories of category (a).

ユーザインタフェース部３は、カテゴリ表示操作部３１、２軸マップ可視化部３２を備え、ユーザの入力情報や指示情報を受け付ける。 The user interface unit 3 includes a category display operation unit 31 and an axis map visualization unit 32, and receives user input information and instruction information.

カテゴリ表示操作部３１は、分類結果記憶部２に記憶されているカテゴリおよびその階層構造をユーザに示す画面（以下、カテゴリ表示画面という）を表示する。また、カテゴリ表示操作部３１は、表示したカテゴリ表示画面に対するユーザの操作を受け付ける入力部としての機能も有する。カテゴリ表示画面については後述する。 The category display operation unit 31 displays a screen showing the categories stored in the classification result storage unit 2 and its hierarchical structure to the user (hereinafter referred to as a category display screen). The category display operation unit 31 also has a function as an input unit that receives a user operation on the displayed category display screen. The category display screen will be described later.

２軸マップ可視化部３２は、ユーザが、カテゴリ表示操作部３１を用いて入力したカテゴリの子カテゴリである順位付きカテゴリ全てを第１の分類軸とし、後述する要因語抽出部４１と複合要因語抽出部４２の抽出結果に基づいて、要因カテゴリ生成部４３によって分類された各カテゴリを第２の分類軸とし、それぞれの各カテゴリがクロスした文書数の集計結果を、２軸マップ上に表示する。 The biaxial map visualization unit 32 uses, as the first classification axis, all categories with ranks that are children categories of the category input by the user using the category display operation unit 31, and a factor word extraction unit 41 and a compound factor word described later. Based on the extraction result of the extraction unit 42, each category classified by the factor category generation unit 43 is used as a second classification axis, and the total number of documents crossed by each category is displayed on the two-axis map. .

続いて要因生成部４について説明する。要因生成部４は、要因語抽出部４１、複合要因語抽出部４２、要因カテゴリ生成部４３を備える。 Next, the factor generation unit 4 will be described. The factor generation unit 4 includes a factor word extraction unit 41, a complex factor word extraction unit 42, and a factor category generation unit 43.

要因語抽出部４１は、カテゴリ表示操作部３１によってユーザがカテゴリを入力すると、入力されたカテゴリに属する子カテゴリである順位付きカテゴリにおいて、カテゴリの順位の上昇・下降の原因となる単語群（第１の単語群）を抽出する。 When the user inputs a category through the category display operation unit 31, the factor word extraction unit 41 is a group of words that cause the category rank to rise or fall in the ranked category that is a child category belonging to the input category (first group). 1 word group) is extracted.

複合要因語抽出部４２は、要因語抽出部４１によって抽出された第１の単語群において、第１の単語群に含まれる単語を組み合わせた場合に、カテゴリの順位の上昇・下降の原因となる単語の組み合わせを抽出する。 When the words included in the first word group are combined in the first word group extracted by the factor word extraction unit 41, the compound factor word extraction unit 42 causes the category rank to rise or fall. Extract word combinations.

さらに、複合要因語抽出部４２は、要因語抽出部４１によって抽出されなかった単語群（第２の単語群）についても同様に、第２の単語群に含まれる単語を組み合わせた場合に、カテゴリの順位の上昇・下降の原因となる単語の組み合わせを抽出する。 Furthermore, the compound factor word extraction unit 42 similarly applies a category group (second word group) not extracted by the factor word extraction unit 41 when combining words included in the second word group. Extract word combinations that cause the ranking to rise or fall.

要因カテゴリ生成部４３は、要因語抽出部４１と複合要因語抽出部４２とによって抽出された単語および単語の組み合わせに基づいて新規カテゴリを作成し、分類結果記憶部２に記憶する。この新規カテゴリには、当該単語および単語の組み合わせを含んだ文書データが分類される。 The factor category generation unit 43 creates a new category based on the words and word combinations extracted by the factor word extraction unit 41 and the compound factor word extraction unit 42 and stores them in the classification result storage unit 2. In the new category, document data including the word and a combination of words is classified.

文書記憶部１および分類結果記憶部２は、計算機の記憶手段である、ファイルシステムやデータベースなどを用いて実現する。ユーザインタフェース部３は、キーボード１２、マウス１３、およびディスプレイ１５によって実現される。要因生成部４３０は記憶装置１１に記憶されたプログラムを中央演算装置１４が実行することによって実現される。 The document storage unit 1 and the classification result storage unit 2 are realized using a file system, a database, or the like, which is a storage unit of a computer. The user interface unit 3 is realized by a keyboard 12, a mouse 13, and a display 15. The factor generation unit 430 is realized by the central processing unit 14 executing the program stored in the storage device 11.

以下、図３に示す文書記憶部１に記憶された自動車の不具合情報に関する文書データであって、不具合の危険度順に「極めて危険」、「やや危険」、「危険」、「安全」という順位付きカテゴリ毎に文書データが複数格納されている場合であって、ユーザは、文書データが「危険」という順位付きカテゴリに分類された要因、すなわち危険度付与の要因を知りたい場合に、「危険」という順位付きカテゴリに属する文書データに含まれる単語から当該カテゴリに分類された要因を示す単語（以下、要因語という）を抽出する文書分析装置１０の動作の一例について図５乃至図１８を参照して説明する。 The following is document data relating to vehicle malfunction information stored in the document storage unit 1 shown in FIG. 3, and is ranked in the order of “very dangerous”, “somewhat dangerous”, “dangerous”, and “safety” in order of the risk of malfunction. When a plurality of document data is stored for each category, and the user wants to know the factor that the document data is classified into the category with the ranking of “danger”, that is, the factor of risk assignment, “danger” FIG. 5 to FIG. 18 show an example of the operation of the document analysis apparatus 10 that extracts a word indicating a factor classified into the category (hereinafter referred to as a factor word) from words included in the document data belonging to the ranking category. I will explain.

図５は、文書分析装置１０の動作の一例を示すフローチャートである。 FIG. 5 is a flowchart illustrating an example of the operation of the document analysis apparatus 10.

まず、ユーザインタフェース部３にカテゴリ表示画面が表示される（ステップＳ４０１）。 First, a category display screen is displayed on the user interface unit 3 (step S401).

図６にステップＳ４０１で表示されるカテゴリ表示画面の一例を示す。図６に示すカテゴリ表示画面５０は、当該画面左部５０１にカテゴリの階層構造が表示され、そのうちの１つのカテゴリを選択すると、当該画面右上部５０２に選択したカテゴリに属する文書のタイトル一覧が表示される。更に、一覧表示されたタイトルのうちの１つを選択すると、画面右下部５０３に文２書データの本文が表示される。図６では、ユーザが文書データを確認したいカテゴリとして、ドットで表示された「極めて危険」カテゴリが選択され、画面右上部５０２に「極めて危険」カテゴリに属する文書データのタイトル一覧が表示されている。また、画面右上部５０２に表示された「極めて危険」カテゴリに属する文書データのうち、ドットで表示された「燃料漏れ」が選択され、「燃料漏れ」の文書データの本文が、画面右下部５０３に表示されている。 FIG. 6 shows an example of the category display screen displayed in step S401. The category display screen 50 shown in FIG. 6 displays a hierarchical structure of categories in the left part 501 of the screen. When one of the categories is selected, a title list of documents belonging to the selected category is displayed in the upper right part 502 of the screen. Is done. Further, when one of the titles displayed in a list is selected, the text of sentence 2 data is displayed in the lower right 503 of the screen. In FIG. 6, the “extremely dangerous” category displayed in dots is selected as the category in which the user wants to check the document data, and the title list of the document data belonging to the “extremely dangerous” category is displayed in the upper right portion 502 of the screen. . Also, among the document data belonging to the “extremely dangerous” category displayed in the upper right portion 502 of the screen, “fuel leak” displayed as a dot is selected, and the text of the document data of “fuel leak” is displayed at the lower right 503 of the screen. Is displayed.

続いて、ユーザがカテゴリ表示操作部３１を介して、カテゴリ表示画面５０の画面左部５０１に表示された親カテゴリのうち、当該親カテゴリに属する順位付きカテゴリの順位付けの要因を分析する（要因語を抽出する）対象の親カテゴリを入力する（ステップＳ４０２）。 Subsequently, among the parent categories displayed on the left part 501 of the category display screen 50 via the category display operation unit 31, the user analyzes factors for ranking the ranked categories belonging to the parent category (factor The target parent category from which words are extracted is input (step S402).

図７に、ステップＳ４０２において入力される親カテゴリを入力する際のカテゴリ表示画面５０の一例を示す。図７に示すように、ステップＳ４０２でユーザはユーザインタフェース部３を用いて、要因語抽出対象の親カテゴリとして「不具合の危険度別」カテゴリ６０１を選択する。その後、「選択したカテゴリ直下の各カテゴリの要因を抽出しますか？」というメッセージが表示されたメッセージ画面６０２が表示されると、ユーザは「ＹＥＳ」ボタン６０３を押下し後述する要因語抽出処理が実行される。なお、「ＮＯ」ボタン６０４を押下した場合、処理は終了する（図示しない）。 FIG. 7 shows an example of the category display screen 50 when the parent category input in step S402 is input. As illustrated in FIG. 7, in step S <b> 402, the user uses the user interface unit 3 to select the “by defect risk level” category 601 as a parent category from which factor words are extracted. After that, when a message screen 602 with a message “Do you want to extract factors of each category immediately under the selected category?” Is displayed, the user presses a “YES” button 603 to cause factor word extraction processing described later. Is executed. If the “NO” button 604 is pressed, the process ends (not shown).

親カテゴリが入力されると、要因生成部４が要因生成処理を行う（ステップＳ４０３とステップＳ４０４）。要因生成処理は、分析対象の親カテゴリの直下に属する各順位付きカテゴリにおける順位の上昇もしくは下降の要因となる単語単体を抽出する要因語抽出処理と、各順位付きカテゴリにおける順位の上昇もしくは下降の要因となる単語の組み合わせを抽出する複合要因語抽出処理とを行なう。 When the parent category is input, the factor generation unit 4 performs factor generation processing (steps S403 and S404). The factor generation process includes a factor word extraction process that extracts a single word that causes a rise or fall in each ranking category directly under the parent category to be analyzed, and a rise or fall in rank in each ranking category. A compound factor word extraction process for extracting a factor word combination is performed.

図８および図９を参照して、ステップＳ４０３において要因語抽出部４１が行なう要因語抽出処理について説明する。なお、ここでは、図６の「不具合の危険度別」カテゴリについて要因語抽出処理を行う場合について説明する。 With reference to FIGS. 8 and 9, the factor word extraction process performed by the factor word extraction unit 41 in step S403 will be described. Here, a case will be described in which factor word extraction processing is performed for the “by defect risk level” category of FIG.

まず、要因語抽出部４１は、要因語抽出処理における抽出結果のリストであるｌｉｓｔ１を初期化する（ステップＳ１１０１）。順位付きの各カテゴリＣｉについて（ステップＳ１１０２）、カテゴリＣｉの全ての文書内の各単語ｔについて繰り返し（ステップＳ１１０３）、ｔのＣｉにおける出現頻度が、統計的に有意に大きいか否かを判定する（ステップＳ１１０４）。 First, the factor word extraction unit 41 initializes list1, which is a list of extraction results in the factor word extraction process (step S1101). For each category Ci with ranking (step S1102), it repeats for each word t in all documents of category Ci (step S1103), and determines whether the appearance frequency of t in Ci is statistically significant. (Step S1104).

統計的に有意に大きいか否かの判定は、例えば、χ二乗検定で有意か否かを判定する。この場合のχ二乗検定によれば、χ二乗統計量（χ）が、自由度２の有意水準５％のχ二乗分布（３．８４）や自由度２の有意水準１％のχ二乗分布（６．６３）よりも小さい場合には統計的に有意となる。一方、χ二乗統計量（χ）が、自由度２の有意水準５％のχ二乗分布（３．８４）や自由度２の有意水準１％のχ二乗分布（６．６３）よりも大きい場合には統計的に有意とならない。なお、χ二乗統計量（χ）は、図９に示す式（１）により算出される。 The determination of whether or not it is statistically significant is, for example, whether or not it is significant by a chi-square test. According to the chi-square test in this case, the chi-square statistic (χ) has a chi-square distribution (3.84) with a significance level of 5% and a chi-square distribution (3.84) with a significance level of 2 degrees ( If it is smaller than 6.63), it becomes statistically significant. On the other hand, when the chi-square statistic (χ) is larger than the chi-square distribution (3.84) with a significance level of 5% and the chi-square distribution (6.63) with a significance level of 1% and a significance level of 1%. Is not statistically significant. Note that the χ square statistic (χ) is calculated by the equation (1) shown in FIG.

式（１）において、ｘ１１は、カテゴリＣｉにおける単語ｔが出現する文書の数である。ａ１は、全ての文書における単語ｔが出現する文書の数である。ｂ１は、カテゴリＣｉの文書の数である。ｎは、全ての文書の数である。また、ｘ１２はａ１−ｘ１１であり、ｘ２１はｂ１−ｘ１１であり、ｘ２２はｎ−ａ１−ｘ２１である。 In Expression (1), x11 is the number of documents in which the word t in the category Ci appears. a1 is the number of documents in which the word t appears in all documents. b1 is the number of documents of category Ci. n is the number of all documents. X12 is a1-x11, x21 is b1-x11, and x22 is n-a1-x21.

このχ二乗検定により、有意であると判定されなければ（ステップＳ１１０４がＮｏ）、他の単語について繰り返しを行う。有意と判定されれば（ステップＳ１１０４がＹｅｓ）、単語ｔのスコアであるｓｃｏｒｅ（Ｃｉ，ｔ）を求め、ｌｉｓｔ１にｔとそのスコアを代入する（ステップＳ１１０５）。ｓｃｏｒｅ（Ｃｉ，ｔ）については後述する。なお、ｓｃｏｒｅ（Ｃｉ，ｔ））は、図９の式（２）により算出される。ただし、順位付きカテゴリは、（Ｃ１，Ｃ２，…Ｃｉ，…，Ｃｍ）とする。 If it is not determined to be significant by this χ square test (No in step S1104), the process is repeated for other words. If it is determined to be significant (step S1104 is Yes), score (Ci, t) which is the score of the word t is obtained, and t and its score are substituted into list1 (step S1105). The score (Ci, t) will be described later. Note that score (Ci, t)) is calculated by equation (2) in FIG. However, the ranking category is (C1, C2,... Ci,..., Cm).

式（２）において、ｄｆ（Ｃｉ，ｔ）は、カテゴリＣｉにおける単語ｔが出現する文書数とし、Ｄは全文書集合とする。ｆｌａｇ（ｉ，ｊ）は、ｉ＜ｊのとき−１を、ｉ＞ｊのとき＋１を取る関数（変数）とする。 In Equation (2), df (Ci, t) is the number of documents in which the word t in the category Ci appears, and D is the total document set. flag (i, j) is a function (variable) that takes −1 when i <j and +1 when i> j.

ここで、図９に示す式（２）の考え方について説明する。順位付きの各カテゴリで、要因となる単語は、そのカテゴリでの単語の出現頻度が多く、そのカテゴリから順位が上昇・下降する（順位が遠くなる）に従って出現頻度が減少していくような単語とし、そのような考えに基づいて、上記の式（２）を定義する。 Here, the concept of equation (2) shown in FIG. 9 will be described. In each category with ranking, the word that becomes a factor has a high appearance frequency of the word in the category, and the appearance frequency decreases as the rank rises or falls (decreases the rank) from the category. Based on such an idea, the above equation (2) is defined.

まず、式（２）のｄｆ（Ｃｉ，ｔ）／ｄｆ（Ｄ，ｔ）で、単語ｔのカテゴリＣｉでの出現頻度を求める。そして、ｄｆ（Ｃｊ＋１，ｔ）−ｄｆ（Ｃｊ，ｔ）で隣り合ったカテゴリの出現頻度の減少具合を計算し、その減少具合が、要因となるカテゴリの近くの順位のカテゴリでの重みを大きくするように、ｌｏｇ（ｍ−｜ｊ＋１−ｉ｜）をかける。ｆｌａｇ（ｉ，ｊ）は、要因となるカテゴリより手前の順位に関しては、出現頻度が上昇しているときにスコアを加点するために＋１を、要因となるカテゴリより後の順位に関しては、出現頻度が下降しているときにスコアを加点するために−１を用いる。 First, the appearance frequency of the word t in the category Ci is obtained by df (Ci, t) / df (D, t) in the expression (2). Then, the degree of decrease in the appearance frequency of the adjacent categories is calculated by df (Cj + 1, t) −df (Cj, t), and the degree of decrease increases the weight in the category of the rank near the factor category. Log (m− | j + 1−i |) is applied. flag (i, j) is +1 for adding a score when the appearance frequency is rising for the rank before the factor category, and for the rank after the category, Use -1 to add a score when is moving down.

図１０に、親カテゴリ「不具合の危険度別」に属する全ての文書データに含まれる単語と、各単語のステップＳ４０２で入力されたカテゴリの子カテゴリごとの出現頻度の一例を示す。図１０に示す表の縦方向の項目が文書データに含まれる単語であり、横方向の項目がステップＳ４０２で入力されたカテゴリの子カテゴリである。 FIG. 10 shows an example of words included in all document data belonging to the parent category “according to the risk level of defects” and the appearance frequency of each word in the child category of the category input in step S402. The items in the vertical direction of the table shown in FIG. 10 are words included in the document data, and the items in the horizontal direction are child categories of the category input in step S402.

図１０に示すように、「シートベルト」は、「やや危険」カテゴリで出現頻度が３９で他の順位付きカテゴリより頻度が多い。更に、「やや危険」カテゴリと「極めて危険」カテゴリでは、頻度の減少具合が、３９から１４に減っており、「やや危険」カテゴリから「危険」カテゴリへの減少具合も、３９から１２、「危険」カテゴリから「極めて危険」カテゴリへの減少具合も１２から５という結果であり、単語「シートベルト」は、「やや危険」カテゴリでの単語の出現頻度が多く、「やや危険」カテゴリから順位が離れるに従って出現頻度が減少していくような単語である。 As shown in FIG. 10, “seat belt” has an appearance frequency of 39 in the “slightly dangerous” category, and is more frequent than other ranking categories. Furthermore, in the “slightly dangerous” category and the “extremely dangerous” category, the frequency reduction degree is reduced from 39 to 14, and the reduction degree from the “slightly dangerous” category to the “dangerous” category is also 39 to 12, “ The reduction from the “dangerous” category to the “extremely dangerous” category is also a result of 12 to 5, and the word “seat belt” has a high frequency of appearance in the “slightly dangerous” category and ranks from the “slightly dangerous” category. It is a word whose frequency of occurrence decreases with increasing distance.

しかしながら、単語「燃料」は、「やや危険」カテゴリでの出現頻度が、他の順位付きカテゴリの出現頻度と比べて一番多いが、「やや危険」カテゴリから「極めて危険」カテゴリへの減少具合は３２から１４で、「やや危険」カテゴリから「危険」カテゴリへの減少具合は３２から２６で、「危険」カテゴリから「極めて危険」カテゴリへは、２６から２８と増加している。すなわち、単語「シートベルト」と比べ、単語「燃料」は減少具合が少ないため、「やや危険」カテゴリのスコアが低くなり易く、要因として抽出され難い。 However, the word “fuel” has the highest frequency of occurrence in the “slightly dangerous” category compared to the frequency of appearance in the other ranked categories, but the degree of decrease from the “slightly dangerous” category to the “very dangerous” category. 32 to 14, the decrease from the “slightly dangerous” category to the “dangerous” category is 32 to 26, and the “dangerous” category to the “extremely dangerous” category increases from 26 to 28. That is, the word “fuel” is less reduced than the word “seat belt”, so the score of the “slightly dangerous” category is likely to be low and difficult to extract as a factor.

なお、本実施形態では、式（２）を例としてあげたが、式（２）だけに限らなくても良く、要因となる単語は、そのカテゴリでの単語の出現頻度が多く、更にそのカテゴリから順位が離れるに従って出現頻度が減少していくような単語とする考え方に基づいていればよい。例えば、ｄｆ（Ｄ，ｔ）で割る計算を実行しなくてもよいし、ｌｏｇ（ｍ−｜ｊ＋１−ｉ｜）のｌｏｇを除いてもよい。 In the present embodiment, the expression (2) is taken as an example. However, the expression is not limited to the expression (2). The factor word has a high frequency of appearance of words in the category. It is only necessary to be based on the concept of a word whose appearance frequency decreases with increasing rank. For example, the calculation divided by df (D, t) may not be executed, and the log of log (m− | j + 1−i |) may be excluded.

全ての単語の繰り返しの終了後（ステップＳ１１０３の繰り返し終了）、カテゴリＣｉにおけるｌｉｓｔ１（Ｃｉ）をｓｃｏｒｅ（Ｃｉ，ｔ）順でソートし、上位Ｋ個（ユーザが任意に指定、ここでは３個とする）の単語を抽出し、ｌｉｓｔ１（Ｃｉ）に残す（ステップＳ１１０６）。すなわち、上位Ｋ個の単語以外を削除する。 After completion of repetition of all words (repetition end of step S1103), list1 (Ci) in category Ci is sorted in the order of score (Ci, t), and the top K items (specified arbitrarily by the user, here 3) ) Is extracted and left in list1 (Ci) (step S1106). That is, the words other than the top K words are deleted.

全てのＣｉについて繰り返し終了後（ステップＳ１１０２の繰り返し終了）、要因語抽出部４１はｌｉｓｔ１を複合要因語抽出部４２に出力する（ステップＳ１１０７）。これにより、要因語抽出処理が終了する。 After the repetition for all Ci (repetition end of step S1102), the factor word extraction unit 41 outputs list1 to the complex factor word extraction unit 42 (step S1107). Thereby, the factor word extraction process is completed.

図１１にステップＳ１１０７で出力されるｌｉｓｔ１の一例を示す。図１１に示すように、ｌｉｓｔ１には「極めて危険」カテゴリでは、要因語として「タイヤ」、「脱落」、「アクセル」が抽出されている。「危険」カテゴリでは「ホース」、「ハブ」、「ハンドル」が抽出されている。「やや危険」カテゴリでは「シートベルト」、「表示灯」が抽出されている。「安全」カテゴリでは、「冷房」、「スピーカー」、「効かない」が抽出されている。このようにそれぞれの順位付きカテゴリにおける要因として、単体の単語が抽出されている。 FIG. 11 shows an example of list1 output in step S1107. As shown in FIG. 11, in the “extremely dangerous” category, “tire”, “drop off”, and “accelerator” are extracted as factor words in list1. In the “danger” category, “hose”, “hub”, and “handle” are extracted. In the “somewhat dangerous” category, “seat belt” and “indicator light” are extracted. In the “safety” category, “cooling”, “speaker”, and “not effective” are extracted. As described above, a single word is extracted as a factor in each ranked category.

図５の説明に戻る。ステップＳ４０３において要因生成部４１による要因語生成処理が終了すると、複合要因語抽出部４２が複合要因語抽出処理を行う（ステップＳ４０４）。 Returning to the description of FIG. When the factor word generation processing by the factor generation unit 41 ends in step S403, the compound factor word extraction unit 42 performs compound factor word extraction processing (step S404).

図１２および図１３を参照して、図５のステップ４０４において複合要因語抽出部４２によって実行される複合要因語抽出処理の一例について説明する。 With reference to FIG. 12 and FIG. 13, an example of the complex factor word extraction process executed by the complex factor word extracting unit 42 in step 404 of FIG. 5 will be described.

複合要因語抽出処理はステップＳ４０３で抽出された単語のうち、単語を組み合わせた場合に、カテゴリの順位の上昇・下降の原因となる単語の組み合わせを抽出する。更に、前記要因語抽出部４１によって抽出されなかった単語集合についても、同様に単語を組み合わせた場合に、カテゴリの順位の上昇・下降の原因となる単語の組み合わせを抽出する。 In the compound factor word extraction process, when words are combined among the words extracted in step S403, a combination of words that causes an increase / decrease in the category ranking is extracted. Further, for word sets that are not extracted by the factor word extraction unit 41, combinations of words that cause an increase or decrease in category ranking are extracted when words are similarly combined.

まず、複合要因語抽出部４２は、複合要因語抽出処理における抽出結果のリストであるｌｉｓｔ２を初期化する（ステップＳ１３０１）。続いて、順位付きの各カテゴリＣｉについて繰り返し（ステップＳ１３０２）、ステップＳ１３０３〜ステップＳ１３１８の処理を行う。 First, the complex factor word extraction unit 42 initializes list2, which is a list of extraction results in the complex factor word extraction process (step S1301). Then, it repeats about each category Ci with a ranking (step S1302), and performs the process of step S1303-step S1318.

複合要因語抽出部４２は、ステップＳ１３０２において順位付きの各カテゴリから１つのカテゴリを選択すると、単語の組み合わせを数える変数ｃｏｎｂＣｎｔの初期値に２を、組み合わせ対象の単語の集合を表すｔｅｒｍＳｅｔに要因語抽出部４１の出力値（複合要因語抽出部４２の入力値）を、組み合わせた単語の集合を保持する変数ｔｅｒｍＣｏｎｂの初期値ｎｕｌｌを、それぞれ代入する（ステップＳ１３０３）。そして、ｔｅｒｍＳｅｔの各単語に対して、ｃｏｎｂＣｎｔ個組み合わせる（ステップＳ１３０４）。 When one category is selected from each category with ranking in step S1302, the complex factor word extraction unit 42 sets the initial value of the variable conbCnt that counts word combinations to 2 and sets the factor word to the termSet representing the set of words to be combined. The output value of the extraction unit 41 (input value of the compound factor word extraction unit 42) is substituted for the initial value null of the variable termConb that holds a set of combined words (step S1303). Then, combCnt pieces are combined for each word of termSet (step S1304).

ここで、ステップＳ１３０４で組み合わせる単語は、カテゴリＣｉでその単語が出現する文書数（以下、出現文書数という）が所定の閾値以上の単語に限定する。そうすることで組み合わせの処理速度を効率化する。 Here, the words combined in step S1304 are limited to words whose number of documents in which the word appears in category Ci (hereinafter referred to as the number of appearing documents) is equal to or greater than a predetermined threshold. By doing so, the processing speed of the combination is improved.

例えば、出現文書数の閾値が３で、ｃｏｎｂＣｎｔ＝２で、ｔｅｒｍＳｅｔが単語Ａ、単語Ｂ、単語Ｃ、単語Ｄとする。単語Ａ、単語Ｂ、単語Ｃ、単語Ｄ、それぞれの単語を含んだカテゴリＣｉでの出現文書数が、それぞれ３、４、２、４のとき、単語Ｃはある一定の値よりも少ないので、単語Ｃを含んだ組み合わせは対象とせず、「単語Ａと単語Ｂ」、「単語Ａと単語Ｄ」、「単語Ｂと単語Ｄ」がｔｅｒｍＣｏｎｂになる。 For example, it is assumed that the threshold of the number of appearing documents is 3, conbCnt = 2, and termSet is word A, word B, word C, and word D. When the number of appearing documents in the category Ci including the word A, the word B, the word C, and the word D is 3, 4, 2, and 4, respectively, the word C is less than a certain value. Combinations including the word C are not targeted, and “word A and word B”, “word A and word D”, and “word B and word D” are termComb.

また、ｃｏｎｂＣｎｔ＝３のときも同様に、ｔｅｒｍＳｅｔが「単語Ａと単語Ｂ」、「単語Ａと単語Ｄ」、「単語Ｂと単語Ｄ」となり、ｔｅｒｍＣｏｎｂは「単語Ａと単語Ｂと単語Ｄ」が考えられる。ここで、それぞれの単語を含んだカテゴリＣｉでの出現文書数がそれぞれ、２、３、４のとき、「単語Ａと単語Ｂ」は満たしていない。よって、「単語Ａと単語Ｂ」を含んだ、組み合わせ、即ち「単語Ａと単語Ｂと単語Ｄ」は組み合わせ対象とならない。なお、このような組み合わせの考えを、アプリオリアルゴリズムという。 Similarly, when conbCnt = 3, termSet becomes “word A and word B”, “word A and word D”, “word B and word D”, and termConb becomes “word A, word B and word D”. Can be considered. Here, when the numbers of appearance documents in the category Ci including the respective words are 2, 3, and 4, respectively, “word A and word B” are not satisfied. Therefore, a combination including “word A and word B”, that is, “word A, word B, and word D” is not a combination target. Such a combination idea is called an a priori algorithm.

アプリオリアルゴリズムの考え方によって、ステップＳ１３０４で抽出したｔｅｒｍＣｏｎｂがｎｕｌｌでなければ（ステップＳ１３０５）、ｔｅｒｍＣｏｎｂの各単語ｔで繰り返し（ステップＳ１３０６）、単語ｔを含んだＣｉにおける出現文書数が統計的に有意に大きいか否かを判定する（ステップＳ１３０７）。この有意に大きいかの判定は、図８のステップＳ１１０４の方法と同様である。 If the termComb extracted in step S1304 is not null (step S1305) due to the a priori algorithm concept, the termComb is repeated for each word t (step S1306), and the number of appearing documents in Ci including the word t is statistically significant. It is determined whether it is larger (step S1307). The determination of whether it is significantly large is the same as the method of step S1104 in FIG.

単語ｔを含んだＣｉにおける出現文書数が統計的に有意に大きい場合（ステップＳ１３０７がＹｅｓ）、複合要因語抽出部４２はｓｃｏｒｅ（Ｃｉ，ｔ）を求め、ｌｉｓｔ２にｔとそのスコアとを代入する（ステップＳ１３０８）。 When the number of appearing documents in Ci including the word t is statistically significantly large (Yes in step S1307), the compound factor word extraction unit 42 obtains score (Ci, t), and substitutes t and its score for list2. (Step S1308).

単語ｔを含んだＣｉにおける出現文書数が統計的に有意に大きくない場合（ステップＳ１３０７がＮｏ）、ステップＳ１３０６に戻り、他の単語ｔで処理を繰り返す。なお、ここでのｓｃｏｒｅ（Ｃｉ，ｔ）の算出は、図８のＳ１１０５と同様の方法で行なう。 When the number of appearance documents in Ci including the word t is not statistically significantly large (No in step S1307), the process returns to step S1306, and the process is repeated with another word t. The calculation of score (Ci, t) here is performed by the same method as S1105 in FIG.

全ての単語の繰り返しの終了後（ステップＳ１３０６の繰り返し終了）、複合要因語抽出部４２は、ｃｏｎｂＣｎｔを１つ増やし、ｔｅｒｍＳｅｔにｔｅｒｍＣｏｎｂを代入する（ステップＳ１３０９）。その後、複合要因語抽出部４２は、ステップＳ１３０４〜Ｓ１３０９の処理をＳ１３０５でｔｅｒｍＣｏｎｂがｎｕｌｌと判定されるまで繰り返す。 After the repetition of all words (repetition end of step S1306), the complex factor word extraction unit 42 increments conbCnt by one and substitutes termConb into termSet (step S1309). Thereafter, the compound factor word extraction unit 42 repeats the processing of steps S1304 to S1309 until it is determined that termConb is null in S1305.

ｔｅｒｍＣｏｎｂがｎｕｌｌと判定された後（ステップＳ１３０５がＮｏ）、複合要因語抽出部４２は、当該カテゴリＣｉにおけるｌｉｓｔ２（Ｃｉ）をｓｃｏｒｅ（Ｃｉ，ｔ）順でソートし、上位Ｋ個の単語ペア（ここでは２個）を抽出し、抽出した単語ペアをｌｉｓｔ２（Ｃｉ）に入れ、抽出されなかった単語のみ、単体の要因語として、ｌｉｓｔ（Ｃｉ）に残す（ステップＳ１３１０）。すなわち、複合要因語抽出処理で単語の組み合わせが抽出された場合、これらの単語を要因語抽出処理の結果、出力されるｌｉｓｔ１から削除する。なお、上位Ｋ個分の単語ペアが抽出されない場合、抽出された全ての単語ペアｌｉｓｔ２（Ｃｉ）に入れる。 After termConb is determined to be null (No in step S1305), the complex factor word extraction unit 42 sorts list2 (Ci) in the category Ci in the order of score (Ci, t), and sets the top K word pairs ( The extracted word pairs are put in list2 (Ci), and only the unextracted words are left in list (Ci) as a single factor word (step S1310). That is, when a combination of words is extracted by the complex factor word extraction process, these words are deleted from the output list 1 as a result of the factor word extraction process. If the top K word pairs are not extracted, they are put in all the extracted word pairs list2 (Ci).

以上のステップＳ１３０１〜Ｓ１３１０の処理によって、要因語抽出部４１で抽出した単体の単語を組み合わせた場合にカテゴリの順位の上昇・下降の原因となる単語を抽出する。続くステップＳ１３１１〜ステップＳ１３１８の処理が、要因語抽出部４１で抽出されなかった単語を対象に、それらを組み合わせた場合に、カテゴリの順位の上昇・下降の要因となる単語の組み合わせを抽出する処理である。 Through the processes in steps S1301 to S1310 described above, a word that causes an increase / decrease in the category rank is extracted when a single word extracted by the factor word extraction unit 41 is combined. Processing in which the processing in subsequent steps S1311 to S1318 extracts a combination of words that cause a rise or fall in the category ranking when words that have not been extracted by the factor word extraction unit 41 are combined. It is.

まず、複合要因語抽出部４２は単語の組み合わせを数える変数ｃｏｎｂＣｎｔの初期値に２を、出力するｌｉｓｔ３にｎｕｌｌを、組み合わせ対象の単語の集合を表すｔｅｒｍＳｅｔにＣｉの全ての単語のうち、要因語抽出部４１の出力されなかった単語、すなわち複合要因語抽出部４２に入力されなかった単語を、組み合わせた単語の集合を保持する変数ｔｅｒｍＣｏｎｂの初期値にｎｕｌｌを代入する（ステップＳ１３１１）。 First, the complex factor word extraction unit 42 sets the initial value of the variable conbCnt for counting word combinations to 2, the output list3 to null, and the termSet representing the set of words to be combined, among all the words of Ci, the factor word Null is substituted for the initial value of a variable termConb that holds a set of words that are a combination of words that have not been output by the extraction unit 41, that is, words that have not been input to the complex factor word extraction unit 42 (step S1311).

そして、以降のステップＳ１３１２〜ステップＳ１３１６では、Ｓ１３０４〜Ｓ１３０８と同様の方法で行う。 In subsequent steps S1312-S1316, the same method as S1304-S1308 is performed.

すなわち、複合要因語抽出部４２は、ｔｅｒｍＳｅｔの各単語に対して、アプリオリアルゴリズムの考え方により、単語をｃｏｎｂＣｎｔ個組み合わせる（ステップＳ１３１２）。そして、アプリオリアルゴリズムの考え方によって抽出したｔｅｒｍＣｏｎｂに対し値があれば（ステップＳ１３１３がＹｅｓ）、ｔｅｒｍＣｏｎｂの各単語ｔで繰り返し（ステップＳ１３１４）、単語ｔを含んだＣｉにおける出現文書数が統計的に有意に大きいか判定する（ステップＳ１３１５）。 That is, the compound factor word extraction unit 42 combines conbCnt words with each word of the termSet according to the concept of the a priori algorithm (step S1312). If there is a value for the termComb extracted by the a priori algorithm concept (Yes in step S1313), the process is repeated for each word t of the termConb (step S1314), and the number of documents appearing in Ci including the word t is statistically significant. (Step S1315).

単語ｔを含んだＣｉにおける出現文書数が統計的に有意に大きい場合（ステップＳ１３１５がＹｅｓ）、複合要因語抽出部４２はｓｃｏｒｅ（Ｃｉ，ｔ）を求め、ｌｉｓｔ３にｔとそのスコアを代入する（ステップＳ１３１６）。 When the number of appearance documents in Ci including the word t is statistically significantly large (Yes in step S1315), the compound factor word extraction unit 42 obtains score (Ci, t), and substitutes t and its score into list3. (Step S1316).

単語ｔを含んだＣｉにおける出現文書数が統計的に有意に大きくない場合（ステップＳ１３１５がＮｏ）、ステップＳ１３１４に戻り他の単語で繰り返す。 When the number of documents appearing in Ci including the word t is not statistically significant (No in step S1315), the process returns to step S1314 and is repeated with another word.

ｔｅｒｍＣｏｎｂの各単語ｔにおいて全ての単語の繰り返しが終了した後（ステップＳ１３１４の繰り返し終了）、複合要因語抽出部４２は、ｃｏｎｂＣｎｔを１つ増やし、ｔｅｒｍＳｅｔにｔｅｒｍＣｏｎｂを代入する（ステップＳ１３１７）その後、複合要因語抽出部４２は、ステップＳ１３１２〜ステップＳ１３１７の処理を、Ｓ１３１３でｔｅｒｍＣｏｎｂがｎｕｌｌになるまで繰り返す。 After repetition of all words in each word t of termConb (repetition end of step S1314), the compound factor word extraction unit 42 increments conbCnt by one and substitutes termConb into termSet (step S1317). The factor word extraction unit 42 repeats the processing in steps S1312 to S1317 until termConb becomes null in S1313.

ｔｅｒｍＣｏｎｂがｎｕｌｌである場合（ステップＳ１３１３がＮｏ）、複合要因語抽出部４２はｌｉｓｔ３（Ｃｉ）をソートし、上位Ｋ個を抽出し（ここでは２つ）、ｌｉｓｔ２（Ｃｉ）に入れる。全てのカテゴリＣｉの繰り返し終了後（ステップＳ１３０２の繰り返し終了）、複合要因語抽出部４２はｌｉｓｔ２を出力する（ステップＳ１３１９）。これにより複合要因語抽出処理が終了する。 When termConb is null (No in step S1313), the compound factor word extraction unit 42 sorts list3 (Ci), extracts the top K (two here), and puts it in list2 (Ci). After the repetition of all the categories Ci (repeated end of step S1302), the compound factor word extraction unit 42 outputs list2 (step S1319). Thereby, the complex factor word extraction process is completed.

図１４に、複合要因語抽出部４２がステップＳ１３１９で出力したｌｉｓｔ２の一例を示す。図１４に示すように、ｌｉｓｔ２には、「極めて危険」カテゴリでは、「タイヤ」と「脱落」を組み合わせた単語が抽出されている。「危険」カテゴリでは「ハンドル」、「重くなる」を組み合わせた単語が抽出されている。「やや危険」カテゴリでは「表示灯」と「つかない」を組み合わせた単語が抽出されている。「安全」カテゴリでは、「冷房」と「効かない」を組み合わせた単語が抽出されている。 FIG. 14 shows an example of list2 output by the complex factor word extraction unit 42 in step S1319. As illustrated in FIG. 14, in the “extremely dangerous” category, a word that combines “tire” and “dropped” is extracted from list2. In the “danger” category, words combining “handle” and “heavy” are extracted. In the “slightly dangerous” category, words combining “indicator light” and “not applicable” are extracted. In the “safety” category, words combining “cooling” and “not effective” are extracted.

これら複合的な要因である単語の組み合わせをユーザに提示することにより、ユーザは順位付けの要因を把握しやすくなる。 By presenting the combination of words, which are these complex factors, to the user, the user can easily understand the ranking factor.

図１４のｌｉｓｔ３には、「極めて危険」カテゴリでは、「燃料」と「漏れ」」を組み合わせた単語と、「ブレーキ」と「効かない」を組み合わせた単語が抽出されている。「危険」カテゴリでは「燃料」と「残量」を組み合わせた単語と、「ブレーキ」と「効き辛い」を組み合わせた単語が抽出されている。「やや危険」カテゴリでは「ドア」と「ロック」と「できない」を組み合わせた単語が抽出されている。「安全」カテゴリでは、「ドア」と「開かない」を組み合わせた単語が抽出されている。 In the list 3 of FIG. 14, in the “extremely dangerous” category, words combining “fuel” and “leakage” and words combining “brake” and “not effective” are extracted. In the “danger” category, words that combine “fuel” and “remaining amount” and words that combine “brake” and “hard” are extracted. In the “somewhat dangerous” category, words combining “door”, “lock”, and “cannot” are extracted. In the “safety” category, words combining “door” and “not open” are extracted.

ｌｉｓｔ３に示すように、要因語抽出処理で抽出されなかった単語の集合から、組み合わせて要因となる単語を抽出した場合、「燃料」、「ブレーキ」のように、「極めて危険」、「危険」両方のカテゴリでも出現している単語が抽出されることがある。すなわち、「燃料」、「ブレーキ」のように、単語単体では、各カテゴリの要因とならないが、「漏れ」や「残量」、「効かない」や「効き辛い」と組み合わせることで、「極めて危険」や「危険」の要因となるような単語の組み合わせを抽出することが可能となる。 As shown in list 3, when a factor word is extracted from a set of words not extracted by the factor word extraction process, “extremely dangerous” and “dangerous” like “fuel” and “brake” are extracted. Words that appear in both categories may be extracted. In other words, words such as “fuel” and “brake” are not factors for each category, but by combining “leakage”, “remaining amount”, “not effective” and “not effective” It is possible to extract combinations of words that cause “danger” and “danger”.

図５の説明に戻る。複合要因語抽出処理が終了すると、要因生成部４の要因カテゴリ性セブ４３が、ステップＳ４０３とステップＳ４０４とにおける要因抽出処理の結果、抽出された単語および単語の組み合わせから新規のカテゴリを生成する要因カテゴリ生成処理を行う（ステップＳ４０５）。 Returning to the description of FIG. When the complex factor word extraction process is completed, the factor categorization cebu 43 of the factor generation unit 4 generates a new category from the extracted words and word combinations as a result of the factor extraction process in steps S403 and S404. A category generation process is performed (step S405).

図１５乃至図１６を参照して要因カテゴリ生成部４３による要因カテゴリ生成処理について説明する。 The factor category generation processing by the factor category generation unit 43 will be described with reference to FIGS.

図１５は、要因カテゴリ生成部４３によって実行される、要因カテゴリ生成処理の一例を示すフローチャートである。要因カテゴリ生成処理が、要因カテゴリ生成部４３が、各順位付きカテゴリＣｉにおいてステップＳ４０３とステップＳ４０４で抽出された単語に基づいて新規カテゴリを生成し、生成した新規カテゴリに、当該順位付きカテゴリに属する文書データを分類する処理である。 FIG. 15 is a flowchart illustrating an example of a factor category generation process executed by the factor category generation unit 43. In the factor category generation process, the factor category generation unit 43 generates a new category based on the words extracted in steps S403 and S404 in each ranked category Ci, and the generated new category belongs to the ranked category. This is a process for classifying document data.

要因カテゴリ生成部４３、各順位付きカテゴリＣｉについて繰り返し（ステップＳ１４０１）、要因語抽出部４１と複合要因語抽出部４２によって出力されたｌｉｓｔ１とｌｉｓt２に格納された、当該カテゴリに対する各単語および単語の組み合わせについて繰り返し（ステップＳ１４０２）、当該単語および単語の組み合わせを含む文書データを集めてカテゴリを生成する（ステップＳ１４０３）。このとき生成するカテゴリの名称は、文書データを集める対象の単語もしくは単語の組み合わせとする。 The factor category generating unit 43 repeats for each ranked category Ci (step S1401), and each word and word of the category stored in the list1 and list2 output by the factor word extracting unit 41 and the compound factor word extracting unit 42 are stored. It repeats about a combination (step S1402), collects the document data containing the said word and the combination of a word, and produces | generates a category (step S1403). The category name generated at this time is a word or a combination of words from which document data is collected.

全てのカテゴリＣｉについて繰り返し終了後（ステップＳ１４０１の繰り返し終了）、生成したカテゴリを分類結果記憶部３に登録する（ステップＳ１４０４）。これにより、要因カテゴリ生成処理は終了する。 After repeating for all categories Ci (repeated end of step S1401), the generated category is registered in the classification result storage unit 3 (step S1404). Thereby, the factor category generation process ends.

要因カテゴリ生成処理後、ユーザインタフェース部３に新規に生成されたカテゴリが表示される。 After the factor category generation process, the newly generated category is displayed on the user interface unit 3.

図１６に、要因カテゴリ生成処理後に、ユーザインタフェース部３に表示されるカテゴリ画面の一例を示す。図１６に示すカテゴリ表示画面５１は、「極めて危険」、「危険」、「やや危険」、「安全」それぞれのカテゴリ直下に、ステップＳ４０３、ステップＳ４０４で抽出された単語および単語の組み合わせを含む文書を分類したカテゴリ９０１〜９１５が生成されている。 FIG. 16 shows an example of a category screen displayed on the user interface unit 3 after the factor category generation process. The category display screen 51 shown in FIG. 16 is a document that includes the words and word combinations extracted in steps S403 and S404 immediately under the categories of “extremely dangerous”, “dangerous”, “somewhat dangerous”, and “safety”. Categories 901 to 915 are generated.

続いて、図５のステップＳ４０６について説明する。２軸マップ可視化部３２が、ステップＳ４０５で生成されたカテゴリと、ステップＳ４０１で入力された順位付きカテゴリの集合を軸とした、２軸マップを作成し表示する（ステップＳ４０６）。 Next, step S406 in FIG. 5 will be described. The biaxial map visualization unit 32 creates and displays a biaxial map with the categories generated in step S405 and the set of categories with ranks input in step S401 as axes (step S406).

ここで、図１７および図１８を参照して、２軸マップ可視化処理について説明する。 Here, the biaxial map visualization process will be described with reference to FIGS. 17 and 18.

２軸マップ可視化部３２は、２軸マップに表示するデータを格納するリストであるｖｉｅｗＬｉｓｔを初期化する（ステップＳ１５０１）。次に、２軸マップ可視化部３２は、順位付きの各カテゴリＣｉについて繰り返し（ステップＳ１５０２）、要因カテゴリ生成部４３で生成された新規のカテゴリＣｆについて繰り返し（ステップＳ１５０３）、ＣｉとＣｆの両方に出現する文書データを取得し、取得した文書データの件数と、順位付きカテゴリＣｉのカテゴリ名と、新規カテゴリＣｆのカテゴリ名と関連付けてｖｉｅｗＬｉｓｔに記憶する（ステップＳ１５０４）。 The biaxial map visualization unit 32 initializes viewList, which is a list for storing data to be displayed on the biaxial map (step S1501). Next, the biaxial map visualization unit 32 repeats for each ranked category Ci (step S1502), repeats for the new category Cf generated by the factor category generation unit 43 (step S1503), and sets both Ci and Cf. Appearing document data is acquired and stored in viewList in association with the number of acquired document data, the category name of the ranking category Ci, and the category name of the new category Cf (step S1504).

全ての新規カテゴリＣｆについて繰り返し後（ステップＳ１５０３の繰り返し終了）、他の順位付きカテゴリについても繰り返し、全ての順位付きカテゴリＣｉについて繰り返し後（ステップＳ１５０２の繰り返し終了）、２軸マップ可視化部３２は、ｖｉｅｗＬｉｓｔを出力する（ステップＳ１５０５）。これにより、２軸マップ可視化処理は終了する。 After repeating for all new categories Cf (repeated end of step S1503), repeated for other ranked categories and repeated for all ranked categories Ci (repeated end of step S1502), the biaxial map visualization unit 32 The viewList is output (step S1505). Thereby, the biaxial map visualization process ends.

図１８に２軸マップ可視化部３２が出力したｖｉｅｗＬｉｓｔを表示する２軸マップ画面の一例を示す。図１８に示すように、２軸マップ画面６０は、要因カテゴリ生成部４３で生成した新規カテゴリが縦軸であり、要因語抽出部４１に入力された順位付きカテゴリが横軸であり、縦軸と横軸とがクロスした欄に、縦軸と横軸の両方のカテゴリ名が含まれる文書データ数が示される。このとき、各順位付きカテゴリにおいて要因抽出処理で抽出された単語および単語の組み合わせであるカテゴリについては、文書データ数を表示する欄を強調表示する。ここでは斜線を表示している。 FIG. 18 shows an example of a biaxial map screen that displays the viewList output by the biaxial map visualization unit 32. As shown in FIG. 18, in the two-axis map screen 60, the new category generated by the factor category generation unit 43 is the vertical axis, the ranked category input to the factor word extraction unit 41 is the horizontal axis, and the vertical axis In the column where the horizontal axis and the horizontal axis cross, the number of document data including the category names of both the vertical axis and the horizontal axis is shown. At this time, for a category that is a word and a combination of words extracted in the factor extraction process in each category with ranking, a column for displaying the number of document data is highlighted. Here, diagonal lines are displayed.

上述のように、本実施形態の文書分析装置１０によると、あらかじめ順位が設定された順位付きカテゴリに対し、カテゴリの順位関係に基づいて、順位の上昇・下降の要因となる単語を抽出することにより、ユーザはカテゴリにつけられた順位の要因を把握できる。これによりユーザは、文書データの内容の把握を容易に行なうことが可能となる。 As described above, according to the document analysis apparatus 10 of the present embodiment, words that cause an increase / decrease in rank are extracted based on the rank relation of categories for categories with ranks set in advance. Thus, the user can grasp the factor of the ranking assigned to the category. As a result, the user can easily grasp the contents of the document data.

また、本実施形態の文書分析装置１０によると、要因となる単語単体の抽出だけでなく、要因が複合的な場合、即ち単語を組み合わせた場合であっても抽出することが可能である。このとき、文書分析装置１０は、単語単体が要因とならない場合であっても、組み合わせると要因となるような単語の組み合わせを抽出することも可能である。これにより、ユーザは文書データの内容の把握が容易になる。 Further, according to the document analysis apparatus 10 of the present embodiment, not only extraction of a single word as a factor, but also extraction can be performed even when factors are complex, that is, when words are combined. At this time, the document analysis apparatus 10 can also extract a combination of words that can be a factor when combined even if a single word is not a factor. Thereby, the user can easily grasp the contents of the document data.

また、本実施形態の文書分析装置１０は、要因カテゴリ生成部４３が生成したカテゴリを分類軸１とし、あらかじめ設定された順位付きの各カテゴリを分類軸２とし、分類軸１の分類項目のカテゴリと、分類軸２の分類項目のカテゴリとの、両方に分類されている文書データの個数を、当該複数のカテゴリの全ての組み合わせについて求めることでクロス集計を実行し２軸マップを作成する。これにより、ユーザは文書データの内容の把握が容易になる。 Further, the document analysis apparatus 10 according to the present embodiment uses the category generated by the factor category generation unit 43 as the classification axis 1, each category with a preset rank as the classification axis 2, and the category of the classification item of the classification axis 1. Then, cross tabulation is performed by obtaining the number of document data classified into both of the category items of the classification axis 2 and all the combinations of the plurality of categories to create a two-axis map. Thereby, the user can easily grasp the contents of the document data.

なお、本実施形態の文書分析装置１０によって分析を行なうその他の例としては、商品情報を分析したい場合などがあげられる。商品情報が、年齢層別の売り上げカテゴリに分類されている場合に、所定の年齢層の売り上げ要因を抽出することが可能となる。 As another example of the analysis by the document analysis apparatus 10 of the present embodiment, there is a case where product information is desired to be analyzed. When the product information is classified into the sales category for each age group, it is possible to extract a sales factor for a predetermined age group.

以上、本発明の実施形態を説明したが、この実施形態は例として提示したものであり、発明の範囲を限定することは意図していない。この新規な実施形態は、その他の様々な形態で実施されることが可能であり、発明の要旨を逸脱しない範囲で、種々の省略、置き換え、変更を行うことができる。この実施形態やその変形は、発明の範囲や要旨に含まれるとともに、特許請求の範囲に記載された発明とその均等の範囲に含まれる。 As mentioned above, although embodiment of this invention was described, this embodiment is shown as an example and is not intending limiting the range of invention. The novel embodiment can be implemented in various other forms, and various omissions, replacements, and changes can be made without departing from the scope of the invention. This embodiment and its modifications are included in the scope and gist of the invention, and are included in the invention described in the claims and the equivalents thereof.

例えば、本実施形態の文書分析装置１０は、ユーザが所定のカテゴリを選択すると、当該カテゴリに属する子カテゴリである順位付きカテゴリ全てについて、要因語を抽出するが、ユーザが順位付きカテゴリを選択し、当該順位付きカテゴリのみについて要因語を抽出することも可能である。 For example, when the user selects a predetermined category, the document analysis apparatus 10 of the present embodiment extracts factor words for all ranked categories that are child categories belonging to the category, but the user selects the ranked category. It is also possible to extract factor words for only the category with the ranking.

１…文書記憶部、２…分類結果記憶部、３…ユーザインタフェース部、３１…カテゴリ表示操作部、３２…２軸マップ可視化部、４…要因生成部、４１…要因語抽出部、４２…複合要因語抽出部、４３…要因カテゴリ生成部 DESCRIPTION OF SYMBOLS 1 ... Document storage part, 2 ... Classification result storage part, 3 ... User interface part, 31 ... Category display operation part, 32 ... Biaxial map visualization part, 4 ... Factor generation part, 41 ... Factor word extraction part, 42 ... Compound Factor word extraction unit, 43 ... Factor category generation unit

Claims

A document analysis apparatus comprising a document storage unit for storing document data and analyzing the document data,
A plurality of categories with ranking for classifying the document data, and a classification result storage unit for storing a hierarchical structure of the categories ;
In one category of the plurality of categories, from the word group included in the document data classified into the category, the appearance frequency of the word in another category in which the appearance frequency of the word in the category is in the same hierarchy as the category A factor word extraction unit that extracts words that are higher in appearance frequency and decrease in frequency as the rank of the appearance frequency of the word in the other category becomes farther from the category;
A document analysis apparatus comprising:

The factor word extraction unit creates a combination of extracted words, and in one category of the plurality of categories, the appearance frequency in the category is higher than the appearance frequency in the other category from the created word combination. The document analysis apparatus according to claim 1, wherein a combination of the words whose appearance frequency in the other category decreases and decreases as the rank becomes far from the category is extracted.

The factor word extraction unit creates a combination of words included in the document data classified into the category other than the word of the extraction result, and in one category of the plurality of categories from the combination of words 2. The combination of the words whose frequency of occurrence in the category is higher than the frequency of occurrence in the other category and whose frequency of occurrence in the other category decreases as the rank becomes farther from the category is extracted. 3. The document analysis apparatus according to any one of items 2.

A factor category generating unit that generates a category in a lower hierarchy of a category to be extracted by the factor word extracting unit for classifying document data including the word or a combination of words extracted by the factor word extracting unit. The document analysis apparatus according to any one of claims 1 to 3.

The number of document data classified on both axes with the category generated by the factor category generation unit as the first classification axis and the classification category with ranking stored in the classification result storage unit as the second classification axis document analysis device according to any one of claims 1 to 4 comprising a two-axis map creating unit that displays.

A document analysis program of a document analysis apparatus comprising a document storage unit for storing document data and analyzing the document data,
In the computer of the document analysis device,
A plurality of categories with ranking for classifying the document data, and a function of storing a hierarchical structure of the categories;
In one category of the plurality of categories, from the word group included in the document data classified into the category, the appearance frequency of the word in another category in which the appearance frequency of the word in the category is in the same hierarchy as the category A function of extracting words that are more than the appearance frequency and that the appearance frequency of the word in the other category decreases as the rank becomes farther from the category;
Document analysis program that realizes.