JP6173846B2

JP6173846B2 - Document analyzer

Info

Publication number: JP6173846B2
Application number: JP2013186759A
Authority: JP
Inventors: 後藤　和之; 和之後藤; 秀樹岩崎; 泰成宮部; 松本　茂; 茂松本
Original assignee: Toshiba Corp; Toshiba Digital Solutions Corp
Current assignee: Toshiba Corp; Toshiba Digital Solutions Corp
Priority date: 2013-09-09
Filing date: 2013-09-09
Publication date: 2017-08-02
Anticipated expiration: 2033-09-09
Also published as: JP2015053019A

Description

本発明の実施形態は、文書分析装置に関する。 Embodiments described herein relate generally to a document analysis apparatus.

近年、計算機の高性能化や記憶媒体の大容量化、計算機ネットワークの普及などに伴い、電子化された文書を、計算機システムを用いて大量に収集、記憶し、利用することが可能となった。大量の文書には、価値のある知識が埋もれている可能性がある。例えば、企業が提供する商品やサービスに対して顧客から日々寄せられるクレーム情報やアンケート情報、製品の設計や製造に関わる不具合情報、インターネット上の評判情報などには、顧客のニーズや不具合の発生傾向などを知るための貴重な手掛かりが含まれていることが多い。そのため、このような自然言語で記述された非定型なテキスト情報を、その意味内容に応じて分類、分析し、活用するための技術として、文書の自動分類、クラスタリング、テキストマイニングなどの技術が開発されている。 In recent years, it has become possible to collect, store, and use electronic documents in large quantities using a computer system, as computer performance and storage capacity have increased, and computer networks have become popular. . There is a possibility that valuable knowledge is buried in a large amount of documents. For example, customer information and survey information received daily by customers for products and services provided by companies, defect information related to product design and manufacturing, reputation information on the Internet, etc. It often contains valuable clues to know. Therefore, technologies such as automatic document classification, clustering, and text mining have been developed as technologies for classifying, analyzing, and utilizing such atypical text information written in natural language according to its semantic content. Has been.

日本語などの自然言語で記述されたテキスト情報を分析するという目的に対し、文章中に記述された語句の表記や頻度のみに基づいた従来の分析手法では、文章の意味内容を反映した適切な分析結果が得られないという課題がある。そこで近年では、文章中に記述された複数の語句同士の関係、例えば係り受け関係などに基づいた分析手法が考案されている（例えば、特許文献１、特許文献２、特許文献３参照。）。このような手法では、例えば、ある製品の不具合情報を分析する場合、「タンクに亀裂が発生した。」という文章から、「タンク」という部品を表す語句と、「亀裂」という症状を表す語句との関係を抽出するといった処理が行われる。 For the purpose of analyzing text information written in a natural language such as Japanese, conventional analysis methods based only on the notation and frequency of words described in the sentence are appropriate to reflect the semantic content of the sentence. There is a problem that analysis results cannot be obtained. Therefore, in recent years, an analysis method based on the relationship between a plurality of words described in a sentence, such as a dependency relationship, has been devised (see, for example, Patent Document 1, Patent Document 2, and Patent Document 3). In such a technique, for example, when analyzing defect information of a certain product, from a sentence “a crack has occurred in a tank”, a phrase indicating a part “tank”, a phrase indicating a symptom “crack”, and The process of extracting the relationship is performed.

一方、複数の文書のテキスト情報を分析した結果をユーザが把握しやすい形に提示する方法の一つに、クロス集計がある。これは、２つ以上の分析軸を対象に、各分析軸が持つ複数の分析項目に各々対応する文書集合をもとに、各分析項目の組み合わせに対応する文書の部分集合を求め、その文書数などをマトリックス状に表示する方法である。ユーザは、このクロス集計の結果を用いることで、文書集合の全体的な内容を把握することができるとともに、各々の分析項目の相関関係などについて詳細に調べることができる。例えば、ある商品の不具合情報を分析する場合には、ユーザは、一方の分析軸として「部品」の軸を選び、他方の分析軸として「症状」の軸を選ぶ。これにより、「部品」軸の分析項目である「タンク」、「パイプ」、「配線」などと、「症状」軸の分析項目である「亀裂」、「脱落」、「干渉」などとの間の、全体的な関係を把握することができる。また、このうち例えば、「タンク」の「亀裂」に関する不具合について、両者の関係が記述された文書を詳細に調べることもできる。特許文献１ではこのクロス集計を用いて、係り受け関係を持つ特徴的な概念を表示するよう考案されている。 On the other hand, cross tabulation is one method for presenting the results of analyzing text information of a plurality of documents in a form that is easy for the user to grasp. This is to obtain a subset of documents corresponding to a combination of analysis items based on a set of documents corresponding to a plurality of analysis items of each analysis axis for two or more analysis axes. This is a method of displaying numbers in a matrix. By using the result of the cross tabulation, the user can grasp the entire contents of the document set and can examine the correlation between the analysis items in detail. For example, when analyzing defect information of a certain product, the user selects the “component” axis as one analysis axis and the “symptom” axis as the other analysis axis. As a result, the analysis items on the “parts” axis, such as “tank”, “pipe”, and “wiring”, and the analysis items on the “symptom” axis, such as “crack”, “drop off”, “interference”, etc. Can understand the overall relationship. Of these, for example, a document describing the relationship between the “tank” and “crack” can be examined in detail. In Patent Document 1, it is devised to display a characteristic concept having a dependency relationship using this cross tabulation.

自然言語で記述された大量の文章から、ユーザが所望する複数の分析軸を対象にして、各分析軸に相当する語句と、その語句同士の関係を、過不足なく抽出することができれば、上述のクロス集計などの方法を用いて、文書から様々な知見を得ることができる。しかしながら、大量の文書に記述された語句、および、語句同士の関係（係り受け関係など）の組み合わせの数は膨大であり、その中には、ユーザが所望する分析軸とは関わりのないものも多数存在する。例えば、ある製品の不具合情報を、「部品」の分析軸と「症状」の分析軸によって分析したい場合、「部品」に相当する語句と、「症状」に相当する語句とを全て網羅的に自動抽出することは困難であり、「部品」や「症状」には関わりのない語句が誤って数多く抽出されてしまうという問題がある。そこで、特許文献１や特許文献２に記載されているように、「部品」や「症状」に相当する語句を記述した辞書を事前に用意し、既知の語句のみを対象にして語句同士の関係を抽出することも考えられる。しかし、辞書の作成には労力がかかる上、日々増加する文書に記述される新しい語句には対応しきれないという問題がある。一方で、特許文献１に記載されているように、所望する関係、例えば所定の係り受け関係のみを抽出するルールを用いる方法も考えられるが、自然言語の非定型な文章は多様な表現で記述されている上、ユーザが分析に用いる分析軸の組み合わせも多様である。このため、必要なルールを事前に用意することは困難であり、ルールが不十分なため抽出誤りや抽出漏れが生じることが多いという問題がある。また、特許文献３に記載されているように、抽出された表現（語句とその関係）の重要度に基づいて、重要な表現のみを提示する方法も考えられる。しかし、特許文献３ではこの重要度を単純に「表現の出現回数÷対象とする文書の部分集合の文書数」としているため、この方法によっては、例えば「部品」と「症状」の語句と関係のみを正しく抽出することはできない。 If it is possible to extract a phrase corresponding to each analysis axis and a relationship between the phrases from a large amount of sentences described in a natural language with respect to a plurality of analysis axes desired by the user, the above-mentioned Using various methods such as cross tabulation, various knowledge can be obtained from the document. However, the number of combinations of phrases described in a large number of documents and relations between phrases (such as dependency relations) is enormous, and some of them are not related to the analysis axis desired by the user. There are many. For example, if you want to analyze the defect information of a product using the "parts" analysis axis and the "symptoms" analysis axis, all the words and phrases corresponding to "parts" and "symptoms" are automatically and comprehensively It is difficult to extract, and there is a problem that many words and phrases that are not related to “parts” and “symptoms” are erroneously extracted. Therefore, as described in Patent Document 1 and Patent Document 2, a dictionary describing words and phrases corresponding to “parts” and “symptoms” is prepared in advance, and the relationship between words only for known words and phrases. It is also possible to extract. However, it takes a lot of work to create a dictionary, and there is a problem that it is not possible to cope with new words and phrases described in documents that are increasing daily. On the other hand, as described in Patent Document 1, there may be a method using a rule for extracting only a desired relationship, for example, a predetermined dependency relationship, but natural language atypical sentences are described in various expressions. In addition, there are various combinations of analysis axes used by users for analysis. For this reason, it is difficult to prepare necessary rules in advance, and there is a problem that extraction errors and omissions often occur due to insufficient rules. Further, as described in Patent Document 3, a method of presenting only important expressions based on the importance of extracted expressions (phrases and their relationships) is also conceivable. However, in Patent Document 3, this importance level is simply “number of occurrences of expression ÷ number of documents in a subset of target documents”. Therefore, depending on this method, for example, the relationship between “part” and “symptom” Can not be extracted correctly.

特開２００１−７５９６６号公報JP 2001-75966 A 特開２００１−１０１１９９号公報JP 2001-101199 A 特開２００４−１０２３９７号公報JP 2004-102397 A

本発明が解決しようとする課題は、自然言語で記述された大量の文章から、ユーザがクロス集計に用いる対象として選択した分析軸に相当する語句を抽出することができる文書分析装置を提供することである。 The problem to be solved by the present invention is to provide a document analysis apparatus capable of extracting a phrase corresponding to an analysis axis selected as a target to be used for cross tabulation by a user from a large amount of sentences described in a natural language. It is.

本発明の１つの態様による文書分析装置は、文書記憶部と、分析軸記憶部と、分析項目生成部と、クロス集計部とを具備する。文書記憶部は、複数の文書データを記憶する。分析軸記憶部は、複数の分析軸と、前記分析軸の分析項目と、前記分析項目に対応した語句とを記憶する。分析項目生成部は、第１の分析軸と第２の分析軸との入力を受け、分析軸記憶部から第１の分析軸の分析項目に対応した語句である第１の語句集合と第２の分析軸の分析項目に対応した語句である第２の語句集合とを読み出すとともに、文書記憶部に記憶されている文書データにおいて第１の語句集合に含まれる語句と共起する語句である語句候補を抽出し、これら語句候補の中から、文書データにおいて第１の語句集合に含まれる語句と共起する頻度または表現が第２の語句集合に含まれる語句と所定の基準よりも類似した語句候補を選択し、選択した語句候補を用いた第２の分析軸の新たな分析項目を分析軸記憶部に書き込む。クロス集計部は、複数の分析軸それぞれの分析項目と、この分析項目に対応した語句とを分析軸記憶部から読み出し、複数の分析軸について読み出した分析項目の組み合わせ毎に、文書記憶部に記憶されている文書データのうち、前記組み合わせを構成する分析項目に対応した語句を含んだ文書データの数を計数し、計数結果を表示させる。 A document analysis apparatus according to an aspect of the present invention includes a document storage unit, an analysis axis storage unit, an analysis item generation unit, and a cross tabulation unit. The document storage unit stores a plurality of document data. The analysis axis storage unit stores a plurality of analysis axes, analysis items of the analysis axis, and words / phrases corresponding to the analysis items. The analysis item generation unit receives the input of the first analysis axis and the second analysis axis, and receives from the analysis axis storage unit the first word set and the second word set corresponding to the analysis item of the first analysis axis. A phrase that is a phrase that co-occurs with a phrase included in the first phrase set in the document data stored in the document storage unit while reading out the second phrase set that is a phrase corresponding to the analysis item of the analysis axis Candidates are extracted, and from these word candidates, words or phrases whose frequency or expression co-occurs with the words included in the first word set in the document data are similar to the words included in the second word set than the predetermined criteria A candidate is selected, and a new analysis item of the second analysis axis using the selected word candidate is written in the analysis axis storage unit. The cross tabulation unit reads the analysis items for each of the plurality of analysis axes and the words corresponding to the analysis items from the analysis axis storage unit, and stores them in the document storage unit for each combination of analysis items read for the plurality of analysis axes. The number of document data including words / phrases corresponding to the analysis items constituting the combination is counted, and the counting result is displayed.

本発明の実施形態に係る文書分析装置の構成を示すブロック図である。It is a block diagram which shows the structure of the document analyzer which concerns on embodiment of this invention. 文書記憶部に記憶される文書データの例を示す図である。It is a figure which shows the example of the document data memorize | stored in a document memory | storage part. 分析軸記憶部に記憶される分析軸データの例を示す図である。It is a figure which shows the example of the analysis axis data memorize | stored in an analysis axis memory | storage part. 分析軸記憶部に記憶される分析項目データの例を示す図である。It is a figure which shows the example of the analysis item data memorize | stored in an analysis axis memory | storage part. 文書分析装置の全体的な処理の流れを表すフローチャートである。It is a flowchart showing the flow of the whole process of a document analyzer. クロス集計部によって実行されるクロス集計処理の流れを表すフローチャートである。It is a flowchart showing the flow of the cross tabulation process performed by the cross tabulation part. 分析項目生成部によって実行される分析項目生成処理の流れを表すフローチャートである。It is a flowchart showing the flow of the analysis item production | generation process performed by the analysis item production | generation part. 分析項目生成部の語句関係評価部によって実行される語句の関係の評価処理の流れを表すフローチャートである。It is a flowchart showing the flow of the evaluation process of the phrase relationship performed by the phrase relationship evaluation part of an analysis item production | generation part. 分析項目生成部の語句関係抽出部によって抽出される語句の関係の例を示す図である。It is a figure which shows the example of the relationship of the phrase extracted by the phrase relationship extraction part of an analysis item production | generation part. クロス集計部によって表示されるクロス集計の結果と、分析軸操作部によって操作される分析軸と分析項目、および、分析項目生成部によって生成される分析項目の例を示す図である。It is a figure which shows the example of the analysis item produced | generated by the result of the cross tabulation displayed by the cross tabulation part, the analysis axis and analysis item operated by the analysis axis operation part, and the analysis item generation part. クロス集計部によって表示されるクロス集計の結果と、分析軸操作部によって操作される分析軸および分析項目の例を示す図である。It is a figure which shows the example of the analysis axis | shaft and analysis item which are operated by the analysis axis | shaft operation part, and the result of the cross tabulation displayed by the cross tabulation part.

以下、本発明の実施形態について、図面を参照しながら説明する。
図１は、本発明の実施形態に係る文書分析装置１００の構成を示すブロック図である。同図に示すように、文書分析装置１００は、文書記憶部１、分析軸記憶部２、クロス集計部３、分析軸操作部４、および分析項目生成部５を備えて構成される。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.
FIG. 1 is a block diagram showing a configuration of a document analysis apparatus 100 according to an embodiment of the present invention. As shown in FIG. 1, the document analysis apparatus 100 includes a document storage unit 1, an analysis axis storage unit 2, a cross tabulation unit 3, an analysis axis operation unit 4, and an analysis item generation unit 5.

文書記憶部１は、複数の文書データを記憶する。文書データは、文書分析装置１００が分析の対象とする文書を示す。分析軸記憶部２は、分析軸データ、および分析項目データを記憶する。分析軸データ、分析項目データはそれぞれ、文書を分析するために作成される分析軸、分析項目を示す。分析軸は、１以上の分析項目を持つ。文書記憶部１、および分析軸記憶部２は、ファイルシステムやデータベース装置など、従来の計算機の記憶手段を用いて実現することができる。 The document storage unit 1 stores a plurality of document data. The document data indicates a document to be analyzed by the document analysis apparatus 100. The analysis axis storage unit 2 stores analysis axis data and analysis item data. The analysis axis data and the analysis item data respectively indicate an analysis axis and an analysis item created for analyzing the document. The analysis axis has one or more analysis items. The document storage unit 1 and the analysis axis storage unit 2 can be realized using storage means of a conventional computer such as a file system or a database device.

クロス集計部３は、文書記憶部１に記憶されている文書データの集合を、分析軸記憶部２に記憶されている分析軸データが示す分析軸を複数用いてクロス集計し、そのクロス集計結果を、後述する図１０に示すような表示によりユーザに提示する。この提示の際には、インターネットを介して通信し、パーソナルコンピュータのディスプレイを通じたブラウザなどの汎用的な機器を用いることが多い。つまり、クロス集計部３は、分析軸記憶部２に記憶されている分析軸データ、および分析項目データに基づいて複数の分析軸それぞれに含まれる分析項目を読み出す。クロス集計部３は、読み出した各分析軸の分析項目の組み合わせごとに、それら分析項目に同時に分類される文書データの数を計数し、計数結果を表す情報をユーザに対して提示する。分析軸操作部４は、分析軸記憶部２に記憶されている分析軸データや分析項目データが示す分析軸、分析項目に対して、ユーザが作成、削除、移動などの操作を行うことを可能にする。ユーザは、後述する図１０に示すようなクロス集計の結果を表示した画面上で、これらの操作を行うことができる。 The cross tabulation unit 3 cross tabulates a set of document data stored in the document storage unit 1 using a plurality of analysis axes indicated by the analysis axis data stored in the analysis axis storage unit 2, and the cross tabulation result Is presented to the user by a display as shown in FIG. In this presentation, a general-purpose device such as a browser that communicates via the Internet and through a display of a personal computer is often used. That is, the cross tabulation unit 3 reads out analysis items included in each of the plurality of analysis axes based on the analysis axis data and the analysis item data stored in the analysis axis storage unit 2. The cross tabulation unit 3 counts the number of document data classified into the analysis items at the same time for each combination of analysis items read out, and presents information representing the counting result to the user. The analysis axis operation unit 4 allows the user to perform operations such as creation, deletion, and movement on the analysis axis and analysis items indicated by the analysis axis data and analysis item data stored in the analysis axis storage unit 2 To. The user can perform these operations on a screen displaying the result of cross tabulation as shown in FIG.

分析項目生成部５は、分析軸記憶部２に記憶されている分析軸データが示す所定の分析軸に対して、その分析項目を自動的に生成する。本実施形態において分析項目生成部５は、語句抽出部５１、語句関係抽出部５２、および語句関係評価部５３を備えて構成される。語句抽出部５１は、自然言語で記述された文章から語句を抽出する。語句抽出部５１は、例えば従来の形態素解析などの技術を用いて実現することができる。語句関係抽出部５２は、自然言語で記述された文章から抽出された複数の語句間の係り受け関係などの関係を抽出する。語句関係抽出部５２は、例えば従来の構文解析などの技術を用いて実現することができる。語句関係評価部５３は、文書データから抽出された複数の語句間の関係に基づいて語句を分析項目として用いるべきかどうかを評価する。この評価結果に基づいて、分析項目生成部５は、語句に対応する分析項目を生成し、その分析項目データを分析軸記憶部２に登録する。 The analysis item generation unit 5 automatically generates an analysis item for a predetermined analysis axis indicated by the analysis axis data stored in the analysis axis storage unit 2. In this embodiment, the analysis item generation unit 5 includes a phrase extraction unit 51, a phrase relationship extraction unit 52, and a phrase relationship evaluation unit 53. The phrase extraction unit 51 extracts a phrase from a sentence described in a natural language. The phrase extraction unit 51 can be realized using a technique such as conventional morphological analysis. The phrase relation extraction unit 52 extracts a relation such as a dependency relation between a plurality of phrases extracted from a sentence described in a natural language. The phrase relationship extraction unit 52 can be realized using a technique such as conventional syntax analysis, for example. The phrase relationship evaluation unit 53 evaluates whether or not the phrase should be used as an analysis item based on the relationship between a plurality of phrases extracted from the document data. Based on the evaluation result, the analysis item generation unit 5 generates an analysis item corresponding to the phrase, and registers the analysis item data in the analysis axis storage unit 2.

図２は、文書記憶部１に記憶される文書データの例を示す図である。図２（ａ）に示す文書データ２００は、文書番号２０１、報告日２０２、対象製品２０３、および本文２０４を含み、図２（ｂ）に示す文書データ２１０は、文書番号２１１、報告日２１２、対象製品２１３、および本文２１４を含む。図２（ｃ）に示す文書データ２２０は、文書番号２２１、報告日２２２、対象製品２２３、および本文２２４を含み、図２（ｄ）に示す文書データ２３０は、文書番号２３１、報告日２３２、対象製品２３３、および本文２３４を含む。図２（ｅ）に示す文書データ２４０は、文書番号２４１、報告日２４２、対象製品２４３、および本文２４４を含み、図２（ｆ）に示す文書データ２５０は、文書番号２５１、報告日２５２、対象製品２５３、および本文２５４を含む。 FIG. 2 is a diagram illustrating an example of document data stored in the document storage unit 1. The document data 200 shown in FIG. 2A includes a document number 201, a report date 202, a target product 203, and a body 204, and the document data 210 shown in FIG. 2B includes a document number 211, a report date 212, The target product 213 and the text 214 are included. The document data 220 shown in FIG. 2C includes a document number 221, a report date 222, a target product 223, and a body 224. The document data 230 shown in FIG. 2D includes a document number 231, a report date 232, A target product 233 and a text 234 are included. The document data 240 shown in FIG. 2E includes a document number 241, a report date 242, a target product 243, and a text 244. The document data 250 shown in FIG. 2F includes a document number 251, a report date 252, The target product 253 and the text 254 are included.

文書番号２０１、２１１、２２１、２３１、２４１、２５１は、文書データを特定するためのユニークなデータである。本文２０４、２１４、２２４、２３４、２４４、２５４は、文書の種類に応じたテキストのデータの例である。このテキストは、日本語など自然言語で記述されており、文書分析装置１００の主たる分析対象となる。報告日２０２、２１２、２２２、２３２、２４２、２５２、および対象製品２０３、２１３、２２３、２３３、２４３、２５３は、文書データ２００、２１０、２２０、２３０、２４０、２５０が製品の不具合情報を表す文書のデータであるために有する属性データである。報告日２０２、２１２、２２２、２３２、２４２、２５２は、不具合の報告日を示すデータであり、対象製品２０３、２１３、２２３、２３３、２４３、２５３は、不具合が報告された対象製品を示すデータである。 Document numbers 201, 211, 221, 231, 241, and 251 are unique data for specifying document data. Texts 204, 214, 224, 234, 244, and 254 are examples of text data corresponding to the type of document. This text is described in a natural language such as Japanese, and is a main analysis target of the document analysis apparatus 100. Report date 202, 212, 222, 232, 242, 252 and target product 203, 213, 223, 233, 243, 253, document data 200, 210, 220, 230, 240, 250 represents product defect information This is attribute data possessed because it is document data. Report dates 202, 212, 222, 232, 242, and 252 are data indicating defect report dates, and target products 203, 213, 223, 233, 243, and 253 are data indicating target products for which defects have been reported. It is.

図３は、分析軸記憶部２に記憶される分析軸データの例を示す図である。図３（ａ）に示す分析軸データ３００は、分析軸番号３０１、および名称３０２を含み、図３（ｂ）に示す分析軸データ３１０は、分析軸番号３１１、および名称３１２を含む。分析軸番号３０１、３１１は、文書分析装置１００が分析軸データを特定するためのユニークなデータである。名称３０２、３１２は、分析軸の名称を示すデータである。図３（ａ）に示す分析軸データ３００は、図２で示した不具合情報の文書データ集合の内容を「部品」の観点で分析するための分析軸のデータの例であり、図３（ｂ）に示す分析軸データ３１０は、「症状」の観点で分析するための分析軸のデータの例である。以下では、例えば、文書データ集合の内容を分析軸ａの観点で分析するための分析軸データに設定されている分析軸番号を「分析軸ａの分析軸番号」と記載する。 FIG. 3 is a diagram illustrating an example of analysis axis data stored in the analysis axis storage unit 2. The analysis axis data 300 illustrated in FIG. 3A includes an analysis axis number 301 and a name 302, and the analysis axis data 310 illustrated in FIG. 3B includes an analysis axis number 311 and a name 312. The analysis axis numbers 301 and 311 are unique data for the document analysis apparatus 100 to specify analysis axis data. Names 302 and 312 are data indicating the names of analysis axes. The analysis axis data 300 shown in FIG. 3A is an example of analysis axis data for analyzing the contents of the document data set of defect information shown in FIG. 2 in terms of “parts”. The analysis axis data 310 shown in (2) is an example of analysis axis data for analysis from the viewpoint of “symptoms”. Hereinafter, for example, the analysis axis number set in the analysis axis data for analyzing the contents of the document data set from the viewpoint of the analysis axis a is described as “analysis axis number of the analysis axis a”.

図４は、分析軸記憶部２に記憶される分析項目データの例を示す図である。図４（ａ）に示す分析項目データ４００は、分析項目番号４０１、分析軸４０２、名称４０３、語句４０４、および文書４０５を含み、図４（ｂ）に示す分析項目データ４１０は、分析項目番号４１１、分析軸４１２、名称４１３、語句４１４、および文書４１５を含む。分析項目番号４０１、４１１は、文書分析装置１００が分析項目データを特定するためのユニークなデータである。 FIG. 4 is a diagram illustrating an example of analysis item data stored in the analysis axis storage unit 2. The analysis item data 400 shown in FIG. 4A includes an analysis item number 401, an analysis axis 402, a name 403, a phrase 404, and a document 405. The analysis item data 410 shown in FIG. 411, analysis axis 412, name 413, word / phrase 414, and document 415. The analysis item numbers 401 and 411 are unique data for the document analysis apparatus 100 to specify analysis item data.

分析軸４０２、４１２は、分析項目が属する分析軸を特定するデータであり、その分析軸の分析軸番号により示される。図４（ａ）の分析項目データ４００は、分析軸４０２の値が「ｐ０１」であるので、図３（ａ）の分析軸データ３００が示す分析軸の分析項目に対応する。同様に、図４（ｂ）の分析項目データ４１０は、分析軸４１２が「ｐ０２」であるので、図３（ｂ）の分析軸データ３１０が示す分析軸の分析項目に対応する。 The analysis axes 402 and 412 are data for specifying the analysis axis to which the analysis item belongs, and are indicated by the analysis axis number of the analysis axis. Since the value of the analysis axis 402 is “p01”, the analysis item data 400 of FIG. 4A corresponds to the analysis item of the analysis axis indicated by the analysis axis data 300 of FIG. Similarly, the analysis item data 410 in FIG. 4B corresponds to the analysis item of the analysis axis indicated by the analysis axis data 310 in FIG. 3B because the analysis axis 412 is “p02”.

名称４０３、４１３は、分析項目の名称を示すデータである。語句４０４、４１４は、後述するクロス集計などの処理で、分析項目に対応する、すなわち、分析項目に分類する文書データ集合を決定するための語句を示すデータである。図２で示した文書データの例では、語句４０４、４１４が示す語句を、そのテキストすなわち「本文」に含む文書データの集合が、当該分析項目に対応する文書データ集合となる。図４（ａ）に示す分析項目データ４００の例では、語句４０４に設定されている「タンク」を本文に含む文書データの文書番号「ｄ０１」、「ｄ０２」、「ｄ０３」などが文書４０５に設定されている。なお、図４（ｂ）に示す分析項目データ４１０の語句４１４には、複数の語句「亀裂」、「き裂」が設定されているが、このように、同じ意味を表す複数の異なる表記の語句を、分析項目の語句として設定してもよい。文書４１５には、語句４１４に設定されている「亀裂」または「き裂」を本文に含む文書データの文書番号が設定される。 Names 403 and 413 are data indicating names of analysis items. The phrases 404 and 414 are data indicating phrases for determining a document data set corresponding to an analysis item, that is, classified into an analysis item, in processing such as cross tabulation described later. In the example of the document data shown in FIG. 2, a set of document data that includes the word / phrase indicated by the words / phrases 404 and 414 in the text, that is, the “body” is a document data set corresponding to the analysis item. In the example of the analysis item data 400 shown in FIG. 4A, document numbers “d01”, “d02”, “d03”, and the like of document data including “tank” set in the word / phrase 404 in the text are included in the document 405. Is set. Note that, in the word / phrase 414 of the analysis item data 410 shown in FIG. 4B, a plurality of words “crack” and “crack” are set. Thus, a plurality of different notations representing the same meaning are used. The phrase may be set as the phrase of the analysis item. In the document 415, the document number of the document data including “crack” or “crack” set in the word / phrase 414 in the body is set.

図５は、文書分析装置１００が実行する処理の全体の流れを表すフローチャートである。文書分析装置１００は、分析軸や分析項目に対するユーザの操作をステップＳ５０１からＳ５１５までで受け付け、ステップＳ５１５にて終了の要求を受けると、処理を終了する。同図に示す処理は主に、分析軸操作部４にて実行されるが、ステップＳ５１２の処理はクロス集計部３が実行し、ステップＳ５１４の処理は分析項目生成部５が実行する。また、ステップＳ５０２、Ｓ５０４、Ｓ５０６、Ｓ５０８、Ｓ５１０、Ｓ５１４の処理において、分析軸操作部４または分析項目生成部５は、分析軸または分析項目の作成、削除、移動などを実行し、その結果として、分析軸記憶部２に記憶されている分析軸データや分析項目データ（図３、図４に例を示したデータ）を変更する。 FIG. 5 is a flowchart showing the overall flow of processing executed by the document analysis apparatus 100. The document analysis apparatus 100 accepts user operations for the analysis axis and analysis items in steps S501 to S515, and ends the process when receiving a termination request in step S515. The process shown in the figure is mainly executed by the analysis axis operation unit 4, but the process of step S512 is executed by the cross tabulation unit 3, and the process of step S514 is executed by the analysis item generation unit 5. Further, in the processing of steps S502, S504, S506, S508, S510, and S514, the analysis axis operation unit 4 or the analysis item generation unit 5 executes creation, deletion, movement, etc. of the analysis axis or analysis item, and as a result The analysis axis data and the analysis item data (data shown in FIGS. 3 and 4) stored in the analysis axis storage unit 2 are changed.

分析軸操作部４は、ユーザから新しい分析軸を作成する操作の入力を受けると（ステップＳ５０１−ＹＥＳ）、新しい分析軸ｐの分析軸データを作成して分析軸記憶部２に書き込む（ステップＳ５０２）。例えば、分析軸操作部４は、作成した分析軸ｐの分析軸データに、新たに割当てた分析軸番号と、ユーザから入力を受けた分析軸ｐの名称を設定する。分析軸操作部４は、ステップＳ５０１からの処理を繰り返す。 When the analysis axis operation unit 4 receives an input of an operation for creating a new analysis axis from the user (step S501—YES), the analysis axis operation unit 4 creates the analysis axis data of the new analysis axis p and writes it to the analysis axis storage unit 2 (step S502). ). For example, the analysis axis operation unit 4 sets the newly assigned analysis axis number and the name of the analysis axis p received from the user in the generated analysis axis data of the analysis axis p. The analysis axis operation unit 4 repeats the processing from step S501.

分析軸操作部４は、ユーザから分析軸を削除する操作の入力を受けると（ステップＳ５０１−ＮＯ、ステップＳ５０３−ＹＥＳ）、ユーザが指定した分析軸ｐの分析軸データと、その分析軸ｐの全ての分析項目の分析項目データを分析軸記憶部２から削除する（ステップＳ５０４）。具体的には、分析軸操作部４は、分析軸ｐの分析軸データを特定して分析軸番号を読み出すと、特定した分析軸ｐの分析軸データを分析軸記憶部２から削除するとともに、読み出した分析軸ｐの分析軸番号が設定された分析項目データの全てを分析軸記憶部２から削除する。分析軸操作部４は、ステップＳ５０１からの処理を繰り返す。 When the analysis axis operation unit 4 receives an input of an operation for deleting the analysis axis from the user (step S501-NO, step S503-YES), the analysis axis data of the analysis axis p designated by the user and the analysis axis p of the analysis axis p are specified. The analysis item data of all analysis items is deleted from the analysis axis storage unit 2 (step S504). Specifically, when the analysis axis operation unit 4 specifies the analysis axis data of the analysis axis p and reads the analysis axis number, the analysis axis data of the specified analysis axis p is deleted from the analysis axis storage unit 2, All of the analysis item data in which the analysis axis number of the read analysis axis p is set is deleted from the analysis axis storage unit 2. The analysis axis operation unit 4 repeats the processing from step S501.

分析軸操作部４は、ユーザから分析項目を作成する操作の入力を受けると（ステップＳ５０１、ステップＳ５０３−ＮＯ、ステップＳ５０５−ＹＥＳ）、ユーザが指定した分析軸ｐに新規に分析項目ｃを作成する。そして、分析軸操作部４は、ユーザが指定した語句ｔを含む文書データの集合を、分析項目ｃに対応する文書集合とする（ステップＳ５０６）。具体的には、分析軸操作部４は、新規に分析項目ｃの分析項目データを作成して分析軸記憶部２に書き込み、新たに割当てた分析項目番号と、分析軸ｐの分析軸番号とを設定する。さらに、分析軸操作部４は、この分析項目ｃの分析項目データに、ユーザが入力した分析項目ｃの名称および語句ｔと、語句ｔを本文に含む文書データの文書番号を設定する。分析軸操作部４は、ステップＳ５０１からの処理を繰り返す。 When the analysis axis operation unit 4 receives an input of an operation for creating an analysis item from the user (step S501, step S503-NO, step S505-YES), the analysis axis operation unit 4 newly creates an analysis item c on the analysis axis p designated by the user. To do. Then, the analysis axis operation unit 4 sets a set of document data including the word t specified by the user as a document set corresponding to the analysis item c (step S506). Specifically, the analysis axis operation unit 4 newly creates the analysis item data of the analysis item c and writes it into the analysis axis storage unit 2, and newly assigns the analysis item number, the analysis axis number of the analysis axis p, Set. Further, the analysis axis operation unit 4 sets the name and the phrase t of the analysis item c input by the user and the document number of the document data including the phrase t in the body text in the analysis item data of the analysis item c. The analysis axis operation unit 4 repeats the processing from step S501.

分析軸操作部４は、ユーザから分析項目を削除する操作の入力を受けると（ステップＳ５０１、ステップＳ５０３、ステップＳ５０５−ＮＯ、ステップＳ５０７−ＹＥＳ）、ユーザが指定した分析項目ｃの分析項目データを分析軸記憶部２から削除する（ステップＳ５０８）。分析軸操作部４は、ステップＳ５０１からの処理を繰り返す。 When the analysis axis operation unit 4 receives an input of an operation for deleting an analysis item from the user (step S501, step S503, step S505-NO, step S507-YES), the analysis axis data of the analysis item c designated by the user is obtained. It deletes from the analysis axis memory | storage part 2 (step S508). The analysis axis operation unit 4 repeats the processing from step S501.

分析軸操作部４は、ユーザから分析項目を移動する操作の入力を受けると（ステップＳ５０１、ステップＳ５０３、ステップＳ５０５、ステップＳ５０７−ＮＯ、ステップＳ５０９−ＹＥＳ）、ユーザが指定した分析項目ｃを、元の分析軸ｐ１から、ユーザが指定した分析軸ｐ２に移動する（ステップＳ５１０）。具体的には、分析軸操作部４は、分析項目ｃの分析項目データに現在設定されている分析軸ｐ１の分析軸番号を、移動先の分析軸ｐ２の分析軸番号に書き換える。分析軸操作部４は、ステップＳ５０１からの処理を繰り返す。 When the analysis axis operation unit 4 receives an input of an operation for moving the analysis item from the user (step S501, step S503, step S505, step S507-NO, step S509-YES), the analysis item c designated by the user is The original analysis axis p1 is moved to the analysis axis p2 designated by the user (step S510). Specifically, the analysis axis operation unit 4 rewrites the analysis axis number of the analysis axis p1 currently set in the analysis item data of the analysis item c to the analysis axis number of the analysis axis p2 that is the movement destination. The analysis axis operation unit 4 repeats the processing from step S501.

分析軸操作部４が、ユーザからクロス集計の実行要求の入力を受けると（ステップＳ５０１、ステップＳ５０３、ステップＳ５０５、ステップＳ５０７、ステップＳ５０９−ＮＯ、ステップＳ５１１−ＹＥＳ）、クロス集計部３は、ユーザが指定した分析軸ｐ１と分析軸ｐ２を対象にクロス集計を実行し、実行結果を表示する（ステップＳ５１２）。この処理内容の詳細は、後述の図６において説明する。分析軸操作部４は、ステップＳ５０１からの処理を繰り返す。 When the analysis axis operation unit 4 receives a cross tabulation execution request input from the user (step S501, step S503, step S505, step S507, step S509-NO, step S511-YES), the cross tabulation unit 3 The cross tabulation is executed for the analysis axis p1 and the analysis axis p2 designated by (2), and the execution result is displayed (step S512). Details of this processing will be described later with reference to FIG. The analysis axis operation unit 4 repeats the processing from step S501.

分析軸操作部４が、ユーザから分析項目の生成要求の入力を受けると（ステップＳ５０１、ステップＳ５０３、ステップＳ５０５、ステップＳ５０７、ステップＳ５０９、ステップＳ５１１−ＮＯ、ステップＳ５１３−ＹＥＳ）、分析項目生成部５は、ユーザが指定した分析軸ｐ１と分析軸ｐ２を対象に、分析項目を生成する（ステップＳ５１４）。この処理内容の詳細については後述の図７と図８において説明する。分析軸操作部４は、ステップＳ５０１からの処理を繰り返す。 When the analysis axis operation unit 4 receives an input of an analysis item generation request from the user (step S501, step S503, step S505, step S507, step S509, step S511-NO, step S513-YES), the analysis item generation unit 5 generates an analysis item for the analysis axis p1 and the analysis axis p2 designated by the user (step S514). Details of this processing will be described later with reference to FIGS. The analysis axis operation unit 4 repeats the processing from step S501.

分析軸操作部４は、ユーザから分析軸の作成・削除の操作、分析項目の作成・削除・移動の操作、クロス集計の実行要求、あるいは、分析項目の生成要求以外の操作を受ける（ステップＳ５０１、ステップＳ５０３、ステップＳ５０５、ステップＳ５０７、ステップＳ５０９、ステップＳ５１１、ステップＳ５１３−ＮＯ）。分析軸操作部４は、その操作が終了の要求以外であれば（ステップＳ５１５−ＮＯ）、ステップＳ５０１からの処理を繰り返し、終了の要求であれば（ステップＳ５１５−ＹＥＳ）、処理を終了する。 The analysis axis operation unit 4 receives operations other than analysis user creation / deletion operations, analysis item creation / deletion / movement operations, cross tabulation execution requests, or analysis item generation requests from the user (step S501). Step S503, Step S505, Step S507, Step S509, Step S511, Step S513-NO). If the operation is other than an end request (step S515-NO), the analysis axis operation unit 4 repeats the process from step S501. If the operation is an end request (step S515-YES), the analysis axis operation unit 4 ends the process.

図６は、クロス集計部３によって実行されるクロス集計処理の流れを表すフローチャートであり、前述の図５のステップＳ５１２における詳細な処理を示す。クロス集計の対象とする分析軸は、図５のステップＳ５１２にて、ユーザによって指定された２つの分析軸ｐ１と分析軸ｐ２である。 FIG. 6 is a flowchart showing the flow of the cross tabulation process executed by the cross tabulation unit 3, and shows the detailed processing in step S512 of FIG. The analysis axes to be subjected to cross tabulation are the two analysis axes p1 and p2 specified by the user in step S512 in FIG.

まず、クロス集計部３は、分析軸ｐ１の分析項目の集合を分析項目集合Ｃ１とし、分析軸ｐ２の分析項目の集合を分析項目集合Ｃ２とする（ステップＳ６０１）。具体的には、クロス集計部３は、ユーザからの入力を受けた分析軸ｐ１の分析軸番号が設定されている分析項目データを特定し、特定した各分析項目データが表す分析項目ｃ１ｉ（ｉ＝１，２，…）の集合を分析項目集合Ｃ１とする。また、クロス集計部３は、ユーザからの入力を受けた分析軸ｐ２の分析軸番号が設定されている分析項目データを特定し、特定した各分析項目データが表す分析項目ｃ２ｊ（ｊ＝１，２，…）の集合を分析項目集合Ｃ２とする。 First, the cross tabulation unit 3 sets a set of analysis items on the analysis axis p1 as an analysis item set C1, and sets a set of analysis items on the analysis axis p2 as an analysis item set C2 (step S601). Specifically, the cross tabulation unit 3 identifies the analysis item data in which the analysis axis number of the analysis axis p1 received from the user is set, and the analysis item c1i (i represented by each identified analysis item data = 1, 2,...) Is an analysis item set C1. Further, the cross tabulation unit 3 identifies the analysis item data in which the analysis axis number of the analysis axis p2 received from the user is set, and the analysis item c2j (j = 1, 1) represented by each identified analysis item data 2,...) Is an analysis item set C2.

クロス集計部３は、分析項目集合Ｃ１中の分析項目ｃ１ｉを全て選択するまで、ｉを１から順に１ずつ増加させて選択した分析項目ｃ１ｉについて、ステップＳ６０３からステップＳ６０７までの処理を繰り返し行う（ステップＳ６０２−繰り返し継続）。 The cross tabulation unit 3 repeatedly performs the processing from step S603 to step S607 for the selected analysis item c1i by sequentially increasing i by 1 from 1 until all the analysis items c1i in the analysis item set C1 are selected. Step S602—continuation is repeated).

クロス集計部３は、分析項目ｃ１ｉの分析項目データに設定されている語句をｔ１ｉとする（ステップＳ６０３）。例えば、前述の図４（ａ）の分析項目データ４００であれば、語句４０４に設定されている語句「タンク」が語句ｔ１ｉとなる。クロス集計部３は、分析項目集合Ｃ２中の分析項目ｃ２ｊを全て選択するまで、ｊを１から順に１ずつ増加させて選択した分析項目ｃ２ｊについて、ステップＳ６０５からステップＳ６０７までの処理を繰り返し行う（ステップＳ６０４−繰り返し継続）。 The cross tabulation unit 3 sets the word / phrase set in the analysis item data of the analysis item c1i as t1i (step S603). For example, in the case of the analysis item data 400 in FIG. 4A described above, the phrase “tank” set in the phrase 404 becomes the phrase t1i. The cross tabulation unit 3 repeatedly performs the processing from step S605 to step S607 on the selected analysis item c2j by sequentially incrementing j from 1 until it selects all the analysis items c2j in the analysis item set C2. Step S604-Repeatedly continue).

クロス集計部３は、分析項目ｃ２ｊの分析項目データに設定されている語句をｔ２ｊとする（ステップＳ６０５）。クロス集計部３は、語句ｔ１ｉと語句ｔ２ｊを共に含む文書データ集合Ｄ（ｔ１ｉ，ｔ２ｊ）を求める（ステップＳ６０６）。例えば、語句ｔ１ｉが「タンク」であり、語句ｔ２ｊが「亀裂」である場合、クロス集計部３は、「タンク」と「亀裂」を共に本文に含む文書データ（例えば図２（ａ）に示す文書データ２００）の集合を文書データ集合Ｄ（ｔ１ｉ，ｔ２ｊ）とする。クロス集計部３は、クロス集計結果のｉ行ｊ列目の値を、この文書データ集合Ｄ（ｔ１ｉ，ｔ２ｊ）に含まれる文書データの数である文書数｜Ｄ（ｔ１ｉ，ｔ２ｊ）｜とする（ステップＳ６０７）。 The cross tabulation unit 3 sets the word / phrase set in the analysis item data of the analysis item c2j as t2j (step S605). The cross tabulation unit 3 obtains a document data set D (t1i, t2j) that includes both the phrase t1i and the phrase t2j (step S606). For example, when the word t1i is “tank” and the word t2j is “crack”, the cross tabulation unit 3 includes document data including both “tank” and “crack” in the text (for example, as shown in FIG. 2A). A set of document data 200) is set as a document data set D (t1i, t2j). The cross tabulation unit 3 sets the value of the i-th row and j-th column of the cross tabulation result as the number of documents | D (t1i, t2j) | that is the number of document data included in the document data set D (t1i, t2j). (Step S607).

クロス集計部３は、ステップＳ６０４の処理に戻り、分析項目集合Ｃ２中の全ての分析項目ｃ２ｊを選択していない場合は現在のｊの値を１増加させてステップＳ６５０からステップＳ６０７までの処理を繰り返す（ステップＳ６０４−繰り返し継続）。そして、全ての分析項目ｃ２ｊを選択すると、クロス集計部３は繰り返し処理を終了し（ステップＳ６０４−繰り返し終了）、ステップＳ６０２の処理に戻る。 The cross tabulation unit 3 returns to the process of step S604, and if all the analysis items c2j in the analysis item set C2 are not selected, the current j value is incremented by 1 and the processes from step S650 to step S607 are performed. Repeat (step S604-repeat continuation). When all the analysis items c2j are selected, the cross tabulation unit 3 finishes the iterative process (step S604-repeat end), and returns to the process of step S602.

クロス集計部３は、ステップＳ６０２の処理に戻ると、分析項目集合Ｃ１中の全ての分析項目ｃ１ｉを選択していない場合は現在のｉの値に１を加算してステップＳ６０３からステップＳ６０７までの処理を繰り返す（ステップＳ６０２−繰り返し継続）。そして、分析項目集合Ｃ１中の全ての分析項目ｃ１ｉを選択すると、クロス集計部３は繰り返し処理を終了する（ステップＳ６０２−繰り返し終了）。クロス集計部３は、分析軸ｐ１の分析項目集合Ｃ１と分析軸ｐ２の分析項目集合Ｃ２の各々の分析項目の組み合わせに対応するクロス集計の結果をディスプレイに表示するなどしてユーザに提示する（ステップＳ６０８）。例えば、クロス集計部３は、分析項目集合Ｃ１に含まれる分析項目ｃ１ｉの数以上を行数とし、分析項目集合Ｃ２に含まれる分析項目ｃ２ｊの数以上を列数とするマトリックスのｉ行ｊ列に｜Ｄ（ｔ１ｉ，ｔ２ｊ）｜を表す情報を表示する。さらに、クロス集計部３は、このマトリックスのｉ行の見出しに分析項目ｃ１ｉの分析項目データに設定されている名称を表示し、ｊ列の見出しに分析項目ｃ２ｊの分析項目データに設定されている名称を表示する。また、クロス集計部３は、各行の分析項目をまとめた見出しに分析軸ｐ１の分析軸データに設定されている名称を表示し、各列の分析項目をまとめた見出しに分析軸ｐ２の分析軸データに設定されている名称を表示する。 When the cross tabulation unit 3 returns to the process of step S602, if all the analysis items c1i in the analysis item set C1 are not selected, 1 is added to the current value of i, and the process from step S603 to step S607 is performed. The process is repeated (step S602-repeat continuation). Then, when all the analysis items c1i in the analysis item set C1 are selected, the cross tabulation unit 3 ends the iterative process (step S602—end of repetition). The cross tabulation unit 3 presents the result of cross tabulation corresponding to the combination of the analysis items of the analysis item set C1 of the analysis axis p1 and the analysis item set C2 of the analysis axis p2 to the user by displaying it on the display (for example). Step S608). For example, the cross tabulation unit 3 sets i rows and j columns in a matrix in which the number of analysis items c1i included in the analysis item set C1 is the number of rows and the number of analysis items c2j included in the analysis item set C2 is the number of columns. Information representing | D (t1i, t2j) | is displayed. Further, the cross tabulation unit 3 displays the name set in the analysis item data of the analysis item c1i in the heading of the i row of this matrix, and is set in the analysis item data of the analysis item c2j in the heading of the j column. Display the name. In addition, the cross tabulation unit 3 displays the name set in the analysis axis data of the analysis axis p1 in the headline summarizing the analysis items of each row, and the analysis axis of the analysis axis p2 in the headline summarizing the analysis items of each column. Displays the name set in the data.

なお、ステップＳ６０６の処理では単純に、２つの語句を共に含む文書データ集合を求めるとしたが、図４（ｂ）に示す分析項目データ４１０の語句４１４のように、１つの分析項目に対応する語句が複数の場合（例えば、「亀裂」と「き裂」の２つがある場合）がある。その場合、クロス集計部３は、これら複数の語句のうち少なくとも１つを含み、かつ、他方の分析軸の分析項目の語句を含む文書データ集合を求めるようにする。例えば、語句ｔ１ｉが「タンク」であり、語句ｔ２ｊが「亀裂」、「き裂」である場合、クロス集計部３は、「タンク」と「亀裂」を共に本文に含む文書データ、および、「タンク」と「き裂」を共に本文に含む文書データの集合を文書データ集合Ｄ（ｔ１ｉ，ｔ２ｊ）とする。また、語句ｔ１ｉ、語句ｔ２ｊとも複数の語句であれば、クロス集計部３は、語句ｔ１ｉと語句ｔ２ｊの全ての組み合わせを生成し、生成したいずれかの組み合わせの語句ｔ１ｉと語句ｔ２ｊを共に本文に含む文書データの集合を文書データ集合Ｄ（ｔ１ｉ，ｔ２ｊ）とする。 In the process of step S606, a document data set including both of two words / phrases is simply obtained. However, as shown in the word / phrase 414 of the analysis item data 410 shown in FIG. 4 (b), it corresponds to one analysis item. There are cases where there are a plurality of words (for example, there are two cases of “crack” and “crack”). In this case, the cross tabulation unit 3 obtains a document data set including at least one of the plurality of words and including the word of the analysis item of the other analysis axis. For example, when the word t1i is “tank” and the word t2j is “crack” or “crack”, the cross tabulation unit 3 includes document data including both “tank” and “crack” in the text, and “ A set of document data including both “tank” and “crack” in the text is a document data set D (t1i, t2j). If both the phrase t1i and the phrase t2j are a plurality of phrases, the cross tabulation unit 3 generates all combinations of the phrase t1i and the phrase t2j, and both the generated phrase t1i and the phrase t2j are included in the text. A set of document data to be included is a document data set D (t1i, t2j).

さらに、このステップＳ６０６の処理を変形し、クロス集計部３は、２つの語句が後述する所定の関係（係り受け関係など）を持つような文書データに限るようにして文書データ集合Ｄ（ｔ１ｉ，ｔ２ｊ）を求めてもよい。また、ステップＳ６０７ではクロス集計結果のｉ行ｊ列目を文書データ集合の文書データ数としたが、この値は文書データ数に限らず、例えば、全文書データ集合に対する比率（パーセンテージ）としてもよく、その画面上の表示形式についても、例えばバブルチャートなどを用いてもよい。例えば、クロス集計部３は、このようにしてクロス集計を行った結果を、図１０に示すような形で表示し、ユーザに提示する。 Further, the processing of step S606 is modified so that the cross tabulation unit 3 limits the document data set D (t1i, 2) so that the two words are limited to document data having a predetermined relationship (such as dependency relationship) described later. t2j) may be determined. In step S607, the i-th row and j-th column of the cross tabulation result are set as the number of document data in the document data set. However, this value is not limited to the number of document data, and may be, for example, a ratio (percentage) to the total document data set. As the display format on the screen, for example, a bubble chart may be used. For example, the cross tabulation unit 3 displays the result of cross tabulation in this way in the form shown in FIG. 10 and presents it to the user.

図１０は、クロス集計部３によって表示されるクロス集計の結果と、分析軸操作部４によって操作される分析軸と分析項目、および、分析項目生成部５によって生成される分析項目の例を示す図である。図１０（ａ）は、分析軸「部品」１００１と分析軸「症状」１００２を対象としたクロス集計の結果を示している。分析軸「部品」１００１の分析項目は「タンク」１００３であり、分析軸「症状」１００２の分析項目は「亀裂」１００４、および「脱落」１００５である。そして、異なる軸の２つの分析項目に対応する文書データ集合の文書数が、バブルチャートの円の面積によって表現されている。例えば、分析軸「部品」１００１の分析項目「タンク」１００３と、分析軸「症状」１００２の分析項目「亀裂」１００４の２つの分析項目に対応する文書データ集合の文書数が、バブルチャートの円９０６の面積によって表現されている。 FIG. 10 shows examples of cross tabulation results displayed by the cross tabulation unit 3, analysis axes and analysis items operated by the analysis axis operation unit 4, and analysis items generated by the analysis item generation unit 5. FIG. FIG. 10A shows the result of cross tabulation for the analysis axis “component” 1001 and the analysis axis “symptom” 1002. The analysis item of the analysis axis “component” 1001 is “tank” 1003, and the analysis items of the analysis axis “symptom” 1002 are “crack” 1004 and “dropout” 1005. The number of documents in the document data set corresponding to the two analysis items on different axes is expressed by the area of the bubble chart circle. For example, the number of documents in the document data set corresponding to the two analysis items of the analysis item “tank” 1003 of the analysis axis “component” 1001 and the analysis item “crack” 1004 of the analysis axis “symptom” 1002 is the circle of the bubble chart. It is represented by an area of 906.

図７は、分析項目生成部５によって実行される分析項目生成処理の流れを表すフローチャートであり、前述の図５のステップＳ５１４における詳細な処理を示す。この処理の対象とする分析軸は、図５のステップＳ５１４において、ユーザによって指定された２つの分析軸ｐ１と分析軸ｐ２であり、分析軸ｐ１はユーザが指定した所定の分析項目を持つ軸で、分析軸ｐ２が分析項目の生成対象となる軸である。 FIG. 7 is a flowchart showing the flow of analysis item generation processing executed by the analysis item generation unit 5, and shows detailed processing in step S514 of FIG. 5 described above. The analysis axes to be processed are the two analysis axes p1 and p2 specified by the user in step S514 of FIG. 5, and the analysis axis p1 is an axis having a predetermined analysis item specified by the user. The analysis axis p2 is an axis that is an analysis item generation target.

分析項目生成部５は、ユーザによってクロス集計部３で表示されるクロス集計の結果で示された分析項目に関して指示された分析軸ｐ１と分析軸ｐ２を設定すると、分析軸ｐ１の分析項目の集合を分析項目集合Ｃ１とし、分析軸ｐ２の分析項目の集合を分析項目集合Ｃ２とする。具体的には、分析項目生成部５は、分析軸ｐ１の分析軸番号が設定されている分析項目データを特定し、特定した各分析項目データが表す分析項目の集合を分析項目集合Ｃ１とする。また、分析項目生成部５は、分析軸ｐ２の分析軸番号が設定されている分析項目データを特定し、特定した各分析項目データが表す分析項目の集合を分析項目集合Ｃ２する。 When the analysis item generation unit 5 sets the analysis axis p1 and the analysis axis p2 instructed with respect to the analysis item indicated by the cross tabulation result displayed by the cross tabulation unit 3 by the user, the set of analysis items of the analysis axis p1 Is an analysis item set C1, and a set of analysis items on the analysis axis p2 is an analysis item set C2. Specifically, the analysis item generation unit 5 identifies analysis item data in which the analysis axis number of the analysis axis p1 is set, and a set of analysis items represented by each identified analysis item data is set as an analysis item set C1. . Further, the analysis item generation unit 5 specifies analysis item data in which the analysis axis number of the analysis axis p2 is set, and sets a set of analysis items represented by the specified analysis item data as an analysis item set C2.

さらに、分析項目生成部５は、分析項目集合Ｃ１の各分析項目に対応する語句の集合を語句集合Ｔ１とし、分析項目集合Ｃ２の各分析項目に対応する語句の集合を語句集合Ｔ２とする。分析項目に対応する語句とは、その分析項目の分析項目データに設定されている語句である。つまり、分析項目生成部５は、分析項目集合Ｃ１に含まれる各分析項目の分析項目データから読み出した語句の集合を語句集合Ｔ１とし、分析項目集合Ｃ２に含まれる各分析項目の分析項目データから読み出した語句の集合を語句集合Ｔ２とする。 Further, the analysis item generation unit 5 sets a set of words and phrases corresponding to each analysis item in the analysis item set C1 as a phrase set T1, and sets a set of words and phrases corresponding to each analysis item in the analysis item set C2 as a phrase set T2. The phrase corresponding to the analysis item is a phrase set in the analysis item data of the analysis item. That is, the analysis item generation unit 5 sets a phrase set read from the analysis item data of each analysis item included in the analysis item set C1 as a phrase set T1, and uses the analysis item data of each analysis item included in the analysis item set C2. A set of read phrases is referred to as a phrase set T2.

また、分析軸の分析項目は、図５のフローチャートで説明したように、ユーザの操作によって削除または移動が行われることがある。そこで、分析項目生成部５は、図５のステップＳ５０３またはステップＳ５０９において分析項目集合Ｃ２から削除されたことのある各分析項目に対応する語句の集合を削除語句集合Ｔ２ｒｍｖとする（ステップＳ７０１）。つまり、分析項目生成部５は、分析軸ｐ２から削除された、あるいは、分析軸ｐ２から他の分析軸へ移動した各分析項目の分析項目データに設定されていた語句の集合を削除語句集合Ｔ２ｒｍｖとする。 Further, as described in the flowchart of FIG. 5, the analysis item of the analysis axis may be deleted or moved by a user operation. Therefore, the analysis item generation unit 5 sets a set of words / phrases corresponding to each analysis item that has been deleted from the analysis item set C2 in step S503 or step S509 in FIG. 5 as a deleted word / phrase set T2rmv (step S701). That is, the analysis item generation unit 5 deletes a set of words / phrases set in the analysis item data of each analysis item deleted from the analysis axis p2 or moved from the analysis axis p2 to another analysis axis. And

分析項目生成部５は、分析軸ｐ２に新規に生成する分析項目に対応する語句の候補の集合である語句候補集合Ｔ２ｎｅｗを求める（ステップＳ７０２）。語句候補集合Ｔ２ｎｅｗに含まれる語句ｔ２ｎｅｗは、語句抽出部５１が文書データの本文から抽出した語句のうち、語句集合Ｔ１にも、語句集合Ｔ２にも、削除語句集合Ｔ２ｒｍｖにも含まれない語句であることを条件とする。また、語句集合Ｔ１に含まれるいずれかの語句と、語句ｔ２ｎｅｗとを本文に共に含む文書データの集合を文書データ集合Ｄ（Ｔ１，ｔ２ｎｅｗ）としたとき、この文書データ集合Ｄ（Ｔ１，ｔ２ｎｅｗ）が空集合でないことも、語句ｔ２ｎｅｗの条件とする。言い換えれば、語句ｔ２ｎｅｗは、文書データにおいて語句集合Ｔ１のいずれかの語句と共起する語句、すなわち、語句集合Ｔ１のいずれかの語句を含む文書データの集合Ｄ（Ｔ１）から抽出された語句である。 The analysis item generation unit 5 obtains a phrase candidate set T2new, which is a set of phrase candidates corresponding to the analysis item newly generated on the analysis axis p2 (step S702). The phrase t2new included in the phrase candidate set T2new is a phrase that is not included in the phrase set T1, the phrase set T2, or the deleted phrase set T2rmv among the phrases extracted by the phrase extraction unit 51 from the text of the document data. Subject to being. Further, when a document data set D (T1, t2new) is a document data set that includes any one of the phrases included in the phrase set T1 and the phrase t2new in the body, this document data set D (T1, t2new) Is also a condition for the phrase t2new. In other words, the phrase t2new is a phrase extracted from the document data set D (T1) including any phrase in the phrase set T1, that is, a phrase that co-occurs with any phrase in the phrase set T1 in the document data. is there.

続いて、分析項目生成部５は、語句候補集合Ｔ２ｎｅｗ中の語句ｔ２ｎｅｗを全て選択するまで未選択の語句ｔ２ｎｅｗを１つずつ選択し、選択した語句ｔ２ｎｅｗについて以下のステップＳ７０４からステップＳ７０６までの処理を繰り返し実行する（ステップＳ７０３−繰り返し継続）。 Subsequently, the analysis item generation unit 5 selects unselected words t2new one by one until all the words t2new in the word candidate set T2new are selected, and the processing from step S704 to step S706 below for the selected word t2new. Is repeatedly executed (step S703—continuation of repetition).

まず、分析項目生成部５は、語句ｔ２ｎｅｗと、語句集合Ｔ２に含まれる語句との類似性によるスコアｓｃｒ（Ｔ１，Ｔ２，ｔ２ｎｅｗ）を求める（ステップＳ７０４）。このスコアの算出処理については、後述する図８にて詳細に説明するが、スコアが大きいほど、新規に生成する分析項目としてより適切であることを示す。分析項目生成部５は、ステップＳ７０４において算出されたスコアが所定の閾値未満であれば（ステップＳ７０５−ＹＥＳ）、この語句に対応する分析項目は生成しないものとして、語句ｔ２ｎｅｗを語句候補集合Ｔ２ｎｅｗから除き（ステップＳ７０６）、ステップＳ７０３からの処理を繰り返す。一方、ステップＳ７０４において算出されたスコアが所定の閾値以上であれば（ステップＳ７０５−ＮＯ）、分析項目生成部５は、そのままステップＳ７０３からの処理を繰り返す。 First, the analysis item generation unit 5 obtains a score scr (T1, T2, t2new) based on the similarity between the phrase t2new and the phrase included in the phrase set T2 (step S704). The score calculation process will be described in detail with reference to FIG. 8 to be described later. The larger the score, the more appropriate the analysis item to be newly generated. If the score calculated in step S704 is less than the predetermined threshold (YES in step S705), the analysis item generation unit 5 determines that the analysis item corresponding to this word is not generated, and extracts the phrase t2new from the phrase candidate set T2new. Except (step S706), the processing from step S703 is repeated. On the other hand, if the score calculated in step S704 is equal to or greater than a predetermined threshold (step S705—NO), the analysis item generation unit 5 repeats the processing from step S703 as it is.

分析項目生成部５は、語句候補集合Ｔ２ｎｅｗ中の全ての語句ｔ２ｎｅｗを選択すると、繰り返し処理を終了する（ステップＳ７０３−繰り返し終了）。分析項目生成部５は、ステップＳ７０３の繰り返しの処理が終了した段階で削除されずに語句候補集合Ｔ２ｎｅｗに残っている語句のうち、ステップＳ７０４において求めたスコアｓｃｒ（Ｔ１，Ｔ２，ｔ２ｎｅｗ）が大きい順に所定の個数だけ語句ｔ２ｎｅｗを選び、選ばれなかった語句ｔ２ｎｅｗは語句候補集合Ｔ２ｎｅｗから除く（ステップＳ７０７）。 When the analysis item generation unit 5 selects all the words t2new in the word candidate set T2new, the analysis item generation unit 5 finishes the repetition process (step S703—end of repetition). The analysis item generation unit 5 has a large score scr (T1, T2, t2new) obtained in step S704 among the words remaining in the word candidate set T2new without being deleted at the stage where the repetition processing in step S703 is completed. A predetermined number of phrases t2new are selected in order, and the unselected phrases t2new are removed from the phrase candidate set T2new (step S707).

そして、分析項目生成部５は、語句候補集合Ｔ２ｎｅｗ中の語句ｔ２ｎｅｗを全て選択するまで、未選択の語句ｔ２ｎｅｗを１つずつ選択して、以下のステップＳ７０９の処理を繰り返し実行する（ステップＳ７０８−繰り返し継続）。すなわち、分析項目生成部５は、分析軸ｐ２に分析項目ｃ２ｎｅｗを生成し、選択した語句ｔ２ｎｅｗを分析項目ｃ２ｎｅｗに対応する語句とすると、語句ｔ２ｎｅｗを含む文書データを分析項目ｃ２ｎｅｗに対応する文書集合とする（ステップＳ７０９）。具体的には、分析項目生成部５は、新たな分析項目ｃ２ｎｅｗの分析項目データを生成して分析軸記憶部２に登録する。分析項目生成部５は、この分析項目データに、新たな分析項目番号と、分析軸ｐ２の分析軸番号と、語句ｔ２ｎｅｗを示す名称および語句と、語句ｔ２ｎｅｗを本文に含む文書データの文書番号の集合を設定する。分析項目生成部５は、語句候補集合Ｔ２ｎｅｗ中の全ての語句ｔ２ｎｅｗを選択すると（ステップＳ７０８−繰り返し終了）、図７の分析項目生成処理を終了する。 Then, the analysis item generation unit 5 selects the unselected words t2new one by one until the selection of all the words t2new in the word candidate set T2new, and repeatedly executes the processing of the following step S709 (step S708-). Repeat continuously). That is, the analysis item generation unit 5 generates an analysis item c2new on the analysis axis p2, and if the selected word / phrase t2new is a word / phrase corresponding to the analysis item c2new, document data including the word / phrase t2new is a document set corresponding to the analysis item c2new. (Step S709). Specifically, the analysis item generation unit 5 generates analysis item data of a new analysis item c2new and registers it in the analysis axis storage unit 2. The analysis item generation unit 5 adds to this analysis item data the new analysis item number, the analysis axis number of the analysis axis p2, the name and phrase indicating the phrase t2new, and the document number of the document data including the phrase t2new in the text. Set a set. When the analysis item generation unit 5 selects all the phrases t2new in the phrase candidate set T2new (step S708—end repeatedly), the analysis item generation process of FIG. 7 ends.

図１０（ｂ）は、図７で説明した分析項目生成処理の結果の例を示す。例えば図１０（ａ）の分析軸「部品」１００１を対象として、分析項目を生成する場合、文書分析装置１００は、分析項目の生成要求の入力を受け、さらに、分析軸ｐ１が分析軸「症状」１００２であり、分析軸ｐ２が分析軸「部品」１００１である旨の入力を受ける。この入力は、例えば、図１０（ａ）の表示において、ユーザが分析軸ｐ１として分析軸「症状」１００２を選択し、分析軸ｐ２として分析軸「部品」１００１を選択することによって行ってもよい。これにより、分析項目生成部５が、図７の分析項目生成処理を行い、分析軸「部品」１００１に分析項目「パイプ」、「ペダル」、「溶接」を生成し、これらの分析項目データを分析軸記憶部２に登録する。その後、文書分析装置１００がユーザからクロス集計の実行要求を受けると、クロス集計部３は、図６に示すクロス集計処理を行う。クロス集計部３は、図１０（ｂ）に示すように、分析軸「部品」１００１に、分析項目「パイプ」１０１１、「ペダル」１０１２、「溶接」１０１３が追加されたクロス集計結果を表示する。 FIG. 10B shows an example of the result of the analysis item generation process described in FIG. For example, when generating an analysis item for the analysis axis “component” 1001 in FIG. 10A, the document analysis apparatus 100 receives an input of an analysis item generation request, and the analysis axis p1 is the analysis axis “symptom”. ”1002, and an input indicating that the analysis axis p 2 is the analysis axis“ component ”1001 is received. This input may be performed, for example, by the user selecting the analysis axis “symptom” 1002 as the analysis axis p1 and selecting the analysis axis “component” 1001 as the analysis axis p2 in the display of FIG. . As a result, the analysis item generation unit 5 performs the analysis item generation processing of FIG. 7 to generate the analysis items “pipe”, “pedal”, and “weld” on the analysis axis “component” 1001, Register in the analysis axis storage unit 2. Thereafter, when the document analysis apparatus 100 receives a cross tabulation execution request from the user, the cross tabulation unit 3 performs a cross tabulation process shown in FIG. As shown in FIG. 10B, the cross tabulation unit 3 displays the cross tabulation result in which the analysis items “pipe” 1011, “pedal” 1012, and “weld” 1013 are added to the analysis axis “component” 1001. .

なお、文書分析装置１００は、図５のステップＳ５１４の処理（図７の分析項目生成処理）の実行後、ユーザからクロス集計の実行要求の入力を受けることなく、ステップＳ５１２（図６のクロス集計処理）の処理を行うようにしてもよい。 Note that the document analysis apparatus 100 does not receive the input of the cross tabulation execution request from the user after the execution of the processing of step S514 in FIG. 5 (analysis item generation processing in FIG. 7), and performs step S512 (cross tabulation in FIG. 6). Processing) may be performed.

図８は、分析項目生成部５の語句関係評価部５３によって実行される語句の関係の評価処理の流れを表すフローチャートであり、前述の図７のステップＳ７０４において、語句ｔ２ｎｅｗのスコアを求める処理の詳細な処理を示す。図８の処理は、語句関係評価部５３が実行するが、ステップＳ８０３とステップＳ８０７は、語句間の関係を用いた処理であるため、語句関係抽出部５２が実行する。 FIG. 8 is a flowchart showing the flow of the phrase relationship evaluation process executed by the phrase relationship evaluation unit 53 of the analysis item generation unit 5. In step S704 of FIG. 7, the process for obtaining the score of the phrase t2new is performed. Detailed processing is shown. The processing in FIG. 8 is executed by the word / phrase relationship evaluation unit 53, but step S 803 and step S 807 are processing using the relationship between words, so the word / phrase relationship extraction unit 52 executes.

まず、語句関係評価部５３は、語句が文書データ集合に含まれる頻度に着目したスコアｆｓｃｒ（Ｔ１，Ｔ２，ｔ２ｎｅｗ）を求める（ステップＳ８０１）。スコアｆｓｃｒ（Ｔ１，Ｔ２，ｔ２ｎｅｗ）は、語句ｔ２ｎｅｗの出現頻度と、語句集合Ｔ２中の語句の出現頻度との類似性に基づく。本実施形態では、語句関係評価部５３は、このスコアｆｓｃｒ（Ｔ１，Ｔ２，ｔ２ｎｅｗ）を、以下の式（１）、式（２）、式（３）で示すような頻度に関する３つの特徴を総合した方法で求める。すなわち、語句関係評価部５３は、分析軸ｐ２の既存の分析項目に対応する語句である語句集合Ｔ２中の語句ｔ２と、新規に生成する分析項目の候補である語句ｔ２ｎｅｗとの、出現頻度に着目した類似度を表す値を式（１）、式（２）、式（３）によってそれぞれ求める。そして、語句関係評価部５３は、式（１）、式（２）、式（３）によって求めたこれらの値を、式（４）に示すように総合した値を語句ｔ２ｎｅｗのスコアとする。しかしながら、本実施形態は、この方法に限定するものではない。 First, the phrase relationship evaluation unit 53 obtains a score fscr (T1, T2, t2new) focusing on the frequency with which the phrase is included in the document data set (step S801). The score fscr (T1, T2, t2new) is based on the similarity between the appearance frequency of the word t2new and the appearance frequency of the word in the word set T2. In the present embodiment, the phrase relationship evaluation unit 53 uses the score fscr (T1, T2, t2new) as the three characteristics regarding the frequency as shown by the following expressions (1), (2), and (3). Obtained in a comprehensive manner. That is, the phrase relationship evaluation unit 53 determines the appearance frequency of the phrase t2 in the phrase set T2 that is a phrase corresponding to the existing analysis item of the analysis axis p2 and the phrase t2new that is a candidate for the newly generated analysis item. A value representing the noticed similarity is obtained by Expression (1), Expression (2), and Expression (3), respectively. Then, the phrase relationship evaluation unit 53 sets the value obtained by Expression (1), Expression (2), and Expression (3) as a score of the phrase t2new, as shown in Expression (4). However, the present embodiment is not limited to this method.

文書記憶部１に記憶されている文書データの集合を文書データ集合Ｄとする。まず式（１）では、｜Ｄ（ｔ）｜と｜Ｄ（Ｔ１，ｔ）｜の比率をｘ（Ｔ１，ｔ）とする。｜Ｄ（ｔ）｜は、文書データ集合Ｄにおいて、ある語句ｔを本文に含む文書データの数であり、｜Ｄ（Ｔ１，ｔ）｜は、文書データ集合Ｄにおいて、語句集合Ｔ１中のいずれかの語句と語句ｔとを共に本文に含む文書データの数である。語句関係評価部５３は、語句ｔを語句ｔ２ｎｅｗとした場合の比率ｘ（Ｔ１，ｔ２ｎｅｗ）と、語句ｔを語句集合Ｔ２中の各語句ｔ２とした場合の比率ｘ（Ｔ１，ｔ２）とを求める。語句関係評価部５３は、語句集合Ｔ２中の語句ｔ２それぞれについて、比率ｘ（Ｔ１，ｔ２ｎｅｗ）と比率ｘ（Ｔ１，ｔ２）の類似度をｆｓｉｍ１（Ｔ１，ｔ２，ｔ２ｎｅｗ）として算出する。これにより、ｆｓｉｍ１（Ｔ１，ｔ２，ｔ２ｎｅｗ）の値が大きいほど、語句ｔ２ｎｅｗのスコアが大きくなるようにする。 A set of document data stored in the document storage unit 1 is referred to as a document data set D. First, in Expression (1), the ratio of | D (t) | and | D (T1, t) | is x (T1, t). | D (t) | is the number of document data including a certain word t in the text in the document data set D. | D (T1, t) | is any of the word sets T1 in the document data set D. This is the number of document data including both the phrase and the phrase t in the text. The phrase relationship evaluation unit 53 obtains a ratio x (T1, t2new) when the phrase t is the phrase t2new and a ratio x (T1, t2) when the phrase t is each phrase t2 in the phrase set T2. . The phrase relationship evaluation unit 53 calculates the similarity between the ratio x (T1, t2new) and the ratio x (T1, t2) as fsim1 (T1, t2, t2new) for each phrase t2 in the phrase set T2. As a result, the score of the word t2new is increased as the value of fsim1 (T1, t2, t2new) is increased.

次に式（２）では、ある語句ｔについて、語句集合Ｔ１中の語句のうち、語句ｔと共に本文に含まれる文書データ集合の数が最も大きい語句ｔ１を選択する。そして、｜Ｄ（ｔ１，ｔ）｜と｜Ｄ（Ｔ１，ｔ）｜の比率をｙ（Ｔ１，ｔ）とする。｜Ｄ（ｔ１，ｔ）｜は、文書データ集合Ｄにおいて、選択した語句ｔ１と語句ｔとを本文に含んだ文書データの数である。｜Ｄ（Ｔ１，ｔ）｜は、前述したように、文書データ集合Ｄにおいて、語句集合Ｔ１中のいずれかの語句と語句ｔとを共に本文に含む文書データの数である。語句関係評価部５３は、語句ｔを語句ｔ２ｎｅｗとした場合の比率ｙ（Ｔ１，ｔ２ｎｅｗ）と、語句ｔを語句集合Ｔ２中の各語句ｔ２とした場合の比率ｙ（Ｔ１，ｔ２）とを求める。語句関係評価部５３は、語句集合Ｔ２中の語句ｔ２それぞれについて、比率ｙ（Ｔ１，ｔ２ｎｅｗ）と比率ｙ（Ｔ１，ｔ２）の類似度をｆｓｉｍ２（Ｔ１，ｔ２，ｔ２ｎｅｗ）として算出する。これにより、ｆｓｉｍ２（Ｔ１，ｔ２，ｔ２ｎｅｗ）の値が大きいほど、語句ｔ２ｎｅｗのスコアが大きくなるようにする。 Next, in the expression (2), for a certain phrase t, the phrase t1 having the largest number of document data sets included in the text together with the phrase t is selected from the phrases in the phrase set T1. The ratio of | D (t1, t) | and | D (T1, t) | is y (T1, t). | D (t1, t) | is the number of document data including the selected word t1 and word t in the text in the document data set D. As described above, | D (T1, t) | is the number of document data in the text data set D that includes any word / phrase in the word / phrase set T1 and the word / phrase t in the text. The phrase relationship evaluation unit 53 obtains a ratio y (T1, t2new) when the phrase t is the phrase t2new and a ratio y (T1, t2) when the phrase t is each phrase t2 in the phrase set T2. . The phrase relationship evaluation unit 53 calculates the similarity between the ratio y (T1, t2new) and the ratio y (T1, t2) as fsim2 (T1, t2, t2new) for each phrase t2 in the phrase set T2. Accordingly, the score of the word t2new is increased as the value of fsim2 (T1, t2, t2new) is increased.

続いて式（３）では、ある語句ｔと、語句集合Ｔ１中の各々の語句ｔ１とが共に本文に出現する頻度の度合いをベクトルｖ（Ｔ１，ｔ）で表す。このベクトルｖ（Ｔ１，ｔ）の各要素は語句ｔ１に対応しており、その要素の値ｗ（ｔ１，ｔ）は、｜Ｄ（ｔ１，ｔ）｜と｜Ｄ（ｔ）｜の比率である。｜Ｄ（ｔ１，ｔ）｜は、文書データ集合Ｄにおいて、語句ｔ１と語句ｔとを共に本文に含んだ文書データの数であり、｜Ｄ（ｔ）｜は、前述したように、文書データ集合Ｄにおいてある語句ｔを本文に含む文書データの数である。語句関係評価部５３は、語句ｔを語句ｔ２ｎｅｗとした場合のベクトルｖ（Ｔ１，ｔ２ｎｅｗ）と、語句ｔを語句集合Ｔ２中の各語句ｔ２とした場合のベクトルｖ（Ｔ１，ｔ２）とを算出する。語句関係評価部５３は、語句集合Ｔ２中の語句ｔ２それぞれについて、ベクトルｖ（Ｔ１，ｔ２ｎｅｗ）とベクトルｖ（Ｔ１，ｔ２）のコサイン類似度ｆｓｉｍ３（Ｔ１，ｔ２，ｔ２ｎｅｗ）を算出する。これにより、ｆｓｉｍ３（Ｔ１，ｔ２，ｔ２ｎｅｗ）の値が大きいほど、語句ｔ２ｎｅｗのスコアが大きくなるようにする。 Subsequently, in Expression (3), a vector v (T1, t) represents the degree of frequency that a certain word t and each word t1 in the word set T1 appear in the text. Each element of the vector v (T1, t) corresponds to the phrase t1, and the value w (t1, t) of the element is a ratio of | D (t1, t) | and | D (t) | is there. | D (t1, t) | is the number of document data including both the phrase t1 and the phrase t in the text in the document data set D, and | D (t) | is the document data as described above. This is the number of document data that includes a certain word t in the text in the set D. The phrase relationship evaluation unit 53 calculates a vector v (T1, t2new) when the phrase t is the phrase t2new and a vector v (T1, t2) when the phrase t is each phrase t2 in the phrase set T2. To do. The phrase relationship evaluation unit 53 calculates the cosine similarity fsim3 (T1, t2, t2new) between the vector v (T1, t2new) and the vector v (T1, t2) for each phrase t2 in the phrase set T2. Thus, the larger the value of fsim3 (T1, t2, t2new), the higher the score of the word t2new.

そして式（４）では、語句関係評価部５３は、語句集合Ｔ２中の語句ｔ２毎に、式（１）〜式（３）で算出した３つの類似度それぞれに所定の正の定数α１、α２、α３を各々乗じて加算した値を算出し、その中の最大値をスコアｆｓｃｒ（Ｔ１，Ｔ２，ｔ２ｎｅｗ）とする。このスコアの値が大きいということは、語句集合Ｔ１中の語句と共に出現する頻度という観点で、語句ｔ２ｎｅｗとよく類似した語句が、分析軸ｐ２の既存の分析項目に対応する語句として存在することを意味する。 In the expression (4), the phrase relationship evaluation unit 53 determines a predetermined positive constant α1, α2 for each of the three similarities calculated by the expressions (1) to (3) for each phrase t2 in the phrase set T2. , Α3 are respectively multiplied and calculated, and the maximum value among them is set as a score fscr (T1, T2, t2new). The large value of this score means that a phrase that is very similar to the phrase t2new exists as a phrase corresponding to the existing analysis item of the analysis axis p2 in terms of the frequency of appearance together with the phrase in the phrase set T1. means.

したがって、語句関係評価部５３は、ステップＳ８０１にて求めたこのスコアｆｓｃｒ（Ｔ１，Ｔ２，ｔ２ｎｅｗ）が所定の閾値未満であると判定した場合（ステップＳ８０２−ＹＥＳ）、語句ｔ２ｎｅｗのスコアｓｃｒ（Ｔ１，Ｔ２，ｔ２ｎｅｗ）に０を設定し、図８の処理を終了する（ステップＳ８０９）。このようにすることで、この語句ｔ２ｎｅｗは、前述の図７のステップＳ７０６にて、候補から除かれるようになる。 Therefore, when the phrase relationship evaluation unit 53 determines that the score fscr (T1, T2, t2new) obtained in step S801 is less than a predetermined threshold (step S802-YES), the score scr (T1) of the phrase t2new , T2, t2new) is set to 0, and the process of FIG. 8 is terminated (step S809). By doing so, the phrase t2new is removed from the candidates in the above-described step S706 of FIG.

語句関係評価部５３が、スコアｆｓｃｒ（Ｔ１，Ｔ２，ｔ２ｎｅｗ）は所定の閾値以上であると判定した場合（ステップＳ８０２−ＮＯ）、語句関係抽出部５２は、語句ｔ２ｎｅｗと語句集合Ｔ２中の語句との類似性によるスコアｅｓｃｒ（Ｔ１，Ｔ２，ｔ２ｎｅｗ）を算出する（ステップＳ８０３）。このスコアｅｓｃｒ（Ｔ１，Ｔ２，ｔ２ｎｅｗ）は、語句が自然言語の文章中に記述されている表現に着目したスコアである。語句関係抽出部５２は、スコアｅｓｃｒ（Ｔ１，Ｔ２，ｔ２ｎｅｗ）を、以下の式（５）と式（６）により、自然言語の表現に関する２つの特徴を総合した方法で求める。すなわち、語句関係抽出部５２は、分析軸ｐ２の既存の分析項目に対応する語句であるＴ２中の語句ｔ２と、新規に生成する分析項目の候補の語句である語句ｔ２ｎｅｗとの表現に着目した類似度を表す値を、これら式（５）、式（６）それぞれにより求める。語句関係抽出部５２は、式（５）、式（６）それぞれにより求めたこれらの類似度を表す値を、式（７）と式（８）によって総合した値をｔ２ｎｅｗのスコアｅｓｃｒ（Ｔ１，Ｔ２，ｔ２ｎｅｗ）とする。しかしながら、本実施形態はこの方法に限定するものではない。 When the phrase relationship evaluation unit 53 determines that the score fscr (T1, T2, t2new) is equal to or greater than a predetermined threshold (step S802-NO), the phrase relationship extraction unit 52 determines whether the phrase t2new and the phrase in the phrase set T2 A score escr (T1, T2, t2new) based on the similarity to is calculated (step S803). This score escr (T1, T2, t2new) is a score that focuses on expressions in which a phrase is described in a natural language sentence. The phrase relationship extraction unit 52 obtains the score escr (T1, T2, t2new) by a method in which two features relating to the expression of natural language are integrated by the following equations (5) and (6). That is, the phrase relationship extraction unit 52 focuses on the expression of the phrase t2 in T2 that is a phrase corresponding to the existing analysis item of the analysis axis p2, and the phrase t2new that is a candidate phrase of the analysis item to be newly generated. A value representing the similarity is obtained by each of these formulas (5) and (6). The phrase relationship extraction unit 52 obtains a value obtained by combining the values obtained by the expressions (5) and (6) according to the expressions (7) and (8) from the expressions (7) and (6), and the score escr (T1, T2new). T2, t2new). However, the present embodiment is not limited to this method.

図９は、分析項目生成部５の語句関係抽出部５２によって抽出される語句の関係の例を示す図である。同図では、語句関係抽出部５２が、語句集合Ｔ１中の語句ｔ１と語句集合Ｔ２中の語句ｔ２との両方の語句を含む文から語句間の関係を抽出した結果の例を示している。本実施形態では、この語句関係抽出部５２の処理を、従来の構文解析の技術を用いて実現するが、その場合には、構文木を構成する複数の文節と、文節間の係り受け関係が抽出される。なお、各文節は、形態素で構成される。語句関係抽出部５２は、これらの文節と文節間の係り受け関係の中から、語句ｔ１または語句ｔ２を含む文節と、これらの文節を接続する最も少ない関係と、この関係により語句ｔ１を含む文節および語句ｔ２を含む文節に繋がる他の文節とから構成される部分構文木ｓ（ｔ１，ｔ２）を求める。 FIG. 9 is a diagram illustrating an example of a phrase relationship extracted by the phrase relationship extraction unit 52 of the analysis item generation unit 5. The figure shows an example of the result of the phrase relationship extraction unit 52 extracting the relationship between phrases from a sentence including both the phrase t1 in the phrase set T1 and the phrase t2 in the phrase set T2. In the present embodiment, the processing of the phrase relationship extraction unit 52 is realized by using a conventional syntax analysis technique. In this case, there are a plurality of clauses constituting the syntax tree and dependency relationships between the clauses. Extracted. Each phrase is composed of morphemes. The phrase relation extraction unit 52 selects the phrase including the phrase t1 or the phrase t2, the least relation connecting these phrases, and the phrase including the phrase t1 based on this relation from the dependency relation between the phrases and the phrase. And a partial syntax tree s (t1, t2) composed of other clauses connected to the clause including the word t2.

図９（ａ）は、語句関係抽出部５２が、図２（ａ）で示した文書データ２００の本文２０４から文節と関係を抽出した例である。語句ｔ１を「亀裂」、語句ｔ２を「タンク」とした場合、文節９０２は語句ｔ１を含み、文節９０３は語句ｔ２を含んでいる。よって、語句関係抽出部５２は、文節９０２および文節９０３と、これら文節９０２および文節９０３を繋ぐ関係９０５、９０６と、関係９０５、９０６により文節９０２および文節９０３につながる文節９０４とで構成される矩形部分の部分構文木９０１を部分構文木ｓ（ｔ１，ｔ２）として求める。一方、語句関係抽出部５２は、候補の語句ｔ２ｎｅｗについても同様に、語句ｔ２ｎｅｗと語句ｔ１とを含む文の部分構文木ｓ（ｔ１，ｔ２ｎｅｗ）を求める。図９（ｂ）は、図２（ｄ）で示した文書データ２３０の本文２３４から文節と関係を抽出した例であり、語句ｔ２ｎｅｗを「パイプ」とした場合、矩形部分の部分構文木９１１が部分構文木ｓ（ｔ１，ｔ２ｎｅｗ）となる。部分構文木９１１は、語句ｔ２ｎｅｗ「パイプ」を含む文節９１２と、語句ｔ１「亀裂」を含む文節９１３と、これら文節９１２および文節９１３を繋ぐ関係９１５、９１６と、関係９１５、９１６により文節９１２および文節９１３につながる文節９１４とで構成される。 FIG. 9A shows an example in which the phrase relationship extraction unit 52 extracts phrases and relationships from the text 204 of the document data 200 shown in FIG. When the phrase t1 is “crack” and the phrase t2 is “tank”, the phrase 902 includes the phrase t1, and the phrase 903 includes the phrase t2. Accordingly, the phrase relationship extraction unit 52 is a rectangle composed of the clause 902 and the clause 903, the relationships 905 and 906 that connect the clause 902 and the clause 903, and the clause 904 that is connected to the clause 902 and the clause 903 by the relationships 905 and 906. A partial syntax tree 901 of the part is obtained as a partial syntax tree s (t1, t2). On the other hand, the phrase relation extraction unit 52 similarly obtains a partial syntax tree s (t1, t2new) of a sentence including the phrase t2new and the phrase t1 for the candidate phrase t2new. FIG. 9B is an example in which clauses and relationships are extracted from the text 234 of the document data 230 shown in FIG. 2D. When the phrase t2new is “pipe”, the partial syntax tree 911 of the rectangular portion is The partial syntax tree s (t1, t2new) is obtained. The partial syntax tree 911 includes a clause 912 including the phrase t2new “pipe”, a clause 913 including the phrase t1 “crack”, relations 915 and 916 connecting the clause 912 and the phrase 913, It consists of a clause 914 connected to the clause 913.

また、図９（ｃ）および（ｄ）は、図９（ａ）および（ｂ）と同様に語句ｔ１を「亀裂」、語句ｔ２を「タンク」とし、語句ｔ２ｎｅｗは「溶接」とした例である。図９（ｃ）に示す部分構文木９２１は、語句ｔ２「タンク」を含む文節９２２と、語句ｔ１「亀裂」を含む文節９２３と、これら文節９２２および文節９２３を繋ぐ関係９２５、９２６と、関係９２５、９２６により文節９２２および文節９２３につながる文節９２４とで構成され、部分構文木ｓ（ｔ１，ｔ２）となる。図９（ｄ）に示す部分構文木９３１は、語句ｔ２ｎｅｗ「溶接」を含む文節９３２と、語句ｔ１「亀裂」を含む文節９３３と、これら文節９３２および文節９３３を繋ぐ関係９３５、９３６と、関係９３５、９３６により文節９３２および文節９３３につながる文節９３４とで構成され、部分構文木ｓ（ｔ１，ｔ２ｎｅｗ）となる。 FIGS. 9C and 9D are examples in which the word t1 is “crack”, the word t2 is “tank”, and the word t2new is “weld”, as in FIGS. 9A and 9B. is there. The partial syntax tree 921 shown in FIG. 9C includes a phrase 922 including the phrase t2 “tank”, a phrase 923 including the phrase t1 “crack”, and relations 925 and 926 connecting the phrases 922 and 923. It consists of a clause 922 connected to a clause 922 and a clause 923 by 925 and 926, and becomes a partial syntax tree s (t1, t2). The partial syntax tree 931 shown in FIG. 9D includes a phrase 932 including the phrase t2new “weld”, a phrase 933 including the phrase t1 “crack”, and relations 935 and 936 that connect the phrase 932 and the phrase 933. 935 and 936 are composed of a clause 932 and a clause 934 connected to the clause 933, and becomes a partial syntax tree s (t1, t2new).

語句ｔ２と語句ｔ２ｎｅｗとを同一の語句とみなして部分構文木ｓ（ｔ１，ｔ２）と部分構文木ｓ（ｔ１，ｔ２ｎｅｗ）の各文節および関係を相互に対応付けた場合に、対応付けることが可能な文節同士で重複する形態素の個数、あるいは、対応付けることが可能な関係の個数に基づいて、部分構文木ｓ（ｔ１，ｔ２）と部分構文木ｓ（ｔ１，ｔ２ｎｅｗ）の類似度を求める数式が式（５）、式（６）である。式（５）において、ｃｏｕｎｔ（ｍ∈ｓ）は、部分構文木ｓに含まれる形態素ｍの個数を示し、式（６）において、ｃｏｕｎｔ（ｒ∈ｓ）は、部分構文木ｓに含まれる関係ｒの個数を示す。文書データからは、語句関係抽出部５２により、語句集合Ｔ１中の各語句ｔ１と語句集合Ｔ２中の各語句ｔ２との組み合わせ毎に部分構文木ｓ（ｔ１，ｔ２）が０以上得られ、語句集合Ｔ１中の各語句ｔ１と語句ｔ２ｎｅｗとの組み合わせ毎に、部分構文木ｓ（ｔ１，ｔ２ｎｅｗ）が０以上得られている。 When the phrase t2 and the phrase t2new are regarded as the same phrase and the clauses and relationships of the partial syntax tree s (t1, t2) and the partial syntax tree s (t1, t2new) are associated with each other, they can be associated with each other. Based on the number of morphemes that overlap each other or the number of relationships that can be associated with each other, a mathematical expression for calculating the similarity between the partial syntax tree s (t1, t2) and the partial syntax tree s (t1, t2new) Equations (5) and (6). In equation (5), count (mεs) indicates the number of morphemes m included in the partial syntax tree s. In equation (6), count (rεs) is a relationship included in the partial syntax tree s. The number of r is shown. From the document data, the phrase relation extraction unit 52 obtains zero or more partial syntax trees s (t1, t2) for each combination of each phrase t1 in the phrase set T1 and each phrase t2 in the phrase set T2. For each combination of each word / phrase t1 and word / phrase t2new in the set T1, zero or more partial syntax trees s (t1, t2new) are obtained.

まず式（５）では、部分構文木ｓ（ｔ１，ｔ）に含まれる形態素ｍの個数ｍｎ（ｓ（ｔ１，ｔ））と、２つの部分構文木ｓ（ｔ１，ｔｉ）と部分構文木ｓ（ｔ１，ｔｊ）とに共に含まれる形態素ｍの個数ｍｃ（ｓ（ｔ１，ｔｉ），ｓ（ｔ１，ｔｊ））を考える。なお、この個数ｍｃ（ｓ（ｔ１，ｔｉ），ｓ（ｔ１，ｔｊ））を数えるときには、語句ｔｉと語句ｔｊに相当する形態素は等しいとみなして数える。語句関係抽出部５２は、部分構文木ｓ（ｔ１，ｔ２）各々のｍｎ（ｓ（ｔ１，ｔ２））と、部分構文木ｓ（ｔ１，ｔ２）各々のｍｎ（ｓ（ｔ１，ｔ２ｎｅｗ））とを数える。さらに、語句関係抽出部５２は、語句ｔ１が共通する部分構文木ｓ（ｔ１，ｔ２）と部分構文木ｓ（ｔ１，ｔ２ｎｅｗ）の全ての組み合わせ各々についてｍｃ（ｓ（ｔ１，ｔ２），ｓ（ｔ１，ｔ２ｎｅｗ））を数え、ｍｎ（ｓ（ｔ１，ｔ２））とｍｎ（ｓ（ｔ１，ｔ２ｎｅｗ））のうちどちらか大きな値との比率により、語句ｔ２と語句ｔ２ｎｅｗとの類似度ｅｓｉｍ１（ｓ（ｔ１，ｔ２），ｓ（ｔ１，ｔ２ｎｅｗ））を算出する。 First, in Expression (5), the number mn (s (t1, t)) of morphemes m included in the partial syntax tree s (t1, t), two partial syntax trees s (t1, ti), and the partial syntax tree s. Consider the number mc (s (t1, ti), s (t1, tj)) of morphemes m included in (t1, tj). When counting the number mc (s (t1, ti), s (t1, tj)), the word ti and the morpheme corresponding to the word tj are regarded as being equal and counted. The phrase relationship extraction unit 52 calculates the mn (s (t1, t2)) of each partial syntax tree s (t1, t2) and the mn (s (t1, t2 new)) of each partial syntax tree s (t1, t2). Count. Furthermore, the phrase relationship extraction unit 52 uses mc (s (t1, t2), s () for all combinations of the partial syntax tree s (t1, t2) and the partial syntax tree s (t1, t2new) that share the phrase t1. t1, t2new)), and the similarity esim1 (s) between the word t2 and the word t2new is calculated according to the ratio of mn (s (t1, t2new)) and mn (s (t1, t2new)), whichever is larger. (T1, t2), s (t1, t2new)) are calculated.

一方、式（６）では、部分構文木ｓ（ｔ１，ｔ）に含まれる関係ｒの個数ｒｎ（ｓ（ｔ１，ｔ））と、２つの部分構文木ｓ（ｔ１，ｔｉ）と部分構文木ｓ（ｔ１，ｔｊ）に共に含まれる関係ｒの個数ｒｃ（ｓ（ｔ１，ｔｉ），ｓ（ｔ１，ｔｊ））を考える。語句関係抽出部５２は、部分構文木ｓ（ｔ１，ｔ２）各々のｒｎ（ｓ（ｔ１，ｔ２））と、部分構文木ｓ（ｔ１，ｔ２）各々のｒｎ（ｓ（ｔ１，ｔ２ｎｅｗ））とを数える。さらに、語句関係抽出部５２は、語句ｔ１が共通する部分構文木ｓ（ｔ１，ｔ２）と部分構文木ｓ（ｔ１，ｔ２ｎｅｗ）の全ての組み合わせ各々について、ｒｃ（ｓ（ｔ１，ｔ２），ｓ（ｔ１，ｔ２ｎｅｗ））を数え、ｒｎ（ｓ（ｔ１，ｔ２））とｒｎ（ｓ（ｔ１，ｔ２ｎｅｗ））のうちどちらか大きな値との比率により、語句ｔ２と語句ｔ２ｎｅｗとの類似度ｅｓｉｍ２（ｓ（ｔ１，ｔ２），ｓ（ｔ１，ｔ２ｎｅｗ））を算出する。 On the other hand, in Expression (6), the number rn (s (t1, t)) of the relation r included in the partial syntax tree s (t1, t), two partial syntax trees s (t1, ti), and the partial syntax tree. Consider the number rc (s (t1, ti), s (t1, tj)) of the relation r included in s (t1, tj). The phrase relationship extraction unit 52 calculates the rn (s (t1, t2)) of each partial syntax tree s (t1, t2) and the rn (s (t1, t2new)) of each partial syntax tree s (t1, t2). Count. Further, the phrase relationship extraction unit 52 determines rc (s (t1, t2), s for each combination of the partial syntax tree s (t1, t2) and the partial syntax tree s (t1, t2new) that share the phrase t1. (T1, t2new)) is counted, and the similarity esim2 () between the word t2 and the word t2new is determined by the ratio of rn (s (t1, t2new)) and rn (s (t1, t2new)), whichever is larger. s (t1, t2), s (t1, t2new)) are calculated.

そして式（７）では、語句関係抽出部５２は、語句ｔ１と語句ｔ２ｎｅｗを含む全ての部分構文木ｓ（ｔ１，ｔ２ｎｅｗ）の各々について、式（５）と式（６）で求めた２つの類似度に、所定の正の定数β１とβ２各々乗じて加算した値の最大値を、語句ｔ１と語句ｔ２を含む部分構文木ｓ（ｔ１，ｔ２）毎に選ぶ。語句関係抽出部５２は、選んだ最大値を加算した結果を部分構文木ｓ（ｔ１，ｔ２ｎｅｗ）の個数｜ｓ（ｔ１，ｔ２ｎｅｗ）｜により除算することにより平均した値を、語句ｔ２と語句ｔ２ｎｅｗの類似度ｅｓｉｍ（ｔ１，ｔ２，ｔ２ｎｅｗ）とする。 In the expression (7), the phrase relation extraction unit 52 uses the two expressions obtained by the expressions (5) and (6) for each of the partial syntax trees s (t1, t2new) including the phrase t1 and the phrase t2new. The maximum value obtained by multiplying the similarity by multiplying each of the predetermined positive constants β1 and β2 is selected for each partial syntax tree s (t1, t2) including the phrase t1 and the phrase t2. The phrase relationship extraction unit 52 divides the result of adding the selected maximum values by the number of partial syntax trees s (t1, t2new) | s (t1, t2new) |, and the phrase t2 and the phrase t2new Similarity esim (t1, t2, t2new).

式（８）では、語句関係抽出部５２は、Ｔ１中の語句ｔ１とＴ２中の語句ｔ２との全ての組み合わせについて式（７）で算出した類似度ｅｓｉｍ（ｔ１，ｔ２，ｔ２ｎｅｗ）のうち最大値をスコアｅｓｃｒ（Ｔ１，Ｔ２，ｔ２ｎｅｗ）とする。このスコアの値が大きいということは、語句集合Ｔ１中の語句と共に出現する表現という観点で、語句ｔ２ｎｅｗとよく類似した語句が、分析軸ｐ２の既存の分析項目に対応する語句として存在することを意味する。 In Expression (8), the phrase relationship extraction unit 52 uses the maximum of the similarities esim (t1, t2, t2new) calculated in Expression (7) for all combinations of the phrase t1 in T1 and the phrase t2 in T2. The value is a score escr (T1, T2, t2new). A high score value means that a phrase that is very similar to the phrase t2new exists as a phrase corresponding to the existing analysis item of the analysis axis p2 in terms of an expression that appears with the phrase in the phrase set T1. means.

例えば、図９（ａ）の部分構文木９０１の形態素は「タンク」「に」「亀裂」「が」「発生」「した」「。」であり、図９（ｂ）の部分構文木９１１の形態素は「パイプ」「に」「亀裂」「が」「発生」「した」「。」である。よって、図９（ａ）の部分構文木９０１と図９（ｂ）の部分構文木９１１とでは、語句ｔ２「タンク」と語句ｔ２ｎｅｗ「パイプ」が等しいとみなした場合、全ての形態素と関係が一致する。従って、類似度ｅｓｉｍ１（ｓ（ｔ１，ｔ２），ｓ（ｔ１，ｔ２ｎｅｗ））＝７／７＝１、類似度ｅｓｉｍ２（ｓ（ｔ１, ｔ２），ｓ（ｔ１，ｔ２ｎｅｗ））＝２／２＝１となり、類似度としては最大になる。 For example, the morphemes of the partial syntax tree 901 in FIG. 9A are “tank” “ni” “crack” “ga” “occurred” “done” “.”, And the partial syntax tree 911 in FIG. The morphemes are “pipe” “to” “crack” “ga” “occurrence” “done” “.”. Therefore, in the partial syntax tree 901 in FIG. 9A and the partial syntax tree 911 in FIG. 9B, when the phrase t2 “tank” and the phrase t2new “pipe” are regarded as equal, all the morphemes and the relationships are related. Match. Accordingly, the similarity esim1 (s (t1, t2), s (t1, t2new)) = 7/7 = 1, the similarity esim2 (s (t1, t2), s (t1, t2new)) = 2/2 = 1 and the maximum similarity.

また、図９（ｃ）の部分構文木９２１と、図９（ｄ）の部分構文木９３１とでは、語句ｔ２「タンク」と語句ｔ２ｎｅｗ「溶接」が等しいとみなした場合、７個中６個の形態素が一致し（「底部」と「箇所」のみが一致しない）、全ての関係が一致する。よって、類似度ｅｓｉｍ１（ｓ（ｔ１，ｔ２），ｓ（ｔ１，ｔ２ｎｅｗ））＝６／７、類似度ｅｓｉｍ２（ｓ（ｔ１, ｔ２），ｓ（ｔ１，ｔ２ｎｅｗ））＝２／２＝１となり、この場合の類似度もかなり大きいこととなる。 Further, in the partial syntax tree 921 in FIG. 9C and the partial syntax tree 931 in FIG. 9D, when the phrase t2 “tank” and the phrase t2new “weld” are regarded as equal, six out of seven. Morphemes match (only the “bottom” and “location” do not match), and all the relationships match. Therefore, the similarity degree esim1 (s (t1, t2), s (t1, t2new)) = 6/7 and the similarity degree esim2 (s (t1, t2), s (t1, t2new)) = 2/2 = 1. In this case, the similarity is considerably large.

図８において、語句関係評価部５３は、ステップＳ８０３にて求められたスコアｅｓｃｒ（Ｔ１，Ｔ２，ｔ２ｎｅｗ）が、所定の閾値未満であると判定した場合（ステップＳ８０４−ＹＥＳ）、語句ｔ２ｎｅｗのスコアｓｃｒ（Ｔ１，Ｔ２，ｔ２ｎｅｗ）に０を設定し、図８の処理を終了する（ステップＳ８０９）。このようにすることで、この語句ｔ２ｎｅｗは、前述の図７のステップＳ７０６にて、候補から除かれるようになる。 In FIG. 8, when the phrase relationship evaluation unit 53 determines that the score escr (T1, T2, t2new) obtained in step S803 is less than a predetermined threshold (step S804—YES), the score of the phrase t2new Scr (T1, T2, t2new) is set to 0, and the process of FIG. 8 is terminated (step S809). By doing so, the phrase t2new is removed from the candidates in the above-described step S706 of FIG.

語句関係評価部５３は、スコアｅｓｃｒ（Ｔ１，Ｔ２，ｔ２ｎｅｗ）が、所定の閾値以上であると判定した場合（ステップＳ８０４−ＮＯ）、ステップＳ８０５からＳ８０８までの処理を行い、図７のステップＳ７０１に記した削除語句集合Ｔ２ｒｍｖに類似した語句を除く。つまり、語句関係評価部５３は、削除語句集合Ｔ２ｒｍｖについて、スコアｆｓｃｒ（Ｔ１，Ｔ２ｒｍｖ，ｔ２ｎｅｗ）を算出し、語句関係抽出部５２は、スコアｅｓｃｒ（Ｔ１，Ｔ２ｒｍｖ，ｔ２ｎｅｗ）を算出する。これらの値が大きいほど、語句ｔ２ｎｅｗは、以前に削除された分析項目に対応する語句と類似していることを示すため、分析項目の候補から除かれやすくなるように、ｔ２ｎｅｗのスコアｓｃｒ（Ｔ１，Ｔ２，ｔ２ｎｅｗ）を０とする。 When the phrase relationship evaluation unit 53 determines that the score escr (T1, T2, t2new) is equal to or greater than a predetermined threshold (step S804-NO), the word / phrase relationship evaluation unit 53 performs the processing from step S805 to S808, and step S701 in FIG. The phrase similar to the deleted phrase set T2rmv described in (2) is excluded. That is, the phrase relationship evaluation unit 53 calculates the score fscr (T1, T2rmv, t2new) for the deleted phrase set T2rmv, and the phrase relationship extraction unit 52 calculates the score escr (T1, T2rmv, t2new). The larger these values are, the more the word t2new is similar to the word corresponding to the previously deleted analysis item, so that the score scr (T1) of t2new is more easily removed from the analysis item candidates. , T2, t2new) is set to 0.

具体的には、語句関係評価部５３は、前述のステップＳ８０１と同様の処理により、スコアｆｓｃｒ（Ｔ１，Ｔ２ｒｍｖ，ｔ２ｎｅｗ）を算出する（ステップＳ８０５）。語句関係評価部５３が、算出したスコアｆｓｃｒ（Ｔ１，Ｔ２ｒｍｖ，ｔ２ｎｅｗ）は所定の閾値以上であると判定した場合（ステップＳ８０６−ＹＥＳ）、語句関係抽出部５２は、前述のステップＳ８０３と同様の処理によりスコアｅｓｃｒ（Ｔ１，Ｔ２ｒｍｖ，ｔ２ｎｅｗ）を求める（ステップＳ８０７）。語句関係評価部５３は、算出されたスコアｅｓｃｒ（Ｔ１，Ｔ２ｒｍｖ，ｔ２ｎｅｗ）が所定の閾値以上であると判定した場合（ステップＳ８０８−ＹＥＳ）、語句ｔ２ｎｅｗのスコアｓｃｒ（Ｔ１，Ｔ２，ｔ２ｎｅｗ）に０を設定し、図８の処理を終了する（ステップＳ８０９）。 Specifically, the word / phrase relationship evaluation unit 53 calculates the score fscr (T1, T2rmv, t2new) by the same process as in step S801 described above (step S805). If the phrase relationship evaluation unit 53 determines that the calculated score fscr (T1, T2rmv, t2new) is greater than or equal to a predetermined threshold (step S806—YES), the phrase relationship extraction unit 52 is the same as step S803 described above. A score escr (T1, T2rmv, t2new) is obtained by the processing (step S807). If the phrase relationship evaluation unit 53 determines that the calculated score escr (T1, T2rmv, t2new) is equal to or greater than a predetermined threshold (YES in step S808), the phrase relationship evaluation unit 53 sets the score scr (T1, T2, t2new) of the phrase t2new. 0 is set, and the process of FIG. 8 is terminated (step S809).

語句関係評価部５３は、ステップＳ８０５において算出したスコアｆｓｃｒ（Ｔ１，Ｔ２ｒｍｖ，ｔ２ｎｅｗ）が所定の閾値未満であると判定した場合（ステップＳ８０６−ＮＯ）、あるいは、ステップＳ８０７で算出されたスコアｅｓｃｒ（Ｔ１，Ｔ２ｒｍｖ，ｔ２ｎｅｗ）が所定の閾値未満であると判定した場合（ステップＳ８０８−ＮＯ）、語句ｔ２ｎｅｗが分析項目の候補として適切であると判断する。語句関係評価部５３は、ステップＳ８０１において算出したスコアｆｓｃｒ（Ｔ１，Ｔ２，ｔ２ｎｅｗ）とＳ８０３において算出されたスコアｅｓｃｒ（Ｔ１，Ｔ２，ｔ２ｎｅｗ）とにそれぞれ、正の定数αとβを乗じて足し合わせた値を、最終的なスコアｓｃｒ（Ｔ１，Ｔ２，ｔ２ｎｅｗ）として算出し、処理を終了する（ステップＳ８１０）。 The phrase relationship evaluation unit 53 determines that the score fscr (T1, T2rmv, t2new) calculated in step S805 is less than a predetermined threshold (step S806-NO), or the score escr (step S807 calculated). When it is determined that (T1, T2rmv, t2new) is less than the predetermined threshold (step S808-NO), it is determined that the word t2new is appropriate as a candidate for the analysis item. The phrase relationship evaluation unit 53 multiplies the scores fscr (T1, T2, t2new) calculated in step S801 and the score escr (T1, T2, t2new) calculated in step S803 by positive constants α and β, respectively. The combined value is calculated as the final score scr (T1, T2, t2new), and the process ends (step S810).

文書分析装置１００は、図７と図８を用いて説明した処理を、対象とする分析軸を交互に切り替えて実行しながら、クロス集計を行うことで、図１０に示したように、各分析軸の分析項目を段階的に作成していくことができる。例えば、上述したように、文書分析装置１００が、図１０（ａ）の状態から、分析軸「部品」１００１を対象にして分析項目を生成した例が図１０（ｂ）である。しかし、自動的に生成した分析項目は、特に作業の初期の段階では、必ずしも全て適切であるとは限らず、例えば、図１０（ｂ）の分析項目「溶接」１０１３のように、分析軸「部品」１００１には相当しない語句によって分析項目が生成されてしまうことがある。このような場合、ユーザは、前述のように分析軸操作部４を用いて不適切な分析項目を削除することができる。つまり、文書分析装置１００の分析軸操作部４は、図５のステップＳ５０８の処理により、ユーザが不適切であると指定した分析項目「溶接」１０１３を削除する。あるいは、分析軸操作部４は、図５のステップＳ５１０の処理により、分析項目「溶接」１０１３を別の分析軸の分析項目として使用することも可能である。その例を図１１に示す。 The document analysis apparatus 100 performs cross tabulation while executing the processing described with reference to FIGS. 7 and 8 by alternately switching the target analysis axes, thereby performing each analysis as shown in FIG. Axis analysis items can be created in stages. For example, as described above, FIG. 10B illustrates an example in which the document analysis apparatus 100 generates an analysis item for the analysis axis “part” 1001 from the state illustrated in FIG. However, the automatically generated analysis items are not necessarily all appropriate particularly in the initial stage of the work. For example, the analysis axis “welding” 1013 in FIG. An analysis item may be generated by a phrase that does not correspond to “part” 1001. In such a case, the user can delete an inappropriate analysis item using the analysis axis operation unit 4 as described above. That is, the analysis axis operation unit 4 of the document analysis apparatus 100 deletes the analysis item “welding” 1013 designated as inappropriate by the user through the process of step S508 in FIG. Alternatively, the analysis axis operation unit 4 can also use the analysis item “welding” 1013 as an analysis item of another analysis axis by the processing in step S510 of FIG. An example is shown in FIG.

図１１は、分析項目移動後にクロス集計部３によって表示されるクロス集計の結果と、分析軸操作部４によって操作される分析軸および分析項目の例を示す図である。
ユーザは、図１１（ａ）に示すクロス集計結果において分析軸「部品」１１００として不適切な分析項目「溶接」１１０１を、別の分析軸「工程」に移動する。これにより、クロス集計部３は、図１１（ｂ）に示すように、分析軸「工程」１１１１に分析項目「溶接」１１１２を使用して、クロス集計結果を表示することができる。移動先の分析軸は、例えば、クロス集計部３が表示している別のクロス集計結果において指定してもよい。このように分析項目を移動すると、前述の図７のステップＳ７０２および図８のステップＳ８０５からＳ８０８で説明した処理により、以後の移動元の分析軸（図１１の例では「部品」１１００）に分析項目を生成する処理では、移動した語句、および、これと類似した語句は抽出されにくくなる。一方で、移動先の分析軸（図１１の例では「工程」）を対象にして分析項目を生成する処理では、当該語句に類似した語句は抽出されやすくなる。 FIG. 11 is a diagram illustrating an example of a cross tabulation result displayed by the cross tabulation unit 3 after the analysis item is moved, an analysis axis and an analysis item operated by the analysis axis operation unit 4.
The user moves the analysis item “welding” 1101 inappropriate as the analysis axis “component” 1100 in the cross tabulation result shown in FIG. 11A to another analysis axis “process”. Accordingly, the cross tabulation unit 3 can display the cross tabulation result by using the analysis item “welding” 1112 for the analysis axis “process” 1111 as shown in FIG. The destination analysis axis may be specified in another cross tabulation result displayed by the cross tabulation unit 3, for example. When the analysis item is moved in this way, the analysis is performed on the subsequent analysis axis (“part” 1100 in the example of FIG. 11) by the processing described in steps S702 of FIG. 7 and steps S805 to S808 of FIG. In the process of generating the item, the moved phrase and the similar phrase are difficult to be extracted. On the other hand, in the process of generating an analysis item for the analysis axis of the movement destination (“process” in the example of FIG. 11), a phrase similar to the phrase is easily extracted.

上記のように、図１０（ｂ）の状態で、ユーザが分析軸「部品」１００１から分析項目「溶接」１０１３を削除または移動し、次に分析軸「症状」１００２を対象にして分析項目を生成した後のクロス集計結果が、図１０（ｃ）である。この例では、分析軸「症状」１００２に、「折損」１０２１、「干渉」１０２２など、不具合の症状に関する語句が抽出され、これに対応する分析項目が作成されている。 As described above, in the state of FIG. 10B, the user deletes or moves the analysis item “weld” 1013 from the analysis axis “component” 1001, and then selects the analysis item for the analysis axis “symptom” 1002. FIG. 10C shows the cross tabulation result after generation. In this example, words related to the symptom of the malfunction such as “broken” 1021 and “interference” 1022 are extracted on the analysis axis “symptom” 1002 and an analysis item corresponding to this is created.

このように、図１０（ａ）から図１０（ｂ）、図１０（ｃ）といった形で分析軸を交互に切り替えて分析項目を生成していくこともできるが、例えば図１０（ｃ）の状態から、分析軸「部品」１００１および分析軸「症状」１００２の両方の分析軸を対象にして、分析項目を一度に生成してもよい。図１０（ｄ）に示した例では、分析軸「部品」１００１に対応する分析項目として「ケーブル」１０３１と「配線」１０３２が生成され、同時に、分析軸「症状」１００２に対応する分析項目として「断線」１０３３が生成されている。このように、複数の分析軸の分析項目を一度に生成することも、図７の処理の流れに小さい変形を加えることで、容易に実現できる。例えば、分析項目生成部５は、ユーザによって指示された分析軸ｐ１と分析軸ｐ２について図７の処理を終えた後、分析軸ｐ１と分析軸ｐ２を入れ替えて再び図７の処理を行う。 As described above, the analysis items can be generated by alternately switching the analysis axes in the form of FIG. 10A to FIG. 10B and FIG. 10C. For example, FIG. From the state, analysis items may be generated at once for both the analysis axis “part” 1001 and the analysis axis “symptom” 1002. In the example shown in FIG. 10D, “cable” 1031 and “wiring” 1032 are generated as analysis items corresponding to the analysis axis “component” 1001, and simultaneously as analysis items corresponding to the analysis axis “symptom” 1002. A “break” 1033 is generated. Thus, it is possible to easily generate analysis items for a plurality of analysis axes at once by adding a small modification to the processing flow of FIG. For example, after finishing the processing of FIG. 7 for the analysis axis p1 and the analysis axis p2 instructed by the user, the analysis item generation unit 5 switches the analysis axis p1 and the analysis axis p2 and performs the processing of FIG. 7 again.

上述した実施形態によれば、分析作業の最初の段階では、第１の分析軸および第２の分析軸の各々に対して、既知の語句を用いた分析項目を、例えばユーザが手作業で少数作成しておく。その後、分析項目生成部５により、まず、第２の分析軸に相当する語句を自動的に抽出して新規の分析項目を生成し、ユーザはその結果を、クロス集計部３を用いて確認する。そして、必要ならば、今度は第１の分析軸に相当する語句を自動的に抽出して、新規の分析項目を生成し、ユーザは、その結果を再びクロス集計部３を用いて確認する、といった手順で、分析作業を進めていくことができる。このような作業の繰り返しによって、第１の分析軸と第２の分析軸に対応する分析項目が複数個ずつ、順次作成されていく。ユーザはこれらの分析項目のクロス集計結果を用いて、文書データ集合の全体的な内容を把握することができるとともに、未知の語句すなわち分析項目を発見し、これと別の分析項目との相関関係などを詳細に調べることができる。 According to the above-described embodiment, in the first stage of the analysis work, the analysis items using known words and phrases are reduced manually by the user for each of the first analysis axis and the second analysis axis, for example. Create it. After that, the analysis item generation unit 5 automatically extracts a word corresponding to the second analysis axis to generate a new analysis item, and the user confirms the result using the cross tabulation unit 3. . Then, if necessary, this time, the phrase corresponding to the first analysis axis is automatically extracted to generate a new analysis item, and the user confirms the result again using the cross tabulation unit 3. The analysis work can be carried out in the following procedure. By repeating such operations, a plurality of analysis items corresponding to the first analysis axis and the second analysis axis are sequentially created. Using the cross tabulation results of these analysis items, the user can grasp the overall contents of the document data set, discover unknown words, that is, analysis items, and correlate these with other analysis items. Etc. can be examined in detail.

また、第１の分析軸や第２の分析軸の分析項目としては不適切な分析項目が分析項目生成部５によって自動生成された場合、ユーザはこれを、クロス集計部３を用いてすぐ確認することができる。また、分析軸操作部４により、不適切な分析項目を削除することはもちろん、別の分析軸の分析項目として利用することも容易に行える。また、分析軸操作部４が、ある分析軸の分析項目として不適切であるとして語句を削除した場合、それ以後、分析項目生成部５は、削除された語句と、その削除された語句に類似した語句を、当該分析軸に対応する語句として抽出することがない。したがって、このような作業を繰り返すことで、各分析軸に対する適切な語句が精度よく抽出されるようになる。 In addition, when an analysis item that is inappropriate as an analysis item of the first analysis axis or the second analysis axis is automatically generated by the analysis item generation unit 5, the user immediately confirms this using the cross tabulation unit 3. can do. In addition, the analysis axis operation unit 4 can easily delete an inappropriate analysis item and use it as an analysis item of another analysis axis. In addition, when the analysis axis operation unit 4 deletes a word / phrase as inappropriate as an analysis item of a certain analysis axis, the analysis item generation unit 5 thereafter resembles the deleted word / phrase and the deleted word / phrase. The extracted phrase is not extracted as a phrase corresponding to the analysis axis. Therefore, by repeating such operations, appropriate words for each analysis axis can be extracted with high accuracy.

以上で述べた少なくともひとつの実施形態の文書分析装置１００によれば、分析項目生成部５を有することにより、大量の文書を複数の分析軸で分析するときに用いる各分析軸の分析項目を作成することができる。ユーザは、大量の文書に記述された個々の文章を調べて分析項目に相当する語句を手作業で探し出す必要がなく、また、事前に辞書などの形で各分析項目に相当する語句を用意する必要も、事前にルールなどの形で抽出すべき語句や語句間の関係を指定する必要もない。従って、分析作業にかかっていたユーザの労力が大幅に軽減されるとともに、未知の語句や関係の表現が記述された非定型な文章であっても、その意味内容を反映した分析が行える。 According to the document analysis apparatus 100 of at least one embodiment described above, the analysis item generation unit 5 is provided, so that analysis items for each analysis axis used when analyzing a large amount of documents with a plurality of analysis axes are created. can do. The user does not have to manually search for a sentence corresponding to an analysis item by examining individual sentences described in a large amount of documents, and prepares a phrase corresponding to each analysis item in the form of a dictionary in advance. There is no need to specify in advance a word or phrase to be extracted in the form of a rule or the like, or a relationship between words. Therefore, the user's labor required for the analysis work is greatly reduced, and even an atypical sentence in which an unknown phrase or relational expression is described can be analyzed reflecting its semantic content.

また、以上で述べた少なくともひとつの実施形態の文書分析装置１００によれば、分析軸操作部４を有することにより、分析項目を生成する過程において、対象としている分析軸としては不適切だが、文書を分析する上では有用であるような分析項目が得られた場合には、これを別の分析軸の分析項目とすることができる。これにより、ユーザは、当初ユーザが対象としていた分析軸とは異なる分析軸を用いた分析も行える。そして、分析項目生成部５は、分析軸から削除された語句と、その削除された語句に類似した語句を用いる分析項目については、当該分析軸に生成しないようにする。これにより、不適切である可能性が高い語句により分析項目が生成されないように、各分析軸に対して適切な語句による分析項目を精度よく生成することができる。 Further, according to the document analysis apparatus 100 of at least one embodiment described above, the analysis axis operation unit 4 is included, so that it is inappropriate as a target analysis axis in the process of generating an analysis item, but a document When an analysis item that is useful in analyzing the above is obtained, it can be used as an analysis item of another analysis axis. As a result, the user can also perform analysis using an analysis axis different from the analysis axis that was initially targeted by the user. Then, the analysis item generation unit 5 does not generate an analysis item using a phrase deleted from the analysis axis and a phrase similar to the deleted phrase on the analysis axis. Accordingly, it is possible to accurately generate analysis items with appropriate words and phrases for each analysis axis so that the analysis items are not generated with words that are highly likely to be inappropriate.

なお、上述の各実施形態における図１の文書分析装置１００の機能を実現するためのプログラムをコンピュータ読み取り可能な記録媒体に記録して、この記録媒体に記録されたプログラムをコンピュータシステムに読み込ませ、実行することにより文書分析装置１００として動作させるようにしてもよい。なお、ここでいう「コンピュータシステム」とは、ＯＳや周辺機器等のハードウェアを含むものとする。また、「コンピュータシステム」は、ホームページ提供環境（あるいは表示環境）を備えたＷＷＷシステムも含むものとする。また、「コンピュータ読み取り可能な記録媒体」とは、フレキシブルディスク、光磁気ディスク、ＲＯＭ、ＣＤ−ＲＯＭ等の可搬媒体、コンピュータシステムに内蔵されるハードディスク等の記憶装置のことをいう。さらに「コンピュータ読み取り可能な記録媒体」とは、インターネット等のネットワークや電話回線等の通信回線を介してプログラムが送信された場合のサーバやクライアントとなるコンピュータシステム内部の揮発性メモリ（ＲＡＭ）のように、一定時間プログラムを保持しているものも含むものとする。 Note that a program for realizing the function of the document analysis apparatus 100 in FIG. 1 in each of the above-described embodiments is recorded on a computer-readable recording medium, and the program recorded on the recording medium is read into a computer system. By executing this, the document analysis apparatus 100 may be operated. Here, the “computer system” includes an OS and hardware such as peripheral devices. The “computer system” includes a WWW system having a homepage providing environment (or display environment). The “computer-readable recording medium” refers to a storage device such as a flexible medium, a magneto-optical disk, a portable medium such as a ROM and a CD-ROM, and a hard disk incorporated in a computer system. Further, the “computer-readable recording medium” refers to a volatile memory (RAM) in a computer system that becomes a server or a client when a program is transmitted via a network such as the Internet or a communication line such as a telephone line. In addition, those holding programs for a certain period of time are also included.

また、上記プログラムは、このプログラムを記憶装置等に格納したコンピュータシステムから、伝送媒体を介して、あるいは、伝送媒体中の伝送波により他のコンピュータシステムに伝送されてもよい。ここで、プログラムを伝送する「伝送媒体」は、インターネット等のネットワーク（通信網）や電話回線等の通信回線（通信線）のように情報を伝送する機能を有する媒体のことをいう。また、上記プログラムは、前述した機能の一部を実現するためのものであっても良い。さらに、前述した機能をコンピュータシステムに既に記録されているプログラムとの組み合わせで実現できるもの、いわゆる差分ファイル（差分プログラム）であってもよい。 The program may be transmitted from a computer system storing the program in a storage device or the like to another computer system via a transmission medium or by a transmission wave in the transmission medium. Here, the “transmission medium” for transmitting the program refers to a medium having a function of transmitting information, such as a network (communication network) such as the Internet or a communication line (communication line) such as a telephone line. The program may be for realizing a part of the functions described above. Furthermore, what can implement | achieve the function mentioned above in combination with the program already recorded on the computer system, what is called a difference file (difference program) may be sufficient.

以上、本発明の実施形態を説明したが、この実施形態は、例として提示したものであり、発明の範囲を限定することを意図していない。この実施形態は、その他の様々な形態で実施されることが可能であり、発明の要旨を逸脱しない範囲で、種々の省略、置き換え、変更を行うことができる。この実施形態やその変形は、発明の範囲や要旨に含まれると同様に、特許請求の範囲に記載された発明とその均等の範囲に含まれるものである。 As mentioned above, although embodiment of this invention was described, this embodiment is shown as an example and is not intending limiting the range of invention. This embodiment can be implemented in various other forms, and various omissions, replacements, and changes can be made without departing from the spirit of the invention. This embodiment and its modifications are included in the scope of the present invention and the gist thereof, and are also included in the invention described in the claims and the equivalent scope thereof.

１…文書記憶部
２…分析軸記憶部
３…クロス集計部
４…分析軸操作部
５…分析項目生成部
５１…語句抽出部
５２…語句関係抽出部
５３…語句関係評価部
１００…文書分析装置 DESCRIPTION OF SYMBOLS 1 ... Document storage part 2 ... Analysis axis storage part 3 ... Cross tabulation part 4 ... Analysis axis operation part 5 ... Analysis item generation part 51 ... Phrase extraction part 52 ... Phrase relation extraction part 53 ... Phrase relation evaluation part 100 ... Document analyzer

Claims

A document storage unit for storing a plurality of document data;
An analysis axis storage unit that stores a plurality of analysis axes, analysis items of the analysis axis, and words corresponding to the analysis items;
The first analysis set and the second analysis axis are input, and the first set of words and phrases corresponding to the analysis item of the first analysis axis from the analysis axis storage unit and the second analysis set A second phrase set that is the phrase corresponding to the analysis item on the analysis axis is read out, and co-occurs with the phrase included in the first phrase set in the document data stored in the document storage unit A phrase that is included in the second phrase set, and a frequency or expression that co-occurs with the phrase included in the first phrase set in the document data from the word candidates. An analysis item generating unit that selects the word candidates that are more similar than a predetermined criterion, and writes a new analysis item of the second analysis axis using the selected word candidates to the analysis axis storage unit;
An analysis item for each of a plurality of analysis axes and a word corresponding to the analysis item are read from the analysis axis storage unit, and each combination of the analysis items read for the plurality of analysis axes is stored in the document storage unit. A cross tabulation unit that counts the number of the document data including words corresponding to the analysis items constituting the combination of the document data, and displays the counting result;
A document analysis apparatus comprising:

Based on the input instruction to delete or move the analysis item, the analysis axis storage unit performs a process of deleting the analysis item or a process of rewriting the analysis axis to which the analysis item belongs to another analysis axis. An analysis axis operation unit,
The analysis item generation unit includes the phrase included in the deleted phrase set corresponding to the analysis item deleted or moved from the second analysis axis from the phrase candidate, and the first phrase set in the document data. Excluding the word candidate whose frequency or expression co-occurring with the included word is similar to the word included in the deleted word set than a predetermined criterion;
The document analysis apparatus according to claim 1, wherein:

The analysis item generation unit is a first analysis axis having a predetermined analysis item selected from the analysis items for which the cross tabulation unit displays the count result, and a first analysis item to be generated. Set 2 analysis axes,
The document analysis apparatus according to claim 1, wherein the document analysis apparatus is a document analysis apparatus.