JP7272846B2

JP7272846B2 - Document analysis device and document analysis method

Info

Publication number: JP7272846B2
Application number: JP2019064867A
Authority: JP
Inventors: 新司飯塚; 大介菊地
Original assignee: 株式会社日立ソリューションズ東日本
Priority date: 2019-03-28
Filing date: 2019-03-28
Publication date: 2023-05-12
Anticipated expiration: 2039-03-28
Also published as: JP2020166426A

Description

本発明は、文書分析技術に関する。 The present invention relates to document analysis technology.

大量の文書を分析する際、文書すべてを読んでその内容を把握するには多大な手間がかかる。そこで、図１に示すように、大量の文書群Ｄ１，Ｄ２，…から、よくある内容の文書（Ｄ_Ａ，Ｄ_Ｂ，…）をいくつか抽出し（Ｌ１）、文書Ｄ_Ａ，Ｄ_Ｂ，…を読むことで、大量の文書群Ｄ１，Ｄ２，…の内容を把握したいというニーズがある。これにより、大量の文書すべてを読むのに多大な手間がかかるという課題を解決することが期待できる。
これに対し、従来の技術では、例えば、図２に示すような文書クラスタリング技術が用いられている。文書クラスタリング技術は、大量の文書を、内容が類似する文書の集まりであるクラスタに分類する。図２では、大量の文書群Ｄ１，Ｄ２，…が、文書Ｄ_Ａと内容が類似する文書を含むクラスタ１，文書Ｄ_Ｂと内容が類似する文書を含むクラスタ２，…というようにクラスタへと分類されている（Ｌ２）。そして、各クラスタの文書の内容を把握することにより（Ｌ３）、元の大量の文書の主な内容（Ｓａ１）を把握することが行われている。 When analyzing a large number of documents, it takes a lot of time and effort to read all the documents and understand their contents. _Therefore , as shown in FIG. 1, several documents (D _A , D _B _, . There is a need to comprehend the contents of a large number of document groups D1, D2, . This can be expected to solve the problem that it takes a lot of time and effort to read all of a large amount of documents.
On the other hand, in conventional techniques, for example, a document clustering technique as shown in FIG. 2 is used. Document clustering technology classifies a large number of documents into clusters, which are collections of documents with similar contents. In FIG. 2, a large number of document groups _D1 , _D2 , . classified (L2). By grasping the contents of the documents in each cluster (L3), the main contents (Sa1) of the large amount of original documents are grasped.

下記の特許文献１に記載の技術では、質問文書の各文をルールベースでラベル付けし、質問文書内容の談話構造を解析する。そして、ユーザ指定のキーワードに関連する質問文書群を抽出し、談話構造をもとにグループ化する。これにより、各グループの質問の代表文をＦＡＱ（代表質問）候補としてリスト表示することができる。 In the technique described in Patent Document 1 below, each sentence of a question text is labeled on a rule basis, and the discourse structure of the content of the question text is analyzed. Then, a group of question texts related to the user-designated keyword is extracted and grouped based on the discourse structure. As a result, representative sentences of questions in each group can be listed as FAQ (representative question) candidates.

特許第５５７４８４２号公報（「ＦＡＱ候補抽出システムおよびＦＡＱ候補抽出プログラム」）Patent No. 5574842 ("FAQ candidate extraction system and FAQ candidate extraction program")

文書が数千から数万の規模で収集されている場合、クラスタリングの結果得られた各クラスタの文書の内容を把握するときに、クラスタ内のすべての文書を読むには膨大な時間がかかるという課題がある。
また、各クラスタの内容を把握するために任意で文書を抽出すると、その文書の内容が元の大量の文書でよくある内容であるとは限らない。クラスタリングでは、よくある内容の文書もそうでない文書もいずれかのクラスタに割り当てられる。そのため、クラスタにはよくある内容の文書とそうでない文書が混在する。従って、文書クラスタリング技術の問題点として、本来抽出したいよくある内容の文書（以下、「代表文書」と呼ぶ）が特定できない点が挙げられる。代表文書としては、元の大量の文書全体において類似する文書の数が多いものが最適である。 When documents are collected on a scale of thousands to tens of thousands, it takes an enormous amount of time to read all the documents in each cluster when grasping the contents of the documents in each cluster obtained as a result of clustering. I have a problem.
Also, when a document is arbitrarily extracted to grasp the content of each cluster, the content of the document is not necessarily the content that is common in the original mass of documents. In clustering, both documents with common content and those with non-common content are assigned to one of the clusters. Therefore, a cluster contains both documents with common contents and documents with uncommon contents. Therefore, a problem with the document clustering technique is that it is impossible to specify a document with common content (hereinafter referred to as a "representative document") that is originally intended to be extracted. The best representative document is a document that has a large number of similar documents in the entire original mass of documents.

また、代表文書の抽出には人手で行う作業が必要である。
例えば、上記特許文献１に記載の技術においても、ＦＡＱの抽出作業は人手で行う必要があるという課題がある。 In addition, extraction of representative documents requires manual work.
For example, even in the technique described in Patent Document 1, there is a problem that the FAQ extraction work needs to be performed manually.

本発明は、代表文書の抽出作業を高効率・高精度で実施できる技術を提供することを目的とする。 SUMMARY OF THE INVENTION It is an object of the present invention to provide a technology capable of performing the work of extracting a representative document with high efficiency and high accuracy.

本発明の一観点によれば、分散表現により文書のベクトル化を行い、文書ベクトルを算出する文書データ前処理部と、前記文書データ前処理部による前処理を行った文書において、所定の閾値より類似度が高い文書が所定数より少ない文書を孤立文書として除去する孤立文書除去部と、前記孤立文書除去部により孤立文書を除去した文書の類似度を考慮してクラスタリングを行うクラスタリング処理部と、前記クラスタリング処理部によりクラスタリングを行ったクラスタから代表文書を抽出する代表文書抽出部と、を有することを特徴とする文書分析装置が提供される。
前記孤立文書除去部の処理により、代表文書として所定の閾値より類似度が高い文書が所定数より少ない孤立文書が選ばれることがない。
前記代表文書抽出部は、クラスタ中心点とコサイン距離が最も近い文書ベクトルを持つ文書を抽出することが好ましい。 According to one aspect of the present invention, a document data preprocessing unit that vectorizes a document by distributed representation and calculates a document vector; an isolated document removal unit that removes, as an isolated document, documents having a high degree of similarity less than a predetermined number, and a clustering processing unit that performs clustering in consideration of the degree of similarity of the documents from which the isolated documents have been removed by the isolated document removal unit; and a representative document extraction unit for extracting a representative document from the clusters clustered by the clustering processing unit.
Due to the processing of the isolated document removing unit, an isolated document having fewer than a predetermined number of documents having a degree of similarity higher than a predetermined threshold is not selected as a representative document.
Preferably, the representative document extracting unit extracts a document having a document vector having the closest cosine distance to the cluster center point.

前記孤立文書除去部は、コサイン距離が閾値ｄよりも近い文書ベクトルをもつ類似文書数が閾値ｎより少ない文書を孤立していると判断して除外することが好ましい。 It is preferable that the isolated document removal unit determines that a document having a document vector whose cosine distance is closer than the threshold value d and the number of similar documents smaller than the threshold value n is isolated and excludes the document.

前記クラスタリング処理部は、実数値ベクトルをクラスタリングする手法を用いて文書ベクトルをクラスタ化することにより、文書の類似度を考慮したクラスタリングを行い、
クラスタ中心点を、前記文書データ前処理部により算出した文書ベクトルからランダム抽出したベクトルとすることで初期化し、前記ランダム抽出において、前記文書ベクトルが抽出される確率が、前記文書ベクトルと初期化済みのクラスタ中心点とのコサイン距離の最小値のα乗と、前記文書ベクトルとのコサイン距離が前記閾値ｄよりも近い文書ベクトルの個数のβ乗と、に比例する確率であることが好ましい。 The clustering processing unit performs clustering in consideration of document similarity by clustering document vectors using a method of clustering real-valued vectors,
A cluster center point is initialized by using a vector randomly extracted from the document vector calculated by the document data preprocessing unit, and in the random extraction, the probability of extracting the document vector is initialized with the document vector. and the minimum cosine distance to the cluster center point to the power of α and the number of document vectors whose cosine distance to the document vector is closer than the threshold value d to the power of β.

さらに、前記孤立文書除去部における処理で用いるｎ及びｄと、前記クラスタリング処理部における処理で用いるα，βと、クラスタ数ｋと、をパラメータのセットとして、代表文書に類似する文書の件数の割合である第１の指標と、クラスタリングの正解データと比較したクラスタリング精度の評価指標である第２の指標とを算出する評価指標算出部を有し、前記第１の指標と前記第２の指標の組をプロットして表示する散布図表示部をさらに有し、前記散布図表示部で表示されたプロットから、所定の判断基準に基づき自動で選択されたプロットか、または、ユーザの判断により手動で選択されたプロットに基づいて、前記パラメータを再設定することが好ましい。 Furthermore, a ratio of the number of documents similar to the representative document is set by using n and d used in the processing in the isolated document removal unit, α and β used in the processing in the clustering processing unit, and the number of clusters k as a set of parameters. and a second index that is an evaluation index of clustering accuracy compared with the clustering correct data, and the first index and the second index Further having a scatter diagram display unit for plotting and displaying the set, plots automatically selected based on predetermined criteria from the plots displayed in the scatter diagram display unit, or manually at the user's discretion Preferably, the parameters are reset based on the selected plot.

本発明の他の観点によれば、コンピュータによる文書分析方法であって、分散表現により文書のベクトル化を行い、文書べクトルを算出する文書データ前処理ステップと、前記文書データ前処理ステップによる前処理を行った文書において、所定の閾値より類似度が高い文書が所定数より少ない文書を孤立文書として除去する孤立文書除去ステップと、前記孤立文書除去ステップにより孤立文書を除去した文書の類似度を考慮してクラスタリングを行うクラスタリング処理ステップと、前記クラスタリング処理ステップによりクラスタリングを行ったクラスタから代表文書を抽出する代表文書抽出ステップと、を有することを特徴とする文書分析方法が提供される。 According to another aspect of the present invention, there is provided a document analysis method by a computer, comprising: a document data preprocessing step of vectorizing a document by distributed representation and calculating a document vector; An isolated document removing step for removing, as an isolated document, a document whose similarity is higher than a predetermined threshold and the number of which is less than a predetermined number among the processed documents, and the similarity of the document from which the isolated document is removed by the isolated document removing step. A document analysis method is provided, comprising: a clustering processing step of performing clustering with consideration given to a document; and a representative document extracting step of extracting a representative document from the clusters clustered by the clustering processing step.

本発明によれば、大量の文書の内容把握作業を効率化することができる。 According to the present invention, it is possible to improve the efficiency of work for grasping the contents of a large number of documents.

文書の内容を把握する方法の一例を示す図である。FIG. 2 is a diagram showing an example of a method for grasping the contents of a document; 文書クラスタリング技術の一例を示す図である。1 is a diagram illustrating an example of a document clustering technique; FIG. 本発明の第１の実施の形態による文書分析技術に適用できる文書分析システム、文書分析装置の一構成例を示す機能ブロック図である。BRIEF DESCRIPTION OF THE DRAWINGS It is a functional block diagram which shows one structural example of the document-analysis system applicable to the document-analysis technique by the 1st Embodiment of this invention, and a document-analysis apparatus. 本実施の形態による文書分析技術の処理の流れの一例を示すフローチャート図である。FIG. 10 is a flow chart diagram showing an example of the processing flow of the document analysis technique according to the present embodiment; 図４の文書データ前処理の詳細な処理例を示すフローチャート図である。5 is a flowchart showing a detailed processing example of document data preprocessing in FIG. 4; FIG. 図６（ａ）は、孤立文書の除去処理を行わない場合の代表文書抽出処理の様子を示す図である。図６（ｂ）は、孤立文書の除去処理を行った場合の代表文書抽出処理の様子を示す図である。FIG. 6A is a diagram showing how representative document extraction processing is performed when isolated document removal processing is not performed. FIG. 6B is a diagram showing how representative documents are extracted when isolated documents are removed. ステップＳ２の処理の詳細な例を示すフローチャート図である。It is a flowchart figure which shows the detailed example of the process of step S2. 図８（ａ）は、代表文書の任意抽出処理の様子を示す図であり、図８（ｂ）は、クラスタ中心点とのコサイン距離が最も近い文書を代表文書として自動抽出処理の様子を示す図である。FIG. 8(a) is a diagram showing arbitrary extraction processing of a representative document, and FIG. 8(b) shows automatic extraction processing with a document having the closest cosine distance to the cluster center point as the representative document. It is a diagram. ステップＳ４の処理の詳細を示すフローチャート図である。It is a flowchart figure which shows the detail of the process of step S4. 本発明の第２の実施の形態による文書分析装置の位置構成例を示す機能ブロック図であり、図３に対応する図である。4 is a functional block diagram showing a positional configuration example of a document analysis apparatus according to a second embodiment of the present invention, and is a diagram corresponding to FIG. 3; FIG. 本発明の第２の実施の形態によるシステム処理の流れを示すフローチャート図である。FIG. 9 is a flow chart diagram showing the flow of system processing according to the second embodiment of the present invention; パラメータ設定例を示す散布図の一例を示す図である。FIG. 10 is a diagram showing an example of a scatter diagram showing an example of parameter setting;

本明細書において、文書の分散表現とは、例えば、文書内容を実数による数値ベクトルとしてベクトル化したものである。
文書の類似度とは、例えば、文書ベクトル間のコサイン距離である。
孤立文書とは、所定の閾値より類似度が高い文書が所定数より少ない文書である。
代表文書とは、例えば、元の文書内で類似する他の文書が多い文書である。
以下に、本発明の実施の形態による文書分析技術について図面を参照しながら詳細に説明する。 In this specification, a distributed representation of a document is, for example, a vectorization of the document content as a numeric vector of real numbers.
Document similarity is, for example, the cosine distance between document vectors.
An isolated document is a document in which the number of documents with similarities higher than a predetermined threshold is less than a predetermined number.
A representative document is, for example, a document that has many other similar documents in the original document.
Document analysis techniques according to embodiments of the present invention will be described in detail below with reference to the drawings.

（第１の実施の形態）
図３は、本発明の第１の実施の形態による文書分析技術に適用できる文書分析システムＹ、文書分析装置Ｘの一構成例を示す機能ブロック図である。図４は、本実施の形態による文書分析技術の処理の流れの一例を示すフローチャート図である。また、表１から表１０までは、文書分析に用いられるデータテーブルの一例を示す表である。 (First embodiment)
FIG. 3 is a functional block diagram showing one configuration example of a document analysis system Y and a document analysis device X that can be applied to the document analysis technique according to the first embodiment of the present invention. FIG. 4 is a flow chart showing an example of the processing flow of the document analysis technique according to this embodiment. Tables 1 to 10 are tables showing examples of data tables used for document analysis.

表１は、文書分析システムのデータテーブルの一覧を示す表である。Ａは、１．文書データと、２．文書ベクトルとからなる文書関連データである。Ｂは、１．定型文ルールと、２．固有表現ルールと、３．分散表現モデルとを有する前処理関連データである。Ｃは，１．文書クラスタと２．代表文書とを有するクラスタリング関連データである。Ｄは、１．パラメータセットを有するパラメータチューニング関連データである。
以下の各表２から表１０までは、表１のデータテーブル一覧に含まれる各データのテーブル一例を示す表である。 Table 1 is a table showing a list of data tables of the document analysis system. A is 1. document data;2. It is document-related data consisting of document vectors. B is 1. 2. boilerplate rules; 3. named entity rules; It is pre-processing related data with a distributed representation model. C is 1. document cluster;2. Representative documents and clustering related data. D is 1. Parameter tuning related data with parameter sets.
Each of Tables 2 to 10 below is a table showing an example of each data table included in the data table list of Table 1.

表２は、Ａ１、すなわち、文書データテーブルの一例を示す表である。文書データテーブルＡ１は、質問回答など文書の原文のテキストデータをＩＤ毎に格納する。

Table 2 is a table showing an example of A1, that is, a document data table. The document data table A1 stores original text data of documents such as questions and answers for each ID.

表３は、Ａ２、すなわち、文書ベクトルテーブルの一例を示す表である。文書ベクトルテーブルＡ２は、文書ＩＤ毎に、文書ベクトルの要素値ｖ１，ｖ２，…を格納する。
後述するＢ３.分散表現モデルをもとに計算された文書ベクトルデータであり、文書ＩＤはＡ１、文書データテーブルのＩＤに対応する。ベクトル要素値ｖ１，ｖ２，…は、ベクトル次元数だけ列が存在する。 Table 3 is a table showing an example of A2, that is, a document vector table. The document vector table A2 stores document vector element values v1, v2, . . . for each document ID.
B3. Document vector data calculated based on the distributed representation model described later, and the document ID corresponds to A1 and the ID of the document data table. Vector element values v1, v2, . . . have as many columns as the number of vector dimensions.

表４は、Ｂ１、すなわち、定型文ルールテーブルの一例を示す表である。定型文ルールテーブルＢ１は、定型文のリストであり、文書中に該当する、もしくは類似する文があれば除外する対象を格納している。 Table 4 is a table showing an example of B1, that is, a standard sentence rule table. The standard sentence rule table B1 is a list of standard sentences, and stores objects to be excluded if there is a corresponding or similar sentence in the document.

表５は、Ｂ２、すなわち、固有表現ルールテーブルの一例を示す表である。固有表現ルールテーブルＢ２は、正規表現に合致する文中の箇所を、例えば「（ラベル）」に置き換える。例えば、「工事日は2019/1/23です。」は、「工事日は（日付）です。」に置き換える。ここで、ラベルを囲む「（」および「）」は一例であり、ラベルの単語である「日付」等と、原文中に出現する単語とを、置き換えた後の文において識別するための記号である。 Table 5 is a table showing an example of B2, that is, a named entity rule table. The named entity rule table B2 replaces the portion of the sentence that matches the regular expression with, for example, "(label)". For example, "The construction date is 2019/1/23." is replaced with "The construction date is (date)." Here, the "(" and ")" surrounding the label is an example, and is a symbol for distinguishing between the label word "date" and words that appear in the original sentence in the sentence after replacement. be.

表６は、Ｂ３、すなわち、分散表現モデルテーブルの一例を示す表である。分散表現モデルテーブルＢ３は、コーパス(Wikipediaなど)をもとに作成した、単語ベクトルデータであり、単語ベクトル要素値は、ベクトル次元数の分だけ列が存在する。 Table 6 is a table showing an example of B3, that is, a distributed representation model table. The distributed representation model table B3 is word vector data created based on a corpus (Wikipedia, etc.), and word vector element values have as many columns as the number of vector dimensions.

表７は、Ｃ１、すなわち、文書クラスタテーブルの一例を示す表である。文書クラスタテーブルＣ１では、クラスタリングにより形成されたクラスタと文書の所属を対応付ける。文書ＩＤは文書データテーブルＡ１における文書データのＩＤに対応する。 Table 7 is a table showing an example of C1, that is, a document cluster table. In the document cluster table C1, clusters formed by clustering are associated with document affiliations. The document ID corresponds to the document data ID in the document data table A1.

表８は、Ｃ２、すなわち、代表文書テーブルの一例を示す表である。代表文書テーブルＣ２では、代表文書抽出処理によって選ばれた各クラスタの代表文書を管理する。代表文書ＩＤは、Ａ１の文書データのＩＤに対応する。 Table 8 is a table showing an example of C2, that is, the representative document table. The representative document table C2 manages representative documents of each cluster selected by the representative document extraction process. The representative document ID corresponds to the ID of the document data of A1.

表９は、Ｄ１、すなわち、パラメータセットテーブルの一例を示す表である。パラメータセットテーブルＤ１は、クラスタリングや代表文書抽出などの各種パラメータで設定できる値(n, α, β, d, k)のリストである。さらに、どのパラメータが使用中か分かがるように、「使用中」の列が設けられている。 Table 9 is a table showing an example of D1, that is, a parameter set table. The parameter set table D1 is a list of values (n, α, β, d, k) that can be set for various parameters such as clustering and representative document extraction. Additionally, a column "in use" is provided so that you can see which parameters are in use.

表１０は、正解データＤ２、すなわち、正解データテーブルの一例を示す表である。正解データテーブルＤ２では、人が作成したクラスタ正解データと文書の所属（文書ＩＤ）とを対応付ける。文書ＩＤは文書データテーブルＡ１の文書データのＩＤに対応する。 Table 10 is a table showing an example of correct data D2, that is, a correct data table. In the correct data table D2, cluster correct data created by a person is associated with document affiliation (document ID). The document ID corresponds to the document data ID of the document data table A1.

図３に示すように、本実施の形態による文書分析装置Ｘは、例えば、文書分析処理部Ｘ１と、データベース（記憶装置）ＤＢとを有する。そして、例えば、文書分析装置Ｘと、文書分析装置Ｘとネットワーク（ＮＴ）接続される端末装置（ユーザ端末）Ｚとを含んで、文書分析システムＹを構成する。ネットワークＮＴは、有線でも無線でも良い。
文書分析処理部Ｘ１は、データベース（記憶装置）ＤＢ内の文書データＤＢ１の等を管理する文書データ管理部１と、文書データの前処理関連の処理を行う前処理関連機能部３と、クラスタリング処理部５と、代表文書抽出部７と、代表文書内容表示部１１と、を有する。前処理関連機能部３は、孤立文書除去部３ａを有する。 As shown in FIG. 3, the document analysis apparatus X according to this embodiment has, for example, a document analysis processing section X1 and a database (storage device) DB. Then, for example, a document analysis system Y includes a document analysis device X and a terminal device (user terminal) Z connected to the document analysis device X via a network (NT). The network NT may be wired or wireless.
The document analysis processing unit X1 includes a document data management unit 1 that manages the document data DB1 in the database (storage device) DB, a preprocessing-related function unit 3 that performs processing related to preprocessing of the document data, and a clustering process. It has a section 5 , a representative document extraction section 7 and a representative document content display section 11 . The preprocessing-related function unit 3 has an isolated document removal unit 3a.

また、データベース（記憶装置）ＤＢは、文書データテーブルＡ１を格納する文書データＤＢ１と、文書ベクトルテーブルＡ２を格納する文書ベクトルＤＢ２と、文書クラスタテーブルＣ１を格納する文書クラスタＤＢ３と、代表文書テーブルＣ２を格納する代表文書ＤＢ４と、定型文ルールテーブルＢ１，固有表現ルールテーブルＢ２，分散表現モデルテーブルＢ３を格納する前処理関連データＤＢ５と、パラメータセットテーブルＤ１を格納するパラメータセットＤＢ６とを有する。 The database (storage device) DB includes a document data DB1 storing the document data table A1, a document vector DB2 storing the document vector table A2, a document cluster DB3 storing the document cluster table C1, and a representative document table C2. , a preprocessing related data DB5 storing a fixed phrase rule table B1, a named entity rule table B2 and a distributed representation model table B3, and a parameter set DB6 storing a parameter set table D1.

次に、文書分析処理部Ｘ１による文書分析処理の流れについて説明する。図４に示すように、処理が開始されると（ＳＴＡＲＴ）、ステップＳ１において、前処理関連機能部３が文書データＤＢ１に格納されている文書データテーブルＡ１の文書と前処理関連データＤＢ５に格納されている前処理関連データ（定型文ルールテーブルＢ１、固有表現ルールテーブルＢ２、分散表現モデルテーブルＢ３）までとを取得して、文書データの前処理を行う。前処理は、分散表現モデルに基づく文書ベクトルＡ２の計算などを含む。文書ベクトルＡ２を文書ベクトルＤＢ２に格納する。
次いで、ステップＳ２において、孤立文書除去部３ａが所定の閾値より類似度が高い文書が所定数より少ない孤立文書の除去を行う。孤立文書の除去処理は、例えば、コサイン距離が近い文書ベクトルを持つ他の文書が少ない文書は、孤立していると判断して除外する処理を含む。孤立文書を除去した後の文書ベクトルを文書ベクトルＤＢ２に格納する。 Next, the flow of document analysis processing by the document analysis processing section X1 will be described. As shown in FIG. 4, when the process is started (START), in step S1, the preprocessing-related function unit 3 stores the document in the document data table A1 stored in the document data DB1 and the preprocessing-related data DB5. Preprocessing related data (formal sentence rule table B1, named entity rule table B2, distributed representation model table B3) are obtained, and the document data is preprocessed. Preprocessing includes calculation of document vector A2 based on the distributed representation model, and the like. Document vector A2 is stored in document vector DB2.
Next, in step S2, the isolated document removing unit 3a removes isolated documents having fewer than a predetermined number of documents with a degree of similarity higher than a predetermined threshold. The process of removing an isolated document includes, for example, a process of determining that a document having a small number of other documents having document vectors with a close cosine distance is isolated and excluded. The document vector after removing the isolated document is stored in the document vector DB2.

次に、ステップＳ３において、クラスタリング処理部５が、パラメータセットＤＢ６に格納されるパラメータセットＤ１と、文書ベクトルＤＢ２に格納される文書ベクトルＡ２とに基づいてクラスタリング処理を行う。クラスタリング処理は、文書ベクトル間のコサイン距離を考慮してコサイン距離の近い文書群によりクラスタ形成する。形成されたクラスタと文書の所属との対応付けを、文書クラスタＤＢ３に格納する。
次に、ステップＳ４において、代表文書抽出部７が、パラメータセットＤＢ６に格納されるパラメータセットＤ１に基づいて、文書クラスタテーブルＣ１の文書クラスタから代表文書の抽出を行う。代表文書の抽出処理は、あるクラスタについて、クラスタ中心点と最もコサイン距離が近い文書を選択する処理である。抽出された代表文書は、代表文書ＤＢ４に代表文書テーブルＣ２内のデータとして格納される。この処理において、クラスタから抽出する理由は、複数の代表文書の間で内容の重複をなくすためである。 Next, in step S3, the clustering processing unit 5 performs clustering processing based on the parameter set D1 stored in the parameter set DB6 and the document vector A2 stored in the document vector DB2. In the clustering process, the cosine distances between document vectors are taken into consideration, and clusters are formed from a group of documents having close cosine distances. Correspondence between the formed cluster and the affiliation of the document is stored in the document cluster DB3.
Next, in step S4, the representative document extraction unit 7 extracts representative documents from the document clusters in the document cluster table C1 based on the parameter set D1 stored in the parameter set DB6. Extraction processing of a representative document is processing for selecting a document having the closest cosine distance to a cluster central point for a given cluster. The extracted representative document is stored in the representative document DB 4 as data in the representative document table C2. The reason for extracting from clusters in this processing is to eliminate duplication of content among a plurality of representative documents.

次いで、ステップＳ５において、代表文書内容表示（制御）部１１が、代表文書ＤＢ４に格納される代表文書テーブルＣ２により代表文書内容の要約表示処理を行う。
以上により、処理が終了する（ＥＮＤ）。 Next, in step S5, the representative document content display (control) unit 11 performs summary display processing of the representative document content using the representative document table C2 stored in the representative document DB4.
Thus, the processing ends (END).

次に、上記の各処理について詳細に説明する。
図５は、図４の文書データ前処理の詳細な処理例を示すフローチャート図である。
まず。処理が開始されると（ＳＴＡＲＴ）、以下の処理が行われる。
ステップＳ１１：文書の文への分割
ステップＳ１２：文の形態素解析
ステップＳ１３：文中の記号除去
ステップＳ１４：文書中の定型文除去
ステップＳ１５：文中の固有表現抽出
ステップＳ１６：文書の分散表現計算
ステップＳ１７：文書内容の要約 Next, each of the above processes will be described in detail.
FIG. 5 is a flow chart showing a detailed processing example of the document data preprocessing in FIG.
first. When the process starts (START), the following processes are performed.
Step S11: Division of document into sentences Step S12: Morphological analysis of sentences Step S13: Removal of symbols in sentences Step S14: Removal of standard sentences in documents Step S15: Extraction of named entities in sentences Step S16: Distributed representation calculation step S17 of documents : Summary of document content

以上の処理は、公知の技術を用いることができる。一例として、ステップＳ１３、ステップＳ１４、ステップＳ１７の処理には、それぞれ、特願２０１８－１６２５２５号の不要語除去処理部、不要文除去処理部、要約生成部の技術を用いることができる。ステップＳ１５の処理には、Hidden Markov Modelや、Conditional Random Fieldのような、公知の固有表現抽出技術を用いることができる。また、ステップＳ１６の処理には、Doc2VecまたはParagraph Vectorと呼ばれる文書の分散表現を計算する技術（Quoc Le, Tomas Mikolov, “Distributed representations of sentences and documents,” International conference on machine learning, 2014）を用いることができる。
次に、孤立文書の除去処理について説明する。所定の閾値より類似度が高い文書が所定数より少ない孤立文書の除去処理は、例えば、コサイン距離が所定の閾値ｄより近い文書数が所定数ｎより少ない文書を孤立していると判断して除外する処理である。
尚、孤立文書の判定で使うコサイン距離の閾値ｄは、クラスタリングの処理で使うパラメータｄと同じものである。
パラメータｄは、文書が類似しているかどうかの判定の基準となる閾値である。孤立文書の除去では、文書全体の中で類似している文書が少ない文書を「孤立している」と判定したいため、文書の類似の判断基準となるパラメータｄを閾値として使用する。
尚、以下の第２の実施の形態においてチューニング対象のパラメータとして参照することから、文書数の閾値を「閾値ｎ」とした。 A known technique can be used for the above processing. As an example, for the processes of steps S13, S14, and S17, the technologies of the unnecessary word removal processing unit, unnecessary sentence removal processing unit, and summary generation unit of Japanese Patent Application No. 2018-162525 can be used, respectively. For the process of step S15, known named entity extraction techniques such as Hidden Markov Model and Conditional Random Field can be used. In addition, for the processing of step S16, a technique for calculating distributed representations of documents called Doc2Vec or Paragraph Vector (Quoc Le, Tomas Mikolov, “Distributed representations of sentences and documents,” International conference on machine learning, 2014) may be used. can be done.
Next, an isolated document removal process will be described. The processing for removing isolated documents in which the number of documents whose similarity is higher than a predetermined threshold is less than a predetermined number is, for example, determined as isolated documents in which the number of documents whose cosine distance is closer than a predetermined threshold d is less than a predetermined number n. This is the process of exclusion.
Note that the cosine distance threshold d used in the isolated document determination is the same as the parameter d used in the clustering process.
A parameter d is a threshold that serves as a criterion for determining whether documents are similar. In the removal of isolated documents, it is desirable to determine that a document having few similar documents among all documents is "isolated", so the parameter d, which serves as a document similarity determination criterion, is used as a threshold.
Note that the threshold for the number of documents is set to "threshold n" since it will be referred to as a parameter to be tuned in the following second embodiment.

図６（ａ）は、孤立文書の除去処理を行わない場合の代表文書抽出処理の様子を示す図である。クラスタ数は４で固定するものとする。
クラスタリングでは、どの文書もいずれかのクラスタに割り当てられる。そのため、孤立文書の除去処理を行わずにクラスタリングを行うと、孤立文書もいずれかのクラスタに割り当てられることになる。孤立文書を含むクラスタは、孤立文書と、それに類似する文書からなる。例えば、上述の所定数ｎが１の場合、孤立文書とコサイン距離が所定の閾値ｄより近い文書数は１件未満であり、孤立文書と類似する文書は存在しない。従って、クラスタには孤立文書に類似する文書は含まれず、孤立文書のみからなるクラスタが形成されることになる。そのため、クラスタから代表文書を抽出すると孤立文書が選ばれてしまうことがある。 FIG. 6A is a diagram showing how representative document extraction processing is performed when isolated document removal processing is not performed. The number of clusters is fixed at four.
Clustering assigns every document to a cluster. Therefore, if clustering is performed without removing isolated documents, the isolated documents are also assigned to one of the clusters. A cluster containing isolated documents consists of isolated documents and similar documents. For example, when the predetermined number n is 1, the number of documents whose cosine distance from the isolated document is closer than the predetermined threshold value d is less than 1, and there is no document similar to the isolated document. Therefore, the cluster does not include documents similar to the isolated document, and a cluster consisting only of the isolated document is formed. Therefore, when a representative document is extracted from a cluster, an isolated document may be selected.

図６（ｂ）は、孤立文書の除去処理を行った場合の代表文書抽出処理の様子を示す図である。図６（ｂ）に示すように、孤立文書をクラスタ形成対象から予め除去することで、代表文書として孤立文書が選ばれることがないようにすることができる。 FIG. 6B is a diagram showing how representative documents are extracted when isolated documents are removed. As shown in FIG. 6B, by removing isolated documents in advance from cluster formation targets, it is possible to prevent isolated documents from being selected as representative documents.

図７は、ステップＳ２の処理の詳細な例を示すフローチャート図である。図７に示すように、まず、ステップＳ２－１において、文書データＤＢ１に登録されている文書群の中から未処理の１文書を特定する。ステップＳ２－２において、特定した１文書において、コサイン距離が所定の閾値ｄより近い文書数をカウントする。ステップＳ２－３において、ステップＳ２－２でカウントされた文書数と所定数ｎとを比較する。所定数ｎは、孤立文書と見なせるかどうかを基準に予め設定しておくことができる。 FIG. 7 is a flow chart showing a detailed example of the processing in step S2. As shown in FIG. 7, first, in step S2-1, one unprocessed document is specified from among the documents registered in the document data DB1. In step S2-2, the number of documents whose cosine distance is closer than a predetermined threshold value d is counted in one identified document. At step S2-3, the number of documents counted at step S2-2 is compared with a predetermined number n. The predetermined number n can be set in advance based on whether or not the document can be regarded as an isolated document.

ステップＳ２－３において、カウントされた文書数が所定数ｎよりも小さいかどうかを判定する。ステップＳ２－３でＮｏの場合には、ステップＳ２－４に進み、当該文書を除外対象とせずに、文書ベクトルＤＢ２に格納する。そして、ステップＳ２－６に進む。 At step S2-3, it is determined whether the counted number of documents is smaller than a predetermined number n. If No in step S2-3, the process advances to step S2-4 to store the document in the document vector DB2 without excluding the document. Then, the process proceeds to step S2-6.

ステップＳ２－３でＹｅｓの場合には、ステップＳ２－５に進み、当該文書を孤立文書として除外する。そして、ステップＳ２－６に進み、現在の文書カウント数ｎが文書データＤＢ１に登録されている全文書数ｍと等しいかどうかを判定する。ステップＳ２－６でＹｅｓの場合には、処理を終了する（ＥＮＤ）。ステップＳ２－６でＮｏの場合には、ステップＳ２－７に進み、ｎ＝ｎ＋１として、ステップＳ２－１に戻る。 If Yes in step S2-3, the process advances to step S2-5 to exclude the document as an isolated document. Then, in step S2-6, it is determined whether or not the current document count number n is equal to the total document number m registered in the document data DB1. If Yes in step S2-6, the process ends (END). If No in step S2-6, the process proceeds to step S2-7, sets n=n+1, and returns to step S2-1.

以上の処理を継続的に行うことで、孤立文書を除外することができる。
このように、クラスタリング処理の前に孤立文書の除去処理を行っておくことにより、代表文書として孤立文書が選ばれることを未然に防止することができる。 The isolated document can be excluded by continuously performing the above processing.
Thus, by removing isolated documents before clustering, it is possible to prevent isolated documents from being selected as representative documents.

次に、ステップＳ３のクラスタリング処理について説明する。
本実施の形態では、k-means++の改良版アルゴリズムを用いることができる。
k-means++の改良版アルゴリズムでは、従来のk-means++について以下の点を改良している。
１）文書ベクトル間のコサイン距離を考慮したspherical k-means法を用いる（Kurt Hornik, Ingo Feinerer, Martin Kober, Christian Buchta, “Spherical k-Means Clustering,” Journal of Statistical Software, September 2012, Volume 50, Issue 10 参照）。
２）クラスタ中心点の初期値を，文書ベクトルを用いて以下の確率でランダム抽出する。 Next, clustering processing in step S3 will be described.
In this embodiment, an improved version of k-means++ algorithm can be used.
The improved k-means++ algorithm has the following improvements over the conventional k-means++.
1) Using the spherical k-means method considering the cosine distance between document vectors (Kurt Hornik, Ingo Feinerer, Martin Kober, Christian Buchta, “Spherical k-Means Clustering,” Journal of Statistical Software, September 2012, Volume 50, (see Issue 10).
2) Randomly extract the initial value of the cluster center point using the document vector with the following probability.

ここで、各パラメータを可変とする意図は以下の通りである。
α: クラスタリング精度および収束スピードの向上
β: 文書ベクトルが集中しているところから、クラスタ中心点の初期値が選ばれやすくなるようにする
d: 文書ベクトルが類似していると判定するコサイン距離の閾値の調整
尚、α=2かつβ=0のとき，数１を用いたアルゴリズムは、従来のk-means++に相当する。 Here, the intention of making each parameter variable is as follows.
α: Improve clustering accuracy and convergence speed β: Make it easier to select the initial value of the cluster center point from where the document vectors are concentrated
d: Adjusting the cosine distance threshold for judging that document vectors are similar When α=2 and β=0, the algorithm using Equation 1 corresponds to conventional k-means++.

次に、ステップＳ４の代表文書の自動抽出処理について説明する。
図８（ａ）は、代表文書の任意抽出処理の様子を示す図であり、図８（ｂ）は、クラスタ中心点とのコサイン距離が最も近い文書ベクトルを持つ文書を代表文書として自動抽出処理の様子を示す図である。
図８（ａ）に示すように、代表文書の任意抽出処理によれば、代表文書の文書ベクトルがクラスタ中心から離れる可能性がある。従って、代表文書の文書ベクトルに近いコサイン距離の文書ベクトルを持ったクラスタ内文書が少なくなるという課題がある。 Next, the automatic extraction processing of the representative document in step S4 will be described.
FIG. 8(a) is a diagram showing arbitrary extraction processing of a representative document, and FIG. 8(b) is a diagram showing automatic extraction processing of a document having a document vector with the closest cosine distance to the cluster center point as the representative document. It is a figure which shows the state of.
As shown in FIG. 8(a), according to arbitrary extraction processing of the representative document, the document vector of the representative document may be separated from the center of the cluster. Therefore, there is a problem that the number of intra-cluster documents having a document vector with a cosine distance close to the document vector of the representative document is reduced.

図９は、ステップＳ４の処理の詳細を示すフローチャート図である。
図９に示すように、ステップＳ４のステップＳ４－１において、クラスタ中心点を取得する。ステップＳ４－２において、クラスタ内の各文書の文書ベクトルとクラスタ中心点とのコサイン距離を比較する。ステップＳ４－３において、クラスタ中心点とのコサイン距離が最も近い文書ベクトルを持つ文書を代表文書とする。そして、処理を終了する（ＥＮＤ）。 FIG. 9 is a flowchart showing the details of the processing in step S4.
As shown in FIG. 9, cluster center points are obtained in step S4-1 of step S4. At step S4-2, the cosine distance between the document vector of each document in the cluster and the cluster center point is compared. At step S4-3, the document having the document vector with the closest cosine distance to the cluster center point is taken as the representative document. Then, the process ends (END).

上記の処理により、図８（ｂ）に示すように、クラスタ中心点からの距離に基づいて代表文書を抽出すると、クラスタ中心点に最も近い文書ベクトルを持つ文書が代表として選ばれる。k-means法では、クラスタ中心点と、クラスタに属するベクトルとの距離が最小化されるようにクラスタが形成される。そのため、クラスタ中心点とコサイン距離が近い文書ベクトルを持つ文書がクラスタ内に多く存在する。従って、クラスタ中心点とのコサイン距離が最も近い文書ベクトルを持つ文書を代表文書とすることで、代表文書の文書ベクトルと近いコサイン距離の文書ベクトルを持つクラスタ内の文書数が多くなるという利点がある。その結果、抽出した代表文書は、元の大量の文書全体において類似する文書の数が多い、よくある内容の文書であるという、望ましい性質を満たすようになる。 As shown in FIG. 8B, by the above processing, when representative documents are extracted based on the distance from the cluster central point, the document having the document vector closest to the cluster central point is selected as the representative. In the k-means method, clusters are formed such that the distance between the cluster center point and the vectors belonging to the cluster is minimized. Therefore, there are many documents in the cluster that have document vectors that are close in cosine distance to the cluster center point. Therefore, by making the document whose cosine distance is closest to the cluster center point the representative document, there is an advantage that the number of documents in the cluster having the document vector whose cosine distance is close to that of the representative document increases. be. As a result, the extracted representative document satisfies the desired property of having a large number of similar documents in the entire original mass of documents and having common content.

以上に説明したように、本実施の形態によれば、クラスタから代表文書を自動抽出するために、クラスタ内のすべての文書を読む必要がない。従って、処理が簡単になる。
抽出した代表文書は元の大量の文書でよくある内容の文書であるため、代表文書を読むことで元の大量の文書の主な内容を把握することができる。
従って、大量の文書の内容把握作業の効率化が可能である。 As described above, according to this embodiment, it is not necessary to read all the documents in the cluster in order to automatically extract the representative document from the cluster. Therefore, processing is simplified.
Since the extracted representative document is a document with common contents in a large amount of original documents, it is possible to grasp the main contents of the large amount of original documents by reading the representative document.
Therefore, it is possible to improve the efficiency of the work of grasping the contents of a large number of documents.

（第２の実施の形態）
次に、本発明の第２の実施の形態による文書分析技術について説明を行う。本実施の形態による文書分析技術においては、第１の実施の形態に加えて、パラメータチューニング支援機能を追加している。
図１０は、本実施の形態による文書分析装置の一構成例を示す機能ブロック図であり、図３に対応する図である。図１０の文書分析装置Ｘにおいては、図３の文書分析装置に加えて、評価指標算出部１５と、パラメータ設定部１７と、散布図表示部２１とを有している。また、図３のデータベースに加えて、正解データＤ２を格納する正解データデータベースＤＢ７を有している（ＤＢａ，表１０参照）。 (Second embodiment)
Next, a document analysis technique according to the second embodiment of the invention will be described. In the document analysis technique according to this embodiment, a parameter tuning support function is added in addition to the first embodiment.
FIG. 10 is a functional block diagram showing one configuration example of the document analysis apparatus according to this embodiment, and corresponds to FIG. The document analysis apparatus X of FIG. 10 has an evaluation index calculation section 15, a parameter setting section 17, and a scatter diagram display section 21 in addition to the document analysis apparatus of FIG. In addition to the database of FIG. 3, it has a correct data database DB7 for storing correct data D2 (DBa, see Table 10).

評価指標算出部１５は、代表文書Ｃ２により代表文書に類似する文書の件数の割合（第１の指標）を、正解データＤ２によりクラスタリング精度の評価指標（第２の指標）を、算出する。パラメータ設定部１７は、評価指標算出部１５が算出した第１の指標と第２の指標の重み付き和を最大化するプロットを自動的に選択する。尚、パラメータは、ユーザ判断で選択することも可能である。散布図表示部２１は、後述する散布図を表示する。 The evaluation index calculator 15 calculates the ratio of the number of documents similar to the representative document (first index) from the representative document C2, and calculates the clustering accuracy evaluation index (second index) from the correct data D2. The parameter setting unit 17 automatically selects a plot that maximizes the weighted sum of the first index and the second index calculated by the evaluation index calculation unit 15 . It should be noted that the parameters can also be selected by the user's judgment. The scatter diagram display unit 21 displays a scatter diagram, which will be described later.

図１１は、本実施の形態によるシステム処理の流れを示すフローチャート図である。尚、図４と同様の処理を行うステップは、同じ処理の符号を付して説明を省略する。 FIG. 11 is a flow chart showing the flow of system processing according to this embodiment. Steps that perform the same processing as in FIG. 4 are denoted by the same reference numerals, and descriptions thereof are omitted.

図１１において、ステップＳ１の次に、ステップＳ９において、パラメータセットＤＢ６の全てのパラメータについて、処理が完了しているかどうかを判定する。処理が完了していなければ（Ｎｏ）、ステップＳ１０において、パラメータセットＤＢから次のＩＤのパラメータを取得する。
次いで、ステップＳ２ａにおいて、孤立文書除去部３ａが、孤立文書の除去処理を行う。この処理は、取得したパラメータセットに含まれる閾値ｎと閾値ｄを使用して行う処理である。 In FIG. 11, after step S1, in step S9, it is determined whether or not all the parameters in the parameter set DB6 have been processed. If the process has not been completed (No), in step S10, the parameter of the next ID is acquired from the parameter set DB.
Next, in step S2a, the isolated document removing section 3a performs the process of removing isolated documents. This process is performed using the threshold value n and the threshold value d included in the acquired parameter set.

次いで、ステップＳ３ａにおいて、クラスタリング処理部５が、クラスタリング処理を行う。この処理は、取得したパラメータセットのパラメータn, α, β, d, kを使用して行う処理である。
次いで、ステップＳ４において、代表文書抽出部７が、代表文書を抽出し、代表文書ＤＢ４に登録する。 Next, in step S3a, the clustering processing unit 5 performs clustering processing. This process is performed using the parameters n, α, β, d, and k of the acquired parameter set.
Next, in step S4, the representative document extraction unit 7 extracts a representative document and registers it in the representative document DB4.

次いで、ステップＳ６において、評価指標算出部１５が、評価指標算出処理を行う。この処理は、以下の２指標の算出を行う。
１）代表文書に類似する文書の件数の割合（第１の指標）
文書データＤＢ１に登録されている文書のうち、当該文書の文書ベクトルと、代表文書ＤＢ４に登録されている、いずれか少なくとも一つの代表文書の文書ベクトルとのコサイン距離が、閾値ｄ以下である文書の割合である。
２）クラスタリング精度の評価指標（第２の指標）
文書クラスタテーブルＣ１と、正解データＤ２とを比較し、クラスタリング精度の評価指標を算出する。評価指標としては、一例として、Adjusted Rand Indexや、Adjusted Mutual Informationを用いることができる。
そして、ステップＳ９に戻る。
一方、ステップＳ９でＹｅｓの場合には、ステップＳ７に進み、散布図表示部２１が、全ての評価指標の組を散布図でプロットする。 Next, in step S6, the evaluation index calculation unit 15 performs evaluation index calculation processing. This process calculates the following two indexes.
1) Ratio of the number of documents similar to the representative document (first indicator)
A document whose cosine distance between the document vector of the document registered in the document data DB 1 and the document vector of at least one representative document registered in the representative document DB 4 is equal to or less than the threshold value d. is the ratio of
2) Evaluation index of clustering accuracy (second index)
The document cluster table C1 and the correct data D2 are compared to calculate an evaluation index of clustering accuracy. As an evaluation index, for example, an Adjusted Rand Index or Adjusted Mutual Information can be used.
Then, the process returns to step S9.
On the other hand, in the case of Yes in step S9, the process proceeds to step S7, and the scatter diagram display unit 21 plots all sets of evaluation indices in scatter diagrams.

次いで、ステップＳ８において、以下のパラメータの設定処理を行う。
１）第１の指標と第２の指標の重み付き和を最大化するプロットを自動で選択する。
２）任意のプロットをユーザ判断で選択することも可能である。
次いで、第１の実施の形態と同様のステップＳ３（クラスタリング処理），ステップＳ４（代表文書抽出処理）を行う。
以上により、パラメータの再設定を継続して行うパラメータチューニング支援処理を終了する。 Next, in step S8, the following parameters are set.
1) Automatically select the plot that maximizes the weighted sum of the first index and the second index.
2) It is also possible to select arbitrary plots at the user's discretion.
Next, step S3 (clustering processing) and step S4 (representative document extraction processing) similar to those in the first embodiment are performed.
With the above, the parameter tuning support processing for continuously resetting the parameters is completed.

第２の実施の形態では、第１の実施の形態に、パラメータチューニング支援機能が追加される。
例えば、類似文書件数割合（第１の指標）とクラスタリングの評価指標（第２の指標）を用いた散布図によるパラメータ設定効果の可視化を行う。 In the second embodiment, a parameter tuning support function is added to the first embodiment.
For example, the parameter setting effect is visualized by a scatter diagram using the similar document number ratio (first index) and the clustering evaluation index (second index).

図１２は、パラメータ設定例を示す散布図の一例を示す図である。図１２に示すように、散布図は、例えば、横軸に類似文書件数割合（％）（第１の指標）を、縦軸にクラスタリングの評価指標（第２の指標）をとっている。各プロットにおける指標の算出には、異なるパラメータセット(ｎ, α, β, d, k(クラスタ数)の値の組)を用いる。 FIG. 12 is a diagram showing an example of a scatter diagram showing parameter setting examples. As shown in FIG. 12, in the scatter diagram, for example, the horizontal axis indicates the number of similar documents (%) (first index), and the vertical axis indicates the clustering evaluation index (second index). Different parameter sets (sets of values of n, α, β, d, k (number of clusters)) are used to calculate indices in each plot.

ここで、可能な値の組み合わせ全てを網羅することはできないため、プロットの対象となるパラメータの値の組み合わせは、事前にパラメータセットＤＢに登録されているものとする。表９のパラメータセットＤＢ（ＤＢ１０）のテーブル構成図に示すように、１のＩＤで特定される１レコードがパラメータの値の組み合わせ１件となるようにし、テーブルの列がパラメータの名称 (n, α, β, d, k) となるようにする。また、どのパラメータが使用中か分かがるように、「使用中」の列を追加する。 Here, since all possible combinations of values cannot be covered, it is assumed that combinations of parameter values to be plotted are registered in the parameter set DB in advance. As shown in the table configuration diagram of the parameter set DB (DB10) in Table 9, one record specified by an ID of 1 is one combination of parameter values, and the columns of the table are parameter names (n, α, β, d, k). Also add a "in use" column so you can see which parameters are in use.

ここで、クラスタリングの正解データは、事前に「正解データ（Ｄ２）ＤＢ」に登録されているものとする。正解データ（Ｄ２）ＤＢのテーブル構成図は上記の表１０に示されている。文書ＩＤは文書データテーブルＡ１における文書データのＩＤに対応する。尚、正解データＤ２に登録されている文書は、文書データテーブルＡ１に登録されている全ての文書である必要はなく、手動抽出した一部文書データを用いることができる。 Here, it is assumed that the correct data for clustering is registered in the "correct data (D2) DB" in advance. A table configuration diagram of the correct data (D2) DB is shown in Table 10 above. The document ID corresponds to the document data ID in the document data table A1. The documents registered in the correct data D2 do not have to be all the documents registered in the document data table A1, and some manually extracted document data can be used.

以上のようにして、処理アルゴリズムにおけるパラメータの最適化を以下のように支援する。
１）所定の判断基準に基づくプロット自動選択
２）ユーザの判断によるプロット手動選択 As described above, the optimization of parameters in the processing algorithm is supported as follows.
1) Automatic selection of plots based on predetermined criteria 2) Manual selection of plots by user's judgment

図１２の破線Ｌ１１は、類似文書件数割合（第１の指標）とクラスタリングの評価指標（第２の指標）の重み付き和を最大化するという基準を示す直線である。重み付き和を算出するときの第１の指標と第２の指標の重みは、正の実数値であり、事前に登録されているものとする。ユーザは、重みの値を変更することで、第１の指標と第２の指標のどちらを重視するかを調整することがでる。なお、重みの比は、直線Ｌ１１の傾きを表す。この直線Ｌ１１上において、プロットＰ１を自動的に選択することができる。
また、Ｐ２に示すように、散布図上の別のプロットをユーザの判断で選択することもできる。
以上の構成により、パラメータ調整作業の省力化が可能となる。 A dashed line L11 in FIG. 12 is a straight line indicating a criterion for maximizing the weighted sum of the similar document number ratio (first index) and the clustering evaluation index (second index). It is assumed that the weights of the first index and the second index when calculating the weighted sum are positive real numbers and registered in advance. The user can adjust which of the first index and the second index is emphasized by changing the weight value. The weight ratio represents the slope of the straight line L11. Plot P1 can be automatically selected on this straight line L11.
Also, as shown in P2, another plot on the scatter diagram can be selected at the user's discretion.
With the above configuration, it is possible to save labor in parameter adjustment work.

尚、文書分析技術の活用例としては、例えば、以下のものが例示的に挙げられる。
１）FAQ(代表質問)作成
質問群に対し、類似の質問をクラスタ化し分類する。
各分類の代表的な質問を抽出し、要約文を表記する。
２）故障情報分析
保守点検報告書群に対し、類似の報告書をクラスタ化し分類する。
各分類の代表的な報告書を抽出し、それに記載された内容を読むことで、よくある故障の内容を把握する。 Examples of the use of document analysis technology include the following.
1) FAQ (representative question) preparation For the question group, similar questions are clustered and classified.
Extract representative questions from each category and write a summary sentence.
2) Similar reports are clustered and classified for the group of failure information analysis maintenance and inspection reports.
By extracting the representative report of each classification and reading the contents described in it, we grasp the details of common failures.

本実施の形態によれば、文書クラスタの代表文書の抽出作業を効率化・精度向上することが可能である。高効率は、代表文書の抽出処理を自動化することにより達成することができる。
また、高精度は、代表文書に類似する文書数の最大化をすることにより達成することができる。
さらに、処理アルゴリズムにおけるパラメータの最適化処理を支援することができる。従って、パラメータ調整作業を省力化することができる。 According to this embodiment, it is possible to improve the efficiency and accuracy of the work of extracting a representative document of a document cluster. High efficiency can be achieved by automating the representative document extraction process.
Also, high accuracy can be achieved by maximizing the number of documents similar to the representative document.
Furthermore, it can assist in optimizing parameters in the processing algorithm. Therefore, it is possible to save labor in parameter adjustment work.

処理および制御は、ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）やＧＰＵ（ＧｒａｐｈｉｃｓＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）によるソフトウェア処理、ＡＳＩＣ（ＡｐｐｌｉｃａｔｉｏｎＳｐｅｃｉｆｉｃＩｎｔｅｇｒａｔｅｄＣｉｒｃｕｉｔ）やＦＰＧＡ（ＦｉｅｌｄＰｒｏｇｒａｍｍａｂｌｅＧａｔｅＡｒｒａｙ）によるハードウェア処理によって実現することができる。
上記の実施の形態において、図示されている構成等については、これらに限定されるものではなく、本発明の効果を発揮する範囲内で適宜変更することが可能である。その他、本発明の目的の範囲を逸脱しない限りにおいて適宜変更して実施することが可能である。 Processing and control can be realized by software processing by a CPU (Central Processing Unit) or GPU (Graphics Processing Unit), or by hardware processing by an ASIC (Application Specific Integrated Circuit) or an FPGA (Field Programmable Gate Array). .
In the above-described embodiments, the illustrated configurations and the like are not limited to these, and can be changed as appropriate within the scope of exhibiting the effects of the present invention. In addition, it is possible to carry out by appropriately modifying the present invention as long as it does not deviate from the scope of the purpose of the present invention.

また、本発明の各構成要素は、任意に取捨選択することができ、取捨選択した構成を具備する発明も本発明に含まれるものである。
また、本実施の形態で説明した機能を実現するためのプログラムをコンピュータ読み取り可能な記録媒体に記録して、この記録媒体に記録されたプログラムをコンピュータシステムに読み込ませ、実行することにより各部の処理を行ってもよい。尚、ここでいう「コンピュータシステム」とは、ＯＳや周辺機器等のハードウェアを含むものとする。
また、「コンピュータシステム」は、ＷＷＷシステムを利用している場合であれば、ホームページ提供環境（あるいは表示環境）も含むものとする。 In addition, each component of the present invention can be selected arbitrarily, and the present invention includes an invention having a selected configuration.
Also, a program for realizing the functions described in the present embodiment is recorded in a computer-readable recording medium, and the program recorded in this recording medium is read by a computer system and executed, thereby processing of each section. may be performed. It should be noted that the "computer system" referred to here includes hardware such as an OS and peripheral devices.
The "computer system" also includes the home page providing environment (or display environment) if the WWW system is used.

また、「コンピュータ読み取り可能な記録媒体」とは、フレキシブルディスク、光磁気ディスク、ＲＯＭ、ＣＤ－ＲＯＭ等の可搬媒体、コンピュータシステムに内蔵されるハードディスク等の記憶装置のことをいう。さらに「コンピュータ読み取り可能な記録媒体」とは、インターネット等のネットワークや電話回線等の通信回線を介してプログラムを送信する場合の通信線のように、短時間の間、動的にプログラムを保持するもの、その場合のサーバやクライアントとなるコンピュータシステム内部の揮発性メモリのように、一定時間プログラムを保持しているものも含むものとする。また前記プログラムは、前述した機能の一部を実現するためのものであっても良く、さらに前述した機能をコンピュータシステムにすでに記録されているプログラムとの組み合わせで実現できるものであっても良い。機能の少なくとも一部は、集積回路などのハードウェアで実現しても良い。 The term "computer-readable recording medium" refers to portable media such as flexible discs, magneto-optical discs, ROMs and CD-ROMs, and storage devices such as hard discs incorporated in computer systems. Furthermore, "computer-readable recording medium" means a medium that dynamically retains a program for a short period of time, like a communication line when transmitting a program via a network such as the Internet or a communication line such as a telephone line. It also includes those that hold programs for a certain period of time, such as volatile memories inside computer systems that serve as servers and clients in that case. Further, the program may be for realizing part of the functions described above, or may be capable of realizing the functions described above in combination with a program already recorded in the computer system. At least part of the functions may be implemented in hardware such as integrated circuits.

本発明は、文書分析装置に利用可能である。 INDUSTRIAL APPLICABILITY The present invention is applicable to document analysis devices.

Ｘ文書分析装置
Ｘ１文書分析処理部
Ｙ文書分析システム
ＮＴネットワーク
ＤＢデータベース（記憶装置）
１文書データ管理部
３前処理関連機能部
３ａ孤立文書除去部
５クラスタリング処理部
７代表文書抽出部
１１代表文書内容表示部
１５評価指標算出部
１７パラメータ設定部
２１散布図表示部
X document analysis device X1 document analysis processing unit Y document analysis system NT network DB database (storage device)
1 document data management unit 3 preprocessing related function unit 3a isolated document removal unit 5 clustering processing unit 7 representative document extraction unit 11 representative document content display unit 15 evaluation index calculation unit 17 parameter setting unit 21 scatter diagram display unit

Claims

a document data preprocessing unit that vectorizes a document by distributed representation and calculates a document vector;
an isolated document removing unit for removing, as an isolated document, a document whose similarity is higher than a predetermined threshold value and whose number is less than a predetermined number from among the documents preprocessed by the document data preprocessing unit;
a clustering processing unit that performs clustering in consideration of the degree of similarity of documents from which isolated documents have been removed by the isolated document removal unit;
a representative document extraction unit for extracting a representative document from the clusters clustered by the clustering processing unit ;
The isolated document removing unit determines that a document having a document vector whose cosine distance is closer than the threshold value d and the number of similar documents is less than the threshold value n is isolated and excludes the document,
The clustering processing unit clusters the document vectors using a method of clustering real-valued vectors to perform clustering in consideration of the degree of similarity of the documents, and the cluster center point is calculated by the document data preprocessing unit. Initialize by making it a vector randomly extracted from the document vector,
In the random extraction, the probability of extracting the document vector is the minimum value of the cosine distance between the document vector and the initialized cluster center point to the power of α, and the cosine distance between the document vector and the threshold d. is a probability proportional to the β-th power of the number of document vectors that are close to each other .

moreover,
It is the ratio of the number of documents similar to the representative document, with n and d used in the processing in the isolated document removal unit, α and β used in the processing in the clustering processing unit, and the number of clusters k as a set of parameters. an evaluation index calculation unit that calculates a first index and a second index that is an evaluation index of clustering accuracy compared with correct data for clustering;
further comprising a scatter diagram display unit that plots and displays a set of the first index and the second index;
From the plots displayed in the scatter diagram display unit, resetting the parameters based on plots automatically selected based on predetermined criteria or plots manually selected by user's judgment 2. A document analysis apparatus according to claim 1 .

The representative document extraction unit
3. The document analysis apparatus according to claim 1, wherein a document having a document vector having the closest cosine distance to a cluster center point is extracted.

A computer-based document analysis method comprising:
a document data preprocessing step of vectorizing a document by distributed representation and calculating a document vector;
an isolated document removing step of removing, as an isolated document, a document whose similarity is higher than a predetermined threshold value and whose number is less than a predetermined number from among the documents preprocessed by the document data preprocessing step;
a clustering processing step of performing clustering in consideration of the degree of similarity of documents from which isolated documents have been removed by the isolated document removal step;
a representative document extraction step of extracting a representative document from the clusters clustered by the clustering processing step ;
In the isolated document removing step, a document having a document vector whose cosine distance is closer than the threshold value d and the number of similar documents is smaller than the threshold value n is determined to be isolated and excluded,
In the clustering processing step, the document vectors are clustered using a method of clustering real-valued vectors to perform clustering in consideration of the similarity of the documents, and the cluster center point is calculated by the document data preprocessing step. Initialize by making it a vector randomly extracted from the document vector,
In the random extraction, the probability that the document vector is extracted is α-th power of the minimum cosine distance between the document vector and the initialized cluster center point, and the cosine distance between the document vector and the threshold value d. is a probability proportional to the β-th power of the number of document vectors that are close to each other .