JP6885211B2

JP6885211B2 - Information analyzer, information analysis method and information analysis program

Info

Publication number: JP6885211B2
Application number: JP2017119773A
Authority: JP
Inventors: 倉科　守; 守倉科; 添田　武志; 武志添田
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2017-06-19
Filing date: 2017-06-19
Publication date: 2021-06-09
Anticipated expiration: 2037-06-19
Also published as: JP2019003553A

Description

本件は、情報分析装置、情報分析方法および情報分析プログラムに関する。 This case relates to information analyzers, information analysis methods and information analysis programs.

テキスト分析において、クラスタリングを行う技術が開示されている（例えば、特許文献１，２参照）。テキスト分析においては、例えば、分析したいテキスト情報を名前や原因などの要素ごとに分割し、それぞれの要素について形態素分析さらには特徴語抽出を実施し、その後、含まれる特徴語が類似した内容を持つテキスト内の要素同士が同一のクラスタに分類される。 A technique for performing clustering in text analysis is disclosed (see, for example, Patent Documents 1 and 2). In text analysis, for example, the text information to be analyzed is divided into elements such as names and causes, morphological analysis and feature word extraction are performed for each element, and then the feature words included have similar contents. The elements in the text are grouped into the same cluster.

特開２０１３−０６５０９７号公報Japanese Unexamined Patent Publication No. 2013-065097 国際公開第２０１６／１４７２１９号International Publication No. 2016/147219

クラスタリングが行われると、便宜的に各クラスタにクラスタ番号が付与されることになる。しかしながら、この番号の順番ならびに大きさには意味がなく、番号が隣り合うクラスタ同士に意味的な相関性はない。したがって、このクラスタ情報をもとにパス分析により、ダイアグラムを作成、可視化を実現した場合にも、パスが複雑となり、直感的にわかりにくいダイアグラムとなる。そのため、フィルタリングなどにより分析対象を限定するなどの追加措置が必要になってしまう。 When clustering is performed, a cluster number is assigned to each cluster for convenience. However, the order and size of the numbers are meaningless, and there is no semantic correlation between clusters with adjacent numbers. Therefore, even if a diagram is created and visualized by path analysis based on this cluster information, the path becomes complicated and the diagram becomes difficult to understand intuitively. Therefore, additional measures such as limiting the analysis target by filtering or the like are required.

１つの側面として、本件の目的は、近接するクラスタ間に意味的な相関性を付与することを可能にする情報分析装置、情報分析方法および情報分析プログラムを提供することとする。 As one aspect, an object of the present invention is to provide an information analyzer, an information analysis method, and an information analysis program that can impart a semantic correlation between adjacent clusters.

１つの態様では、情報分析装置は、同一のテキストサンプルから、複数の要素に対し、特徴語を示す複数の因子を抽出し、当該複数の要素のうち１以上の要素に対して順序尺度からなるターゲット変数を付与するターゲット変数抽出部と、前記ターゲット変数を用い、前記複数の要素に対する多次元対応分析を行うことによって、クラスタ分類を実施するクラスタ分類部と、前記クラスタ分類部によって、前記多次元対応分析後に形成された各クラスタに仮の識別子を付与し、前記ターゲット変数が付与された要素以外の各要素について、前記仮の識別子ごとに全テキストサンプルの前記順序尺度の平均値を算出して、前記平均値の昇順に前記クラスタの識別子を並び替え、新たな識別子番号を付与する、並び替え部と、を備える。 In one embodiment, the information analyzer extracts a plurality of factors indicating characteristic words for a plurality of elements from the same text sample, and consists of an order scale for one or more of the plurality of elements. the target variable extraction unit to impart target variable, using the target variable, by performing multidimensional correspondence analysis for said plurality of elements, and cluster classification unit to perform the cluster classification, by the cluster classification unit, the multidimensional A tentative identifier is assigned to each cluster formed after the correspondence analysis, and the average value of the order scale of all text samples is calculated for each tentative identifier for each element other than the element to which the target variable is assigned. A sorting unit that sorts the identifiers of the clusters in ascending order of the average value and assigns a new identifier number is provided.

近接するクラスタ間に意味的な相関性を付与することを可能とする。 It is possible to give a semantic correlation between adjacent clusters.

（ａ）および（ｂ）はパス分析を例示する図である。(A) and (b) are diagrams illustrating path analysis. （ａ）は実施例１に係る情報分析装置のハードウェア構成を説明するためのブロック図であり、（ｂ）は各部の機能ブロック図である。(A) is a block diagram for explaining the hardware configuration of the information analyzer according to the first embodiment, and (b) is a functional block diagram of each part. 情報分析装置による情報分析処理の一例を表すフローチャートである。It is a flowchart which shows an example of the information analysis processing by an information analyzer. 「原因」に対して、「停止時間」をターゲット変数とした場合のクラスタ分類結果を例示する図である。It is a figure which exemplifies the cluster classification result when "stop time" is set as a target variable for "cause". （ａ）〜（ｃ）は仮クラスタ番号から本クラスタ番号を算出する処理を例示する図である。(A) to (c) are diagrams illustrating the process of calculating the main cluster number from the temporary cluster number. （ａ）は「名前」における「仮クラスタ番号」、「ターゲット変数平均値」、「本クラスタ番号」の一覧を示し、（ｂ）は「原因」における「仮クラスタ番号」、「ターゲット変数平均値」、「本クラスタ番号」の一覧を示し、（ｃ）は「対応方法」における「仮クラスタ番号」、「ターゲット変数平均値」、「本クラスタ番号」の一覧を示す。(A) shows a list of "temporary cluster number", "target variable average value", and "main cluster number" in "name", and (b) shows "temporary cluster number", "target variable average value" in "cause". , "This cluster number" is shown, and (c) shows a list of "temporary cluster number", "target variable average value", and "this cluster number" in "correspondence method". 作成されたダイアグラムを例示する図である。It is a figure which illustrates the created diagram. 「停止時間」をターゲット変数とせずに、クラスタ分類を行った結果として得られたダイアグラムである（比較例）。It is a diagram obtained as a result of performing cluster classification without using "stop time" as a target variable (comparative example). （ａ）は第１指標を例示する図であり、（ｂ）は第２指標を例示する図である。(A) is a diagram illustrating the first index, and (b) is a diagram illustrating the second index. （ａ）は「名前」における「仮クラスタ番号」、「ターゲット変数平均値」、「本クラスタ番号」の一覧を示し、（ｂ）は「原因」における「仮クラスタ番号」、「ターゲット変数平均値」、「本クラスタ番号」の一覧を示し、（ｃ）は「対応方法」における「仮クラスタ番号」、「ターゲット変数平均値」、「本クラスタ番号」の一覧を示す。(A) shows a list of "temporary cluster number", "target variable average value", and "main cluster number" in "name", and (b) shows "temporary cluster number", "target variable average value" in "cause". , "This cluster number" is shown, and (c) shows a list of "temporary cluster number", "target variable average value", and "this cluster number" in "correspondence method". 作成されたダイアグラムを例示する図である。It is a figure which illustrates the created diagram.

実施例の説明に先立って、パス分析の概要について説明する。図１（ａ）および図１（ｂ）は、パス分析を例示する図である。パス分析においては、まず、テキスト情報を含む同一のテキストサンプルから、要素ごとに特徴語を意味する複数の因子を抽出する。その後、全てのテキストサンプルについて、同様の処理を行い、全てのテキストサンプルの全ての要素について、因子を抽出した後、テキストクラスタリングによって、要素別に、各テキストサンプルから抽出された複数の因子を複数のクラスタに分類する。具体的には、要素Ａについて、テキスト情報の各因子がクラスタＡ１、クラスタＡ２、クラスタＡ３、…、に分類される。ここで、これらのクラスタ番号は、便宜的に付された番号であるが、各クラスタ内に振り分けられた各因子は、互いに類似した特徴語情報からなる。なお、図１（ａ）および図１（ｂ）の各クラスタには、該当する因子が記載されたサンプルの番号が記載されている。例えば、テキストサンプルＮｏ．１は、要素ＡのクラスタＡ１に分類されている。 Prior to the description of the embodiment, the outline of the path analysis will be described. 1 (a) and 1 (b) are diagrams illustrating path analysis. In the path analysis, first, a plurality of factors meaning feature words are extracted for each element from the same text sample including text information. After that, the same processing is performed for all text samples, factors are extracted for all elements of all text samples, and then a plurality of factors extracted from each text sample are extracted for each element by text clustering. Classify into clusters. Specifically, for element A, each factor of text information is classified into cluster A1, cluster A2, cluster A3, .... Here, these cluster numbers are numbers assigned for convenience, but each factor assigned within each cluster consists of characteristic word information similar to each other. In each cluster of FIGS. 1 (a) and 1 (b), the number of the sample in which the corresponding factor is described is described. For example, the text sample No. 1 is classified into cluster A1 of element A.

まず、要素ごとに、クラスタ番号の昇順に並び替える。例えば、要素Ａについては、クラスタＡ１、クラスタＡ２、クラスタＡ３、…の順に配置する。次に、サンプルごとに、要素Ａの因子、要素Ｂの因子、要素Ｃの因子、…、を線（パス）で結ぶことによってダイアグラムを作成する。それにより、各テキストサンプルに含まれる要素間のつながりを、座標を意味するクラスタ番号によって可視化することができる。 First, sort by element in ascending order of cluster number. For example, the element A is arranged in the order of cluster A1, cluster A2, cluster A3, and so on. Next, for each sample, a diagram is created by connecting the factor of element A, the factor of element B, the factor of element C, ..., With a line (path). Thereby, the connection between the elements included in each text sample can be visualized by the cluster number meaning the coordinates.

しかしながら、分類されたクラスタの並び順（クラスタＡ１、クラスタＡ２、クラスタＡ３、…）において、隣接するクラスタ同士は、相関性を有していない。この場合、図１（ａ）で例示するように、作成されたダイアグラムは、パスが入り組んだ形となる。例えば、サンプルＮｏ．１は、要素ＡのクラスタＡ１から要素ＢのクラスタＢ２を通り、要素ＣのクラスタＣ４に至る。サンプルＮｏ．ｘでは、要素ＡのクラスタＡ２から要素ＢのクラスタＢ４を通り、要素ＣのクラスタＣ１に至る。このように、分類されたクラスタ同士が相関を有していないと、要素間（横方向）を結ぶパスが、要素内（縦方向）に大きく移動することになる。それにより、直感的にわかりにくいダイアグラムが作成されてしまう。したがって、パス分析結果から有効的な知見を得るためには、データのフィルタリングが必要など、追加のデータ加工が必要となる。 However, in the order of the classified clusters (cluster A1, cluster A2, cluster A3, ...), Adjacent clusters do not have a correlation. In this case, as illustrated in FIG. 1A, the created diagram has a complicated path. For example, sample No. 1 passes from cluster A1 of element A to cluster B2 of element B to cluster C4 of element C. Sample No. At x, the cluster A2 of the element A passes through the cluster B4 of the element B and reaches the cluster C1 of the element C. In this way, if the classified clusters do not have a correlation, the path connecting the elements (horizontal direction) will move significantly within the element (vertical direction). As a result, a diagram that is difficult to understand intuitively is created. Therefore, in order to obtain effective knowledge from the path analysis results, additional data processing such as data filtering is required.

一方、図１（ｂ）は、各クラスタに順序尺度を付与し、並べ替えた例である。この例では、隣接するクラスタ同士が相関を有するようになる。この例では、サンプルごとに因子をパスで結ぶと、要素間（横方向）を結ぶ際に、要素内（縦方向）の移動量が小さくなる。それにより、直感的にわかりやすいダイアグラムが作成されることになる。以下の実施例では、近接するクラスタ間に意味的な相関性を付与することを可能とすることで、直感的にわかりやすいパス分析結果を作成可能な情報分析装置、情報分析方法および情報分析プログラムについて説明する。 On the other hand, FIG. 1B is an example in which an order scale is given to each cluster and rearranged. In this example, adjacent clusters will have a correlation. In this example, if the factors are connected by a path for each sample, the amount of movement within the elements (vertical direction) becomes small when connecting the elements (horizontal direction). This will create an intuitively easy-to-understand diagram. In the following examples, an information analyzer, an information analysis method, and an information analysis program that can create intuitive path analysis results by making it possible to give a semantic correlation between adjacent clusters. explain.

図２（ａ）は、実施例１に係る情報分析装置１００のハードウェア構成を説明するためのブロック図である。図２（ａ）で例示するように、情報分析装置１００は、ＣＰＵ１０１、ＲＡＭ１０２、記憶装置１０３、入力機器１０４、表示装置１０５などを備える。これらの各機器は、バスなどによって接続されている。ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）１０１は、中央演算処理装置である。ＣＰＵ１０１は、１以上のコアを含む。ＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）１０２は、ＣＰＵ１０１が実行するプログラム、ＣＰＵ１０１が処理するデータなどを一時的に記憶する揮発性メモリである。記憶装置１０３は、不揮発性記憶装置である。記憶装置１０３として、例えば、ＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）、フラッシュメモリなどのソリッド・ステート・ドライブ（ＳＳＤ）、ハードディスクドライブに駆動されるハードディスクなどを用いることができる。入力機器１０４は、キーボード、マウスなどである。表示装置１０５は、液晶ディスプレイ、エレクトロルミネッセンスパネルなどであり、情報分析装置１００の処理結果などを表示する。 FIG. 2A is a block diagram for explaining the hardware configuration of the information analyzer 100 according to the first embodiment. As illustrated in FIG. 2A, the information analysis device 100 includes a CPU 101, a RAM 102, a storage device 103, an input device 104, a display device 105, and the like. Each of these devices is connected by a bus or the like. The CPU (Central Processing Unit) 101 is a central processing unit. The CPU 101 includes one or more cores. The RAM (Random Access Memory) 102 is a volatile memory that temporarily stores a program executed by the CPU 101, data processed by the CPU 101, and the like. The storage device 103 is a non-volatile storage device. As the storage device 103, for example, a ROM (Read Only Memory), a solid state drive (SSD) such as a flash memory, a hard disk driven by a hard disk drive, or the like can be used. The input device 104 is a keyboard, a mouse, or the like. The display device 105 is a liquid crystal display, an electroluminescence panel, or the like, and displays a processing result of the information analyzer 100 or the like.

図２（ｂ）は、ＣＰＵ１０１が記憶装置１０３に記憶されているプログラムを実行することによって実現される各部の機能ブロック図である。図２（ｂ）で例示するように、情報分析装置１００は、分析処理部１０、データベース部２０などとして機能する。分析処理部１０は、ターゲット変数抽出部１１、クラスタ分類部１２、並び替え部１３、ダイアグラム作成部１４などを備える。データベース部２０は、テキスト情報格納部２１などを備える。 FIG. 2B is a functional block diagram of each part realized by the CPU 101 executing a program stored in the storage device 103. As illustrated in FIG. 2B, the information analyzer 100 functions as an analysis processing unit 10, a database unit 20, and the like. The analysis processing unit 10 includes a target variable extraction unit 11, a cluster classification unit 12, a sorting unit 13, a diagram creation unit 14, and the like. The database unit 20 includes a text information storage unit 21 and the like.

図３は、情報分析装置１００による情報分析処理の一例を表すフローチャートである。以下、図２（ｂ）および図３を参照しつつ、情報分析装置１００による情報分析処理について説明する。 FIG. 3 is a flowchart showing an example of information analysis processing by the information analysis device 100. Hereinafter, the information analysis process by the information analysis apparatus 100 will be described with reference to FIGS. 2B and 3.

まず、ターゲット変数抽出部１１は、テキスト情報格納部２１からテキスト情報を含む複数のテキストサンプルを読み込む（ステップＳ１）。本実施例においては、一例として、テキスト情報を含むサンプルとして、製造ラインの報告書について説明する。例えば、報告書には、以下のように、要素ごとに因子がテキストとして記載されている。以下の例では、「報告書番号」、「発生日時」、「停止時間」、「名前」、「対応方法」、「原因」が要素、「報告書番号」の「１」や、「発生日時」の「２０１７年１月１日」が因子である。サンプルごとに異なる事象に関する報告内容が記載されているため、各要素について、サンプルごとに異なる因子が含まれることになる。例えば、以下のような記載内容となる。
[報告書番号] １
[発生日時] ２０１７年１月１日
[停止時間] １５分
[名前] 部品Ａ
[対応方法] 交換
[原因] 断線 First, the target variable extraction unit 11 reads a plurality of text samples including text information from the text information storage unit 21 (step S1). In this embodiment, as an example, a production line report will be described as a sample including text information. For example, in the report, the factors are described as text for each element as follows. In the following example, "report number", "occurrence date and time", "stop time", "name", "action method", "cause" are elements, "report number""1" and "occurrence date and time""January 1, 2017" is a factor. Since the report contents regarding different events are described for each sample, different factors will be included for each element for each sample. For example, the description is as follows.
[Report number] 1
[Date and time of occurrence] January 1, 2017
[Stop time] 15 minutes
[Name] Part A
[Correspondence method] Replacement
[Cause] Disconnection

次に、ターゲット変数抽出部１１は、各サンプルから、複数の要素を抽出した後、いずれかの要素を順序尺度で置き換えた後、ターゲット変数に設定する（ステップＳ２）。本実施例においては、複数の要素として、「停止時間」、「名前」、「原因」、「対応方法」を抽出し、これらの中で「停止時間」をターゲット変数として設定する。一例として、ここではターゲット変数として停止時間を選定し、停止時間が３０ｍｉｎ以下のサンプルをクラスタ１に分類、３１ｍｉｎ以上６０ｍｉｎ以下のサンプルをクラスタ２に分類、６１ｍｉｎ以上９０ｍｉｎ以下のサンプルをクラスタ３に分類、９１ｍｉｎ以上１２０ｍｉｎ以下のサンプルをクラスタ４に分類、１２１ｍｉｎ以上のサンプルをクラスタ５に分類した（ステップＳ３）。ここでは、例えば、クラスタに割り付けられるサンプル数がクラスタ間で略均一となるようにする。これらのクラスタ番号を、ターゲット変数のクラスタ番号と称する。 Next, the target variable extraction unit 11 extracts a plurality of elements from each sample, replaces any of the elements with an order scale, and then sets the target variable (step S2). In this embodiment, "stop time", "name", "cause", and "countermeasure" are extracted as a plurality of elements, and "stop time" is set as a target variable among them. As an example, here, the stop time is selected as the target variable, the sample with the stop time of 30 min or less is classified into cluster 1, the sample with stop time of 31 min or more and 60 min or less is classified into cluster 2, and the sample with stop time of 61 min or more and 90 min or less is classified into cluster 3. , 91 min or more and 120 min or less were classified into cluster 4, and 121 min or more sample was classified into cluster 5 (step S3). Here, for example, the number of samples allocated to the clusters is made to be substantially uniform among the clusters. These cluster numbers are referred to as the cluster numbers of the target variables.

次に、クラスタ分類部１２において、「名前」、「原因」、「対応方法」の各要素をテキスト分析対象として、順序尺度を有する「停止時間」をターゲット変数とした多次元対応分析を行うことによってクラスタ分類を実施し、各因子に仮クラスタ番号を付す（ステップＳ４）。ここで、仮クラスタ番号は、形成されたクラスタを識別するための識別子を意味する。なお、ステップＳ４のクラスタ分類の際には、各要素の因子に含まれる特徴語を用いて分類を行うこととする。テキスト分析における多次元対応分析として、本実施例においては、例えば、期待最大アルゴリズムを用いる。 Next, in the cluster classification unit 12, a multidimensional correspondence analysis is performed with each element of "name", "cause", and "correspondence method" as a text analysis target and "stop time" having an order scale as a target variable. Cluster classification is performed according to, and each factor is assigned a temporary cluster number (step S4). Here, the temporary cluster number means an identifier for identifying the formed cluster. In the cluster classification in step S4, the feature words included in the factors of each element are used for the classification. As a multidimensional correspondence analysis in the text analysis, in this embodiment, for example, the expected maximum algorithm is used.

図４は、「原因」に対して、「停止時間」をターゲット変数とした場合のクラスタ分類結果を例示する図である。図４で例示するように、仮クラスタ番号１では、特徴語として「ヘッド不良」、「ヘッド不能」などを含む特徴語が分類されており、仮クラスタ番号２では、「落下」、「ボルト」などを含む特徴語が分類されている。これによって、各要素に属する各因子に対し、仮クラスタ番号が付与されることになる。なお、「名前」および「対応方法」についても、同様の手順を踏むことによって、クラスタ分類結果を得られる。 FIG. 4 is a diagram illustrating a cluster classification result when "stop time" is used as a target variable for "cause". As illustrated in FIG. 4, in the temporary cluster number 1, characteristic words including "head failure", "head impossible", etc. are classified as characteristic words, and in the temporary cluster number 2, "fall" and "bolt" are classified. Characteristic words including, etc. are classified. As a result, a temporary cluster number is assigned to each factor belonging to each element. As for the "name" and "correspondence method", the cluster classification result can be obtained by following the same procedure.

次に、並び替え部１３においては、前記「仮クラスタ番号」にテキストサンプルのターゲット変数の「クラスタ番号」を紐付け、仮クラスタ番号ごとに、全サンプルにおけるターゲット変数のクラスタ番号の平均値を算出する（ステップＳ５）。図５（ａ）の例では、「名前」に対して「停止時間」をターゲット変数とした場合のクラスタ分類結果では、サンプル番号１においては、ターゲット変数のクラスタ１が仮クラスタ番号２に紐付けられ、サンプル３においてクラスタ３が仮クラスタ番号９に紐づけられている。その後、図５（ｂ）の例で、「名前」に対して「停止時間」をターゲット変数とした場合のクラスタ分類結果において、仮クラスタ番号１についての平均値の算出を例示している。同様に、「原因」に対して「停止時間」をターゲット変数とした場合のクラスタ分類結果においては、サンプル１においてクラスタ１が仮クラスタ番号２に紐づけられ、サンプル３においてクラスタ３が仮クラスタ番号６に紐づけられ、「停止時間」をターゲット変数とした場合のクラスタ分類結果において、各仮クラスタ番号についての平均値を算出する。 Next, in the sorting unit 13, the "cluster number" of the target variable of the text sample is associated with the "temporary cluster number", and the average value of the cluster numbers of the target variables in all the samples is calculated for each temporary cluster number. (Step S5). In the example of FIG. 5A, in the cluster classification result when "stop time" is set as the target variable for "name", in sample number 1, cluster 1 of the target variable is associated with temporary cluster number 2. In sample 3, cluster 3 is associated with temporary cluster number 9. After that, in the example of FIG. 5B, the calculation of the average value for the provisional cluster number 1 is illustrated in the cluster classification result when the “stop time” is set as the target variable for the “name”. Similarly, in the cluster classification result when "stop time" is set as the target variable for "cause", cluster 1 is associated with temporary cluster number 2 in sample 1, and cluster 3 is temporary cluster number in sample 3. In the cluster classification result when "stop time" is set as the target variable linked to 6, the average value for each temporary cluster number is calculated.

次に、クラスタ分類部１２は、ターゲット変数のクラスタ番号の平均値が大きい順に、大きい「本クラスタ番号」を付する（ステップＳ６）。図５（ｃ）は、「名前」に対して「停止時間」をターゲット変数とした場合のクラスタ分類結果における本クラスタ番号を例示する図である。図６（ａ）は、「名前」における「仮クラスタ番号」、「ターゲット変数のクラスタ番号の平均値」、「本クラスタ番号」の一覧を示す。図６（ｂ）は、「原因」における「仮クラスタ番号」、「ターゲット変数のクラスタ番号の平均値」、「本クラスタ番号」の一覧を示す。図６（ｃ）は、「対応方法」における「仮クラスタ番号」、「ターゲット変数のクラスタ番号の平均値」、「本クラスタ番号」の一覧を示す。 Next, the cluster classification unit 12 assigns a larger “main cluster number” in descending order of the average value of the cluster numbers of the target variables (step S6). FIG. 5C is a diagram illustrating the present cluster number in the cluster classification result when "stop time" is set as the target variable for "name". FIG. 6A shows a list of “temporary cluster number”, “average value of cluster number of target variable”, and “main cluster number” in “name”. FIG. 6B shows a list of “temporary cluster number”, “average value of cluster number of target variable”, and “main cluster number” in “cause”. FIG. 6C shows a list of the “temporary cluster number”, the “average value of the cluster numbers of the target variables”, and the “main cluster number” in the “correspondence method”.

次に、ダイアグラム作成部１４は、得られた「名前」、「原因」、「対応方法」のそれぞれの本クラスタ番号に基づいて、パス分析結果を作成する（ステップＳ７）。図７は、作成されたダイアグラムを例示する図である。ダイアグラムにおいては、サンプルごとに、「名前」の因子、「原因」の因子、「対応方法」の因子、「停止時間」の因子がそれぞれ線で結ばれている。なお、「停止時間」については、ターゲット変数のクラスタ番号の順に配置される。図７においては、同一の経路を通るパスの本数が多いほど太い線に見えるようになる。なお、作成されたダイアグラムは、表示装置１０５に表示される。その後、フローチャートの実行が終了する。図７で例示するように、作成されたダイアグラムにおいては、要素間においてパスの縦方向の移動量が少なくなっている。それにより、直感的にわかりやすいダイアグラムになっている。 Next, the diagram creation unit 14 creates a path analysis result based on the obtained cluster numbers of the “name”, “cause”, and “correspondence method” (step S7). FIG. 7 is a diagram illustrating the created diagram. In the diagram, the "name" factor, the "cause" factor, the "response method" factor, and the "stop time" factor are connected by lines for each sample. The "stop time" is arranged in the order of the cluster number of the target variable. In FIG. 7, as the number of paths passing through the same route increases, the line looks thicker. The created diagram is displayed on the display device 105. After that, the execution of the flowchart ends. As illustrated in FIG. 7, in the created diagram, the amount of vertical movement of the path between the elements is small. This makes the diagram intuitive and easy to understand.

図８は、「停止時間」をターゲット変数とせずに、クラスタ分類を行った結果として得られたダイアグラムである（比較例）。図８の例では、要素間において縦方向の移動量が多くなっている。それにより、入り組んだ形状となり、直感的にわかりにくいダイアグラムになっている。 FIG. 8 is a diagram obtained as a result of performing cluster classification without using "stop time" as a target variable (comparative example). In the example of FIG. 8, the amount of movement in the vertical direction is large between the elements. As a result, the shape becomes intricate and the diagram is difficult to understand intuitively.

次に、図７のダイアグラム（実施例）および図８のダイアグラム（比較例）について検証を行う。検証を行うに際して、例えば、クラスタの分散度を第１指標とし、リンクの移動度を第２指標とする。 Next, the diagram of FIG. 7 (example) and the diagram of FIG. 8 (comparative example) are verified. In the verification, for example, the dispersion degree of the cluster is used as the first index, and the mobility of the link is used as the second index.

クラスタの分散度とは、要素間を結ぶリンクの総和である。リンクとは、１本以上のパスが結ぶ同一の要素間のことである。図９（ａ）で例示するように、名前「１」から原因「１」、原因「２」、原因「６」の３か所に１本以上のパスが結ばれている。この場合、延びるリンク数は３本である。同様に、名前「２」から延びるリンク数は６本である。これらを（１，０）＝３、（２，０）＝６のように算出し、すべての要素から要素につながっているリンクの総和を求める。リンク数の総和が少ないほど、パスの数が少なくまとまっているように見えるため、このリンクの総和をクラスタの分散度とする。計算した結果、表１に示すように、図７の例で１６８本となり、図８の例では１６７本と、両者で分散度はほぼ同じになった。

The dispersion of a cluster is the sum of the links connecting the elements. A link is between the same elements connected by one or more paths. As illustrated in FIG. 9A, one or more paths are connected from the name "1" to the three places of the cause "1", the cause "2", and the cause "6". In this case, the number of extended links is three. Similarly, the number of links extending from the name "2" is six. These are calculated as (1,0) = 3, (2,0) = 6, and the sum of the links connected to the elements is calculated from all the elements. The smaller the total number of links, the smaller the number of paths seems to be, so the total number of links is used as the dispersion degree of the cluster. As a result of the calculation, as shown in Table 1, the number was 168 in the example of FIG. 7, and 167 in the example of FIG. 8, and the dispersity was almost the same in both cases.

次に、リンクの移動度とは、リンクの始点と終点の本クラスタ番号の差分である。この移動度が大きければ、クラスタが離れている（分散している）ことになる。図９（ｂ）で例示するように、リンクＡは、名前「１」と原因「１」とを結ぶため、本クラスタ番号の差、すなわち移動度は０である。リンクＢは、名前「１」と原因「２」とを結ぶため、本クラスタ番号の差、すなわち移動度は１である。このようにしてすべてのリンクについての移動度の総和を計算し、リンクの移動度とする。表１に示すように、図７の例では５２１となり、図８の例では、５９６となった。これにより、図７の例では、図８の例と比較して分散度が１５％程度小さくなった。したがって、図８の例と比較して、図７の例では、直感的にわかりやすいダイアグラムとなったことがわかった。 Next, the mobility of the link is the difference between the cluster numbers at the start and end points of the link. If this mobility is high, the clusters are separated (distributed). As illustrated in FIG. 9B, since the link A connects the name “1” and the cause “1”, the difference between the cluster numbers, that is, the mobility is 0. Since the link B connects the name "1" and the cause "2", the difference between the cluster numbers, that is, the mobility is 1. In this way, the sum of the mobility of all the links is calculated and used as the mobility of the link. As shown in Table 1, it was 521 in the example of FIG. 7, and 596 in the example of FIG. As a result, in the example of FIG. 7, the dispersity was reduced by about 15% as compared with the example of FIG. Therefore, it was found that the diagram in FIG. 7 was intuitively easy to understand as compared with the example in FIG.

本実施例によれば、各テキストサンプルの要素ごとに特徴語からなる因子情報が抽出されるとともに、少なくとも１つ以上の要素において、序尺度が付されたターゲット変数群を抽出することができる。その後、抽出された複数の要素のうち、ターゲット変数以外の各要素に対して、多次元対応分析が行われて、クラスタ分類が実施される。このようにすることで、ターゲット変数の順序尺度を分類結果に反映させることによって、分類されたクラスタ同士に相関を持たせることができる。このクラスタ情報をもとにパス分析によるダイアグラムを作成して可視化を実施すると、パスの複雑さが解消されて、直感的にわかりやすいダイアグラムとなる。 According to this embodiment, factor information consisting of feature words can be extracted for each element of each text sample, and a target variable group with an introductory scale can be extracted for at least one or more elements. After that, among the extracted plurality of elements, each element other than the target variable is subjected to multidimensional correspondence analysis and cluster classification is performed. By doing so, the order scale of the target variable is reflected in the classification result, so that the classified clusters can be correlated with each other. By creating a diagram by path analysis based on this cluster information and performing visualization, the complexity of the path is eliminated and the diagram becomes intuitive and easy to understand.

なお、本実施例においては、順序尺度として、停止時間などの互いに連続する数値範囲を用いたが、それに限られない。数値範囲以外にも、間隔尺度、比例尺度などを適用することができる。
（変形例） In this embodiment, a numerical range that is continuous with each other, such as a stop time, is used as an order scale, but the order is not limited to this. In addition to the numerical range, an interval scale, a proportional scale, and the like can be applied.
(Modification example)

ターゲット変数のクラスタ番号の平均値を算出する際に、ターゲット変数の順序尺度の影響が大きくなるようにしてもよい。例えば、停止時間が３０ｍｉｎ以下をクラスタ番号を「１」、３１ｍｉｎ以上６０ｍｉｎ以下のクラスタ番号を「２」、６１ｍｉｎ以上９０ｍｉｎ以下のクラスタ番号を「３」、９１ｍｉｎ以上１２０ｍｉｎ以下のクラスタ番号を「８」、１２１ｍｉｎ以上のクラスタ番号を「１０」とし、長い停止時間に重み付けを行ってもよい。この場合のクラスタリング結果を図１０（ａ）〜図１０（ｃ）に示す。図１０（ａ）は、「名前」における「仮クラスタ番号」、「ターゲット変数のクラスタ番号の平均値」、「本クラスタ番号」の一覧を示す。図１０（ｂ）は、「原因」における「仮クラスタ番号」、「ターゲット変数のクラスタ番号の平均値」、「本クラスタ番号」の一覧を示す。図１０（ｃ）は、「対応方法」における「仮クラスタ番号」、「ターゲット変数のクラスタ番号の平均値」、「本クラスタ番号」の一覧を示す。また、得られた「名前」、「原因」、「対応方法：のそれぞれのクラスタ番号をダイアグラム化した図を図１１に示す。 When calculating the average value of the cluster numbers of the target variables, the influence of the ordinal scale of the target variables may be large. For example, a cluster number of 30 min or less is "1", a cluster number of 31 min or more and 60 min or less is "2", a cluster number of 61 min or more and 90 min or less is "3", and a cluster number of 91 min or more and 120 min or less is "8". , The cluster number of 121 min or more may be set to "10", and the long stop time may be weighted. The clustering results in this case are shown in FIGS. 10 (a) to 10 (c). FIG. 10A shows a list of “temporary cluster number”, “average value of cluster number of target variable”, and “main cluster number” in “name”. FIG. 10B shows a list of “temporary cluster number”, “average value of cluster number of target variable”, and “main cluster number” in “cause”. FIG. 10C shows a list of the “temporary cluster number”, the “average value of the cluster numbers of the target variables”, and the “main cluster number” in the “correspondence method”. Further, FIG. 11 shows a diagram diagram of each cluster number of the obtained "name", "cause", and "countermeasure method:".

図６（ａ）〜図６（ｃ）と図１０（ａ）〜図１０（ｃ）とを比較すると、本クラスタ番号に差が生じている。図１１の結果に対して、上述した第１指標および第２指標を算出した。その結果を表２に示す。表２に示すように、第１指標が小さい値となり、第２指標がさらに小さい値となった。これは、重み付けを行ったことで、順序尺度の影響をより反映できるようになったためである。

Comparing FIGS. 6 (a) to 6 (c) with FIGS. 10 (a) to 10 (c), there is a difference in the cluster numbers. The above-mentioned first index and second index were calculated with respect to the result of FIG. The results are shown in Table 2. As shown in Table 2, the first index has a small value, and the second index has a smaller value. This is because the weighting makes it possible to better reflect the influence of the order scale.

なお、上記各例において、ターゲット変数抽出部１１が、同一のテキストサンプルから、複数の要素に対し、特徴語を示す複数の因子を抽出し、当該複数の要素のうち１以上の要素に対して順序尺度からなるターゲット変数を付与するターゲット変数抽出部の一例として機能する。クラスタ分類部１２が、前記ターゲット変数を用い、前記複数の要素に対する多次元対応分析を行うことによって、クラスタ分類を実施するクラスタ分類部の一例として機能する。並び替え部１３が、前記クラスタ分類部によって、前記多次元対応分析後に形成された各クラスタに仮の識別子を付与し、前記ターゲット変数が付与された要素以外の各要素について、前記仮の識別子ごとに全テキストサンプルの前記順序尺度の平均値を算出して、前記平均値の昇順に前記クラスタの識別子を並び替え、新たな識別子番号を付与する、並び替え部の一例として機能する。ダイアグラム作成部１４が、前記並び替え部が並び替えた前記クラスタを要素ごとに配置し、同一のテキストサンプルに属する因子同士を隣接する要素間を線で結ぶことでダイアグラムを作成するダイアグラム作成部の一例として機能する。 In each of the above examples, the target variable extraction unit 11 extracts a plurality of factors indicating characteristic words for a plurality of elements from the same text sample, and for one or more of the plurality of elements. It functions as an example of a target variable extraction unit that assigns a target variable consisting of an ordinal scale. The cluster classification unit 12 functions as an example of a cluster classification unit that performs cluster classification by performing a multidimensional correspondence analysis on the plurality of elements using the target variables. The sorting unit 13 assigns a tentative identifier to each cluster formed after the multidimensional correspondence analysis by the cluster classification unit, and for each element other than the element to which the target variable is assigned, for each of the tentative identifiers. It functions as an example of a sorting unit that calculates the average value of the ordinal scale of all text samples, sorts the identifiers of the clusters in ascending order of the average values, and assigns new identifier numbers. The diagram creation unit 14 arranges the clusters rearranged by the rearrangement unit for each element, and creates a diagram by connecting factors belonging to the same text sample with adjacent elements with a line. Works as an example.

以上、本発明の実施例について詳述したが、本発明は係る特定の実施例に限定されるものではなく、特許請求の範囲に記載された本発明の要旨の範囲内において、種々の変形・変更が可能である。 Although the examples of the present invention have been described in detail above, the present invention is not limited to the specific examples, and various modifications and modifications are made within the scope of the gist of the present invention described in the claims. It can be changed.

１０分析処理部
１１ターゲット変数抽出部
１２クラスタ分類部
１３並び替え部
１４ダイアグラム作成部
２０データベース部
２１テキスト情報格納部
１００情報分析装置 10 Analysis processing unit 11 Target variable extraction unit 12 Cluster classification unit 13 Sorting unit 14 Diagram creation unit 20 Database unit 21 Text information storage unit 100 Information analyzer

Claims

A target variable extraction unit that extracts a plurality of factors indicating characteristic words for a plurality of elements from the same text sample and assigns a target variable consisting of an order scale to one or more of the plurality of elements. ,
A cluster classification unit that performs cluster classification by performing multidimensional correspondence analysis on the plurality of elements using the target variables.
The cluster classification unit assigns a tentative identifier to each cluster formed after the multidimensional correspondence analysis, and for each element other than the element to which the target variable is assigned, the above-mentioned of all text samples for each tentative identifier. An information analyzer comprising a sorting unit that calculates an average value of an ordinal scale, sorts the identifiers of the clusters in ascending order of the average values, and assigns a new identifier number.

The cluster is rearranged by the rearrangement unit, and the cluster is arranged for each element, and a diagram creation unit for creating a diagram by connecting factors belonging to the same text sample with adjacent elements with a line is provided. The information analyzer according to claim 1.

The information analyzer according to claim 1 or 2, wherein the target variable extraction unit gives a larger weight to the order scale as the order scale is larger.

A process of extracting a plurality of factors indicating characteristic words for a plurality of elements from the same text sample and assigning a target variable consisting of an order scale to one or more of the plurality of elements.
A process of performing cluster classification by performing a multidimensional correspondence analysis for each of the elements using the target variable, and
A tentative identifier is assigned to each cluster formed after the multidimensional correspondence analysis, and for each element other than the element to which the target variable is assigned, the average value of the ordinal scale of all text samples is calculated for each tentative identifier. An information analysis method characterized in that a computer executes a process of calculating, rearranging the identifiers of the clusters in ascending order of the average value, and assigning a new identifier number.

On the computer
A process of extracting a plurality of factors indicating characteristic words for a plurality of elements from the same text sample and assigning a target variable consisting of an order scale to one or more of the plurality of elements.
A process of performing cluster classification by performing a multidimensional correspondence analysis on the plurality of elements using the target variables, and
A tentative identifier is assigned to each cluster formed after the multidimensional correspondence analysis, and for each element other than the element to which the target variable is assigned, the average value of the ordinal scale of all text samples is calculated for each tentative identifier. An information analysis program characterized in that a process of calculating, rearranging the identifiers of the cluster in ascending order of the average value, and assigning a new identifier number is executed.