JP2009230323A

JP2009230323A - Information analyzing device and program

Info

Publication number: JP2009230323A
Application number: JP2008073181A
Authority: JP
Inventors: Junichi Takeda; 隼一武田; Hitoshi Ikeda; 仁池田; Motofumi Fukui; 基文福井
Original assignee: Fuji Xerox Co Ltd
Current assignee: Fujifilm Business Innovation Corp
Priority date: 2008-03-21
Filing date: 2008-03-21
Publication date: 2009-10-08

Abstract

<P>PROBLEM TO BE SOLVED: To provide an information analyzing device capable of adaptively determining the representative vector of a group with respect to the characteristic of the distribution of components of a feature vector concerning an analysis object even if the characteristic is changed. <P>SOLUTION: The feature vectors are acquired which are classified into one of the plurality of groups. One of a plurality of kinds of representative vector determining methods is selected based on evaluation about the distribution of the components with values equal to or larger than prescribed component thresholds of the feature vectors. The representative vector of each group is determined by the selected representative vector determining method. The value to indicate the degree of holding each of the plurality of kinds of features by the corresponding analysis object is defined as the component in the feature vector. <P>COPYRIGHT: (C)2010,JPO&INPIT

Description

本発明は情報解析装置及びプログラムに関する。 The present invention relates to an information analysis apparatus and a program.

近年情報化社会の進展に伴い、電子化された膨大な情報がコンピュータ内に蓄積されるようになってきている。このために、蓄積された大量の情報の中から価値のある情報を見つけることや、情報の全体的な構造を理解することは、従来に比べますます困難になっている。これらの困難に対応するため、これらの情報を系統別に分類し、利用者に提示することが求められている。 With the progress of the information society in recent years, a huge amount of computerized information has been accumulated in computers. For this reason, it is more difficult than ever to find valuable information from a large amount of accumulated information and to understand the overall structure of the information. In order to cope with these difficulties, it is required to classify these pieces of information by system and present them to the user.

これらの情報を提示する方法として、分類された情報を２次元あるいは３次元空間内のグラフやマップ等に可視化する事が考えられている。これは分類された情報の関係を直観的に理解できる点で有効である。上述の可視化方法の一つとして、情報の一つ一つを解析対象として多次元の特徴ベクトルで表現し、その特徴ベクトルを分析し解析対象の分布をマップに表す手法が従来から存在している。 As a method of presenting such information, it is considered to visualize the classified information on a graph or map in a two-dimensional or three-dimensional space. This is effective in that the relationship between classified information can be intuitively understood. As one of the visualization methods described above, there is a conventional method in which each piece of information is represented as an analysis target by a multidimensional feature vector, the feature vector is analyzed, and the distribution of the analysis target is represented in a map. .

例えば大量の文献を解析対象とする場合の情報解析および可視化手法として、非特許文献１に以下のような方法が開示されている。はじめに文献ごとにその文献に含まれるキーワードを成分とする特徴ベクトルを作成し、その特徴ベクトルをクラスタリングする事によって解析対象をグループに分類する。次に、分類された各グループの重心を各グループの代表ベクトルとして算出し、その代表ベクトルを主成分分析により２次元空間にマッピングする。さらに各文献に対応する特徴ベクトルは既にマッピングされた全てのグループの代表ベクトルとの距離を保存するように多次元尺度構成法によりマッピングする。２段階でマッピングを行うこの方法により、各解析対象は所属するグループに近い位置にプロットされるため、グループに分類された解析対象の関係が可視化される。
ジェームス・Ａ・ワイズ（James A.Wise）、「ジ・エコロジカル・アプローチ・トゥ・テキスト・ビジュアライゼイション（The EcologicalApproach to Text Visualization）」、ジャーナル・オブ・ジ・アメリカン・ソサエティ・フォー・インフォメーション・サイエンス（JOURNALOF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE）、（米国）、１９９９年１１月 For example, Non-Patent Document 1 discloses the following method as an information analysis and visualization method when a large amount of documents are to be analyzed. First, feature vectors whose components are keywords included in each document are created for each document, and the analysis objects are classified into groups by clustering the feature vectors. Next, the center of gravity of each classified group is calculated as a representative vector of each group, and the representative vector is mapped to a two-dimensional space by principal component analysis. Furthermore, the feature vector corresponding to each document is mapped by the multidimensional scaling method so as to preserve the distances from the representative vectors of all the already mapped groups. By this method of mapping in two steps, each analysis object is plotted at a position close to the group to which it belongs, so the relationship between the analysis objects classified into groups is visualized.
James A. Wise, “The Ecological Approach to Text Visualization”, Journal of the American Society for Information Science (JOURNALOF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE), (USA), November 1999

しかしながら、従来の手法では、解析対象となる特徴ベクトルの成分の分布の特性の変化によって、上述の代表ベクトル間の相互関係を解析することが困難な場合があった。 However, with the conventional method, it may be difficult to analyze the interrelationship between the above representative vectors due to a change in the distribution characteristics of the component of the feature vector to be analyzed.

例えば、代表ベクトルを各グループの特徴ベクトルの重心によって決定すると、グループ間の特徴ベクトルの成分どうしの重複が多い場合にはグループの特徴が目立たなくなっていた。 For example, when the representative vector is determined based on the center of gravity of the feature vector of each group, the feature of the group becomes inconspicuous when there are many overlapping feature vector components between groups.

本発明は上記課題に鑑みてなされたものであって、その目的は、解析対象に係る特徴ベクトルの成分の分布の特性が変化しても、その特性に応じて適応的に代表ベクトルを決定することができる情報解析装置及びプログラムを提供することにある。 The present invention has been made in view of the above problems, and its purpose is to adaptively determine a representative vector according to the characteristics of the distribution of the feature vector components related to the analysis object even if the characteristics change. An object of the present invention is to provide an information analysis apparatus and a program that can be used.

請求項１の発明は情報解析装置であって、複数グループのいずれかに分類された複数の解析対象のそれぞれについて、複数種類の特徴のそれぞれを該解析対象が有する程度を示す値をその成分とする特徴ベクトルを取得する取得手段と、前記複数の解析対象のうち全部又は一部について前記取得手段により取得される特徴ベクトルの所定の成分値閾値以上の値を有する成分の分布に関する評価に基づいて、複数種類の代表ベクトル決定方法のうち１つを選択する選択手段と、前記選択手段により選択される代表ベクトル決定方法により、前記各グループに分類された解析対象について前記取得手段により取得される特徴ベクトルに基づいて、該グループの代表ベクトルを決定する決定手段と、を含むことを特徴とする。 The invention of claim 1 is an information analysis apparatus, and for each of a plurality of analysis objects classified into any of a plurality of groups, a value indicating a degree of the analysis object having each of a plurality of types of features is defined as its component On the basis of evaluation relating to the distribution of components having a value equal to or greater than a predetermined component value threshold value of the feature vector acquired by the acquiring unit for all or part of the plurality of analysis objects. A selection unit that selects one of a plurality of types of representative vector determination methods, and a representative vector determination method selected by the selection unit, wherein the analysis unit classified into each group is acquired by the acquisition unit Determining means for determining a representative vector of the group based on the vector.

請求項２の発明は請求項１の発明において、前記複数種類の代表ベクトル決定方法は、前記各グループに係る特徴ベクトルの情報のうち該グループの代表ベクトルの決定に用いる情報の量が制限された制限代表ベクトル決定方法を少なくとも１つ含み、前記選択手段は、前記取得手段により取得される特徴ベクトルの前記所定の成分値閾値以上の値を有する成分の数を示す値が所定の成分数閾値より多い場合に、前記制限代表ベクトル決定方法のうち１つを選択する、ことを特徴とする。 According to a second aspect of the present invention, in the first aspect of the invention, the plurality of types of representative vector determining methods are limited in the amount of information used to determine the representative vector of the group out of the feature vector information related to each group. Including at least one limited representative vector determination method, wherein the selection unit has a value indicating the number of components having a value equal to or greater than the predetermined component value threshold of the feature vector acquired by the acquisition unit than a predetermined component number threshold If there are many, one of the limited representative vector determination methods is selected.

請求項３の発明は請求項１又は２の発明において、前記複数種類の代表ベクトル決定方法は、各グループの特徴ベクトルの重心ベクトルを決定する基準代表ベクトル決定方法と、所定の成分値閾値以上の値を有する成分が異なるグループの代表ベクトル間で重複する程度を前記基準代表ベクトル決定方法よりも低くする低重複代表ベクトル決定方法と、所定の成分値閾値以上の値を有する成分が異なるグループの代表ベクトル間で重複する程度を前記基準代表ベクトル決定方法よりも高くする高重複代表ベクトル決定方法と、を含み、前記選択手段は、前記所定の成分値閾値以上の値を有する成分が異なるグループに係る特徴ベクトル間で重複する程度を評価し、該重複する程度が所定程度以上である場合に、前記低重複代表ベクトル決定方法を選択し、該重複する程度が所定程度より小さい場合に、前記高重複代表ベクトル決定方法を選択する、ことを特徴とする。 The invention of claim 3 is the invention of claim 1 or 2, wherein the plurality of types of representative vector determining methods include a reference representative vector determining method for determining a centroid vector of feature vectors of each group, and a predetermined component value threshold value or more. A low overlap representative vector determination method that lowers the degree of overlap between components having values among representative vectors of different groups, and a representative of groups having components that have a value equal to or greater than a predetermined component value threshold. A high overlap representative vector determination method that makes a degree of overlap between vectors higher than that of the reference representative vector determination method, and the selection unit relates to a group having different components having a value equal to or greater than the predetermined component value threshold When the degree of overlap between feature vectors is evaluated and the degree of overlap is greater than or equal to a predetermined level, the method for determining the low overlap representative vector Select, if the extent of the overlap is less than a predetermined degree, selects the high overlapping representative vector determination method, characterized in that.

請求項４の発明は請求項３の発明において、前記選択手段は、異なるグループに係る特徴ベクトル間の近さの程度を算出し、該近さの程度により前記重複する程度を評価する、ことを特徴とする。 According to a fourth aspect of the present invention, in the third aspect of the invention, the selecting means calculates a degree of proximity between feature vectors of different groups, and evaluates the degree of overlap according to the degree of proximity. Features.

請求項５の発明は請求項３又は４の発明において、前記低重複代表ベクトル決定方法は、前記各グループに分類された解析対象について前記取得手段により取得される特徴ベクトルのうち、前記基準代表ベクトル決定方法により決定される該グループの代表ベクトルに基づいて選択される一部のみを用いて該グループの代表ベクトルを決定する方法である、ことを特徴とする。 According to a fifth aspect of the present invention, in the invention of the third or fourth aspect, the low-overlapping representative vector determination method includes the reference representative vector among the feature vectors acquired by the acquisition unit for the analysis target classified into the groups. The method is characterized in that the representative vector of the group is determined by using only a part selected based on the representative vector of the group determined by the determination method.

請求項６の発明は請求項１から５のいずれかの発明において、前記決定手段により決定された前記各グループの代表ベクトルをマップが生成される空間と同じ座標系に射影した座標情報を生成する座標情報生成手段、をさらに含むことを特徴とする。 According to a sixth aspect of the present invention, in any one of the first to fifth aspects, coordinate information is generated by projecting the representative vector of each group determined by the determining means to the same coordinate system as a space in which a map is generated. Coordinate information generating means is further included.

請求項７の発明は、複数グループのいずれかに分類された複数の解析対象のそれぞれについて、複数種類の特徴のそれぞれを該解析対象が有する程度を示す値をその成分とする特徴ベクトルを取得する取得手段、前記複数の解析対象のうち全部又は一部について前記取得手段により取得される特徴ベクトルの所定の成分値閾値以上の値を有する成分の分布に関する評価に基づいて、複数種類の代表ベクトル決定方法のうち１つを選択する選択手段、前記選択手段により選択される代表ベクトル決定方法により、前記各グループに分類された解析対象について前記取得手段により取得される特徴ベクトルに基づいて、該グループの代表ベクトルを決定する決定手段、としてコンピュータを機能させるためのプログラムである。 The invention according to claim 7 obtains a feature vector whose component is a value indicating a degree of the analysis target for each of the plurality of types of features for each of the plurality of analysis targets classified into any of a plurality of groups. Determining a plurality of types of representative vectors based on an evaluation relating to a distribution of components having a value equal to or greater than a predetermined component value threshold of a feature vector acquired by the acquiring unit for all or a part of the plurality of analysis objects; Based on the feature vector acquired by the acquisition unit for the analysis target classified into each group by the selection unit for selecting one of the methods, and the representative vector determination method selected by the selection unit, This is a program for causing a computer to function as a determination means for determining a representative vector.

請求項１，７の発明によれば、解析対象に係る特徴ベクトルの成分の分布の特性が変化しても、適応的に代表ベクトルを決定することができる。 According to the first and seventh aspects of the present invention, the representative vector can be determined adaptively even when the distribution characteristics of the component of the feature vector related to the analysis object change.

請求項２の発明によれば、解析対象に係る特徴ベクトルの所定の成分値閾値以上の値を有する成分の数を示す値が所定の成分数閾値より多い場合に、関係がより適切に解析できる代表ベクトルを決定することができる。各成分が示す特徴が特徴ベクトルを代表する情報である蓋然性が低いからである。 According to the second aspect of the present invention, when the value indicating the number of components having a value equal to or greater than the predetermined component value threshold of the feature vector to be analyzed is greater than the predetermined component number threshold, the relationship can be analyzed more appropriately. A representative vector can be determined. This is because the probability that the feature indicated by each component is information representative of the feature vector is low.

請求項３の発明によれば、所定の成分値閾値以上の値を有する成分が異なるグループに係る特徴ベクトル間で重複する程度が所定程度以上である場合に、代表ベクトル間の重複を少なくし、上記重複する程度が所定程度より小さい場合に代表ベクトル間の重複を多くすることで、関係がより適切に解析できる代表ベクトルを決定することができる。 According to the invention of claim 3, when the degree of overlapping between the feature vectors related to different groups with components having a value equal to or greater than a predetermined component value threshold is a predetermined degree or more, the overlap between representative vectors is reduced, When the overlapping degree is smaller than the predetermined degree, the representative vector whose relationship can be analyzed more appropriately can be determined by increasing the overlapping between the representative vectors.

請求項４の発明によれば、異なるグループに係る特徴ベクトル間の重複の程度を評価する指標として、異なるグループに係る特徴ベクトル間の近さの程度を用いることができる。 According to the fourth aspect of the present invention, the degree of proximity between feature vectors related to different groups can be used as an index for evaluating the degree of overlap between feature vectors related to different groups.

請求項５の発明によれば、代表ベクトル間の重複を少なくする代表ベクトル決定方法として、基準代表ベクトル決定方法により決定されるベクトルに基づいて選択される一部のみを用いて各グループの代表ベクトルを決定する方法を用いることができる。 According to the invention of claim 5, as a representative vector determining method for reducing the overlap between representative vectors, only a part selected based on the vector determined by the reference representative vector determining method is used. Can be used.

請求項６の発明によれば、解析対象に係る特徴ベクトルの成分の分布の特性が変化しても、マップの元情報となる代表ベクトルの座標情報を適応的に生成することができる。 According to the sixth aspect of the present invention, even if the characteristics of the distribution of the component of the feature vector related to the analysis object change, the coordinate information of the representative vector that is the original information of the map can be adaptively generated.

本発明の実施形態について図面を参照しながら説明する。図１は、本発明の実施形態に係る情報解析装置１の構成図の一例である。同図に示すように、情報解析装置１は、ＣＰＵ１１、メモリ１２、入力部１３、および出力部１４を含んでいる。 Embodiments of the present invention will be described with reference to the drawings. FIG. 1 is an example of a configuration diagram of an information analysis apparatus 1 according to an embodiment of the present invention. As shown in FIG. 1, the information analysis apparatus 1 includes a CPU 11, a memory 12, an input unit 13, and an output unit 14.

ＣＰＵ１１は、メモリ１２に格納されているプログラムに従って動作する。なお、上記プログラムは、ＣＤ−ＲＯＭやＤＶＤ−ＲＯＭ等の情報記録媒体に格納されて提供されるものであってもよいし、インターネット等のネットワークを介して提供されるものであってもよい。 The CPU 11 operates according to a program stored in the memory 12. The program may be provided by being stored in an information recording medium such as a CD-ROM or DVD-ROM, or may be provided via a network such as the Internet.

メモリ１２は、ＲＡＭやＲＯＭ等のメモリ素子やハードディスク等の記録装置によって構成されている。メモリ１２は、上記プログラムを格納する。また、メモリ１２は、各部から入力される情報や演算結果を格納する。 The memory 12 includes a memory device such as a RAM or a ROM, or a recording device such as a hard disk. The memory 12 stores the program. The memory 12 stores information and calculation results input from each unit.

入力部１３は、外部のコンピュータとの通信手段、リムーバブルメディア等の外部記録装置や利用者からの指示を受け入れるキーボードやマウス等で構成されている。入力部１３は、ＣＰＵ１１の制御に基づいて、外部から入力された情報をＣＰＵ１１やメモリ１２に出力する。 The input unit 13 includes a means for communicating with an external computer, an external recording device such as a removable medium, and a keyboard and a mouse for receiving instructions from a user. The input unit 13 outputs information input from the outside to the CPU 11 and the memory 12 based on the control of the CPU 11.

出力部１４は、外部のコンピュータとの通信手段や利用者への表示出力手段である。出力部１４は、ＣＰＵ１１の制御に基づいて、ＣＰＵ１１の処理結果を外部に出力する。 The output unit 14 is a communication unit with an external computer or a display output unit to the user. The output unit 14 outputs the processing result of the CPU 11 to the outside based on the control of the CPU 11.

図２は、情報解析装置１が実現する機能を示す機能ブロック図である。情報解析装置１は、機能的に特徴ベクトル取得部２１と、決定方法選択部２２と、代表ベクトル決定部２３と、座標情報生成部２４と、描画部２５と、を含む。これらの機能はＣＰＵ１１がメモリ１２に格納されたプログラムを実行し、入力部１３および出力部１４を制御することによって実現される。 FIG. 2 is a functional block diagram showing functions realized by the information analysis apparatus 1. The information analysis apparatus 1 functionally includes a feature vector acquisition unit 21, a determination method selection unit 22, a representative vector determination unit 23, a coordinate information generation unit 24, and a drawing unit 25. These functions are realized by the CPU 11 executing a program stored in the memory 12 and controlling the input unit 13 and the output unit 14.

特徴ベクトル取得部２１は、ＣＰＵ１１、メモリ１２および入力部１３を中心として実現される。特徴ベクトル取得部２１は、入力部１３より予め複数のグループに分類された複数の特徴ベクトルの情報を入力し、その情報をメモリ１２に格納する。 The feature vector acquisition unit 21 is realized centering on the CPU 11, the memory 12, and the input unit 13. The feature vector acquisition unit 21 inputs information about a plurality of feature vectors previously classified into a plurality of groups from the input unit 13 and stores the information in the memory 12.

ここで、入力データに係る解析対象と特徴ベクトルについて説明する。本実施形態においては、解析対象は特許文献などの電子文書としている。一つの特徴ベクトルは一つの解析対象と対応し、その特徴ベクトルの成分は解析対象がもつ複数種類の特徴のそれぞれを有する程度をあらわす。本実施形態においては解析対象がもつ複数種類の特徴は、例えば文章を形態素解析によって抽出したキーワードや、特許文献におけるＦタームなど書誌的事項から抽出したキーワードであり、その特徴の程度は各キーワードの出現頻度である。なお、この場合は特徴ベクトルの成分の値は電子文書より抽出したキーワードの出現頻度となる。 Here, the analysis target and the feature vector related to the input data will be described. In the present embodiment, the analysis target is an electronic document such as a patent document. One feature vector corresponds to one analysis target, and the component of the feature vector represents the degree of having each of a plurality of types of features of the analysis target. In this embodiment, the plural types of features of the analysis target are, for example, keywords extracted from bibliographic items such as keywords obtained by morphological analysis or F-terms in patent literature, and the degree of the feature is determined for each keyword. Appearance frequency. In this case, the value of the feature vector component is the appearance frequency of the keyword extracted from the electronic document.

図３は解析対象から特徴ベクトルを生成する方法の一例を示す図である。図３（ａ），（ｂ），（ｃ）は解析対象が文章である場合の特徴ベクトル生成について、図３（ｄ），（ｅ），（ｆ）は解析対象が書誌的事項である場合の特徴ベクトル生成についてあらわしている。 FIG. 3 is a diagram illustrating an example of a method for generating a feature vector from an analysis target. 3A, 3B, and 3C show feature vector generation when the analysis target is a sentence, and FIGS. 3D, 3E, and 3F show the case where the analysis target is a bibliographic item. This shows the generation of feature vectors.

図３（ａ）は、解析対象１および２の文章を表している。この例の文章は少数の文からなる短い文章であるが、複数の章によって構成されるような長い文章でもよい。これらの文章に対し形態素解析および助詞などの意味を持たないキーワードの除去と、出現頻度の計算を行う。図３（ｂ）がその結果の出現頻度の例である。これを用いて各解析対象のキーワード（特徴）の出現頻度を基に成分の値を決めた特徴ベクトルが生成される。図３（ｃ）がその結果の特徴ベクトルの例である。図３（ｃ）では、キーワード（特徴）と成分の値の対応の把握を容易にするため、特徴ベクトルを表で表現している。この表では、例えば解析対象１の特徴ベクトルが（１，１，１，１，１，２，１，１，１，１，０，０，０，０，０）であることを示している。これ以降もベクトルについて同様な表現を行う。図３（ｄ）は解析対象３および４が有する書誌的事項キーワードをあらわしている。ここでは書誌的事項キーワードの例として特許文献におけるＦターム分類を用いている。書誌的事項キーワードに対しては、文書構造を基にした直接的なキーワード（特徴）抽出と、出現頻度の計算がされ（結果となる出現頻度の例が図３（ｅ）である）、特徴ベクトルが生成される（結果となる特徴ベクトルの例が図３（ｆ）である）。なお、図３に示される特徴ベクトルは理解の容易のため長さが１となるような正規化はされていないが、正規化がされていてもよい。なお、解析対象の数が多くても同じ方法で特徴ベクトルを生成することができる。 FIG. 3A shows sentences of analysis objects 1 and 2. The sentence in this example is a short sentence composed of a small number of sentences, but may be a long sentence composed of a plurality of chapters. For these sentences, morphological analysis and removal of keywords having no meaning such as particles are performed and the appearance frequency is calculated. FIG. 3B shows an example of the appearance frequency of the result. Using this, a feature vector in which a component value is determined based on the appearance frequency of each analysis target keyword (feature) is generated. FIG. 3C shows an example of the resulting feature vector. In FIG. 3C, feature vectors are expressed in a table in order to easily understand the correspondence between keywords (features) and component values. In this table, for example, the feature vector of the analysis target 1 is (1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 0, 0, 0, 0, 0). . Thereafter, the same expression is used for vectors. FIG. 3D shows bibliographic item keywords of the analysis objects 3 and 4. Here, F-term classification in patent literature is used as an example of a bibliographic item keyword. For bibliographic item keywords, direct keyword (feature) extraction based on the document structure and appearance frequency are calculated (an example of the resulting appearance frequency is shown in FIG. 3E). A vector is generated (an example of the resulting feature vector is FIG. 3 (f)). Note that the feature vector shown in FIG. 3 is not normalized so as to have a length of 1 for easy understanding, but may be normalized. Note that feature vectors can be generated by the same method even when the number of analysis objects is large.

さらに特徴ベクトルは、予め複数のグループに分類されている。分類の方法は特に問わず、Ｋ−ｍｅａｎｓ法などの公知のクラスタリング手法によって分類を行ってもよいし、人為的に分類を行ってもよい。 Further, the feature vectors are previously classified into a plurality of groups. The classification method is not particularly limited, and the classification may be performed by a known clustering method such as a K-means method, or the classification may be performed artificially.

なお特徴ベクトル取得部２１は、入力部１３を介して外部から情報を入力するだけではなく、事前に他のプログラムなどによってメモリ１２上に記録された上記特徴ベクトルの情報を内部的に取得してもよい。 Note that the feature vector acquisition unit 21 not only inputs information from the outside via the input unit 13 but also internally acquires information on the feature vector recorded in advance on the memory 12 by another program or the like. Also good.

決定方法選択部２２は、ＣＰＵ１１を中心として実現される。決定方法選択部２２は、メモリ１２に格納された複数の特徴ベクトルの所定の成分値閾値以上の値を有する成分の分布に関する評価によって、複数種類ある代表ベクトルの決定方法のうち一つを選択する。本実施形態における決定方法選択部２２の処理フローを図４に示す。 The determination method selection unit 22 is realized centering on the CPU 11. The determination method selection unit 22 selects one of a plurality of types of representative vector determination methods by evaluating the distribution of components having a value equal to or greater than a predetermined component value threshold of the plurality of feature vectors stored in the memory 12. . A processing flow of the determination method selection unit 22 in the present embodiment is shown in FIG.

Ｓ３１は、特徴ベクトルの所定の成分値閾値以上の値を有する成分の数を、上述の成分の分布として評価するステップである。具体的には以下の手順で処理を行う。 S31 is a step of evaluating the number of components having a value equal to or larger than a predetermined component value threshold of the feature vector as the above-described component distribution. Specifically, processing is performed according to the following procedure.

決定方法選択部２２は、取得した各特徴ベクトルの成分の値が０より大きい成分の数をカウントする。さらに取得した全特徴ベクトルについてその成分の数の平均値をとり、それが所定の成分数閾値以上か否かを判断する。 The determination method selection unit 22 counts the number of components whose component value of each acquired feature vector is greater than zero. Further, an average value of the number of components of all the acquired feature vectors is taken, and it is determined whether or not it is equal to or greater than a predetermined component number threshold.

例えば所定の成分数閾値を５と設定した場合について図３を用いて説明する。図３（ｆ）の書誌的事項の場合は解析対象３および４の特徴ベクトルの成分の値が０より大きい成分の数はそれぞれ３個および４個である。成分の数の平均値は３．５となり、所定の成分数閾値よりも小さいと判断される。一方、図３（ｃ）の文章の場合は解析対象１および２の特徴ベクトルで上記条件を満たす成分の数はそれぞれ１０個および６個である。成分の数の平均値は８となり、所定の成分数閾値より大きいと判断される。 For example, the case where the predetermined component number threshold is set to 5 will be described with reference to FIG. In the case of the bibliographic item of FIG. 3 (f), the number of components whose feature vector components of analysis objects 3 and 4 are greater than 0 is 3 and 4, respectively. The average value of the number of components is 3.5, and is determined to be smaller than the predetermined component number threshold. On the other hand, in the case of the sentence in FIG. 3C, the number of components satisfying the above conditions in the feature vectors of the analysis targets 1 and 2 is 10 and 6, respectively. The average value of the number of components is 8, which is determined to be larger than the predetermined component number threshold.

ここで、成分の値が０以上である成分がカウント対象となっているのは、本実施形態では、キーワードが出現するか否かを判断基準としているからである。なお、この場合は所定の成分値閾値は０である。 Here, the reason why the component whose component value is 0 or more is counted is because, in this embodiment, whether or not a keyword appears is used as a criterion. In this case, the predetermined component value threshold is zero.

上述の成分の数の平均値が所定の成分数閾値以下であれば、各グループの代表ベクトルの決定方法として、各グループの特徴ベクトルが持つ情報を全て利用するような方法を選定する。この場合は一つ一つの成分が特徴ベクトルを代表している可能性が高いと考えられるからである。このような情報の例としては、図３（ｄ）の書誌的事項などのように人為的に索引付けされた情報がある。本実施形態において選定される代表ベクトルの決定方法は各グループに分類されている特徴ベクトルの重心を計算する方法である（Ｓ３２）。 If the average value of the number of components described above is equal to or less than a predetermined component number threshold, a method that uses all the information of the feature vectors of each group is selected as the representative vector determination method for each group. This is because it is highly likely that each component represents a feature vector. An example of such information is information that has been artificially indexed, such as the bibliographic items of FIG. The representative vector determination method selected in the present embodiment is a method of calculating the center of gravity of the feature vectors classified into each group (S32).

一方、上述の成分の数の平均値が所定の成分数閾値より大きければ、さらに分類間の距離について判断を行う（Ｓ３３）。この場合は一つ一つの成分が特徴ベクトルを代表していない可能性が高く、不要な特徴が混じる可能性が高いと考えられるからである。Ｓ３３では他の種類の成分の分布を評価し、不要な特徴を減らすような代表ベクトル生成方法（制限代表ベクトル決定方法）が選定される（Ｓ３４，Ｓ３５）。詳しくはＳ３３の説明で後述する。 On the other hand, if the average value of the number of components described above is larger than a predetermined component number threshold value, the distance between the classifications is further determined (S33). In this case, it is highly likely that each component does not represent a feature vector, and it is highly likely that unnecessary features are mixed. In S33, the distribution of other types of components is evaluated, and a representative vector generation method (restricted representative vector determination method) that reduces unnecessary features is selected (S34, S35). Details will be described later in the description of S33.

所定の成分数閾値は、予め定められた固定値でもよいし、特徴ベクトル取得部２１で取得した特徴ベクトルを用いて、（特徴ベクトルの次元数÷３）などの計算により決められた値でもよい。なお、本実施形態においては特徴ベクトルの成分の数の平均値を判断条件に利用しているが、特徴ベクトルの成分の数を代表していればよく、例えば中央値や最小値を利用してもよい。 The predetermined component number threshold value may be a predetermined fixed value, or may be a value determined by calculation such as (the number of dimensions of the feature vector ÷ 3) using the feature vector acquired by the feature vector acquisition unit 21. . In this embodiment, the average value of the number of feature vector components is used as a determination condition. However, the number of feature vector components only needs to be representative, for example, using a median value or a minimum value. Also good.

Ｓ３３は、異なるグループにある特徴ベクトル間で所定の成分値閾値以上の値を有する成分が重複する程度を、特徴ベクトル内の成分の分布として評価するステップである。具体的には以下の手順で処理を行う。 S33 is a step of evaluating the degree of overlapping of components having a value equal to or greater than a predetermined component value threshold between feature vectors in different groups as the distribution of components in the feature vector. Specifically, processing is performed according to the following procedure.

はじめに、各グループの基準代表ベクトルを計算し、基準代表ベクトル間でのユークリッド距離を求める。ここでは、基準代表ベクトルとして、各グループの特徴ベクトルの重心を用いている。全ての組み合わせに対する基準代表ベクトル間の距離の平均が重複の程度を示す。 First, the reference representative vector of each group is calculated, and the Euclidean distance between the reference representative vectors is obtained. Here, the center of gravity of the feature vector of each group is used as the reference representative vector. The average distance between the reference representative vectors for all combinations indicates the degree of overlap.

ここで、ユークリッド距離が重複の程度を示すのは、例えば以下の理由である。本実施形態においては特徴ベクトルの各成分はキーワードの出現頻度であり、基準代表ベクトルの各成分の値が０以上であるから、重複成分が多い場合は、重複している成分の値の差が小さくなり、その二乗和の平方によってあらわされるユークリッド距離も近くなる傾向がある。なお、ベクトル間のユークリッド距離が最大になるのはそれぞれの成分どうしが直交する場合であり、この場合はベクトル間の内積が０、つまり全成分が重複しない。ここで、距離はユークリッド距離に限られず、余弦距離等の公知の距離計算方法で計算された距離でもよい。また、グループ間の距離の計算は、必ずしも基準代表ベクトルを用いなくてもよく、他方のグループに最も近い特徴ベクトルをそれぞれのグループから選び、その間の距離を計算するといった方法でもよい。 Here, the Euclidean distance indicates the degree of overlap, for example, for the following reason. In this embodiment, each component of the feature vector is the appearance frequency of the keyword, and since the value of each component of the reference representative vector is 0 or more, if there are many overlapping components, the difference in the values of the overlapping components is There is a tendency that the Euclidean distance expressed by the square of the sum of squares becomes smaller and closer. Note that the Euclidean distance between vectors is maximum when the respective components are orthogonal to each other. In this case, the inner product between the vectors is 0, that is, all the components do not overlap. Here, the distance is not limited to the Euclidean distance, and may be a distance calculated by a known distance calculation method such as a cosine distance. The distance between groups may not necessarily be calculated using a reference representative vector, and a feature vector closest to the other group may be selected from each group and the distance between them may be calculated.

距離の平均が所定の距離以上であれば、異なるグループにある特徴ベクトル間で所定の成分値閾値以上の値を有する成分が重複する程度が所定の程度以上であると判断し、代表ベクトル間の所定の成分値閾値以上の成分間で重複する程度が小さくなる代表ベクトル決定方法（低重複ベクトル決定方法）を選定する。本実施形態において選定される方法は、各グループの重心に最も近い特徴ベクトルを代表ベクトルとする方法である（Ｓ３４）。 If the average distance is greater than or equal to a predetermined distance, it is determined that the degree of overlap of components having a value greater than or equal to a predetermined component value threshold between feature vectors in different groups is greater than or equal to a predetermined level. A representative vector determination method (low overlap vector determination method) that reduces the degree of overlap between components that are equal to or greater than a predetermined component value threshold is selected. The method selected in this embodiment is a method in which a feature vector closest to the center of gravity of each group is used as a representative vector (S34).

図５はグループの重心に最も近い特徴ベクトルが代表ベクトルとして決定される例を示す図である。この図でグループ１にはＶ１１，Ｖ１２，Ｖ１３，Ｖ１４の４つの特徴ベクトルが分類され、グループ２にはＶ２１，Ｖ２２，Ｖ２３，Ｖ２４の４つの特徴ベクトルが分類されている。代表ベクトルはグループ重心に一番近い特徴ベクトルＶ１４およびＶ２２である。グループ重心よりも代表ベクトルの方が、０でない成分の数が少なく、特徴がより明確化されるのがわかる。さらに、重心に最も近い特徴ベクトルを代表ベクトルとして選ぶことで、グループ１とグループ２とで、重心ベクトルどうしの重複する程度より代表ベクトルどうしで重複する程度を小さくする（例えば距離を遠くする）ことができる。なお、代表ベクトル決定方法は必ずしも各グループの重心に最も近い特徴ベクトルを選定する方法に限られるわけではなく、各グループの重心に近い数個のベクトルを選び、さらにその重心を代表ベクトルとして決定してもよい。 FIG. 5 is a diagram showing an example in which the feature vector closest to the center of gravity of the group is determined as the representative vector. In this figure, group 1 has four feature vectors V11, V12, V13, and V14, and group 2 has four feature vectors V21, V22, V23, and V24. The representative vectors are feature vectors V14 and V22 that are closest to the group centroid. It can be seen that the representative vector has fewer non-zero components than the group centroid, and the features are more clarified. Furthermore, by selecting the feature vector closest to the center of gravity as the representative vector, the degree of overlap between the representative vectors in group 1 and group 2 is made smaller (for example, the distance is increased) than the degree of overlap between the center of gravity vectors. Can do. Note that the representative vector determination method is not necessarily limited to the method of selecting the feature vector closest to the centroid of each group. Select several vectors close to the centroid of each group, and further determine the centroid as the representative vector. May be.

上述のような代表ベクトル決定方法を用いることで、複数グループ間で共通して所定の成分値閾値以上の値を有する（特徴を示す）成分の影響を除き、グループ内でのみ特徴を示す成分の割合を増やすことができ、各グループの代表ベクトル間の差異を明確化し、より解析しやすい代表ベクトルを決定できる。 By using the representative vector determination method as described above, the influence of components having a value equal to or greater than a predetermined component value threshold (indicating characteristics) in common among a plurality of groups is excluded, and components having characteristics only in groups are displayed. The ratio can be increased, the difference between the representative vectors of each group can be clarified, and a representative vector that is easier to analyze can be determined.

一方、距離の平均が所定の距離以上であれば、異なるグループにある特徴ベクトル間で所定の成分値閾値以上の値を有する成分が重複する程度が所定の程度より小さいと判断し、代表ベクトル間の所定の成分値閾値以上の成分間で重複する程度が大きくなる代表ベクトル決定方法（高重複代表ベクトル決定方法）を選定する。本実施形態において選定される代表ベクトル決定方法は、各グループ内で頻出上位Ｎ個（Ｎは１以上の整数）の成分を抽出して残し、それ以外の成分を０とする代表ベクトルを作成する方法である（Ｓ３５）。 On the other hand, if the average distance is equal to or greater than the predetermined distance, it is determined that the degree of overlapping of components having a value equal to or greater than the predetermined component value threshold between feature vectors in different groups is smaller than the predetermined level. A representative vector determining method (high overlapping representative vector determining method) that increases the degree of overlap between components equal to or greater than a predetermined component value threshold is selected. The representative vector determination method selected in the present embodiment creates a representative vector in which the top N frequently occurring components (N is an integer of 1 or more) are extracted and left in each group, and other components are set to 0. Method (S35).

図６はグループの特徴ベクトルから頻出成分を抽出して代表ベクトルを決定する例を示す図である。本図の例ではＮは３として代表ベクトルを作成している。グループ１にはＶ１１，Ｖ１２，Ｖ１３，Ｖ１４の４つの特徴ベクトルが分類され、グループ２にはＶ２１，Ｖ２２，Ｖ２３，Ｖ２４の４つの特徴ベクトルが分類されている。各グループの重心ベクトルどうしで重複する成分の数は４つで、一方頻出上位３個をとって生成したベクトルどうしで重複する成分の数は３つであり、重複する成分の数は少なくなる。しかし、正規化した代表ベクトルでは他の代表ベクトルと重複している成分の値が大きくなるため、重複する程度を大きくする（例えば距離を近くする）ことができる。なお、省略している成分は、複数のグループに共通して出現しない（重複しない）成分としている。この図の例ではＮは３としているが、５や１０など特徴ベクトルの次元数より小さい他の数としてもよい。 FIG. 6 is a diagram showing an example in which a representative vector is determined by extracting a frequent component from a group feature vector. In the example of this figure, N is 3 and a representative vector is created. Group 1 includes four feature vectors V11, V12, V13, and V14, and group 2 includes four feature vectors V21, V22, V23, and V24. The number of overlapping components among the centroid vectors of each group is four, while the number of overlapping components among the vectors generated by taking the top three frequently occurring is three, and the number of overlapping components is reduced. However, in the normalized representative vector, the value of the component that overlaps with another representative vector is large, so that the degree of overlap can be increased (for example, the distance can be reduced). The omitted component is a component that does not appear (does not overlap) in common in a plurality of groups. In the example of this figure, N is set to 3, but other numbers smaller than the dimension number of the feature vector such as 5 and 10 may be used.

このように、頻出成分を抽出することで、各グループ内でのみ所定の成分値閾値以上の値を有し（特徴を示し）、他のグループでは特徴を示さない成分の影響を除き、複数グループ間で共通して特徴を示す成分、つまり重複する成分の割合を増やし、より解析しやすい代表ベクトルを決定できる。 In this way, by extracting frequent components, multiple groups are excluded except for the influence of components that have a value equal to or greater than a predetermined component value threshold only in each group (indicating characteristics) and that do not exhibit characteristics in other groups. It is possible to determine a representative vector that is easier to analyze by increasing the proportion of components that exhibit features in common, that is, the proportion of overlapping components.

なお、本実施形態においては所定の成分値閾値以上の値を有する成分が異なるグループに係る特徴ベクトル間で重複する程度を評価するために、各グループの基準代表ベクトル間の距離を利用しているが、他の指標を用いてもよい。例えば決定方法選択部２２は、上記重複する程度を判断する２つのグループについて、ある成分の値が所定の成分値閾値以上の値であるという条件を満たす特徴ベクトルが双方のグループに存在するか成分ごとに確認する。その確認の結果双方のグループにその条件を満たす特徴ベクトルが存在すると確認された成分の数をカウントし、そのカウントされた値をその２つのグループ間の重複する程度の判断に用いてもよい。 In the present embodiment, the distance between the reference representative vectors of each group is used to evaluate the degree of overlap between feature vectors of different groups with components having a value equal to or greater than a predetermined component value threshold. However, other indicators may be used. For example, the determination method selection unit 22 determines whether a feature vector that satisfies the condition that the value of a certain component is equal to or greater than a predetermined component value threshold exists in both groups for the two groups that determine the degree of overlap. Check every time. As a result of the confirmation, it is possible to count the number of components that have been confirmed as having feature vectors that satisfy the condition in both groups, and use the counted value to determine the degree of overlap between the two groups.

本実施形態においては、決定方法選択部２２はＳ３１およびＳ３３の２つの判断を行っているが、どちらか一方の判断のみを行うようにしてもよい。さらにＳ３３のみの判断を行う場合は、上述の重複の程度が所定の程度以上か所定の程度未満かによって２つの代表ベクトル決定方法のうちの一つを選択する代わりに、重複の程度の大きさに応じてさらに多くの種類の代表ベクトル決定方法のうちの一つを選択するようにしてもよい。例えば、上述の重複の程度が大きくなるのに応じて、低重複代表ベクトル決定方法、基準代表ベクトルを代表ベクトルとして決定する方法、高重複代表ベクトル決定方法の順に、いずれかの代表ベクトル決定方法を選択するようにしてもよい。 In the present embodiment, the determination method selection unit 22 performs two determinations of S31 and S33, but only one of the determinations may be performed. Further, when only S33 is determined, instead of selecting one of the two representative vector determination methods depending on whether the degree of overlap is greater than or less than a predetermined degree, the degree of overlap is determined. Depending on, one of more types of representative vector determination methods may be selected. For example, as the degree of overlap increases, any one of the representative vector determination methods is performed in the order of the low overlap representative vector determination method, the reference representative vector as a representative vector, and the high overlap representative vector determination method. You may make it select.

決定方法選択部２２は、上述の実施形態ではメモリ１２に格納された複数の特徴ベクトルに基づいて代表ベクトル決定方法を選択しているが、代わりに、情報解析装置を操作する者から入力部１３を介して指示を受け、その指示を受けた代表ベクトル決定方法を選択するようにしてもよい。 The determination method selection unit 22 selects a representative vector determination method based on a plurality of feature vectors stored in the memory 12 in the above-described embodiment. Instead, the input unit 13 from a person who operates the information analysis apparatus. The representative vector determination method that receives the instruction may be selected.

代表ベクトル決定部２３は、ＣＰＵ１１を中心として実現される。代表ベクトル決定部２３は、決定方法選択部２２で選択された代表ベクトル決定方法を用いて、メモリ１２に格納された複数の特徴ベクトルに対してグループごとに代表ベクトルを決定し、決定した代表ベクトルをメモリ１２に格納する。代表ベクトルの決定方法は決定方法選択部２２の説明で記述したとおりである。 The representative vector determination unit 23 is realized with the CPU 11 as the center. The representative vector determination unit 23 determines a representative vector for each of the plurality of feature vectors stored in the memory 12 using the representative vector determination method selected by the determination method selection unit 22, and the determined representative vector Is stored in the memory 12. The method of determining the representative vector is as described in the description of the determination method selection unit 22.

座標情報生成部２４は、各グループの代表ベクトルと、各グループに分類された特徴ベクトルから、マップ生成を行う空間の座標系に座標を変換する。代表ベクトルと特徴ベクトルはキーワード（特徴）の種類の次元を持つ高次元ベクトルであり、マップの生成対象となる、より低次元（２次元、３次元など）の座標系（以下、マップ座標系という）へ射影することが必要となる。 The coordinate information generation unit 24 converts coordinates from the representative vector of each group and the feature vector classified into each group into a coordinate system of a space for generating a map. The representative vector and the feature vector are high-dimensional vectors having keyword (feature) types of dimensions, and a lower-dimensional (two-dimensional, three-dimensional, etc.) coordinate system (hereinafter referred to as a map coordinate system) that is a map generation target. ) To project to.

座標情報生成部２４は、はじめに代表ベクトルをマップ座標系に射影し、その後、特徴ベクトルを代表ベクトルとの距離を保存するようにマップ座標系に射影するという２段階の処理を行う。これにより、各グループの代表ベクトル間だけでなく、各グループの解析対象も関係を保存してマッピングされる。以下では２段階マッピングについてマップ座標系を２次元の空間とした場合を例として説明する。 The coordinate information generation unit 24 performs a two-step process of first projecting the representative vector onto the map coordinate system and then projecting the feature vector onto the map coordinate system so as to preserve the distance from the representative vector. Thereby, not only between the representative vectors of each group but also the analysis target of each group is mapped while storing the relationship. Hereinafter, a case where the map coordinate system is a two-dimensional space will be described as an example for the two-step mapping.

１段階目のマッピングである代表ベクトルの射影の一例は以下のとおりである。図７は、多次元空間上に存在する各グループの特徴ベクトルの分布イメージの一例を示す図である。本図は多次元データを３次元で模式的にあらわしている。特徴ベクトルはＡ，Ｂ，Ｃ，Ｄの４グループに分類されており、二点鎖線内の領域は特徴ベクトルが存在する領域を、丸は代表ベクトルの多次元空間上の座標をあらわしている。図８は、図７に示される各グループの代表ベクトルをマップ座標系に射影したイメージの一例を示す図である。この射影は、例えば主座標分析や主成分分析などの公知の手法で行うことができる。以下、マッピングされた空間をマップ対象空間と呼ぶ。 An example of the projection of the representative vector, which is the mapping at the first stage, is as follows. FIG. 7 is a diagram illustrating an example of a distribution image of feature vectors of each group existing in a multidimensional space. This figure schematically represents multidimensional data in three dimensions. The feature vectors are classified into four groups of A, B, C, and D. The region within the two-dot chain line indicates the region where the feature vector exists, and the circle indicates the coordinates of the representative vector in the multidimensional space. FIG. 8 is a diagram showing an example of an image obtained by projecting the representative vector of each group shown in FIG. 7 onto the map coordinate system. This projection can be performed by a known method such as principal coordinate analysis or principal component analysis. Hereinafter, the mapped space is referred to as a map target space.

次に、２段階目のマッピングである各特徴ベクトルの射影の一例は以下のとおりである。座標情報生成部２４は、各グループに対して以下のマッピング処理を行う。はじめに、マッピング処理を行うグループ（以下「対象グループ」という）の代表ベクトルとは別の代表ベクトルを２つ選択する。その２つの代表ベクトルの選択の基準は、例えば対象グループの代表ベクトルとの距離が近い順などであってよい。例えば図８におけるグループＡが対象グループであれば、グループＢおよびグループＤの代表ベクトルを選択する。そして、対象グループに分類されている特徴ベクトルＭ個、そのグループの代表ベクトル、他の選択された２つの代表ベクトルの計（Ｍ＋３）個について、主座標分析などを用いてマップ座標系と同じ２次元空間に射影する。ここで射影した情報を仮マップ情報と呼ぶ。次に、各グループについて作成された仮マップ情報に含まれる各特徴ベクトルの座標情報をアフィン変換し、それを特徴ベクトルに対応したマップ対象空間上の座標とする。各仮マップをアフィン変換する際には、各仮マップ情報に含まれる３つの代表ベクトルのそれぞれについて、アフィン変換された座標と１段階目のマッピングで射影された座標とが一致するようにする。また、これまで説明した２段階目のマッピングの方法に代えて、非特許文献１に記載された公知の方法である、各特徴ベクトルと既にマッピングされた全てのグループの代表ベクトルとの距離を保存するように多次元尺度構成法によりマッピングする方法を用いてもよい。 Next, an example of the projection of each feature vector, which is the second-stage mapping, is as follows. The coordinate information generation unit 24 performs the following mapping process for each group. First, two representative vectors different from the representative vector of the group to be subjected to the mapping process (hereinafter referred to as “target group”) are selected. The criterion for selecting the two representative vectors may be, for example, in the order of the distance from the representative vector of the target group. For example, if group A in FIG. 8 is the target group, the representative vectors of group B and group D are selected. Then, for the feature vector M classified into the target group, the representative vector of the group, and the total (M + 3) of the two selected representative vectors, the same 2 as the map coordinate system using principal coordinate analysis or the like 2 Project into the dimensional space. The information projected here is called temporary map information. Next, the coordinate information of each feature vector included in the temporary map information created for each group is subjected to affine transformation, and this is used as coordinates on the map target space corresponding to the feature vector. When each temporary map is affine transformed, for each of the three representative vectors included in each temporary map information, the coordinates after the affine transformation and the coordinates projected in the first-stage mapping are matched. Also, instead of the second-stage mapping method described so far, the distance between each feature vector and the representative vectors of all groups already mapped, which is a known method described in Non-Patent Document 1, is stored. As described above, a mapping method using a multidimensional scaling method may be used.

描画部２５は、座標情報生成部２４が生成した代表ベクトルと特徴ベクトルに対応するマップ対象空間上の座標から、利用者が認識できるように描画した画像を生成し出力部１４を介して出力する。ただし、出力部１４に出力するのではなく、ＪＰＥＧなどの所定の画像データフォーマットに変換して、メモリ１２に格納し、利用者に提供してもよい。 The drawing unit 25 generates an image drawn so as to be recognized by the user from the coordinates in the map target space corresponding to the representative vector and the feature vector generated by the coordinate information generation unit 24, and outputs the generated image via the output unit 14. . However, instead of outputting to the output unit 14, it may be converted into a predetermined image data format such as JPEG, stored in the memory 12, and provided to the user.

図９は描画部２５が各グループの特徴ベクトルをマップ座標系に射影したイメージの一例を示す図である。図９は、特徴ベクトルに対応する座標そのものはプロットされておらず、ある範囲内に解析対象が存在する密度が濃淡で表されている。 FIG. 9 is a diagram illustrating an example of an image in which the drawing unit 25 projects the feature vectors of each group onto the map coordinate system. In FIG. 9, the coordinates corresponding to the feature vector itself are not plotted, and the density at which the analysis target exists within a certain range is represented by shading.

なお、これまでに説明した実施形態では、対象となる解析対象として主に特許文献について説明したが、複数種類の特徴を有するものであれば解析対象はそれだけには限られない。例えば、Ｗｅｂ上に存在するＨＴＭＬ文書やＸＭＬ文書、電子媒体に記録された音声情報などでもよい。 In the embodiments described so far, the patent document is mainly described as the analysis target to be analyzed. However, the analysis target is not limited to that as long as it has a plurality of types of features. For example, it may be an HTML document or XML document existing on the Web, audio information recorded on an electronic medium, or the like.

他にも、特徴ベクトルの成分として、解析対象にあらわれるキーワード（特徴）の出現頻度をそのまま使っているが、例えば出現頻度０なら成分の値を−１とし、多ければ＋１と表現してもよい。この場合は所定の成分値閾値を−１とすればよい。 In addition, the appearance frequency of the keyword (feature) appearing in the analysis target is used as it is as the component of the feature vector. For example, if the appearance frequency is 0, the value of the component may be −1, and if it is more, it may be expressed as +1. . In this case, the predetermined component value threshold may be set to -1.

本発明の実施形態に係る情報解析装置の構成例を示す構成ブロック図である。It is a block diagram showing a configuration example of an information analysis apparatus according to an embodiment of the present invention. 本発明の実施形態に係る情報解析装置の機能ブロックを示す図である。It is a figure which shows the functional block of the information analysis apparatus which concerns on embodiment of this invention. 解析対象から特徴ベクトルを生成する方法の一例を示す図である。It is a figure which shows an example of the method of producing | generating a feature vector from an analysis object. 決定方法選択部の処理フローの一例を示す図である。It is a figure which shows an example of the processing flow of a determination method selection part. グループの重心に最も近い特徴ベクトルが代表ベクトルとして決定される例を示す図である。It is a figure which shows the example in which the feature vector nearest to the center of gravity of a group is determined as a representative vector. グループの特徴ベクトルから頻出成分を抽出して代表ベクトルを決定する例を示す図である。It is a figure which shows the example which extracts a frequent component from the feature vector of a group, and determines a representative vector. 多次元空間上に存在する各グループの特徴ベクトルの分布イメージの一例を示す図である。It is a figure which shows an example of the distribution image of the feature vector of each group which exists on multidimensional space. 各グループの代表ベクトルをマップ座標系に射影したイメージの一例を示す図である。It is a figure which shows an example of the image which projected the representative vector of each group on the map coordinate system. 各グループの特徴ベクトルをマップ座標系に射影したイメージの一例を示す図である。It is a figure which shows an example of the image which projected the feature vector of each group on the map coordinate system.

Explanation of symbols

１情報解析装置、１１ＣＰＵ、１２メモリ、１３入力部、１４出力部、２１特徴ベクトル取得部、２２決定方法選択部、２３代表ベクトル決定部、２４座標情報生成部、２５描画部。 DESCRIPTION OF SYMBOLS 1 Information analysis apparatus, 11 CPU, 12 Memory, 13 Input part, 14 Output part, 21 Feature vector acquisition part, 22 Determination method selection part, 23 Representative vector determination part, 24 Coordinate information generation part, 25 Drawing part

Claims

For each of a plurality of analysis objects classified into any of a plurality of groups, an acquisition means for acquiring a feature vector having as its component a value indicating the degree of the analysis object having each of a plurality of types of features;
Of the plurality of types of representative vector determination methods, based on the evaluation on the distribution of components having a value equal to or greater than a predetermined component value threshold value of the feature vector acquired by the acquisition unit for all or a part of the plurality of analysis targets. A selection means for selecting one;
A determining unit that determines a representative vector of the group based on a feature vector acquired by the acquiring unit with respect to an analysis target classified into each group by a representative vector determining method selected by the selecting unit;
An information analysis apparatus comprising:

The plurality of types of representative vector determination methods include at least one limited representative vector determination method in which an amount of information used for determining a representative vector of the group among information on feature vectors related to each group is limited;
The selection unit is configured to determine whether the feature vector acquired by the acquisition unit has a value indicating the number of components having a value equal to or greater than the predetermined component value threshold value when the number of components is greater than a predetermined component number threshold value. Select one of them,
The information analysis apparatus according to claim 1.

The plurality of types of representative vector determination methods include a reference representative vector determination method for determining a centroid vector of a feature vector of each group, and a degree in which components having a value equal to or greater than a predetermined component value threshold overlap between representative vectors of different groups. Lower overlap representative vector determination method that lowers the reference representative vector determination method, and the degree of overlap of components having a value equal to or greater than a predetermined component value threshold between different groups of representative vectors than the reference representative vector determination method A high overlapping representative vector determination method for increasing,
The selection means evaluates the degree of overlapping of components having a value equal to or greater than the predetermined component value threshold between feature vectors related to different groups, and when the overlapping degree is equal to or higher than the predetermined level, the low overlap representative Selecting a vector determination method, and selecting the high overlap representative vector determination method when the overlapping degree is smaller than a predetermined level;
The information analysis apparatus according to claim 1, wherein the information analysis apparatus is an information analysis apparatus.

The selection means calculates a degree of proximity between feature vectors relating to different groups, and evaluates the degree of overlap according to the degree of proximity;
The information analysis apparatus according to claim 3.

The low overlap representative vector determination method is selected based on a representative vector of the group determined by the reference representative vector determination method among the feature vectors acquired by the acquisition unit for the analysis target classified into each group. A method for determining a representative vector of the group using only a part of
The information analysis apparatus according to claim 3 or 4, characterized by the above.

Coordinate information generating means for generating coordinate information obtained by projecting the representative vector of each group determined by the determining means to the same coordinate system as the space in which the map is generated;
The information analysis apparatus according to claim 1, further comprising:

Acquisition means for acquiring, for each of a plurality of analysis objects classified into any of a plurality of groups, a feature vector whose component is a value indicating the degree of the analysis object having a plurality of types of features;
Of the plurality of types of representative vector determination methods, based on the evaluation on the distribution of components having a value equal to or greater than a predetermined component value threshold value of the feature vector acquired by the acquisition unit for all or a part of the plurality of analysis targets. A selection means for selecting one,
A determining unit that determines a representative vector of the group based on a feature vector acquired by the acquiring unit with respect to an analysis target classified into each group by a representative vector determining method selected by the selecting unit;
As a program to make the computer function.