JP2003288352A

JP2003288352A - Information analytic display device and information analytic display program

Info

Publication number: JP2003288352A
Application number: JP2003012661A
Authority: JP
Inventors: Yasuki Iizuka; 泰樹飯塚; Takao Fukushige; 貴雄福重
Original assignee: Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Holdings Corp
Priority date: 2002-01-23
Filing date: 2003-01-21
Publication date: 2003-10-10

Abstract

<P>PROBLEM TO BE SOLVED: To solve the problem in a conventional device that the display of a set of data by classifying and arranging it on the space in a form suitable for the set of data requires a large computational effort, and the result is hard to see. <P>SOLUTION: In a set of data expressed by matrix, the lesser data of row and column are arranged on space by a self-organizing method, and the other data is arranged by use of only this arrangement result, whereby the set of data can be classified and arranged with a less computation effort. Further, since one data can be used as a label by displaying both the data, a result easy to see can be obtained. <P>COPYRIGHT: (C)2004,JPO

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は、電子計算機を利用
したデータ分析および分析結果表示に関するものであ
る。特に文書検索結果をキーワードで分類して表示した
り、マーケット分析における顧客と商品の関係などを分
析表示することに応用できるものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to data analysis and analysis result display using an electronic computer. In particular, the present invention can be applied to display document search results classified by keywords and to analyze and display the relationship between customers and products in market analysis.

【０００２】[0002]

【従来の技術】文書の分類や分析には単語と文書の関係
行列が使われることが多い。これは単語を行、文書を列
に取り、文書の中にその単語が何回出現したかを記録し
た行列である。図３参照。この行列の行の一つ一つを取
り出すことで単語のベクトル表現を取り出すことが出来
るし、あるいは列の一つ一つを取り出すことで文書のベ
クトル表現を取り出すことが出来る。よって、ある２つ
の単語Ａと単語Ｂの距離は、ベクトルの距離、ベクトル
の余弦、あるいはベクトルの内積などにより定義でき
る。文書間の距離についても同様にベクトルの距離、余
弦、内積などで定義できる。すなわち、単語の距離は文
書を要素とするベクトルの比較により、文書の距離は単
語を要素とするベクトルの比較により表現できる。ま
た、文書と単語の距離も、文書中にその単語がどれだけ
出現したかにより定義できる。2. Description of the Related Art A word-document relation matrix is often used for document classification and analysis. This is a matrix that takes words in rows and documents in columns and records how many times the word appears in the document. See FIG. You can retrieve the vector representation of a word by extracting each row of this matrix, or the vector representation of a document by extracting each column. Therefore, the distance between certain two words A and B can be defined by the distance between the vectors, the cosine of the vector, or the inner product of the vectors. Similarly, the distance between documents can be defined by vector distance, cosine, inner product, and the like. That is, the distance between words can be expressed by comparing vectors having documents as elements, and the distance between documents can be expressed by comparing vectors having words as elements. The distance between the document and the word can also be defined by how many times the word appears in the document.

【０００３】さて、このように行列表現できるデータの
例として、マーケッティングにおける顧客と購買商品の
関係などがある。商品を行、顧客を列に取った行列で、
ある顧客がどの商品をどれだけ購入したことがあるかを
記録すれば、顧客と商品の関係が記録できる。図９参
照。この場合も顧客それぞれのベクトル表現、あるいは
商品それぞれのベクトル表現を取り出すことができる。
ある顧客のベクトルは、その顧客の商品嗜好を示してい
るため、ベクトルの距離が近い顧客同士は嗜好が近いと
言うことが出来る。この場合も、顧客の距離は商品を要
素とするベクトルにより、商品の距離は顧客を要素とす
るベクトルにより表現できる。Now, as an example of data that can be expressed in a matrix like this, there is a relationship between a customer and a purchased product in marketing. A matrix with rows of products and columns of customers,
The relationship between a customer and a product can be recorded by recording which product and how much a certain customer has purchased. See FIG. Also in this case, the vector expression of each customer or the vector expression of each product can be extracted.
Since the vector of a certain customer indicates the product preference of the customer, it can be said that the customers whose vector distances are close to each other have similar preferences. Also in this case, the customer distance can be expressed by a vector having the product as an element, and the product distance can be expressed by a vector having the customer as an element.

【０００４】この例では文書と単語は行列の行と列の関
係にあり、顧客と商品も同様の関係にある。このような
関係にあるデータの組み合わせは数多く存在するが、以
降では文書と単語の関係を例に説明する。In this example, documents and words are in a row and column relationship in a matrix, and customers and products are in a similar relationship. Although there are many combinations of data having such a relationship, the relationship between a document and a word will be described below as an example.

【０００５】ＩＴ技術とインターネットの普及により、
電子化された文書の数は爆発的に増えつづけている。例
えば新聞記事や特許文献の数は、既存の発行分だけでも
膨大な数に達しており、今後も増えつづけることが確実
である。これらの文献を有効に利用するためには、目的
の文書を的確に選択することが出来る検索・分類・分析
手段をもつことが不可欠である。With the spread of IT technology and the Internet,
The number of computerized documents continues to grow explosively. For example, the number of newspaper articles and patent documents has reached an enormous amount even with existing publications, and it is certain that the number will continue to increase in the future. In order to effectively use these documents, it is indispensable to have a search / classification / analysis means capable of accurately selecting a target document.

【０００６】検索結果の文書を分類するための手段とし
て、大別すると以下のような方法が既に存在する。As a means for classifying documents as search results, the following methods already exist when roughly classified.

【０００７】（１）一つ目はあらかじめ分類基準を作成
しておき、文書をこの分類基準に照らして分類するとい
うものである。図１７はこの方法の操作・処理概要を表
したフローチャートである。準備段階として最初に人手
により分類基準を作成しておく（１７０１）。分類基準
は一度作ってしまえば汎用的に何度でも使えるのが普通
である。次に文書を検索し（１７０２）、検索結果の文
書集合をこの分類基準に照らして自動分類し（１７０
３）、結果を分類ごとに表示する（１７０４）。この方
式は分類があらかじめ決めておける新聞記事などに適し
ている。(1) The first is to create a classification standard in advance and classify documents according to this classification standard. FIG. 17 is a flowchart showing the outline of the operation and processing of this method. As a preparatory step, first, a classification standard is manually created (1701). Once a classification standard is created, it can be used universally and repeatedly. Next, the document is searched (1702), and the document set of the search result is automatically classified according to this classification criterion (170
3) The results are displayed for each classification (1704). This method is suitable for newspaper articles whose classification can be decided in advance.

【０００８】（２）二つ目は文書の集合を、文書間の距
離だけで空間上に配置する。配置が収束するまで繰り返
し計算を行うことで、自己組織化分類が可能となる。こ
の方式を実現するためには、SOM（自己組織化マップ）
（参考図書：Ｔ．コホネン:"自己組織化マップ" シュ
プリンガー・フェアラーク東京ＩＳＢＮ４−４３１−
７０７００−Ｘ（１９９６））、あるいはスプリングモ
デルによる配置（参考文献：ＰｅｔｅｒＥａｄｅ
ｓ: "Ａｈｅｕｒｉｓｔｉｃｆｏｒｇｒａｐｈ
ｄｒａｗｉｎｇ"、ｃｏｎｇｒｅｓｓｕｓｎｕｍｅｒ
ａｎｔｉｕｍ、Ｖｏｌ４２（１９８４））（文書分析に
応用した例の参考文献：渡部勇："ビジュアルテキス
トマイニング", 人工知能学会誌１６巻２号(２００
１)）などが有名である。(2) The second is to arrange a set of documents in space only by the distance between the documents. By repeating the calculation until the arrangement converges, self-organizing classification becomes possible. In order to realize this method, SOM (self-organizing map)
(Reference book: T. Kohonen: "Self-Organizing Map" Springer Fairlark Tokyo ISBN4-431-
70700-X (1996)) or arrangement by spring model (reference: Peter Eade)
s: "A heuristic for graph
drawing ", congressus number
antim, Vol 42 (1984)) (References of examples applied to document analysis: Isamu Watanabe: "Visual Text Mining", Journal of Artificial Intelligence Vol. 16 No. 2 (200)
1)) etc. are famous.

【０００９】このうちスプリングモデルは、そもそもは
無向グラフ（方向の無いグラフ）を対象とする配置方法
であるが、文書や単語の分類配置に応用できる。例えば
文書を配置するなら、文書をグラフの接点とみなし、こ
の接点が文書の距離（あるいは類似度）に応じたバネで
接続されているものとみなす。初期状態の例を図２８示
す。Of these, the spring model is an arrangement method intended for undirected graphs (graphs having no direction), but it can be applied to classification arrangement of documents and words. For example, when arranging a document, the document is regarded as a graph contact point, and this contact point is considered to be connected by a spring according to the document distance (or similarity). FIG. 28 shows an example of the initial state.

【００１０】図２８は接点がそれぞれ文書であり、ギザ
ギザの線がバネを模式的に示したものである。そしてこ
の接点とバネからなる系の安定状態、すなわち各々のバ
ネが伸びすぎたり縮みすぎたりしない本来の長さに近い
ところに落ち着いた状態を求める。すると結果として類
似した文書は近くに配置され、類似していない文書は遠
くに配置される。このような例を図２９に示す。In FIG. 28, each contact point is a document, and jagged lines schematically show the spring. Then, the stable state of the system composed of this contact and the spring, that is, the state in which each spring is settled to a position close to the original length where it does not extend or contract too much is obtained. As a result, similar documents are placed closer and dissimilar documents are placed far away. Such an example is shown in FIG.

【００１１】図２９の例では文書Ａ、Ｂ、Ｃは互いに類
似する文書であり、文書Ｄは文書Ａ、Ｂ、Ｃのいずれに
も類似していない文書であることが視覚的に把握するこ
とができる。In the example of FIG. 29, it is necessary to visually understand that documents A, B, and C are similar to each other, and document D is a document that is not similar to any of documents A, B, and C. You can

【００１２】ＳＯＭやスプリングモデルといった方式は
検索結果の文書集合に適した配置が出来るので、検索ご
とに柔軟な分類が可能となる。この方式では自己組織化
分類を行うため、分類結果は必ずしも人が見てわかる基
準に沿っているわけではない。そこで通常、結果のクラ
スタについて、ラベル付け処理を施す。[0012] The SOM and spring model methods can be arranged in a manner suitable for the document set of the search result, so that flexible classification can be performed for each search. Since this method uses self-organizing classification, the classification result does not always follow the standard that can be seen by people. Therefore, the resulting clusters are usually labeled.

【００１３】この様子を示したのが図１８のフローチャ
ートである。すなわち、まず文書を検索し（１８０
１）、検索した文書の自己組織化と配置を行い（１８０
２）、その配置結果に基づいてクラスタに分割し（１８
０３）、各クラスタ毎にラベル付けを行い（１８０
４）、最後に配置結果とラベルを表示する（１８０
５）。特開平８−２６３５１４は、前記自己組織化の方
法としてＳＯＭを応用した方式の例であり、ＳＯＭを用
いた結果は図２２のようなセルの集合で表示されること
が多い。またスプリングモデルを用いた結果は、図２０
のように空間上へのデータの配置として表示されること
が多い。This is shown in the flow chart of FIG. That is, the document is first searched (180
1) self-organize and arrange retrieved documents (180
2) and divide the cluster into clusters based on the placement result (18
03), label each cluster (180
4) Finally, the placement result and the label are displayed (180
5). Japanese Patent Laid-Open No. 8-263514 is an example of a method in which SOM is applied as the self-organizing method, and the result of using SOM is often displayed as a set of cells as shown in FIG. The result using the spring model is shown in FIG.
It is often displayed as the arrangement of data on the space like.

【００１４】（３）三つ目は、キーワードにどれほど近
いかで文書を分類する方式である。図１９はこの方式の
操作・処理概要を表したフローチャートである。まず文
書を検索し（１９０１）、検索結果の文書について、人
がキーワードを与えるか、あるいは自動的にキーワード
を抽出し（１９０２）、そのキーワードを空間上の定点
に配置し（１９０３）、次に検索結果の個々の文書がど
のキーワードに近いかによって同じ空間上に配置し（１
９０４）、最後に配置結果を表示する（１９０５）。特
開２０００−７６２７９はこの方式の例である。（４）また、ベクトル化した文書を適当な数にクラスタ
リング化し、そのクラスタを代表するクラスタ中心につ
いてのみマッピング手段を適用することによって、文書
を意味的な内容の遠近に応じて一定の次元の空間に配置
する手法が、特開平１０−１７１８２３に開示されてい
る。この手法においては、まず、分析対象である文書を
ベクトル化手段３５０３によってベクトル化し、クラス
タリング手段３５０４によって、ベクトル化した文書を
クラスタに分類する。次に、クラスタ中心抽出手段３５
０５によって各クラスタを代表するベクトルを抽出し、
クラスタ中心マッピング手段３５０６によって各クラス
タ中心を互いの距離をなるべく保ったまま低次元の空間
に配置する。これにより決定された配置位置とクラスタ
リング手段３５０４によって得られたベクトルの分類結
果に基づき、各クラスタに含まれる文書を配置する。こ
の文書配置の際には、配置する文書ベクトルの属するク
ラスタの近傍に配置されたクラスタのクラスタ中心との
比較を行っている。(3) The third method is to classify documents according to how close they are to the keyword. FIG. 19 is a flowchart showing the outline of the operation and processing of this system. First, a document is searched (1901), and a person gives a keyword to the document of the search result or a keyword is automatically extracted (1902), and the keyword is arranged at a fixed point in space (1903). Place each document in the search results in the same space depending on which keyword is closest (1
904), and finally the placement result is displayed (1905). Japanese Patent Laid-Open No. 2000-76279 is an example of this method. (4) Further, the vectorized documents are clustered into an appropriate number, and the mapping means is applied only to the cluster centers that represent the clusters, so that the documents have a certain dimensional space according to the perspective of semantic content. The method of arranging the above is disclosed in Japanese Patent Application Laid-Open No. 10-171823. In this method, first, the document to be analyzed is vectorized by the vectorization unit 3503, and the vectorized document is classified into clusters by the clustering unit 3504. Next, the cluster center extraction means 35
The vector representing each cluster is extracted by 05,
The cluster center mapping unit 3506 arranges the cluster centers in a low-dimensional space while keeping the distance between them as much as possible. The documents included in each cluster are arranged based on the arrangement position thus determined and the vector classification result obtained by the clustering unit 3504. At the time of this document arrangement, a comparison is made with the cluster center of the cluster arranged in the vicinity of the cluster to which the arranged document vector belongs.

【００１５】[0015]

【発明が解決しようとする課題】しかしながら前記
（１）の分類手段では、あらかじめ決められた分類基準
にしか分類することが出来ない。これは新聞記事を経済
とスポーツというように分類するには適しているかもし
れないが、文書の検索では検索結果を常に新しい軸で分
類しなければいけない場面に遭遇する。スポーツをプロ
とアマチュアに分類するとしても、プロが参加できるよ
うになった五輪については別の分類軸が必要になるだろ
う。分類は状況によって変化するものであるので、あら
かじめ決められた分類基準を作成しておく方式には限界
が存在することになる。However, the classification means of the above (1) can only classify according to a predetermined classification standard. This may be good for categorizing newspaper articles as economics and sports, but when you search for documents, you come across situations in which search results must always be categorized on a new axis. Even if sports are classified as professionals and amateurs, another classification axis will be needed for the Olympics where professionals can participate. Since the classification changes depending on the situation, there is a limit to the method of creating a predetermined classification standard.

【００１６】前記（２）の分類手段は、自己組織化分類
をおこなうためには、全ての文書間について、適当な位
置に落ち着くまで（収束するまで）繰り返し計算を行わ
なければいけない。分類しようとした文書の数が膨大に
なった場合、それらの位置が収束するまで計算すること
は非常にコストがかかり、あまり実用的とはいえない。In order to perform the self-organizing classification, the classifying means (2) has to repeatedly perform calculation between all documents until they settle at appropriate positions (until they converge). When the number of documents to be classified becomes enormous, it is very costly to calculate them until their positions converge, which is not very practical.

【００１７】例えばスプリングモデルについて、図２８
は４つの接点のモデルであった。図３０は８つの接点の
モデルについて、バネを直線で書いた模式図である。こ
の図でもわかるように４つから８つに接点が２倍になっ
ただけで、バネの数は４倍になっている。文書Ｎ個の間
全てをバネで結ぶ場合、そのバネの数は、Ｎ×（Ｎ−
１）÷２個になるわけだから、バネの数はＮの２乗のオ
ーダーということになる。FIG. 28 shows a spring model, for example.
Was a model of four contacts. FIG. 30 is a schematic diagram in which a spring is drawn in a straight line for a model of eight contact points. As can be seen in this figure, the number of springs has been quadrupled by simply doubling the number of contacts from four to eight. When all the N documents are connected by springs, the number of springs is N × (N−
1) ÷ 2, so the number of springs is on the order of N squared.

【００１８】またスプリングモデルを用いて図２０に示
すように空間上に文書を配置できたとしても、それをど
のようなクラスタにするかは微妙な問題となる。そして
これが図２１のようにクラスタ分けできたとしても、ク
ラスタごとにそのクラスタを意味するラベル（文字列）
を適切に付けられるとは限らない。このクラスタは多次
元ベクトルの計算上のものであるため、人間にわかりや
すい分類になっているという保証がないからである。仮
に特開２０００−８２０６８に開示された発明のように
文書のタイトルから分類後のラベルを抽出して表示しよ
うとしても、クラスタ分けされた文書のラベルがそれぞ
れ異なっている、あるいは別のクラスタ上に同じタイト
ルの文書が多数存在する場合などは、適当なラベルが抽
出できるとは限らない。よって分類配置の後に、ラベル
付け問題だけでもコストのかかる計算をしなければ解決
できない。ＳＯＭでも同様である。Even if a document can be arranged in the space as shown in FIG. 20 by using the spring model, what kind of cluster it will be a delicate problem. Then, even if this can be divided into clusters as shown in FIG. 21, a label (character string) meaning the cluster for each cluster
Is not always attached properly. This is because this cluster is for calculation of multidimensional vectors, and therefore there is no guarantee that it is a classification that is easy for humans to understand. Even if an attempt is made to extract and display labels after classification from the title of a document as in the invention disclosed in Japanese Patent Laid-Open No. 2000-82068, the labels of clustered documents are different from each other, or they are displayed on different clusters. If there are many documents with the same title, an appropriate label cannot always be extracted. Therefore, after classification and placement, the labeling problem alone cannot be solved without costly calculations. The same applies to SOM.

【００１９】前記（３）の分類手段では、キーワードを
固定的に表示し、その間の距離を等間隔に取ることを前
提としている。このキーワードを人がつける場合、適当
な分類用単語を適当な数だけ設定できるわけではない。
つまり、例えばそこで選んだ６個のキーワードが、文書
集合に対して分類に適した対極的な言葉になっていると
は限らない。たとえばスポーツ関係の新聞記事文書を分
類しようとした時、「野球」「球技」「高校野球」「Ｊ
リーグ」というように概念や抽象度のレベルが一様でな
かったり、分類として文書集合に適当ではないものを指
定してしまうことがある。キーワードを計算機が抽出す
る場合、適切なキーワードを抽出できたとしても、与え
られた文書集合に対して、そしてそれらを等間隔に配置
することで、文書の集合が本来の特徴とは違ったクラス
タに分類されてしまう危険がある。つまり特開２０００
−７６２７９に開示された発明のようにキーワード６個
を６角形に配置することを前提にしてしまうと、本来な
らその文書集合に対しては６つのキーワードのうち１つ
だけが特異な意味を持っていた場合、分類配置は適切な
ものとはならない。また、前記（４）の手法において
は、各文書を配置する際にその文書が属するクラスタの
近傍に配置されたクラスタのクラスタ中心との比較を行
っているが、すべてのクラスタ中心との比較は行ってい
ない。従って、あるクラスタに分類された文書ベクトル
が、実際にはその近傍以外に配置されたクラスタ中心と
似た特性を持っていたとしても、近傍以外のクラスタ中
心からの影響は無視されてしまい、正確な文書の特性を
反映したマッピング結果とはなりにくい。さらに、文書
を配置する際にそのラベル付けは行っておらず、表示さ
れた配置結果はユーザにとって視覚的に判別しにくいも
のとなるおそれがある。ラベル付けの施された表示を実
現するためには、データ配置の結果に基づいたラベル決
定、さらにラベルとデータとの対応付け等、コストのか
かる計算が必要となる。The classification means (3) is based on the premise that the keywords are fixedly displayed and the distances between them are set at equal intervals. When a person attaches this keyword, it is not possible to set an appropriate number of appropriate classification words.
That is, for example, the six keywords selected there are not necessarily opposite words suitable for classification with respect to the document set. For example, when trying to classify sports-related newspaper article documents, "baseball,""ballgames,""high school baseball,""J
There are cases in which the level of concept or abstraction is not uniform, such as “league”, or a category that is not appropriate for the document set is specified. When a keyword is extracted by a computer, even if it is possible to extract an appropriate keyword, by arranging them for a given set of documents and at equal intervals, the set of documents is different from the original feature. There is a risk of being classified into. That is, JP 2000
Assuming that six keywords are arranged in a hexagon as in the invention disclosed in -76279, only one of the six keywords originally has a unique meaning for the document set. If so, the classification arrangement is not appropriate. Further, in the above method (4), when each document is arranged, the comparison is made with the cluster center of the cluster arranged in the vicinity of the cluster to which the document belongs. not going. Therefore, even if a document vector classified into a certain cluster actually has characteristics similar to those of the cluster centers other than the neighborhood, the influence from the cluster centers other than the neighborhood is ignored, and It is difficult to obtain mapping results that reflect the characteristics of various documents. Furthermore, the document is not labeled when it is placed, and the displayed placement result may be difficult for the user to visually identify. In order to realize the labeled display, it is necessary to perform a costly calculation such as a label determination based on the result of the data arrangement and a correspondence between the label and the data.

【００２０】本発明は上記の従来技術の課題を解決する
ためのものであり、文書集合の情報のみを使うことで文
書集合そのものを自己組織化分類するものであるが、高
速に、与えられた文書集合に適した、しかも利用者に使
いやすいよう実態に沿ったラベル付けをおこなうように
するためのものである。また、その方法を実施する装置
を提供することを目的としている。The present invention is to solve the above-mentioned problems of the prior art, and is to classify the document set itself by using only the information of the document set, but it is given at high speed. This is to perform labeling suitable for the set of documents and according to the actual conditions so that the user can easily use it. Moreover, it aims at providing the apparatus which implements the method.

【００２１】[0021]

【課題を解決するための手段】この目的を達成するため
に本発明では、２つのデータ集合の関係を、行列の行と
列の関係に表現できる場合に、一方の集合に含まれるデ
ータをデータオブジェクトＡとして、他方の集合に含ま
れるデータをデータオブジェクトＢとして、その動作は
図１に示す。まずデータオブジェクトＡだけを用いてそ
の距離間の関係からデータオブジェクトＡを空間（例え
ば人間が視覚的に把握できる３次元以下の空間）上にそ
の距離関係を保持しつつ配置し（１０１）、次にデータ
オブジェクトＡとデータオブジェクトＢの距離の関係だ
けを用いて、データオブジェクトＢ同士の距離関係を使
うことなく、データオブジェクトＢを空間上に、データ
オブジェクトＢとデータオブジェクトＡの距離関係を保
持しつつ配置し（１０２）、最後にデータオブジェクト
Ａ及びＢを表示する（１０３）という手順をとる。In order to achieve this object, according to the present invention, when the relation between two data sets can be expressed by the relation between the rows and columns of a matrix, the data contained in one of the data sets As an object A, data included in the other set is a data object B, and its operation is shown in FIG. First, using only the data object A, the data object A is arranged in a space (for example, a space of three dimensions or less that can be visually grasped by humans) from the relationship between the distances while maintaining the distance relationship (101). Using only the distance relationship between data object A and data object B, the distance relationship between data object B and data object A is maintained in space without using the distance relationship between data objects B. While arranging (102), the data objects A and B are finally displayed (103).

【００２２】すなわち、２つの種類のデータオブジェク
トについて、少ない数のものをデータオブジェクトＡと
し、このデータオブジェクトＡを例えば自己組織化分類
で配置した上で、この配置をもとに大量のデータである
データオブジェクトＢを、データオブジェクトＡの配置
を利用して配置するという手順を取る。That is, of the two types of data objects, a small number of data objects are designated as data object A, and this data object A is arranged by, for example, self-organizing classification, and a large amount of data is generated based on this arrangement. The data object B is arranged using the arrangement of the data object A.

【００２３】そのため本発明の情報分析表示装置では、
データオブジェクトＡ配置をもって、高速に大量のデー
タであるデータオブジェクトＢを空間上に配置すること
ができる。また、データオブジェクトＡの配置をデータ
分布のデータ属性を示すラベルとすることで、文書の検
索結果を絞り込むことやデータ間の関係を分析すること
が出来る。これによって、ラベル付けとデータマッピン
グが同時に実現でき、マッピング結果にはラベル付けが
施されているために、ユーザにとって視覚的に理解しや
すい形式でのデータ分析表示が可能となる。Therefore, in the information analysis display device of the present invention,
With the data object A arrangement, the data object B, which is a large amount of data, can be arranged in space at high speed. Further, by using the layout of the data object A as a label indicating the data attribute of the data distribution, it is possible to narrow the search results of documents and analyze the relationship between data. As a result, the labeling and the data mapping can be realized at the same time, and since the mapping result is labeled, the data analysis display can be performed in a format that is visually easy for the user to understand.

【００２４】[0024]

【発明の実施の形態】（実施の形態１）以下、本発明の
第一の実施の形態について説明する。第一の実施の形態
は、文書を検索し、その結果を分類配置する文書分類装
置である。BEST MODE FOR CARRYING OUT THE INVENTION (First Embodiment) A first embodiment of the present invention will be described below. The first embodiment is a document classification device that searches documents and classifies and arranges the results.

【００２５】図２は第一の実施の形態の装置の構成図で
ある。FIG. 2 is a block diagram of the apparatus of the first embodiment.

【００２６】この装置は図２に示すように、検索条件を
入力する手段２０１と、検索結果や分類配置結果を出力
する出力手段２０２と、検索を行う検索手段２０３と、
検索結果を記憶する検索結果記憶手段２０４と、文書デ
ータを保存する文書データ格納手段２０５と、単語の情
報を格納する単語データ格納手段２０６と、単語と文書
の行列データを格納する行列格納手段２０７と、行列を
用いて単語間、文書間、単語文書間の距離を計算する距
離計算手段２０８と、文書集合の中からキーワードを抽
出するためのキーワード抽出手段２０９と、データ間の
距離からそれらデータを空間上に配置するための分析配
置手段２１０と、空間の情報を格納するための空間記憶
手段２１１を備える。なお、分析配置手段２１０は、単
語データマッピング手段２１０ａおよび文書データマッ
ピング手段２１０ｂから構成される。単語データマッピ
ング手段２１０ａは、本願請求項においてラベルマッピ
ング手段として記載されるものであり、文書データマッ
ピング手段２１０ｂは、本願請求項においてデータマッ
ピング手段として記載されるものである。As shown in FIG. 2, this apparatus has a means 201 for inputting search conditions, an output means 202 for outputting a search result and a classification arrangement result, and a search means 203 for performing a search.
Search result storage means 204 for storing search results, document data storage means 205 for storing document data, word data storage means 206 for storing word information, and matrix storage means 207 for storing word and document matrix data. A distance calculation means 208 for calculating distances between words, between documents, and between word documents using a matrix, keyword extraction means 209 for extracting keywords from a document set, and those data from distances between data. Is provided in the space, and the space storage means 211 for storing the space information. The analysis arrangement means 210 is composed of word data mapping means 210a and document data mapping means 210b. The word data mapping unit 210a is described as a label mapping unit in the claims of the present application, and the document data mapping unit 210b is described as a data mapping unit in the claims of the present application.

【００２７】行列格納手段２０７には、図３に示すよう
に単語と文書が行列の行と列の関係で記録されている。
行列の要素（ｉ，ｋ）は、文書ｋの中に単語ｉが出現し
た回数である。In the matrix storage means 207, words and documents are recorded in a matrix of rows and columns as shown in FIG.
The element (i, k) of the matrix is the number of times the word i appears in the document k.

【００２８】図４は本装置の利用および動作概要を示し
たフローチャートである。（ステップ４０１）利用者はまず、入力手段２０１から
検索条件を入力する。（ステップ４０２）すると検索手段２０３が検索を行
い、検索結果の文書集合を検索結果記憶手段２０４に記
憶する。（ステップ４０３）この検索結果から、キーワード抽出
手段２０９がキーワードを抽出する。キーワード（ある
いは特徴語）の抽出には、例えば、既知の技術である特
開平１１−２５１０８の方式を用いる。（ステップ４０４）抽出されたキーワードについてキー
ワード間の距離を行列格納手段２０７の情報をもとに距
離計算手段２０８が計算し、単語データマッピング手段
２１０ａが空間記憶手段２１１の２次元空間、あるいは
３次元空間上に配置する。キーワードの空間上への配置
には、既存の方法であるスプリングモデルを用いる。詳
細は後述する。空間上に配置されたキーワードは、その
単語の表記で表現する。キーワードだけが２次元空間上
に配置された例を図５に示す。（ステップ４０５）次に、配置されたキーワードとステ
ップ４０２で検索された結果の文書の距離を、行列格納
手段２０７の情報をもとに距離計算手段２０８が計算
し、文書データマッピング手段２１０ｂが空間記憶手段
２１１の２次元空間、あるいは３次元空間上に配置す
る。詳細は後述する。２次元空間上の配置例を図６に示
す。この図の中の○印が個々の文書である。（ステップ４０６）最後にこの配置された結果を出力手
段２０２から出力する。FIG. 4 is a flow chart showing the outline of the use and operation of this device. (Step 401) First, the user inputs search conditions from the input means 201. Then (step 402), the search means 203 performs a search and stores the document set of the search result in the search result storage means 204. (Step 403) The keyword extraction means 209 extracts a keyword from this search result. The method of Japanese Patent Laid-Open No. 11-25108, which is a known technique, is used to extract the keyword (or characteristic word). (Step 404) The distance calculation means 208 calculates the distance between keywords for the extracted keywords based on the information of the matrix storage means 207, and the word data mapping means 210a makes the two-dimensional space or three-dimensional space of the space storage means 211. Place in space. The spring model, which is an existing method, is used to place the keywords in the space. Details will be described later. Keywords placed in the space are expressed by the notation of the word. FIG. 5 shows an example in which only keywords are arranged in a two-dimensional space. (Step 405) Next, the distance calculation unit 208 calculates the distance between the arranged keyword and the document obtained as a result of the search in Step 402, based on the information in the matrix storage unit 207, and the document data mapping unit 210b creates a space. The storage means 211 is arranged in a two-dimensional space or a three-dimensional space. Details will be described later. FIG. 6 shows an arrangement example in the two-dimensional space. The circles in this figure are individual documents. (Step 406) Finally, the arranged result is output from the output means 202.

【００２９】ステップ４０４のキーワードの空間上への
配置について説明する。Arrangement of keywords in the space in step 404 will be described.

【００３０】キーワード（単語）間の距離については、
行列格納手段２０７から計算できる。すなわち単語のベ
クトル表現から、ベクトル間の距離、あるいはベクトル
間の余弦、あるいはベクトルの内積によりこれら距離を
定義できる。配置すべき全てのキーワード間の距離は前
もって距離計算手段２０８が計算し、図３１のような三
角行列の形で一時的に保持する。Regarding the distance between keywords (words),
It can be calculated from the matrix storage means 207. That is, these distances can be defined from the vector representation of words by the distance between vectors, the cosine between vectors, or the inner product of vectors. The distances between all the keywords to be arranged are calculated by the distance calculating means 208 in advance and are temporarily held in the form of a triangular matrix as shown in FIG.

【００３１】そして全てのキーワードを、ベクトルの関
係から計算された距離に応じた長さを持つスプリングで
結ぶ。初期状態ではキーワードは適当な位置、例えば２
次元空間上の同一円周上に等間隔で配置する。この様子
を模式的に示したのが図２８、あるいは図３０である。
各スプリングは、長さがベクトルの関係から計算された
距離ｄｍの時は力が働かないが、それより伸びると引力
が働き、それより縮むと斥力が働く。この様子を図３２
に示す。図３２は横軸が距離、縦軸が引力の強さを示し
ている。距離が大きくなると引力も強くなるものとす
る。適正な距離ｄｍより近いとマイナスの引力、すなわ
ち斥力が働く。つまり初期状態ではスプリングの力に関
係なく適当に配置してしまったため、それぞれの接点
（単語）には力が働いている。Then, all the keywords are connected by a spring having a length corresponding to the distance calculated from the vector relation. In the initial state, the keyword is in an appropriate position, for example, 2
Arrange them at equal intervals on the same circumference in dimensional space. FIG. 28 or FIG. 30 schematically shows this state.
Each spring has no force when the length is a distance dm calculated from the relationship of vectors, but when it is longer than that, an attractive force is exerted, and when it is shorter, a repulsive force is exerted. This state is shown in FIG.
Shown in. In FIG. 32, the horizontal axis represents the distance and the vertical axis represents the strength of the attractive force. As the distance increases, the attractive force also increases. If the distance is shorter than the proper distance dm, a negative attractive force, that is, a repulsive force works. In other words, in the initial state, the springs are properly arranged regardless of the force of the springs, so each contact (word) has a force.

【００３２】次にこの接点とスプリングからなる系につ
いて、それぞれの接点を微小に動かすことで、系が安定
する位置に接点を動かし、系全体が安定状態になるよう
にする。この一連の計算を図３３を用いて説明する。（ステップ３３０１）まずキーワード間の距離を行列格
納手段２０７の情報をもとに距離計算手段２０８が計算
し、図３１のような三角行列の形で一時的に保持する。（ステップ３３０２）キーワードを接点、その間が距離
に応じた長さを持つスプリングで結ばれた系であるとみ
なし、接点を２次元（あるいは３次元）空間上の初期位
置に配置する。初期位置はどのようなものでも良いが、
２次元空間なら円周上、３次元空間なら同一球面上の配
置がよい。（ステップ３３０３）以下ステップ３３０４、ステップ
３３０５を規定回数Ｒ回繰り返す。（ステップ３３０４）以下ステップ３３０５をすべての
接点について処理する。（ステップ３３０５）一つの接点ｉについて、その接点
にかかる全てのスプリングの力を計算して合成する。す
ると合力はある一つの方向を向くので、その方向に合力
の大きさに応じた微小距離ｋ×α(r)×ｆだけ接点を移
動させる。ｆは合力の大きさである。ｋは力から距離に
変換するための定数とする。Next, with respect to the system composed of this contact and the spring, each contact is slightly moved to move the contact to a position where the system is stable, so that the entire system is in a stable state. This series of calculations will be described with reference to FIG. (Step 3301) First, the distance calculation means 208 calculates the distance between keywords based on the information in the matrix storage means 207, and temporarily holds it in the form of a triangular matrix as shown in FIG. (Step 3302) The keyword is regarded as a system in which the contact points are connected by a spring having a length corresponding to the distance, and the contact points are arranged at the initial positions in the two-dimensional (or three-dimensional) space. The initial position can be anything,
If it is a two-dimensional space, it should be placed on the circumference of the circle, and if it is a three-dimensional space, it should be on the same spherical surface. (Step 3303) The following steps 3304 and 3305 are repeated R times a prescribed number of times. (Step 3304) The following steps 3305 are processed for all the contacts. (Step 3305) For one contact i, the forces of all springs applied to the contact are calculated and combined. Then, the resultant force is directed in a certain direction, and the contact is moved in that direction by a minute distance k × α (r) × f according to the magnitude of the resultant force. f is the magnitude of the resultant force. k is a constant for converting the force into the distance.

【００３３】ステップ３３０５で用いたα(r)は、ステ
ップ３３０３からの繰り返し回数に応じて、繰り返しが
進むについてれ減少するパラメータである。例えば次の
式で与えられる。Α (r) used in step 3305 is a parameter that decreases as the iteration progresses in accordance with the number of iterations from step 3303. For example, it is given by the following formula.

【００３４】[0034]

【数１】α（ｒ）＝１−（ｒ／Ｒ）## EQU1 ## α (r) = 1- (r / R)

【００３５】Ｒは繰り返す回数、ｒは何回目の繰り返し
かを表す変数とする。このようなパラメータを用いるこ
とで繰り返しが進むたびに移動距離が短くなり、結果と
して接点が系全体が安定する位置に収束する。Let R be the number of repetitions and r be a variable representing the number of repetitions. By using such parameters, the moving distance becomes shorter each time the iteration progresses, and as a result, the contact points converge to a position where the entire system is stable.

【００３６】なお、ステップ４０４のキーワードの配置
では、スプリングモデルの代わりに既存の方式であるＳ
ＯＭを用いても良い。In the keyword arrangement in step 404, the existing method S is used instead of the spring model.
OM may be used.

【００３７】以上がステップ４０４の詳細な説明であ
る。次にステップ４０５の文書の配置について説明す
る。The above is a detailed description of step 404. Next, the document arrangement in step 405 will be described.

【００３８】単語から文書への距離は、図３に示す行列
より計算できる。単語ｐと文書ｑの距離は、行列要素
（ｐ，ｑ）の逆数に比例した大きさであると定義でき
る。すなわち、ある文書ｑにある単語ｐが何度も出現す
れば、行列要素（ｐ，ｑ）が大きくなるから、その逆数
は小さくなる。すなわちより近いということになる。た
だしある文書ａに比べてある文書ｂが非常に大きかった
（単語数が多いなど）場合は単純に比較できないことが
ある。そこで行列要素は、個々の文書の大きさで割るな
どの正規化処理が必要になる。これらの処理は、行列格
納手段２０７の情報をもとに距離計算手段２０８が計算
する。The distance from the word to the document can be calculated from the matrix shown in FIG. The distance between the word p and the document q can be defined as a size proportional to the reciprocal of the matrix element (p, q). That is, if the word p in a document q appears many times, the matrix element (p, q) becomes large, and the reciprocal thereof becomes small. That is, it is closer. However, when a certain document b is much larger than a certain document a (the number of words is large), it may not be possible to simply compare. Therefore, the matrix elements require normalization processing such as division by the size of each document. These processes are calculated by the distance calculating means 208 based on the information in the matrix storing means 207.

【００３９】ステップ４０５における文書の配置は、ス
プリングモデルを応用することで計算する。ここでは、
すでに配置されている単語から一つ一つの文書が、その
単語―文書間の距離に応じた長さをもつスプリングで結
ばれている系を考える。一つの文書について取り出した
例を図７に示す。図７ではスプリングを直線で模式的に
示す。この系が先の図２８や図３０の系と違うことは、
図７の系では単語は全て既に配置済みであり、固定され
ている。よって文書一つだけを動かして、スプリングが
安定する文書配置をみつければよい。これに対し図２８
のような系は、全ての接点を動かしながら系（スプリン
グ）が安定する位置を決めなければいけなかった。The document placement in step 405 is calculated by applying the spring model. here,
Let us consider a system in which each document from already arranged words is connected by a spring whose length depends on the distance between the word and the document. An example of extracting one document is shown in FIG. In FIG. 7, the spring is schematically shown by a straight line. The difference of this system from the system of Fig. 28 and Fig. 30 is that
In the system of FIG. 7, all the words have already been arranged and are fixed. So you need to move just one document to find the document alignment where the spring is stable. On the other hand, FIG.
In a system like, we had to decide the position where the system (spring) was stable while moving all the contacts.

【００４０】この計算を図３４を用いて説明する。（ステップ３４０１）各単語から文書までの距離を行列
格納手段２０７の情報をもとに距離計算手段２０８が計
算し、一時的に記憶する。（ステップ３４０２）全ての文書について以下の処理を
行う。（ステップ３４０３）文書を初期配置する。初期配置は
どこでもよい。（ステップ３４０４）以下のステップ３４０５を規定回
数Ｔ回繰り返す。（ステップ３４０５）文書にかかる全てのスプリングの
力を計算して合成する。すると合力はある一つの方向を
向くので、その方向に合力の大きさに応じた微小距離ｋ
×α(r)×ｆだけ接点を移動させる。ｆは合力の大きさ
である。ｋは力から距離に変換するための定数とする。
α(r)はステップ３３０５で用いたのと同類の減衰パラ
メータである。This calculation will be described with reference to FIG. (Step 3401) The distance calculation unit 208 calculates the distance from each word to the document based on the information in the matrix storage unit 207, and temporarily stores it. (Step 3402) The following process is performed for all documents. (Step 3403) Initially arrange the document. The initial placement can be anywhere. (Step 3404) The following step 3405 is repeated T times a prescribed number of times. (Step 3405) The forces of all springs applied to the document are calculated and combined. Then, the resultant force is directed in one direction, and in that direction, a small distance k depending on the magnitude of the resultant force.
The contact is moved by xα (r) × f. f is the magnitude of the resultant force. k is a constant for converting the force into the distance.
α (r) is a damping parameter similar to that used in step 3305.

【００４１】以上がステップ４０５の詳細な説明であ
る。The above is a detailed description of step 405.

【００４２】本発明の特徴となるのはステップ４０４と
ステップ４０５の順番である。すなわち、最初に少ない
キーワードをスプリングモデルなどを用いて配置し、次
に多くの文書を、すでに固定されているキーワードとの
位置関係だけをもとに配置する。文書同士の距離は計算
しない。The feature of the present invention is the order of step 404 and step 405. That is, first, a small number of keywords are arranged by using a spring model or the like, and then a large number of documents are arranged based only on the positional relationship with already fixed keywords. The distance between documents is not calculated.

【００４３】一般に大きなデータベースから文書を検索
した場合、その検索された結果の文書の数Ｑは数百個に
及ぶことが多い。もし既存の方式で、文書同士の距離を
計算してスプリングモデルなどで自己組織化分類させた
場合、基本的にＱ×Ｑ回の計算を行わなければならな
い。しかも相互の距離の平衡が保たれる位置、つまりス
プリングからなる系の安定状態を見つけるまで、個々の
文書を微小距離移動させながら距離計算を繰り返さなけ
ればならないため、Ｑ×Ｑ×Ｒ回（Ｒは収束するまでの
繰り返し）の回数の計算が必要となる。Ｑは数百、Ｒは
数百〜数万のオーダーの数となる。In general, when a document is searched from a large database, the number Q of documents as a result of the search often reaches several hundreds. If the distance between documents is calculated by the existing method and self-organized classification is performed by a spring model or the like, it is basically necessary to calculate Q × Q times. Moreover, the distance calculation must be repeated while moving a small distance for each document until a position where the mutual distances are balanced, that is, a stable state of the system composed of springs is found. Therefore, Q × Q × R times (R Is repeated until it converges). Q is in the order of hundreds and R is in the order of hundreds to tens of thousands.

【００４４】しかもその表示結果は図２０や図２２のよ
うに文書だけの分類結果であり、適当なラベル付けをし
なければ利用者にはわからないものとなるだろう。この
ラベル付けに、ラベルの個数をＰ個として、Ｐ×Ｑ×Ｓ
回（Ｓは定数）のオーダーの計算が必要だとすると、こ
れらの計算には、（Ｑ×Ｑ×Ｒ＋Ｐ×Ｑ×Ｓ）回のオー
ダーの計算が必要となる。Moreover, the display result is a classification result of only the document as shown in FIGS. 20 and 22, and the user will not be able to understand it without proper labeling. For this labeling, assuming that the number of labels is P, P × Q × S
If calculation of the order of times (S is a constant) is required, calculation of the order of (Q × Q × R + P × Q × S) is required for these calculations.

【００４５】本発明の場合は、上記と比較した場合、ラ
ベルがすなわちキーワードに相当するから、キーワード
の個数をＰ個として、その自己組織化配置に同じくＰ×
Ｐ×Ｒ回、固定されたキーワードを使って文書Ｑ個を配
置するのに、Ｐ×Ｑ×Ｔ回（Ｔは定数）の計算が必要に
なるから、これらの計算には、（Ｐ×Ｐ×Ｒ＋Ｐ×Ｑ×
Ｔ）回のオーダーの計算が必要となる。In the case of the present invention, when compared with the above, the label corresponds to the keyword, that is, the number of keywords is P, and the self-organizing arrangement is P ×.
Since it is necessary to calculate P × Q × T times (T is a constant) in order to arrange Q documents using fixed keywords P × R times, these calculations require (P × P × R + P × Q ×
T) order calculations are required.

【００４６】ところで利用者が分類のキーワードとして
利用できる数はせいぜい１０個〜３０個程度までであ
る。すなわち、文書数が数百のオーダーでキーワードが
数十のオーダーだと仮定すると、By the way, the number of users that can be used as the classification keyword is about 10 to 30 at most. That is, assuming that the number of documents is on the order of hundreds and the keywords are on the order of tens,

【００４７】[0047]

【数２】Ｑ＝１０×Ｐ[Equation 2] Q = 10 × P

【００４８】の関係が仮定できる。従って、既存の方法
では、（１００×Ｐ×Ｐ×Ｒ＋１０×Ｐ×Ｐ×Ｓ）回の
オーダーの回数の計算に対して、本願発明では、（Ｐ×
Ｐ×Ｒ＋１０×Ｐ×Ｐ×Ｔ）回というより少ないオーダ
ーの回数の計算で終了することになる。ＳやＴにもよる
が、大体１０分の一から１００分の一の計算量というこ
とになる。当然利用するメモリも小さくてすむ。The relationship of can be assumed. Therefore, in the existing method, in contrast to the calculation of the number of times of the order of (100 × P × P × R + 10 × P × P × S), in the present invention, (P ×
The calculation is completed with a smaller number of times (P × R + 10 × P × P × T) times. Although it depends on S and T, the calculation amount is about 1/10 to 1/100. Naturally, the memory used can also be small.

【００４９】ところで図６は単純に単語の周りに文書を
配置した。利用者にわかりやすく情報を提示するために
は、文書をより多く引き付ける単語、すなわち行列を用
いたベクトルの距離計算（距離、余弦、内積など）でよ
り多くの文書との距離が近いものを強調表示させること
も可能である。図２の行列格納手段２０７と距離計算手
段２０８を用いることで、どの単語がどれだけの文書と
近いかはすぐに計算できるからである。より多くの文
書、すなわちある閾値以上の個数の文書を引き付けるキ
ーワードを強調表示させた例を図１５に示す。これによ
り、文書集合がどのようなキーワードで分類されたかを
分析することが出来る。By the way, in FIG. 6, the document is simply arranged around the word. In order to present information to users in an easy-to-understand manner, emphasize words that attract more documents, that is, vector distance calculations using matrices (distance, cosine, dot product, etc.) that are closer to the document. It is also possible to display it. This is because by using the matrix storage unit 207 and the distance calculation unit 208 in FIG. 2, it is possible to immediately calculate which word is close to how many documents. FIG. 15 shows an example in which a keyword attracting more documents, that is, a number of documents equal to or larger than a certain threshold is highlighted. As a result, it is possible to analyze by what keyword the document set is classified.

【００５０】また、これとは逆に、多くの文書を引き付
けなかったキーワードを強調表示させることも可能であ
る。先の例と同様に図２の行列格納手段２０７と距離計
算手段２０８を用いることで、どの単語がどれだけの文
書と近いかはすぐに計算できるからである。多くの文書
を引き付けなかったキーワードは、検索で利用するにあ
たっては、検索結果を絞り込むのに有効な単語であるた
め、これを強調表示することで利用者にどのような単語
で検索すればよいかの指針を示すことが出来る。多くの
文書を引き付けなかったキーワードを強調表示させた例
を図１６に示す。On the contrary, it is also possible to highlight the keywords that have not attracted many documents. This is because by using the matrix storage unit 207 and the distance calculation unit 208 of FIG. 2 as in the previous example, it is possible to immediately calculate which word is close to how many documents. Keywords that have not attracted many documents are effective words for narrowing the search results when using them in a search, so what words should be searched for by highlighting them? Can show the guidelines. FIG. 16 shows an example in which keywords that have not attracted many documents are highlighted.

【００５１】一度分類配置されたデータは、図２３のよ
うにＧＵＩで選択してさらなる分析に用いることができ
る。例えば図２４のように選択された文書集合だけを再
度分類配置すると言った操作・処理の他、選択された単
語集合だけを検索語として用いて再検索、選択された文
書集合からだけキーワード抽出を行って表示、選択され
た文書集合や単語集合を削除して再分類配置や再検索な
どが考えられる。The data once classified and arranged can be selected by the GUI as shown in FIG. 23 and used for further analysis. For example, as shown in FIG. 24, in addition to the operation / processing of re-classifying and arranging only the selected document set, re-search is performed using only the selected word set as a search word, and keyword extraction is performed only from the selected document set. It is possible to perform redisplay and display, delete the selected document set or word set, and perform reclassification arrangement and re-search.

【００５２】また、単語を選択し、その単語に近い文書
を強調表示する（図２５）、あるいはその単語に近い文
書の範囲を表示する（図２６）などができる。Further, it is possible to select a word and highlight a document close to the word (FIG. 25) or display a range of documents close to the word (FIG. 26).

【００５３】また、単語を選択し、その単語を配置上で
移動させることで文書配置を動的に変更することも可能
となる（図２７）。It is also possible to dynamically change the document layout by selecting a word and moving the word on the layout (FIG. 27).

【００５４】また、本実施の形態ではキーワード１０個
〜２０個程度を配置してから文書を配置したが、例えば
文書数十個程度について分析するために、最初に文書だ
けを自己組織化手法で配置し、その後に配置された文書
だけを用いてキーワード数十個〜数百個を配置しても良
い。この場合は配置された文書により、キーワードをク
ラスタリングすることが出来る。あるいは文書からキー
ワードをクラスタリングすることで発想支援などに用い
ることが出来る。このように最初に少ないデータオブジ
ェクトを自己組織化手法で配置しておき、その後に配置
されたデータオブジェクトを頼りに、それより多い別の
種類のデータオブジェクトを配置することで、計算量の
少ないデータ配置・分析が可能となる。Further, in the present embodiment, about 10 to 20 keywords are arranged and then the document is arranged. However, in order to analyze about several tens of documents, for example, only the document is first subjected to the self-organizing method. You may arrange | position, and you may arrange | position several dozen to several hundred keywords only using the document arrange | positioned after that. In this case, keywords can be clustered by the arranged documents. Alternatively, by clustering keywords from documents, they can be used for idea support. In this way, by allocating a small number of data objects by the self-organizing method first, and then arranging another type of data object that is larger than that, relying on the data objects arranged after that, data with a small amount of calculation can be Placement / analysis is possible.

【００５５】以上のように本実施の形態では、キーワー
ドだけをまず空間上に自己組織化配置し、次に文書一つ
一つを配置されたキーワードとの距離だけから配置す
る。よって、文書全ての組み合わせで計算するよりもは
るかに少ないコストで計算を終了し、しかも文書ごとに
適切な配置が可能になり、単語をラベルとして利用する
ことで配置結果は見やすいものとなり、その実用的効果
は大きい。As described above, in the present embodiment, only keywords are first arranged in a space in a self-organizing manner, and then each document is arranged only from the distance from the arranged keywords. Therefore, the calculation can be completed at a far lower cost than the calculation with all combinations of documents, and moreover, it is possible to arrange appropriately for each document, and the use of words as labels makes the arrangement result easy to see. Effect is large.

【００５６】（実施の形態２）以下、本発明の第二の実
施の形態について説明する。第二の実施の形態は、マー
ケットデータ分析装置である。(Second Embodiment) The second embodiment of the present invention will be described below. The second embodiment is a market data analysis device.

【００５７】図８は第二の実施の形態の装置の構成図で
ある。FIG. 8 is a block diagram of the apparatus of the second embodiment.

【００５８】この装置は図８に示すように、検索条件を
入力する手段８０１と、検索結果や分類配置結果を出力
する出力手段８０２と、検索を行う検索手段８０３と、
検索結果などのデータを記憶するデータ一時記憶手段８
０４と、顧客データを保存する顧客データ格納手段８０
５と、商品の情報を格納する商品データ格納手段８０６
と、商品と顧客の行列データを格納する行列格納手段８
０７と、行列を用いて商品間、顧客間、商品顧客間の距
離を計算する距離計算手段８０８と、販売データを格納
するための販売データ格納手段８０９と、データ間の距
離からそれらデータを空間上に配置するための分析配置
手段８１０と、空間の情報を格納するための空間記憶手
段８１１を備える。ここで、分析配置手段８１０は、商
品データマッピング手段８１０ａと顧客データマッピン
グ手段８１０ｂとを備えるものである。尚、商品データ
マッピング手段８１０ａは、本願請求項においてラベル
マッピング手段と記載されるものであり、顧客データマ
ッピング手段は、本願請求項においてデータマッピング
手段として記載されるものである。As shown in FIG. 8, this apparatus has means 801 for inputting search conditions, output means 802 for outputting search results and classification and arrangement results, search means 803 for searching,
Data temporary storage means 8 for storing data such as search results
04 and customer data storage means 80 for storing customer data
5 and product data storage means 806 for storing product information
And a queue storing means 8 for storing queue data of products and customers
07, distance calculation means 808 for calculating distances between products, customers, and product customers using a matrix, sales data storage means 809 for storing sales data, and space between these data based on distances between data. It is provided with analysis arrangement means 810 for arrangement on the top and space storage means 811 for storing space information. Here, the analysis arrangement unit 810 includes a product data mapping unit 810a and a customer data mapping unit 810b. The product data mapping unit 810a is described as a label mapping unit in the claims of the present application, and the customer data mapping unit is described as a data mapping unit in the claims of the present application.

【００５９】行列格納手段８０７には、図９に示すよう
に顧客と商品が行列の行と列の関係で記録されている。
行列の要素（ｉ，ｋ）は、顧客ｋが商品ｉを買った合計
個数である。また販売データ格納手段８０９には、い
つ、どの顧客が、どの商品を購入したかが記録されてい
るものとする。In the matrix storage means 807, customers and products are recorded in a matrix of rows and columns as shown in FIG.
The element (i, k) of the matrix is the total number of purchases of the item i by the customer k. Further, it is assumed that the sales data storage unit 809 records when and which customer purchased which product.

【００６０】図１０は本装置の利用および動作概要を示
したフローチャートである。（ステップ１００１）利用者はまず、入力手段８０１か
ら検索条件を入力する。検索条件とは例えば、「最近３
ヶ月に売れた商品のベスト２０とそれを購入した顧客」
などである。（ステップ１００２）すると検索手段８０３が販売デー
タ格納手段８０９から検索を行い、検索結果の商品と顧
客の集合をデータ一時記憶手段８０４に記憶する。（ステップ１００３）検索された商品について、商品間
の距離を行列格納手段８０７の情報をもとに距離計算手
段８０８が計算し、商品データマッピング手段８１０ａ
が空間記憶手段８１１の２次元空間、あるいは３次元空
間上に配置する。商品の空間上への配置には、既存の方
法であるスプリングモデルやＳＯＭ（自己組織化マッ
プ）などを用いる。空間上に配置された商品は、商品名
で表現する。商品だけが２次元空間上に配置された例を
図１１に示す。（ステップ１００４）次に、配置された商品とステップ
１００２で検索された結果の顧客の距離を、行列格納手
段８０７の情報をもとに距離計算手段２０８が計算し、
顧客データマッピング手段８１０ｂが空間記憶手段８１
１の２次元空間、あるいは３次元空間上に配置する。２
次元空間上の配置例を図１２に示す。この図の中の○印
が個々の顧客である。（ステップ１００５）最後にこの配置された結果を出力
手段８０２から出力する。FIG. 10 is a flow chart showing an outline of the use and operation of this device. (Step 1001) The user first inputs the search condition from the input means 801. The search condition is, for example, “Recent 3
The Top 20 Best Selling Products Per Month And The Customers Who Buy It "
And so on. (Step 1002) Then, the search means 803 searches the sales data storage means 809 and stores the set of products and customers as the search result in the data temporary storage means 804. (Step 1003) For the searched products, the distance calculation unit 808 calculates the distance between the products based on the information in the matrix storage unit 807, and the product data mapping unit 810a.
Are arranged in the two-dimensional space or the three-dimensional space of the space storage means 811. An existing method such as a spring model or SOM (self-organizing map) is used for arranging the product in the space. Products placed in the space are represented by product names. FIG. 11 shows an example in which only products are arranged in a two-dimensional space. (Step 1004) Next, the distance calculation unit 208 calculates the distance between the placed product and the customer as a result of the search in Step 1002, based on the information in the matrix storage unit 807,
The customer data mapping means 810b is the space storage means 81.
It is arranged in the two-dimensional space of 1 or the three-dimensional space. Two
FIG. 12 shows an example of arrangement in the dimensional space. The circles in this figure are individual customers. (Step 1005) Finally, the output means 802 outputs the arranged result.

【００６１】本発明の方式はこのように、行列で表現さ
れるデータについて行のデータ集合、あるいは列のデー
タ集合のみを用いて最初の配置を決定し、他方のデータ
は既配置データだけをもとに配置を決定することで、少
ない計算で空間上へのデータ分析配置を可能とする。顧
客／商品というマーケッティングデータにおいても、分
析対象の大量の顧客を購入商品によりクラスタリングす
ることが可能である。もし顧客データだけを自己組織化
クラスタリングなどで分類しても、図１３のようにわか
りにくくなってしまうが、購入商品名を表示することで
それらを図１２のようにわかりやすく表示することが可
能となる。In this way, the method of the present invention determines the first arrangement of the data represented by the matrix by using only the row data set or the column data set, and the other data only the already arranged data. By deciding the placements in and, the data analysis placement in the space is possible with less calculation. Even in the marketing data of customers / products, it is possible to cluster a large number of customers to be analyzed by purchased products. If only customer data is classified by self-organizing clustering or the like, it becomes difficult to understand as shown in FIG. 13, but by displaying the purchased product name, it is possible to display them in an easy-to-understand manner as shown in FIG. Becomes

【００６２】もちろんマーケッティング分析などでは、
ここで表示されたデータについて、図１４のようにデー
タ表示ウィンドウ上の一部をマウスで選択することによ
りその部分のデータだけを取り出し、さらに詳しい分析
にかけることが可能となるだろう。Of course, in marketing analysis, etc.,
With respect to the data displayed here, by selecting a part of the data display window with the mouse as shown in FIG. 14, it will be possible to take out only the data of that part for further detailed analysis.

【００６３】以上のように本実施の形態では、商品だけ
をまず空間上に配置し、次に顧客データ一つ一つを配置
された商品との距離だけから配置する。よって、顧客全
ての組み合わせで計算するよりもはるかに少ないコスト
で計算を終了し、しかも顧客ごとに適切な配置が可能に
なり、商品をラベルとして利用することで配置結果は見
やすいものとなり、その実用的効果は大きい。As described above, in the present embodiment, only the products are first placed in the space, and then each customer data is placed only from the distance to the placed product. Therefore, it is possible to finish the calculation at a far lower cost than calculating with all the combinations of customers, and it is possible to arrange appropriately for each customer, and by using the product as a label, the arrangement result is easy to see and its practical use Effect is large.

【００６４】[0064]

【発明の効果】以上のように、本発明では、単語と文
書、商品と顧客など行列で表現できるデータ組に対し
て、少ない方のデータを先に空間上に配置し、その配置
を用いて多い方のデータを空間上に配置することで、少
ない計算量で適切な配置が実現でき、しかも両方のデー
タを空間上に配置することで片方のデータをラベルとし
て使うことができるので利用者にとってわかりやすいデ
ータ配置結果を得ることができる。また、多い方のデー
タを配置する際にすべてのラベルとの比較を行っている
ため、近傍のみならず遠方に配置されたラベルの影響も
考慮されている。その結果、より早く正確なデータ分析
やデータ分類を行うことが可能となり、その実用的効果
は非常に大きい。As described above, according to the present invention, with respect to a data set that can be expressed in a matrix such as words and documents, products and customers, the smaller data is arranged in the space first, and the arrangement is used. By allocating the larger amount of data in the space, it is possible to realize an appropriate arrangement with a small amount of calculation, and by arranging both data in the space, one of the data can be used as a label. It is possible to obtain an easy-to-understand data arrangement result. Further, since the comparison with all the labels is performed when arranging the data of the larger amount, the influence of the labels arranged not only in the vicinity but also in the distance is also taken into consideration. As a result, it becomes possible to perform quick and accurate data analysis and data classification, and its practical effect is very large.

[Brief description of drawings]

【図１】本発明の処理概要を示すフローチャートFIG. 1 is a flowchart showing an outline of processing of the present invention.

【図２】本発明の第１の実施の形態における文書分類装
置の構成を示すブロック図FIG. 2 is a block diagram showing the configuration of a document classification device according to the first embodiment of the present invention.

【図３】本発明の第１の実施の形態における文書と単語
の関係行列を示す概念図FIG. 3 is a conceptual diagram showing a relationship matrix between a document and words according to the first embodiment of the present invention.

【図４】本発明の第１の実施の形態における操作・処理
概要を示すフローチャートFIG. 4 is a flowchart showing an outline of operations / processes according to the first embodiment of the present invention.

【図５】本発明の第１の実施の形態における単語のみの
配置結果例を示す模式図FIG. 5 is a schematic diagram showing an example of a placement result of only words in the first embodiment of the present invention.

【図６】本発明の第１の実施の形態における出力結果例
を示す模式図FIG. 6 is a schematic diagram showing an example of an output result according to the first embodiment of the present invention.

【図７】本発明の第１の実施の形態における文書配置計
算を示す概念図FIG. 7 is a conceptual diagram showing document layout calculation according to the first embodiment of the present invention.

【図８】本発明の第２の実施の形態におけるマーケット
データ分析装置を示すブロック図FIG. 8 is a block diagram showing a market data analysis device according to a second embodiment of the present invention.

【図９】本発明の第２の実施の形態における商品と顧客
の関係行列を示す概念図FIG. 9 is a conceptual diagram showing a relationship matrix between a product and a customer according to the second embodiment of the present invention.

【図１０】本発明の第２の実施の形態における操作・処
理概要を示すフローチャートFIG. 10 is a flowchart showing an outline of operations / processes according to the second embodiment of the present invention.

【図１１】本発明の第２の実施の形態における商品だけ
の配置結果例を示す模式図FIG. 11 is a schematic diagram showing an example of an arrangement result of only products according to the second embodiment of the present invention.

【図１２】本発明の第２の実施の形態における出力結果
例を示す模式図FIG. 12 is a schematic diagram showing an example of an output result according to the second embodiment of the present invention.

【図１３】本発明の第２の実施の形態において商品名を
表示しなかった場合にデータが見ずらいことを示す概念
図FIG. 13 is a conceptual diagram showing that data is difficult to see when a product name is not displayed in the second embodiment of the present invention.

【図１４】本発明の第２の実施の形態における出力結果
のＧＵＩ操作を示す模式図FIG. 14 is a schematic diagram showing a GUI operation of an output result in the second embodiment of the invention.

【図１５】本発明の第２の実施の形態における単語強調
表示例を示す模式図FIG. 15 is a schematic diagram showing an example of word highlighting according to the second embodiment of the present invention.

【図１６】本発明の第２の実施の形態における単語強調
表示例を示す模式図FIG. 16 is a schematic diagram showing an example of word highlighting according to the second embodiment of the present invention.

【図１７】従来の技術による文書分類方式の例の一つ目
について処理概要を示すフローチャートFIG. 17 is a flowchart showing a process outline for the first example of the document classification method according to the conventional technique.

【図１８】従来の技術による文書分類方式の例の二つ目
について処理概要を示すフローチャートFIG. 18 is a flowchart showing a process outline of the second example of the document classification method according to the conventional technique.

【図１９】従来の技術による文書分類方式の例の三つ目
について処理概要を示すフローチャートFIG. 19 is a flowchart showing a processing outline for the third example of the document classification method according to the conventional technique.

【図２０】従来の技術のスプリングモデルによるデータ
や文書の分類配置結果例を示す模式図FIG. 20 is a schematic diagram showing an example of the result of classification and arrangement of data and documents based on the conventional spring model.

【図２１】従来の技術のスプリングモデルによるデータ
や文書の分類配置結果例をクラスタ分類した例を示す模
式図FIG. 21 is a schematic diagram showing an example of cluster classification of data and document classification / arrangement result examples by a conventional spring model.

【図２２】従来の技術のＳＯＭを用いた分類結果表示例
を示す模式図FIG. 22 is a schematic diagram showing an example of a classification result display using SOM of the related art.

【図２３】本発明の第１の実施の形態における出力結果
のＧＵＩ操作を示す模式図FIG. 23 is a schematic diagram showing a GUI operation of an output result in the first embodiment of the invention.

【図２４】本発明の第１の実施の形態における部分集合
の再分類配置操作を示す模式図FIG. 24 is a schematic diagram showing a reclassification placement operation for a subset according to the first embodiment of the present invention.

【図２５】本発明の第１の実施の形態における関係文書
強調表示処理を示す模式図FIG. 25 is a schematic diagram showing related document highlighting processing according to the first embodiment of the present invention.

【図２６】本発明の第１の実施の形態における関係文書
範囲表示処理を示す模式図FIG. 26 is a schematic diagram showing a related document range display process according to the first embodiment of the present invention.

【図２７】本発明の第１の実施の形態における単語の移
動と文書の動的再配置操作と処理を示す模式図FIG. 27 is a schematic diagram showing word movement, document dynamic rearrangement operation, and processing according to the first embodiment of the present invention.

【図２８】スプリングモデルの物理モデル、およびその
初期配置を示す模式図FIG. 28 is a schematic diagram showing a physical model of a spring model and its initial arrangement.

【図２９】スプリングモデルの物理モデル、およびその
最終結果例を示す模式図FIG. 29 is a schematic diagram showing a physical model of a spring model and an example of the final result thereof.

【図３０】スプリングモデルの接点が８つの場合の物理
モデル、およびその初期配置を示す模式図FIG. 30 is a schematic diagram showing a physical model in the case where the spring model has eight contacts, and an initial arrangement of the physical model.

【図３１】本発明第１の実施の形態における単語間の距
離を計算した結果を記憶する行列を示す模式図FIG. 31 is a schematic diagram showing a matrix that stores the result of calculating the distance between words in the first embodiment of the present invention.

【図３２】スプリングモデルにおける距離と力の関係を
示す模式図FIG. 32 is a schematic diagram showing the relationship between distance and force in a spring model.

【図３３】本発明の第１の実施の形態における単語の配
置処理を示すフローチャートFIG. 33 is a flowchart showing word placement processing according to the first embodiment of the present invention.

【図３４】本発明の第２の実施の形態における文書の配
置処理を示すフローチャートFIG. 34 is a flowchart showing a document arrangement process according to the second embodiment of the present invention.

【図３５】従来の技術による文書分類方式の例の四つ目
について処理概要を示すフローチャートFIG. 35 is a flowchart showing an outline of processing regarding the fourth example of the document classification method according to the conventional technique.

[Explanation of symbols]

２０１入力手段２０２出力手段２０３検索手段２０４検索結果記憶手段２０５文書データ格納手段２０６単語データ格納手段２０７行列格納手段２０８距離計算手段２０９キーワード抽出手段２１０分析配置手段２１１空間記憶手段８０１入力手段８０２出力手段８０３検索手段８０４データ一時記憶手段８０５顧客データ格納手段８０６商品データ格納手段８０７行列格納手段８０８距離計算手段８０９販売データ格納手段８１０分析配置手段８１１空間記憶手段 201 Input means 202 output means 203 Search method 204 Search result storage means 205 document data storage means 206 word data storage means 207 matrix storage means 208 Distance calculation means 209 keyword extraction means 210 Analysis Arrangement Means 211 Spatial storage means 801 Input means 802 output means 803 Search method 804 data temporary storage means 805 Customer data storage means 806 Product data storage means 807 matrix storage means 808 Distance calculation means 809 Sales data storage means 810 Analysis Arrangement Means 811 Spatial storage means

Claims

[Claims]

1. A data set included in the set A, wherein a set having a smaller number of data sets is used as a label indicating a data attribute of a data distribution from the two data sets stored in the storage means. Label mapping means for mapping each data object A in a three-dimensional space or less while maintaining a relative distance relationship between each data object of object A, and a set having a larger number of data are targeted for data analysis. A certain set B, and data mapping means for mapping the data object B in the space while maintaining the relative distance relationship between the data object B and the data object A which are the data included in the set B. An information analysis display device, characterized in that it comprises an analysis arrangement means.

2. The data mapping means performs mapping of the data object B by fixing the arrangement position in the space of the data object A which is the result of the label mapping means. Information analysis display device.

3. The output means for visually displaying the arrangement position of at least one of the data objects A and B in the space arranged by the analysis arrangement means. Information analysis display device described.

4. The information analysis display device according to claim 1, wherein one of the data objects A and B is word data and the other is text data.

5. The information analysis display device according to claim 1, wherein one of the data objects A and B is product data and the other is customer data.

6. The data included in one of the two data sets, which is used as a label indicating the data attribute of the data distribution, is the data object A, and the other data is included in the data set to be analyzed. As data object B, data object A as line,
Data object B as a column and data object A
And matrix storage means for storing a matrix having a value indicating the relationship between B and B as elements, and distance calculation for calculating distances between data objects A and data objects A and B stored by the matrix storage means. Means, space storage means for storing a space for arranging these data objects, output means for visually outputting information on the space of the arrangement result, and analysis arrangement for arranging the data objects in the space A label mapping means for arranging the data object A in the space using only the distance relationship between the data objects A calculated by the distance calculation means, and the arrangement. , The distribution of the data object B using the distance relationship between the data objects A and B calculated by the distance calculation means. An information analysis display device, comprising: a data mapping unit that determines a position and arranges a data object B in a space, and the output unit displays the arrangement result of at least one of the data objects A and B.

7. One of the data objects A and B is word data and the other is document data, and a value indicating the relationship between the data objects A and B is used as a frequency of appearance of words in the document. The information analysis display device according to claim 6.

8. The method according to claim 6, wherein one of the data objects A and B is product data, the other is customer data, and the value indicating the relationship between the data objects A and B is used as the product purchase frequency of the customer. Information analysis display device.

9. The information analysis according to claim 3, wherein the output unit highlights a data object A having more data objects B than a threshold at a distance shorter than a predetermined distance. Display device.

10. The information analysis display according to claim 3, wherein the output unit highlights a data object A having a data object B smaller than a threshold at a distance shorter than a predetermined distance. apparatus.

11. A set A having a smaller number of data sets is used as a label indicating a data attribute of a data distribution from the two data sets stored in the storage means for data analysis display on a computer, and Set A
With respect to the label mapping step of arranging the data object A in a space of three dimensions or less while maintaining the relative distance relationship between the respective data objects of the data object A that is the data included in the set A, A large set is set as a set B which is a target of data analysis, and the data object B in the space is arranged while maintaining the relative distance relationship between the data object B and the data object A which are the data of the set B. An information analysis display program for causing a computer to execute a data mapping step and a step of visually displaying the arrangement position of at least one of the data objects A and B mapped in the space.

12. A data object A is data included in a data set used as a label indicating a data attribute of a data distribution of the two data sets for data analysis and display on a computer, and the other is data. Matrix storage for storing data contained in a data set to be analyzed as a data object B, a data object A as a row, a data object B as a column, and a matrix having values indicating the relationship between the data objects A and B as elements Means, distance calculation means for calculating distances between the data objects A stored by the matrix storage means, between data objects A and B, and space storage means for storing a space for arranging these data objects. And a data object calculated by the distance calculation means A label mapping means for mapping the data object A in space using only the distance relationship between them and the distance relationship between the data objects A and B calculated by the distance calculation means based on the arrangement thereof. Information that functions as a data mapping unit that determines the arrangement of the data object B by mapping the data object B in space and an output unit that displays the mapping result of at least one of the data objects A and B. Analysis display program.

13. An information analysis display method for performing data analysis of two data sets and displaying an analysis result, wherein a set having a smaller number of data is selected from the two data sets stored in the storage means. A set A used as a label indicating a data attribute of a data distribution, and a data object in a space of three dimensions or less while maintaining a relative distance relationship between each data object of the data object A which is the data included in the set A A label mapping step of arranging A, a set having a larger number of data as a set B which is a target of data analysis, and a relative distance relationship between the data object B and the data object A which are the data of the set B are shown. A data mapping step of arranging and arranging a data object B in the space; Visually displaying the arrangement position of at least one of the marked data objects A and B.

14. An information analysis display method for performing data analysis of two data sets and displaying an analysis result, wherein data included in one of the two data sets is data object A and the other is A matrix storing step of storing a matrix having data contained in the set as a data object B, data object A as a row, data object B as a column, and a value indicating a relationship between data objects A and B as an element; A distance calculation step for calculating distances between the data objects A and between the data objects A and B stored by the storage means, and a space storage step for storing a space for mapping the data objects A and B, Distance relationship between data objects A calculated by the distance calculation means A label mapping step of mapping the data object A by using only in space, the based on the mapping result of the data object, calculated by the distance calculation unit data object A
A data mapping step of determining the arrangement of the data object B using the distance relationship between the data objects B and B and mapping the data object B in space, and an output step of displaying the mapping result of at least one of the data objects A and B Information analysis and display method consisting of and.