JP2002245067A

JP2002245067A - Information retrieval unit

Info

Publication number: JP2002245067A
Application number: JP2001037163A
Authority: JP
Inventors: Hiroyoshi Konaka; 裕喜小中; Shinichiro Tsudaka; 新一郎津高; Ryuichi Kobune; 隆一小船; Hidekazu Arita; 英一有田
Original assignee: Mitsubishi Electric Corp
Current assignee: Mitsubishi Electric Corp
Priority date: 2001-02-14
Filing date: 2001-02-14
Publication date: 2002-08-30

Abstract

PROBLEM TO BE SOLVED: To obtain an information retrieval unit for calculating a similarity degree which reflects relation between keywords and improving precision in classification or retrieval. SOLUTION: The unit is provided with a document database 10 storing multiple kinds of document data, a vector generating means 20 for generating the feature vector of the keyword concerning each kind of document data, a classifying means 30 for calculating the similarity degree between the feature vectors and classifying document data and an output means 40 for outputting the classification result of document data. The vector generating means 20 analyzes the respective kinds of document data, extracts the keywords and relation between the keywords and generates the feature vector based on the appearance frequency of the both.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】この発明は、電子化された文
書データの分類・検索に関し、特に文書データを自動的
に分類・検索をする情報検索装置に関するものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to the classification and retrieval of digitized document data, and more particularly to an information retrieval apparatus for automatically classifying and retrieving document data.

【０００２】[0002]

【従来の技術】電子化された文書データの分類・検索に
関しては、従来、例えば特開平１１−１１０３９５号公
報に示されている情報検索装置が提案されている。ここ
に提案されている情報検索装置においては、類義語の出
現頻度をまとめて特徴ベクトルを生成し、特徴ベクトル
間の類似度を計算して各文書データを分類する。また、
この特開平１１−１１０３９５号公報には、類義語の関
係にある複数の単語にそれぞれ重み付けを付けることも
提案されている。2. Description of the Related Art With respect to classification and retrieval of digitized document data, an information retrieval apparatus disclosed in, for example, Japanese Patent Application Laid-Open No. H11-110395 has been proposed. In the information retrieval device proposed here, the frequency of occurrence of synonyms is put together to generate a feature vector, and the similarity between the feature vectors is calculated to classify each document data. Also,
Japanese Patent Application Laid-Open No. H11-110395 proposes that a plurality of words having synonymous relations are weighted.

【０００３】さらに、例えば特開平１０−１９８６９１
号公報においては、類義語の出現頻度をまとめて特徴ベ
クトルを生成することが開示されているとともに、文書
データベース中の隣接した単語対、類義語対を登録して
おき、特徴ベクトルの計算に用いることが開示されてい
る。[0003] Further, for example, Japanese Patent Application Laid-Open No. 10-198691.
The publication discloses that a feature vector is generated by summarizing the appearance frequencies of synonyms, and adjacent word pairs and synonym pairs in a document database are registered and used for calculation of a feature vector. It has been disclosed.

【０００４】[0004]

【発明が解決しようとする課題】このような構成の従来
の情報検索装置においては、語句の係り受け等、キーワ
ード間の関係を反映した分類あるいは検索をしていな
い、そのため、精度の高い分類あるいは検索をすること
が出来なかった。In a conventional information retrieval apparatus having such a configuration, classification or retrieval reflecting relationships between keywords, such as dependency of a word, is not performed. I could not search.

【０００５】この発明は、上述のような課題を解決する
ためになされたもので、キーワードだけでなくキーワー
ド間の関係をも反映した類似度計算を可能とし、分類あ
るいは検索の精度を向上することができる情報検索装置
を得ることを目的とする。SUMMARY OF THE INVENTION The present invention has been made to solve the above-described problems, and enables calculation of similarity reflecting not only keywords but also relationships between keywords, thereby improving the accuracy of classification or search. It is an object of the present invention to obtain an information retrieval device capable of performing the following.

【０００６】[0006]

【課題を解決するための手段】この発明に係る情報検索
装置は、複数の文書データを格納する文書データベース
と、各々の文書データに対し特徴ベクトルを生成するベ
クトル生成手段と、特徴ベクトル間の類似度を計算して
各文書データを分類する分類手段と、文書データの分類
結果を出力する出力手段とを有する情報検索装置におい
て、ベクトル生成手段は、各文書データを各々解析して
キーワード及びキーワード間の関係を抽出し、これら両
方の出現頻度に基づいて特徴ベクトルを生成する。An information retrieval apparatus according to the present invention has a document database storing a plurality of document data, a vector generating means for generating a feature vector for each document data, and a similarity between the feature vectors. In an information retrieval apparatus having a classifying means for calculating a degree and classifying each document data and an output means for outputting a classification result of the document data, the vector generating means analyzes each document data and executes a keyword and a keyword Are extracted, and a feature vector is generated based on both appearance frequencies.

【０００７】また、この発明に係る情報検索装置は、複
数の文書データを格納する文書データベースと、検索式
を入力する検索式入力手段と、各々の文書データ及び検
索式に対し特徴ベクトルを生成するベクトル生成手段
と、検索式に対する特徴ベクトルと各々の文書データに
対する特徴ベクトル間の類似度を計算する類似度計算手
段と、類似度の高い特徴ベクトルを有する文書データを
出力する出力手段とを有する情報検索装置において、ベ
クトル生成手段は、各文書データ及び検索式を各々解析
してキーワード及びキーワード間の関係を抽出し、これ
らの出現頻度に基づいて特徴ベクトルを生成する。[0007] Further, an information retrieval apparatus according to the present invention, a document database storing a plurality of document data, a retrieval formula input means for inputting a retrieval formula, and generating a feature vector for each document data and retrieval formula. Information comprising: vector generation means; similarity calculation means for calculating a similarity between a feature vector for a search expression and a feature vector for each document data; and output means for outputting document data having a feature vector with a high similarity. In the search device, the vector generation means analyzes each document data and search expression to extract keywords and relationships between the keywords, and generates a feature vector based on the appearance frequency of the keywords.

【０００８】また、ベクトル生成手段は、キーワード間
の関係として係り受けの関係を用いる。The vector generating means uses a dependency relationship as a relationship between keywords.

【０００９】また、ベクトル生成手段は、キーワード間
の関係としてキーワード間の距離が近いことを用いる。[0009] The vector generation means uses the fact that the distance between keywords is short as the relationship between keywords.

【００１０】また、ベクトル生成手段は、同一カテゴリ
に属するキーワード群に含まれるキーワードもしくはそ
れを含むキーワード間の関係の出現頻度の代わりに、そ
のカテゴリを代表するキーワードもしくはそれを含むキ
ーワード間の関係の出現頻度としてそれらの出現頻度を
それぞれ加算したものを用いる。[0010] The vector generating means may generate a keyword representing the category or a relation between the keywords including the keyword, instead of the appearance frequency of the keyword included in the keyword group belonging to the same category or the keyword including the keyword. The sum of the appearance frequencies is used as the appearance frequency.

【００１１】さらに、ベクトル生成手段は、キーワード
及びキーワード間の関係の出現頻度に対し、利用者が指
定する重みづけに基づいて特徴ベクトルを生成する。Further, the vector generation means generates a feature vector based on the weight specified by the user with respect to the appearance frequency of the keyword and the relationship between the keywords.

【００１２】[0012]

【発明の実施の形態】実施の形態１．図１は、この発明
の分類に関する情報検索装置の構成例を示すブロック図
である。図において文書データベース１０は複数の文書
データを格納する。文書データベース１０に格納された
各文書データは少なくともテキストデータを有してい
る。DESCRIPTION OF THE PREFERRED EMBODIMENTS Embodiment 1 FIG. 1 is a block diagram showing an example of the configuration of an information retrieval apparatus relating to classification according to the present invention. In the figure, a document database 10 stores a plurality of document data. Each document data stored in the document database 10 has at least text data.

【００１３】ベクトル生成手段２０は、各文書データに
対して特徴ベクトルを生成する。すなわち、各文書デー
タのテキストデータに対して形態素解析などを行い、必
要に応じて不要語処理等を行ってキーワードを抽出する
と共に、キーワード間の関係を抽出する。The vector generation means 20 generates a feature vector for each document data. That is, a morphological analysis or the like is performed on the text data of each document data, and unnecessary word processing or the like is performed as necessary to extract keywords and also to extract relationships between keywords.

【００１４】次に、Ｎ個の文書からなる文書データベー
ス１０全体から、キーワードＫ個、キーワード間の関係
Ｒ個が抽出されたとき、各文書ｉ（１≦ｉ≦Ｎ）の特徴
ベクトルＶｉは、たとえば、Ｋ＋Ｒ次元のベクトルで表
される。キーワード若しくはキーワード間の関係のイン
デックスをｊ（１≦ｊ≦Ｋ＋Ｒ）で表すとき、特徴ベク
トルＶｉの各次元ｊの成分Ｖｉｊは、たとえば、ｔｆ・
ｉｄｆ法によると、次の式で算出できる。Next, when K keywords and R relations between keywords are extracted from the entire document database 10 composed of N documents, the feature vector Vi of each document i (1 ≦ i ≦ N) becomes For example, it is represented by a K + R-dimensional vector. When an index of a keyword or a relationship between keywords is represented by j (1 ≦ j ≦ K + R), a component Vij of each dimension j of the feature vector Vi is, for example, tf ·
According to the idf method, it can be calculated by the following equation.

【００１５】Ｖｉｊ＝ＴＦｉｊ＊ｌｏｇ（Ｎ／ＤＦｊ）Vij = TFij * log (N / DFj)

【００１６】ここで、ＴＦｉｊは、文書ｉ中において、
ｊ成分に対応するキーワード若しくはキーワード間の関
係が現れる回数であり、また、ＤＦｊは、文書データベ
ース１０のＮ個の全文書中において、ｊ成分に対応する
キーワード若しくはキーワード間の関係が現れる回数で
ある。このようにして、特徴ベクトルが生成される。Here, TFij is defined as follows in document i:
The keyword corresponding to the j component or the number of times the relationship between the keywords appears, and DFj is the number of times the keyword corresponding to the j component or the relationship between the keywords appears in all N documents in the document database 10. . Thus, a feature vector is generated.

【００１７】分類手段３０では、文書間の類似度を計算
すると共に、その結果を使って、文書のクラスタリング
を行う。文書間の類似度は、たとえば上記のように算出
した複数文書の各特徴ベクトル間の角度のコサイン値で
計算できる。クラスタリングについては、Ｋ平均法など
のクラスタリングアルゴリズムに必要な類似度計算に、
上述のように生成した特徴ベクトルを使って計算した類
似度を用いることにより、従来のクラスタリングで用い
られていたキーワードだけでなく、キーワード間の関係
をも反映したクラスタリングが可能となる。The classifying means 30 calculates the similarity between the documents and uses the result to cluster the documents. The similarity between documents can be calculated, for example, by the cosine value of the angle between the respective feature vectors of a plurality of documents calculated as described above. For clustering, the similarity calculation required for clustering algorithms such as the K-means method
By using the similarity calculated using the feature vector generated as described above, it is possible to perform clustering that reflects not only keywords used in conventional clustering but also relationships between keywords.

【００１８】分類した結果は出力手段４０により出力す
ることができる。The result of the classification can be output by the output means 40.

【００１９】ここで、ベクトル生成手段２０で、各文書
データに対して特徴ベクトルを生成する際のキーワード
間の関係抽出の具体例としては、構文解析の結果として
得られる係り受けの関係やキーワード間の距離の近いも
のなどが考えられる。Here, as a specific example of the extraction of the relationship between keywords when the vector generation means 20 generates a feature vector for each document data, the relationship between dependencies obtained as a result of syntax analysis and the Are close to each other.

【００２０】まず、係り受けの関係について、たとえ
ば、「ＡがＢをＣする」という文を考える。この文にお
いては、「ＡがＣする」「ＢをＣする」という係り受け
の関係が存在する。これらを格まで含めて識別してもよ
いが、格を無視して、「Ａ→Ｃ」、「Ｂ→Ｃ」あるい
は、方向も無視して、「Ａ＆Ｃ」、「Ｂ＆Ｃ」（「Ａ→
Ｃ」と「Ｃ→Ａ」を同じと見なす）とすることも考えら
れる。この場合のキーワード間の関係に係る、前記ＴＦ
ｉｊ、若しくはＤＦｊとしてはこのような係り受けの出
現回数を使用することになる。First, for the relation of dependency, for example, consider the sentence "A does B to C". In this sentence, there is a dependency relationship of “A does C” and “B does C”. Although these may be identified including the case, ignoring the case, "A → C", "B → C" or ignoring the direction, "A &C","B&C"("A → C")
C "and" C → A "are considered the same). The TF related to the relationship between the keywords in this case.
The appearance frequency of such a dependency is used as ij or DFj.

【００２１】一方、キーワード間の関係として、キーワ
ード間の距離の近いものを用いる場合の距離の具体例と
しては、たとえばキーワード間の文字数、形態素数、文
節数、文数、段落数等が考えられる。この場合も方向を
考える場合と考えない場合が存在する。この場合のキー
ワード間の関係に係る、前記ＴＦｉｊ、若しくはＤＦｊ
としては、例えば、この距離がユーザー指定値より小さ
い場合の出現回数を使用することになる。On the other hand, as a specific example of the distance when a keyword having a short distance is used as the relation between keywords, for example, the number of characters, the number of morphemes, the number of clauses, the number of sentences, the number of paragraphs, and the like between keywords can be considered. . Also in this case, there are cases where the direction is considered and cases where the direction is not considered. The TFij or DFj related to the relationship between keywords in this case.
For example, the number of appearances when this distance is smaller than the user-specified value is used.

【００２２】さらに、同義語関係などのキーワードのカ
テゴリを反映した分類を行うことも可能である。すなわ
ち、係り受けの関係を例にとって説明すれば、キーワー
ドａ０、ａ１がカテゴリＡに属しており、ａ０、ａ１は
ｂと係り受けの関係があるとすれば、キーワードａ０、
ａ１に関する次元、「ａ０」、「ａ０→ｂ」、「ａ
１」、「ａ１→ｂ」を、「Ａ」、「Ａ→ｂ」に、まとめ
て、特徴ベクトルを生成することも出来る。Further, it is possible to perform classification reflecting the category of the keyword such as a synonym relation. In other words, taking the dependency relationship as an example, if keywords a0 and a1 belong to category A, and a0 and a1 have a dependency relationship with b, keywords a0 and a1
dimensions related to a1, “a0”, “a0 → b”, “a
"1" and "a1 → b" can be combined into "A" and "A → b" to generate a feature vector.

【００２３】なお、分類の際の比較対象となる特徴ベク
トルの各次元は、共に非ゼロ成分（共起）でなければ類
似度には寄与しない。しかしながら、一般に、キーワー
ド間の関係の共起は、単なるキーワードの共起よりも確
率的に低くなるため、キーワードに比べて、キーワード
間の関係の類似度への寄与が低くなってしまう傾向があ
る。そこで、キーワード間の関係に関する次元の重みを
キーワードの次元よりも大きくすることにより、両者の
類似度評価への寄与のバランスを図ることができる。Note that each dimension of the feature vector to be compared at the time of classification does not contribute to the similarity unless it is a non-zero component (co-occurrence). However, in general, the co-occurrence of the relationship between keywords is probabilistically lower than the co-occurrence of mere keywords, and therefore, the contribution to the similarity of the relationship between keywords tends to be lower than that of keywords. . Therefore, by making the weight of the dimension regarding the relationship between keywords larger than the dimension of the keyword, it is possible to balance the contribution of the two to the similarity evaluation.

【００２４】また、ユーザーがキーワードを選定して、
そのキーワードの次元やそのキーワードが含まれるキー
ワード間の関係の次元の重みを大きくして、ユーザーが
注目するキーワードを重視した分類を行うことも可能で
ある。Also, the user selects a keyword,
It is also possible to increase the weight of the dimension of the keyword and the dimension of the relationship between the keywords including the keyword, and to perform the classification focusing on the keyword that the user pays attention to.

【００２５】実施の形態２．図２は、この発明の他の実
施例である、検索に関する情報検索装置の構成例を示す
ブロック図であり、文書データベース１０、出力手段４
０は図１と同じ機能を有する。Embodiment 2 FIG. 2 is a block diagram showing a configuration example of an information retrieval apparatus relating to retrieval according to another embodiment of the present invention.
0 has the same function as in FIG.

【００２６】検索式入力手段５０は、検索条件を検索式
（検索式を表現する文章でも良い。）として入力する機
能を有し、ベクトル生成手段２０は、文書データベース
１０に格納されている全文書について、キーワード及び
キーワード間の関係の出現頻度から、例えば、実施の形
態１で述べた特徴ベクトル計算式にしたがって特徴ベク
トルを生成すると共に、入力した検索式からも、実施の
形態１で述べた特徴ベクトル生成方法と同様の方法を用
いて、特徴ベクトルを生成する機能を有する。（ただ
し、検索式から特徴ベクトルを生成する場合は、前記Ｖ
ｉｊのｉは文書ｉを示すものではなく、検索式を示すも
のとする。この場合、ＴＦｉｊは通常１になる。）The search formula input means 50 has a function of inputting a search condition as a search formula (a sentence expressing the search formula may be used). The vector generating means 20 stores all the documents stored in the document database 10. For example, a feature vector is generated from the appearance frequency of the keyword and the relationship between the keywords in accordance with, for example, the feature vector calculation formula described in the first embodiment, and the feature described in the first embodiment is also obtained from the input search formula. It has a function of generating a feature vector using the same method as the vector generation method. (However, when generating a feature vector from a search expression, the V
The i of ij does not indicate the document i, but indicates a search formula. In this case, TFij is usually 1. )

【００２７】検索手段６０は、検索式から生成した特徴
ベクトルと、格納された文書データベース１０中の全文
書についての特徴ベクトル間の類似度を、実施の形態１
の分類の手段で述べた方法と同様な方法で計算し、その
結果を使って、文書データベース中の文書の類似度ラン
キング評価を行う。The search means 60 determines the similarity between the feature vector generated from the search formula and the feature vectors of all the documents in the stored document database 10 according to the first embodiment.
The similarity ranking of the documents in the document database is evaluated using the calculation result in the same manner as the method described in the section of (1).

【００２８】ランキング付けした結果は出力手段４０に
より出力することができる。The ranking result can be output by the output means 40.

【００２９】なお、ベクトル生成手段２０では、各文書
データに対して特徴ベクトルを生成する際のキーワード
間の関係抽出の具体例として、実施の形態１で述べたこ
とと同様に、キーワードの係り付けの関係やキーワード
間の距離の近いもの数などを利用することが考えられ
る。The vector generation means 20 associates keywords with each other as a specific example of extracting relationships between keywords when generating a feature vector for each document data, as described in the first embodiment. It is conceivable to use the relationship between keywords or the number of keywords having a short distance between keywords.

【００３０】さらに、同義語関係などのキーワードのカ
テゴリを反映した検索を行うことも可能である。キーワ
ード若しくはキーワード間の関係のカテゴリーへのまと
め方は、実施の形態１の例示と同様である。Further, it is possible to perform a search reflecting the category of a keyword such as a synonym relation. The way of grouping keywords or relationships between keywords into categories is the same as in the first embodiment.

【００３１】また、特定のキーワード若しくはキーワー
ド間の関係の次元の重みを大きくすることで、実施の形
態１での例示と同様に、キーワード／キーワード間の関
係のそれぞれの次元の重みのバランスを図ったり、ユー
ザーが注目するキーワード若しくはキーワード間の関係
を重視した検索を行うことも可能である。Further, by increasing the weight of the dimension of the specific keyword or the relationship between the keywords, the weight of each dimension of the keyword / the relationship between the keywords is balanced as in the example of the first embodiment. Alternatively, it is also possible to perform a search that emphasizes a keyword or a relationship between keywords that the user pays attention to.

【００３２】[0032]

【発明の効果】この発明に係る情報検索装置は、複数の
文書データを格納する文書データベースと、各々の文書
データに対し特徴ベクトルを生成するベクトル生成手段
と、特徴ベクトル間の類似度を計算して各文書データを
分類する分類手段と、文書データの分類結果を出力する
出力手段とを有する情報検索装置において、ベクトル生
成手段は、各文書データを各々解析してキーワード及び
キーワード間の関係を抽出し、これら両方の出現頻度に
基づいて特徴ベクトルを生成する。そのため、文書デー
タの分類において、各文書データのキーワードだけでな
く、キーワード間の関係をも反映した類似度計算が可能
となり精度が向上する。An information retrieval apparatus according to the present invention calculates a similarity between feature vectors, a document database storing a plurality of document data, a vector generating means for generating a feature vector for each document data. In an information retrieval apparatus having a classifying unit for classifying each document data by using a search unit, and an output unit for outputting a classification result of the document data, the vector generating unit analyzes each document data to extract a keyword and a relationship between the keywords. Then, a feature vector is generated based on both of these appearance frequencies. Therefore, in the classification of the document data, similarity calculation reflecting not only the keywords of each document data but also the relationships between the keywords can be performed, and the accuracy is improved.

【００３３】また、この発明に係る情報検索装置は、複
数の文書データを格納する文書データベースと、検索式
を入力する検索式入力手段と、各々の文書データ及び検
索式に対し特徴ベクトルを生成するベクトル生成手段
と、検索式に対する特徴ベクトルと各々の文書データに
対する特徴ベクトル間の類似度を計算する類似度計算手
段と、類似度の高い特徴ベクトルを有する文書データを
出力する出力手段とを有する情報検索装置において、ベ
クトル生成手段は、各文書データ及び検索式を各々解析
してキーワード及びキーワード間の関係を抽出し、これ
らの出現頻度に基づいて特徴ベクトルを生成する。その
ため、検索式に近い文書データの検索において、検索式
および各文書データに出現するキーワードだけでなく、
キーワード間の関係をも反映した類似度計算が可能とな
り検索の精度が向上する。Further, the information retrieval apparatus according to the present invention generates a document database for storing a plurality of document data, a retrieval formula inputting means for inputting a retrieval formula, and a feature vector for each document data and retrieval formula. Information comprising: vector generating means; similarity calculating means for calculating a similarity between a feature vector for a retrieval formula and a feature vector for each document data; and output means for outputting document data having a feature vector with a high similarity. In the search device, the vector generation means analyzes each document data and search expression to extract keywords and relationships between the keywords, and generates a feature vector based on the appearance frequency of the keywords. Therefore, in the search of document data close to the search formula, not only the search formula and keywords appearing in each document data,
Similarity calculation that also reflects the relationship between keywords becomes possible, and the accuracy of search is improved.

【００３４】また、ベクトル生成手段は、キーワード間
の関係として係り受けの関係を用いる。そのため、文書
データの分類や検索式に近い文書データの検索におい
て、係り受けの関係を反映した類似度計算が行われ、キ
ーワードのみを用いた従来の方式に比べ分類や検索の精
度が向上する。The vector generating means uses a dependency relationship as a relationship between keywords. Therefore, in the classification of the document data and the search of the document data close to the search formula, the similarity calculation reflecting the dependency relation is performed, and the accuracy of the classification and the search is improved as compared with the conventional method using only the keyword.

【００３５】また、ベクトル生成手段は、キーワード間
の関係としてキーワード間の距離が近いことを用いる。
そのため、文書データの分類や検索式に近い文書データ
の検索において、キーワード間の距離を反映した類似度
計算が行われ、キーワードのみを用いた従来の方式に比
べ分類や検索の精度が向上する。The vector generating means uses the fact that the distance between keywords is short as the relationship between keywords.
Therefore, in the classification of the document data and the search of the document data close to the search formula, the similarity calculation reflecting the distance between the keywords is performed, and the accuracy of the classification and the search is improved as compared with the conventional method using only the keywords.

【００３６】また、ベクトル生成手段は、同一カテゴリ
に属するキーワード群に含まれるキーワードもしくはそ
れを含むキーワード間の関係の出現頻度の代わりに、そ
のカテゴリを代表するキーワードもしくはそれを含むキ
ーワード間の関係の出現頻度としてそれらの出現頻度を
それぞれ加算したものを用いる。そのため、文書データ
の分類や検索式に近い文書データの検索において、利用
者にとって区別することが不要なキーワードあるいはキ
ーワード間の関係をまとめた類似度計算を行うことがで
きる。その結果、高精度の分類・検索の効率化を図るこ
とが可能となる。In addition, instead of the frequency of appearance of the keywords included in the keyword group belonging to the same category or the relation between the keywords including the same, the vector generating means calculates the keyword representing the category or the relation between the keywords including the keyword. The sum of the appearance frequencies is used as the appearance frequency. Therefore, in the classification of the document data and the search of the document data close to the search formula, it is possible to perform the similarity calculation that summarizes the keywords or the relationships between the keywords that need not be distinguished for the user. As a result, it is possible to improve the efficiency of high-precision classification / search.

【００３７】さらに、ベクトル生成手段は、キーワード
及びキーワード間の関係の出現頻度に対し、利用者が指
定する重みづけに基づいて特徴ベクトルを生成する。そ
のため、文書データの分類や検索式に近い文書データの
検索において、利用者の意図を反映して、特定のキーワ
ードあるいはキーワード間の関係を重視あるいは軽視し
た類似度計算を行うことができる。その結果、利用者の
意図をより良く反映した形での分類・検索の高精度化が
可能となる。Further, the vector generating means generates a feature vector based on the weight specified by the user with respect to the appearance frequency of the keyword and the relationship between the keywords. Therefore, in the classification of the document data and the search of the document data close to the search formula, the similarity calculation which emphasizes or neglects the specific keyword or the relationship between the keywords can be performed by reflecting the intention of the user. As a result, it is possible to improve the accuracy of classification / search in a form that better reflects the intention of the user.

[Brief description of the drawings]

【図１】この発明の分類に関連する情報検索装置を示
すブロック図である。FIG. 1 is a block diagram showing an information retrieval apparatus related to classification according to the present invention.

【図２】この発明の検索に関連する情報検索装置を示
すブロック図である。FIG. 2 is a block diagram showing an information retrieval apparatus related to retrieval according to the present invention.

[Explanation of symbols]

１０文書データベース、２０ベクトル生成手段、３
０分類手段、４０出力手段、５０検索式入力手段、
６０類似度計算手段（検索手段）。10 document database, 20 vector generation means, 3
0 classification means, 40 output means, 50 search expression input means,
60 Similarity calculation means (search means).

───────────────────────────────────────────────────── フロントページの続き (72)発明者小船隆一東京都千代田区丸の内二丁目２番３号三菱電機株式会社内 (72)発明者有田英一東京都千代田区丸の内二丁目２番３号三菱電機株式会社内Ｆターム(参考） 5B075 ND03 NK02 NR12 PP23 PR04 PR06 QM08 ──────────────────────────────────────────────────続き Continuing on the front page (72) Ryuichi Kofune 2-3-2 Marunouchi, Chiyoda-ku, Tokyo Mitsui Electric Co., Ltd. (72) Eiichi Arita 2-3-2 Marunouchi, Chiyoda-ku, Tokyo F term in Mitsubishi Electric Corporation (reference) 5B075 ND03 NK02 NR12 PP23 PR04 PR06 QM08

Claims

[Claims]

1. A document database for storing a plurality of document data; a vector generating means for generating a feature vector for each of the document data; and calculating a similarity between the feature vectors to classify each of the document data. An information retrieval apparatus having a classifying unit that performs classification of the document data, and an output unit that outputs a classification result of the document data. The vector generating unit analyzes each of the document data to extract a keyword and a relationship between the keywords. An information retrieval apparatus characterized in that the feature vector is generated based on both appearance frequencies.

2. A document database for storing a plurality of document data; a search formula input unit for inputting a search formula; a vector generating unit for generating a feature vector for each of the document data and the search formula; An information retrieval apparatus comprising: a similarity calculating unit that calculates a similarity between a feature vector for an expression and a feature vector for each of the document data; and an output unit that outputs document data having the feature vector having a high similarity. An information retrieval apparatus characterized in that the vector generation means analyzes each of the document data and the retrieval formula to extract keywords and relationships between the keywords, and generates the feature vector based on the appearance frequency of the keywords.

3. The information retrieval apparatus according to claim 1, wherein said vector generation means uses a dependency relation as a relation between said keywords.

4. The information retrieval apparatus according to claim 1, wherein the vector generating means uses a short distance between keywords as the relationship between the keywords.

5. The method according to claim 1, wherein the vector generation unit is configured to replace the frequency of appearance of a keyword included in a keyword group belonging to the same category or a keyword including the keyword with a keyword representing the category or a relationship between keywords including the keyword. 5. The information retrieval apparatus according to claim 1, wherein the sum of the appearance frequencies is used as the appearance frequency of the information retrieval device.

6. The method according to claim 1, wherein said vector generating means generates a feature vector based on a weight specified by a user with respect to an appearance frequency of a keyword and a relationship between keywords. An information retrieval device according to the present invention.