JP2002197095A

JP2002197095A - Keyword extracting device and information retrieving device

Info

Publication number: JP2002197095A
Application number: JP2000394194A
Authority: JP
Inventors: Kyoji Umemura; 恭司梅村; Yoshinori Takenami; 佳則武並; Masahiro Kishida; 正博岸田
Original assignee: Sumitomo Electric Industries Ltd
Current assignee: Sumitomo Electric Industries Ltd
Priority date: 2000-12-26
Filing date: 2000-12-26
Publication date: 2002-07-12

Abstract

PROBLEM TO BE SOLVED: To extract a keyword from a document without necessity of a dictionary. SOLUTION: A keyword extracting device includes a suffix file generating part 22 to receive a group of documents and to generate a suffix file to be described later from the group of documents, a suffix file storage part 24 to store the suffix file, a punctuating part 28 to receive an optional document to be included in the group of documents or a document in the same field as the group of documents and to punctuate the document at a break of a sentence such as punctuation marks, a score calculating part 26 to properly punctuate the sentence based on the suffix file and the sentence supplied from the punctuating part 28 and to calculate appearance frequency α, a degree β of concentration of appearance and weight, etc., to be described later, an operation result storage part 30 to store an operation result, a document separating part 32 to punctuate the document into candidates of the keyword based on the operation result and a narrowing part 34 to narrow down the candidates of the keyword.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明はキーワード抽出装置
および情報検索装置に関し、特に、辞書を必要とせずに
ドキュメント群からキーワードを抽出可能なキーワード
抽出装置および情報検索装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a keyword extraction device and an information retrieval device, and more particularly to a keyword extraction device and an information retrieval device capable of extracting a keyword from a document group without requiring a dictionary.

【０００２】[0002]

【従来の技術】最新の技術情報の速報やニュースなどを
整理する際には、検索のために記事の内容を特定できる
キーワードの付与が行なわれている。キーワードを付与
する作業を自動化できれば、キーワードが付与されてい
ないドキュメントの操作も容易になる。これまで検討さ
れているキーワード自動抽出は、辞書を用いて形態素解
析を行ない、その後、品詞情報と頻度情報とをもとにキ
ーワードを弁別する手法により行なわれている。2. Description of the Related Art When arranging breaking news or news of the latest technical information, a keyword for specifying the content of an article is provided for search. If the task of assigning a keyword can be automated, the operation of a document to which no keyword has been assigned can be facilitated. The automatic keyword extraction that has been studied so far is performed by performing a morphological analysis using a dictionary, and then discriminating keywords based on part of speech information and frequency information.

【０００３】[0003]

【発明が解決しようとする課題】しかし、辞書を用いる
手法は日々新しい単語が生まれるインターネット時代の
情報処理としては問題がある。その理由として、処理の
自動化が必要な最新の文章からキーワードを辞書に登録
し続ける必要があるため生産性が悪いことと、辞書に登
録されていない全く未知の用語に対する汎用性がないこ
とが挙げられる。However, the method using a dictionary has a problem as information processing in the Internet age where new words are born every day. The reasons for this are that productivity must be low because keywords must be registered in the dictionary from the latest sentence that requires automated processing, and that there is no versatility for completely unknown terms that are not registered in the dictionary. Can be

【０００４】本発明は上述の課題を解決するためになさ
れたもので、その目的は、辞書を必要とせずにドキュメ
ントよりキーワードを抽出可能なキーワード抽出装置を
提供することである。SUMMARY OF THE INVENTION The present invention has been made to solve the above-described problem, and an object of the present invention is to provide a keyword extracting device capable of extracting a keyword from a document without requiring a dictionary.

【０００５】本発明の他の目的は、辞書を必要とせずに
あるドキュメントに関連したドキュメントをドキュメン
ト群から抽出可能な情報検索装置を提供することであ
る。It is another object of the present invention to provide an information retrieval apparatus capable of extracting a document related to a certain document from a group of documents without requiring a dictionary.

【０００６】[0006]

【課題を解決するための手段】本発明のある局面に従う
キーワード抽出装置は、ドキュメント群における各ドキ
ュメントに含まれる部分文字列のドキュメント群中での
出現頻度を求めるための出現頻度算出手段と、部分文字
列のドキュメント群中での出現集中度を求めるための出
現集中度算出手段と、出現頻度算出手段および出現集中
度算出手段に接続され、出現頻度および出現集中度に基
づいて、入力されたドキュメントからキーワードを抽出
するための第１のキーワード抽出手段とを含む。According to an aspect of the present invention, there is provided a keyword extracting apparatus comprising: an appearance frequency calculating unit for obtaining an appearance frequency of a partial character string included in each document in the document group; A document input device that is connected to an appearance concentration calculation unit for calculating the occurrence concentration degree of the character string in the document group, and an appearance frequency calculation unit and an appearance concentration degree calculation unit, and based on the appearance frequency and the appearance concentration degree, And first keyword extracting means for extracting a keyword from the keyword.

【０００７】部分文字列の出現頻度および出現集中度に
基づいて、キーワードを抽出する。このため、辞書を必
要とせずにドキュメントよりキーワードを抽出すること
ができる。A keyword is extracted based on the appearance frequency and appearance concentration of the partial character string. Therefore, keywords can be extracted from the document without requiring a dictionary.

【０００８】好ましくは、第１のキーワード抽出手段
は、入力されたドキュメントを部分文字列に分割するた
めのドキュメント分割手段と、ドキュメント分割手段、
出現頻度算出手段および出現集中度算出手段に接続さ
れ、出現頻度および出現集中度に基づいて、部分文字列
の単語らしさを算出するための単語らしさ算出手段と、
単語らしさ算出手段に接続され、単語らしさの合計値に
基づいて、ドキュメントよりキーワードを抽出するため
の第２のキーワード抽出手段とを含む。Preferably, the first keyword extracting means includes: a document dividing means for dividing the input document into partial character strings;
A word likeness calculating means connected to the appearance frequency calculating means and the appearance concentration degree calculating means, for calculating the word likeness of the partial character string based on the appearance frequency and the appearance concentration degree;
A second keyword extracting unit connected to the word-likeness calculating unit for extracting a keyword from the document based on the total value of the word-likeness.

【０００９】さらに好ましくは、単語らしさ算出手段
は、ドキュメント分割手段、出現頻度算出手段および出
現集中度算出手段に接続され、出現頻度、出現集中度、
部分文字列の長さおよびドキュメントの平均サイズに基
づいて、部分文字列の単語らしさを算出するための手段
を含む。More preferably, the word-likeness calculating means is connected to the document dividing means, the appearance frequency calculating means, and the appearance concentration degree calculating means.
Means for calculating wordiness of the partial character string based on the length of the partial character string and the average size of the document is included.

【００１０】ドキュメントの平均サイズが小さくなる
と、出現集中度が０に近くなる傾向にある。このため、
ドキュメントの平均サイズで単語らしさの計算方法を変
えることにより、出現頻度が小さくなっても適切な単語
らしさを計算することができる。[0010] As the average size of a document decreases, the degree of appearance concentration tends to approach zero. For this reason,
By changing the method of calculating the word-likeness at the average size of the document, it is possible to calculate the appropriate word-likeness even if the appearance frequency is reduced.

【００１１】さらに好ましくは、キーワード抽出装置
は、さらに、第２のキーワード抽出手段、出現頻度算出
手段および出現集中度算出手段に接続され、第２のキー
ワード抽出手段で抽出されたキーワードを、出現頻度、
出現集中度および部分文字列の長さに基づいて絞り込む
ための絞込み手段を含む。[0011] More preferably, the keyword extracting device is further connected to a second keyword extracting means, an appearance frequency calculating means, and an appearance concentration degree calculating means, and outputs the keyword extracted by the second keyword extracting means. ,
A narrowing-down unit for narrowing down based on the appearance concentration degree and the length of the partial character string is included.

【００１２】さらに好ましくは、キーワード抽出装置
は、さらに、入力されたドキュメントを句読点で区切
り、ドキュメント分割手段に供給するための区切り手段
を含む。[0012] More preferably, the keyword extracting apparatus further includes a separating unit for separating the input document by punctuation marks and supplying the document to the document dividing unit.

【００１３】さらに好ましくは、ドキュメント分割手段
は、先頭文字が予め定められた文字より始まらないよう
に、入力されたドキュメントを部分文字列に分割するた
めの手段を含む。[0013] More preferably, the document dividing means includes a means for dividing the input document into partial character strings so that the first character does not start with a predetermined character.

【００１４】さらに好ましくは、ドキュメント分割手段
は、部分文字列の長さが予め定められた文字数以上にな
らないように、入力されたドキュメントを部分文字列に
分割するための手段を含む。[0014] More preferably, the document dividing means includes means for dividing the input document into partial character strings such that the length of the partial character string does not exceed a predetermined number of characters.

【００１５】本発明の他の局面に従う情報検索装置は、
ドキュメント群における各ドキュメントに含まれる部分
文字列のドキュメント群中での出現頻度を求めるための
出現頻度算出手段と、部分文字列のドキュメント群中で
の出現集中度を求めるための出現集中度算出手段と、出
現頻度算出手段および出現集中度算出手段に接続され、
出現頻度および出現集中度に基づいて、入力されたドキ
ュメントからキーワードを抽出するためのキーワード抽
出手段と、キーワード抽出手段に接続され、ドキュメン
ト群中の各ドキュメントについて、キーワード抽出手段
で抽出された各キーワードとの一致度を計算するための
一致度計算手段と、一致度計算手段に接続され、一致度
に基づいて、ドキュメント群中の各ドキュメントについ
て、入力されたドキュメントとの類似度を計算するため
の類似度計算手段と、類似度計算手段に接続され、類似
度に基づいて、入力されたドキュメントと関連するドキ
ュメントをドキュメント群より抽出するための手段とを
含む。[0015] An information retrieval apparatus according to another aspect of the present invention comprises:
Appearance frequency calculation means for calculating the appearance frequency of the partial character string included in each document in the document group, and occurrence concentration calculation means for obtaining the appearance concentration degree of the partial character string in the document group And an appearance frequency calculation unit and an appearance concentration degree calculation unit,
Keyword extraction means for extracting a keyword from the input document based on the frequency of appearance and the degree of occurrence concentration, and each keyword extracted by the keyword extraction means for each document in the document group connected to the keyword extraction means A degree-of-match calculating means for calculating the degree of coincidence with, and a degree-of-similarity for each document in the document group based on the degree of similarity with the input document based on the degree of matching. A similarity calculating unit; and a unit connected to the similarity calculating unit, for extracting a document related to the input document from the document group based on the similarity.

【００１６】入力されたドキュメントから辞書を用いず
にキーワードが抽出され、そのキーワードに関連するド
キュメントが抽出される。このため、辞書を必要とせず
に入力されたドキュメントに関連するドキュメントを抽
出することができる。A keyword is extracted from the input document without using a dictionary, and a document related to the keyword is extracted. Therefore, it is possible to extract a document related to the input document without requiring a dictionary.

【００１７】[0017]

【発明の実施の形態】［実施の形態１］図１を参照し
て、本発明の実施の形態に係るキーワード抽出装置は、
ドキュメント群を受け、ドキュメント群から後述するサ
フィックスファイルを作成するサフィックスファイル作
成部２２と、サフィックスファイル作成部２２に接続さ
れ、サフィックスファイル作成部２２で作成されたサフ
ィックスファイルを記憶するサフィックスファイル記憶
部２４と、ドキュメント群に含まれる任意のドキュメン
トまたはドキュメント群と同じ分野のドキュメントを受
け、「、」や「。」などの文章の切れ目でドキュメント
分割する区切り部２８と、サフィックスファイル記憶部
２４および区切り部２８に接続され、サフィックスファ
イル記憶部２４に記憶されたサフィックスファイルおよ
び区切り部２８より供給された文章に基づいて、文章を
適宜区切り、後述する出現頻度α、出現集中度βおよび
重みなどの計算を行なうスコア計算部２６と、スコア計
算部２６に接続され、スコア計算部２６での演算結果を
記憶する演算結果記憶部３０と、演算結果記憶部３０に
接続され、演算結果記憶部３０に記憶された演算結果に
基づいて、ドキュメントをキーワードの候補に分割して
いくドキュメント分割部３２と、ドキュメント分割部３
２に接続され、キーワード候補を絞込み、キーワードを
抽出する絞込み部３４とを含む。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS [Embodiment 1] Referring to FIG. 1, a keyword extracting apparatus according to an embodiment of the present invention
A suffix file creation unit 22 that receives a document group and creates a suffix file described later from the document group, and a suffix file storage unit 24 that is connected to the suffix file creation unit 22 and stores the suffix file created by the suffix file creation unit 22 And a document included in the document group or a document in the same field as the document group, and separates the document at a break of a sentence such as “,” or “.”, A suffix file storage unit 24 and a separation unit 28, based on the suffix file stored in the suffix file storage unit 24 and the text supplied from the delimiter unit 28, the text is appropriately divided, and calculation of an appearance frequency α, an appearance concentration β, and a weight, which will be described later, is performed. line The score calculation unit 26, a calculation result storage unit 30 connected to the score calculation unit 26 and storing the calculation result of the score calculation unit 26, and a calculation result storage unit 30 connected to the calculation result storage unit 30 and stored in the calculation result storage unit 30 A document dividing unit 32 that divides the document into keyword candidates based on the calculated result, and a document dividing unit 3
2 and a narrowing-down unit 34 for narrowing down keyword candidates and extracting keywords.

【００１８】［本発明の概要］本発明は文字列の出現頻
度に加えて出現集中度を示す統計量を用いることを特徴
とする。本発明にかかるキーワード抽出においては、す
べての部分文字列について出現集中度を求める必要があ
るため、単純に求めると計算量が膨大になる。我々は、
多数のドキュメント内の文字列の、出現集中度を求める
ための「サフィックスファイル」を用いる技術を既に確
立しており、これを利用する。[Summary of the Present Invention] The present invention is characterized in that a statistic indicating the degree of appearance concentration is used in addition to the appearance frequency of a character string. In the keyword extraction according to the present invention, it is necessary to calculate the appearance concentration degree for all partial character strings. we,
A technique using a "suffix file" for determining the degree of occurrence concentration of character strings in many documents has already been established and is used.

【００１９】［キーワード抽出の原理］出現集中度はad
aptationとして知られる統計量であり、「ある単語が１
つのドキュメントに現れたという条件で、同じ単語がも
う一度そのドキュメント内に出現する確率：Ｐ（２回出
現｜１回出現）」の推定値である。この確率を推定する
ために、対象の文字列ｘに関して、「その文字列ｘを含
むドキュメントの数：ｄｆ（ｘ）」と「その文字列ｘを
２回以上含むドキュメントの数：ｄｆ２（ｘ）」を数え
上げる。そして、ベイズの規則を考慮した次式（１）よ
り上記した確率を推定する。ここでＮは全ドキュメント
数である。[Principle of keyword extraction] The appearance concentration is ad
A statistic known as aptation, where "a word is 1
Probability that the same word appears again in the document under the condition that it appears in one document: P (two occurrences | one occurrence). " In order to estimate this probability, regarding the target character string x, “the number of documents including the character string x: df (x)” and “the number of documents including the character string x twice or more: df2 (x)” ". Then, the above-mentioned probability is estimated from the following equation (1) in consideration of Bayes rule. Here, N is the total number of documents.

【００２０】[0020]

【数１】 (Equation 1)

【００２１】このｄｆ（ｘ）、ｄｆ２（ｘ）の引数ｘ
は、すべての部分文字列を取り得る。単純な方法では大
規模なテキストを扱う際にメモリ量または計算量のどち
らかが実際的ではない程大きなものになる。The arguments x of df (x) and df2 (x)
Can take all substrings. In a simple way, when dealing with large texts, either the amount of memory or the amount of computation is so large that it is impractical.

【００２２】そこで、後述するサフィックスファイルと
して知られるデータ構造を利用して、部分文字列の抽出
を行なう。サフィックスファイルは、テキストの５倍の
メモリ空間を要するが、すべての部分文字列の位置がｌ
ｏｇ（ｎ）（ｎ：テキストの大きさ）のオーダの計算量
で特定できる。Therefore, a partial character string is extracted using a data structure known as a suffix file described later. The suffix file requires five times the memory space of the text, but the position of all substrings is l
og (n) (n: text size) can be specified by the amount of calculation.

【００２３】サフィックスファイルの作成法と利用法に
ついては、「M.Yamamoto and K.W.Church, Using Suffi
x Arrays to Compute Term Frequency and Document Fr
equency for All Substrings in a Corpus, In proceed
ing of 6th Workshop on Very Large Corpora, Ed. Eug
ene Charniak, Motreal, pp28-37, 1998」にその詳細が
開示されている。For details on how to create and use suffix files, see "M. Yamamoto and KWChurch, Using Suffi.
x Arrays to Compute Term Frequency and Document Fr
equency for All Substrings in a Corpus, In proceed
ing of 6th Workshop on Very Large Corpora, Ed.Eug
ene Charniak, Motreal, pp. 28-37, 1998 ".

【００２４】サフィックスファイルを使うと、ある文字
列がドキュメントデータベース内に出現する回数を高速
に求めることができる。サフィックスファイルは、すべ
てのドキュメントにおいて生じうる部分の文字列を、文
字コード順に並べ替えて、通し番号（サフィックス）を
つけておくことで実現する。文字列がドキュメントデー
タベースに出現する回数ｔｆは、その文字列と一致する
文字列がサフィックスファイルの中にいくつあるかを算
出することで求められる。By using a suffix file, the number of times a certain character string appears in a document database can be obtained at high speed. The suffix file is realized by rearranging a character string of a part that can occur in all documents in the order of character codes and adding a serial number (suffix). The number of times tf that a character string appears in the document database can be obtained by calculating the number of character strings that match the character string in the suffix file.

【００２５】具体的には、まず、ある文字列と一致する
文字列が出現するサフィックスの最小値ｍｉｎおよび最
大値ｍａｘをそれぞれ二分探索法により求める。一致す
る文字列がなければ、当該文字列がドキュメントデータ
ベースに出現する回数は０である。サフィックスの最小
値ｍｉｎおよび最大値ｍａｘとが求まれば、当該文字列
が出現する回数ｔｆはｔｆ＝ｍａｘ−ｍｉｎ＋１として
求められる。Specifically, first, a minimum value min and a maximum value max of a suffix in which a character string matching a certain character string appears are obtained by a binary search method. If there is no matching character string, the number of times the character string appears in the document database is zero. If the minimum value min and the maximum value max of the suffix are obtained, the number of times tf that the character string appears can be obtained as tf = max−min + 1.

【００２６】ドキュメントデータベースのドキュメント
は、ドキュメント番号によって互いに区別されるものと
し、サフィックスファイルに登録する部分文字列にはこ
のドキュメント番号をつけておく。これによって、ある
部分文字列を含むドキュメントを効率的に検索すること
ができる。また、ある部分文字列を含むドキュメントの
数ｄｆは、重複するドキュメント番号の数を数え上げ、
その数をｔｆから引くことによって計算することができ
る。またこのとき、この部分文字列を２回以上含むドキ
ュメント数がｄｆ２である。The documents in the document database are distinguished from each other by a document number, and the partial character string registered in the suffix file is given this document number. Thus, a document including a certain partial character string can be efficiently searched. In addition, the number of documents df including a certain partial character string is calculated by counting the number of duplicate document numbers,
It can be calculated by subtracting that number from tf. At this time, the number of documents including the partial character string twice or more is df2.

【００２７】ここで、文字列ｘの出現確率の推定値（出
現頻度）であるｄｆ（ｘ）／Ｎをαとし、adaptationを
推定するｄｆ２（ｘ）／ｄｆ（ｘ）の推定値（出現集中
度）をβとする。文字列の出現がポアソン分布に従うと
仮定すると、αおよびβは同じ値になる。実際のコーパ
スではβの値が大きく、キーワードと認める文字列なら
αに比べその差は特に大きくなることが観測できる。Here, df (x) / N, which is the estimated value (appearance frequency) of the appearance probability of the character string x, is α, and the estimated value of df2 (x) / df (x) for estimating the adaptation (appearance concentration) Degree) is β. Assuming that the occurrence of the character string follows a Poisson distribution, α and β have the same value. It can be observed that the value of β is large in an actual corpus, and the difference is particularly large in a character string recognized as a keyword compared to α.

【００２８】出現集中の様子として、ここでは「ロボッ
トについて」という語の部分文字列ｘの一例とそれに対
応するｄｆ、ｄｆ２、α（＝ｄｆ／Ｎ）およびβ（＝ｄ
ｆ２／ｄｆ）とを図２に示す。ここで観測できることは
二つある。一つは、キーワードを構成する文字列では、
αに比べてβが大きいということである。このことは、
キーワードとなる語はドキュメント中に複数回出現する
ことが多いことを表わす。またこのことは、図２におい
ても容易に確認できる。もう一つは、語の境界を越える
とβが小さくなることである。これは、ある語は常に同
じ語として表われるが、その後に続く文字が様々に変化
し得ることから導かれる。これは、図２において、部分
文字列ｘが「ロボット」から「ロボットに」に変化する
と、それまではほぼ一定値のβが小さくなっていること
からも確認できる。As the appearance and concentration, here, an example of a partial character string x of the word “about a robot” and corresponding df, df2, α (= df / N) and β (= d
f2 / df) is shown in FIG. There are two things we can observe here. One is the character strings that make up the keyword.
This means that β is larger than α. This means
The word used as a keyword indicates that it often appears more than once in a document. This can also be easily confirmed in FIG. The other is that β becomes smaller beyond word boundaries. This is derived from the fact that certain words always appear as the same word, but the characters that follow can vary. This can be confirmed from the fact that, when the partial character string x changes from "robot" to "robot" in FIG.

【００２９】βの値をもとに、文字列の単語らしさ（重
み）を推定することで単語の分割を行なう。重み（ｓｃ
ｏｒｅ）の推定は図３のように分類される。ここでは、
ｄｆ２があまりに少ない文字列ｘ、すなわちｄｆ２が３
以下の場合には（Ｓ２でＮＯ）、単語とみなさず重みを
低く定義している（Ｓ４）。単語と認められるとき、す
なわちｄｆ２が３より大きくかつｔｆがＮ以下の場合
（Ｓ２でＹＥＳかつＳ６でＮＯ）、βの値を推定し、そ
の対数値を重みとする（Ｓ８）。ただし、総出現数ｔｆ
がＮを超えるところでは、単語らしさが正しくβに反映
しない。これは、文章において助詞などは、多数出現す
るにも関わらず語にはならないということを意味する。
そこで、ｔｆ＞Ｎの条件を満たすところでは単語らしさ
が飽和するとし（Ｓ６でＹＥＳ）、定数０．５の対数を
重みとする（Ｓ１０）。すべての文字列分割において、
以上のように重みを求め、各文字列の重みの総和が最大
となる分割を求める。The word is divided by estimating the likelihood (weight) of the character string based on the value of β. Weight (sc
ore) are classified as shown in FIG. here,
Character string x with too small df2, that is, df2 is 3
In the following cases (NO in S2), the weight is defined as low without being regarded as a word (S4). When the word is recognized as a word, that is, when df2 is greater than 3 and tf is equal to or less than N (YES in S2 and NO in S6), the value of β is estimated, and its logarithmic value is used as a weight (S8). However, the total number of appearances tf
Where N exceeds N, the wordiness is not correctly reflected in β. This means that particles, etc., do not become words even though they appear many times in sentences.
Therefore, it is assumed that the word-likeness is saturated where the condition of tf> N is satisfied (YES in S6), and the logarithm of the constant 0.5 is used as the weight (S10). For all string splits,
As described above, the weight is obtained, and the division in which the total sum of the weights of the respective character strings is maximized is obtained.

【００３０】ここで、単語らしさを決めるためにいくつ
かのしきい値を用いた。すべての単語を網羅しなければ
ならない辞書とは異なり、サンプルの単語から値を求め
ることができるため、これは、新しいドキュメントでも
実行できる操作である。たとえば、βの値はドキュメン
トの長さの影響を受けるが、このしきい値の決定により
対応できる。Here, several thresholds were used to determine the wordiness. This is an operation that can be performed on new documents, because unlike a dictionary, which must cover all words, it can obtain values from sample words. For example, the value of β is affected by the length of the document, and can be handled by determining this threshold.

【００３１】そこでこの方法で分割した例および形態素
解析を用いて分割した例を図４および図５にそれぞれ示
す。助詞・助動詞の分割は不自然であるが、キーワード
は正確に抽出できていることが観測できる。FIGS. 4 and 5 show examples of division using this method and examples of division using morphological analysis. Although it is unnatural to divide particles and auxiliary verbs, it can be observed that keywords are accurately extracted.

【００３２】［キーワード候補の選別］部分文字列ｘの
出現頻度αが大きい場合には、その文字列ｘが非常に多
数のドキュメントに現れる文字列であることを示す。そ
のためそうした文字列ｘはドキュメントを識別する能力
が低い。逆に、αが小さい場合、たとえば一度しか出現
しないような文字列は稀にしか使われない非常に特殊な
文字列であると考えられる。そうした文字列は、他のド
キュメントとの関連を示す能力がないと考えられる。そ
のため、キーワードとして望ましい文字列は、ドキュメ
ントを特定できる単語という性質上、αがある一定の範
囲内に含まれるような文字列である。さらに、単語がド
キュメントの内容に関わるものかを推定するために、β
の値の考慮して求めた重みを再び用いる。αおよびβの
値の範囲については、コーパスを利用して学習を行な
い、抽出する範囲の候補を選別する。[Selection of Keyword Candidates] When the appearance frequency α of the partial character string x is large, it indicates that the character string x is a character string that appears in a very large number of documents. Therefore, such a character string x has a low ability to identify a document. Conversely, when α is small, for example, a character string that appears only once is considered to be a very special character string that is rarely used. Such a string would not have the ability to indicate relevance to other documents. Therefore, a character string desirable as a keyword is a character string in which α is included within a certain range due to the nature of a word that can specify a document. Furthermore, to estimate whether a word is related to the content of the document,
The weight determined in consideration of the value of is used again. With respect to the range of the values of α and β, learning is performed using a corpus, and candidates for the range to be extracted are selected.

【００３３】具体的には例えば図６の条件をすべて満た
したものをキーワードと考えている。すなわち、部分文
字列ｘのαの値が０．００００５より大きくかつ０．１
未満であり、部分文字列ｘの重みが−１．０よりも大き
く、かつ部分文字列ｘの長さが１よりも大きい場合には
（Ｓ１２でＹＥＳ、Ｓ１４でＹＥＳかつＳ１８でＹＥ
Ｓ）、部分文字列ｘをキーワードであると決定する（Ｓ
２０）。それ以外の場合には、キーワードではないと決
定する（Ｓ１６）。なお、ｌｅｎ［ｘ］は文字列ｘの長
さとする。なおここで用いた各値は、これらに限定され
るものではない。たとえば、対象ドキュメント群によっ
ても変化し得るし、抽出すべきキーワード数を調節する
という観点からも変化させることができる。More specifically, keywords satisfying all the conditions shown in FIG. 6 are considered as keywords. That is, the value of α of the partial character string x is greater than 0.00005 and 0.1
If the weight of the partial character string x is greater than -1.0 and the length of the partial character string x is greater than 1 (YES in S12, YES in S14 and YE in S18)
S), and determine that the partial character string x is a keyword (S)
20). Otherwise, it is determined that it is not a keyword (S16). Note that len [x] is the length of the character string x. The values used here are not limited to these. For example, it can change depending on the target document group, and can also change from the viewpoint of adjusting the number of keywords to be extracted.

【００３４】［サフィックスファイル作成部２２の処
理］図７を参照して、サフィックスファイル作成部２２
の処理について説明する。サフィックスファイル作成部
２２は、ドキュメント群を形成する複数のドキュメント
をつないで長大な１つの文字列を作成する（Ｓ２２）。
たとえば、図８（Ａ）に示すように、この文字列が「ａ
ｂｃａｂｄ」として求められているものとする。[Process of Suffix File Creation Unit 22] Referring to FIG.
Will be described. The suffix file creating unit 22 creates one long character string by connecting a plurality of documents forming a document group (S22).
For example, as shown in FIG.
bcabd ”.

【００３５】作成した文字列よりスタート位置を１文字
ずつずらしながら生じうる部分文字列を作成する。ま
た、各部分文字列に通し番号（サフィックス）をつける
（Ｓ２４）。文字列「ａｂｃａｂｄ」より部分文字列お
よびサフィックスを作成すると、図８（Ｂ）のようにな
る。A partial character string that can occur while shifting the start position one character at a time from the created character string is created. A serial number (suffix) is assigned to each partial character string (S24). When a partial character string and a suffix are created from the character string “abcabd”, the result is as shown in FIG.

【００３６】部分文字列を辞書順に並べ替えることによ
りサフィックスファイルを作成する（Ｓ２６）。サフィ
ックスファイルのうち、サフィックスの並びのことをサ
フィックスアレイという。すなわち、図８（Ｃ）のよう
なサフィックスファイルが作成される。作成されたサフ
ィックスファイルは、サフィックスファイル記憶部２４
に記憶される。このサフィックスファイルを用いること
により、もとのドキュメント群中における全文字列の出
現頻度と出現集中度とを少ない計算量で求めることがで
きる。A suffix file is created by rearranging the partial character strings in dictionary order (S26). A sequence of suffixes in a suffix file is called a suffix array. That is, a suffix file as shown in FIG. 8C is created. The created suffix file is stored in the suffix file storage unit 24.
Is stored. By using this suffix file, the appearance frequency and appearance concentration of all character strings in the original document group can be obtained with a small amount of calculation.

【００３７】［区切り部２８の処理］図９を参照して、
区切り部２８の処理について説明する。[Processing of Separator 28] Referring to FIG.
The processing of the separation unit 28 will be described.

【００３８】区切り部２８は、ドキュメントを開き（Ｓ
３２）、文字列を一時的に記憶するために用意されたバ
ッファ（図示せず）をクリアする（Ｓ３４）。ドキュメ
ントより文字列を読込む（Ｓ３６）。Ｓ３６における文
字読込みはドキュメントの先頭文字より行なわれ、Ｓ３
６の処理が行なわれる毎に順次、次の文字が読込まれ
る。The separator 28 opens the document (S
32), a buffer (not shown) prepared for temporarily storing the character string is cleared (S34). A character string is read from the document (S36). The character reading in S36 is performed from the first character of the document.
The next character is sequentially read every time the processing of step 6 is performed.

【００３９】読込んだ文字がＥＯＦ（End Of File）で
あるか否か判断される（Ｓ３８）。読込んだ文字がＥＯ
Ｆである場合には（Ｓ３８でＹＥＳ）、バッファに格納
されている文章を区切り部２８からスコア計算部２６に
出力し（Ｓ４０）、処理を終了する。It is determined whether the read character is EOF (End Of File) (S38). The read character is EO
If it is F (YES in S38), the sentence stored in the buffer is output from the delimiter 28 to the score calculator 26 (S40), and the process ends.

【００４０】読込んだ文字がＥＯＦでなければ（Ｓ３８
でＮＯ）、読込んだ文字が「。」または「、」などの区
切り文字であるか否か判断される（Ｓ４２）。読込んだ
文字が区切り文字でなければ（Ｓ４２でＮＯ）、読込ん
だ一文字をバッファに追加する（Ｓ４４）。その後、Ｓ
３６に戻る。If the read character is not EOF (S38)
NO), it is determined whether the read character is a delimiter such as "." Or "," (S42). If the read character is not a delimiter (NO in S42), the read one character is added to the buffer (S44). Then, S
Return to 36.

【００４１】読込んだ文字が区切り文字である場合には
（Ｓ４２でＹＥＳ）、バッファに格納されている文章を
区切り部２８からスコア計算部２６に出力し（Ｓ４
６）、バッファをクリアする（Ｓ４８）。その後、Ｓ３
６に戻る。If the read character is a delimiter (YES in S42), the sentence stored in the buffer is output from the delimiter 28 to the score calculator 26 (S4).
6), clear the buffer (S48). Then, S3
Return to 6.

【００４２】［スコア計算部２６の処理］図１０および
図１１を参照して、スコア計算部２６の処理について説
明する。[Process of Score Calculation Unit 26] The process of the score calculation unit 26 will be described with reference to FIG. 10 and FIG.

【００４３】図１１を参照して、スコア計算部２６は、
区切り部２８より入力される入力文字列Ｘの長さをｌｅ
ｎ＿Ｘとし、ｌｅｎ＿Ｘの要素数を有する配列ｖａｌｕ
ｅ，ｔａｂｌｅおよびｆｒｏｍを作成する（Ｓ５２）。
ここで、ｔａｂｌｅ［ｉ］は１文字目からｉ文字目まで
の重みの最適値（最大値）を示す。ｖａｌｕｅ［ｉ］は
ｆｒｏｍ［ｉ］文字目からｉ文字目までの重みの最適値
（最大値）を示す。ｆｒｏｍ［ｉ］はｉ文字目における
重みが最適（最大）になる区切りの開始位置を示す。Referring to FIG. 11, score calculation unit 26
The length of the input character string X input from the delimiter 28 is set to le
An array value having n_X and len_X elements
e, table and from are created (S52).
Here, table [i] indicates the optimum value (maximum value) of the weight from the first character to the i-th character. value [i] indicates the optimum value (maximum value) of the weight from the first character to the i-th character. “from [i]” indicates the start position of the segment where the weight at the i-th character is optimal (maximum).

【００４４】配列ｖａｌｕｅ，ｔａｂｌｅおよびｆｒｏ
ｍの要素を初期化する（Ｓ５４）。すなわち、配列ｔａ
ｂｌｅのすべての要素をｍｉｎ＿ｓｃｏｒｅ×ｌｅｎ＿
Ｘに初期化する。ここで、ｍｉｎ＿ｓｃｏｒｅは予め定
められた定数であり、ここでは、ｍｉｎ＿ｓｃｏｒｅ＝
−１００００とする。また、配列ｖａｌｕｅのすべての
要素を０に初期化する。さらに、配列ｆｒｏｍのすべて
の要素について、ｋ番目の要素を（ｋ−１）に初期化す
る。The sequences value, table and fr
The element of m is initialized (S54). That is, the array ta
ble_min_core × len_
Initialize to X. Here, min_score is a predetermined constant, and here, min_score =
-10000. Also, all the elements of the array value are initialized to 0. Further, the k-th element is initialized to (k-1) for all elements of the array from.

【００４５】次に、入力文字列の着目文字を表わすカウ
ンタｉを１に設定する（Ｓ５６）。すなわち、入力文字
列の先頭を指示するようにカウンタｉの値を設定する。Next, the counter i representing the character of interest in the input character string is set to 1 (S56). That is, the value of the counter i is set so as to indicate the head of the input character string.

【００４６】ｔａｂｌｅ［ｉ］＝ｍｉｎ＿ｓｃｏｒｅ×
ｌｅｎ＿Ｘであれば、ｔａｂｌｅ［ｉ］＝０とし、それ
以外の場合には何もしない（Ｓ５８）。Table [i] = min_score ×
If len_X, table [i] = 0 is set, otherwise, nothing is performed (S58).

【００４７】カウンタｊの値をｉ＋１に設定する（Ｓ６
０）。文字列Ｘのｉ文字目からｊ文字目までの文字列を
ｘとする（Ｓ６２）。文字列ｘの先頭文字が「−」また
は「」（空白）であるか否かを判断する（Ｓ６４）。文
字列ｘの先頭文字が「−」または「」でなければ（Ｓ
６４でＮＯ）、文字列ｘのｔｆ，ｄｆおよびｄｆ２を計
算する（Ｓ６６）。ｄｆが１以上か否か、すなわち文字
列ｘがドキュメント群内に出現したか否かが判断される
（Ｓ６８）。なお、ｔｆは全ドキュメントにおいて文字
列ｘの出現する回数を示す。The value of the counter j is set to i + 1 (S6
0). The character string from the i-th character to the j-th character of the character string X is x (S62). It is determined whether the first character of the character string x is "-" or "" (blank) (S64). If the first character of the character string x is not "-" or "" (S
NO at 64), tf, df and df2 of the character string x are calculated (S66). It is determined whether or not df is 1 or more, that is, whether or not the character string x has appeared in the document group (S68). Note that tf indicates the number of times the character string x appears in all documents.

【００４８】文字列ｘがドキュメント群内に出現してい
れば（Ｓ６８でＹＥＳ）、ｄｆ２の値がｍｉｎ＿ｄｆ２
よりも大きいか否かが判断される（Ｓ７０）。ｍｉｎ＿
ｄｆ２は予め定められた定数であり、ここでは３として
いる。If the character string x appears in the document group (YES in S68), the value of df2 becomes min_df2
It is determined whether it is greater than (S70). min_
df2 is a predetermined constant, and is set to 3 here.

【００４９】ｄｆ２がｍｉｎ＿ｄｆ２よりも大きい場合
には（Ｓ７０でＹＥＳ）、ｔｆが全ドキュメント数Ｎよ
りも大きいか否かが判断される（Ｓ７２）。ｔｆがＮよ
りも大きい場合には（Ｓ７２でＹＥＳ）、重みｓｃｏｒ
ｅがｌｏｇ（ｓａｔｕｒａｔｉｏｎ＿ｓｃｏｒｅ）とし
て求められる（Ｓ７４）。ここで、ｓａｔｕｒａｔｉｏ
ｎ＿ｓｃｏｒｅは予め定められた定数であり、ここでは
０．５に設定されている。If df2 is greater than min_df2 (YES in S70), it is determined whether tf is greater than the total number of documents N (S72). If tf is larger than N (YES in S72), the weight scor
e is obtained as log (saturation_score) (S74). Here, satatio
n_score is a predetermined constant, and is set to 0.5 here.

【００５０】ｔｆがＮ以下の場合には（Ｓ７２でＮ
Ｏ）、重みｓｃｏｒｅがｌｏｇ（ｄｆ２／ｄｆ）として
求められる（Ｓ７６）。If tf is equal to or smaller than N (N in S72)
O), and the weight score is obtained as log (df2 / df) (S76).

【００５１】ｄｆ２がｍｉｎ＿ｄｆ２以下の場合には
（Ｓ７０でＮＯ）、重みｓｃｏｒｅがｍｉｎ＿ｓｃｏｒ
ｅとして求められる（Ｓ７８）。When df2 is equal to or less than min_df2 (NO in S70), the weight score is min_scor.
e (S78).

【００５２】Ｓ７４、Ｓ７６またはＳ７８の後、ｔａｂ
ｌｅ［ｊ］が（ｓｃｏｒｅ＋ｔａｂｌｅ［ｉ］）よりも
小さいか否かが判断される（Ｓ８０）。ｔａｂｌｅ
［ｊ］が（ｓｃｏｒｅ＋ｔａｂｌｅ［ｊ］）よりも小さ
い場合には（Ｓ８０でＹＥＳ）、ｆｒｏｍ［ｊ］にｉが
代入され、ｔａｂｌｅ［ｊ］に（ｓｃｏｒｅ＋ｔａｂｌ
ｅ［ｉ］）が代入され、ｖａｌｕｅ［ｊ］にｓｃｏｒｅ
が代入される（Ｓ８２）。After S74, S76 or S78, tab
It is determined whether le [j] is smaller than (score + table [i]) (S80). table
If [j] is smaller than (score + table [j]) (YES in S80), i is substituted for from [j] and (score + tabl) is assigned to table [j].
e [i]), and score [value] is assigned to value [j].
Is substituted (S82).

【００５３】文字列ｘの先頭の文字が「−」または
「」である場合（Ｓ６４でＹＥＳ）、ｄｆが０の場合
（Ｓ６８でＮＯ）、ｔａｂｌｅ［ｊ］が（ｓｃｏｒｅ＋
ｔａｂｌｅ［ｉ］）以上の場合（Ｓ８０でＮＯ）または
Ｓ８２の処理の後、カウンタｊの値を１つインクリメン
トする（Ｓ８４）。その後、ｊがｌｅｎ＿Ｘよりも大き
くなったか否かが判断される（Ｓ８６）。If the first character of the character string x is "-" or "" (YES in S64), if df is 0 (NO in S68), table [j] is set to (score +
table [i]) or more (NO in S80) or after the processing in S82, the value of the counter j is incremented by one (S84). Thereafter, it is determined whether j is larger than len_X (S86).

【００５４】ｊがｌｅｎ＿Ｘ以下の場合には（Ｓ８６で
ＮＯ）、Ｓ６２に戻る。ｊがｌｅｎ＿Ｘより大きい場合
には（Ｓ８６でＹＥＳ）、カウンタｉの値を１つインク
リメントする（Ｓ８８）。その後、ｉがｌｅｎ＿Ｘより
も大きいか否かが判断される（Ｓ９０）。ｉがｌｅｎ＿
Ｘ以下の場合には（Ｓ９０でＮＯ）、Ｓ５８に戻る。ｉ
がｌｅｎ＿Ｘよりも大きい場合には（Ｓ９０でＹＥ
Ｓ）、配列ｆｒｏｍおよびｖａｌｕｅを演算結果記憶部
３０に記憶し（Ｓ９２）、スコア計算部２６における処
理を終了する。If j is equal to or less than len_X (NO in S86), the flow returns to S62. If j is larger than len_X (YES in S86), the value of the counter i is incremented by one (S88). Thereafter, it is determined whether i is greater than len_X (S90). i is len_
If X or less (NO in S90), the process returns to S58. i
Is larger than len_X (YE in S90)
S), the arrays from and value are stored in the operation result storage unit 30 (S92), and the processing in the score calculation unit 26 ends.

【００５５】たとえば、文字列「２０００年問題の対応
策について」を分割したときの配列ｆｒｏｍ，ｖａｌｕ
ｅおよびｔａｂｌｅの値は図１２に示すようになる。こ
れより各単語は図１３のように区切られる。なお、括弧
内が各単語の重みを示している。For example, an array "from" and "valu" obtained by dividing the character string "Year 2000 countermeasures"
The values of e and table are as shown in FIG. Thus, each word is separated as shown in FIG. The weight in parentheses indicates the weight of each word.

【００５６】図１４を参照して、文字列ａの出現するド
キュメントの数ｄｆおよび文字列ａが二回以上出現する
ドキュメントの数ｄｆ２を求める処理（図１０のＳ６
６）について説明する。この処理では、同一の文字列に
対する処理時間を短縮するために、文字列ａと計算した
ｄｆおよびｄｆ２とを、ドキュメントの数を記憶するた
めのハッシュテーブル（以下「ドキュメント数ハッシュ
テーブル」という）に登録することで、再度の計算を不
要としている。文字列ａがドキュメント数ハッシュテー
ブルに登録されているかを判定する。文字列ａが登録済
みであれば（Ｓ１０１でＹＥＳ）、登録されているｄｆ
およびｄｆ２を求める（Ｓ１０２）。Referring to FIG. 14, a process for determining the number df of documents in which character string a appears and the number df2 of documents in which character string a appears twice or more (S6 in FIG. 10).
6) will be described. In this processing, in order to reduce the processing time for the same character string, the character string a and the calculated df and df2 are stored in a hash table for storing the number of documents (hereinafter, referred to as a “document number hash table”). Registration eliminates the need for recalculation. It is determined whether the character string a is registered in the document number hash table. If the character string a has been registered (YES in S101), the registered df
And df2 are obtained (S102).

【００５７】文字列ａが登録されていなければ（Ｓ１０
１でＮＯ）、サフィックスファイルの先頭から順に文字
列ａを探し、最初に見つかった文字列ａに対応するサフ
ィックスをｍｉｎとする（Ｓ１０３）。サフィックスｍ
ｉｎが求まらない場合、すなわちサフィックスファイル
に文字列ａが含まれていない場合は（Ｓ１０４でＹＥ
Ｓ）、文字列ａがドキュメントに出現しない場合であ
る。このため、ｄｆおよびｄｆ２の値を０とする（Ｓ１
０５）。If the character string a is not registered (S10
1 and NO), the character string a is searched in order from the beginning of the suffix file, and the suffix corresponding to the character string a found first is set to min (S103). Suffix m
If in cannot be obtained, that is, if the character string a is not included in the suffix file (YE in S104)
S), where the character string a does not appear in the document. For this reason, the values of df and df2 are set to 0 (S1
05).

【００５８】サフィックスｍｉｎが求まった場合は（Ｓ
１０４でＮＯ）、サフィックスファイル中、サフィック
スｍｉｎ以降で最後に出現する文字列ａに対応するサフ
ィックスをｍａｘとする（Ｓ１０６）。サフィックスが
ｍｉｎからｍａｘまでの範囲が文字列ａと一致する文字
列である。これらの文字列に付されたドキュメント番号
で相異なるものの数を求め、この数をｄｆとする（Ｓ１
０７）。また、これらの文字列に付されたドキュメント
番号を参照し、同一のドキュメント番号が２つ以上存在
するものの数を求め、その数をｄｆ２とする（Ｓ１０
８）。When the suffix min is obtained, (S
NO in 104), the suffix corresponding to the character string a that appears last after the suffix min in the suffix file is set to max (S106). A character string whose suffix ranges from min to max matches the character string a. The number of different document numbers assigned to these character strings is obtained, and this number is set to df (S1
07). Also, referring to the document numbers assigned to these character strings, the number of documents having two or more identical document numbers is obtained, and the number is set to df2 (S10).
8).

【００５９】Ｓ９８の処理またはＳ９５の処理の後、文
字列ａとドキュメントの数ｄｆおよびｄｆ２とをドキュ
メント数ハッシュテーブルに登録する（Ｓ１０９）。Ｓ
９９の処理またはＳ９２の処理の後、ｄｆおよびｄｆ２
をそれぞれ、文字列ａの出現するドキュメントの数、文
字列ａが２回以上出現するドキュメントの数として返す
（Ｓ１１０）。After the processing of S98 or S95, the character string a and the number of documents df and df2 are registered in the document number hash table (S109). S
After the processing of 99 or the processing of S92, df and df2
Are returned as the number of documents in which the character string a appears and the number of documents in which the character string a appears twice or more (S110).

【００６０】図１５を参照して、全ドキュメントにおい
て文字列ａの出現する回数ｔｆを求める処理（図１０の
Ｓ６６）について説明する。With reference to FIG. 15, a description will be given of a process (S66 in FIG. 10) for calculating the number of occurrences tf of the character string a in all documents.

【００６１】サフィックスファイルの先頭から順に文字
列ａを探し、最初に出現する文字列ａのサフィックスを
ｍｉｎとする（Ｓ１２１）。サフィックスｍｉｎが求ま
らない場合、すなわちサフィックスファイルに文字列ａ
が含まれていない場合は（Ｓ１２２でＹＥＳ）、ｔｆに
０を代入する（Ｓ１２３）。サフィックスｍｉｎが求ま
った場合には（Ｓ１２２でＮＯ）、サフィックスファイ
ルにおいて、最後に出現する文字列ａのサフィックスを
ｍａｘとする（Ｓ１２４）。ｔｆを次式（２）にしたが
って求める（Ｓ１２５）。The character string a is searched sequentially from the beginning of the suffix file, and the suffix of the character string a that appears first is set to min (S121). If the suffix min is not obtained, that is, the character string a
Is not included (YES in S122), 0 is substituted for tf (S123). When the suffix min is obtained (NO in S122), the suffix of the character string a that appears last in the suffix file is set to max (S124). tf is obtained according to the following equation (2) (S125).

【００６２】ｔｆ＝ｍａｘ−ｍｉｎ＋１ …（２）Ｓ１２３またはＳ１２５の後、ｔｆを文字列ａの出現す
る回数として返す（Ｓ１２６）。Tf = max−min + 1 (2) After S123 or S125, tf is returned as the number of appearances of the character string a (S126).

【００６３】［ドキュメント分割部３２の処理］ドキュ
メント分割部３２は、演算結果記憶部３０に記憶された
配列ｆｒｏｍおよびｖａｌｕｅに基づいて、入力された
ドキュメントを分割する。すなわち、ドキュメントを分
割した際の重みｓｃｏｒｅの合計値が最大となるよう
に、ドキュメントを分割する。[Process of Document Dividing Unit 32] The document dividing unit 32 divides an input document based on the arrays “from” and “value” stored in the operation result storage unit 30. That is, the document is divided such that the total value of the weights score when the document is divided is maximized.

【００６４】［絞込み部３４の処理］絞込み部３４は、
上述の［キーワード候補の選別］で説明した図６のフロ
ーチャートで示される処理を実行し、キーワードを絞り
込む。[Processing of Narrowing Unit 34] The narrowing unit 34
The process shown in the flowchart of FIG. 6 described in [Selection of Keyword Candidates] is executed to narrow down keywords.

【００６５】［キーワード抽出装置２０の構成例］上述
したキーワード抽出装置２０は、コンピュータにより実
現することが可能である。図１６を参照して、キーワー
ド抽出装置２０は、コンピュータ４１と、コンピュータ
４１に指示を与えるためのキーボード４５およびマウス
４６と、コンピュータ４１により演算された結果等を表
示するためのディスプレイ４２と、コンピュータ４１が
実行するプログラムをそれぞれ読取るための磁気テープ
装置４３、ＣＤ−ＲＯＭ（Compact Disc-Read Only Mem
ory）装置４７および通信モデム４９とを含む。[Configuration Example of Keyword Extraction Apparatus 20] The above-described keyword extraction apparatus 20 can be realized by a computer. Referring to FIG. 16, keyword extracting device 20 includes a computer 41, a keyboard 45 and a mouse 46 for giving instructions to computer 41, a display 42 for displaying results and the like calculated by computer 41, and a computer 41. 41, a magnetic tape device 43 for reading a program executed by each of the programs, and a CD-ROM (Compact Disc-Read Only Mem
ory) device 47 and a communication modem 49.

【００６６】キーワード抽出装置２０のプログラムは、
コンピュータ４１で読取可能な記録媒体である磁気テー
プ４４またはＣＤ−ＲＯＭ４８に記録され、磁気テープ
装置４３およびＣＤ−ＲＯＭ装置４７でそれぞれ読取ら
れる。または、通信回線を介して通信モデム４９で読取
られる。The program of the keyword extracting device 20 is as follows:
The information is recorded on a magnetic tape 44 or a CD-ROM 48 which is a recording medium readable by the computer 41, and is read by a magnetic tape device 43 and a CD-ROM device 47, respectively. Alternatively, it is read by the communication modem 49 via the communication line.

【００６７】図１７を参照して、コンピュータ４１は、
磁気テープ装置４３、ＣＤ−ＲＯＭ装置４７または通信
モデム４９を介して読取られたプログラムを実行するた
めのＣＰＵ（Central Processing Unit）５０と、コン
ピュータ４１の動作に必要なその他のプログラムおよび
データを記憶するためのＲＯＭ（Read Only Memory)５
１と、プログラム、プログラム実行時のパラメータ、演
算結果などを記憶するためのＲＡＭ（Random Access Me
mory）５２と、プログラムおよびデータなどを記憶する
ための磁気ディスク５３とを含む。Referring to FIG. 17, computer 41 includes:
A CPU (Central Processing Unit) 50 for executing a program read via the magnetic tape device 43, the CD-ROM device 47 or the communication modem 49, and other programs and data necessary for the operation of the computer 41 are stored. (Read Only Memory) 5 for
1 and a RAM (Random Access Memory) for storing a program, parameters at the time of program execution, a calculation result, and the like.
mory) 52 and a magnetic disk 53 for storing programs and data.

【００６８】磁気テープ装置４３、ＣＤ−ＲＯＭ装置４
７または通信モデム４９により読取られたプログラム
は、ＣＰＵ５０で実行され、キーワード抽出処理が実行
される。Magnetic tape device 43, CD-ROM device 4
7 or the program read by the communication modem 49 is executed by the CPU 50, and a keyword extraction process is executed.

【００６９】なお、サフィックスファイル記憶部２４お
よび演算結果記憶部３０は、ＲＡＭ５２または磁気ディ
スク５３により実現される。その他のキーワード抽出装
置２０の構成部は、ＣＰＵ５０で実行されるソフトウェ
アにより実現される。The suffix file storage unit 24 and the operation result storage unit 30 are realized by the RAM 52 or the magnetic disk 53. Other components of the keyword extracting device 20 are realized by software executed by the CPU 50.

【００７０】［文字列ｘの重みの計算式の変形例］図３
のＳ８または図１０のＳ７６では、文字列ｘの重みをｌ
ｏｇ（ｄｆ２／ｄｆ）として求めているが、出現頻度、
出現集中度、部分文字列長およびドキュメントの平均サ
イズを考慮して、ドキュメントの平均サイズが２００文
字より大きいときは、ｌｏｇ｛（Ｎ／ｄｆ）×（ｄｆ２
／ｄｆ）×ｌｅｎ（ｘ）｝として重みを求め、ドキュメ
ントの平均サイズが２００文字以下の場合には、ｌｏｇ
｛（Ｎ／ｄｆ）×ｌｅｎ（ｘ）｝として重みを求めるよ
うにしてもよい。各ドキュメントが小さくなると、出現
集中度が０に近くなる傾向がある。このため、ドキュメ
ントの平均サイズで重みの計算方法を変えることによ
り、出現頻度の値が小さくなっても適切な重みを計算す
ることが可能になる。[Modification of Weight Expression for Character String x] FIG. 3
In S8 of FIG. 10 or in S76 of FIG.
og (df2 / df), but the frequency of appearance,
When the average size of the document is larger than 200 characters, considering the occurrence concentration, the partial character string length, and the average size of the document, log ｛(N / df) × (df2
/ Df) × len (x)}, and if the average document size is 200 characters or less, log
The weight may be obtained as {(N / df) × len (x)}. As each document becomes smaller, the degree of appearance concentration tends to approach zero. Therefore, by changing the calculation method of the weight based on the average size of the document, it is possible to calculate an appropriate weight even if the value of the appearance frequency becomes small.

【００７１】以上説明したように、本実施の形態による
と形態素解析のように辞書を予め必要としなくてもキー
ワードの抽出ができる。As described above, according to the present embodiment, keywords can be extracted without requiring a dictionary in advance as in morphological analysis.

【００７２】［実施の形態２］本実施の形態に係る情報
検索装置は、実施の形態１で説明したのと同様のコンピ
ュータにより実現される。[Second Embodiment] An information retrieval apparatus according to the present embodiment is realized by a computer similar to that described in the first embodiment.

【００７３】本実施の形態は、抽出したキーワードによ
る文字列同士の類似度の算出方法に関する。入力された
文字列とデータベースに登録された複数のドキュメント
との類似度を算出することが想定されている。文字列同
士の一致部分を求める際に、データベース中のすべての
ドキュメントそれぞれに対して、入力文字列から抽出し
たキーワードを含むドキュメントをサフィックスファイ
ルの利用によって効率的にデータベース内から検索する
という方法を用いている。This embodiment relates to a method for calculating the similarity between character strings based on extracted keywords. It is assumed that the similarity between an input character string and a plurality of documents registered in a database is calculated. When searching for a match between strings, a method is used in which all documents in the database are searched efficiently for documents containing keywords extracted from the input string by using a suffix file. ing.

【００７４】一致情報の収集は抽出されたキーワードに
対し、次のような方法で行なう。ドキュメントデータベ
ース全体からそのキーワードを含むドキュメントを求め
る。それら各ドキュメント内におけるそのキーワードの
出現場所、入力文字列におけるキーワードの出現場所、
キーワードの長さ、キーワードの重みを一致情報として
記録する。Collection of matching information is performed on the extracted keywords in the following manner. Find documents containing the keyword from the entire document database. Where the keyword appears in each of those documents, where the keyword appears in the input string,
The length of the keyword and the weight of the keyword are recorded as matching information.

【００７５】通常、得られた一致情報は、記録・管理す
ることなく、そのまま重みが加算され類似度が算出され
る。しかし、これを記録・管理することにより、一致し
たキーワードの重みを加算して類似度を算出する方法だ
けでなく、高速性を保ったまま、多くの類似度算出方法
に適用可能としている。Normally, the obtained coincidence information is not recorded and managed, and the weight is added as it is to calculate the similarity. However, by recording and managing this, not only a method of calculating the similarity by adding the weights of the matched keywords, but also it is applicable to many similarity calculation methods while maintaining high speed.

【００７６】入力文字列とデータベース内のドキュメン
トとの類似度は、一致したキーワードに付けられた重み
を加算することによって算出される。The similarity between the input character string and the document in the database is calculated by adding the weight given to the matching keyword.

【００７７】本発明において、計算対象とするキーワー
ドを抽出して算出する文字列類似度によるドキュメント
検索プログラムの処理フローを図１８〜図２１に示す。
本プログラムは、入力した検索文章に基づき、ドキュメ
ントデータベースを検索し、類似度の高い複数のドキュ
メントを検索する。In the present invention, the processing flow of the document search program based on the character string similarity calculated by extracting the keyword to be calculated is shown in FIGS.
This program searches the document database based on the input search text and searches for a plurality of documents having a high degree of similarity.

【００７８】図１８を参照して、検索文章に基づいてド
キュメントデータベースを検索し、類似度の高いドキュ
メントを選び出して出力する処理について説明する。Referring to FIG. 18, a process of searching a document database based on a search sentence, selecting and outputting a document having a high similarity will be described.

【００７９】まず、ある文字列の出現回数を効率よく計
算する準備のために、ドキュメントデータベースに含ま
れる全ドキュメントを統合してサフィックスファイル
（Suffix File）を作成する（Ｓ１３１）。First, in preparation for efficiently calculating the number of appearances of a certain character string, a suffix file (Suffix File) is created by integrating all documents included in the document database (S131).

【００８０】次に、検索文章を文字列Ｘに読込む（Ｓ１
３２）。文字列Ｘから抽出したキーワードを、キーワー
ド管理テーブルに記録する（Ｓ１３３）。Next, the search text is read into the character string X (S1).
32). The keyword extracted from the character string X is recorded in the keyword management table (S133).

【００８１】キーワード管理テーブルに記録された各キ
ーワードに対し、一致情報を収集し、一致情報管理テー
ブルへの記録を行なう（Ｓ１３４）。一致情報とは、キ
ーワードの文字列Ｘにおける出現場所、キーワードのド
キュメント内における出現場所、キーワードの長さおよ
びキーワードの重みを表わす情報のことである。一致情
報管理テーブルには、ドキュメント番号毎に、一致情報
がリストとして記録される。Ｓ１３４の処理については
後に詳述する。Matching information is collected for each keyword recorded in the keyword management table and recorded in the matching information management table (S134). The matching information is information indicating the location of the keyword in the character string X, the location of the keyword in the document, the length of the keyword, and the weight of the keyword. In the matching information management table, matching information is recorded as a list for each document number. The process of S134 will be described later in detail.

【００８２】一致情報管理テーブルからある１つのドキ
ュメントＹのリストを取出す（Ｓ１３５）。A list of one document Y is extracted from the matching information management table (S135).

【００８３】取出したリストより文字列Ｘおよびドキュ
メントＹの類似度を計算する（Ｓ１３６）。Ｓ１３６の
処理については後に詳述する。The similarity between the character string X and the document Y is calculated from the extracted list (S136). The process of S136 will be described later in detail.

【００８４】求めた類似度とドキュメント番号とを組に
してドキュメント管理テーブルに登録する（Ｓ１３
７）。A set of the obtained similarity and the document number is registered in the document management table (S13).
7).

【００８５】一致情報管理テーブルに記録されたすべて
のリストについて類似度を計算したかどうかを判定する
（Ｓ１３８）。すべてのリストについて類似度を計算し
ていなければ（Ｓ１３８でＮＯ）、Ｓ１３５に戻る。It is determined whether similarities have been calculated for all lists recorded in the matching information management table (S138). If the similarities have not been calculated for all the lists (NO in S138), the process returns to S135.

【００８６】すべてのリストについて類似度を計算して
いれば（Ｓ１３８でＹＥＳ）、ドキュメント管理テーブ
ルの類似度とドキュメント番号との組を、類似度の高い
順に並べ替える。If the similarities have been calculated for all the lists (YES in S138), the sets of the similarities and the document numbers in the document management table are rearranged in descending order of the similarities.

【００８７】類似度の高いドキュメントを出力する（Ｓ
１４０）。出力するドキュメントは、１つだけであって
もよいし、予め定められた所定の個数であってもよい。
または、所定の類似度以上のドキュメントを出力するよ
うにしてもよい。A document having a high similarity is output (S
140). Only one document may be output, or a predetermined number of documents may be output.
Alternatively, a document having a predetermined similarity or higher may be output.

【００８８】図１９を参照して、キーワード管理テーブ
ルに記録された各キーワードと、ドキュメントデータベ
ース内の各ドキュメントとの一致情報を収集し、その情
報を一致情報管理テーブルに記録する処理（図１８のＳ
１３４）について説明する。Referring to FIG. 19, a process of collecting matching information between each keyword recorded in the keyword management table and each document in the document database and recording the information in the matching information management table (FIG. 18) S
134) will be described.

【００８９】キーワード管理テーブルからある１つのキ
ーワードを選びａとする（Ｓ１５１）。ドキュメントデ
ータベース内でキーワードａが出現する場所をすべて求
め、これを出現する場所の順に並べ替える（Ｓ１５
２）。A certain keyword is selected from the keyword management table and set as a (S151). All locations where the keyword a appears in the document database are obtained, and the keywords are sorted in the order of appearance (S15).
2).

【００９０】キーワードａの各出現場所に対し、キーワ
ードａを含むドキュメント番号を求める。このとき、キ
ーワードａは出現場所順に並んでいるので、得られるド
キュメント番号も小さい順に並んでいる（Ｓ１５３）。For each occurrence of keyword a, a document number containing keyword a is determined. At this time, since the keywords a are arranged in the order of appearance, the obtained document numbers are also arranged in ascending order (S153).

【００９１】キーワードａの出現場所を、出現場所の前
から順に１つ選ぶ（Ｓ１５４）。選んだキーワードａの
出現場所が、それを含むドキュメント内において最も前
方にある出現場所かどうかを判定する（Ｓ１５５）。す
なわち、選んだ出現場所のドキュメントと、１つ前に選
んだ出現場所のドキュメントとが異なっていれば、それ
は選んだドキュメントにおける最初の出現場所である。
選んだ出現場所のドキュメントと、１つ前に選んだ出現
場所のドキュメントとが同じであれば、それは選んだド
キュメントにおける２番目以降の出現場所である。One of the appearance locations of the keyword a is selected in order from the front of the appearance location (S154). It is determined whether or not the appearance location of the selected keyword a is the forefront appearance location in the document including the keyword a (S155). That is, if the document at the selected appearance location is different from the document at the previous appearance location selected, it is the first appearance location in the selected document.
If the document at the selected appearance location is the same as the document at the previous appearance location selected, it is the second or later occurrence location in the selected document.

【００９２】キーワードａの出現場所がドキュメント内
において最初であると判断された場合には（Ｓ１５５で
ＹＥＳ）、入力文字列Ｘにおけるキーワードａの出現場
所（以下「ｓｔａｒｔＸ」という）、ドキュメント内に
おけるキーワードａの出現場所（以下「ｓｔａｒｔｄｏ
ｃ」という）、キーワードａの長さ（以下「ｔｅｒｍｌ
ｅｎｇｔｈ」という）およびキーワードａの重み（以下
「ｓｃｏｒｅ」という）を組にして一致情報管理テーブ
ルに記録する（Ｓ１５６）。If it is determined that the keyword a appears first in the document (YES in S155), the keyword a appears in the input character string X (hereinafter referred to as "startX") a (where "startdo
c ”), the length of the keyword a (hereinafter“ terml ”)
and the weight of the keyword a (hereinafter, referred to as “score”) is recorded as a set in the matching information management table (S156).

【００９３】図２０を参照して、一致情報管理テーブル
は、ドキュメント情報毎の一致情報のリストによって構
成される。ドキュメント番号０００２に一致情報１およ
び５が、ドキュメント番号０１００に一致情報２、３お
よび６が、ドキュメント番号０１１１に一致情報４およ
び７がリストとして記録されている。それぞれの一致情
報には、入力文字列Ｘにおけるキーワードのｓｔａｒｔ
Ｘ、ｓｔａｒｔｄｏｃ、ｔｅｒｍｌｅｎｇｔｈおよびｓ
ｃｏｒｅが格納されている。Referring to FIG. 20, the matching information management table includes a list of matching information for each document information. Matching information 1 and 5 are recorded as a document number 0002, matching information 2, 3 and 6 are recorded as a document number 0100, and matching information 4 and 7 are recorded as a document number 0111 as a list. Each match information includes the keyword start in the input character string X.
X, startdoc, termlength and s
core is stored.

【００９４】新たにドキュメント番号０００２に関する
一致情報８が得られた場合、図２０に示すように、これ
まで一致情報５を指していたリストの先頭を指すポイン
タは一致情報８を指し、一致情報８から一致情報５への
ポインタが張られ、ドキュメント情報０００２のリスト
の先頭に一致情報８は記録される。When the match information 8 relating to the document number 0002 is newly obtained, as shown in FIG. 20, the pointer pointing to the head of the list that has previously pointed to the match information 5 points to the match information 8 and the match information 8 , A pointer to the matching information 5 is set, and the matching information 8 is recorded at the head of the list of the document information 0002.

【００９５】再度図１９を参照して、Ｓ１５６の後また
はキーワードａの出現場所がドキュメント内において２
番目以降だと判断された場合には（Ｓ１５５でＮＯ）、
キーワードａの出現場所をすべて調べたか否かを判定す
る（Ｓ１５７）。Referring to FIG. 19 again, after S156 or the appearance position of keyword a is 2 in the document.
If it is determined that it is the third or later (NO in S155),
It is determined whether all occurrence locations of the keyword a have been checked (S157).

【００９６】調べていない出現場所があれば（Ｓ１５７
でＮＯ）、Ｓ１５４に戻る。すべての出現場所について
調べて終えていれば（Ｓ１５７でＹＥＳ）、キーワード
管理テーブル内のすべてのキーワードについて、一致情
報の収集を行なったか否かを判定する（Ｓ１５８）。一
致情報の収集をしていないキーワードが存在する場合に
は（Ｓ１５８でＮＯ）、まだ選んでいないキーワードａ
を読込むためにＳ１５１に戻る。すべてのキーワードに
ついて一致情報の収集が終わっていれば（Ｓ１５８でＹ
ＥＳ）、得られた一致情報管理テーブルを返す（Ｓ１５
９）。If there is an appearance location that has not been checked (S157)
NO), and returns to S154. If all the appearance locations have been checked (YES in S157), it is determined whether or not matching information has been collected for all keywords in the keyword management table (S158). If there is a keyword for which matching information has not been collected (NO in S158), a keyword a that has not yet been selected
The process returns to S151 in order to read. If the matching information has been collected for all keywords (Y in S158)
ES), and returns the obtained matching information management table (S15)
9).

【００９７】図２１を参照して、入力文章Ｘとドキュメ
ントＹとの類似度を、一致情報管理テーブルから取出し
たリストを用いて、一致した文字列の重みの加算によっ
て求める処理（図１８のＳ１３６）について説明する。Referring to FIG. 21, a process of calculating the similarity between input text X and document Y by adding the weight of the matched character string using the list extracted from the matching information management table (S136 in FIG. 18) ) Will be described.

【００９８】ＸとＹの類似度（以下「ｓｉｍ」という）
を０に初期化する（Ｓ１６１）。一致情報管理テーブル
に記録されているＹに関するリストからある一つの一致
情報を選び、Ｉとする（Ｓ１６２）。Similarity between X and Y (hereinafter referred to as “sim”)
Is initialized to 0 (S161). One piece of matching information is selected from the list related to Y recorded in the matching information management table, and is set as I (S162).

【００９９】ｓｉｍに一致情報Ｉのｓｃｏｒｅを加算す
る（Ｓ１６３）。ドキュメントＹに関する一致情報のリ
ストに記録されたすべての一致情報について調べたかど
うかを判定する（Ｓ１６４）。もし、調べていない一致
情報があれば（Ｓ１６４でＮＯ）、Ｓ１６２に戻る。す
べての一致情報について調べていれば（Ｓ１６４でＹＥ
Ｓ）、得られたｓｉｍを入力文章ＸとドキュメントＹと
の類似度として返す（Ｓ１６５）。The score of the coincidence information I is added to sim (S163). It is determined whether all the pieces of matching information recorded in the list of matching information regarding document Y have been checked (S164). If there is unmatched matching information (NO in S164), the process returns to S162. If all matching information has been checked (YE in S164)
S), the obtained sim is returned as the similarity between the input text X and the document Y (S165).

【０１００】以上説明したように本実施の形態に係る情
報検索装置によると、ユーザが入力したドキュメントと
類似するドキュメントを予め登録されているデータベー
スの中から探し出すことができるようになる。このた
め、たとえばＦＡＱ（Frequently Asked Questions）シ
ステムなどにおいて、ユーザが質問を入力文章として与
えた場合に、その入力文章に対応するＦＡＱを取出すこ
とができるようになる。As described above, according to the information search apparatus of the present embodiment, a document similar to the document input by the user can be searched from a database registered in advance. Therefore, for example, in a FAQ (Frequently Asked Questions) system or the like, when a user gives a question as an input sentence, an FAQ corresponding to the input sentence can be extracted.

【０１０１】今回開示された実施の形態はすべての点で
例示であって制限的なものではないと考えられるべきで
ある。本発明の範囲は上記した説明ではなくて特許請求
の範囲によって示され、特許請求の範囲と均等の意味お
よび範囲内でのすべての変更が含まれることが意図され
る。The embodiments disclosed this time are to be considered in all respects as illustrative and not restrictive. The scope of the present invention is defined by the terms of the claims, rather than the description above, and is intended to include any modifications within the scope and meaning equivalent to the terms of the claims.

【０１０２】[0102]

【発明の効果】辞書を必要とせずにドキュメントよりキ
ーワードを抽出することができる。According to the present invention, keywords can be extracted from a document without requiring a dictionary.

[Brief description of the drawings]

【図１】本発明の実施の形態１に係るキーワード抽出
装置の構成を示すブロック図である。FIG. 1 is a block diagram showing a configuration of a keyword extraction device according to Embodiment 1 of the present invention.

【図２】「ロボットについて」という語の部分文字列
ｘとそれに対応するｄｆ、ｄｆ２、αおよびβの値を示
す図である。FIG. 2 is a diagram illustrating a partial character string x of the word “about a robot” and corresponding values of df, df2, α, and β.

【図３】文字列の単語らしさ（重み）を推定するため
の処理のフローチャートである。FIG. 3 is a flowchart of a process for estimating wordiness (weight) of a character string.

【図４】文字列の重みを用いてドキュメントを分割し
た例を示す図である。FIG. 4 is a diagram illustrating an example in which a document is divided using the weight of a character string.

【図５】形態素解析によりドキュメントを分割した例
を示す図である。FIG. 5 is a diagram showing an example in which a document is divided by morphological analysis.

【図６】キーワードを抽出するための処理のフローチ
ャートである。FIG. 6 is a flowchart of a process for extracting a keyword.

【図７】サフィックスファイル作成部２２の行なう処
理のフローチャートである。FIG. 7 is a flowchart of a process performed by a suffix file creation unit 22.

【図８】サフィックスファイルを説明するための図で
ある。FIG. 8 is a diagram for explaining a suffix file.

【図９】区切り部２８の行なう処理のフローチャート
である。FIG. 9 is a flowchart of a process performed by a separation unit.

【図１０】スコア計算部２６の実行する処理のフロー
チャートである。FIG. 10 is a flowchart of a process executed by a score calculation unit 26.

【図１１】スコア計算部２６での重み計算に用いられ
る配列ｖａｌｕｅ、ｔａｂｌｅおよびｆｒｏｍを説明す
るための図である。11 is a diagram for explaining arrays value, table, and from used for weight calculation in the score calculation unit 26. FIG.

【図１２】文字列「２０００年問題の対応策につい
て」を分割したときの配列ｆｒｏｍ、ｖａｌｕｅおよび
ｔａｂｌｅの値を示す図である。FIG. 12 is a diagram showing values of arrays from, value, and table when the character string “about the measure against the year 2000 problem” is divided.

【図１３】重みにより各単語を区切った一例を示す図
である。FIG. 13 is a diagram showing an example in which each word is divided by weight.

【図１４】文字列ａの出現するドキュメントの数ｄｆ
および文字列ａが二回以上出現するドキュメントの数ｄ
ｆ２を求める処理のフローチャートである。FIG. 14 shows the number df of documents in which a character string a appears.
And the number d of documents in which the character string a appears twice or more
It is a flowchart of a process of obtaining f2.

【図１５】全ドキュメントにおいて文字列ａの出現す
る回数ｔｆを求める処理のフローチャートである。FIG. 15 is a flowchart of a process for obtaining the number of times tf that a character string a appears in all documents.

【図１６】キーワード抽出装置を実現するコンピュー
タの外観図である。FIG. 16 is an external view of a computer that realizes the keyword extracting device.

【図１７】図１６に示すコンピュータのハードウェア
構成を示す図である。17 is a diagram illustrating a hardware configuration of the computer illustrated in FIG.

【図１８】検索文章に基づいてドキュメントデータベ
ースを検索し、類似度の高いドキュメントを選び出して
出力する処理のフローチャートである。FIG. 18 is a flowchart of a process of searching a document database based on a search sentence, selecting and outputting a document having a high degree of similarity.

【図１９】キーワード管理テーブルに記録された各キ
ーワードと、ドキュメントデータベース内の各ドキュメ
ントとの一致情報を収集し、その情報を一致情報管理テ
ーブルに記録する処理のフローチャートである。FIG. 19 is a flowchart of a process of collecting matching information between each keyword recorded in the keyword management table and each document in the document database, and recording the information in the matching information management table.

【図２０】一致情報管理テーブルの構成を示す図であ
る。FIG. 20 is a diagram showing a configuration of a coincidence information management table.

【図２１】入力文章ＸとドキュメントＹとの類似度
を、一致情報管理テーブルから取出したリストを用い
て、一致した文字列の重みの加算によって求める処理の
フローチャートである。FIG. 21 is a flowchart of a process of calculating the similarity between an input sentence X and a document Y by adding the weight of a matched character string using a list extracted from a matching information management table.

[Explanation of symbols]

２０キーワード抽出装置、２２サフィックスファイ
ル作成部、２４サフィックスファイル記憶部、２６
スコア計算部、２８区切り部、３０演算結果記憶
部、３２ドキュメント分割部、３４絞込み部、４１
コンピュータ、４２ディスプレイ、４３磁気テー
プ装置、４４磁気テープ、４５キーボード、４６
マウス、４７ＣＤ−ＲＯＭ装置、４８ＣＤ−ＲＯ
Ｍ、４９通信モデム、５０ＣＰＵ、５１ＲＯＭ、
５２ＲＡＭ、５３磁気ディスク。Reference Signs List 20 Keyword extraction device, 22 Suffix file creation unit, 24 Suffix file storage unit, 26
Score calculation section, 28 delimiter section, 30 calculation result storage section, 32 document division section, 34 narrowing down section, 41
Computer, 42 display, 43 magnetic tape device, 44 magnetic tape, 45 keyboard, 46
Mouse, 47 CD-ROM device, 48 CD-RO
M, 49 communication modem, 50 CPU, 51 ROM,
52 RAM, 53 magnetic disk.

フロントページの続き (72)発明者武並佳則大阪市此花区島屋一丁目１番３号住友電気工業株式会社大阪製作所内 (72)発明者岸田正博大阪市此花区島屋一丁目１番３号住友電気工業株式会社大阪製作所内Ｆターム(参考） 5B075 ND03 NK31 PP02 PP03 PP22 PQ02 PR04 PR06 QM08 Continued on the front page (72) Inventor Yoshinori Takenami 1-3-1 Shimaya, Konohana-ku, Osaka Sumitomo Electric Industries, Ltd. Osaka Works (72) Inventor Masahiro Kishida 1-1-3 Shimaya, Konohana-ku, Osaka Sumitomo Electric Industry Co., Ltd. Osaka Works F-term (reference) 5B075 ND03 NK31 PP02 PP03 PP22 PQ02 PR04 PR06 QM08

Claims

[Claims]

An appearance frequency calculating means for obtaining an appearance frequency of a partial character string included in each document in the document group in the document group; and an appearance concentration degree of the partial character string in the document group. An appearance concentration degree calculating means for obtaining, and an appearance concentration degree calculating means connected to the appearance frequency calculation means and the appearance concentration degree calculation means, for extracting a keyword from an input document based on the appearance frequency and the appearance concentration degree. A keyword extracting device, comprising: one keyword extracting unit.

2. The method according to claim 1, wherein the first keyword extracting unit includes: a document dividing unit configured to divide the input document into partial character strings; and a document dividing unit, the appearance frequency calculating unit, and the appearance concentration calculating unit. Connected, based on the appearance frequency and the appearance concentration degree, word-likeness calculating means for calculating the word-likeness of the partial character string; connected to the word-likeness calculating means, based on the total value of the word-likeness 2. The keyword extracting apparatus according to claim 1, further comprising a second keyword extracting unit for extracting a keyword from the document.

3. The word likeness calculating means is connected to the document dividing means, the appearance frequency calculating means, and the appearance concentration degree calculating means, and includes the appearance frequency, the appearance concentration degree, the length of the partial character string, 3. The keyword extracting device according to claim 2, further comprising: means for calculating a wordiness of the partial character string based on an average size of the document.

4. The method further comprising the steps of: connecting the keyword extracted by the second keyword extraction unit to the second keyword extraction unit, the appearance frequency calculation unit, and the appearance concentration degree calculation unit; 3. The keyword extracting device according to claim 2, further comprising a narrowing unit for narrowing down based on an appearance concentration degree and a length of the partial character string.

5. The keyword extracting apparatus according to claim 2, further comprising a separating unit for separating the input document by punctuation marks and supplying the separated document to the document dividing unit.

6. The keyword extracting apparatus according to claim 2, wherein said document dividing means includes means for dividing an input document into partial character strings such that a leading character does not start with a predetermined character. apparatus.

7. The apparatus according to claim 2, wherein said document dividing means includes means for dividing an input document into partial character strings such that the length of the partial character string does not exceed a predetermined number of characters. The described keyword extraction device.

8. An appearance frequency calculating means for calculating an appearance frequency of a partial character string included in each document in the document group in the document group, and an appearance concentration degree of the partial character string in the document group. A keyword for extracting a keyword from an input document based on the frequency of appearance and the degree of concentration of appearance based on the frequency of appearance and the degree of concentration of appearance. Extraction means, connected to the keyword extraction means, and a coincidence calculation means for calculating the degree of coincidence of each document in the document group with each keyword extracted by the keyword extraction means; Means for connecting each document in the document group based on the degree of coincidence. Te,
A similarity calculating unit for calculating a similarity with the input document; and a similarity calculating unit connected to the similarity calculating unit. Means for extracting information.