JP2000076254A

JP2000076254A - Keyword extraction device, similar document retrieval device using the same, keyword extraction method and record medium

Info

Publication number: JP2000076254A
Application number: JP10245029A
Authority: JP
Inventors: Yasuo Tanosaki; 康雄田野崎; Yukio Nakamoto; 幸夫中本; Takuya Nishina; 卓哉仁科
Original assignee: Toshiba Corp; Toshiba Computer Engineering Corp
Current assignee: Toshiba Corp; Toshiba Computer Engineering Corp
Priority date: 1998-08-31
Filing date: 1998-08-31
Publication date: 2000-03-14

Abstract

PROBLEM TO BE SOLVED: To highly precisely extract a keyword considering respective documents in a data base from a text given as a keyword extraction object without executing a troublesome processing such as morpheme analysis on the respective documents in the data base. SOLUTION: A word extraction part 12b extracts a word from a keyword extraction object text. Intra-text appearing frequency is obtained at every extracted word and is stored in a word management table 13b. A word retrieval execution part 12c searches the full texts of the respective documents in a document data base storage part 11b at every extracted word. Intra-data base appearing frequency is obtained and is stored in the word management table 13b. A significance calculation part 12d calculates the significance of the respective words based on intra-text appearing frequency and intra-data base appearing frequency, which are stored in the word management table 13b. A keyword deciding part 12e decides a keyword based on the significance of the respective words.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、キーワード抽出対
象として与えられたテキストから重要度の高い単語をキ
ーワードとして抽出するキーワード抽出装置に係り、特
にデータベース内の各文書を考慮したキーワードを抽出
可能なキーワード抽出装置と、このキーワード抽出装置
を用いた類似文献検索装置、キーワード抽出方法及び記
録媒体に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a keyword extracting apparatus for extracting a word having high importance from a text given as a keyword extraction target as a keyword. The present invention relates to a keyword extraction device, a similar document search device using the keyword extraction device, a keyword extraction method, and a recording medium.

【０００２】[0002]

【従来の技術】近年、文書データを効率良く検索するた
めの索引キーとして、あるいは類似した文書を検索する
ために与えられたテキストから重要の高い単語をキーワ
ードとして抽出する処理が重要になってきた。2. Description of the Related Art In recent years, it has become important to extract a highly important word as a keyword as an index key for efficiently searching document data or from a given text to search for similar documents. .

【０００３】ここで、キーワード抽出対象として与えら
れたテキストから抽出される各単語に重み付けを行って
キーワードを選出するいくつかの方式がある。最も単純
な方式としては、キーワード抽出対象テキスト内の各単
語の出現頻度のみを参照する方式である。しかし、これ
は単にテキスト上で頻繁に使われている単語の重要度を
高くするだけのものであるので、精度は悪い。Here, there are several methods for selecting a keyword by weighting each word extracted from a text given as a keyword extraction target. The simplest method is a method of referring only to the appearance frequency of each word in the keyword extraction target text. However, this is not accurate because it merely increases the importance of words that are frequently used in the text.

【０００４】また、出現頻度と共にキーワードおよび不
要語辞書を用いる方式がある。これはキーワードまたは
不要語とすべき単語を予め辞書に登録しておき、その辞
書に登録された単語がキーワード抽出対象テキスト内に
どのくらいの回数で出現するかによって重み付けを行う
方法である。There is also a system that uses a keyword and an unnecessary word dictionary together with the frequency of appearance. This is a method in which words to be keywords or unnecessary words are registered in a dictionary in advance, and weighting is performed based on how many times the words registered in the dictionary appear in the keyword extraction target text.

【０００５】また、データベースに複数の文書データが
蓄積されている場合には、予めデータベース内に出現す
る各単語についての出現頻度表を作成し、出現頻度の小
さな単語に対する重要度を高く設定しておくといった方
法もある。When a plurality of document data are stored in the database, an appearance frequency table for each word appearing in the database is created in advance, and the importance of words having a low appearance frequency is set high. There is also a way to put it.

【０００６】なお、データベース内での出現頻度の低い
単語の重要度を高く設定するのは、データベース上で頻
繁で出てくる単語、例えば特許文献に関するデータベー
スでは、「特許」とか「発明」といった単語などはキー
ワードとしての評価値が低く、逆に、あまり出現しない
単語の方がキーワードとしての評価値が高いためであ
る。さらに、キーワード抽出対象となるテキストに対し
て構文解析を行い、単語の係り受け関係等より重みを決
定する方式も試みられている。It is to be noted that a word having a low appearance frequency in the database is set to have a high importance because words frequently appearing in the database, for example, words such as "patent" and "invention" in a database relating to patent documents. This is because, for example, the evaluation value as a keyword is low, and conversely, words that do not appear frequently have a higher evaluation value as a keyword. Further, a scheme has been attempted in which syntax analysis is performed on a text from which a keyword is to be extracted, and a weight is determined based on a dependency relationship between words.

【０００７】[0007]

【発明が解決しようとする課題】上記したように、従
来、キーワードを抽出するための種々の方式が考えられ
ていた。しかしながら、例えばキーワードとして用いる
単語や不要語を辞書に登録しておく方式では、その辞書
の作成に困難を要するだけでなく、対象テキストが新規
性のある新聞記事などの場合には、辞書の登録内容が最
新のものに追いつかないといった不具合が生じる。As described above, conventionally, various methods for extracting a keyword have been considered. However, in the method of registering words and unnecessary words used as keywords in a dictionary, for example, not only is it difficult to create the dictionary, but also when the target text is a newspaper article with novelty, the dictionary is not registered. There is a problem that the content cannot keep up with the latest one.

【０００８】また、予めデータベース内の各文書毎に単
語の出現頻度表を作成しておく方式では、データベース
内に存在するすベての文書について、形態素解析を行っ
て単語を抽出し、出現頻度の統計を取るといった処理を
行なければならない。このため、処理に多大な時間を要
し、さらにデータベースに対して文書の追加、更新が行
われた際には、その都度、出現頻度表を更新しなければ
ならないといった不具合が生じる。特に、頻繁に更新が
行われる場合にこの不具合は顕著となる。さらに出現頻
度の統計を保存しておくために膨大な記憶容量を要する
などの問題もある。In the method in which a word appearance frequency table is created for each document in the database in advance, a word is extracted by performing morphological analysis on all documents existing in the database, and the appearance frequency is calculated. It is necessary to carry out processing such as taking statistics of. For this reason, a large amount of time is required for processing, and furthermore, every time a document is added or updated in the database, the occurrence frequency table must be updated each time. In particular, when the update is frequently performed, this problem becomes remarkable. Further, there is a problem that a huge storage capacity is required to store the statistics of the appearance frequency.

【０００９】また、構文解析を用いて単語の重みを決定
するという方式では、現状では構文解析自体の精度が実
用の域に達していないという問題があり、またデータベ
ースを参照することがないため、過去における単語の出
現情報が得られず、十分な精度でキーワードを抽出する
ことは困難となっている。Also, the method of determining the weight of a word using syntactic analysis has a problem that the accuracy of syntactic analysis itself does not reach a practical level at present, and there is no reference to a database. It is difficult to extract keywords with sufficient accuracy because information on the appearance of words in the past cannot be obtained.

【００１０】本発明は上記のような点に鑑みなされたも
ので、データベース内の各文書に対して形態素解析など
の面倒な処理を施すことなく、キーワード抽出対象とし
て与えられたテキストからデータベース内の各文書を考
慮したキーワードを高精度に抽出することのできるキー
ワード抽出装置、このキーワード抽出装置を用いた類似
文献検索装置、キーワード抽出方法及び記録媒体を提供
することを目的とする。The present invention has been made in view of the above points, and does not perform a complicated process such as morphological analysis on each document in a database, and performs a process of extracting a text provided as a keyword extraction target from a text in the database. It is an object of the present invention to provide a keyword extraction device capable of extracting a keyword in consideration of each document with high accuracy, a similar document search device using the keyword extraction device, a keyword extraction method, and a recording medium.

【００１１】[0011]

【課題を解決するための手段】本発明のキーワード抽出
装置は、キーワード抽出対象として与えられたテキスト
から形態素解析等により単語を抽出し、その抽出された
各単語毎に上記テキスト内での出現頻度を求めると共
に、上記抽出された各単語について、データベース内の
各文書をフルテキストサーチして上記データベース内で
の出現頻度を求め、上記テキスト内での出現頻度と上記
データベース内での出現頻度に基づいて各単語の重要度
を計算することにより、上記テキストのキーワードとな
る単語を決定するようにしたものである。A keyword extracting apparatus according to the present invention extracts words from a text given as a keyword extraction target by morphological analysis or the like, and for each extracted word, the frequency of occurrence in the text. And for each of the extracted words, a full-text search of each document in the database is performed to determine the frequency of occurrence in the database, and based on the frequency of occurrence in the text and the frequency of occurrence in the database. By calculating the degree of importance of each word, a word serving as a keyword of the text is determined.

【００１２】このような構成によれば、キーワード抽出
対象テキスト内での単語の出現頻度と共に、データベー
ス内での単語の出現頻度も加味して単語の重要度が求め
られる。したがって、相対的な重要度を得ることがで
き、その重要度に基づいてキーワードとして適切な単語
を抽出することができる。According to such a configuration, the importance of a word is obtained by taking into account the frequency of appearance of the word in the database as well as the frequency of appearance of the word in the text to be extracted. Therefore, relative importance can be obtained, and words appropriate as keywords can be extracted based on the importance.

【００１３】また、データベース内での単語の検索がフ
ルテキストサーチで行われるため、検索処理を高速化す
ることができ、キーワードの抽出処理に必要となるメモ
リ容量も少なくて済むようになる。[0013] Further, since the word search in the database is performed by the full text search, the search processing can be speeded up, and the memory capacity required for the keyword extraction processing can be reduced.

【００１４】また、上記のようにしてキーワード抽出対
象テキストから抽出されたキーワード単語を用いてデー
タベースを検索することにより、当該テキストと類似す
る文書をデータベースから類似文献として得ることがで
きる。By searching the database using the keyword words extracted from the keyword extraction target text as described above, a document similar to the text can be obtained as a similar document from the database.

【００１５】[0015]

【発明の実施の形態】以下、図面を参照して本発明の一
実施形態を説明する。図１は本発明の一実施形態に係る
キーワード抽出装置の構成を示すブロック図である。な
お、本装置は一般的なアーキテクチャを持つコンピュー
タ上に類似文献検索装置の一機能として構築されるもの
である。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS An embodiment of the present invention will be described below with reference to the drawings. FIG. 1 is a block diagram showing a configuration of a keyword extracting device according to one embodiment of the present invention. This apparatus is constructed as a function of a similar document search apparatus on a computer having a general architecture.

【００１６】図１に示すように、本装置は、外部記憶装
置１１、制御装置１２、メモリ１３から構成される。外
部記憶装置１１は、例えばハードディスク装置あるいは
ＤＶＤ（Digital Video Disc）装置などからなる。この
外部記憶装置１１には、キーワード抽出対象テキスト格
納部１１ａ、文書データベース格納部１１ｂおよび出現
位置管理表１１ｃが設けられている。As shown in FIG. 1, the apparatus comprises an external storage device 11, a control device 12, and a memory 13. The external storage device 11 includes, for example, a hard disk device or a DVD (Digital Video Disc) device. The external storage device 11 includes a keyword extraction target text storage unit 11a, a document database storage unit 11b, and an appearance position management table 11c.

【００１７】キーワード抽出対象テキスト格納部１１ａ
は、キーワード抽出対象として与えられたテキストデー
タを格納する。このテキストデータは、例えばユーザに
よって直接入力、作成された文書データあるいはネット
ワークを介して外部から受信した文書データなどであ
り、ここでは予めデータベース内に登録されている各文
書データとは別のものとする。本装置は、このテキスト
データをキーワード抽出対象テキストとして用い、その
テキストからキーワードとなる単語を抽出し、さらに、
そのキーワード単語を含む文書をデータベースから類似
文献として検索するなどの処理を行う。Keyword extraction target text storage unit 11a
Stores text data given as a keyword extraction target. This text data is, for example, document data directly input and created by a user or document data received from the outside via a network, and here is different from each document data registered in advance in a database. I do. The present apparatus uses this text data as a keyword extraction target text, extracts a keyword word from the text,
Processing such as retrieving a document containing the keyword word from the database as a similar document is performed.

【００１８】文書データベース格納部１１ｂは、図２に
示すように、複数の検索対象文書から構成されている。
ここで、１つの検索対象文書は、同図に示すように、作
成者名、作成日時などの文書情報データ部２１、文書を
構成するテキストデータ部２２、および文書に含まれる
図表やイメージなどのマルチメディアデータ部２３から
なっている。なお、この文書データベース格納部１１ｂ
に格納されている各検索対象文書は雑多な分野に属する
文書からなるのではなく、例えば経済分野とか技術分野
といった特定の大項目に含まれるような一定の均質性を
持ったものを前提とする。The document database storage section 11b is composed of a plurality of documents to be searched, as shown in FIG.
As shown in the drawing, one search target document includes a document information data portion 21 such as a creator name and a creation date and time, a text data portion 22 constituting the document, and a chart and image included in the document. It comprises a multimedia data section 23. The document database storage unit 11b
Each document to be searched stored in is assumed not to consist of documents belonging to various fields, but to have a certain degree of homogeneity that is included in specific large items such as economic fields or technical fields .

【００１９】また、出現位置管理表１１ｃは、文書デー
タベース格納部１１ｂに格納された各検索対象文書毎に
単語の出現位置を記憶したものである。この出現位置管
理表１１ｃについては、後に図６を参照して説明する。The appearance position management table 11c stores the appearance positions of words for each search target document stored in the document database storage unit 11b. The appearance position management table 11c will be described later with reference to FIG.

【００２０】制御装置１２は、本装置全体の制御を行う
ものであり、例えば磁気ディスクなどの記録媒体に記録
されたプログラムを読み込んで各種処理を実行する。そ
のプログラムによって実現される処理部として、テキス
ト入力部１２ａ、単語抽出部１２ｂ、単語検索実行部１
２ｃ、重要度計算部１２ｄ、キーワード決定部１２ｅを
有する。The control unit 12 controls the entire apparatus, reads a program recorded on a recording medium such as a magnetic disk, and executes various processes. As a processing unit realized by the program, a text input unit 12a, a word extraction unit 12b, a word search execution unit 1
2c, an importance calculation unit 12d, and a keyword determination unit 12e.

【００２１】メモリ１３は、例えばＲＡＭなどからな
り、これらの処理部に必要なデータを格納する。ここで
は、メモリ１３に入力テキスト格納バッファ１３ａ、単
語管理表１３ｂおよびキーワード格納バッファ１３ｃが
設けらている。The memory 13 is composed of, for example, a RAM and stores necessary data in these processing units. Here, an input text storage buffer 13a, a word management table 13b, and a keyword storage buffer 13c are provided in the memory 13.

【００２２】ここで、上記制御装置１２において、テキ
スト入力部１２ａは、外部記憶装置１１のキーワード抽
出対象テキスト格納部１１ａに格納されているキーワー
ド抽出対象テキストをメモリ１３に設けられた入力テキ
スト格納バッファ１３ａに格納する処理を行う。Here, in the control device 12, the text input unit 12 a stores a keyword extraction target text stored in the keyword extraction target text storage unit 11 a of the external storage device 11 in an input text storage buffer provided in the memory 13. 13a.

【００２３】単語抽出部１２ｂは、入力テキスト格納バ
ッファ１３ａに格納されているテキストデータに対して
形態素解析を行って単語を抽出し、メモリ１３に設けら
れた単語管理表１３ｂに登録する処理を行う。その際、
テキストから抽出された各単語毎にテキスト内出現頻度
（テキスト内に当該単語が出現する回数）をカウント
し、その総数を単語管理表１３ｂに格納する処理も行
う。この単語管理表１３ｂの構成を図３に示す。The word extraction unit 12b performs a morphological analysis on the text data stored in the input text storage buffer 13a to extract words, and registers the words in a word management table 13b provided in the memory 13. . that time,
For each word extracted from the text, the frequency of appearance in the text (the number of times the word appears in the text) is counted, and the total number is stored in the word management table 13b. FIG. 3 shows the structure of the word management table 13b.

【００２４】単語管理表１３ｂは、テキストから抽出さ
れた各単語毎について、テキストに出現する頻度とデー
タベースに出現する頻度とを比較するためのテーブルで
ある。The word management table 13b is a table for comparing, for each word extracted from the text, the frequency of occurrence in the text with the frequency of occurrence in the database.

【００２５】図３に示すように、この単語管理表１３ｂ
は、各単語に付されるＩＤ番号を格納するための「ＩＤ
番号」欄３１、テキストから抽出された単語を格納する
ための「見出し語」欄３２、テキスト内での単語の出現
頻度を格納するための「テキスト内出現頻度」欄３３、
データベース内での単語の出現頻度を格納するための
「データベース内出現頻度」欄３４、テキスト内での単
語の出現頻度とデータベース内での単語の出現頻度に基
づいて算出される単語重要度を格納するための「単語重
要度」欄３５から構成される。As shown in FIG. 3, this word management table 13b
Is “ID” for storing the ID number assigned to each word.
A "number" column 31, an "entry word" column 32 for storing words extracted from the text, an "appearance frequency in text" column 33 for storing the frequency of occurrence of words in the text,
An “appearance frequency in database” column 34 for storing the appearance frequency of words in the database, and stores the word importance calculated based on the appearance frequency of words in the text and the appearance frequency of words in the database. And a “word importance” column 35 for performing the operation.

【００２６】再び図１に戻って説明する。単語検索実行
部１２ｃは、単語管理表１３ｂの「見出し語」欄３２に
格納された各単語について、文書データベース格納部１
１ｂに格納された全文書毎に出現頻度をカウントし、そ
の総数を単語管理表１３ｂの「データベース内出現頻
度」欄３４に格納する処理を行う。Returning to FIG. 1, the description will be continued. The word search execution unit 12c executes the processing for the document database storage unit 1 for each word stored in the "entry word" column 32 of the word management table 13b.
The frequency of appearance is counted for every document stored in 1b, and the total number is stored in the “frequency of appearance in database” column 34 of the word management table 13b.

【００２７】重要度計算部１２ｄは、単語管理表１３ｂ
の「見出し語」欄３２に格納された各単語について、
「テキスト内出現頻度」欄３３および「データベース内
出現頻度」欄３４を参照して重要度を計算し、その計算
結果を単語管理表１３ｂの「単語重要度」欄３５に格納
する処理を行う。The importance calculating section 12d includes a word management table 13b.
For each word stored in the “headword” column 32 of
The importance is calculated with reference to the “appearance frequency in text” column 33 and the “appearance frequency in database” column 34, and the calculation result is stored in the “word importance” column 35 of the word management table 13 b.

【００２８】キーワード決定部１２ｅは、単語管理表１
３ｂの「単語重要度」欄３５を参照して、重要度が高い
単語を一定の基準で選び、これを当該テキストのキーワ
ードとして決定し、メモリ１３中のキーワード格納バッ
ファ１３ｃに格納する処理を行う。The keyword deciding section 12e sets the word management table 1
With reference to the “word importance” column 35 of 3b, a word having a high importance is selected based on a certain criterion, determined as a keyword of the text, and stored in a keyword storage buffer 13c in the memory 13. .

【００２９】次に、本装置の処理の流れについて図５を
用いて説明する。図５は同実施形態におけるキーワード
抽出処理の動作を示すフローチャートである。制御装置
１２の制御の下で、まず、テキスト入力部１２ａが起動
され、キーワード抽出対象として与えられたテキストの
格納処理が行われる（ステップＳ１１）。すなわち、外
部記憶装置１１のキーワード抽出対象テキスト格納部１
１ａからキーワード抽出対象となるテキストデータが読
み出され、メモリ１３の入力テキスト格納バッファ１３
ａに格納される。Next, the processing flow of the present apparatus will be described with reference to FIG. FIG. 5 is a flowchart showing the operation of the keyword extraction process in the embodiment. Under the control of the control device 12, first, the text input unit 12a is activated, and a process of storing text given as a keyword extraction target is performed (step S11). That is, the keyword extraction target text storage unit 1 of the external storage device 11
1a, the text data from which the keyword is to be extracted is read, and the input text storage buffer 13 of the memory 13 is read.
a.

【００３０】次に、単語抽出部１２ｂが起動され、テキ
スト内単語の抽出処理が行われる（ステップＳ１２）。
これは、入力テキスト格納バッファ１３ａに格納された
テキストデータを形態素解析し、その解析結果に基づい
て当該テキストから各単語を順に抽出するといった処理
を行うことである。Next, the word extracting unit 12b is activated, and a process of extracting words in the text is performed (step S12).
In other words, the text data stored in the input text storage buffer 13a is subjected to morphological analysis, and the words are sequentially extracted from the text based on the analysis result.

【００３１】ここでは、各単語のうち、名詞とサ変名詞
について、例えばＪＩＳコードの配列順にソートしてＩ
Ｄ番号を与え、単語管理表１３ｂのＩＤ番号に対応した
「見出し語」欄３２に格納すると共に、当該テキストに
おける各単語の出現頻度の集計結果を「テキスト内出現
頻度」欄３３に格納する。なお、サ変名詞とは、例えば
「格納する」とか「指定する」といったように、「す
る」といった付属語が付くと動詞になる名詞のことであ
る。Here, of the words, nouns and sa-variable nouns are sorted, for example, in JIS code arrangement order, and
A D number is given, and stored in the “entry word” column 32 corresponding to the ID number of the word management table 13b, and a totaling result of the appearance frequency of each word in the text is stored in the “appearance frequency in text” column 33. Note that the sa-variant noun is a noun that becomes a verb when an auxiliary word such as “do” is attached, such as “store” or “designate”.

【００３２】次に、単語検索実行部１ｄが起動され、デ
ータベース内単語の検索処理が行われる（ステップＳ１
３）。ここでは、単語管理表１３ｂの「見出し語」欄３
２に格納された各単語について、文書データベース格納
部１１ｂの各検索対象文書のテキストデータに対して文
字列照合をフルテキストサーチで行い、これらの単語が
各検索対象文書中に何回出現しているかをカウントす
る。この結果をすべての検索対象文書について集計し、
合計カウント数を単語管理表１３ｂの「データベース内
出現頻度」欄３４に格納する。Next, the word search execution unit 1d is activated, and a search process for words in the database is performed (step S1).
3). Here, the “headword” column 3 of the word management table 13b
2 for each word stored in the document database storage unit 11b, character string collation is performed by text search on the text data of each search target document, and how many times these words appear in each search target document Count Aggregate this result for all search documents,
The total count is stored in the “appearance frequency in database” column 34 of the word management table 13b.

【００３３】次に、重要度計算部１２ｄが起動され、各
単語に対する重要度の計算処理が行われる（ステップＳ
１４）。ここでは、式（１）に示すように、単語管理表
１３ｂの各単語について、テキスト内の出現頻度が大き
くなれば増加し、データベース中での出現頻度が大きく
なれば減少するような性質を持つ関数を用いて、単語重
要度（評価値）を計算する。Next, the importance calculation unit 12d is activated, and the importance is calculated for each word (step S).
14). Here, as shown in Expression (1), each word in the word management table 13b has such a property that it increases as the frequency of appearance in the text increases, and decreases as the frequency of appearance in the database increases. The word importance (evaluation value) is calculated using the function.

【００３４】単語重要度＝テキスト内出現頻度／データベース内出現頻度 …（１）このようにして得られた単語重要度を各単語毎に単語管
理表１３ｂの「単語重要度」欄３５に格納する。Word importance = frequency of occurrence in text / frequency of appearance in database (1) The word importance thus obtained is stored for each word in the “word importance” column 35 of the word management table 13b. .

【００３５】以上の処理により、単語管理表１３ｂの各
欄にデータが格納された状態を図４に示す。図４の例で
は、キーワード抽出対象テキストから「株価」、「ニュ
ーヨーク」、「証券」、「市場」、「上昇」といった各
単語が抽出され、ＪＩＳコードの配列順にソートされた
状態が示されている。なお、ＪＩＳコードの配列では、
カタカナが平仮名よりも先であるため、この例では「ニ
ューヨーク」が先頭となる。また、漢字は読み仮名順と
する。FIG. 4 shows a state in which data is stored in each column of the word management table 13b by the above processing. In the example of FIG. 4, words such as “stock price”, “New York”, “securities”, “market”, and “rise” are extracted from the keyword extraction target text and are sorted in the JIS code arrangement order. I have. In the JIS code array,
In this example, "New York" comes first because katakana precedes hiragana. In addition, the kanji is in the order of the reading kana.

【００３６】ここで、例えば「ニューヨーク」といった
単語がテキスト内に４回出現し、データベース内には７
０回出現する場合において、「ニューヨーク」の重要度
は、上記（１）式に従って、４／７０＝０．０５７となる。Here, for example, the word "New York" appears four times in the text, and seven words appear in the database.
When it appears 0 times, the importance of “New York” is 4/70 = 0.057 according to the above equation (1).

【００３７】同様に、図４に示すように、「株価」の重
要度は、テキスト内出現頻度が１２回、データベース内
出現頻度が２２０回であるので、１２／２２０＝０．０
５４となる。Similarly, as shown in FIG. 4, the importance of the “stock price” is 12/220 = 0.0 since the appearance frequency in the text is 12 times and the appearance frequency in the database is 220 times.
54.

【００３８】「市場」の重要度は、テキスト内出現頻度
が６回、データベース内出現頻度が１８０回であるの
で、６／１８０＝０．０３３となる。「上昇」の重要度
は、テキスト内出現頻度が３回、データベース内出現頻
度が２４回であるので、３／２４＝０．１２５となる。The importance of "market" is 6/180 = 0.033 because the appearance frequency in the text is 6 times and the appearance frequency in the database is 180 times. The importance of the “rise” is 3/24 = 0.125 since the appearance frequency in the text is 3 times and the appearance frequency in the database is 24 times.

【００３９】「証券」の重要度は、テキスト内出現頻度
が５回、データベース内出現頻度が１６０回であるの
で、５／１６０＝０．０３１となる。このように、テキ
スト内での出現回数が多く、データベース内での出現回
数が少ない場合には、その単語の重要度は高くなる。逆
に、テキスト内での出現回数が少なく、データベース内
の出現回数が多い場合には、その単語の重要度は低くな
る。The importance of “securities” is 5/160 = 0.031 because the appearance frequency in the text is 5 times and the appearance frequency in the database is 160 times. As described above, when the number of appearances in the text is large and the number of appearances in the database is small, the importance of the word becomes high. Conversely, if the number of occurrences in the text is small and the number of occurrences in the database is large, the importance of the word is low.

【００４０】次に、各単語毎に得られた重要度をもと
に、キーワード決定部１２ｅが起動され、キーワードの
決定処理が行われる（ステップＳ１５）。ここでは、例
えば単語重要度が単語管理表１３ｂ中の単語の上位４０
％に含まれるか、あるいは、単語重要度が０．０５以上
のものをキーワードと判断して、順にキーワード格納バ
ッファ１３ｃに格納していく。Next, based on the degree of importance obtained for each word, the keyword determining unit 12e is activated, and a keyword determining process is performed (step S15). Here, for example, the word importance is ranked at the top 40 of the word in the word management table 13b.
If the keyword is included in% or has a word importance of 0.05 or more, the keyword is determined and stored in the keyword storage buffer 13c in order.

【００４１】図４に示した例では、上位４０％に含まれ
るというキーワードの決定条件では、「上昇」と「ニュ
ーヨーク」が該当し、それらの単語が当該テキストのキ
ーワードとしてキーワード格納バッファ１３ｃに格納さ
れる。また、単語重要度が０．０５以上という決定条件
では、「上昇」と「ニューヨーク」、そして「株価」が
それぞれ当該テキストのキーワードとして選ばれ、キー
ワード格納バッファ１３ｃに格納される。In the example shown in FIG. 4, the condition for determining a keyword that is included in the top 40% corresponds to “rising” and “New York”, and those words are stored in the keyword storage buffer 13c as keywords of the text. Is done. Further, under the determination condition that the word importance is 0.05 or more, “rise”, “New York”, and “stock price” are respectively selected as keywords of the text, and are stored in the keyword storage buffer 13c.

【００４２】なお、このようなキーワードの決定条件は
ユーザが任意に設定可能であるとする。キーワード格納
バッファ１３ｃは、この設定された決定条件に基づい
て、単語管理表１３ｂからキーワードとなる単語を選出
することになる。It is assumed that the conditions for determining such a keyword can be arbitrarily set by the user. The keyword storage buffer 13c selects a word to be a keyword from the word management table 13b based on the set determination conditions.

【００４３】キーワード抽出対象テキストに対するキー
ワードが得られた後は、上位のモジュールにより、キー
ワード格納バッファ１３ｃに格納された単語（キーワー
ド）を用いて各種の処理が実行される。After the keyword for the keyword extraction target text is obtained, various processes are executed by the upper module using the word (keyword) stored in the keyword storage buffer 13c.

【００４４】例えば、キーワード格納バッファ１３ｃに
格納されたすベての単語（キーワード）を含む文書を文
書データベース格納部１１ｂより検索し、これを当該テ
キストの類似文献として提示する処理等が行われる。す
なわち、上記の例で、「上昇」、「ニューヨーク」、
「株価」といった各単語がキーワードとして得られてい
る場合には、これら３つの単語を持つ文書がデータベー
スから類似文献として検索されることになる。For example, a process including retrieving a document containing all the words (keywords) stored in the keyword storage buffer 13c from the document database storage unit 11b and presenting the retrieved documents as similar documents of the text is performed. That is, in the example above, "rise", "New York",
When each word such as “stock price” is obtained as a keyword, a document having these three words is searched as a similar document from the database.

【００４５】なお、ここでは検索条件として、キーワー
ド格納バッファ１３ｃに格納されたすベての単語を含む
ものとしたが、例えば、少なくとも１つの単語を含むこ
とを検索条件としたり、各単語を含む割合などを検索条
件とするなど、各種の方法がある。Here, the search condition includes all words stored in the keyword storage buffer 13c. However, for example, the search condition includes at least one word, or includes each word. There are various methods such as using a ratio as a search condition.

【００４６】また、検索対象はデータベース内の文書に
限らず、入力テキストであっても良い。すなわち、キー
ワード格納バッファ１３ｃに格納されたすベての単語
（キーワード）を用いて、キーワード抽出対象テキスト
格納部１１ａから該当テキストを検索するといった処理
も可能である。The search target is not limited to a document in the database, but may be an input text. That is, it is also possible to perform a process of searching for the corresponding text from the keyword extraction target text storage unit 11a using all the words (keywords) stored in the keyword storage buffer 13c.

【００４７】以上のように、キーワード抽出対象テキス
ト内での単語の出現頻度と共に、データベース内での単
語の出現頻度も加味して単語の重要度が求められる。し
たがって、相対的な重要度を得ることができ、その重要
度に基づいてキーワードとして適切な単語を抽出するこ
とができる。As described above, the importance of a word is determined by taking into account the frequency of appearance of the word in the database as well as the frequency of appearance of the word in the text to be extracted. Therefore, relative importance can be obtained, and words appropriate as keywords can be extracted based on the importance.

【００４８】また、キーワード抽出対象テキストから抽
出された各単語毎にデータベース内に存在する全文書が
フルテキストサーチされ、データベース内での単語の出
現頻度が求められる。したがって、キーワードの抽出に
際し、形態素解析が必要となるのはキーワード抽出対象
テキストのみである。よって、従来のようにデータベー
ス内のすべての文書について形態素解析が必要であった
方法に比べて、処理を高速化することができ、また、キ
ーワードの抽出処理に必要となるメモリ容量も少なくて
済む。Further, for each word extracted from the keyword extraction target text, all documents existing in the database are subjected to a full text search, and the frequency of occurrence of the word in the database is obtained. Therefore, when extracting a keyword, only the keyword extraction target text needs morphological analysis. Therefore, the processing can be speeded up and the memory capacity required for keyword extraction processing can be reduced as compared with a method in which morphological analysis is required for all documents in a database as in the related art. .

【００４９】なお、上記実施形態では、図５のステップ
Ｓ１３において、キーワード抽出対象テキストから抽出
された単語がデータベース内に出現する総数をデータベ
ース内出現頻度の値として用いるようにしたが、例え
ば、キーワード抽出対象テキストから抽出された単語が
データベース内の各検索対象文書中に存在するか否かを
判定する処理のみを行い、データベース内で当該単語を
含む文書の総数をデータベース内出現頻度の値として用
いるようにしても良い。In the above embodiment, in step S13 of FIG. 5, the total number of occurrences of words extracted from the keyword extraction target text in the database is used as the value of the occurrence frequency in the database. Performs only processing to determine whether a word extracted from the extraction target text exists in each search target document in the database, and uses the total number of documents including the word in the database as a value of the occurrence frequency in the database You may do it.

【００５０】この場合、キーワード抽出対象テキストか
ら抽出された各単語について、テキスト内の出現頻度が
大きくなれば増加し、その単語を含むデータベース内の
文書の総数が大きくなれば減少するような性質を持つ関
数を用いて、単語の重要度（評価値）を算出することに
なる。In this case, for each word extracted from the target text for keyword extraction, the frequency increases when the appearance frequency in the text increases, and decreases when the total number of documents in the database including the word increases. The importance (evaluation value) of the word is calculated by using the function.

【００５１】このように、データベース内での該当文書
の総数をデータベース内出現頻度として用いるようにす
れば、各単語の出現回数をいちいちカウントする必要が
なくなるため、その分、処理速度がアップするといった
利点がある。ただし、単語そのものの出現総数をデータ
ベース内出現頻度として用いる方法に比べれば、信頼性
は低くなる。As described above, if the total number of applicable documents in the database is used as the frequency of occurrence in the database, it is not necessary to count the number of occurrences of each word, so that the processing speed is increased accordingly. There are advantages. However, the reliability is lower than the method using the total number of appearances of the word itself as the appearance frequency in the database.

【００５２】また、上記ステップＳ１３において、単語
を高速に検索する方法としては、図６に示すような出現
位置管理表１１ｃを作成しておき、これを参照する方法
もある。この出現位置管理表１１ｃは、データベース内
の各文書に出現する文字コード毎に、当該文字コードが
出現する文書名およびその文書中での出現位置を格納し
たものである。In step S13, as a method of searching for a word at high speed, there is a method of creating an appearance position management table 11c as shown in FIG. 6 and referring to the table. The appearance position management table 11c stores, for each character code appearing in each document in the database, the document name where the character code appears and the appearance position in the document.

【００５３】例えば、「あ」、「さ」といった文字コー
ドが文書１の先頭から順に出現する場合には、出現位置
管理表１１ｃの文字コード「あ」の欄に、文書１とその
出現位置を示す情報（ここでは先頭位置を示す「１」）
が格納され、出現位置管理表１１ｃの文字コード「さ」
の欄に、文書１とその出現位置を示す情報（ここでは２
番目を示す「２」）が格納されることになる。For example, when character codes such as “A” and “SA” appear sequentially from the beginning of the document 1, the document 1 and its appearance position are displayed in the column of the character code “A” in the appearance position management table 11c. Information (in this case, "1" indicating the head position)
Is stored, and the character code “sa” in the appearance position management table 11c is stored.
Column, information indicating the document 1 and its appearance position (here, 2
"2") indicating the number is stored.

【００５４】このような出現位置管理表１１ｃを参照す
れば、データベースを引かなくとも、各単語の出現頻度
を求めることができる。これにより、重要度を計算する
処理を高速化することができる。By referring to the appearance position management table 11c, the appearance frequency of each word can be obtained without consulting the database. This makes it possible to speed up the process of calculating the importance.

【００５５】なお、出現位置管理表１１ｃを予め作成し
ておく必要があるが、この出現位置管理表１１ｃは各文
書の文字コードを先頭からチェックしていくだけで簡単
に作成できるものであるため、各文書毎に形態素解析を
必要とする単語表の作成に比べればそれ程手間はかから
ず、また、新たな文書が追加された場合でも簡単に更新
することができる。Although the appearance position management table 11c needs to be created in advance, the appearance position management table 11c can be easily created only by checking the character code of each document from the beginning. Compared to the creation of a word table that requires morphological analysis for each document, it does not take much time and can be updated easily even when a new document is added.

【００５６】また、例えばステップＳ１４で単語重要度
を計算する際に、テキスト内出現頻度およびデータベー
ス内出現頻度に係数を乗じたり、あるいは、それらの値
を２乗したり、対数を用いた処理を行うなどにしても良
い。For example, when calculating the word importance in step S14, the frequency of appearance in the text and the frequency of appearance in the database are multiplied by a coefficient, or their values are squared, or processing using a logarithm is performed. It may be performed.

【００５７】要するに、本発明はその趣旨を逸脱しない
範囲で種々の変形が可能である。なお、上述した実施形
態において記載した手法は、コンピュータに実行させる
ことのできるプログラムとして、例えば磁気ディスク
（フロッピーディスク、ハードディスク等）、光ディス
ク（ＣＤ−ＲＯＭ、ＤＶＤ等）、半導体メモリなどの記
録媒体に書き込んで各種装置に適用したり、通信媒体に
より伝送して各種装置に適用することも可能である。本
装置を実現するコンピュータは、記録媒体に記録された
プログラムを読み込み、このプログラムによって動作が
制御されることにより、上述した処理を実行する。In short, the present invention can be variously modified without departing from the gist thereof. Note that the method described in the above-described embodiment can be executed on a recording medium such as a magnetic disk (floppy disk, hard disk, etc.), an optical disk (CD-ROM, DVD, etc.), a semiconductor memory, etc. It is also possible to write and apply it to various devices, or to transmit it via a communication medium and apply it to various devices. A computer that realizes the present apparatus reads the program recorded on the recording medium, and executes the above-described processing by controlling the operation of the program.

【００５８】[0058]

【発明の効果】以上詳記したように本発明によれば、キ
ーワード抽出対象テキスト内での単語の出現頻度と共
に、データベース内での単語の出現頻度も加味して単語
の重要度を求めるようにしたため、相対的な重要度を得
ることができ、その重要度に基づいてキーワードとして
適切な単語を抽出することができる。As described above in detail, according to the present invention, the importance of a word is determined by taking into account the frequency of the word in the database as well as the frequency of the word in the text to be extracted. As a result, relative importance can be obtained, and words suitable as keywords can be extracted based on the importance.

【００５９】また、データベース内での単語の検索をフ
ルテキストサーチで行うようにしたため、検索処理を高
速化することができ、キーワードの抽出処理に必要とな
るメモリ容量も少なくて済むようになる。Further, since the word search in the database is performed by the full-text search, the search processing can be speeded up and the memory capacity required for the keyword extraction processing can be reduced.

【００６０】また、このようにしてキーワード抽出対象
テキストから抽出されたキーワード単語を用いてデータ
ベースを検索することにより、当該テキストと類似する
文書をデータベースから類似文献として得ることができ
る。Further, by searching the database using the keyword words extracted from the keyword extraction target text, a document similar to the text can be obtained as a similar document from the database.

[Brief description of the drawings]

【図１】本発明の一実施形態に係るキーワード抽出装置
の構成を示すブロック図。FIG. 1 is a block diagram showing a configuration of a keyword extraction device according to an embodiment of the present invention.

【図２】上記キーワード抽出装置に設けられる文書デー
タベース格納部の構成を示す図。FIG. 2 is a diagram showing a configuration of a document database storage unit provided in the keyword extraction device.

【図３】上記キーワード抽出装置に設けられる単語管理
表の構成を示す図。FIG. 3 is a diagram showing a configuration of a word management table provided in the keyword extraction device.

【図４】上記単語管理表に具体的なデータが格納された
状態を示す図。FIG. 4 is a view showing a state where specific data is stored in the word management table.

【図５】上記キーワード抽出装置にて実行されるキーワ
ード抽出処理の動作を示すフローチャート。FIG. 5 is a flowchart showing the operation of keyword extraction processing executed by the keyword extraction device.

【図６】上記キーワード抽出装置に設けられる出現位置
管理表の構成を示す図。FIG. 6 is a diagram showing a configuration of an appearance position management table provided in the keyword extraction device.

[Explanation of symbols]

１１…外部記憶装置１１ａ…キーワード抽出対象テキスト格納部１１ｂ…文書データベース格納部１１ｃ…出現位置管理表１２…制御装置１２ａ…テキスト入力部１２ｂ…単語抽出部１２ｃ…単語検索実行部１２ｄ…重要度計算部１２ｅ…キーワード決定部１３…メモリ１３ａ…入力テキスト格納バッファ１３ｂ…単語管理表１３ｃ…キーワード格納バッファ２１…文書情報データ部２２…テキストデータ部２３…マルチメディアデータ部 Reference Signs List 11 ... External storage device 11a ... Keyword extraction target text storage unit 11b ... Document database storage unit 11c ... Appearance position management table 12 ... Control device 12a ... Text input unit 12b ... Word extraction unit 12c ... Word search execution unit 12d ... Importance calculation Part 12e Keyword determination part 13 Memory 13a Input text storage buffer 13b Word management table 13c Keyword storage buffer 21 Document information data part 22 Text data part 23 Multimedia data part

───────────────────────────────────────────────────── フロントページの続き (72)発明者中本幸夫東京都青梅市新町３丁目３番地の１東芝コンピュ―タエンジニアリング株式会社内 (72)発明者仁科卓哉東京都青梅市新町３丁目３番地の１東芝コンピュ―タエンジニアリング株式会社内Ｆターム(参考） 5B075 ND03 NK14 NK32 PP25 PQ36 PR06 QS01 ──────────────────────────────────────────────────続き Continued on the front page (72) Inventor Yukio Nakamoto 1-3-3 Shinmachi, Ome-shi, Tokyo Toshiba Computer Engineering Co., Ltd. (72) Takuya Nishina 3-3-3 Shinmachi, Ome-shi, Tokyo No. 1 Toshiba Computer Engineering Co., Ltd. F-term (reference) 5B075 ND03 NK14 NK32 PP25 PQ36 PR06 QS01

Claims

[Claims]

1. A database storing a plurality of documents, text acquisition means for acquiring a text given as a keyword extraction target, and words extracted from the text obtained by the text acquisition means; Full-text search for each document in the database for each word extracted by the word extraction means in the text for obtaining the frequency of appearance in the text for each word; A word search means in a database for determining an appearance frequency; and an appearance frequency in the text obtained by the word extraction means in the text and an appearance frequency in the database obtained by the word search means in the database. Importance calculating means for calculating the importance of a word; Keyword extracting device being characterized in that includes a keyword determining means for determining a word serving as a keyword of the text based on each word of importance obtained I.

2. The in-database word search means performs a full-text search of each document in the database for each word extracted by the in-text word extraction means, and calculates the total number of documents containing the word in the database. 3. The method according to claim 1, wherein the frequency is used as a frequency of occurrence in the data.
The described keyword extraction device.

3. The importance calculating means increases the frequency of appearance of each word extracted by the word extraction means in the text if the frequency of appearance in the text increases, and increases the frequency of appearance in the database. 2. The keyword extracting apparatus according to claim 1, wherein the importance of each word is calculated using a function having a property of decreasing each time.

4. The importance calculating means increases the frequency of occurrence of each word extracted by the word extracting means in the text, and increases the document containing the word in the database. 3. The keyword extracting apparatus according to claim 2, wherein the importance of each word is calculated using a function having a property of decreasing as the total number of keywords increases.

5. A table means for storing, for each character code appearing in the database, a document in which the character code appears and its appearance position, the word search means in the database refers to this table means. 2. The keyword extracting apparatus according to claim 1, wherein each of the words extracted by the in-text word extracting means is searched.

6. A database storing a plurality of documents, text acquisition means for acquiring text given as a keyword extraction target, words extracted from the text obtained by the text acquisition means, Full-text search for each document in the database for each word extracted by the word extraction means in the text for obtaining the frequency of appearance in the text for each word; A word search means in a database for determining an appearance frequency; and an appearance frequency in the text obtained by the word extraction means in the text and an appearance frequency in the database obtained by the word search means in the database. Importance calculating means for calculating the importance of a word; Keyword determining means for determining a word to be a keyword of the text based on the degree of importance of each word obtained as described above; and a document similar to the text based on the word determined by the keyword determining means. And a similar document search means for searching for similar documents from the same.

7. A keyword extracting method for extracting a keyword in consideration of each document in a database from a text given as a keyword extraction target, wherein a word is extracted from the text, and each extracted word is extracted. The full text search is performed on each document in the database for each of the extracted words, and the frequency of occurrence in the database is obtained. A keyword extraction method comprising: calculating importance of each word based on the frequency of appearance in a database; and determining a word to be a keyword of the text based on the importance of each word.

8. A recording medium storing a keyword extraction program for extracting a keyword in consideration of each document in a database from a text given as a keyword extraction target, extracting a word from the text, and extracting the word. A procedure for obtaining the frequency of appearance in the text for each of the words, and a procedure for performing a full-text search of each document in the database for each of the extracted words, and calculating the frequency of appearance in the database; Calculating the importance of each word based on the frequency of appearance in the text and the frequency of occurrence in the database; and determining the word to be a keyword of the text based on the importance of each word. And a computer-readable recording medium on which a program for causing a computer to execute the program is recorded.