JP2011242844A

JP2011242844A - Device, method, program and system for keyword extraction

Info

Publication number: JP2011242844A
Application number: JP2010111946A
Authority: JP
Inventors: Masashi Nakaomi; 政司中臣
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 2010-05-14
Filing date: 2010-05-14
Publication date: 2011-12-01

Abstract

PROBLEM TO BE SOLVED: To provide a keyword extraction device, a keyword extraction method, a keyword extraction program and a keyword extraction system which are able to accurately extract keywords on each of multiple documents contained in a document set, together with keywords in the whole document set.SOLUTION: A keyword extraction device includes: a document set keyword extraction unit 113 to extract keywords of a whole document set from the text information of the document set consisting of multiple documents; a document keyword extraction unit 114 to extract keywords of each of the documents contained in the document set from the text information of the document set, in consideration of the frequency of the appearance of the words across the whole document set; and a keyword determination unit 112 to determine keywords extracted by the document set keyword extraction unit 113 and also contained in a document and the keywords extracted by the document keyword extraction unit 114 to be the keywords of the document for each of the documents contained in the document set.

Description

本発明は、キーワード抽出装置、キーワード抽出方法、キーワード抽出プログラムおよびキーワード抽出システムに関する。 The present invention relates to a keyword extraction device, a keyword extraction method, a keyword extraction program, and a keyword extraction system.

日本語の文章から成る文書の概要を知りたい場合や、その文書を検索するための検索ワードを予め用意しておきたい場合等に、その文書のキーワードを抽出する方法が取られる。日本語の文書からキーワードを抽出する手法として、その文書を構成するそれぞれの文を形態素に分解し、各形態素に対してＴＦ（単語の出現頻度:Term Frequency）−ＩＤＦ（逆出現頻度:Inverse Document Frequency）値等の重要度を表す指標を計算し、その指標の高い形態素をその文書のキーワードとする技術が知られている。 When it is desired to obtain an outline of a document composed of Japanese sentences, or when it is desired to prepare a search word for searching the document in advance, a method of extracting a keyword of the document is used. As a technique for extracting keywords from a Japanese document, each sentence constituting the document is decomposed into morphemes, and TF (word frequency: Term Frequency) -IDF (inverse appearance frequency: Inverse Document) for each morpheme. A technique is known in which an index representing importance such as a frequency value is calculated, and a morpheme having a high index is used as a keyword of the document.

また、例えば、特許文献１には、複数文書を話題ごとに分類することを目的とし、文書からキーワードを抽出し、そのキーワードに基づき、文書同士が同一の話題を扱っているかを判定する文書集約方法に関する技術が開示されている。 Also, for example, Patent Document 1 aims at classifying a plurality of documents for each topic, and extracts a keyword from the document, and based on the keyword, document aggregation for determining whether the documents handle the same topic Techniques relating to the method are disclosed.

しかしながら、単に文章を形態素に分解し、それぞれの形態素について重要度を計算する方法では、複数ページからなるスライド文書のように、共通する話題を持った文書の集合体について、それぞれの文書に特有のキーワードを的確に抽出することができないという問題があった。 However, in the method of simply decomposing sentences into morphemes and calculating the importance for each morpheme, a collection of documents with a common topic, such as a slide document consisting of multiple pages, is unique to each document. There was a problem that keywords could not be extracted accurately.

ここで、スライド文書の場合は、各ページを一つの文書と見なすことにより、スライド全体を複数の文書（＝ページ）の集合体と見なすことができる。また、通常の文書においても構造化された文書であれば、各構造を一つの文書と見なすことで、文書全体を複数の文書（＝構造）の集合体と見なすことができる。 Here, in the case of a slide document, the entire slide can be regarded as an aggregate of a plurality of documents (= pages) by regarding each page as one document. Further, if a regular document is also a structured document, the entire document can be regarded as an aggregate of a plurality of documents (= structures) by regarding each structure as a single document.

しかしながら、このような集合体（文書集合ともいう）を構成する各文書は、共通した話題について書かれているため、各文書に対して、日本語一般としてのキーワードを抽出しようとすると、それぞれの文書間でのキーワードの違いが少なくなってしまうという問題がある。 However, since each document constituting such a collection (also referred to as a document collection) is written about a common topic, when trying to extract keywords as general Japanese for each document, There is a problem that the difference in keywords between documents is reduced.

また、逆に集合体を構成する他の文書との比較において重要なキーワードを抜き出そうとすると、他の文書とは違った言葉を抜き出そうとするがために、文書集合全体で共通する話題を表すようなキーワードが抜け落ちてしまうという問題があった。 On the other hand, when trying to extract important keywords in comparison with other documents that make up the collection, it tries to extract words that are different from those of other documents. There was a problem that keywords that expressed the topic would fall out.

上記特許文献１に記載の技術では、似た話題の文書それぞれにおいてキーワードを抽出しているが、文書間に共通する話題のキーワードと各書特有のキーワードの双方に考慮したキーワードについては、同時に抽出することができないという問題を解消することはできなかった。 In the technique described in Patent Document 1, keywords are extracted from documents of similar topics, but keywords that consider both topic keywords common to documents and keywords specific to each book are extracted simultaneously. The problem of not being able to do so could not be solved.

そこで本発明は、複数の文書から成る文書集合のそれぞれの文書のキーワードを抽出し、文書集合そのものを表すキーワードを残しながらも、各文書間の違いが分かるようなキーワードをも抽出できるようにすることができるキーワード抽出装置、キーワード抽出方法、キーワード抽出プログラムおよびキーワード抽出システムを提供することを目的とする。 Therefore, the present invention makes it possible to extract a keyword of each document of a document set made up of a plurality of documents, and to extract a keyword that can distinguish between the documents while leaving a keyword representing the document set itself. An object of the present invention is to provide a keyword extraction device, a keyword extraction method, a keyword extraction program, and a keyword extraction system.

かかる目的を達成するため、請求項１に記載のキーワード抽出装置は、複数の文書に基づく文書集合のテキスト情報から、文書集合の全体としてのキーワードを抽出する集合ワード抽出手段と、文書集合のテキスト情報から、文書集合の各文書におけるキーワードを、文書集合の全体における出現頻度を考慮して抽出する文書ワード抽出手段と、文書集合の各文書について、集合ワード抽出手段により抽出されたキーワードのうち当該文書に含まれるキーワードと、文書ワード抽出手段により抽出されたキーワードと、を当該文書のキーワードとするキーワード決定手段と、を備えたものである。 In order to achieve the above object, the keyword extracting apparatus according to claim 1 includes a set word extracting unit for extracting a keyword as a whole document set from text information of the document set based on a plurality of documents, and a text of the document set. A document word extracting means for extracting a keyword in each document of the document set from information in consideration of an appearance frequency in the whole document set, and for each document in the document set, the keyword extracted by the set word extracting means And a keyword determining unit that uses a keyword included in the document and a keyword extracted by the document word extracting unit as keywords of the document.

また、請求項２に記載の発明は、請求項１に記載のキーワード抽出装置において、集合ワード抽出手段は、形態素と該形態素の日本語としての特異度との対応付けを記憶したコスト記憶テーブルを参照して、各文書を構成する各形態素の特異度を計算し、該特異度に基づいて文書集合の全体としてのキーワードを抽出するものである。 The invention according to claim 2 is the keyword extraction device according to claim 1, wherein the collective word extraction means stores a cost storage table storing associations between morphemes and singularities of the morphemes as Japanese. With reference to this, the specificity of each morpheme constituting each document is calculated, and keywords as a whole document set are extracted based on the specificity.

また、請求項３に記載の発明は、請求項１または２に記載のキーワード抽出装置において、文書ワード抽出手段は、当該文書における形態素の出現頻度および文書集合の全体における該形態素の逆出現頻度に基づいて、各文書におけるキーワードを抽出するものである。 According to a third aspect of the present invention, in the keyword extracting device according to the first or second aspect, the document word extracting means determines the appearance frequency of the morpheme in the document and the reverse appearance frequency of the morpheme in the entire document set. Based on this, keywords in each document are extracted.

また、請求項４に記載の発明は、請求項１から３までのいずれかに記載のキーワード抽出装置において、集合ワード抽出手段により抽出されたキーワードと、文書ワード抽出手段により抽出されたキーワードと、を識別可能に表示させるキーワード表示手段を備えたものである。 The invention described in claim 4 is the keyword extraction device according to any one of claims 1 to 3, wherein the keyword extracted by the collective word extraction unit, the keyword extracted by the document word extraction unit, Is provided with keyword display means for displaying the information in an identifiable manner.

また、請求項５に記載の発明は、請求項１から４までのいずれかに記載のキーワード抽出装置において、文書集合のテキスト情報は、ユーザにより入力された複数の文書、または、ユーザにより入力された電子ファイルから抽出された文書に基づくものである。 According to a fifth aspect of the present invention, in the keyword extracting device according to any one of the first to fourth aspects, the text information of the document set is input by a plurality of documents input by the user or by the user. Based on documents extracted from electronic files.

また、請求項６に記載の発明は、請求項５に記載のキーワード抽出装置において、複数の文書または電子ファイルを入力する入力手段と、抽出されたキーワードを表示する出力手段と、をさらに備えたものである。 The invention described in claim 6 further includes an input means for inputting a plurality of documents or electronic files, and an output means for displaying the extracted keywords in the keyword extracting device according to claim 5. Is.

また、請求項７に記載のキーワード抽出方法は、複数の文書に基づく文書集合のテキスト情報から、文書集合の全体としてのキーワードを抽出する集合ワード抽出処理と、文書集合のテキスト情報から、文書集合の各文書におけるキーワードを、文書集合の全体における出現頻度を考慮して抽出する文書ワード抽出処理と、文書集合の各文書について、集合ワード抽出処理により抽出されたキーワードのうち当該文書に含まれるキーワードと、文書ワード抽出処理により抽出されたキーワードと、を当該文書のキーワードとするキーワード決定処理と、を行うようにしている。 The keyword extraction method according to claim 7 includes: a set word extraction process for extracting a keyword as a whole document set from text information of a document set based on a plurality of documents; and a document set from text information of the document set. Document word extraction processing for extracting keywords in each document in consideration of the appearance frequency in the entire document set, and keywords included in the document among the keywords extracted by the set word extraction processing for each document in the document set And keyword determination processing using the keywords extracted by the document word extraction processing as keywords of the document.

また、請求項８に記載のキーワード抽出プログラムは、コンピュータに、複数の文書に基づく文書集合のテキスト情報から、文書集合の全体としてのキーワードを抽出する集合ワード抽出処理と、文書集合のテキスト情報から、文書集合の各文書におけるキーワードを、文書集合の全体における出現頻度を考慮して抽出する文書ワード抽出処理と、文書集合の各文書について、集合ワード抽出処理により抽出されたキーワードのうち当該文書に含まれるキーワードと、文書ワード抽出処理により抽出されたキーワードと、を当該文書のキーワードとするキーワード決定処理と、を実行させるものである。 According to another aspect of the present invention, there is provided a keyword extraction program that extracts from a text set of a document set based on a plurality of documents a set word extraction process for extracting a keyword as a whole of the document set and a text information of the document set. A document word extraction process for extracting a keyword in each document of the document set in consideration of the appearance frequency in the whole document set, and a keyword extracted by the set word extraction process for each document in the document set. The keyword determination process using the included keyword and the keyword extracted by the document word extraction process as the keyword of the document is executed.

また、請求項９に記載のキーワード抽出システムは、キーワード抽出装置とネットワークを介して接続されるクライアント端末からなるキーワード抽出システムであって、キーワード抽出装置は、複数の文書に基づく文書集合のテキスト情報から、文書集合の全体としてのキーワードを抽出する集合ワード抽出手段と、文書集合のテキスト情報から、文書集合の各文書におけるキーワードを、文書集合の全体における出現頻度を考慮して抽出する文書ワード抽出手段と、文書集合の各文書について、集合ワード抽出手段により抽出されたキーワードのうち当該文書に含まれるキーワードと、文書ワード抽出手段により抽出されたキーワードと、を当該文書のキーワードとするキーワード決定手段と、を備え、クライアント端末は、複数の文書を入力する入力手段と、キーワード決定手段が決定したキーワードを表示する出力手段と、を備えたものである。 The keyword extraction system according to claim 9 is a keyword extraction system including a client terminal connected to the keyword extraction device via a network, and the keyword extraction device includes text information of a document set based on a plurality of documents. A word extraction unit for extracting a keyword as a whole document set, and a word extraction for extracting a keyword in each document of the document set in consideration of the appearance frequency in the whole document set from the text information of the document set. And, for each document in the document set, a keyword determination unit that uses a keyword included in the document among keywords extracted by the set word extraction unit and a keyword extracted by the document word extraction unit as a keyword of the document And the client terminal has a plurality of documents Input means for inputting, those having an output means for displaying the keyword keyword determining means has determined, the.

本発明によれば、複数の文書からなる文書集合に対して、文書集合を構成する各文書のキーワードを、文書集合全体のキーワードと併せて、精度よく抽出することができる。 According to the present invention, it is possible to accurately extract a keyword of each document constituting a document set together with a keyword of the entire document set from a document set including a plurality of documents.

キーワード抽出装置のハードウェア構成図の一例である。It is an example of the hardware block diagram of a keyword extracting device. キーワード抽出システムの概略構成図の一例である。It is an example of the schematic block diagram of a keyword extraction system. キーワード抽出装置およびクライアント端末の機能ブロック図の一例である。It is an example of a functional block diagram of a keyword extraction device and a client terminal. キーワード抽出処理の概要を示すフローチャートである。It is a flowchart which shows the outline | summary of a keyword extraction process. 文書入力画面の一例である。It is an example of a document input screen. キーワード決定処理の詳細を示すフローチャートである。It is a flowchart which shows the detail of a keyword determination process. 集合ワード抽出処理の詳細を示すフローチャートである。It is a flowchart which shows the detail of a set word extraction process. コスト記憶テーブルの説明図である。It is explanatory drawing of a cost storage table. 文書ワード抽出処理の詳細を示すフローチャートである。It is a flowchart which shows the detail of a document word extraction process. キーワード表示画面の一例である。It is an example of a keyword display screen. 他の実施形態に係るキーワード抽出装置およびクライアント端末の機能ブロック図である。It is a functional block diagram of the keyword extraction apparatus and client terminal which concern on other embodiment. キーワード表示画面の他の例である。It is another example of a keyword display screen. キーワード表示画面の他の例である。It is another example of a keyword display screen. 他の実施形態に係るキーワード抽出装置の機能ブロック図である。It is a functional block diagram of the keyword extracting device which concerns on other embodiment.

以下、本発明に係る構成を図１から図１４に示す実施の形態に基づいて詳細に説明する。 Hereinafter, the configuration according to the present invention will be described in detail based on the embodiment shown in FIGS.

本実施形態に係るキーワード抽出装置は、複数の文書（以下、文書集合ともいう）に基づく文書集合のテキスト情報から、文書集合の全体としてのキーワードを抽出する集合ワード抽出手段（集合ワード抽出部１１３）と、文書集合のテキスト情報から、文書集合の各文書におけるキーワードを、文書集合の全体における出現頻度を考慮して抽出する文書ワード抽出手段（文書ワード抽出部１１４）と、文書集合の各文書について、集合ワード抽出手段により抽出されたキーワードのうち当該文書に含まれるキーワードと、文書ワード抽出手段により抽出されたキーワードと、を当該文書のキーワードとするキーワード決定手段（キーワード決定部１１２）と、を備えたものである。
ものである。 The keyword extracting apparatus according to the present embodiment is a set word extracting unit (set word extracting unit 113) that extracts a keyword as a whole document set from text information of a document set based on a plurality of documents (hereinafter also referred to as a document set). ) And a document word extracting means (document word extracting unit 114) for extracting a keyword in each document of the document set from the text information of the document set in consideration of the appearance frequency in the entire document set, and each document in the document set A keyword determination unit (keyword determination unit 112) that uses a keyword included in the document among keywords extracted by the collective word extraction unit and a keyword extracted by the document word extraction unit as a keyword of the document; It is equipped with.
Is.

より詳しくは、先ず、日本語からなる文書集合全体の文書を形態素に分解し、各形態素に対して日本語として特異度を表す指標を計算することにより、文書集合全体としてのキーワードを抽出し、一方で、文書集合を構成する各文書のキーワードについては、文書集合全体に対しての特異度を表す指標を計算することにより抽出し、さらに、文書集合全体でのキーワードが、その文書に含まれる場合、そのキーワードについてもその文書のキーワードとするものである。 More specifically, first, the keywords of the entire document set are extracted by decomposing the document of the entire document set consisting of Japanese into morphemes and calculating an index representing specificity as Japanese for each morpheme, On the other hand, the keywords of each document constituting the document set are extracted by calculating an index representing the specificity with respect to the entire document set, and the keywords in the entire document set are included in the document. In this case, the keyword is also used as the keyword of the document.

（キーワード抽出装置・キーワード抽出システム）
本実施形態に係るキーワード抽出装置のハードウェア構成図の一例を図１に示す。キーワード抽出装置１００は、例えば、汎用のサーバ、ワークステーション、パーソナルコンピュータ等により構成され、ＣＰＵ１３１と、メモリ１３２と、ディスプレイアダプタ１３３を介してディスプレイスクリーン１３４などの表示装置（出力手段）と、プリンタ、ＦＡＸ、スキャナ等の外部入出力装置１３５を接続するシリアルポート１３６と、記憶装置（ＲＯＭ等）１３７と、キーボード１３８、ポインティングデバイス１３９等の入力装置（入力手段）とを相互接続するバス１４０を含む。その他、音声インタフェース１４１や無線ＬＡＮを含むネットワークインタフェース１４２など多くのデバイスを接続できる。例えば、ネットワークインタフェース１４２を通して、電子メールやＦＴＰなどの電子ファイル転送やＷＷＷなどのネットワークサービス１４３を利用することができる。 (Keyword extraction device / keyword extraction system)
An example of a hardware configuration diagram of the keyword extracting apparatus according to the present embodiment is shown in FIG. The keyword extraction device 100 is composed of, for example, a general-purpose server, workstation, personal computer, and the like, and includes a CPU 131, a memory 132, a display device (output means) such as a display screen 134 via a display adapter 133, a printer, A serial port 136 for connecting an external input / output device 135 such as a FAX or a scanner, a storage device (ROM or the like) 137, and a bus 140 for interconnecting input devices (input means) such as a keyboard 138 or a pointing device 139 are included. . In addition, many devices such as a voice interface 141 and a network interface 142 including a wireless LAN can be connected. For example, through the network interface 142, an electronic file transfer such as e-mail and FTP, and a network service 143 such as WWW can be used.

キーワード抽出システムの概略構成図を図２に示す。キーワード抽出システム３００は、キーワード抽出装置１００に、イントラネットやインターネットなどのネットワーク３０１を介してクライアント端末２００が接続されて構成される。なお、クライアント端末２００としては、例えば、汎用のパーソナルコンピュータを用いることができ、図１と同様のハードウェア構成を有している。 FIG. 2 shows a schematic configuration diagram of the keyword extraction system. The keyword extraction system 300 is configured by connecting the client terminal 200 to the keyword extraction device 100 via a network 301 such as an intranet or the Internet. As the client terminal 200, for example, a general-purpose personal computer can be used and has the same hardware configuration as that in FIG.

キーワード抽出システム３００におけるキーワード抽出装置１００は、クライアント端末２００においてユーザが入力した複数の文書について、各文書に特有のキーワードおよび文書集合全体としてのキーワードを抽出し、クライアント端末２００にその結果を表示するものである。 The keyword extraction apparatus 100 in the keyword extraction system 300 extracts a keyword specific to each document and a keyword as a whole document set for a plurality of documents input by the user at the client terminal 200 and displays the result on the client terminal 200. Is.

具体的には、例えば、ユーザは、クライアント端末２００のブラウザ２０１上で複数の文書を入力し、該文書がネットワーク３０１を介して、キーワード抽出装置１００に送信され、キーワード抽出装置１００によって以下に説明するように、該文書集合についてのキーワードの抽出処理が実行され、クライアント端末２００に抽出結果を表示させ、ユーザはブラウザ２０１上で抽出されたキーワードを確認することができるものである。 Specifically, for example, the user inputs a plurality of documents on the browser 201 of the client terminal 200, and the documents are transmitted to the keyword extracting device 100 via the network 301. The keyword extracting device 100 will explain below. As described above, the keyword extraction process for the document set is executed, the extraction result is displayed on the client terminal 200, and the user can check the extracted keyword on the browser 201.

次に、本実施形態に係るキーワード抽出装置およびクライアント端末の機能ブロック図の一例を図３に示す。キーワード抽出装置１００は、処理部１１０として、クライアント端末２００からネットワーク３００を介して、文書をテキスト情報として受信する文書受信部１１１、文書受信部１１１の受信した複数の文書の、各々についてキーワードを決定するキーワード決定部１１２、複数の文書から成る文書集合全体としてのキーワードを抽出する集合ワード抽出部１１３、複数の文書から成る文書集合中の各々の文書について、文書集合全体と比較して、その文書に特徴的な言葉をキーワードとして抽出する文書ワード抽出部１１４、キーワード決定部の決定したキーワードを表示させるキーワード表示部（キーワード表示手段）１１５を備え、また、記憶部（ＤＢ部）１２０として、言葉の日本語としての特異性の高さを表すコストを保持するコスト記憶部１２１を備えている。なお、本実施形態における複数の文書は、共通するキーワードを抽出するものであるので、話題（内容）に共通部分を有していることが好適であることは勿論である。 Next, an example of a functional block diagram of the keyword extracting device and the client terminal according to the present embodiment is shown in FIG. The keyword extraction apparatus 100 determines, as the processing unit 110, a keyword for each of a document receiving unit 111 that receives a document as text information from the client terminal 200 via the network 300, and a plurality of documents received by the document receiving unit 111. A keyword determination unit 112 for extracting a keyword as a whole document set composed of a plurality of documents, and for each document in the document set composed of a plurality of documents, the document is compared with the whole document set The document word extraction unit 114 for extracting words characteristic of the keyword as a keyword, the keyword display unit (keyword display unit) 115 for displaying the keyword determined by the keyword determination unit, and the storage unit (DB unit) 120 as a word Holds the cost of expressing the high specificity of Japanese as Japanese And a strike storage unit 121. In addition, since the some document in this embodiment extracts a common keyword, it is natural that it is suitable for a topic (content) to have a common part.

また、クライアント端末２００は、キーワード表示部１１５による表示結果を受け、当該結果を、ブラウザ２０１を介して表示するものである。なお、クライアント端末２００は必須ではなく、キーワード抽出装置１００のキーボード１３８等の入力装置から入力された文書について、キーワード抽出処理を実行し、キーワード表示部１１５による表示結果をキーワード抽出装置１００のディスプレイスクリーン１３４等の表示装置に表示するようにしても良い（後述する図１４参照）。 The client terminal 200 receives the display result from the keyword display unit 115 and displays the result via the browser 201. Note that the client terminal 200 is not indispensable, keyword extraction processing is executed for a document input from an input device such as the keyboard 138 of the keyword extraction device 100, and the display result of the keyword display unit 115 is displayed on the display screen of the keyword extraction device 100. You may make it display on display apparatuses, such as 134 (refer FIG. 14 mentioned later).

（キーワード抽出処理）
次に、キーワード抽出装置１００が実行するキーワード抽出処理（本発明に係るキーワード抽出方法）の概要を図４のフローチャートを用いて説明する。 (Keyword extraction process)
Next, an outline of the keyword extraction process (keyword extraction method according to the present invention) executed by the keyword extraction device 100 will be described with reference to the flowchart of FIG.

まず、クライアント端末２００の表示装置に表示される文書入力画面２０２から、ユーザは、図５に示すように、複数の文書をテキストで入力する。該複数の文書は、文書受信部１１１によりテキスト情報（文書集合のテキスト情報）として受信される（Ｓ１０１）。 First, from the document input screen 202 displayed on the display device of the client terminal 200, the user inputs a plurality of documents as text as shown in FIG. The plurality of documents are received as text information (text information of a document set) by the document receiving unit 111 (S101).

キーワード決定部１１２は、文書集合を集合ワード抽出部１１３に受け渡し、返り値として文書集合全体でのキーワードを取得する（Ｓ１０２）。また、キーワード決定部１１２は、同じ文書集合を文書ワード抽出部１１４に受け渡し、返り値として各文書におけるキーワードを取得する（Ｓ１０３）。 The keyword determination unit 112 passes the document set to the set word extraction unit 113 and acquires a keyword for the entire document set as a return value (S102). Further, the keyword determination unit 112 passes the same document set to the document word extraction unit 114, and acquires a keyword in each document as a return value (S103).

そして、キーワード決定部１１２は、集合ワード抽出部１１３および文書ワード抽出部１１４から取得した各キーワードに基づき、各文書に対するキーワードを決定する（Ｓ１０４）。最後に、もとの複数の文書とともにキーワード表示部１１５に渡し、キーワード表示部１１５は、キーワード抽出結果を表示するものである（Ｓ１０５）。 Then, the keyword determining unit 112 determines a keyword for each document based on each keyword acquired from the collective word extracting unit 113 and the document word extracting unit 114 (S104). Finally, it is passed to the keyword display unit 115 together with the original documents, and the keyword display unit 115 displays the keyword extraction result (S105).

さらに、キーワード決定部１１２によるキーワード決定処理について図６のフローチャートを用いて説明する。キーワード決定部１１２は、先ず、文書受信部１１１からユーザの入力した複数の文書のテキスト情報を得る（Ｓ２０１）。ここで、受信する文書の数をｎとする。 Further, the keyword determination process by the keyword determination unit 112 will be described with reference to the flowchart of FIG. The keyword determination unit 112 first obtains text information of a plurality of documents input by the user from the document reception unit 111 (S201). Here, the number of documents to be received is n.

次に、その各文書のテキスト情報を、集合ワード抽出部１１３に渡し、返り値として文書集合全体としてのキーワードｗｏｒｄ_０，１〜ｗｏｒｄ_０,ｐを得る（Ｓ２０２、図４のＳ１０２に相当）。なお、添え字０は文書集合のキーワードであることを示し、ｐは集合ワード抽出部１１３により決定される文書集合のキーワードの数を示している（後述する）。 Next, the text information of each document is passed to the set word extraction unit 113, and keywords word _{0,1 to} word _{0, p} as the entire document set are obtained as return values (S202, corresponding to S102 in FIG. 4). Note that the subscript 0 indicates a keyword of the document set, and p indicates the number of keywords of the document set determined by the set word extraction unit 113 (described later).

さらに、同じ各文書のテキスト情報を、文書ワード抽出部１１４に渡し、返り値として各文書のキーワードｗｏｒｄ_ｉ，１〜ｗｏｒｄ_ｉ,ｑを得る（Ｓ２０３、図４のＳ１０３に相当）。なお、添え字ｉは文書の番号を示しており、ｉ＝１，・・・，ｎである。また、ｑは文書ワード抽出部１１４により決定される一つの文書から抽出するキーワードの数を示している（後述する）。 Further, the text information of the same document is transferred to the document word extraction unit 114, and the keywords word _{i, 1 to} word _{i, q} of each document are obtained as return values (S203, corresponding to S103 in FIG. 4). Note that the subscript i indicates the document number, i = 1,..., N. Q indicates the number of keywords extracted from one document determined by the document word extraction unit 114 (described later).

そして、文書ｉ（ｉ＝１，・・・，ｎ）について、ｗｏｒｄ_ｉ，１〜ｗｏｒｄ_ｉ,ｑをその文書特有のキーワードとし、かつ、ｗｏｒｄ_０，１〜ｗｏｒｄ_０,ｐのうち当該文書ｉに存在するすべてのワードを文書集合の話題に関わるその文書ｉのキーワードとする。すなわち、当該２種類のキーワードを合わせて文書ｉのキーワードとして決定する（Ｓ２０４、図４のＳ１０３に相当）。以下、各処理の詳細について説明する。 Then, for the document i (i = 1,..., N), word _{i, 1 to} word _{i, q} is a keyword specific to the document, and the document i of words _{0,1 to} word _{0, p} Are the keywords of the document i related to the topic of the document set. That is, the two types of keywords are combined and determined as the keyword of the document i (S204, corresponding to S103 in FIG. 4). Details of each process will be described below.

＜集合ワード抽出処理＞
上記集合ワード抽出部１１３による処理の詳細を図７のフローチャートを用いて説明する。集合ワード抽出部１１３は、先ず、キーワード決定部１１２から文書のテキスト情報を得る（Ｓ３０１）。 <Aggregate word extraction processing>
Details of the processing by the collective word extracting unit 113 will be described with reference to the flowchart of FIG. The collective word extraction unit 113 first obtains text information of the document from the keyword determination unit 112 (S301).

次に、その各文書について以下の処理を実行する（Ｓ３０２〜Ｓ３０６）。先ず、文書iを形態素に分解する（Ｓ３０３）。形態素への分解処理については、公知または新規の手法によれば良く、特に限られるものではないが、例えばＣｈａＳｅｎ（茶筌）やＭｅＣａｂ(和布蕪)等の公知の形態素解析エンジンをライブラリとして用いることができる。 Next, the following processing is executed for each document (S302 to S306). First, the document i is decomposed into morphemes (S303). The morpheme decomposition process may be a known or new method, and is not particularly limited. For example, a known morpheme analysis engine such as ChaSen or MeCab may be used as a library. it can.

そして、形態素毎にその文書に含まれる数をカウントし、形態素ｊの文書iにおける出現頻度をＦ（ｉ,ｊ）とする（Ｓ３０３）。ここで、名詞などキーワードになり得やすい品詞以外は、キーワードの候補とせず出現頻度のカウントも行わないことが好ましい。 Then, the number included in the document is counted for each morpheme, and the appearance frequency of the morpheme j in the document i is set to F (i, j) (S303). Here, it is preferable that the frequency of appearance is not counted except for the part of speech that can easily become a keyword such as a noun as a keyword candidate.

さらに、各形態素のその文書におけるＴＦ値（単語の出現頻度:Term Frequency）を、数式１により、計算し、文書集合におけるＴＦ値ｔｆ（ｊ）に加算する（Ｓ３０４）。なお、ＴＦ値とは、ある文書（文書i）におけるある単語（形態素ｊ）の出現回数をその文書中に出現する単語のバリエーション数（形態素数）で割った値である。
Furthermore, the TF value (word frequency: Term Frequency) of each morpheme in the document is calculated by Equation 1 and added to the TF value tf (j) in the document set (S304). The TF value is a value obtained by dividing the number of appearances of a certain word (morpheme j) in a certain document (document i) by the number of variations of words appearing in the document (morpheme number).

なお、数式１の右辺におけるｔｆ（ｊ）の値が、本式を計算する時点で存在しない場合は０とする。また、ｔｆ（ｊ）の値が他の文書によってすでに得られている場合には、その値に対して加算する。 If the value of tf (j) on the right side of Equation 1 does not exist at the time of calculating this equation, it is set to 0. If the value of tf (j) has already been obtained by another document, the value is added to that value.

上記処理（Ｓ３０３〜Ｓ３０４）をすべての文書について行う（Ｓ３０５，Ｓ３０６）。この結果、文書集合に含まれる形態素の種類の数だけｔｆ（ｊ）が存在することとなる。但し、キーワードの候補から外した形態素については除かれる。 The above processing (S303 to S304) is performed for all documents (S305, S306). As a result, tf (j) exists as many as the number of morpheme types included in the document set. However, morphemes excluded from keyword candidates are excluded.

その結果、得られたｔｆ（ｊ）に対して、さらに形態素ｊに対応する固有の値であるｉｄｆ（ｊ）との積を求める（Ｓ３０７、次式（２））。
ｔｆ・ｉｄｆ（ｊ）＝ｔｆ（ｊ）・ｉｄｆ（ｊ） …（２） As a result, the product of the obtained tf (j) and idf (j) that is a unique value corresponding to the morpheme j is obtained (S307, the following equation (2)).
tf · idf (j) = tf (j) · idf (j) (2)

ここで、ＩＤＦ値（逆出現頻度:Inverse Document Frequency）はその形態素の日本語としての出現のしにくさによって求められる値（特異性の高さを示す指標）である。 Here, the IDF value (Inverse Document Frequency) is a value (an index indicating the level of specificity) obtained by the difficulty of appearance of the morpheme as Japanese.

なお、ＩＤＦ値は、形態素をキーとしたデータベースとして記憶部１２０に記録しておくことが好ましい。本実施形態では、図８に示すようなコスト記憶テーブル１２２として記憶部１２０のコスト記憶部１２１に記録するようにしている。 The IDF value is preferably recorded in the storage unit 120 as a database using morphemes as keys. In the present embodiment, the cost storage table 122 as shown in FIG. 8 is recorded in the cost storage unit 121 of the storage unit 120.

コスト記憶テーブル１２２は、各形態素をキーとして、対応するＩＤＦ値が格納された辞書形式のデータベースである。なお、ＩＤＦ値の算出方法は、公知または新規の手法によれば良く、特に限られるものではないが、例えば、ニュース記事など一般性の高い文書の集合を用意し、集合に含まれる全文書の数を、その文書の中で形態素ｊを含む文書の数で割ることや、インターネット上の全文検索エンジンを用い、その検索エンジンが検索可能な文書数を、その形態素ｊにより検索した結果得られる文書数で割ることで求めた値に対し、さらに自然対数を計算して求めることができる。 The cost storage table 122 is a dictionary-type database in which corresponding IDF values are stored using each morpheme as a key. The IDF value calculation method may be a known or new method, and is not particularly limited. For example, a set of highly general documents such as news articles is prepared, and all documents included in the set are calculated. The number obtained by dividing the number by the number of documents containing morpheme j in the document, or by using the full-text search engine on the Internet and searching the number of documents that can be searched by the search engine using morpheme j A natural logarithm can be calculated and obtained for a value obtained by dividing by a number.

上記式（２）により得られるＴＦ−ＩＤＦ値（ｔｆ・ｉｄｆ（ｊ））がその形態素ｊのその文書における日本語としての特異性を表す指標となる。最後に、集合ワード抽出部１１３はそのｔｆ・ｉｄｆ（ｊ）が高い順にｐ個の形態素（返り値、ｗｏｒｄ_０，１〜ｗｏｒｄ_０,ｐ）を、文書集合のキーワードとして出力する（Ｓ３０８）。なお、ｐは、任意の値であり、例えば、ｎを定数で割った値とすることができる。 The TF-IDF value (tf · idf (j)) obtained by the above equation (2) is an index representing the singularity of the morpheme j in the document as Japanese. Finally, the set word extraction unit 113 outputs p morphemes (return values, word _{0,1 to} word _{0, p} ) in descending order of tf · idf (j) as keywords of the document set (S308). Note that p is an arbitrary value, and can be, for example, a value obtained by dividing n by a constant.

＜文書ワード抽出処理＞
上記文書ワード抽出部１１４による処理の詳細を図９のフローチャートを用いて説明する。文書ワード抽出部１１４は、先ず、キーワード決定部１１２から文書のテキスト情報を得る（Ｓ４０１）。 <Document word extraction processing>
Details of processing by the document word extracting unit 114 will be described with reference to a flowchart of FIG. The document word extraction unit 114 first obtains text information of a document from the keyword determination unit 112 (S401).

次に、その各文書について以下の処理を実行する（Ｓ４０２〜Ｓ４０６）。先ず、文書iを形態素に分解する（Ｓ４０３）。形態素への分解処理については、集合ワード抽出部１１３と同様の処理によれば良い。 Next, the following processing is executed for each document (S402 to S406). First, the document i is decomposed into morphemes (S403). The decomposition process into morphemes may be performed in the same manner as the collective word extraction unit 113.

そして、形態素毎にその文書に含まれる数をカウントし、形態素ｊの文書iにおける出現頻度をｔｆ（ｉ,ｊ）とする（Ｓ４０３）。なお、名詞などキーワードになり得やすい品詞以外は、キーワードの候補とせず出現頻度のカウントも行わないことが好ましい。 Then, the number included in the document is counted for each morpheme, and the appearance frequency of the morpheme j in the document i is set to tf (i, j) (S403). It should be noted that, except for parts of speech that can easily become keywords, such as nouns, it is preferable that the keywords are not candidates and the appearance frequency is not counted.

さらに、文書内に含まれ、かつキーワードの候補から外されていない形態素ｊのすべてについて、次式（３）とする（Ｓ４０４）。
Ｎ（ｊ）＝Ｎ（ｊ）＋１ …（３） Further, all the morphemes j included in the document and not excluded from the keyword candidates are set to the following expression (3) (S404).
N (j) = N (j) +1 (3)

なお、上記式（３）の右辺におけるＮ（ｊ）が、本式を計算する時点で存在しない場合は、その値を０とする。上記処理（Ｓ４０３〜Ｓ４０４）をすべての文書について行った時点で、形態素ｊを含む文書の数がＮ（ｊ）となる（Ｓ４０５，Ｓ４０６）。 If N (j) on the right side of the equation (3) does not exist at the time of calculating this equation, the value is set to zero. When the above processing (S403 to S404) is performed on all documents, the number of documents including morpheme j is N (j) (S405, S406).

次に、文書集合で扱った各形態素ｊについてのＩＤＦ値を、次式（４）により求める（Ｓ４０７）。
ｉｄｆ（ｊ）＝ｌｏｇ（ｎ/Ｎ（ｊ）） …（４） Next, an IDF value for each morpheme j handled in the document set is obtained by the following equation (4) (S407).
idf (j) = log (n / N (j)) (4)

上記式（４）にて求められるｉｄｆ（ｊ）の値は、形態素ｊの出現のしにくさを表すが、ここでは（文書集合の文書数/形態素jを含む文書数）の対数により計算しているため、集合ワード抽出処理（図７）においてはｉｄｆ（ｊ）の値が日本語としての出現のしにくさを表したのに対し、文書ワード抽出処理では形態素jの「文書集合における出現のしにくさ」を表す値となっている。 The value of idf (j) calculated by the above equation (4) represents the difficulty of appearance of morpheme j, but here it is calculated by the logarithm of (number of documents in document set / number of documents including morpheme j). Therefore, in the set word extraction process (FIG. 7), the value of idf (j) represents the difficulty of appearance as Japanese, whereas in the document word extraction process, the morpheme j “appears in the document set” It is a value that represents “difficulty in nodding”.

以上の処理により得たｔｆ（i,ｊ）とｉｄｆ（ｊ）から、各形態素ｊの各文書ｉにおけるＴＦ−ＩＤＦ値を計算する（Ｓ４０８、上記式（２）と同様）。ここで、ｉｄｆ（ｊ）の値は、「文書集合における出現のしにくさ」を表す値であるため、このＴＦ−ＩＤＦの値は、文書集合中の他の文書と比較して特異性の高い形態素について値が高くなることとなる。 A TF-IDF value in each document i of each morpheme j is calculated from tf (i, j) and idf (j) obtained by the above processing (S408, similar to the above equation (2)). Here, since the value of idf (j) is a value indicating “difficult to appear in the document set”, the value of TF-IDF is more specific than the other documents in the document set. The value will be higher for higher morphemes.

最後に、文書ワード抽出部１１４は、各文書iについて、ｔｆ・ｉｄｆ（i,ｊ）の高い順にｑ個の形態素ｊ（返り値、ｗｏｒｄ_ｉ，１〜ｗｏｒｄ_ｉ,ｑ）を、各文書のキーワードとして出力する（Ｓ４０８）。なお、ｑは、任意の値であり、例えば、各文書の単語数をある定数で割った値とすることができる。この場合、ｑはｉにより異なる値となり得る。 Finally, the document word extraction unit 114 calculates q morphemes j (return values, word _{i, 1 to} word _{i, q} ) for each document _{i in} descending order of tf · idf (i, j). It outputs as a keyword (S408). Note that q is an arbitrary value, for example, a value obtained by dividing the number of words in each document by a certain constant. In this case, q can be a different value depending on i.

＜キーワード表示処理＞
また、キーワード表示手段１１５は、以上の処理（図６のＳ２０１〜Ｓ２０４）により抽出したキーワードを表示する。図１０にキーワード表示画面２０３の一例を示す。図１０に示す例では、ユーザが入力した文書と、抽出されたキーワードについて、３つの文書全体でのキーワード（文書集合のキーワード）は囲み文字、各文書特有のキーワードは下線で表示し、ユーザが２種類のキーワードを容易に識別可能に表示している。また、例えば、文書全体でのキーワードを青字、各文書特有のキーワードを赤字などのように、色分け表示をすることとしても良い。 <Keyword display process>
Moreover, the keyword display means 115 displays the keyword extracted by the above process (S201-S204 of FIG. 6). FIG. 10 shows an example of the keyword display screen 203. In the example shown in FIG. 10, for the document input by the user and the extracted keywords, the keywords (keywords of the document set) in all three documents are displayed as enclosing characters, and keywords specific to each document are displayed as underlines. Two types of keywords are displayed so that they can be easily identified. Also, for example, the keywords in the whole document may be displayed in different colors, such as blue characters and keywords unique to each document in red characters.

（その他の実施形態）
本発明に係るキーワード抽出装置の他の実施形態について説明する。なお、上記実施形態と同様の点についての説明は省略する。 (Other embodiments)
Another embodiment of the keyword extracting device according to the present invention will be described. In addition, the description about the same point as the said embodiment is abbreviate | omitted.

上記実施形態では、ユーザが文書入力画面２０２からテキスト入力を行う例について説明したが、ユーザが電子ファイルによる入力を行うことも好ましい。 In the above-described embodiment, an example in which the user inputs text from the document input screen 202 has been described. However, it is also preferable that the user performs input using an electronic file.

本実施形態に係るキーワード抽出装置およびクライアント端末の機能ブロック図の一例を図１１に示す。電子ファイルによる入力を可能とするには、クライアント端末２００から送られる電子ファイルからテキスト情報を抽出する処理を行う必要がある。 An example of a functional block diagram of the keyword extracting device and the client terminal according to the present embodiment is shown in FIG. In order to enable input using an electronic file, it is necessary to perform processing for extracting text information from the electronic file sent from the client terminal 200.

そこで、図１１に示すように、本実施形態に係るキーワード抽出装置１００は、上記実施形態の各構成に加え、電子ファイルからテキスト情報を抜き出すテキスト抽出部１１６を備えている。本実施形態では、文書受信部１１１は、テキスト情報ではなく電子ファイルを受信するため、文書受信部１１１は、受けた電子ファイルをテキスト抽出部１１６に渡し、テキスト抽出部１１６が電子ファイルをテキスト情報に変換し、文書受信部１１１がそのテキスト情報を受け取るものである。以降の処理は、上記実施形態と同様である。 Therefore, as shown in FIG. 11, the keyword extraction device 100 according to the present embodiment includes a text extraction unit 116 that extracts text information from an electronic file in addition to the components of the above embodiment. In this embodiment, since the document reception unit 111 receives an electronic file instead of text information, the document reception unit 111 passes the received electronic file to the text extraction unit 116, and the text extraction unit 116 converts the electronic file into text information. The document receiving unit 111 receives the text information. The subsequent processing is the same as in the above embodiment.

図１２に、ユーザが電子ファイルにより入力を行った場合のキーワード表示画面２０３の一例を示す。図１２に示す例は、文書の集合体として、複数のスライドをまとめたファイルを扱う場合の表示例を示しており、文書集合のキーワードが囲み文字、各文書特有のキーワードが下線で示されている。これにより、ユーザは一枚一枚のスライドのキーワードを見るだけで、そのスライド全体としての話題と、その中でその一枚のスライドがどのような内容か、その双方を容易に知ることができる。 FIG. 12 shows an example of the keyword display screen 203 when the user inputs an electronic file. The example shown in FIG. 12 shows a display example in the case of handling a file in which a plurality of slides are collected as a collection of documents, in which keywords of the document collection are indicated by enclosing characters and keywords specific to each document are indicated by underlining. Yes. As a result, the user can easily know both the topic of the slide as a whole and the contents of the slide in the slide simply by looking at the keyword of each slide. .

また、キーワード抽出装置により抽出されたキーワードを、トリガーとしてそのキーワードをもつファイルを検索することも好ましい。例えば、一つの文書を、各ページの文書集合とみなして、その検索結果を表示する際に、図１３に示すように、その文書中でそのキーワードをもつページを表示したり、また、そのページの他のキーワードを表示することで、検索結果の内容を容易に知ることができる。なお、図１３に示す例では、文書全体のキーワードを大きく、そのページ特有のキーワードを小さく、また検索ワードを囲い文字として、容易に識別可能に表示することで、検索ワードがその文書にとってどのような言葉なのか、またその文書がどのような文書で、その検索キーワードが属するページにどのようなことが書かれているのか、容易に知ることができる。 It is also preferable to search for a file having the keyword by using the keyword extracted by the keyword extracting device as a trigger. For example, when a single document is regarded as a document set of each page and the search result is displayed, a page having the keyword in the document is displayed as shown in FIG. By displaying other keywords, it is possible to easily know the contents of the search results. In the example shown in FIG. 13, the keyword of the entire document is made large, the keyword specific to the page is made small, and the search word is displayed as an enclosing character so that it can be easily identified. It is easy to know what the word is, what the document is, and what is written on the page to which the search keyword belongs.

また、図１４に示すように、クライアント端末２００を介さずに、キーワード抽出装置１００の入力装置から文書のテキストを入力し、キーワードの抽出結果をキーワード抽出装置１００の出力装置に表示するようにすることも好ましい。 Further, as shown in FIG. 14, the text of the document is input from the input device of the keyword extraction device 100 without using the client terminal 200, and the keyword extraction result is displayed on the output device of the keyword extraction device 100. It is also preferable.

本実施形態に係るキーワード抽出装置１００は、文書受信部１１１に代えて文書入力部１１７を有している。本実施形態の文書入力部１１７は、ユーザが文書入力を行う文書入力画面（図５参照）をキーワード抽出装置１００の表示装置に表示させ、ユーザの入力した文書をテキスト情報としてキーワード決定部１１２に渡すものである。 The keyword extraction device 100 according to the present embodiment includes a document input unit 117 instead of the document reception unit 111. The document input unit 117 according to the present embodiment displays a document input screen (see FIG. 5) on which a user inputs a document on the display device of the keyword extraction device 100, and the keyword input unit 117 uses the document input by the user as text information. It is what you pass.

また、キーワード表示部１１５は、キーワード表示画面（図１０参照）を表示装置に表示するものである。なお、図１４に示す例において、ユーザが文書を電子ファイルにより入力する場合は、上述のようにテキスト抽出部１１６をさらに備える構成とすれば良い。 The keyword display unit 115 displays a keyword display screen (see FIG. 10) on the display device. In the example shown in FIG. 14, when the user inputs a document as an electronic file, the text extraction unit 116 may be further provided as described above.

以上説明したキーワード抽出装置によるキーワード抽出処理は、コンピュータにキーワード抽出処理を実行させるプログラム（キーワード抽出プログラム）により実現できる。キーワード抽出プログラムは、例えば、インターネット上からのダウンロードによって提供し、コンピュータにインストールすることも好ましい。また、キーワード抽出プログラムをコンピュータで実行可能に記録した記録媒体（キーワード抽出プログラムを記録した記録媒体）の態様にも適用される。 The keyword extraction processing by the keyword extraction device described above can be realized by a program (keyword extraction program) that causes a computer to execute keyword extraction processing. The keyword extraction program is preferably provided by, for example, downloading from the Internet and installed on a computer. Further, the present invention is also applied to an aspect of a recording medium (recording medium on which a keyword extraction program is recorded) on which a keyword extraction program is recorded so as to be executable by a computer.

尚、上述の実施形態は本発明の好適な実施の例ではあるがこれに限定されるものではなく、本発明の要旨を逸脱しない範囲において種々変形実施可能である。 The above-described embodiment is a preferred embodiment of the present invention, but is not limited thereto, and various modifications can be made without departing from the gist of the present invention.

１００キーワード抽出装置
１１０処理部
１１１文書受信部
１１２キーワード決定部
１１３集合ワード抽出部
１１４文書ワード抽出部
１１５キーワード表示部
１１６テキスト抽出部
１１７文書入力部
１２０記憶部
１２１コスト記憶部
１２２コスト記憶テーブル
１３１ＣＰＵ
１３２メモリ
１３３ディスプレイアダプタ
１３４ディスプレイスクリーン
１３５外部入出力装置
１３６シリアルポート
１３７記憶装置
１３８キーボード
１３９ポインティングデバイス
１４０バス
１４１音声インタフェース
１４２ネットワークインタフェース
１４３ネットワークサービス
２００クライアント端末
２０１ブラウザ
２０２文書入力画面
２０３キーワード表示画面
３００キーワード抽出システム
３０１ネットワーク DESCRIPTION OF SYMBOLS 100 Keyword extraction apparatus 110 Processing part 111 Document receiving part 112 Keyword determination part 113 Collected word extraction part 114 Document word extraction part 115 Keyword display part 116 Text extraction part 117 Document input part 120 Storage part 121 Cost storage part 122 Cost storage table 131 CPU
132 memory 133 display adapter 134 display screen 135 external input / output device 136 serial port 137 storage device 138 keyboard 139 pointing device 140 bus 141 voice interface 142 network interface 143 network service 200 client terminal 201 browser 202 document input screen 203 keyword display screen 300 keyword Extraction system 301 network

特開２００６‐２９３６１６号公報JP 2006-293616 A

Claims

A set word extracting means for extracting a keyword as a whole of the document set from text information of the document set based on a plurality of documents;
A document word extracting means for extracting a keyword in each document of the document set in consideration of the appearance frequency in the whole document set from the text information of the document set;
For each document in the document set, a keyword determination unit that uses a keyword included in the document among keywords extracted by the set word extraction unit and a keyword extracted by the document word extraction unit as a keyword of the document. When,
A keyword extraction device comprising:

The set word extraction means calculates a specificity of each morpheme constituting each document with reference to a cost storage table storing a correspondence between the morpheme and the specificity of the morpheme as Japanese. The keyword extracting apparatus according to claim 1, wherein a keyword as a whole of the document set is extracted based on the keyword.

The document word extracting unit extracts a keyword in each document based on the appearance frequency of the morpheme in the document and the reverse appearance frequency of the morpheme in the whole document set. The keyword extraction apparatus in any one.

The keyword display means for displaying the keyword extracted by the collective word extraction means and the keyword extracted by the document word extraction means in an identifiable manner. The keyword extraction device described in 1.

5. The text information of the document set is based on a plurality of documents input by a user or a document extracted from an electronic file input by a user. Keyword extraction device.

6. The keyword extracting apparatus according to claim 5, further comprising: an input unit that inputs the plurality of documents or the electronic file; and an output unit that displays the extracted keyword.

A set word extraction process for extracting a keyword as a whole of the document set from text information of the document set based on a plurality of documents;
A document word extraction process for extracting a keyword in each document of the document set in consideration of the appearance frequency in the whole document set from the text information of the document set;
For each document in the document set, a keyword determination process using a keyword included in the document among keywords extracted by the set word extraction process and a keyword extracted by the document word extraction process as keywords of the document When,
A keyword extraction method characterized by:

On the computer,
A set word extraction process for extracting a keyword as a whole of the document set from text information of the document set based on a plurality of documents;
A document word extraction process for extracting a keyword in each document of the document set in consideration of the appearance frequency in the whole document set from the text information of the document set;
For each document in the document set, a keyword determination process using a keyword included in the document among keywords extracted by the set word extraction process and a keyword extracted by the document word extraction process as keywords of the document When,
A keyword extraction program characterized by causing

A keyword extraction system comprising a client terminal connected to a keyword extraction device via a network,
The keyword extraction device includes:
A set word extracting means for extracting a keyword as a whole of the document set from text information of the document set based on a plurality of documents;
A document word extracting means for extracting a keyword in each document of the document set in consideration of the appearance frequency in the whole document set from the text information of the document set;
For each document in the document set, a keyword determination unit that uses a keyword included in the document among keywords extracted by the set word extraction unit and a keyword extracted by the document word extraction unit as a keyword of the document. And comprising
The client terminal is
Input means for inputting the plurality of documents;
And an output means for displaying the keyword determined by the keyword determination means.