JP5308918B2

JP5308918B2 - Keyword extraction method, keyword extraction device, and keyword extraction program

Info

Publication number: JP5308918B2
Application number: JP2009130604A
Authority: JP
Inventors: 浩之戸田; 由美子松浦; 良治片岡
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2009-05-29
Filing date: 2009-05-29
Publication date: 2013-10-09
Anticipated expiration: 2029-05-29
Also published as: JP2010277415A

Description

本発明は、コンピュータ内部に存在する電子文書から、該文書の内容を適切に表現するキーワードを抽出する技術に関する。 The present invention relates to a technique for extracting a keyword that appropriately represents the contents of a document from an electronic document existing inside a computer.

Ｗｅｂ上の電子文書を収集し、ユーザに検索を可能とするＷｅｂサーチエンジンは、インターネット上の情報取得にはなくてはならないツールとなっている。ところが近年、Ｗｅｂサーチエンジンが返却する文書数はますます増加し、ユーザの必要とする文書をＷｅｂサーチエンジンの検索結果から探し出すことが難しくなっている。 A Web search engine that collects electronic documents on the Web and enables users to search is an indispensable tool for acquiring information on the Internet. However, in recent years, the number of documents returned by the Web search engine has increased more and more, and it has become difficult to search for documents required by the user from the search results of the Web search engine.

そこで従来から、電子文書を解析して該文書の内容を表現する語（キーワード）を抽出する様々な方法が提案されている。 In view of this, various methods have been proposed in which an electronic document is analyzed to extract words (keywords) that express the contents of the document.

一つは「固有表現抽出」と呼ばれる技術であり、これは文書を解析して用語を抽出するとともに、抽出した用語に人名、組織名、地名などのタイプを割り当てる技術である。これにより、タイプ別のキーワードが抽出でき、文書の分析や検索に利用することが可能となる。この技術は非特許文献１に記載されている。 One is a technique called “proprietary expression extraction”, which extracts a term by analyzing a document and assigns a type such as a person name, an organization name, and a place name to the extracted term. As a result, keywords for each type can be extracted and used for document analysis and search. This technique is described in Non-Patent Document 1.

また、別な技術として「名詞句抽出」と呼ばれる技術があり、これは品詞情報などを基にしたパターンや周辺に出現する形態素の分布などを基に、名詞もしくは名詞句を構成する形態素列をキーワードとして抽出する手法である。これは「固有表現抽出」と異なり、単にキーワードを抽出するのみであるが、「固有表現抽出」では抽出できないキーワードを抽出できる可能性もある。この技術は非特許文献２に記載されている。 Another technique is called “noun phrase extraction”, which is based on patterns based on part-of-speech information and the distribution of morphemes that appear in the vicinity. It is a technique to extract as a keyword. This is different from “specific expression extraction” in that only keywords are extracted, but keywords that cannot be extracted by “specific expression extraction” may be extracted. This technique is described in Non-Patent Document 2.

ＤａｖｉｄＮａｄｅａｕ，ＳａｔｏｓｈｉＳｅｋｉｎｅ，“Ａｓｕｒｖｅｙｏｆｎａｍｅｄｅｎｔｉｔｙｒｅｃｏｇｎｉｔｉｏｎａｎｄｃｌａｓｓｉｆｉｃａｔｉｏｎ”，ＪｏｕｒｎａｌｏｆＬｉｎｇｕｉｓｔｉｃａｅＩｎｖｅｓｔｉｇａｔｉｏｎｅｓ３０−１２００７．David Nadeau, Satoshi Sekine, “A survey of named entity recognition and classification”, Journal of Linguistic Investments 30-1 2007. 石井恵，渡辺一成，“分類体系と名詞句を用いた検索インタフェースの提案とその評価”，情報処理学会研究報告ＨＣＩＶｏｌ．２０００Ｎｏ．１２．Megumi Ishii, Kazunari Watanabe, “Proposal and Evaluation of Search Interface Using Classification System and Noun Phrase”, Information Processing Society of Japan Research Report, HCI Vol. 2000 No. 12 ＭａｒｉｕｓＰａｓｃａ，“ＡｃｑｕｉｓｉｔｉｏｎｏｆＣａｔｅｇｏｒｉｚｅｄＮａｍｅｄＥｎｔｉｔｉｅｓｆｏｒＷｅｂＳｅａｒｃｈ”，Ｐｒｏｃｅｅｄｉｎｇｓｏｆｔｈｅ１３ｔｈＡＣＭＣｏｎｆｅｒｅｎｃｅｏｎＩｎｆｏｒｍａｔｉｏｎａｎｄＫｎｏｗｌｅｄｇｅＭａｎａｇｅｍｅｎｔ（ＣＩＫＭ−０４），２００４，ｐｐ．１３７−１４５．Marius Pasca, “Acquisition of Categorized Named Entities for Web Search”, Proceedings of the 13th ACM Conference on Information and Knowledge Management 04 (KN). 137-145.

しかしながら、非特許文献１の「固有表現抽出」では、キーワードの種類毎に人手で作成したトレーニングデータを大量に用意する必要があり、人名、地名、組織名以外の幅広い分野のキーワード抽出が困難なおそれがある。 However, in “Native Expression Extraction” of Non-Patent Document 1, it is necessary to prepare a large amount of training data manually created for each keyword type, and it is difficult to extract keywords in a wide range of fields other than the names of people, places, and organizations. There is a fear.

また、非特許文献２の「名詞句抽出」では、基本的にはパターンを基に名詞句を網羅的に抽出するため、不自然な位置で区切れたキーワードや、逆に不自然に接続されたキーワードが抽出されるおそれがある。また、ここで取得されたキーワードをその種別に応じて分類することが困難なおそれもある。 In “Noun Phrase Extraction” of Non-Patent Document 2, basically, noun phrases are comprehensively extracted based on patterns, so keywords separated at unnatural positions or conversely unnaturally connected. Keywords may be extracted. In addition, it may be difficult to classify the keywords acquired here according to their types.

本発明は、このような問題を解決するためになされたものであり、人手によるトレーニングデータを用いることなく、電子文書の内容を表現するキーワードを適切に抽出することを解決課題としている。 The present invention has been made to solve such a problem, and an object of the present invention is to appropriately extract a keyword expressing the contents of an electronic document without using manual training data.

そこで本発明は、前記課題を解決するため、検索エンジンに入力される検索条件（クエリ）は、人が適切であると想定した単位で区切られたキーワードが含まれ、該キーワードを検索エンジンに投入した検索結果のタイトルや概要文は、該キーワードが利用される用例として適切なことを利用する。 Therefore, in order to solve the above-described problems, the present invention includes a search condition (query) input to a search engine including keywords delimited in units assumed to be appropriate by a person, and inputs the keywords to the search engine. The search result title and summary sentence use what is appropriate as an example in which the keyword is used.

本発明の一態様は、検索エンジンのログを利用して生成されたモデルを適用することで電子文書に含まれるキーワードを抽出する方法であって、リスト生成手段が、前記検索エンジンから取得したクエリログを解析して、一定の条件を満たすクエリを抽出してキーワードのリストを生成する第１ステップと、収集手段が、前記検索エンジンから前記リスト中のキーワードの検索結果を取得し、該検索結果のタイトルおよび概要文においてキーワードが用いられる用例を収集する第２ステップと、モデル生成手段が、前記第２ステップで収集された用例を基に前記モデルを生成する第３ステップと、を有する。 One aspect of the present invention is a method for extracting a keyword included in an electronic document by applying a model generated using a search engine log, wherein the list generation means acquires the query log acquired from the search engine. A first step of generating a keyword list by extracting a query that satisfies a certain condition, and a collecting unit obtains a search result of the keyword in the list from the search engine, A second step of collecting examples in which keywords are used in the title and the summary sentence, and a third step in which the model generation means generates the model based on the examples collected in the second step.

本発明の他の態様は、検索エンジンのログを利用して生成されたモデルを適用することで電子文書に含まれるキーワードを抽出する装置であって、前記検索エンジンから取得したクエリログを解析して、一定の条件を満たすクエリを抽出してキーワードのリストを生成するリスト生成手段と、前記検索エンジンから前記リスト中のキーワードの検索結果を取得し、該検索結果のタイトルおよび概要文においてキーワードが用いられる用例を収集する収集手段と、前記収集手段で収集された用例を基に前記モデルを生成するモデル生成手段と、を備える。 Another aspect of the present invention is an apparatus that extracts a keyword included in an electronic document by applying a model generated using a search engine log, and analyzes a query log acquired from the search engine. A list generating means for extracting a query that satisfies a certain condition to generate a list of keywords, and obtaining a search result of the keyword in the list from the search engine, and using the keyword in the title and summary sentence of the search result Collecting means for collecting the examples to be used, and model generating means for generating the model based on the examples collected by the collecting means.

なお、本発明は、前記キーワード抽出装置としてコンピュータを機能させるプログラムの態様として提供してもよい。 The present invention may be provided as an aspect of a program that causes a computer to function as the keyword extraction device.

本発明によれば、人手によるトレーニングデータを用いることなく、電子文書の内容を表現するキーワードを適切に抽出することができる。 According to the present invention, it is possible to appropriately extract a keyword expressing the content of an electronic document without using manual training data.

本発明の実施形態に係るキーワード抽出装置の構成図。The block diagram of the keyword extraction apparatus which concerns on embodiment of this invention. 同キーワード抽出モデル生成の処理フロー。Process flow for generating the same keyword extraction model. 同キーワードの用例の分類例。Classification example of usage example of the keyword.

以下、本発明の実施形態を説明する。本発明によれば、検索エンジンのクエリログから得られるキーワードの集合および各キーワードの検索結果のタイトル・概要文の集合を基に、各キーワードの抽出モデルが生成される。 Embodiments of the present invention will be described below. According to the present invention, an extraction model for each keyword is generated based on a set of keywords obtained from a query log of a search engine and a set of titles and summary sentences of search results for each keyword.

この抽出モデルは、各キーワードやその近傍の語が一般的に含んでいると想定される形態素や品詞などのパターンを示す。この抽出モデルを任意の電子文書に適用することにより、該文書から適切なキーワードを抽出する。 This extraction model shows patterns such as morphemes and parts of speech that are assumed to be generally included in each keyword and its neighboring words. By applying this extraction model to an arbitrary electronic document, appropriate keywords are extracted from the document.

＜装置構成例＞
図１に示すように、本発明の実施形態に係るキーワード抽出装置１は、ネットワークを介して検索エンジン２と通信可能に接続されている。 <Example of device configuration>
As shown in FIG. 1, a keyword extraction apparatus 1 according to an embodiment of the present invention is connected to a search engine 2 through a network so as to be communicable.

前記検索エンジン２は、Ｗｅｂ上に公開されている電子文書（Ｗｅｂページ）を検索する通常のＷｅｂサーチエンジンで構成され、ユーザ端末（図示省略）から受け付けたクエリを時系列に記録するクエリログ３と、該クエリに該当する電子文書を検索してユーザ端末に返信するための検索実行手段４とを備えている。 The search engine 2 is a normal Web search engine that searches an electronic document (Web page) published on the Web, and includes a query log 3 that records a query received from a user terminal (not shown) in time series. And search execution means 4 for searching for an electronic document corresponding to the query and returning it to the user terminal.

前記キーワード抽出装置１は、通常のコンピュータのハードウェア資源、即ちＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｏｒＵｎｉｔ）、メモリ（ＲＡＭ）、ハードディスクドライブ装置、通信インタフェースなどを備えている。このハードウェア資源とソフトウェアとの協働の結果、前記キーワード抽出装置１は、キーワードリスト生成手段５，キーワード分類手段６，用例収集手段７，モデル生成手段８，キーワード抽出モデルデータベース９，キーワード抽出手段１０を実装する。 The keyword extraction device 1 includes hardware resources of a normal computer, that is, a CPU (Central Processor Unit), a memory (RAM), a hard disk drive device, a communication interface, and the like. As a result of the cooperation between the hardware resource and the software, the keyword extraction device 1 is provided with a keyword list generation means 5, a keyword classification means 6, an example collection means 7, a model generation means 8, a keyword extraction model database 9, and a keyword extraction means. 10 is implemented.

このうち前記各手段５〜８は、キーワードの抽出モデルを生成するモデル生成処理を実施する。即ち、前記キーワードリスト生成手段５は、前記クエリログ３を解析して一定の条件を満たすクエリをキーワードとして取得し、該キーワードのリストを生成する。 Among these, each said means 5-8 implements the model production | generation process which produces | generates the extraction model of a keyword. That is, the keyword list generation means 5 analyzes the query log 3 to acquire a query that satisfies a certain condition as a keyword, and generates a list of the keyword.

前記キーワード分類手段６は、前記クエリログ３および予め前記抽出装置１に登録してある大量の言語データ（コーパス）を解析して、前記リスト中の各キーワードをその種別に応じて分類する。 The keyword classification means 6 analyzes the query log 3 and a large amount of language data (corpus) registered in the extraction device 1 in advance, and classifies each keyword in the list according to its type.

前記用例収集手段７は、前記種別毎に分類されたキーワードのリストを取得し、各キーワードを基に前記検索実行手段４にアクセスし、検索結果のタイトルおよび概要文から各キーワードの用例を取得する。 The example collection unit 7 acquires a list of keywords classified by the type, accesses the search execution unit 4 based on each keyword, and acquires an example of each keyword from the title and summary sentence of the search result. .

前記モデル生成手段８は、前記用例を基に、各キーワードを抽出するための抽出モデルを生成する。ここで生成された抽出モデルは、前記キーワード抽出モデルデータベース９に格納される。このデータベース９は、前記ハードディスクドライブ装置上に構築されているものとする。 The model generation means 8 generates an extraction model for extracting each keyword based on the example. The extraction model generated here is stored in the keyword extraction model database 9. It is assumed that this database 9 is constructed on the hard disk drive device.

前記キーワード抽出手段１０は、前記キーワード抽出モデルデータベース９に格納された抽出モデルを任意の電子文書に適用して、該文書の内容を表現するキーワードを抽出するキーワード抽出処理を実施する。以下、この各処理の具体的内容を説明する。 The keyword extraction means 10 applies the extraction model stored in the keyword extraction model database 9 to an arbitrary electronic document, and performs a keyword extraction process for extracting a keyword expressing the content of the document. Hereinafter, specific contents of each process will be described.

＜モデル生成処理＞
まず、前記モデル生成処理を図２の処理フローに基づき詳細に説明する。このモデル生成処理は、前記キーワード抽出装置１の主要な処理に該当する。 <Model generation process>
First, the model generation process will be described in detail based on the process flow of FIG. This model generation process corresponds to the main process of the keyword extraction device 1.

ここでは、前記キーワードリスト生成手段５が前記通信インタフェースを介して前記クエリログ３へアクセスし、該クエリログ３を取得するものとする。 Here, it is assumed that the keyword list generation means 5 accesses the query log 3 via the communication interface and acquires the query log 3.

ここで前記クエリログ３には、過去にユーザ端末から前記検索エンジン２に投入されたクエリ（検索キーワード）のログが記録されている。このログは、入力されたクエリおよび入力された日時の組合せなどが時系列に記録されたものである。このクエリログ３の格納データ例を表１に示す。 Here, in the query log 3, a log of queries (search keywords) previously input from the user terminal to the search engine 2 is recorded. In this log, a combination of an input query and an input date and time is recorded in time series. An example of data stored in the query log 3 is shown in Table 1.

Ｓ０１：前記キーワードリスト生成手段５は、前記クエリログ３を解析し、一定の条件を満たすクエリをキーワードとして抽出し、キーワードリストを生成する。条件の例としては、「検索条件として一定の頻度以上で利用されること」や「検索結果として一定数以上の文書が存在すること」などが挙げられる。使用する条件は、仕様に応じて予めプログラムに設定しておけばよい。 S01: The keyword list generation means 5 analyzes the query log 3, extracts a query that satisfies a certain condition as a keyword, and generates a keyword list. Examples of conditions include “being used as a search condition at a certain frequency or more” and “having a certain number of documents as a search result”. The conditions to be used may be set in the program in advance according to the specifications.

このように生成されたキーワードリストおよび前記クエリログ３は、前記キーワード分類手段６へ転送される。このとき、生成された前記キーワードリストは前記メモリなどに記憶してもよい。 The keyword list thus generated and the query log 3 are transferred to the keyword classification means 6. At this time, the generated keyword list may be stored in the memory or the like.

Ｓ０２：前記キーワード分類手段６は、前記クエリログ３を解析することで、Ｓ０１で転送された前記キーワードリストの各キーワードを種別（カテゴリ）に応じて分類する。 S02: The keyword classification means 6 classifies each keyword of the keyword list transferred in S01 according to the type (category) by analyzing the query log 3.

分類方法としては、あらかじめ決められた種別に対して人手で分類する方法、あるいは非特許文献３のように人手で分類したキーワードの例を基に特定の種別のキーワードを発見する方法などが挙げられる。このとき、コンピュータで検索可能な大量の言語データ、即ち「コーパス」を予め前記キーワード抽出装置１に登録しておき、これを前記クエリログ３と併せて解析するようにしてもよい。 Examples of the classification method include a method of manually classifying a predetermined type, or a method of finding a keyword of a specific type based on an example of a keyword manually classified as in Non-Patent Document 3. . At this time, a large amount of language data that can be searched by a computer, that is, “corpus” may be registered in the keyword extracting device 1 in advance and analyzed together with the query log 3.

ここで分類されたリストは、前記キーワードリスト生成手段５を経由して前記用例収集手段７へ転送される。 The list classified here is transferred to the example collecting means 7 via the keyword list generating means 5.

Ｓ０３：前記用例収集手段７は、Ｓ０２で種別毎のキーワードリストが転送されると、各キーワードを基に前記検索実行手段４にアクセスし、検索結果のタイトルおよび概要文から各キーワードの用例を取得する。 S03: When the keyword list for each type is transferred in S02, the example collection unit 7 accesses the search execution unit 4 based on each keyword, and acquires an example of each keyword from the title and summary sentence of the search result. To do.

即ち、前記用例収集手段７は、種別毎のキーワードリストが転送されると、該リストのキーワードを前記通信インタフェースを介して前記検索実行手段４に送信する。 That is, when the keyword list for each type is transferred, the example collection unit 7 transmits the keywords of the list to the search execution unit 4 via the communication interface.

前記検索実行手段４は、前記キーワードを受信すると、該キーワードに該当する検索結果の文書のタイトル、ＵＲＬ、および該文書中で該キーワードが含まれる部分を概要文として、前記通信インタフェースを介して前記キーワード抽出装置１に返信する。 When the search execution means 4 receives the keyword, the search result document title and URL corresponding to the keyword, and a portion including the keyword in the document as an outline sentence are used as the summary sentence via the communication interface. It returns to the keyword extracting device 1.

このとき、前記検索実行手段４にて、前記用例収集手段７から受信したキーワードをもって新たに文書検索を行い、その検索結果の文書のタイトル、ＵＲＬおよび概要文を返信するようにしてもよい。 At this time, the search execution unit 4 may perform a new document search using the keyword received from the example collection unit 7 and return the title, URL, and summary text of the search result.

このように前記キーワード抽出装置１に返信された前記タイトル、ＵＲＬおよび概要文は、前記用例収集手段７に転送される。ここでは、前記用例収集手段７は、転送されたタイトルおよび概要文からキーワードの用例を取得する。 Thus, the title, URL, and summary sentence sent back to the keyword extracting device 1 are transferred to the example collecting means 7. Here, the example collection means 7 acquires a keyword example from the transferred title and summary text.

なお、初期のＷｅｂサーチエンジンでは文書の冒頭部分が概要文として用いられていたが、１９９０年代後半にＧｏｏｇｌｅ（登録商標）が検索キーワード周辺のテキストを提示するようになり、現在の主流となっている。 In the early Web search engines, the beginning part of the document was used as a summary sentence, but in the late 1990s, Google (registered trademark) began to present text around search keywords, which has become the current mainstream. Yes.

Ｓ０４：前記用例収集手段７は、取得した用例をキーワードの種別に応じて分類する。 S04: The example collecting means 7 classifies the acquired examples according to the type of keyword.

ここで用例の分類例を図３に示す。ここでは各キーワード「○○大章典」「○○王冠」「京都○○杯」の用例が「レース名」という種別にそれぞれ分類されている。ここで分類された用例は、前記モデル生成手段８に転送される。 Here, a classification example of the example is shown in FIG. Here, the examples of each keyword “XX large chapter”, “XX crown”, and “Kyoto XX cup” are classified into types of “race names”. The examples classified here are transferred to the model generation means 8.

Ｓ０５：前記モデル生成手段８は、Ｓ０４で分類された用例が転送されると、種別毎にキーワードを抽出するためのモデルを生成する。モデルの生成に利用される素性としては、例えば以下のような例が挙げられる。
１．そのキーワードの構成形態素
２．そのキーワードの近傍の形態素
３．そのキーワードの構成形態素の品詞
４．そのキーワードの近傍の形態素の品詞
５．そのキーワードが出現する文脈で出現する形態素
例えば図３の例では、「京都○○杯」というキーワードに対し、素性１「そのキーワードの構成形態素」を適用した場合は、語尾に「杯」という形態素を含む「○○○杯」や、語頭に「京都」などの地名を含む「（地名）○○○」などのようなモデルが生成される。 S05: When the example classified in S04 is transferred, the model generation means 8 generates a model for extracting a keyword for each type. Examples of the features used for generating the model include the following examples.
1. 1. The constituent morphemes of the keyword 2. morphemes near the keyword Part of speech of the constituent morphemes of the keyword 4. Part of speech of the morpheme near the keyword A morpheme that appears in the context in which the keyword appears For example, in the example shown in FIG. A model such as “XX cup” including “,” or “(place name) XXX” including a place name such as “Kyoto” at the beginning is generated.

また、「○○大章典」というキーワードに対し、素性２「そのキーワードの近傍の形態素」を適用した場合は、「○○大章典」の近傍の形態素（ここでは「第４０回」や「（Ｇ２）」など）に着目し、「第○回○○○」や「○○○（Ｇ２）」などといったモデルが生成される。生成されたモデルは、前記キーワード抽出モデルデータベース９に格納される。 In addition, when the feature 2 “morpheme near the keyword” is applied to the keyword “XX large chapter”, the morpheme near “XX large chapter” (here, “40th” or “( G2) ”) and the like, and models such as“ No. XX ”and“ XXX (G2) ”are generated. The generated model is stored in the keyword extraction model database 9.

＜キーワード抽出処理＞
前記キーワード抽出手段１０は、前記キーワード抽出モデルデータベース９に格納されたキーワード抽出モデルを用いて、任意の電子文書からキーワードを抽出する。 <Keyword extraction process>
The keyword extraction means 10 extracts keywords from an arbitrary electronic document using a keyword extraction model stored in the keyword extraction model database 9.

抽出処理の具体例としては、文書全体をパターンマッチングなどの文字列探索手法で探索し、該文書中から前記モデルに該当する文字列をキーワードとして抽出する方法が挙げられる。 As a specific example of the extraction process, there is a method in which the entire document is searched by a character string search method such as pattern matching, and a character string corresponding to the model is extracted from the document as a keyword.

なお、抽出されたキーワードは、ディスプレイなどの出力手段に出力してもよく、データベースなどの保存手段に保存してもよい。また、前記モデル生成手段８、前記キーワード抽出モデルデータベース９、および前記キーワード抽出手段１０の具体的な実現形態については、サポート・ベクター・マシン（ＳＶＭ：ＳｕｐｐｏｒｔＶｅｃｔｏｒＭａｃｈｉｎｅ）やＣＲＦ（ＣｏｎｄｉｔｉｏｎａｌＲａｎｄａｍＦｉｅｌｄ）、決定木などの各種学習アルゴリズムを利用することが考えられる。 The extracted keyword may be output to an output unit such as a display, or may be stored in a storage unit such as a database. Further, specific implementation forms of the model generation means 8, the keyword extraction model database 9, and the keyword extraction means 10 are described in terms of a support vector machine (SVM), a CRF (Conditional Random Field), It is conceivable to use various learning algorithms such as a decision tree.

このように、前記キーワード抽出装置１によれば、検索エンジンへ投入されたキーワードと、検索エンジンが出力する検索結果のタイトルや概要文の情報を基に、自然な単位のキーワードを人手によるトレーニングデータを用いることなく種別毎に抽出することができる。 As described above, according to the keyword extracting device 1, the keyword of the natural unit is manually trained based on the keyword input to the search engine and the title and summary sentence information of the search result output from the search engine. Can be extracted for each type without using.

ここで抽出されたキーワードは、検索結果のタイトルおよび概要文から生成された抽出モデルに沿っていることから、情報の単位として適切であると考えられ、テキスト集合の分析などに利用できる。 The keyword extracted here is in line with the extraction model generated from the title and summary sentence of the search result, so it is considered appropriate as a unit of information and can be used for analysis of a text set.

また、抽出されたキーワードは文書の内容を適切に表現していると考えられることから、該キーワードを該文書の検索インデクスとして使用すれば文書検索時の検索精度の向上が期待できる。 Further, since the extracted keyword is considered to appropriately express the contents of the document, it is expected that the search accuracy during the document search can be improved by using the keyword as a search index of the document.

本発明は、前記キーワード抽出装置１の各手段５〜１０の一部もしくは全部としてコンピュータを機能させるプログラムに構成することもできる。この場合には、前記実施形態の処理ステップ（Ｓ０１〜Ｓ０５）の全てあるいは一部をコンピュータに実行させる。 The present invention can also be configured as a program that causes a computer to function as part or all of the means 5 to 10 of the keyword extraction device 1. In this case, the computer executes all or part of the processing steps (S01 to S05) of the embodiment.

このプログラムは、Ｗｅｂサイトや電子メールなどネットワークを通じて提供することができる。また、前記プログラムは、ＣＤ−ＲＯＭ，ＤＶＤ−ＲＯＭ，ＣＤ−Ｒ，ＣＤ−ＲＷ，ＤＶＤ−Ｒ，ＤＶＤ−ＲＷ，ＭＯ，ＨＤＤ，Ｂｌｕ−ｒａｙＤｉｓｋ（登録商標）などの記録媒体に格納して、保存・配布することも可能である。この記録媒体は、記録媒体駆動装置（光学ドライブ装置など）を利用して読み出され、そのプログラムコード自体が前記実施形態の処理を実現するので、該記録媒体も本発明を構成する。 This program can be provided through a network such as a website or e-mail. The program is stored in a recording medium such as a CD-ROM, DVD-ROM, CD-R, CD-RW, DVD-R, DVD-RW, MO, HDD, Blu-ray Disk (registered trademark). It is also possible to save and distribute. This recording medium is read using a recording medium driving device (such as an optical drive device), and the program code itself realizes the processing of the above-described embodiment, so that the recording medium also constitutes the present invention.

１…キーワード抽出装置
２…検索エンジン
３…クエリログ
４…検索実行手段
５…キーワードリスト生成手段
６…キーワード分類手段
７…用例収集手段
８…モデル生成手段
９…キーワード抽出モデルデータベース
１０…キーワード抽出手段 DESCRIPTION OF SYMBOLS 1 ... Keyword extraction apparatus 2 ... Search engine 3 ... Query log 4 ... Search execution means 5 ... Keyword list production | generation means 6 ... Keyword classification means 7 ... Example collection means 8 ... Model generation means 9 ... Keyword extraction model database 10 ... Keyword extraction means

Claims

A method for extracting keywords contained in an electronic document by applying a model generated using a search engine log,
A first step of generating a list of keywords by analyzing a query log acquired from the search engine and extracting a query that satisfies a certain condition;
A second step of collecting a search result of a keyword in the list from the search engine and collecting an example in which the keyword is used in a title and a summary sentence of the search result;
A third step in which the model generating means generates the model based on the examples collected in the second step;
A fourth step in which the classifying means classifies the keywords in the list for each type, and extracts the keywords for each type;
A keyword extraction method characterized by comprising:

A device that extracts a keyword included in an electronic document by applying a model generated using a log of a search engine,
A list generation unit that analyzes a query log acquired from the search engine, extracts a query that satisfies a certain condition, and generates a keyword list;
A collecting means for acquiring a search result of the keyword in the list from the search engine and collecting an example in which the keyword is used in a title and a summary sentence of the search result;
Model generation means for generating the model based on the examples collected by the collection means;
Classifying means for classifying keywords in the list for each type, and extracting keywords for each type;
A keyword extracting device comprising:

A keyword extraction program for causing a computer to function as the keyword extraction device according to claim 2 .