JP2010040020A

JP2010040020A - Keyword extraction device, method, and program

Info

Publication number: JP2010040020A
Application number: JP2008205896A
Authority: JP
Inventors: Takeshi Masuyama; 毅司増山
Original assignee: Yahoo Japan Corp
Current assignee: Yahoo Japan Corp
Priority date: 2008-08-08
Filing date: 2008-08-08
Publication date: 2010-02-18
Anticipated expiration: 2028-08-08
Also published as: JP4934115B2

Abstract

PROBLEM TO BE SOLVED: To provide a keyword extraction server, a method and a program capable coping with the text of a free form (including multilingual), by which high speed processing is made possible, and which can extract only a word effective for classification. SOLUTION: The keyword extraction device is provided with: a division part 11 which divides an input text with punctuation; a morpheme extraction part 12 which extracts morpheme from divided division parts; a noun extraction part 13 which decides parts of speech for the extracted morpheme to extract the morpheme decided to be the noun; an arithmetic operation part 14 which calculates the score of the noun as a keyword on the basis of the number of characters of the noun, appearance frequency of the noun in the text, and a ratio of the total number of sentences in the text to the appearance frequency indicating over how many sentences the noun appears, about the extracted noun; and a decision part 15 which decides whether or not the noun is considered as the keyword on the basis of the score as the result of the arithmetic operation. COPYRIGHT: (C)2010,JPO&INPIT

Description

本発明は、キーワード抽出装置、方法及びプログラムに関する。さらに、詳細には、データベースに蓄積されたデータあるいはインターネットを介して取得されるデータなど、様々な文書データの分類に有効なキーワードの効率的な抽出を可能としたキーワード抽出装置，方法及びプログラムに関する。 The present invention relates to a keyword extraction apparatus, method, and program. More particularly, the present invention relates to a keyword extraction apparatus, method, and program capable of efficiently extracting keywords effective for classification of various document data such as data stored in a database or data acquired via the Internet. .

パソコンやインターネットの普及、あるいは電子ファイリング技術の発展等に伴い、電子化された大量の文書データを利用可能な環境が整いつつあるが、一方で膨大な情報の中から重要なキーワードを自動的に抽出するシステムの必要性が生じている。 With the spread of personal computers and the Internet, and the development of electronic filing technology, an environment where a large amount of digitized document data can be used is being prepared. On the other hand, important keywords are automatically selected from a vast amount of information. There is a need for a system to extract.

データベースに蓄積された文書データあるいはインターネットを介して取得される文書データなどを分類する手法として、これまでに、様々な手法が提案されている。例えば、文書データに含まれる特徴となる複数のキーワードを選択し、その分布や出現位置などを解析し、この解析結果に基づいて分類を行う手法が広く知られている。 Various methods have been proposed so far to classify document data stored in a database or document data acquired via the Internet. For example, a technique is widely known in which a plurality of keywords that are features included in document data are selected, their distribution and appearance positions are analyzed, and classification is performed based on the analysis results.

文書データの分類処理を行う際に重要なのが、「分類を行うのに有効な単語」、すなわち、キーワードの選定である。従来から知られるキーワード選定手法を大きく分類すると、以下の（１）〜（３）の３つの手法に分類される。 What is important when performing document data classification processing is selection of “words effective for classification”, that is, keyword selection. Conventional keyword selection methods can be broadly classified into the following three methods (1) to (3).

（１）辞書データを用いるキーワード選定手法
辞書データを用いるキーワード選定手法は、あらかじめ文書データの分類に有効と考えられる単語群を辞書データとして登録し、登録された単語をキーワードとして用いる手法である。この辞書データをキーワードとして利用する手法は、例えば特許文献１、特許文献２に記載されている。 (1) Keyword selection method using dictionary data The keyword selection method using dictionary data is a method in which word groups that are considered to be effective for classifying document data are registered in advance as dictionary data, and the registered words are used as keywords. A method of using this dictionary data as a keyword is described in Patent Document 1 and Patent Document 2, for example.

（２）分類対象の文書データに含まれる文書の文法解析によるキーワード選定手法
分類対象の文書データに含まれる文書の文法解析によるキーワード選定手法は、分類対象の文書データに含まれる文書の文法に基づいた形態素解析、あるいは独自の文法ルールによる解析を行い、その結果として抽出される単語をキーワードまたはその候補として用いる手法である。この手法は、例えば特許文献３、特許文献４に記載されている。 (2) Keyword selection method by grammatical analysis of documents included in document data to be classified The keyword selection method by grammatical analysis of documents included in document data to be classified is based on the grammar of documents included in the document data to be classified. Morphological analysis or analysis based on original grammatical rules, and a word extracted as a result is used as a keyword or a candidate thereof. This technique is described in Patent Document 3 and Patent Document 4, for example.

（３）分類対象の文書データの総比較によるキーワード選定手法
分類対象の文書データの総比較によるキーワード選定手法は、分類対象とる様々な文書データ各々の総比較を行い、様々な単語の出現頻度やその組み合わせデータを解析し、その解析結果に基づいてキーワードまたはキーワード候補を抽出する手法である。この手法は、例えば特許文献５に記載されている。 (3) Keyword selection method based on total comparison of classification target document data The keyword selection method based on total comparison of classification target document data performs a total comparison of various document data to be classified, This is a method of analyzing the combination data and extracting keywords or keyword candidates based on the analysis result. This technique is described in Patent Document 5, for example.

上述したように、キーワードの抽出手法としては、様々な手法が既に提案されている。しかし、例えば上述の「（１）辞書データを用いるキーワード選定手法」は、前提となる辞書を作成するのに専門的な知識と時間がかかる上に、作成された辞書は、想定外の分野の文章に関しては十分な効果をあげないという問題がある。例えば特定の専門分野、例えば医療や金融といった専門分野の文書の分類に有効なキーワードが不十分となったり、あるいは、新しく出現してきた単語に対する対応ができないといった問題がある。 As described above, various methods have already been proposed as keyword extraction methods. However, for example, the above-mentioned “(1) Keyword selection method using dictionary data” takes specialized knowledge and time to create a prerequisite dictionary, and the created dictionary is in an unexpected field. There is a problem that the text is not effective enough. For example, there are problems that keywords that are effective for classification of documents in a specific specialized field such as medical care and finance are insufficient, or that new words that appear cannot be handled.

また、「（２）分類対象の文書データに含まれる文書の文法解析によるキーワード選定手法」は、文法ルールを定型処理化するのに専門的な知識が必要である上に、想定外の言語や、文法的に成立しない自由形式の文章に対して十分な効果をあげないという問題点がある。 In addition, “(2) Keyword selection method by grammatical analysis of documents included in document data to be classified” requires specialized knowledge to standardize grammar rules, and it is not possible to There is a problem that it does not have a sufficient effect on free-form sentences that are not grammatically established.

さらに、「（３）分類対象の文書データの総比較によるキーワード選定手法」は、処理対象となる文書データ量が増えると、その比較の処理にかかる時間が指数的に増え、処理効率が低下するという問題があり、また、日本語であれば文書中に頻出する「です」「ます」など、分類に有効な単語以外の語句が抽出されてしまうといという問題点がある。
この問題点を解決しようとするアルゴリズムに、ｔｆ・ｉｄｆ（ＴｅｒｍＦｒｅｑｕｅｎｃｙ − ＩｎｖｅｒｓｅＤｏｃｕｍｅｎｔＦｒｅｑｕｅｎｃｙ、文章中の特徴的な単語（重要とみなされる単語）を抽出するためのアルゴリズム）がある（後述する）。 Furthermore, in “(3) Keyword selection method based on total comparison of document data to be classified”, when the amount of document data to be processed increases, the time required for the comparison processing increases exponentially and processing efficiency decreases. In addition, there is a problem that phrases other than words that are effective for classification are extracted, such as “is” and “mass” that appear frequently in documents in Japanese.
As an algorithm for solving this problem, there is tf · idf (Term Frequency-Inverse Document Frequency, an algorithm for extracting characteristic words (words regarded as important) in a sentence) (described later).

特開２００２−２１５６４７号公報JP 2002-215647 A 特開２００２−１０８８８８号公報JP 2002-108888 A 特開２００３−３６２６１号公報JP 2003-36261 A 特開２００２−２４５０６１号公報JP 2002-245061 A 特開２００１−２２７５２号公報Japanese Patent Laid-Open No. 2001-22752

本発明は、このような状況に鑑みてなされたものであり、上述した従来のキーワード抽
出手法における問題点を解決したキーワード抽出装置、方法及びプログラムを提供することを目的とする。 The present invention has been made in view of such a situation, and an object of the present invention is to provide a keyword extraction apparatus, method, and program that solve the problems in the conventional keyword extraction method described above.

具体的には、前提となる知識（辞書、文法データ）を使用せず、自由形式（含、多言語）の文章に対応可能であり、高速な処理が可能であって、分類に有効な単語のみを抽出することを可能としたキーワード抽出サーバ及び方法及びプログラムを提供することを目的とする。 Specifically, it does not use prerequisite knowledge (dictionary, grammar data), can handle free-form (including multilingual) sentences, is capable of high-speed processing, and is effective for classification It is an object of the present invention to provide a keyword extraction server, method, and program capable of extracting only the keyword.

（１）入力されたテキストを句読点で分割する分割手段と、
前記分割手段により分割された分割部分から形態素を抽出する形態素抽出手段と、
前記形態素抽出手段により抽出された形態素について品詞を判断し、名詞と判断された形態素を抽出する名詞抽出手段と、
前記名詞抽出手段により抽出された前記名詞について、前記名詞の文字数、前記名詞の前記テキスト中での出現頻度および、前記テキスト中の文の総数と前記名詞がいくつの前記文に跨って出現したかを示す出現頻度との比に基づいて前記名詞のキーワードとしてのスコアを演算する演算手段と、
前記演算の結果である前記スコアに基づいて、前記名詞をキーワードとするか否かを判断する判断手段と、を備えることを特徴とするキーワード抽出装置。 (1) dividing means for dividing the input text by punctuation marks;
Morpheme extraction means for extracting morphemes from the divided parts divided by the dividing means;
Determining a part of speech for the morpheme extracted by the morpheme extraction unit, and extracting a morpheme determined to be a noun;
About the noun extracted by the noun extraction means, the number of characters of the noun, the appearance frequency of the noun in the text, the total number of sentences in the text, and how many sentences the noun appears over Calculating means for calculating a score as a keyword of the noun based on a ratio to the appearance frequency indicating
A keyword extracting device comprising: a determination unit that determines whether or not the noun is a keyword based on the score that is a result of the calculation.

（１）に係る発明によれば、入力されたテキストから名詞を抽出し、前記名詞の文字数、前記名詞の前記テキスト中での出現頻度および、前記テキスト中の文の総数と前記名詞がいくつの（テキスト中の）文に跨って出現したかを示す出現頻度との比を基にキーワードの判断を実行する。 According to the invention according to (1), a noun is extracted from the input text, the number of characters of the noun, the frequency of appearance of the noun in the text, the total number of sentences in the text and the number of nouns The keyword is determined based on the ratio to the appearance frequency indicating whether it has appeared across sentences (in the text).

このようにして、（１）に係る発明によれば、前提となる知識（辞書、文法データ）を使用せず、自由形式（多言語を含む）の文章に対応可能であり、高速な処理が可能であって、分類に有効な単語のみを抽出することができる。 In this way, according to the invention according to (1), it is possible to handle free-form (including multilingual) sentences without using the prerequisite knowledge (dictionary, grammar data), and high-speed processing. Only words that are possible and effective for classification can be extracted.

すなわち、問題の解明や内容の理解の上で、重要な手掛かりとなる語であって、情報検索において検索の手掛かりとして使用する語として有効なキーワードのみを高速に抽出することが可能となる。 That is, it is possible to quickly extract only keywords that are important clues for elucidating the problem and understanding the contents, and that are effective as words used as clues for information retrieval.

（２）前記演算手段は、前記名詞の文字数または前記名詞の文字数前後の対数演算値、前記名詞の前記テキスト中での出現頻度、前記テキスト中の文の総数と前記名詞がいくつの前記文に跨って出現したかを示す出現頻度との比または当該比前後の数の対数演算値とを乗算演算した値を前記スコアとすることを特徴とする（１）に記載のキーワード抽出装置。 (2) The calculation means includes the number of characters of the noun or a logarithm calculation value around the number of characters of the noun, the appearance frequency of the noun in the text, the total number of sentences in the text, and how many nouns the sentence includes (1) The keyword extracting device according to (1), wherein a value obtained by multiplying a ratio with an appearance frequency indicating whether or not it has appeared or a logarithm operation value of a number before and after the ratio is used as the score.

（２）に係る発明によれば、（１）において抽出した名詞について、名詞の文字数または前記名詞の文字数前後の対数演算値、前記名詞の前記テキスト中での出現頻度、前記テキスト中の文の総数と前記名詞がいくつの前記文に跨って出現したかを示す出現頻度との比または当該比前後の数の対数演算値とを乗算演算した値を基にキーワードの判断を実行する。 According to the invention according to (2), for the noun extracted in (1), the number of characters of the noun or the logarithm calculation value before and after the number of characters of the noun, the appearance frequency of the noun in the text, the sentence in the text The determination of the keyword is executed based on a value obtained by multiplying the ratio between the total number and the appearance frequency indicating how many sentences the noun appears over or the logarithm operation values of the numbers before and after the ratio.

このようにして、（２）に係る発明によれば、文字数、対数演算、テキスト中での名詞の出現頻度、文の総数および比の演算という簡易な演算であって演算回数が少ない演算方法に基づいてキーワードが抽出できるので、前提となる知識（辞書、文法データ）を使用せず、自由形式（多言語を含む）の文章に対応可能であり、高速な処理が可能なキーワード抽出サーバを提供することが可能になる。 In this way, according to the invention according to (2), the calculation method is a simple calculation of calculating the number of characters, logarithmic calculation, noun appearance frequency in the text, total number of sentences, and ratio, and having a small number of calculations. Since keywords can be extracted based on this, a keyword extraction server that can handle free-form (including multilingual) sentences without using the prerequisite knowledge (dictionary, grammar data), and provides high-speed processing is provided. It becomes possible to do.

（３）インターネットにおいて送受信された文字情報を記憶した文字情報データベースと、
前記判断手段によって前記キーワードであると判断された前記名詞の中で最も前記スコアが大きい最大スコア名詞を選択する選択手段と、
前記最大スコア名詞と前記名詞とを前記文字情報データベースにおいて検索し、前記最大スコア名詞の検索件数、前記名詞の検索件数および前記最大スコア名詞および前記名詞の両方が含まれる検索件数とを検索し調査する検索手段と、
前記最大スコア名詞の検索件数、前記名詞の検索件数および前記最大スコア名詞並びに前記名詞の両方が含まれる検索件数に基づいて補正係数を演算する補正係数演算手段と、
前記補正係数と前記演算手段によって演算された前記スコアとに基づいて、補正スコアを演算する補正スコア演算手段とを備え、
前記判断手段は、前記補正スコアに基づいて、前記名詞をキーワードとするか否かを判断することを特徴とする（１）または（２）に記載のキーワード抽出装置。 (3) a character information database storing character information transmitted and received on the Internet;
Selecting means for selecting a maximum score noun with the highest score among the nouns determined to be the keyword by the determining means;
The maximum score noun and the noun are searched in the character information database, and the search number of the maximum score noun, the search number of the noun, and the search number including both the maximum score noun and the noun are searched and investigated. Search means to
A correction coefficient calculating means for calculating a correction coefficient based on the search number of the maximum score noun, the search number of the noun and the search number including both the maximum score noun and the noun;
Correction score calculation means for calculating a correction score based on the correction coefficient and the score calculated by the calculation means;
The keyword extracting device according to (1) or (2), wherein the determining means determines whether or not the noun is a keyword based on the correction score.

（３）に係る発明によれば、前記最大スコア名詞と前記名詞とを前記文字情報データベースにおいて検索し、前記最大スコア名詞の検索件数、前記名詞の検索件数および前記最大スコア名詞および前記名詞の両方が含まれる検索件数とを検索し、それらの検索件数に基づいて、（１）に係わるキーワード候補となった名詞について補正を実行する。 According to the invention according to (3), the maximum score noun and the noun are searched in the character information database, the search number of the maximum score noun, the search number of the noun, and both the maximum score noun and the noun Are searched, and nouns that have become keyword candidates related to (1) are corrected based on the number of searches.

このようにして、（３）に係る発明によれば、補正係数（スコアＢ）の効果（スコア（スコアＡ）の値が最も大きい最大スコア名詞との関連性が高い場合には、補正係数（スコアＢ）が大きくなり、スコアＡの値が最も大きい最大スコア名詞との関連性が小さい場合には、補正係数（スコアＢ）が小さくなる。）によって、キーワード性が低い語は、補正スコア（スコアＣ）が小さな値となり、キーワードとして判断されないように適切に演算処理されることが可能となる。 Thus, according to the invention according to (3), the effect of the correction coefficient (score B) (if the relevance with the maximum score noun with the largest score (score A) value is high, the correction coefficient ( If the score B) is large and the relevance to the largest score noun with the largest value of score A is small, the correction coefficient (score B) is small. The score C) becomes a small value and can be appropriately calculated so as not to be determined as a keyword.

（４）前記補正係数演算手段は、前記最大スコア名詞並びに前記名詞の両方が含まれる検索件数を、前記最大スコア名詞の検索件数と前記名詞の検索件数との乗算演算値の平方根で除算演算した値を前記補正スコアとし、
前記判断手段は、前記補正スコアと前記スコアとの乗算演算値に基づいて、前記名詞をキーワードとするか否かを判断することを特徴とする（３）に記載のキーワード抽出装置。 (4) The correction coefficient calculation means performs a division operation on the number of searches including both the maximum score noun and the noun by a square root of a multiplication calculation value of the search number of the maximum score noun and the search number of the noun. The value is the corrected score,
The keyword extraction device according to (3), wherein the determination unit determines whether or not the noun is a keyword based on a multiplication operation value of the correction score and the score.

（４）に係る発明によれば、前記最大スコア名詞並びに前記名詞の両方が含まれる検索件数を、前記最大スコア名詞の検索件数と前記名詞の検索件数との乗算演算値の平方根で除算演算した値に基づいて前記名詞をキーワードとするか否かを判断することを実行する。 According to the invention of (4), the number of searches including both the maximum score noun and the noun is divided by the square root of the multiplication value of the search number of the maximum score noun and the search number of the noun. Based on the value, it is determined whether or not to use the noun as a keyword.

このようにして、（４）に係る発明によれば、検索件数と検索件数文字数、乗算演算、平方根演算という簡易な演算であって演算回数が少ない演算方法に基づいてキーワードが抽出できるので、前提となる知識（辞書、文法データ）を使用せず、自由形式（多言語を含む）の文章に対応可能であり、高速な処理が可能なキーワード抽出サーバを提供することが可能になる。 Thus, according to the invention according to (4), keywords can be extracted based on a calculation method that is a simple calculation such as the number of searches, the number of search characters, the multiplication operation, and the square root operation, and the number of operations is small. Therefore, it is possible to provide a keyword extraction server that can handle free-form (including multilingual) sentences without using knowledge (dictionary, grammar data).

（５）入力されたテキストを句読点で分割する分割工程と、
前記分割工程において分割された分割部分の形態素を抽出する形態素抽出工程と、
前記形態素抽出工程において抽出された形態素について品詞を判断し、名詞と判断された形態素を抽出する名詞抽出工程と、
前記名詞抽出工程において抽出された前記名詞について、前記名詞の文字数、前記名詞の前記テキスト中での出現頻度および、前記テキスト中の文の総数と前記名詞がいくつの前記文に跨って出現したかを示す出現頻度との比に基づいて前記名詞のキーワードとしてのスコアを演算する演算工程と、
前記演算の結果である前記スコアに基づいて、前記名詞をキーワードとするか否かを判断する判断工程と、を備えることを特徴とするキーワード抽出方法。 (5) a dividing step of dividing the input text with punctuation marks;
A morpheme extraction step of extracting morphemes of the divided parts divided in the division step;
Determining a part of speech for the morpheme extracted in the morpheme extraction step, and extracting a morpheme determined as a noun;
About the noun extracted in the noun extraction step, the number of characters of the noun, the appearance frequency of the noun in the text, and the total number of sentences in the text and how many sentences the noun appears over A calculation step of calculating a score as a keyword of the noun based on a ratio with an appearance frequency indicating:
And a determination step of determining whether or not to use the noun as a keyword based on the score that is a result of the calculation.

（６）（５）に記載の方法をコンピュータに実行させることを特徴とするプログラム。 (6) A program that causes a computer to execute the method according to (5).

このような構成によれば、当該プログラムをコンピュータに実行させることにより、（５）と同様の効果が期待できる。 According to such a configuration, the same effect as in (5) can be expected by causing the computer to execute the program.

本発明によれば、前提となる知識（辞書、文法データ）を使用せず、自由形式（多言語を含む）の文章に対応可能であり、高速な処理が可能であって、分類に有効な単語のみを抽出することができる。すなわち、問題の解明や内容の理解の上で、重要な手掛かりとなる語であって、情報検索において検索の手掛かりとして使用する語として有効なキーワードのみを高速に抽出することが可能となる。 According to the present invention, premise knowledge (dictionary, grammar data) is not used, it is possible to deal with free-form (including multilingual) sentences, high-speed processing is possible, and effective for classification. Only words can be extracted. That is, it is possible to quickly extract only keywords that are important clues for elucidating the problem and understanding the contents, and that are effective as words used as clues for information retrieval.

以下、本発明の実施形態について図を参照しながら説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.

［システム全体構成］
図１には、本実施形態に係るキーワード抽出サーバ１０と、ユーザ端末３０とから構成される情報処理システム１を示す。なお、図１においては、情報処理システム１は、キーワード抽出サーバ１０と、ユーザ端末３０とがそれぞれ一つずつで示されているが、これに限られず、それぞれ複数台で構成されていても良い。 [Entire system configuration]
FIG. 1 shows an information processing system 1 including a keyword extraction server 10 and a user terminal 30 according to the present embodiment. In FIG. 1, the information processing system 1 shows the keyword extraction server 10 and the user terminal 30 one by one. However, the information processing system 1 is not limited to this, and may be configured by a plurality of units. .

キーワード抽出サーバ１０は、図２に示すように、制御部３００を構成するＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）３１０（マルチプロセッサ構成ではＣＰＵ３２０等複数のＣＰＵが追加されても良い）、バスライン２００、通信Ｉ／Ｆ（Ｉ／Ｆ：インタフェース）３３０、メインメモリ３４０、ＢＩＯＳ（ＢａｓｉｃＩｎｐｕｔＯｕｔｐｕｔＳｙｓｔｅｍ）３５０、Ｉ／Ｏコントローラ３６０、ハードディスク３７０、光ディスクドライブ３８０、並びに半導体メモリ３９０を備える。なお、ハードディスク３７０、光ディスクドライブ３８０、並びに、半導体メモリ３９０はまとめて記憶装置４１０と呼ばれる。 As shown in FIG. 2, the keyword extraction server 10 includes a CPU (Central Processing Unit) 310 (a plurality of CPUs such as a CPU 320 may be added in a multiprocessor configuration), a bus line 200, and a communication I. / F (I / F: interface) 330, main memory 340, BIOS (Basic Input Output System) 350, I / O controller 360, hard disk 370, optical disk drive 380, and semiconductor memory 390. The hard disk 370, the optical disk drive 380, and the semiconductor memory 390 are collectively referred to as a storage device 410.

制御部３００は、キーワード抽出サーバ１０を統括的に制御する部分であり、ハードディスク３７０に記憶された各種プログラムを適宜読み出して実行することにより、上述したハードウェアと協働し、本発明に係る各種機能を実現している。 The control unit 300 is a part that controls the keyword extraction server 10 in an integrated manner. By appropriately reading and executing various programs stored in the hard disk 370, the control unit 300 cooperates with the hardware described above, and performs various types according to the present invention. The function is realized.

通信Ｉ／Ｆ３３０は、キーワード抽出サーバ１０が、ネットワークを介してユーザ端末３０等の他の装置と情報を送受信する場合のネットワーク・アダプタである。 The communication I / F 330 is a network adapter when the keyword extraction server 10 transmits / receives information to / from other devices such as the user terminal 30 via the network.

ＢＩＯＳ３５０は、キーワード抽出サーバ１０の起動時にＣＰＵ３１０が実行するブートプログラムや、キーワード抽出サーバ１０のハードウェアに依存するプログラム等を記録する。 The BIOS 350 records a boot program executed by the CPU 310 when the keyword extraction server 10 is activated, a program depending on the hardware of the keyword extraction server 10, and the like.

Ｉ／Ｏコントローラ３６０には、ハードディスク３７０、光ディスクドライブ３８０、及び半導体メモリ３９０等の記憶装置４１０を接続することができる。 A storage device 410 such as a hard disk 370, an optical disk drive 380, and a semiconductor memory 390 can be connected to the I / O controller 360.

ハードディスク３７０は、本ハードウェアをキーワード抽出サーバ１０として機能させるための各種プログラム、本発明の機能を実行するプログラム及び後述するテーブル等を記憶する。なお、キーワード抽出サーバ１０は、外部に別途設けたハードディスク（図示せず）を外部記憶装置として利用することもできる。 The hard disk 370 stores various programs for causing the hardware to function as the keyword extraction server 10, a program for executing the functions of the present invention, a table to be described later, and the like. The keyword extraction server 10 can also use an external hard disk (not shown) as an external storage device.

光ディスクドライブ３８０としては、例えば、ＤＶＤ−ＲＯＭドライブ、ＣＤ−ＲＯＭドライブ、ＤＶＤ−ＲＡＭドライブ、ＣＤ−ＲＡＭドライブを使用することができる。この場合は各ドライブに対応した光ディスク４００を使用する。光ディスク４００から光ディスクドライブ３８０によりプログラム又はデータを読み取り、Ｉ／Ｏコントローラ３６０を介してメインメモリ３４０又はハードディスク３７０に提供することもできる。 As the optical disk drive 380, for example, a DVD-ROM drive, a CD-ROM drive, a DVD-RAM drive, or a CD-RAM drive can be used. In this case, the optical disk 400 corresponding to each drive is used. A program or data can be read from the optical disk 400 by the optical disk drive 380 and provided to the main memory 340 or the hard disk 370 via the I / O controller 360.

なお、本発明でいうコンピュータとは、記憶装置、制御部等を備えた情報処理装置をいい、キーワード抽出サーバ１０は、記憶装置４１０、制御部３００等を備えた情報処理装置により構成される。 The computer in the present invention refers to an information processing apparatus including a storage device, a control unit, and the like, and the keyword extraction server 10 includes an information processing device including a storage device 410, a control unit 300, and the like.

また、本発明に係るキーワード抽出サーバ１０は、上述のような構成を有することにより、ユーザ端末３０から入力されたテキスト等を形態素解析し、入力されたテキストから名詞を抽出し、抽出された名詞についてキーワードとしてのスコアを演算し、スコアに基づいてキーワードとしてするか否かを判断する機能を有している。 In addition, the keyword extraction server 10 according to the present invention has the above-described configuration, so that morphological analysis is performed on text input from the user terminal 30, and nouns are extracted from the input text. It has a function of calculating a score as a keyword and determining whether or not to make a keyword based on the score.

ここで、当該機能を発揮するための構成について、図３に示す機能ブロック図を用いて説明する。キーワード抽出サーバ１０は、分割部１１と、形態素抽出部１２と、名詞抽出部１３と、判断部１５と、選択部１６と、検索部１７と、補正係数演算部１８、補正スコア演算部１９と、文字情報データベース（ＤＢ）２０とを備える。 Here, a configuration for exhibiting the function will be described with reference to a functional block diagram shown in FIG. The keyword extraction server 10 includes a division unit 11, a morpheme extraction unit 12, a noun extraction unit 13, a determination unit 15, a selection unit 16, a search unit 17, a correction coefficient calculation unit 18, and a correction score calculation unit 19. And a character information database (DB) 20.

分割部１１は、ユーザ端末３０から直接入力されたテキスト、文字情報データベース２０に既に記憶されているテキスト、放送局（図示せず）において放送された番組の中で発せられた音声情報に基づいて作成された文字情報としてのテキスト、音声情報として入力された情報を音声分析して文字情報化したテキスト、画像情報から作成されたテキスト（ＯＣＲ等含む）等のテキスト（英語、日本語等の言語の種類には限定されない）を句読点等の区切り記号で分割する機能を有する。 The dividing unit 11 is based on text directly input from the user terminal 30, text already stored in the character information database 20, and audio information generated in a program broadcast in a broadcasting station (not shown). Text (English, Japanese, etc.) such as text as created text information, text obtained by voice analysis of information input as voice information, text (including OCR) created from image information, etc. Is not limited to this type), and has a function of dividing by a delimiter such as a punctuation mark.

形態素抽出部１２は、分割部１１により分割された分割部分について形態素を抽出する。形態素の抽出には形態素解析手法を利用する。例えば、ｔｆ・ｉｄｆ（ＴｅｒｍＦｒｅｑｕｅｎｃｙ − ＩｎｖｅｒｓｅＤｏｃｕｍｅｎｔＦｒｅｑｕｅｎｃｙ、文章中の特徴的な単語（重要とみなされる単語）を抽出するためのアルゴリズム）の一部のアルゴリズムを利用して文字情報から複数の形態素を抽出する機能を有する。 The morpheme extraction unit 12 extracts morphemes from the divided parts divided by the division unit 11. A morpheme analysis method is used to extract morphemes. For example, tf.idf (Term Frequency-Inverse Document Frequency, an algorithm for extracting characteristic words (words regarded as important) in a sentence) using a part of an algorithm, a plurality of morphemes are obtained from character information. Has a function to extract.

名詞抽出部１３は、形態素抽出部１２により抽出された形態素について品詞を判断し、名詞と判断された形態素を抽出する機能を有する。 The noun extraction unit 13 has a function of determining the part of speech of the morpheme extracted by the morpheme extraction unit 12 and extracting the morpheme determined to be a noun.

判断部１５は、名詞抽出手段により抽出された前記名詞について、前記名詞の文字数、前記名詞の前記テキスト中での出現頻度および、前記テキスト中の文の総数と前記名詞がいくつの前記文に跨って出現したかを示す出現頻度との比に基づいて前記名詞のキーワードとしてのスコアを演算する機能を有する。 For the noun extracted by the noun extracting means, the determination unit 15 determines the number of characters of the noun, the frequency of appearance of the noun in the text, the total number of sentences in the text, and the number of nouns across the sentences. The score as a keyword of the noun is calculated based on the ratio with the appearance frequency indicating whether or not it has appeared.

具体的には、名詞ｗに関するスコアＡをスコアＡ（ｗ）とすると

で示され、式中、｜ｗ｜は単語ｗの文字数であり、ｔｆ（ｗ）は単語ｗのテキスト中での出現頻度（テキスト中に何回出現したかを示す）、ｓｆ（ｗ）は単語ｗの文中での出現頻度（いくつの文に跨って出現したかを示す）、Ｎはテキスト中の文の総数を示す。
演算部１４は式（１）に基づいてスコアＡを演算する。 Specifically, if the score A for the noun w is the score A (w)

Where | w | is the number of characters of the word w, tf (w) is the frequency of appearance of the word w in the text (shows how many times it appears in the text), and sf (w) is Appearance frequency of the word w in the sentence (indicating how many sentences have appeared), N indicates the total number of sentences in the text.
The calculation unit 14 calculates the score A based on the formula (1).

判断部１５は、演算部１４における式（１）演算の結果であるスコアＡに基づいて、名詞ｗをキーワードとするか否かを判断する機能を有する。
判断基準は任意の値に予め設定しておくことが可能である。任意の値は試行錯誤を繰り返しながら決定することが可能である。 The determination unit 15 has a function of determining whether or not the noun w is a keyword based on the score A that is the result of the calculation of the expression (1) in the calculation unit 14.
The determination criterion can be set in advance to an arbitrary value. An arbitrary value can be determined by repeating trial and error.

ここで抽出するキーワードは、問題の解明や内容を理解する上で、重要な手がかりとなる語である。また、情報検索においては検索の手がかりとして使用する語句となるものである。 The keywords extracted here are important clues for elucidating the problem and understanding the contents. Also, in information retrieval, it is a phrase used as a clue for retrieval.

選択部１６は、判断部１５によってキーワードであると判断された名詞の中で最もスコアが大きい最大スコア名詞を選択する機能を有する。 The selection unit 16 has a function of selecting the maximum score noun having the highest score among the nouns determined by the determination unit 15 as keywords.

検索部１７は、最大スコア名詞と名詞抽出部１３において抽出された名詞とをキーワード抽出サーバ１０のある文字情報データベース（ＤＢ）２０からまたは外部の文字情報ＤＢ（図示せず）において検索し、最大スコア名詞の検索件数、名詞の検索件数および最大スコア名詞および名詞の両方が含まれる検索件数とを検索し調査する機能を有する。 The retrieval unit 17 retrieves the maximum score noun and the noun extracted by the noun extraction unit 13 from a character information database (DB) 20 provided in the keyword extraction server 10 or in an external character information DB (not shown). It has a function of searching and investigating the search number of score nouns, the search number of nouns, and the search number including both the maximum score nouns and nouns.

補正係数演算部１８は、最大スコア名詞の検索件数、名詞の検索件数および最大スコア名詞並びに前記名詞の両方が含まれる検索件数に基づいて補正係数を演算する機能を有する。 The correction coefficient calculation unit 18 has a function of calculating a correction coefficient based on the maximum score noun search number, the noun search number, the maximum score noun, and the search number including both of the nouns.

具体的には、名詞ｗに関する補正係数ＢをスコアＢ（ｗ）とすると

で示され、式中、ＷｍａｘＡはスコアＡが最も大きい名詞を示し、｜ＷｍａｘＡ＆Ｗ｜は｜ＷｍａｘＡ｜と名詞ｗとのＡＮＤ検索（キーワード抽出サーバ１０のある文字情報データベース（ＤＢ）２０からまたは外部の文字情報ＤＢ（図示せず））でのヒット件数を示し、｜ＷｍａｘＡ｜は名詞ＷｍａｘＡの単独検索（キーワード抽出サーバ１０のある文字情報データベース（ＤＢ）２０からまたは外部の文字情報ＤＢ（図示せず））でのヒット件数を示し、｜Ｗ｜は名詞Ｗの単独検索（キーワード抽出サーバ１０のある文字情報データベース（ＤＢ）２０からまたは外部の文字情報ＤＢ（図示せず））でのヒット件数を示す。 Specifically, when the correction coefficient B for the noun w is a score B (w)

Where WmaxA indicates the noun with the highest score A, and | WmaxA & W | is an AND search between | WmaxA | and the noun w (from the character information database (DB) 20 with the keyword extraction server 10 or externally). Indicates the number of hits in the character information DB (not shown), and | WmaxA | is a single search for the noun WmaxA (from the character information database (DB) 20 with the keyword extraction server 10 or external character information DB (not shown)). )) Indicates the number of hits, and | W | indicates the number of hits in a single search for the noun W (from the character information database (DB) 20 with the keyword extraction server 10 or in an external character information DB (not shown)). Show.

補正スコア演算部１９は、補正係数と演算部１４において演算されたスコアとに基づいて、補正スコアを演算する機能を有する。 The correction score calculation unit 19 has a function of calculating a correction score based on the correction coefficient and the score calculated by the calculation unit 14.

具体的には、名詞ｗに関するスコアＡをスコアＡ（ｗ）、名詞ｗに関する補正係数ＢをスコアＢ（ｗ）、名詞ｗに関する補正スコアを補正スコアＣ（ｗ）とすると、

で示される。 Specifically, if the score A for the noun w is score A (w), the correction coefficient B for the noun w is score B (w), and the correction score for the noun w is correction score C (w),

Indicated by

判断部１５は、補正スコアに基づいて、前記名詞をキーワードとするか否かを判断する。判断基準は任意の値に予め設定しておくことが可能である。任意の値は試行錯誤を繰り返しながら決定することが可能であり、以下に記述する本実施形態では、一例として１０を基準に１０よりも大きいスコアＡを持つ名詞をキーワードと判断することも可能である。 Based on the corrected score, the determination unit 15 determines whether or not the noun is a keyword. The determination criterion can be set in advance to an arbitrary value. An arbitrary value can be determined while repeating trial and error. In the present embodiment described below, a noun having a score A greater than 10 can be determined as a keyword based on 10 as an example. is there.

このような構成によれば、本発明に関わるキーワード抽出サーバ１０はユーザ端末３０から入力されたテキスト等を形態素解析し、入力されたテキストから名詞を抽出し、抽出された名詞についてキーワードとしてのスコアを演算し、スコアに基づいてキーワードとしてするか否かを判断する機能を有するばかりではなく、キーワード候補としての名詞について、最大スコア名詞との関係を他の記事とのデータベースでのヒット数（検索によってヒットした件数）による関連性で補正することにより、真にキーワードとするべき名詞を適切に抽出することが可能となった。 According to such a configuration, the keyword extraction server 10 according to the present invention performs morphological analysis on text input from the user terminal 30, extracts nouns from the input text, and scores the extracted nouns as keywords. In addition to having the function of determining whether or not to use as a keyword based on the score, the relationship between the noun as a keyword candidate and the maximum score noun is the number of hits in the database with other articles (search By correcting the relevance based on the number of hits, the nouns that should be truly keywords can be extracted appropriately.

すなわち、従来技術の一例であるｔｆ・ｉｄｆと比較した場合に、ｔｆ・ｉｄｆではキーワード性が高い単語であっても、他の記事によく出現すればキーワード候補の名詞としてのスコアが小さくなってしまうというｔｆ・ｉｄｆの欠点を本発明では効率よく補正することができるという有利な効果がある。 That is, when compared with tf · idf, which is an example of the prior art, even if a word has high keyword characteristics in tf · idf, if it appears frequently in other articles, the score as a keyword candidate noun will decrease. The present invention has an advantageous effect that the defect of tf · idf can be corrected efficiently.

また、従来技術の一例であるｔｆ・ｉｄｆと比較した場合に、ｔｆ・ｉｄｆではキーワード性が低くなってしまう名詞であっても、他の記事にあまり出現しなければスコアが大きくなってしまうというｔｆ・ｉｄｆの欠点を本発明では効率よく補正することができるという有利な効果がある。 In addition, when compared with tf · idf, which is an example of the prior art, even if a noun has low keyword characteristics in tf · idf, the score will increase unless it appears in other articles. In the present invention, there is an advantageous effect that the defect of tf · idf can be corrected efficiently.

［処理手順］
ここで、本発明を適用した場合において実現され得る具体的な処理手順について、図４に示すフローチャートを参照して説明する。なお、以下に示す処理手順は、一例であってこれ以外にも実現され得る処理手順は無数に存在する。 [Processing procedure]
Here, a specific processing procedure that can be realized when the present invention is applied will be described with reference to a flowchart shown in FIG. The processing procedure shown below is an example, and there are innumerable processing procedures that can be realized in addition to this.

ステップＳ１において、キーワード抽出サーバ１０の分割部１１は、ユーザ端末３０から直接入力されたテキスト、文字情報データベース２０に既に記憶されているテキスト、その他外部機器（図示せず）から入力されたテキスト（英語、日本語等の言語の種類には限定されない）を句読点等の区切り記号で分割する。 In step S 1, the dividing unit 11 of the keyword extraction server 10 includes text directly input from the user terminal 30, text already stored in the character information database 20, text input from other external devices (not shown) ( (It is not limited to language types such as English, Japanese, etc.).

ステップＳ２において、キーワード抽出サーバ１０の形態素抽出部１２は分割部１１により分割された分割部分から意味をもった最小の音形である形態素を抽出する。 In step S 2, the morpheme extraction unit 12 of the keyword extraction server 10 extracts a morpheme that is a minimum sound shape having a meaning from the divided parts divided by the division unit 11.

ステップＳ３において、キーワード抽出サーバ１０の名詞抽出部１３は、形態素抽出部１２によって抽出された形態素についてその形態素が名詞であるか否かを判定し、名詞であると判定された形態素を抽出する。 In step S3, the noun extraction unit 13 of the keyword extraction server 10 determines whether or not the morpheme extracted by the morpheme extraction unit 12 is a noun, and extracts the morpheme determined to be a noun.

ステップＳ４において、キーワード抽出サーバ１０の演算部１４は、ステップＳ３において抽出された名詞である形態素についてキーワードになり得るかの判断基準を示すスコアＡを演算する。 In step S4, the calculation unit 14 of the keyword extraction server 10 calculates a score A indicating a criterion for determining whether or not the morpheme that is the noun extracted in step S3 can be a keyword.

スコアＡは上述した式（１）に基づいて演算される。 The score A is calculated based on the above-described equation (1).

ステップＳ５において、ステップＳ４において演算されたスコアＡが予め定められた値よりも大きい場合には、キーワード抽出サーバ１０の判断部１５は、そのスコアＡの値を示す形態素であるワード（名詞）をキーワードと判断する。一例として、スコアＡの値が１０前後よりも大きい場合に、そのスコアＡの値を示す形態素であるワード（名詞）をキーワードと判断することが可能である。 In step S5, when the score A calculated in step S4 is larger than a predetermined value, the determination unit 15 of the keyword extraction server 10 selects a word (noun) that is a morpheme indicating the value of the score A. Judged as a keyword. As an example, when the value of score A is greater than about 10, it is possible to determine a word (noun) that is a morpheme indicating the value of score A as a keyword.

ステップＳ６において、キーワード抽出サーバ１０の選択部１６は、ステップＳ５において判断されたキーワードの中で最もスコアＡの値が大きい最大スコア名詞を選択する。 In step S6, the selection unit 16 of the keyword extraction server 10 selects the maximum score noun having the largest score A among the keywords determined in step S5.

ステップＳ７において、キーワード抽出サーバ１０の検索部１７は、ステップＳ６において選択された最大スコア名詞について、文字情報データベース２０において検索を実行し、ヒット件数を最大スコア名詞の検索件数（｜ＷｍａｘＡ｜）とする。また、他のキーワード候補の名詞（Ｗ）について、文字情報データベース２０において検索を実行し、ヒット件数をキーワード候補名詞の検索件数（｜Ｗ｜）とする In step S7, the search unit 17 of the keyword extraction server 10 searches the character information database 20 for the maximum score noun selected in step S6, and the number of hits is the maximum score noun search number (| WmaxA |). To do. In addition, a search is performed in the character information database 20 for other keyword candidate nouns (W), and the number of hits is set as the number of keyword candidate noun searches (| W |).

さらに、キーワード抽出サーバ１０の検索部１７は、最大スコア名詞および他のキーワード候補名詞の両方が含まれる情報を文字情報データベース２０において検索し、ヒットする検索件数（｜ＷｍａｘＡ＆Ｗ｜）を求める。 Further, the search unit 17 of the keyword extraction server 10 searches the character information database 20 for information including both the maximum score noun and other keyword candidate nouns, and obtains the number of search hits (| WmaxA & W |).

ステップＳ８において、キーワード抽出サーバ１０の補正係数演算部１８は、ステップＳ７において検索された｜ＷｍａｘＡ｜、｜Ｗ｜および｜ＷｍａｘＡ＆Ｗ｜に基づいてスコアＡの補正係数（スコアＢ（ｗ））を演算する。補正係数（スコアＢ（ｗ））は上述した式（２）に基づいて演算される。 In step S8, the correction coefficient calculation unit 18 of the keyword extraction server 10 calculates a correction coefficient for score A (score B (w)) based on | WmaxA |, | W | and | WmaxA & W | searched in step S7. To do. The correction coefficient (score B (w)) is calculated based on the above-described equation (2).

ステップＳ９において、キーワード抽出サーバ１０の補正スコア演算部１９は、ステップＳ８において演算された補正係数（スコアＢ（ｗ））とステップＳ４において演算されたスコアＡ（ｗ）とに基づいて、名詞ｗに関する補正スコアである補正スコアＣ（ｗ）を演算する。 In step S9, the correction score calculation unit 19 of the keyword extraction server 10 determines the noun w based on the correction coefficient (score B (w)) calculated in step S8 and the score A (w) calculated in step S4. A correction score C (w), which is a correction score for, is calculated.

ステップＳ１０において、キーワード抽出サーバ１０の判断部１５は、ステップＳ９において演算された名詞ｗに関する補正スコアＣ（ｗ）が予め定められた値よりも大きい場合には、その補正スコアＣ（ｗ）の値を示すワード（名詞）をキーワードと判断する。一例として、補正スコアＣ（ｗ）の値が１０前後よりも大きい場合に、その補正スコアＣ（ｗ）の値を示すワード（名詞）をキーワードと判断することが可能である。 In step S10, when the correction score C (w) regarding the noun w calculated in step S9 is larger than a predetermined value, the determination unit 15 of the keyword extraction server 10 determines the correction score C (w). A word (noun) indicating a value is determined as a keyword. As an example, when the value of the correction score C (w) is larger than about 10, it is possible to determine a word (noun) indicating the value of the correction score C (w) as a keyword.

このような構成によれば、本発明に関わるキーワード抽出サーバ１０はユーザ端末３０から入力されたテキスト等を形態素解析し、入力されたテキストから名詞を抽出し、抽出された名詞についてキーワードとしてのスコアを演算し、スコアに基づいてキーワードとするか否かを判断する機能を有するばかりではなく、キーワード候補としての名詞について、最大スコア名詞との関係を他の記事とのデータベースでのヒット数（検索によってヒットした件数）による関連性で補正することにより、真にキーワードとするべき名詞を適切に抽出することが可能となった。 According to such a configuration, the keyword extraction server 10 according to the present invention performs morphological analysis on text input from the user terminal 30, extracts nouns from the input text, and scores the extracted nouns as keywords. In addition to having the function of determining whether or not to use a keyword based on the score, the noun as a keyword candidate is related to the maximum score noun and the number of hits in the database with other articles (search By correcting the relevance based on the number of hits, the nouns that should be truly keywords can be extracted appropriately.

［キーワードの特定方法］
また、キーワードの特定方法の一例について以下に説明する。例えば、入力されたテキストが図５に示されるように以下の文章の場合に名詞抽出部１３で抽出された名詞「デジカメ」、「カメラ」、「大写し」および「グニャン」について本実施形態による補正スコアＣ（ｗ）、スコアＡ（ｗ）、スコアＢ（ｗ）および従来技術の一例であるｔｆ・ｉｄｆスコアについて演算過程を図６に示し、演算結果を図７に示し説明する。 [Keyword identification method]
An example of a keyword specifying method will be described below. For example, when the input text is the following sentence as shown in FIG. 5, the nouns “digital camera”, “camera”, “large copy”, and “Gunyan” extracted by the noun extraction unit 13 are corrected according to this embodiment. FIG. 6 shows the calculation process for the score C (w), the score A (w), the score B (w), and the tf · idf score which is an example of the prior art, and the calculation result is shown in FIG.

以上の入力されたテキストからキーワード抽出サーバ１０が、キーワード候補として判断した名詞が「デジカメ」、「カメラ」、「大写し」および「グニャン」である場合について演算過程を示した図６を参照しつつ説明する。 With reference to FIG. 6 showing the calculation process in the case where the nouns determined by the keyword extraction server 10 as keyword candidates from the input text are “digital camera”, “camera”, “large copy”, and “Gunyan”. explain.

キーワード候補として判断した名詞が「デジカメ」の場合に、スコアＡ（ｗ）を求めようとする場合には、ｌｏｇ（｜デジカメ｜＋１）＊ｔｆ（デジカメ）＊ｌｏｇ（Ｎ／（ｓｆ（デジカメ）＋１））を演算する必要がある（式（１）より）。 When the noun determined as the keyword candidate is “digital camera”, if the score A (w) is to be obtained, log (| digital camera | +1) * tf (digital camera) * log (N / (sf (digital camera) +1)) must be calculated (from equation (1)).

式中、｜デジカメ｜は単語デジカメの文字数であるので、上記入力されたテキストから４となり、ｌｏｇ（｜デジカメ｜＋１）はｌｏｇ（５）となる。 In the equation, | digital camera | is the number of characters of the word digital camera, so it becomes 4 from the input text, and log (| digital camera | +1) becomes log (5).

また、ｔｆ（デジカメ）は単語デジカメのテキスト中での出現頻度（テキスト中に何回出現したかを示す）ので、上記入力されたテキストから４となり、ｔｆ（デジカメ）は４となる。 Also, tf (digital camera) is the frequency of appearance of the word digital camera in the text (indicating how many times it appears in the text), so it becomes 4 from the input text, and tf (digital camera) becomes 4.

さらに、ｓｆ（デジカメ）は単語デジカメの文中での出現頻度（いくつの文に跨って出現したかを示す）を示すので、上記入力されたテキストからｓｆ（デジカメ）は４となる。また、Ｎはテキスト中の文の総数を示すので、上記入力されたテキストからＮは２２となる。 Furthermore, since sf (digital camera) indicates the appearance frequency in the sentence of the word digital camera (indicating how many sentences have appeared), sf (digital camera) is 4 from the input text. Since N indicates the total number of sentences in the text, N is 22 from the input text.

したがって、ｌｏｇ（デジカメ／（ｓｆ（デジカメ）＋１））は、ｌｏｇ（２２／（４）＋１））となる。 Therefore, log (digital camera / (sf (digital camera) +1)) becomes log (22 / (4) +1)).

以上の結果、スコアＡ（デジカメ）の値は１２．０５０２になる。一例として、スコアＡの値が６前後よりも大きい場合に、そのスコアＡの値を示す形態素であるワード（名詞）をキーワードと判断すれば、デジカメは上記入力されたテキストのキーワードとすることができる。 As a result, the value of score A (digital camera) is 12.0502. As an example, if the value of score A is greater than around 6, if a word (noun) that is a morpheme indicating the value of score A is determined as a keyword, the digital camera may be a keyword of the input text. it can.

次に、キーワード候補として判断した名詞が「グニャン」の場合に、スコアＡ（ｗ）を求めようとする場合には、デジカメの場合と同様にｌｏｇ（｜グニャン｜＋１）＊ｔｆ（グニャン）＊ｌｏｇ（Ｎ／（ｓｆ（グニャン）＋１））を演算する必要がある（式（１）より）。 Next, in the case where the noun determined as the keyword candidate is “Gunyan” and the score A (w) is to be obtained, log (| Gunyan | +1) * tf (Gunyan) * as in the case of the digital camera. log (N / (sf (Gnyan) +1)) needs to be calculated (from equation (1)).

式中、｜グニャン｜は単語デジカメの文字数であるので、上記入力されたテキストから４となり、ｌｏｇ（｜グニャン｜＋１）はｌｏｇ（５）となる。 In the formula, | Gunyan | is the number of characters of the word digital camera, so it becomes 4 from the input text, and log (| Gunyan | +1) becomes log (5).

また、ｔｆ（グニャン）は単語デジカメのテキスト中での出現頻度（テキスト中に何回出現したかを示す）ので、上記入力されたテキストから２となり、ｔｆ（グニャン）は２となる。 Also, tf (Gunyan) is the frequency of appearance of the word digital camera in the text (indicating how many times it appears in the text), so it becomes 2 from the input text, and tf (Gunyan) becomes 2.

さらに、ｓｆ（グニャン）は単語デジカメの文中での出現頻度（いくつの文に跨って出現したかを示す）を示すので、上記入力されたテキストからｓｆ（グニャン）は１となる。また、Ｎはテキスト中の文の総数を示すので、上記入力されたテキストからＮは２２となる。 Furthermore, since sf (Gunyan) indicates the appearance frequency in the sentence of the word digital camera (indicating how many sentences have appeared), sf (Gunyan) is 1 from the input text. Since N indicates the total number of sentences in the text, N is 22 from the input text.

したがって、ｌｏｇ（グニャン／（ｓｆ（グニャン）＋１））は、ｌｏｇ（２２／（１）＋１））となる。 Therefore, log (Gnyan / (sf (Gnyan) +1)) becomes log (22 / (1) +1)).

以上の結果、スコアＡ（グニャン）の値は１０．０９２８になる。一例として、スコアＡの値が６前後よりも大きい場合に、そのスコアＡの値を示す形態素であるワード（名詞）をキーワードと判断すれば、グニャンは上記入力されたテキストのキーワードとすることができる。 As a result, the value of score A (Gunyan) is 10.0928. As an example, if the score A is greater than about 6, and if a word (noun) that is a morpheme indicating the value of the score A is determined as a keyword, Gunyan may be used as a keyword for the input text. it can.

同様に、キーワード候補として判断した名詞が「カメラ」の場合には、スコアＡ（カメラ）の値は６．８８９６になり、キーワード候補として判断した名詞が「大写し」の場合には、スコアＡ（大写し）の値は４．３４６７になる。 Similarly, when the noun determined as the keyword candidate is “camera”, the value of the score A (camera) is 6.8896, and when the noun determined as the keyword candidate is “large copy”, the score A ( The value of (large copy) is 4.3467.

一例として、スコアＡの値が６前後よりも大きい場合に、そのスコアＡの値を示す形態素であるワード（名詞）をキーワードと判断すれば、「カメラ」および「大写し」は上記入力されたテキストのキーワードとはなりにくい。 As an example, if the value of score A is greater than about 6, and if a word (noun) that is a morpheme indicating the value of score A is determined as a keyword, “camera” and “large copy” are the texts input above It's hard to be a keyword.

また、スコアＡだけの場合であっても、「デジカメ」のスコアＡの値が、「グニャン」のスコアＡの値よりも大きくなっており、「デジカメ」が適切なキーワードとして判断されやすくなっている。 Even in the case of only score A, the value of score A of “digital camera” is larger than the value of score A of “Gunyan”, and “digital camera” is easily determined as an appropriate keyword. Yes.

次に、名詞「デジカメ」および名詞「グニャン」の補正係数（スコアＢ）を演算する。 Next, the correction coefficient (score B) of the noun “digital camera” and the noun “Gunyan” is calculated.

スコアＢ（ｗ）は｜ＷｍａｘＡ＆Ｗ｜と（｜ＷｍａｘＡ｜＊｜Ｗ｜）^１／２との除算演算によって求められる（（式２）より）。 The score B (w) is obtained by a division operation of | WmaxA & W | and (| WmaxA | * | W |) ^1/2 (from (Equation 2)).

式中、ＷｍａｘＡはスコアＡが最も大きい名詞「デジカメ」を示し、｜ＷｍａｘＡ＆Ｗ｜は「デジカメ」と名詞「デジカメ」または名詞「グニャン」とのＡＮＤ検索（キーワード抽出サーバ１０のある文字情報データベース（ＤＢ）２０からまたは外部の文字情報ＤＢ（図示せず））でのヒット件数を示し、｜ＷｍａｘＡ｜は名詞「デジカメ」の単独検索（キーワード抽出サーバ１０のある文字情報データベース（ＤＢ）２０からまたは外部の文字情報ＤＢ（図示せず））でのヒット件数を示し、｜Ｗ｜は名詞「デジカメ」または名詞「グニャン」の単独検索（キーワード抽出サーバ１０のある文字情報データベース（ＤＢ）２０からまたは外部の文字情報ＤＢ（図示せず））でのヒット件数を示す。 In the formula, WmaxA indicates the noun “digital camera” having the highest score A, and | WmaxA & W | indicates an AND search of “digital camera” and the noun “digital camera” or the noun “Gunyan” (character information database (DB ) 20 or the number of hits in the external character information DB (not shown), and | WmaxA | is a single search for the noun “digital camera” (from the character information database (DB) 20 with the keyword extraction server 10 or external) Indicates the number of hits in the character information DB (not shown), and | W | is a single search for the noun “digital camera” or the noun “Gunyan” (from the character information database (DB) 20 with the keyword extraction server 10 or externally) The number of hits in the character information DB (not shown).

名詞「デジカメ」の場合には、｜ＷｍａｘＡ＆Ｗ｜と（｜ＷｍａｘＡ｜＊｜Ｗ｜）^１／２とは同じ値となるので、スコアＢ（デジカメ）は１となる（図６および図７参照）。 In the case of the noun “digital camera”, | WmaxA & W | and (| WmaxA | * | W |) ^1/2 have the same value, so the score B (digital camera) is 1 (see FIGS. 6 and 7). .

その結果、名詞「グニャン」のスコアＢ（グニャン）は２４／（１１３，０００，０００＊７２７）^１／２となり、おおよそ０．０００１となる（図６および７参照）。 As a result, the score B (Gunyan) of the noun “Gunyang” is 24 / (113,000,000 * 727) ^1/2 , which is approximately 0.0001 (see FIGS. 6 and 7).

その結果名詞「カメラ」のスコアＢ（カメラ）は４０，８００，０００／（１１３，０００，０００＊３１０，０００，０００）^１／２となり、おおよそ０．２１４１となる（図６および図７参照）。 As a result, the score B (camera) of the noun “camera” is 40,800,000 / (113,000,000 * 310,000,000) ^1/2 , which is approximately 0.2141 (see FIGS. 6 and 7). ).

その結果名詞「大写し」のスコアＢ（カメラ）は３２，８００／（１１３，０００，０００＊３３３，０００）^１／２となり、おおよそ０．００５６となる（図６および図７参照）。 As a result, the score B (camera) of the noun “large copy” is 32,800 / (113,000,000 * 333,000) ^1/2 , which is approximately 0.0056 (see FIGS. 6 and 7).

次に、これらの結果から、補正スコア（スコアＣ）を演算する。 Next, a corrected score (score C) is calculated from these results.

補正スコア（スコアＣ）は式（３）で示されるように、補正係数（スコアＢ）とスコアＡとを乗算演算した値であるので、名詞「デジカメ」の補正スコア（スコアＣ（デジカメ））は、１２．０５０２＊１＝１２．０５０２となり、名詞「カメラ」の補正スコア（スコアＣ（カメラ））は、６．８８９６＊０．２１４１＝１．４７５１となり、名詞「大写し」の補正スコア（スコアＣ（大写し））は、４．３４６７＊０．００５６＝０．０２４３となり、名詞「グニャン」の補正スコア（スコアＣ（グニャン））は、１０．０９２８＊０．０００１＝０．００１となる（図６及び図７参照）。 Since the correction score (score C) is a value obtained by multiplying the correction coefficient (score B) and the score A as shown in the equation (3), the correction score of the noun “digital camera” (score C (digital camera)) Is 12.0502 * 1 = 12.0502, and the correction score (score C (camera)) of the noun “camera” is 6.8896 * 0.2141 = 1.4751 and the correction score of the noun “large copy” ( The score C (large copy)) is 4.3467 * 0.0056 = 0.0243, and the correction score (score C (Gunyan)) of the noun “Gunyang” is 10.0928 * 0.0001 = 0.001. (See FIGS. 6 and 7).

これらの結果、スコアＡ単独の値でキーワードを判断しようとした場合には、名詞「デジタルカメラ」のスコアＡ（デジタルカメラ）値と名詞「グニャン」のスコアＡ（グニャン）値は大きな値（例えば１０以上）となるので、名詞「デジタルカメラ」と名詞「グニャン」とがキーワードとして判断される可能性があった。 As a result, when trying to determine a keyword based on the value of the score A alone, the score A (digital camera) value of the noun “digital camera” and the score A (gunyan) value of the noun “Gunyan” are large values (for example, Therefore, there is a possibility that the noun “digital camera” and the noun “Gunyan” are judged as keywords.

しかし、補正係数（スコアＢ）によれば、名詞「グニャン」のスコアＢ（グニャン）値は、０．０００１と非常に小さくなり、キーワードとしては不適切であることを補正係数（スコアＢ）によって、数字で的確に示すことが可能となった。 However, according to the correction coefficient (score B), the score B (Gunyan) value of the noun “Gunyang” is as very small as 0.0001, and the correction coefficient (score B) indicates that it is inappropriate as a keyword. , It became possible to indicate accurately with numbers.

この結果、補正スコア（スコアＣ）によって、キーワードとして適切と考えられる名詞「デジカメ」の補正スコア（スコアＣ）値が大きな値（例えば１０以上）となる。したがって、数式と検索による高速な処理が可能となる補正スコア（スコアＣ）によって、分類に有効な名詞（キーワード）のみを容易に抽出することができるキーワード抽出サーバ及び方法及びプログラムを提供することができる。 As a result, the correction score (score C) has a large correction score (score C) value (for example, 10 or more) for the noun “digital camera” that is considered appropriate as a keyword. Therefore, it is possible to provide a keyword extraction server, method, and program capable of easily extracting only nouns (keywords) effective for classification using a correction score (score C) that enables high-speed processing by mathematical formulas and search. it can.

また、ｔｆ（ＴｅｒｍＦｒｅｑｕｅｎｃｙ）・ｉｄｆ（ＩｎｖｅｒｓｅＤｏｃｕｍｅｎｔＦｒｅｑｕｅｎｃｙ）の値を図７に参考として示す。 Further, values of tf (Term Frequency) · idf (Inverse Document Frequency) are shown in FIG. 7 for reference.

ｔｆは、あるターム（本実施形態においては、名詞「デジカメ」、「カメラ」、「大写し」、「グニャン」を示す）が文書に高い頻度で出現すればそのタームはその文書を特徴付ける単語と考えられることからｔｆの値は大きくなる（その文書中での出現頻度を示す。） tf is considered to be a word that characterizes the term if a certain term (in this embodiment, the nouns “digital camera”, “camera”, “large copy”, “Gunyan”) appear in the document at a high frequency. Therefore, the value of tf becomes large (indicating the appearance frequency in the document).

また、ｉｄｆは、文書の頻度を示す。前述のｔｆが大きければそのタームが重要な意味を持つが、例えば「こと」というような名詞は文書中に比較的に高頻度で出現するが特定の文書を特徴付けることにはならない。従って、複数の文書におけるそのタームの出現頻度ｄｆ（ＤｏｃｕｍｅｎｔＦｒｅｑｕｅｎｃｙ）の逆数（ｉｄｆ）をとって、ｄｆの値が小さいもの（文書を特徴付けるタームである可能性が高い）ほどｉｄｆが大きくなるようにし、文書を特徴付ける単語を抽出するようにする。 Idf indicates the frequency of the document. If tf is large, the term has an important meaning. For example, a noun such as “ko” appears relatively frequently in a document but does not characterize a specific document. Therefore, by taking the reciprocal (idf) of the appearance frequency df (Document Frequency) of the term in a plurality of documents, the smaller the value of df (the more likely the term characterizing the document), the larger the idf. Extract words that characterize the document.

具体的には、下記式（４）で演算される。

で示され、式中、ｔｆ（ｗ）は単語ｗのテキスト中での出現頻度（テキスト中に何回出現したかを示す）、ｄｆ（ｗ）は単語ｗの文書集合中での出現頻度（いくつの文書に出現したかを示す）、Ｎは文書集合に含まれる文書の総数を示す。 Specifically, it is calculated by the following equation (4).

Where tf (w) is the appearance frequency of the word w in the text (shows how many times it appears in the text), and df (w) is the appearance frequency of the word w in the document set ( N indicates the total number of documents included in the document set.

ここで、名詞「デジカメ」の場合にはｔｆ（デジカメ）が４となり、ｄｆ（デジカメ）が９７，２００，０００となり、Ｎが１９，２００，０００，０００（Ｗｅｂ文書の総数とみなされる数）となるので、ｔｆ・ｉｄｆ値は２１．１６３８になる。 Here, in the case of the noun “digital camera”, tf (digital camera) is 4, df (digital camera) is 97,200,000, and N is 19,200,000,000 (the number considered as the total number of Web documents). Therefore, the tf · idf value is 21.1638.

また、名詞「グニャン」の場合にはｔｆ（グニャン）が２となり、ｄｆ（グニャン）が７２７となり、Ｎが１９，２００，０００，０００（Ｗｅｂ文書の総数とみなされる数））となるので、ｔｆ・ｉｄｆ値は３４．１７８５になる。 In the case of the noun “Gunyang”, tf (Gunyan) is 2, df (Gunyan) is 727, and N is 19,200,000,000 (the number considered as the total number of Web documents)). The tf · idf value is 34.1785.

同様に、名詞「大写し」の場合にはｔｆ・ｉｄｆ値は１０．５２２４になり、名詞「カメラ」の場合にはｔｆ・ｉｄｆ値は８．５４１９となる。 Similarly, in the case of the noun “large copy”, the tf · idf value is 10.5224, and in the case of the noun “camera”, the tf · idf value is 8.5419.

ここで、「カメラ」と「大写し」とを比較すると、ｔｆ・ｉｄｆでは、「カメラ」のようにキーワード性が高い語であっても、他の記事（文書）によく出現するために、低いスコア（値）になっていることがわかる。 Here, when “camera” and “large copy” are compared, tf · idf is low because even if it is a word with high keyword characteristics such as “camera”, it frequently appears in other articles (documents). It turns out that it is a score (value).

また、ｔｆ・ｉｄｆでは、「大写し」のようにキーワード性が低い語であっても、他の記事（文書）にはあまり出現しないために、高いスコア（値）になっていることがわかる。 In addition, in tf · idf, even a word with low keywordity such as “large copy” does not appear so much in other articles (documents), so it can be seen that it has a high score (value).

しかし、本実施形態では、「カメラ」と「大写し」を比較した場合には、「カメラ」の方が高いスコア（「カメラ」の補正スコア（スコアＣ（カメラ）は８．５４、「大写し」の補正スコア（スコアＣ（大写し）は０．０２）となっていて、キーワード候補が適切に判断されていることが確認される。 However, in this embodiment, when “camera” and “large copy” are compared, the score of “camera” is higher (the correction score of “camera” (score C (camera) is 8.54, “large copy”). The correction score (score C (large copy) is 0.02) is confirmed, and it is confirmed that the keyword candidates are appropriately determined.

さらに、ｔｆ・ｉｄｆでは、「グニャン」のようにキーワード性が低い語であっても、他の記事（文書）にあまり出現しないために、高いスコア（値）になっていることがわかる。 Furthermore, it can be seen that tf · idf has a high score (value) because it does not appear much in other articles (documents) even if it is a low keyword word such as “Gunyan”.

しかし、本実施形態では、補正係数（スコアＢ）の効果（スコアＡの値が最も大きい最大スコア名詞との関連性が高い場合には、補正係数（スコアＢ）が大きくなり、スコアＡの値が最も大きい最大スコア名詞との関連性が小さい場合には、補正係数（スコアＢ）が小さくなる。）によって、「グニャン」のようにキーワード性が低い語は、補正スコア（スコアＣ）が小さな値となって、キーワードとして判断されないように適切に演算処理される。 However, in the present embodiment, the effect of the correction coefficient (score B) (when the relevance with the largest score noun with the highest score A value is high, the correction coefficient (score B) increases and the value of the score A When the relevance to the maximum score noun with the largest is small, the correction coefficient (score B) is small.) As a result, a word with low keywordity such as “Gunyan” has a small correction score (score C). A value is appropriately calculated so as not to be determined as a keyword.

また、スコアＡだけの場合であっても、「デジカメ」のスコアＡの値が、「グニャン」のスコアＡの値よりも大きくなっており、「デジカメ」が適切なキーワードとして判断されやすくなっている。
［他のキーワードの特定方法］ Even in the case of only score A, the value of score A of “digital camera” is larger than the value of score A of “Gunyan”, and “digital camera” is easily determined as an appropriate keyword. Yes.
[How to identify other keywords]

さらに、他のキーワードの特定方法の一例について以下に説明する。例えば、入力されたテキストが図８に示される文章の場合に名詞抽出部１３で抽出された名詞「地震」、「災害」、「震度」および「余震」について本実施形態によるスコアＡ（ｗ）、補正係数であるスコアＢ（ｗ）、補正スコアであるスコアＣ（ｗ）について演算し、その演算結果について説明する。 Furthermore, an example of another keyword specifying method will be described below. For example, in the case where the input text is the sentence shown in FIG. 8, the score A (w) according to the present embodiment for the nouns “earthquake”, “disaster”, “seismic intensity”, and “aftershock” extracted by the noun extraction unit 13. , The score B (w) as the correction coefficient and the score C (w) as the correction score are calculated, and the calculation result will be described.

キーワード候補として判断した名詞が「災害」の場合には、スコアＡ（ｗ）は３．５４、補正係数（スコアＢ）は０．２９、補正スコア（スコアＣ）は１．０３となる。 When the noun determined as the keyword candidate is “disaster”, the score A (w) is 3.54, the correction coefficient (score B) is 0.29, and the correction score (score C) is 1.03.

また、キーワード候補として判断した名詞が「地震」の場合には、スコアＡ（ｗ）は７．２４、補正係数（スコアＢ）は１．０、補正スコア（スコアＣ）は７．２４となる。 When the noun determined as the keyword candidate is “earthquake”, the score A (w) is 7.24, the correction coefficient (score B) is 1.0, and the correction score (score C) is 7.24. .

また、キーワード候補として判断した名詞が「震度」の場合には、スコアＡ（ｗ）は３．５４、補正係数（スコアＢ）は０．２７、補正スコア（スコアＣ）は０．９４となる。 When the noun determined as the keyword candidate is “seismic intensity”, the score A (w) is 3.54, the correction coefficient (score B) is 0.27, and the correction score (score C) is 0.94. .

また、キーワード候補として判断した名詞が「余震」の場合には、スコアＡ（ｗ）は４．２８、補正係数（スコアＢ）は０．１５、補正スコア（スコアＣ）は０．６６となる。 When the noun determined as the keyword candidate is “aftershock”, the score A (w) is 4.28, the correction coefficient (score B) is 0.15, and the correction score (score C) is 0.66. .

以上のキーワード候補とした名詞「地震」、「災害」、「震度」および「余震」についてスコアＡ（ｗ）を演算すると、「地震」が最も大きな値となる。 When the score A (w) is calculated for the nouns “earthquake”, “disaster”, “seismic intensity”, and “aftershock” as the keyword candidates, “earthquake” has the largest value.

「地震」という名詞は、地震が発生した場合など、特別な場合に使用されることが多いため専門性の高い語だと言える。従って、「地震」はキーワード候補としてふさわしい名詞と考えられる。 The term “earthquake” is a highly specialized word because it is often used in special cases, such as when an earthquake occurs. Therefore, “earthquake” is considered as a noun suitable as a keyword candidate.

補正係数であるスコアＢ（ｗ）は、スコアＡ（ｗ）が最も大きな値を有する名詞に基づいて演算されるので、「地震」という名詞と共起する「災害」、「震度」、「余震」に対して、スコアＢ（ｗ）のスコア値が高くなる。 The score B (w), which is a correction coefficient, is calculated based on the noun having the largest value of the score A (w), so that “disaster”, “seismic intensity”, “aftershock” co-occurs with the noun “earthquake”. ", The score value of the score B (w) becomes higher.

さらに、補正スコアであるスコアＣ（ｗ）は補正係数であるスコアＢ（ｗ）の演算結果を利用するので（式（３）参照）、「災害」、「震度」、「余震」の補正スコア値は大きな値となり、専門性の高い語からキーワードを適切に抽出することが可能であることが示される。 Furthermore, since the score C (w) as the correction score uses the calculation result of the score B (w) as the correction coefficient (see Equation (3)), the correction scores for “disaster”, “seismic intensity”, and “aftershock” are used. The value becomes a large value, which indicates that it is possible to appropriately extract keywords from highly specialized words.

以上、この例を分析すると、スコアＡの計算により、スコアＡが最も大きい単語が「地震」となる。「地震」という語は、地震が起こったときなど、特別な場合に使われることが多いため、専門性が高い語だといえる。そのため、スコアＢの計算により、地震とよく共起する「災害」、「震度」、「余震」に高いスコアが付く。スコアCの計算では、スコアＢの計算結果を利用するため、「災害」、「震度」、「余震」に高いスコアが付くことがわかる。 As described above, when this example is analyzed, the word having the highest score A is “earthquake” by the calculation of the score A. The term “earthquake” is highly specialized because it is often used in special cases, such as when an earthquake occurs. Therefore, by calculating the score B, high scores are given to “disaster”, “seismic intensity”, and “aftershock” that often co-occur with earthquakes. In the calculation of score C, since the calculation result of score B is used, it is understood that “disaster”, “seismic intensity”, and “aftershock” have high scores.

[変形例]
新聞記事、雑誌記事、あるいはニュース情報等の情報源からキーワードを抽出したい場合がある。この場合にも、新聞記事、雑誌記事、あるいはニュース情報等の情報をテキスト化しておくことによって、本実施形態によるキーワード抽出サーバ１０においてスコアＡ、スコアＢ、スコアＣを使用した演算によってキーワードを抽出することができる。キーワードはスコアＡ、またはスコアＣの値が高い値から選択することができる。 [Modification]
There are cases where it is desired to extract keywords from information sources such as newspaper articles, magazine articles, or news information. Also in this case, by extracting information such as newspaper articles, magazine articles, or news information into text, the keyword extraction server 10 according to the present embodiment extracts keywords by calculation using the score A, score B, and score C. can do. The keyword can be selected from a score A or a value having a high score C.

また、静止画または動画等の画像情報に関連したキーワードを抽出したい場合がある。 In some cases, it is desired to extract keywords related to image information such as still images or moving images.

この場合には対象となる画像情報のＵＲＬをキーワード抽出サーバ１０が検索し、検索結果の上位の記事情報（タイトルおよびスニペットを含む。）をテキストとしてキーワード抽出サーバ１０が取得する。 In this case, the keyword extraction server 10 searches the URL of the target image information, and the keyword extraction server 10 acquires the article information (including the title and snippet) at the top of the search result as text.

対象となる画像情報のＵＲＬを紹介している記事の周辺には、関連するワードも出現していることが考えられるためである。 This is because it is considered that related words also appear around the article introducing the URL of the target image information.

キーワード抽出サーバ１０が検索し、取得した検索結果の上位の記事情報（タイトルおよびスニペットを含む。）から、キーワード抽出サーバ１０においてスコアＡ、スコアＢ、スコアＣを使用した演算によって画像情報のキーワードを抽出することができる。キーワードはスコアＡ、またはスコアＣの値が高い値から選択することができる。 The keyword extraction server 10 searches and acquires the keyword of the image information from the top article information (including title and snippet) of the acquired search result by the calculation using the score A, score B, and score C in the keyword extraction server 10. Can be extracted. The keyword can be selected from a value having a high score A or score C.

また、記事情報に検索インデックスを付与したい場合がある。この場合にも、記事情報をテキスト化しておくことによって、本実施形態によるキーワード抽出サーバ１０においてスコアＡ、スコアＢ、スコアＣを使用した演算によってキーワードを抽出することができる。 In some cases, it is desired to add a search index to article information. Also in this case, by converting the article information into text, the keyword extraction server 10 according to the present embodiment can extract the keyword by calculation using the score A, score B, and score C.

この場合、検索インデックスとするキーワードは複数選択することができ、スコアＡ、またはスコアＣの値が高いワードから順番に検索インデックスとすることができる。 In this case, a plurality of keywords can be selected as the search index, and the search index can be set in order from the word having the highest score A or score C value.

以上、本発明の実施形態について説明したが、本発明は上述した実施形態に限るものではない。また、本発明の実施形態に記載された効果は、本発明から生じる最も好適な効果を列挙したに過ぎず、本発明による効果は、本発明の実施例に記載されたものに限定されるものではない。 As mentioned above, although embodiment of this invention was described, this invention is not restricted to embodiment mentioned above. The effects described in the embodiments of the present invention are only the most preferable effects resulting from the present invention, and the effects of the present invention are limited to those described in the embodiments of the present invention. is not.

なお、本実施形態においては、サーバ１０は、ハードディスク３７０及び光ディスクドライブ３８０を有する構成として説明したが、これに限られず、これらの駆動系を有さない構成、いわゆるゼロスピンドルによる構成であっても良い。このような構成の場合には、ハードディスク３７０に記憶される内容は、大容量の半導体メモリ３９０に記憶される。 In the present embodiment, the server 10 has been described as having a hard disk 370 and an optical disk drive 380. However, the present invention is not limited to this, and the server 10 may have a structure without these drive systems, that is, a so-called zero spindle. good. In the case of such a configuration, the contents stored in the hard disk 370 are stored in the large-capacity semiconductor memory 390.

本実施形態に係るサーバと、ユーザ端末とから構成される情報処理システムを示す図である。It is a figure which shows the information processing system comprised from the server which concerns on this embodiment, and a user terminal. 本発明に係るサーバの構成を示すブロック図である。It is a block diagram which shows the structure of the server which concerns on this invention. 本実施形態に係るサーバの機能的な構成を示す機能ブロック図である。It is a functional block diagram which shows the functional structure of the server which concerns on this embodiment. 本実施形態に係るサーバによる処理手順についての説明に供するフローチャートである。It is a flowchart with which it uses for description about the process sequence by the server which concerns on this embodiment. 本実施形態に係る入力テキストの一例を示す図である。It is a figure which shows an example of the input text which concerns on this embodiment. 本実施形態に係るスコアの演算例を示す図である。It is a figure which shows the calculation example of the score which concerns on this embodiment. 本実施形態に係るスコアの演算結果例を示す図である。It is a figure which shows the example of a calculation result of the score which concerns on this embodiment. 本実施形態に係る他の入力テキストの一例を示す図である。It is a figure which shows an example of the other input text which concerns on this embodiment.

Explanation of symbols

１情報処理システム
１０キーワード抽出サーバ
１１分割部
１２形態素抽出部
１３名詞抽出部
１４演算部
１５判断部
１６選択部
１７検索部
１８補正係数演算部
１９補正スコア演算部
２０文字情報データベース（ＤＢ）
３０ユーザ端末 DESCRIPTION OF SYMBOLS 1 Information processing system 10 Keyword extraction server 11 Division | segmentation part 12 Morphological extraction part 13 Noun extraction part 14 Operation part 15 Judgment part 16 Selection part 17 Search part 18 Correction coefficient calculation part 19 Correction score calculation part 20 Character information database (DB)
30 User terminal

Claims

A dividing means for dividing the input text by punctuation marks;
Morpheme extraction means for extracting morphemes from the divided parts divided by the dividing means;
Determining a part of speech for the morpheme extracted by the morpheme extraction unit, and extracting a morpheme determined to be a noun;
About the noun extracted by the noun extraction means, the number of characters of the noun, the appearance frequency of the noun in the text, the total number of sentences in the text, and how many sentences the noun appears over Calculating means for calculating a score as a keyword of the noun based on a ratio to the appearance frequency indicating
A keyword extracting device comprising: a determination unit that determines whether or not the noun is a keyword based on the score that is a result of the calculation.

The calculation means includes the number of characters of the noun or a logarithm calculation value around the number of characters of the noun, the appearance frequency of the noun in the text, the total number of sentences in the text, and the number of nouns appearing over the number of the sentences The keyword extraction device according to claim 1, wherein a value obtained by multiplying a ratio with an appearance frequency indicating whether or not or a logarithm calculation value of a number before and after the ratio is used as the score.

A character information database storing character information transmitted and received on the Internet;
Selecting means for selecting a maximum score noun with the highest score among the nouns determined to be the keyword by the determining means;
The maximum score noun and the noun are searched in the character information database, and the search number of the maximum score noun, the search number of the noun, and the search number including both the maximum score noun and the noun are searched and investigated. Search means to
A correction coefficient calculating means for calculating a correction coefficient based on the search number of the maximum score noun, the search number of the noun and the search number including both the maximum score noun and the noun;
Correction score calculation means for calculating a correction score based on the correction coefficient and the score calculated by the calculation means;
The keyword extracting apparatus according to claim 1, wherein the determining unit determines whether or not the noun is a keyword based on the correction score.

The correction coefficient calculating means divides the number of searches including both the maximum score noun and the noun by the square root of the multiplication calculation value of the search number of the maximum score noun and the search number of the noun. As a correction score,
4. The keyword extracting apparatus according to claim 3, wherein the determining means determines whether or not the noun is a keyword based on a multiplication operation value of the correction score and the score.

A splitting process that splits the input text with punctuation marks;
A morpheme extraction step of extracting morphemes of the divided parts divided in the division step;
Determining a part of speech for the morpheme extracted in the morpheme extraction step, and extracting a morpheme determined as a noun;
About the noun extracted in the noun extraction step, the number of characters of the noun, the appearance frequency of the noun in the text, and the total number of sentences in the text and how many sentences the noun appears over A calculation step of calculating a score as a keyword of the noun based on a ratio with an appearance frequency indicating:
And a determination step of determining whether or not to use the noun as a keyword based on the score that is a result of the calculation.

A program for causing a computer to execute the method according to claim 5.