JP2022029461A

JP2022029461A - Keyword extraction device, keyword extraction method, and program

Info

Publication number: JP2022029461A
Application number: JP2021191504A
Authority: JP
Inventors: 俊平大倉; Shumpei Okura; 真吾星野; Shingo Hoshino
Original assignee: Yahoo Japan Corp
Current assignee: Yahoo Japan Corp
Priority date: 2018-02-28
Filing date: 2021-11-25
Publication date: 2022-02-17
Anticipated expiration: 2038-02-28
Also published as: JP7297855B2

Abstract

PROBLEM TO BE SOLVED: To provide a keyword extraction device, keyword extraction method and program, which improve data collection efficiency of users.

SOLUTION: A keyword extraction device is provided, comprising a processing unit configured to extract candidates of keywords contained in a document of interest as keywords on the basis of the number of similar documents, similar to the document of interest, containing the candidates of keywords that appears in the document of interest.

SELECTED DRAWING: Figure 4

Description

本発明は、キーワード抽出装置、キーワード抽出方法、およびプログラムに関する。 The present invention relates to a keyword extraction device, a keyword extraction method, and a program.

インターネット上で配信されるニュース記事のような文書には、過去に配信された文書の事柄を前提とする、続報記事のような関連文書が多数存在する。これに関連し、過去に配信した記事の事柄に関連した続報記事を配信する技術が知られている（例えば、特許文献１参照）。 Documents such as news articles distributed on the Internet include many related documents such as follow-up articles that presuppose the matters of documents distributed in the past. In connection with this, a technique for distributing a follow-up article related to the matter of an article distributed in the past is known (see, for example, Patent Document 1).

特開２００５－２４２７５８号公報Japanese Unexamined Patent Publication No. 2005-242758

しかしながら、従来の技術では、ユーザが文書を検索したときに、その文書に関連した関連文書を精度良く検索することができず、その結果、ユーザの情報収集の効率が低下する場合があった。 However, in the conventional technique, when a user searches for a document, the related document related to the document cannot be searched accurately, and as a result, the efficiency of information collection by the user may be lowered.

本発明は、上記の課題に鑑みてなされたものであり、ユーザの情報収集の効率を向上させることができるキーワード抽出装置、キーワード抽出方法、およびプログラムを提供することを目的としている。 The present invention has been made in view of the above problems, and an object of the present invention is to provide a keyword extraction device, a keyword extraction method, and a program capable of improving the efficiency of information collection by a user.

本発明の一態様は、着目文書に類似する複数の類似文書のうち、前記着目文書に出現するキーワードの候補が出現する前記類似文書の数に基づいて、前記着目文書に含まれる前記キーワードの候補をキーワードとして抽出する処理部を備える、
キーワード抽出装置である。 One aspect of the present invention is a candidate for the keyword included in the document of interest based on the number of similar documents in which the candidate for the keyword appearing in the document of interest appears among a plurality of similar documents similar to the document of interest. Equipped with a processing unit that extracts
It is a keyword extractor.

本発明の一態様によれば、ユーザの情報収集の効率を向上させることができる。 According to one aspect of the present invention, the efficiency of user information collection can be improved.

第１実施形態における情報処理装置１００を含む情報処理システム１の一例を示す図である。It is a figure which shows an example of the information processing system 1 including the information processing apparatus 100 in 1st Embodiment. サービス提供装置２０により提供されるウェブページの一例を示す図である。It is a figure which shows an example of the web page provided by the service providing apparatus 20. 関連ページの一例を示す図である。It is a figure which shows an example of a related page. 第１実施形態における情報処理装置１００の構成の一例を示す図である。It is a figure which shows an example of the structure of the information processing apparatus 100 in 1st Embodiment. 第１実施形態における情報処理装置１００による一連の処理の流れを示すフローチャートである。It is a flowchart which shows the flow of a series of processing by the information processing apparatus 100 in 1st Embodiment. 文書の分類結果の一例を示す図である。It is a figure which shows an example of the classification result of a document. キーワード抽出器ＥＸの評価結果の一例を示す図である。It is a figure which shows an example of the evaluation result of the keyword extractor EX. 第２実施形態における情報処理装置１００Ａの構成の一例を示す図である。It is a figure which shows an example of the structure of the information processing apparatus 100A in 2nd Embodiment. 第２実施形態における情報処理装置１００Ａによる一連の処理の流れを示すフローチャートである。It is a flowchart which shows the flow of a series of processing by the information processing apparatus 100A in 2nd Embodiment. キーワード抽出器ＥＸにより抽出されたキーワードの利用場面の一例を示す図である。It is a figure which shows an example of the usage scene of the keyword extracted by a keyword extractor EX. 実施形態の情報処理装置１００、１００Ａのハードウェア構成の一例を示す図である。It is a figure which shows an example of the hardware composition of the information processing apparatus 100, 100A of embodiment.

以下、本発明を適用したキーワード抽出装置、キーワード抽出方法、およびプログラムを、図面を参照して説明する。 Hereinafter, the keyword extraction device, the keyword extraction method, and the program to which the present invention is applied will be described with reference to the drawings.

［概要］
情報処理装置は、一以上のプロセッサにより実現される。情報処理装置は、関連する文書同士が人手によって同じグループに分類された複数の文書と、複数の文書から、文書内において所定の特徴を有するキーワードを抽出するキーワード抽出器によって抽出されたキーワードとを取得する。情報処理装置は、複数の文書とキーワードとを取得すると、グループ内の文書間のキーワードの一致度合に基づいて、キーワード抽出器の性能を評価する。キーワード抽出器の性能が高いほど、キーワード抽出器により抽出されたキーワードが、文書本来の意味や概念を表したものとなる。このようなキーワードを利用して文書検索を行った場合、キーワード抽出器がキーワードの抽出対象とした文書に関連した文書を検索することができる。この結果、ユーザの情報収集の効率を向上させることができる。 [Overview]
The information processing device is realized by one or more processors. The information processing device includes a plurality of documents in which related documents are manually classified into the same group, and keywords extracted by a keyword extractor that extracts keywords having predetermined characteristics in the documents from a plurality of documents. get. When the information processing apparatus acquires a plurality of documents and keywords, the information processing apparatus evaluates the performance of the keyword extractor based on the degree of matching of the keywords between the documents in the group. The higher the performance of the keyword extractor, the more the keywords extracted by the keyword extractor represent the original meaning and concept of the document. When a document is searched using such a keyword, the keyword extractor can search for a document related to the document for which the keyword is extracted. As a result, the efficiency of user information collection can be improved.

＜第１実施形態＞
［全体構成］
図１は、第１実施形態における情報処理装置１００を含む情報処理システム１の一例を示す図である。第１実施形態における情報処理システム１は、例えば、一以上の端末装置１０と、サービス提供装置２０と、情報処理装置１００とを備える。これらの装置は、ネットワークＮＷを介して接続される。 <First Embodiment>
[overall structure]
FIG. 1 is a diagram showing an example of an information processing system 1 including an information processing apparatus 100 according to the first embodiment. The information processing system 1 in the first embodiment includes, for example, one or more terminal devices 10, a service providing device 20, and an information processing device 100. These devices are connected via the network NW.

図１に示す各装置は、ネットワークＮＷを介して種々の情報を送受信する。ネットワークＮＷは、例えば、インターネット、ＷＡＮ（Wide Area Network）、ＬＡＮ（Local Area Network）、プロバイダ端末、無線通信網、無線基地局、専用回線などを含む。なお、図１に示す各装置の全ての組み合わせが相互に通信可能である必要はなく、ネットワークＮＷは、一部にローカルなネットワークを含んでもよい。 Each device shown in FIG. 1 transmits and receives various information via the network NW. The network NW includes, for example, the Internet, a WAN (Wide Area Network), a LAN (Local Area Network), a provider terminal, a wireless communication network, a wireless base station, a dedicated line, and the like. It should be noted that not all combinations of the devices shown in FIG. 1 need not be able to communicate with each other, and the network NW may include a local network in part.

端末装置１０は、例えば、スマートフォンなどの携帯電話、タブレット端末、各種パーソナルコンピュータなどの、入力装置、表示装置、通信装置、記憶装置、および演算装置を備える端末装置である。通信装置は、ＮＩＣ（Network Interface Card）などのネットワークカード、無線通信モジュールなどを含む。端末装置１０では、ウェブブラウザやアプリケーションプログラムなどのＵＡ（User Agent）が起動し、ユーザの入力する内容に応じたリクエストをサービス提供装置２０に送信する。また、ＵＡが起動された端末装置１０は、サービス提供装置２０から取得した情報に基づいて、表示装置に各種画像を表示させる。 The terminal device 10 is a terminal device including an input device, a display device, a communication device, a storage device, and an arithmetic unit, such as a mobile phone such as a smartphone, a tablet terminal, and various personal computers. The communication device includes a network card such as a NIC (Network Interface Card), a wireless communication module, and the like. In the terminal device 10, a UA (User Agent) such as a web browser or an application program is activated, and a request according to the content input by the user is transmitted to the service providing device 20. Further, the terminal device 10 in which the UA is activated causes the display device to display various images based on the information acquired from the service providing device 20.

サービス提供装置２０は、例えば、ウェブブラウザからのリクエストに応じてウェブページを端末装置１０に提供するウェブサーバである。ウェブページは、例えば、検索サービスを提供するウェブページ（以下、検索ページと称する）である。検索ページには、例えば、ニュース記事などの文書（テキストデータ）や、動画像データ、静止画像データ、音声データなどのコンテンツが含まれる。また、サービス提供装置２０は、インターネットショッピングやＳＮＳ（Social Networking Service）、メールサービスなどの各種サービスを提供するウェブページを端末装置１０に提供してもよい。また、サービス提供装置２０は、アプリケーションプログラムからのリクエストに応じてコンテンツを端末装置１０に提供するアプリケーションサーバであってもよい。 The service providing device 20 is, for example, a web server that provides a web page to the terminal device 10 in response to a request from a web browser. The web page is, for example, a web page (hereinafter referred to as a search page) that provides a search service. The search page includes, for example, documents (text data) such as news articles, and contents such as moving image data, still image data, and audio data. Further, the service providing device 20 may provide the terminal device 10 with a web page that provides various services such as Internet shopping, SNS (Social Networking Service), and mail service. Further, the service providing device 20 may be an application server that provides the content to the terminal device 10 in response to a request from the application program.

例えば、ユーザが端末装置１０に表示された検索ページに対してクエリを入力した場合、サービス提供装置２０は、クエリに対応した単語（ワード）や語句（フレーズ）が文書中に含まれるウェブページを、クエリによる検索結果として端末装置１０に提供する。これを受けて、ユーザが検索結果の中から所望のウェブページを選択した場合、サービス提供装置２０は、ユーザにより選択されたウェブページから抽出されたキーワードを、当該ウェブページに含めて端末装置１０に提供する。サービス提供装置２０は、後述するキーワード抽出器ＥＸを利用して、提供対象のウェブページから予めキーワードを抽出して置いてもよいし、検索時に併せて提供対象のウェブページからキーワードを抽出してもよい。 For example, when the user inputs a query to the search page displayed on the terminal device 10, the service providing device 20 displays a web page in which the word or phrase corresponding to the query is included in the document. , Is provided to the terminal device 10 as a search result by a query. In response to this, when the user selects a desired web page from the search results, the service providing device 20 includes the keyword extracted from the web page selected by the user in the web page, and the terminal device 10 To provide to. The service providing device 20 may extract keywords in advance from the web page to be provided by using the keyword extractor EX described later, or extract the keywords from the web page to be provided at the time of searching. May be good.

図２は、サービス提供装置２０により提供されるウェブページの一例を示す図である。図示の例のように、オリンピックに関連したニュース記事が掲載されたウェブページが提供される場合、そのウェブページには、ニュース記事から抽出された、「○○五輪」や「○○オリンピック」、「□□□□選手」のようなキーワードＫＷが含まれる。キーワードＫＷには、そのキーワードＫＷをクエリとした検索結果にアクセスするためのＵＲＬ（Uniform Resource Locator）がリンク付けられる。そのため、キーワードＫＷは、文書の内容を端的に表現していることが好ましく、更に、キーワードをクエリとした場合、容易に他の文書を検索することができることが好ましい。このように、キーワードＫＷの検索結果へのリンク先が含まれるウェブページをユーザに提供することで、ユーザは、自身が検索した情報に関連した情報も併せて取得することができる。以下、キーワードＫＷをクエリとして検索することで得られるウェブページのことを、「関連ページ」と称する。 FIG. 2 is a diagram showing an example of a web page provided by the service providing device 20. When a web page containing news articles related to the Olympic Games is provided as shown in the example shown in the figure, the web pages include "○○ Olympics" and "○○ Olympics" extracted from the news articles. Keyword KW such as "□□□□ player" is included. A URL (Uniform Resource Locator) for accessing a search result using the keyword KW as a query is linked to the keyword KW. Therefore, it is preferable that the keyword KW simply expresses the content of the document, and further, when the keyword is used as a query, it is preferable that another document can be easily searched. In this way, by providing the user with a web page including a link destination to the search result of the keyword KW, the user can also acquire information related to the information searched by himself / herself. Hereinafter, a web page obtained by searching for the keyword KW as a query is referred to as a "related page".

図３は、関連ページの一例を示す図である。図示の例では、図２に例示した「○○五輪」というキーワードＫＷをクエリとして検索したときの検索結果を表している。このような検索結果には、「○○五輪」に関連した各関連ページのタイトルやＵＲＬ、要約（スニペット）、画像などが項目（リスト）として掲載される。図示の例では、最上段の関連ページには文書Ａが含まれ、２番目の関連ページには文書Ｂが含まれ、３番目の関連ページには文書Ｃが含まれていることを表している。これらの各関連ページには、キーワードの候補となる単語や語句の集合である文書が含まれる。関連ページに含まれる文書（以下、関連文書）と、キーワードの抽出元の文書とは、互いに同一のキーワードを共有しているという性質（キーワード或いはキーフレーズの共有性）を有している。キーフレーズの共有性が高いほど、すなわち、同一のキーワードの共有数が多いほど、より多くの関連ページをユーザに提供することができる。 FIG. 3 is a diagram showing an example of related pages. In the illustrated example, the search result when the keyword KW "○○ Olympics" illustrated in FIG. 2 is searched as a query is shown. In such search results, titles, URLs, summaries (snippets), images, etc. of each related page related to the "○○ Olympics" are posted as items (lists). In the illustrated example, the top related page contains the document A, the second related page contains the document B, and the third related page contains the document C. .. Each of these related pages contains a document that is a collection of words and phrases that are candidate keywords. The document included in the related page (hereinafter referred to as the related document) and the document from which the keyword is extracted have the property of sharing the same keyword (keyword or key phrase sharing). The higher the shareability of the key phrase, that is, the greater the number of shares of the same keyword, the more related pages can be provided to the user.

情報処理装置１００は、サービス提供装置２０が利用するキーワード抽出器ＥＸを、そのキーワード抽出器ＥＸによって各文書から抽出されたキーワード同士を比較することで評価する。 The information processing apparatus 100 evaluates the keyword extractor EX used by the service providing apparatus 20 by comparing the keywords extracted from each document by the keyword extractor EX.

［情報処理装置の構成］
図４は、第１実施形態における情報処理装置１００の構成の一例を示す図である。図示のように、情報処理装置１００は、例えば、通信部１０２と、制御部１１０と、記憶部１３０とを備える。 [Information processing device configuration]
FIG. 4 is a diagram showing an example of the configuration of the information processing apparatus 100 according to the first embodiment. As shown in the figure, the information processing apparatus 100 includes, for example, a communication unit 102, a control unit 110, and a storage unit 130.

通信部１０２は、例えば、ＮＩＣ等の通信インターフェースを含む。通信部１０２は、ネットワークＮＷを介して、端末装置１０やサービス提供装置２０などと通信する。 The communication unit 102 includes, for example, a communication interface such as a NIC. The communication unit 102 communicates with the terminal device 10, the service providing device 20, and the like via the network NW.

制御部１１０は、例えば、キーワード付与部１１２と、文書分類部１１４と、抽出器評価部１１６とを備える。これらの構成要素は、例えば、ＣＰＵ（Central Processing Unit）などのプロセッサが記憶部１３０に格納されたプログラムを実行することにより実現される。また、制御部１１０の構成要素の一部または全部は、ＬＳＩ（Large Scale Integration）、ＡＳＩＣ（Application Specific Integrated Circuit）、ＦＰＧＡ（Field-Programmable Gate Array）、またはＧＰＵ（Graphics Processing Unit）などのハードウェア（回路部；circuitry）により実現されてもよいし、ソフトウェアとハードウェアの協働によって実現されてもよい。 The control unit 110 includes, for example, a keyword assignment unit 112, a document classification unit 114, and an extractor evaluation unit 116. These components are realized by, for example, a processor such as a CPU (Central Processing Unit) executing a program stored in the storage unit 130. Further, some or all of the components of the control unit 110 are hardware such as LSI (Large Scale Integration), ASIC (Application Specific Integrated Circuit), FPGA (Field-Programmable Gate Array), or GPU (Graphics Processing Unit). It may be realized by (circuit unit; circuitry), or it may be realized by the cooperation of software and hardware.

記憶部１３０は、例えば、ＨＤＤ（Hard Disc Drive）、フラッシュメモリ、ＥＥＰＲＯＭ（Electrically Erasable Programmable Read Only Memory）、ＲＯＭ（Read Only Memory）、ＲＡＭ（Random Access Memory）などの記憶装置により実現される。記憶部１３０には、ファームウェアやアプリケーションプログラムなどの各種プログラムの他に、キーワード抽出器データ１３２や文書データ１３４が格納される。 The storage unit 130 is realized by, for example, a storage device such as an HDD (Hard Disc Drive), a flash memory, an EEPROM (Electrically Erasable Programmable Read Only Memory), a ROM (Read Only Memory), and a RAM (Random Access Memory). In addition to various programs such as firmware and application programs, the storage unit 130 stores keyword extractor data 132 and document data 134.

キーワード抽出器データ１３２は、キーワード抽出器ＥＸがどういった抽出器であるのかを定義した情報（プログラム）であり、例えば、複数のキーワード抽出器ＥＸの其々を定義した情報であってよい。例えば、キーワード抽出器ＥＸは、複数の単語や語句を含む文書を形態素解析によって、キーワードの候補となる複数の形態素に分割し、ＴＦ（Term Frequency）‐ＩＤＦ（Inverse Document Frequency）などの単語の出現頻度を評価する手法を用いて、分割した各形態素、または複数の形態素を組み合わせに対して重みを付け、その重みが大きいものをキーワードとして抽出する。ＴＦ‐ＩＤＦに基づく重みは、「所定の特徴」の一例である。 The keyword extractor data 132 is information (program) that defines what kind of extractor the keyword extractor EX is, and may be, for example, information that defines each of a plurality of keyword extractors EX. For example, the keyword extractor EX divides a document containing a plurality of words or phrases into a plurality of morphemes that are candidate for keywords by morphological analysis, and the appearance of words such as TF (Term Frequency) -IDF (Inverse Document Frequency). Using a method for evaluating frequency, each divided morpheme or a plurality of morphemes are weighted for a combination, and the one with a large weight is extracted as a keyword. Weights based on TF-IDF are an example of "predetermined features".

例えば、キーワード抽出器ＥＸは、キーワードの候補とする形態素の数を１つに限らず、所定数（例えば３つ）とすることで、所定の長さのキーワードを抽出してよい。また、例えば、キーワード抽出器ＥＸは、キーワードの候補となる形態素を、名詞や形容詞、動詞などの特定の品詞に限定してもよい。また、例えば、キーワード抽出器ＥＸは、キーワードの候補となる形態素を、半角文字或いは全角文字に限定したり、アルファベットであれば小文字に限定したりしてもよい。また、例えば、キーワード抽出器ＥＸは、キーワードの候補とする品詞の活用形を所定の活用形に変換してもよい。具体的には、キーワード抽出器ＥＸは、形態素として助動詞をキーワードの候補とする場合、助動詞の活用を「です、ます」調から、「である」調に変換してよい。このように、キーワードの長さを指定したり、品詞の種類を限定したり、全角半角や大文字小文字を指定したり、品詞の活用を指定したりすることは、キーワード抽出器ＥＸの設計者（例えばシステムエンジニアなど）が事前に決定するハイパーパラメータとして扱われる。 For example, the keyword extractor EX does not limit the number of morphemes as keyword candidates to one, but may extract keywords having a predetermined length by setting a predetermined number (for example, three). Further, for example, the keyword extractor EX may limit the morphemes that are candidates for keywords to specific part of speech such as nouns, adjectives, and verbs. Further, for example, the keyword extractor EX may limit the morphemes that are candidates for keywords to half-width characters or full-width characters, or if it is an alphabet, it may be limited to lowercase letters. Further, for example, the keyword extractor EX may convert the inflected form of the part of speech that is a candidate for the keyword into a predetermined inflected form. Specifically, when the auxiliary verb is used as a keyword candidate as a morpheme, the keyword extractor EX may convert the conjugation of the auxiliary verb from the "desu, masu" key to the "is" key. In this way, it is the designer of the keyword extractor EX that specifies the length of the keyword, limits the type of part of speech, specifies full-width half-width and uppercase and lowercase letters, and specifies the conjugation of part of speech. For example, it is treated as a hyperparameter determined in advance by a system engineer).

文書データ１３４は、複数の文書（例えば１万件の文書）を含むデータであり、例えば、サービス提供装置２０により提供される各ウェブページに含まれる文書を含んでもよいし、これとは別に用意された文書を含んでもよい。文書データ１３４には、様々なジャンルやテーマの文書が網羅的に含まれていてもよいし、特定のジャンルやテーマの文書だけが含まれていてもよい。なお、各ジャンルやテーマに該当する文書の数は均等である必要はなく、特定のジャンル或いはテーマの文書が多く、他のジャンル或いはテーマの文書が少ない、といったように偏りが生じていてもよい。 The document data 134 is data including a plurality of documents (for example, 10,000 documents), and may include, for example, a document included in each web page provided by the service providing device 20, or is prepared separately. The document may be included. The document data 134 may comprehensively include documents of various genres and themes, or may include only documents of a specific genre or theme. The number of documents corresponding to each genre or theme does not have to be equal, and there may be bias such as many documents of a specific genre or theme and few documents of other genres or themes. ..

［処理フロー］
以下、第１実施形態における情報処理装置１００による一連の処理の流れをフローチャートに即して説明する。図５は、第１実施形態における情報処理装置１００による一連の処理の流れを示すフローチャートである。本フローチャートの処理は、例えば、所定の周期で繰り返し行われてもよい。 [Processing flow]
Hereinafter, the flow of a series of processes by the information processing apparatus 100 in the first embodiment will be described according to a flowchart. FIG. 5 is a flowchart showing a flow of a series of processes by the information processing apparatus 100 in the first embodiment. The processing of this flowchart may be repeated, for example, at a predetermined cycle.

まず、キーワード付与部１１２は、キーワード抽出器データ１３２を基に、一つまたは複数のキーワード抽出器ＥＸを構築（生成）し、構築したキーワード抽出器ＥＸに、文書データ１３４に含まれる各文書から一以上のキーワードを抽出させ、抽出させたキーワードを抽出元の文書に付与する（Ｓ１００）。キーワード付与部１１２は、「取得部」の一例である。また、キーワード付与部１１２によって構築されたキーワード抽出器ＥＸ、すなわち、プロセッサがキーワード抽出器データ１３２を参照することで実現されるキーワード抽出器ＥＸは、「キーワード抽出装置」の一例である。 First, the keyword assigning unit 112 constructs (generates) one or a plurality of keyword extractors EX based on the keyword extractor data 132, and the constructed keyword extractor EX is used from each document included in the document data 134. One or more keywords are extracted, and the extracted keywords are added to the extraction source document (S100). The keyword giving unit 112 is an example of the “acquisition unit”. Further, the keyword extractor EX constructed by the keyword assigning unit 112, that is, the keyword extractor EX realized by the processor referring to the keyword extractor data 132 is an example of the “keyword extractor”.

次に、文書分類部１１４は、通信部１０２を制御して、キーワード付与部１１２によりキーワードが付与された複数の文書を所定の端末装置１０に送信し、所定の端末装置１０のユーザに文書のグループ分けを依頼する（Ｓ１０２）。所定の端末装置１０は、例えば、クラウドソーシングの参加者のコンピュータなどであってよい。文書のグループ分けを依頼されたユーザは、例えば、複数の文書を見て、内容が関連する文書同士を同じグループに分類し、その分類結果を、端末装置１０を用いて情報処理装置１００に送信する。 Next, the document classification unit 114 controls the communication unit 102 to transmit a plurality of documents to which the keyword is added by the keyword assigning unit 112 to the predetermined terminal device 10, and the document is sent to the user of the predetermined terminal device 10. Request grouping (S102). The predetermined terminal device 10 may be, for example, a computer of a participant of crowdsourcing. A user who has been requested to group documents, for example, looks at a plurality of documents, classifies documents having related contents into the same group, and transmits the classification result to the information processing device 100 using the terminal device 10. do.

図６は、文書の分類結果の一例を示す図である。図中の文書ＩＤは、グループ分けが依頼された複数の文書の其々の識別情報を表す。図示の例では、文書ＩＤが「ＤＯＣ＿Ａ」の文書（以下、文書Ａ）と、文書ＩＤが「ＤＯＣ＿Ｂ」の文書（以下、文書Ｂ）と、文書ＩＤが「ＤＯＣ＿Ｃ」の文書（以下、文書Ｃ）が、野球に関連したグループＸに分類されており、文書ＩＤが「ＤＯＣ＿Ｄ」の文書（以下、文書Ｄ）と、文書ＩＤが「ＤＯＣ＿Ｅ」の文書（以下、文書Ｅ）が、サッカーに関連したグループＹに分類されている。なお、各文書からは、２つずつキーワードが抽出されており、このキーワードの抽出数は、ハイパーパラメータとして予め決められているものとする。 FIG. 6 is a diagram showing an example of a document classification result. The document ID in the figure represents the identification information of each of the plurality of documents requested to be grouped. In the illustrated example, a document having a document ID of "DOC_A" (hereinafter, document A), a document having a document ID of "DOC_B" (hereinafter, document B), and a document having a document ID of "DOC_C" (hereinafter, document C). ) Is classified into group X related to baseball, and a document having a document ID of "DOC_D" (hereinafter referred to as "Document D") and a document having a document ID of "DOC_E" (hereinafter referred to as "Document E") are related to soccer. It is classified into the group Y. It should be noted that two keywords are extracted from each document, and the number of extracted keywords is assumed to be predetermined as a hyperparameter.

次に、抽出器評価部１１６は、通信部１０２が、所定の端末装置１０から複数の文書の分類結果を取得するまで待機し（Ｓ１０４）、通信部１０２が複数の文書の分類結果を取得すると、グループに分類された複数の文書（母集団）の中から、着目する一つの文書（以下、着目文書と称する）を選択する（Ｓ１０６）。 Next, the extractor evaluation unit 116 waits until the communication unit 102 acquires the classification results of a plurality of documents from the predetermined terminal device 10 (S104), and the communication unit 102 acquires the classification results of the plurality of documents. , One document of interest (hereinafter referred to as the document of interest) is selected from a plurality of documents (population) classified into groups (S106).

次に、抽出器評価部１１６は、選択した着目文書から抽出された一以上のキーワードと、着目文書と同じグループに分類された他文書から抽出された一以上のキーワードと比較して、これらのキーワードが互いに一致する度合に基づいて、Ｆ値（F-measure）を導出する。Ｆ値は、適合率（precision）と、再現率（recall）との調和平均によって導出されてよい。例えば、抽出器評価部１１６は、数式（１）に基づいてＦ値を導出し、数式（２）に基づいて適合率を導出し、数式（３）に基づいて再現率を導出する。 Next, the extractor evaluation unit 116 compares one or more keywords extracted from the selected document of interest with one or more keywords extracted from other documents classified in the same group as the document of interest. The F value (F-measure) is derived based on the degree to which the keywords match each other. The F value may be derived by the harmonic mean of the precision and the recall. For example, the extractor evaluation unit 116 derives the F value based on the mathematical formula (1), derives the precision ratio based on the mathematical formula (2), and derives the recall rate based on the mathematical formula (3).

上記式のＲは、着目文書と同じグループに分類された一以上の他文書のうち、着目文書と同じキーワードが抽出された他文書の数を表し、Ｎは、グループ分けを依頼した複数の文書の中から着目文書を除いた残りの他文書のうち、着目文書と同じキーワードが抽出された他文書の数を表し、Ｃは、着目文書と同じグループに分類された他文書の数を表している。 R in the above formula represents the number of other documents in which the same keyword as the document of interest is extracted from among one or more other documents classified in the same group as the document of interest, and N is a plurality of documents requested to be grouped. Of the remaining other documents excluding the document of interest, the number of other documents in which the same keyword as the document of interest is extracted is represented, and C represents the number of other documents classified in the same group as the document of interest. There is.

図６に例示した分類結果である場合に、文書Ａを着目文書とした場合、抽出器評価部１１６は、例えば、文書Ａと同じグループに分類された他文書（Ｂ、Ｃ）のうち、「野球」というキーワードが抽出された文書の数Ｒと、グループを問わず文書Ａを除く他文書（Ｂ～Ｅ）のうち、「野球」というキーワードが抽出された文書の数Ｎとの商を、文書Ａから抽出された「野球」というキーワードの適合率として導出する。図６の例の場合、グループＹの文書Ｄから「野球」というキーワードが抽出されているため、Ｎは１となり、グループＸの各文書からは「野球」というキーワードが抽出されていないため、Ｒは０となる。従って、適合率は、０／１、すなわち０［％］となる。 In the case of the classification result exemplified in FIG. 6, when the document A is the document of interest, the extractor evaluation unit 116 is, for example, among the other documents (B, C) classified into the same group as the document A, " The quotient between the number R of documents from which the keyword "baseball" is extracted and the number N of documents from which the keyword "baseball" is extracted from other documents (B to E) excluding document A regardless of the group. It is derived as the conformance rate of the keyword "baseball" extracted from the document A. In the case of the example of FIG. 6, since the keyword "baseball" is extracted from the document D of the group Y, N becomes 1, and the keyword "baseball" is not extracted from each document of the group X. Therefore, R Is 0. Therefore, the precision ratio is 0/1, that is, 0 [%].

また、抽出器評価部１１６は、文書Ａと同じグループに含まれる他文書（Ｂ、Ｃ）のうち、「野球」というキーワードが抽出された文書の数Ｒと、文書Ａと同じグループに分類された他文書の数Ｃとの商を、文書Ａから抽出された「野球」というキーワードの再現率として導出する。図６の例の場合、グループＸには、文書Ａの他に２つの文書が分類されているため、Ｃは２となり、それらの２つの文書からは「野球」というキーワードが抽出されていないため、Ｒは０となる。従って、再現率は、０／２、すなわち０［％］となる。 Further, the extractor evaluation unit 116 is classified into the same group as the document A and the number R of the documents from which the keyword "baseball" is extracted from the other documents (B, C) included in the same group as the document A. The quotient with the number C of other documents is derived as the recall rate of the keyword "baseball" extracted from the document A. In the case of the example of FIG. 6, since two documents other than the document A are classified in the group X, C is 2, and the keyword "baseball" is not extracted from those two documents. , R becomes 0. Therefore, the reproducibility is 0/2, that is, 0 [%].

同様に、文書Ａから抽出された「ベースボール」というキーワードの適合率は、Ｒが２であり、Ｎが２であるため、１００［％］となる。また、文書Ａから抽出された「ベースボール」というキーワードの再現率は、Ｒが２であり、Ｃが２であるため、１００［％］となる。 Similarly, the conformance rate of the keyword "baseball" extracted from the document A is 100 [%] because R is 2 and N is 2. Further, the recall rate of the keyword "baseball" extracted from the document A is 100 [%] because R is 2 and C is 2.

抽出器評価部１１６は、着目文書として選択した文書Ａから抽出された各キーワードについて、Ｆ値を導出する。着目文書から抽出された各キーワードのＦ値は、キーワード毎のキーワード抽出器ＥＸの性能を評価した評価値を表している。上述した数値例の場合、「野球」というキーワードのＦ値は、適合率が０［％］、再現率が０［％］であるため、（２×０［％］×０［％］）／（０［％］＋０［％］）＝０［％］となる。また、「ベースボール」というキーワードのＦ値は、適合率が１００［％］、再現率が１００［％］であるため、（２×１００［％］×１００［％］）／（１００［％］＋１００［％］）＝１００［％］となる。 The extractor evaluation unit 116 derives an F value for each keyword extracted from the document A selected as the document of interest. The F value of each keyword extracted from the document of interest represents an evaluation value that evaluates the performance of the keyword extractor EX for each keyword. In the case of the above numerical example, the F value of the keyword "baseball" has a precision rate of 0 [%] and a reproducibility of 0 [%], so (2 x 0 [%] x 0 [%]) /. (0 [%] + 0 [%]) = 0 [%]. Further, since the F value of the keyword "baseball" has a conformance rate of 100 [%] and a reproducibility of 100 [%], (2 x 100 [%] x 100 [%]) / (100 [%]. ] + 100 [%]) = 100 [%].

そして、抽出器評価部１１６は、各キーワードのＦ値を平均することで、着目文書のＦ値を導出する。着目文書のＦ値は、着目文書に対するキーワード抽出器ＥＸの性能を評価した評価値を表している。上記の数値例の場合、文書ＡのＦ値は、５０［％］となる。 Then, the extractor evaluation unit 116 derives the F value of the document of interest by averaging the F values of each keyword. The F value of the document of interest represents an evaluation value that evaluates the performance of the keyword extractor EX for the document of interest. In the case of the above numerical example, the F value of the document A is 50 [%].

次に、抽出器評価部１１６は、母集団に含まれる全ての文書を着目文書として選択したか否かを判定し（Ｓ１１０）、未だ、全ての文書を着目文書として選択していない場合、着目文書を変更して、Ｓ１０６およびＳ１０８の処理を繰り返す。 Next, the extractor evaluation unit 116 determines whether or not all the documents included in the population have been selected as the documents of interest (S110), and if all the documents have not yet been selected as the documents of interest, the focus is on. The document is changed, and the processing of S106 and S108 is repeated.

例えば、抽出器評価部１１６は、着目文書を文書Ａから文書Ｅに変更した場合、文書Ｅから抽出された「サッカー」というキーワードの適合率については、Ｒが１であり、Ｎが２であるため、５０［％］として導出し、文書Ｅから抽出された「サッカー」というキーワードの再現率については、Ｒが１であり、Ｃが１であるため、１００［％］として導出する。抽出器評価部１１６は、文書Ｅの「サッカー」というキーワードのＦ値を、（２×５０［％］×１００［％］）／（５０［％］＋１００［％］）≒６６．７［％］として導出する。 For example, when the extractor evaluation unit 116 changes the document of interest from document A to document E, R is 1 and N is 2 for the matching rate of the keyword "soccer" extracted from document E. Therefore, it is derived as 50 [%], and the recall rate of the keyword "soccer" extracted from the document E is derived as 100 [%] because R is 1 and C is 1. The extractor evaluation unit 116 sets the F value of the keyword “soccer” in the document E to (2 × 50 [%] × 100 [%]) / (50 [%] + 100 [%]) ≈66.7 [%]. ] To be derived.

また、抽出器評価部１１６は、文書Ｅから抽出された「野球」というキーワードの適合率については、Ｒが０であり、Ｎが１であるため、０［％］として導出し、文書Ｅから抽出された「野球」というキーワードの再現率については、Ｒが０であり、Ｃが１であるため、０［％］として導出する。抽出器評価部１１６は、文書Ｅの「野球」というキーワードのＦ値を、（２×０［％］×０［％］）／（０［％］＋０［％］）＝０［％］として導出する。そして、抽出器評価部１１６は、文書Ｅの各キーワードのＦ値の平均である３３．３［％］を、文書ＥのＦ値として導出する。 Further, the extractor evaluation unit 116 derives the conformance rate of the keyword "baseball" extracted from the document E as 0 [%] because R is 0 and N is 1, and is derived from the document E. The recall rate of the extracted keyword "baseball" is derived as 0 [%] because R is 0 and C is 1. The extractor evaluation unit 116 sets the F value of the keyword "baseball" in the document E as (2 x 0 [%] x 0 [%]) / (0 [%] + 0 [%]) = 0 [%]. Derived. Then, the extractor evaluation unit 116 derives 33.3 [%], which is the average of the F values of each keyword of the document E, as the F value of the document E.

このように、抽出器評価部１１６は、着目文書を変更しながら、母集団に含まれる全ての文書のＦ値を求めることを繰り返す。 In this way, the extractor evaluation unit 116 repeatedly obtains the F value of all the documents included in the population while changing the document of interest.

次に、抽出器評価部１１６は、母集団に含まれる全ての文書のＦ値に基づいて、キーワード抽出器ＥＸを評価する（Ｓ１１２）。例えば、抽出器評価部１１６は、文書のＦ値をグループ毎に平均し、グループ毎に求めたＦ値の平均値を更に平均した値を、母集団に対するキーワード抽出器ＥＸの性能を評価した評価値として導出する。 Next, the extractor evaluation unit 116 evaluates the keyword extractor EX based on the F value of all the documents included in the population (S112). For example, the extractor evaluation unit 116 averages the F-numbers of the documents for each group, and further averages the average F-numbers obtained for each group to evaluate the performance of the keyword extractor EX for the population. Derived as a value.

図７は、キーワード抽出器ＥＸの評価結果の一例を示す図である。図示の例では、複数のキーワード抽出器ＥＸの其々についての評価結果を表している。図示の例のように、グループＸに分類された文書Ａ、Ｂ、Ｃの其々のＦ値は、５０［％］である場合、抽出器評価部１１６は、３つの文書のＦ値の平均値である５０［％］を、グループＸに対するキーワード抽出器ＥＸの性能を評価した評価値として導出する。また、グループＹに分類された文書Ｄ、Ｅの其々のＦ値は、３３［％］である場合、抽出器評価部１１６は、２つの文書のＦ値の平均値である３３［％］を、グループＹに対するキーワード抽出器ＥＸの性能を評価した評価値として導出する。 FIG. 7 is a diagram showing an example of the evaluation result of the keyword extractor EX. In the illustrated example, the evaluation results for each of the plurality of keyword extractors EX are shown. As shown in the illustrated example, when the F value of each of the documents A, B, and C classified into the group X is 50 [%], the extractor evaluation unit 116 averages the F values of the three documents. The value of 50 [%] is derived as an evaluation value for evaluating the performance of the keyword extractor EX for the group X. Further, when the F value of each of the documents D and E classified into the group Y is 33 [%], the extractor evaluation unit 116 is the average value of the F values of the two documents 33 [%]. Is derived as an evaluation value that evaluates the performance of the keyword extractor EX for the group Y.

そして、抽出器評価部１１６は、グループＸのＦ値とグループＹのＦ値との平均（（５０＋３３）／２）である４２［％］を、母集団に対するキーワード抽出器ＥＸの性能を評価した評価値として導出する。 Then, the extractor evaluation unit 116 evaluated the performance of the keyword extractor EX with respect to the population by using 42 [%], which is the average ((50 + 33) / 2) of the F value of the group X and the F value of the group Y. Derived as an evaluation value.

次に、抽出器評価部１１６は、通信部１０２を制御して、キーワード抽出器ＥＸの評価結果（例えば母集団に対するＦ値）を、サービス提供装置２０に送信する（Ｓ１１４）。これに受けて、サービス提供装置２０は、例えば、複数のキーワード抽出器ＥＸが存在する場合、Ｆ値が最も大きいキーワード抽出器ＥＸを利用して、ウェブページなどからキーワードを抽出する。この結果、関連ページの検索に利用可能な汎用的なキーワード、すなわち文書間での共有性が高いキーワードが抽出されやすくなるため、より多くの関連ページをユーザに提供することができる。 Next, the extractor evaluation unit 116 controls the communication unit 102 to transmit the evaluation result of the keyword extractor EX (for example, the F value for the population) to the service providing device 20 (S114). In response to this, the service providing device 20 extracts a keyword from a web page or the like by using the keyword extractor EX having the largest F value, for example, when a plurality of keyword extractors EX exist. As a result, it becomes easy to extract general-purpose keywords that can be used for searching related pages, that is, keywords that are highly shared between documents, so that more related pages can be provided to the user.

以上説明した第１実施形態によれば、関連する文書同士が人手によって同じグループに分類された複数の文書と、キーワード抽出器ＥＸによって文書から抽出されたキーワードとを取得し、グループ内の文書間のキーワードの一致度合に基づいて、キーワード抽出器ＥＸの性能を評価するため、性能が良いキーワード抽出器ＥＸを利用することができ、文書間での共有性が高いキーワードを抽出することができる。これによって、ユーザが文書を検索したときに、その文書に関連した関連文書を容易に検索することができ、ユーザが検索した文書により関連し、且つより多くの関連文書を提供することができる。この結果、ユーザの情報収集の効率を向上させることができる。 According to the first embodiment described above, a plurality of documents in which related documents are manually classified into the same group and keywords extracted from the documents by the keyword extractor EX are acquired, and between the documents in the group. Since the performance of the keyword extractor EX is evaluated based on the degree of matching of the keywords in the above, the keyword extractor EX with good performance can be used, and keywords with high commonality between documents can be extracted. Thereby, when the user searches for a document, the related document related to the document can be easily searched, and the document searched by the user can be more related and more related documents can be provided. As a result, the efficiency of user information collection can be improved.

一般的に、キーワード抽出器ＥＸは、予め、人間がこういった文書であればこういったキーワードが抽出される、という正解データを用意しておき、その正解データと、キーワード抽出器ＥＸが抽出したキーワードとに基づいて、教師あり学習がなされる。このような場合、仮に、図６に例示した文書を想定した場合、人間が、グループＸに分類された文書の正解データ（正解キーワード）を「野球」とした場合、キーワード抽出器ＥＸによって「ベースボール」というキーワードが抽出された場合、そのキーワードは不正解となる。同様に、人の名前のフルネーム（氏名）を正解データとした場合、「名字」だけをキーワードとして抽出したり、「名前」だけをキーワードとして抽出したりした場合、それらは不正解となる。 In general, the keyword extractor EX prepares in advance correct answer data that such keywords are extracted if a human is such a document, and the correct answer data and the keyword extractor EX extract the correct answer data. Supervised learning is done based on the keywords. In such a case, assuming the document illustrated in FIG. 6, when a human sets the correct answer data (correct answer keyword) of the document classified into group X as "baseball", the keyword extractor EX "bases". If the keyword "ball" is extracted, that keyword is incorrect. Similarly, if the full name (name) of a person's name is used as the correct answer data, and if only the "last name" is extracted as a keyword or if only the "name" is extracted as a keyword, they are incorrect answers.

これに対して、上述した実施形態では、人間が正解データとして定めたキーワードと、キーワード抽出器ＥＸが抽出したキーワードとを比較するのではなく、人間が定めたグループ内でキーワード抽出器ＥＸが抽出したキーワード同士を比較するため、人間が定めた正解データの意味的な揺れに左右されずに、同じグループに分類された文書間でキーワードが同じであるのか異なっているのかという観点でキーワード抽出器ＥＸを評価することができる。 On the other hand, in the above-described embodiment, the keyword extractor EX extracts the keyword within the group defined by the human, instead of comparing the keyword defined by the human as the correct answer data with the keyword extracted by the keyword extractor EX. In order to compare the keywords that have been created, the keyword extractor is used from the viewpoint of whether the keywords are the same or different between documents classified into the same group, without being influenced by the semantic fluctuation of the correct answer data determined by humans. EX can be evaluated.

また、例えば、複数の単語を組み合わせた比較的長いキーワードをキーワード抽出器ＥＸが抽出するようにハイパーパラメータが決定されている場合、学習データもまた、キーワード抽出器ＥＸが抽出するキーワードの長さに合わせる必要がある。この場合、ハイパーパラメータを変更して、キーワード抽出器ＥＸに抽出させるキーワードの長さを調整した場合、学習データをその都度変える必要があり、学習データの作成コストが大きくなりやすい。 Further, for example, when the hyperparameters are determined so that the keyword extractor EX extracts a relatively long keyword that combines a plurality of words, the learning data is also set to the length of the keyword extracted by the keyword extractor EX. It needs to be matched. In this case, when the hyperparameters are changed to adjust the length of the keyword to be extracted by the keyword extractor EX, it is necessary to change the learning data each time, and the cost of creating the learning data tends to increase.

これに対して、上述した実施形態では、人間が定めたグループ内でキーワード抽出器ＥＸが抽出したキーワード同士を比較するため、ハイパーパラメータを変更してキーワード抽出器ＥＸに抽出させるキーワードの長さを変更したとしても、比較対象とするキーワード同士が共通して同じ長さとなり、更にグループ分け自体は変更されないため、学習データの作成コストを削減することができる。 On the other hand, in the above-described embodiment, in order to compare the keywords extracted by the keyword extractor EX within the group defined by humans, the length of the keywords to be extracted by the keyword extractor EX by changing the hyperparameters is set. Even if they are changed, the keywords to be compared have the same length in common, and the grouping itself is not changed, so that the cost of creating learning data can be reduced.

また、人間によって決められた正解データに対して、抽出するキーワードが近づくようにキーワード抽出器ＥＸを学習する場合、正解データとして指定する全てのキーワードに対して、半角文字や小文字に統一したり、文末の助動詞の活用を「です、ます」調から、「である」調に変換したりするような前処理を行う必要がある。 In addition, when learning the keyword extractor EX so that the keywords to be extracted approach the correct answer data determined by humans, all the keywords specified as correct answer data may be unified into half-width characters or lowercase letters. It is necessary to perform preprocessing such as converting the conjugation of auxiliary verbs at the end of a sentence from "desu, masu" to "desu".

これに対して、上述した実施形態では、キーワード抽出器ＥＸによって抽出されるキーワードの長さや各品詞の活用形を予めハイパーパラメータとして定義しておくだけで、上記のような前処理を省略することができる。 On the other hand, in the above-described embodiment, only the length of the keyword extracted by the keyword extractor EX and the inflected form of each part of speech are defined in advance as hyperparameters, and the above-mentioned preprocessing is omitted. Can be done.

このように、上述した実施形態によれば、複数の文書を事前にグループ分けするだけで、文書ごとに正解データを作成する必要がなくなり、学習に要するコスト（作業負担など）を削減することができる。また、上述した実施形態によれば、抽出すべきキーワードが、漢字がよいのか、英字などの外来語（横文字）がよいのか、フルネームがよいのか、といった種々のコンセプトについて考慮する必要がなくなる。また、上述した実施形態によれば、同じグループの他文書から抽出されるキーワードを正解データとするため、その文書に特有（固有）のキーワード（例えば、文書作成者が作った造語など）が含まれている場合、同じグループの他文書からも特有のキーワードが抽出されなければＦ値が小さくなるため、文書特有のキーワードを抽出しやすいキーワード抽出器ＥＸほど利用され難くなり、文書間での共有性が高いキーワードを抽出しやすいキーワード抽出器ＥＸほど利用され易くなる。 As described above, according to the above-described embodiment, it is possible to reduce the cost (work load, etc.) required for learning by eliminating the need to create correct answer data for each document simply by grouping a plurality of documents in advance. can. Further, according to the above-described embodiment, it is not necessary to consider various concepts such as whether the keywords to be extracted are Chinese characters, foreign words such as English characters (horizontal characters), and full names. Further, according to the above-described embodiment, since the keywords extracted from other documents in the same group are used as correct answer data, keywords peculiar (unique) to the document (for example, coined words created by the document creator) are included. If this is the case, the F value will be small unless specific keywords are extracted from other documents in the same group, so it will be more difficult to use as much as the keyword extractor EX, which makes it easier to extract document-specific keywords, and sharing between documents. The keyword extractor EX, which makes it easier to extract keywords with high characteristics, is easier to use.

また、上述した実施形態によれば、グループ毎に文書のＦ値の平均を求めるため、母集団のグループ間でのサンプル数（文書数）の偏りの影響を抑制することができる。例えば、特定のジャンル或いはテーマの文書が多く、他のジャンル或いはテーマの文書が少ない、といったような偏りが生じている場合、グループ単位ではなく、全ての文書でＦ値を平均した場合、サンプル数が多いグループのＦ値が全体の評価値に大きく反映され、サンプル数が多いグループに対してキーワードの抽出精度が高くなるようにキーワード抽出器が学習される傾向にある。これに対して、上述した実施形態では、先にグループ毎にＦ値の平均をとることで、グループ間のサンプル数の差をなくしてから、キーワード抽出器ＥＸを学習することができる。この結果、どのグループからも、文書間での共有性が高いキーワードを精度良く抽出することができる。 Further, according to the above-described embodiment, since the average of the F values of the documents is obtained for each group, the influence of the bias of the number of samples (number of documents) among the groups of the population can be suppressed. For example, if there is a bias such as many documents of a specific genre or theme and few documents of other genres or themes, the number of samples when the F value is averaged for all documents, not for each group. The F-number of the group with a large number of samples is greatly reflected in the overall evaluation value, and the keyword extractor tends to be learned so that the keyword extraction accuracy is high for the group with a large number of samples. On the other hand, in the above-described embodiment, the keyword extractor EX can be learned after eliminating the difference in the number of samples between the groups by first averaging the F values for each group. As a result, keywords with high commonality between documents can be accurately extracted from any group.

＜第２実施形態＞
以下、第２実施形態について説明する。第２実施形態では、キーワードの抽出対象となる文書に類似する複数の類似文書のうち、キーワードの抽出対象となる文書に出現するキーワードの候補が出現する類似文書の数に基づいて、キーワードの抽出対象となる文書からキーワードを抽出する点で上述した第１実施形態と相違する。以下、第１実施形態との相違点を中心に説明し、第１実施形態と共通する点については説明を省略する。なお、第２実施形態の説明において、第１実施形態と同じ部分については同一符号を付して説明する。 <Second Embodiment>
Hereinafter, the second embodiment will be described. In the second embodiment, the keyword is extracted based on the number of similar documents in which the keyword candidates appearing in the document to be extracted of the keyword appear among a plurality of similar documents similar to the document to be extracted by the keyword. It differs from the above-described first embodiment in that keywords are extracted from the target document. Hereinafter, the differences from the first embodiment will be mainly described, and the points common to the first embodiment will be omitted. In the description of the second embodiment, the same parts as those of the first embodiment will be described with the same reference numerals.

図８は、第２実施形態における情報処理装置１００Ａの構成の一例を示す図である。図示のように、情報処理装置１００Ａは、例えば、通信部１０２と、制御部１１０Ａと、記憶部１３０Ａとを備える。 FIG. 8 is a diagram showing an example of the configuration of the information processing apparatus 100A according to the second embodiment. As shown in the figure, the information processing apparatus 100A includes, for example, a communication unit 102, a control unit 110A, and a storage unit 130A.

第２実施形態における制御部１１０Ａは、例えば、上述したキーワード付与部１１２と、文書分類部１１４と、抽出器評価部１１６とに加えて、更に、類似文書選択部１１８と、学習処理部１２０とを備える。 The control unit 110A in the second embodiment includes, for example, the keyword assignment unit 112, the document classification unit 114, the extractor evaluation unit 116, the similar document selection unit 118, and the learning processing unit 120. To prepare for.

第２実施形態における記憶部１３０Ａには、ファームウェアやアプリケーションプログラムなどの各種プログラムと、キーワード抽出器データ１３２と、文書データ１３４とに加えて、更に、類似文書データ１３６が格納される。 In addition to various programs such as firmware and application programs, keyword extractor data 132, and document data 134, the storage unit 130A in the second embodiment further stores similar document data 136.

類似文書データ１３６は、キーワードの抽出対象となる文書（文書データ１３４に含まれる各文書）に類似し得る複数の文書を含むデータである。文書同士が「類似する」とは、比較対象とする其々の文書をベクトル化したときに、あるベクトル空間において、それらの各文書のベクトルが互いに近い関係であることをいう。 The similar document data 136 is data including a plurality of documents that can be similar to the document (each document included in the document data 134) to be extracted by the keyword. When documents are "similar" to each other, it means that when each document to be compared is vectorized, the vectors of the documents are closely related to each other in a certain vector space.

［処理フロー］
以下、第２実施形態における情報処理装置１００Ａによる一連の処理の流れをフローチャートに即して説明する。図９は、第２実施形態における情報処理装置１００Ａによる一連の処理の流れを示すフローチャートである。本フローチャートの処理は、例えば、所定の周期で繰り返し行われてもよい。 [Processing flow]
Hereinafter, the flow of a series of processes by the information processing apparatus 100A in the second embodiment will be described according to a flowchart. FIG. 9 is a flowchart showing a flow of a series of processes by the information processing apparatus 100A in the second embodiment. The processing of this flowchart may be repeated, for example, at a predetermined cycle.

まず、類似文書選択部１１８は、文書データ１３４に含まれる複数の文書のうち、キーワード抽出器ＥＸにキーワードを抽出させる対象の文書（以下、キーワード抽出対象文書と称する）と類似する類似文書を、類似文書データ１３６に含まれる複数の文書の中から選択する（Ｓ２００）。キーワード抽出対象文書は、「着目文書」の他の例である。 First, the similar document selection unit 118 selects a similar document similar to a target document (hereinafter referred to as a keyword extraction target document) for which a keyword is extracted by the keyword extractor EX among a plurality of documents included in the document data 134. Select from a plurality of documents included in the similar document data 136 (S200). The keyword extraction target document is another example of the “document of interest”.

例えば、類似文書選択部１１８は、キーワード抽出対象文書に含まれる各単語の出現頻度などの統計量を各要素とする多次元ベクトルを、キーワード抽出対象文書をベクトル化したキーワード抽出対象文書ベクトルとして生成する。また、類似文書選択部１１８は、ある着目する単語の前後に出現する単語を予測するタスクを学習するｗｏｒｄ２ｖｅｃやｄｏｃ２ｖｅｃといったアルゴリズムを利用したり、他の既存の手法を利用したりすることで、キーワード抽出対象文書ベクトルを生成してもよい。 For example, the similar document selection unit 118 generates a multidimensional vector having statistics such as the frequency of appearance of each word included in the keyword extraction target document as each element as a keyword extraction target document vector obtained by vectorizing the keyword extraction target document. do. In addition, the similar document selection unit 118 uses algorithms such as word2vec and doc2vec to learn the task of predicting words that appear before and after a certain word of interest, or uses other existing methods to obtain keywords. The document vector to be extracted may be generated.

類似文書選択部１１８は、生成したキーワード抽出対象文書ベクトルと、類似文書データ１３６に含まれる、類似文書の候補となる各文書のベクトル（以下、類似文書候補ベクトルと称する）との類似度を導出する。類似文書候補ベクトルは、上述したキーワード抽出対象文書ベクトルの生成手法を利用して予め生成されているものとする。 The similar document selection unit 118 derives the degree of similarity between the generated keyword extraction target document vector and the vector of each document (hereinafter referred to as a similar document candidate vector) included in the similar document data 136 that is a candidate for a similar document. do. It is assumed that the similar document candidate vector is generated in advance by using the above-mentioned method for generating the keyword extraction target document vector.

例えば、類似文書選択部１１８は、キーワード抽出対象文書ベクトルと類似文書候補ベクトルとのコサイン類似度を導出し、複数の類似文書候補ベクトルのうち、キーワード抽出対象文書ベクトルとのコサイン類似度が大きい上位所定数（例えば１０個）の類似文書候補ベクトルを抽出したり、キーワード抽出対象文書ベクトルとのコサイン類似度が閾値以上の全ての類似文書候補ベクトルを抽出したりする。そして、類似文書選択部１１８は、抽出した類似文書候補ベクトルの元となった文書を、類似文書として選択する。 For example, the similar document selection unit 118 derives the cosine similarity between the keyword extraction target document vector and the similar document candidate vector, and among a plurality of similar document candidate vectors, the higher rank with the larger cosine similarity with the keyword extraction target document vector. A predetermined number (for example, 10) of similar document candidate vectors are extracted, and all similar document candidate vectors whose cosine similarity with the keyword extraction target document vector is equal to or greater than the threshold value are extracted. Then, the similar document selection unit 118 selects the document that is the source of the extracted similar document candidate vector as the similar document.

次に、キーワード付与部１１２は、キーワード抽出器ＥＸに対して、キーワード抽出対象文書に出現するある単語Ｘが出現した類似文書の数をカウントさせ、そのカウントさせた数に基づいてＴＦ‐ＩＤＦを計算させ、キーワード抽出対象文書に含まれる各キーワードの候補の単語や語句に重みを付与させる（Ｓ２０２）。 Next, the keyword assigning unit 112 causes the keyword extractor EX to count the number of similar documents in which a word X appearing in the keyword extraction target document appears, and TF-IDF is generated based on the counted number. The calculation is performed, and weights are given to the candidate words and phrases of each keyword included in the keyword extraction target document (S202).

第２実施形態におけるキーワード抽出器ＥＸは、例えば、数式（４）に基づいて、キーワード抽出対象文書ごとにＴＦ‐ＩＤＦを計算する。 The keyword extractor EX in the second embodiment calculates TF-IDF for each keyword extraction target document based on, for example, the mathematical formula (4).

キーワード抽出器ＥＸは、複数の類似文書のうち、キーワード抽出対象文書に出現する単語Ｘが出現する類似文書の数を、全類似文書の数で除算した割合を求め、更に、その割合を、類似文書問わず類似文書データ１３６に含まれる全文書のうち、キーワード抽出対象文書に出現する単語Ｘが出現する文書数の対数値で除算することで、単語ＸについてのＴＦ‐ＩＤＦを導出する。キーワード抽出器ＥＸは、単語Ｘを変更しながら、キーワード抽出対象文書に含まれる各キーワード候補についてＴＦ‐ＩＤＦを導出する。このような処理によって、キーワードを付与したい文書と、その文書に類似する類似文書との双方では出現し易く、それら以外の他文書では出現し難い単語Ｘほど、重みを大きくすることができる。 The keyword extractor EX obtains the ratio of the number of similar documents in which the word X appearing in the keyword extraction target document appears by the number of all similar documents among a plurality of similar documents, and further determines the ratio. Of all the documents included in the similar document data 136 regardless of the document, the TF-IDF for the word X is derived by dividing by the logarithm of the number of documents in which the word X appearing in the keyword extraction target document appears. The keyword extractor EX derives TF-IDF for each keyword candidate included in the keyword extraction target document while changing the word X. By such processing, the weight of the word X that is likely to appear in both the document to which the keyword is to be given and the similar document similar to the document and is difficult to appear in other documents can be increased.

一般的なＴＦ-ＩＤＦは、キーワードを付与したい文書では出現し易く、類似文書を含む他文書では出現し難い単語Ｘほど重みを大きくするものである。そのため、キーワード抽出対象文書に関して特有の単語や語句がキーワードとして抽出されやすい。特有の単語や語句とは、例えば、その文書において特有の言い回しの表現や、文書作成者が作った造語などである。このような特有の単語や語句は、他の単語や語句と比べてＩＤＦが大きくなるため、キーワードとして抽出されやすく、仮に、このキーワードを文書検索に利用した場合、キーワードの抽出元の文書に類似した文書を検索することが難しい場合がある。 The general TF-IDF is such that the word X, which is likely to appear in a document to which a keyword is to be added and is unlikely to appear in other documents including similar documents, has a larger weight. Therefore, it is easy to extract words and phrases peculiar to the keyword extraction target document as keywords. The peculiar words and phrases are, for example, expressions of peculiar phrases in the document and coined words created by the document creator. Since the IDF of such a unique word or phrase is larger than that of other words or phrases, it is easy to be extracted as a keyword. If this keyword is used for document search, it is similar to the document from which the keyword is extracted. It can be difficult to search for documents that have been created.

これに対して、本実施形態では、ＴＦ－ＩＤＦの分子式を、単語が自文書で何回出現したかということから、複数の類似文書のうち、どの程度の類似文書に自文書に含まれる単語が含まれているのかということに置き換えるため、より文書間での共有性が高いキーワードを抽出することができる。 On the other hand, in the present embodiment, the molecular formula of TF-IDF is determined by how many times the word appears in the own document, and therefore, among a plurality of similar documents, how many similar documents include the word in the own document. By substituting whether or not is included, it is possible to extract keywords that are more shared between documents.

次に、キーワード付与部１１２は、キーワード抽出器ＥＸに、計算させたＴＦ‐ＩＤＦを基に、文書データ１３４に含まれる各文書から一以上のキーワードを抽出させ、そのキーワードを抽出元の文書に付与する（Ｓ２０４）。 Next, the keyword assigning unit 112 causes the keyword extractor EX to extract one or more keywords from each document included in the document data 134 based on the calculated TF-IDF, and uses the keywords as the extraction source document. Grant (S204).

以降のＳ２０６の処理からＳ２１６の処理は、上述したＳ１０２の処理からＳ１１２の処理と同じであるため説明を省略する。 Since the subsequent processes from S206 to S216 are the same as the processes from S102 to S112 described above, the description thereof will be omitted.

次に、学習処理部１２０は、抽出器評価部１１６によるキーワード抽出器ＥＸの評価結果に基づいて、キーワード抽出器ＥＸのハイパーパラメータを学習（決定）する（Ｓ２１８）。例えば、学習処理部１２０は、キーワード抽出器ＥＸのＦ値が大きくなるように、ＴＦ‐ＩＤＦを計算する際に参照する類似文書の数（上述した所定数）や、ベクトル同士の類似度を導出手法、抽出するキーワードの長さ、キーワードの品詞、といったハイパーパラメータを決定する。 Next, the learning processing unit 120 learns (determines) the hyperparameters of the keyword extractor EX based on the evaluation result of the keyword extractor EX by the extractor evaluation unit 116 (S218). For example, the learning processing unit 120 derives the number of similar documents (predetermined number described above) to be referred to when calculating TF-IDF and the degree of similarity between vectors so that the F value of the keyword extractor EX becomes large. Determine hyperparameters such as method, length of keywords to be extracted, and part of speech of keywords.

また、学習処理部１２０は、類似文書が与えられなくても、上述した手法で得られたキーワードが抽出できるように、キーワード抽出器ＥＸを学習してもよい。より具体的には、学習処理部１２０は、Ｓ２０４の処理で得られたキーワードを正解データとして、キーワード抽出器ＥＸを教師あり学習する。これによって、類似文書を予め用意しておかなくとも、文書間での共有性が高いキーワードを精度良く抽出することができる。 Further, the learning processing unit 120 may learn the keyword extractor EX so that the keywords obtained by the above-mentioned method can be extracted even if a similar document is not given. More specifically, the learning processing unit 120 supervisedly learns the keyword extractor EX using the keywords obtained in the processing of S204 as correct answer data. As a result, it is possible to accurately extract keywords that are highly shared between documents without preparing similar documents in advance.

なお、上述した説明では、キーワード抽出器ＥＸが、キーワード抽出対象文書に出現する単語Ｘが類似文書にも出現する回数をカウントするものとして説明したがこれに限られない。例えば、キーワード抽出器ＥＸは、キーワード抽出対象文書により類似する類似文書ほど（類似度が大きい類似文書ほど）、ＴＦ-ＩＤＦの分子式の寄与度を大きくしてよい。例えば、類似文書として、文書Ｘ、Ｙ、Ｚが存在する場合、数式（５）に基づいて、ＴＦ-ＩＤＦを求めてよい。 In the above description, the keyword extractor EX has been described as counting the number of times that the word X appearing in the keyword extraction target document also appears in a similar document, but the present invention is not limited to this. For example, in the keyword extractor EX, the contribution of the molecular formula of TF-IDF may be increased as the similar documents are more similar to the keyword extraction target document (similar documents having a higher degree of similarity). For example, when documents X, Y, and Z exist as similar documents, TF-IDF may be obtained based on the mathematical formula (5).

式中、Ｗ_Ｘは、文書Ｘの類似度を表し、Ｗ_Ｙは、文書Ｙの類似度を表し、Ｗ_Ｚは、文書Ｚの類似度を表している。キーワード抽出器ＥＸは、キーワード抽出対象文書に出現する単語Ｘが出現する類似文書の各類似度の平均をＴＦ-ＩＤＦの分子とすることで、より文書間での共有性が高いキーワードを抽出することができる。 In the formula, W _X represents the similarity of the document X, W _Y represents the similarity of the document Y, and W _Z represents the similarity of the document Z. The keyword extractor EX extracts keywords with higher commonality between documents by using the average of each similarity of similar documents in which the word X appearing in the keyword extraction target document appears as the numerator of TF-IDF. be able to.

また、類似文書が、キーワード抽出対象文書との類似度に応じてランクが付けられている場合、キーワード抽出器ＥＸは、そのランクの大きさに応じて重みを付けてもよい。例えば、キーワード抽出器ＥＸは、キーワード抽出対象文書と最も類似するランク１位の類似文書には、１．０の重みを付与し、２番目にキーワード抽出対象文書と類似するランク２位の類似文書には、０．９の重みを付与し、３番目にキーワード抽出対象文書と類似するランク３位の類似文書には、０．８の重みを付与する、といったようにしてもよい。これによって、より文書間での共有性が高いキーワードを抽出することができる。 Further, when similar documents are ranked according to the degree of similarity with the keyword extraction target document, the keyword extractor EX may be weighted according to the magnitude of the rank. For example, the keyword extractor EX assigns a weight of 1.0 to a similar document having the highest rank, which is most similar to the document to be extracted by the keyword, and a similar document having the second rank, which is similar to the document to be extracted by the keyword. May be given a weight of 0.9, and a weight of 0.8 may be given to a similar document having a rank of 3 similar to the document to be extracted by the keyword. This makes it possible to extract keywords that are more shared between documents.

［利用場面］
図１０は、キーワード抽出器ＥＸにより抽出されたキーワードの利用場面の一例を示す図である。図示の例では、ショッピングサイトの一ページを模式的に表している。図中Ｒ１で示す領域には、商品の紹介文が掲載されている。このような紹介文は、キーワードの抽出対象の文書として扱われる。例えば、紹介文には、商品の型番（図の例では「ＡＢＣＤＥＦ‐２４」）などが含まれているが、類似文書の単語の出現回数を考慮しない一般的なＴＦ-ＩＤＦの場合、型番を表す単語や語句の重みが大きくなり、その型番がキーワードとして抽出されやすい。しかしながら、その商品に似た商品を探すときには、型番よりも概念的に上位の意味をもつ単語や語句がキーワードとして相応しい。概念的に上位の意味をもつ単語や語句とは、他の商品紹介文に含まれる単語や語句と共起し易いものであり、図示の例では、「液晶テレビ」などの単語が該当する。 [Usage scene]
FIG. 10 is a diagram showing an example of a usage scene of a keyword extracted by the keyword extractor EX. In the illustrated example, one page of a shopping site is schematically represented. In the area indicated by R1 in the figure, an introductory text of the product is posted. Such an introductory text is treated as a document for which keywords are extracted. For example, the introductory text includes the model number of the product (“ABCDEF-24” in the example in the figure), but in the case of a general TF-IDF that does not consider the number of occurrences of words in similar documents, the model number is used. The weight of the expressed word or phrase becomes large, and the model number is easily extracted as a keyword. However, when searching for a product similar to that product, words or phrases that have a conceptually higher meaning than the model number are suitable as keywords. Words and phrases that have a higher conceptual meaning are those that easily co-occur with words and phrases contained in other product introductions, and in the illustrated example, words such as "LCD TV" are applicable.

本実施形態では、キーワード抽出対象文書に出現する単語Ｘが類似文書にも出現する回数（割合）に基づいてＴＦ-ＩＤＦを求めるため、型番のような、そのページの特有の単語や語句（汎用的でない単語や語句）が抽出され難くなり、ショッピングサイト間での共有性が高いキーワードを抽出することができる。この結果、例えば、抽出したキーワードを、商品カテゴリを表す単語とした場合、商品が分類され得る商品カテゴリを網羅的に用意しておく必要がなくなる。例えば、商品がショッピングサイトに追加されるごとに、その商品が掲載されるウェブページの紹介文からキーワードを抽出し、その抽出したキーワードが既存の商品カテゴリを表す単語や語句であれば、新規追加された商品を既存の商品カテゴリに分類し、抽出したキーワードが既存の商品カテゴリを表す単語や語句でなければ、そのキーワードを基に新たな商品カテゴリを作成し、新規追加された商品を新規作成した商品カテゴリに分類する、といった運用を行うことができる。 In this embodiment, since the TF-IDF is obtained based on the number of times (ratio) that the word X appearing in the keyword extraction target document also appears in similar documents, a word or phrase (general purpose) peculiar to the page such as a model number is obtained. It becomes difficult to extract untargeted words and phrases), and keywords that are highly shared among shopping sites can be extracted. As a result, for example, when the extracted keyword is a word representing a product category, it is not necessary to comprehensively prepare the product categories in which the products can be classified. For example, each time a product is added to a shopping site, a keyword is extracted from the introductory text of the web page on which the product is posted, and if the extracted keyword is a word or phrase that represents an existing product category, a new addition is added. If the extracted product is not a word or phrase that represents an existing product category, a new product category is created based on that keyword, and a newly added product is created. It is possible to perform operations such as classifying into the product categories that have been selected.

以上説明した第２実施形態によれば、キーワードの抽出対象とする文書に類似する複数の類似文書のうち、キーワードの抽出対象とする文書に出現するキーワードの候補が出現する類似文書の数に基づいて、キーワードの抽出対象とする文書からキーワードを抽出するため、より文書間での共有性が高いキーワードを抽出することができる。この結果、ユーザが文書を検索したときに、文書間での共有性が高いキーワードを利用することで、その文書に関連した関連文書を容易に検索することができ、ユーザが検索した文書により関連し、且つより多くの関連文書を提供することができる。この結果、ユーザの情報収集の効率を更に向上させることができる。 According to the second embodiment described above, among a plurality of similar documents similar to the document to be extracted by the keyword, the number of similar documents in which the candidate of the keyword appearing in the document to be extracted by the keyword appears is based on the number of similar documents. Since the keywords are extracted from the documents to be extracted, it is possible to extract the keywords that are more shared among the documents. As a result, when a user searches for a document, by using keywords that are highly shared between the documents, it is possible to easily search for related documents related to the document, and the document searched by the user is more related. And more relevant documents can be provided. As a result, the efficiency of user information collection can be further improved.

＜ハードウェア構成＞
上述した実施形態の情報処理装置１００は、例えば、図１１に示すようなハードウェア構成により実現される。図１１は、実施形態の情報処理装置１００、１００Ａのハードウェア構成の一例を示す図である。 <Hardware configuration>
The information processing apparatus 100 of the above-described embodiment is realized by, for example, a hardware configuration as shown in FIG. FIG. 11 is a diagram showing an example of the hardware configuration of the information processing devices 100 and 100A of the embodiment.

情報処理装置１００、１００Ａは、ＮＩＣ１００－１、ＣＰＵ１００－２、ＲＡＭ１００－３、ＲＯＭ１００－４、フラッシュメモリやＨＤＤなどの二次記憶装置１００－５、およびドライブ装置１００－６が、内部バスあるいは専用通信線によって相互に接続された構成となっている。ドライブ装置１００－６には、光ディスクなどの可搬型記憶媒体が装着される。二次記憶装置１００－５、またはドライブ装置１００－６に装着された可搬型記憶媒体に格納されたプログラムがＤＭＡコントローラ（不図示）などによってＲＡＭ１００－３に展開され、ＣＰＵ１００－２によって実行されることで、制御部１１０または１１０Ａが実現される。制御部１１０または１１０Ａが参照するプログラムは、ネットワークＮＷを介して他の装置からダウンロードされてもよい。 Information processing devices 100 and 100A include NIC100-1, CPU100-2, RAM100-3, ROM100-4, secondary storage devices 100-5 such as flash memory and HDD, and drive devices 100-6 as internal buses or dedicated devices. It is configured to be connected to each other by a communication line. A portable storage medium such as an optical disk is mounted on the drive device 100-6. A program stored in a portable storage medium mounted on the secondary storage device 100-5 or the drive device 100-6 is expanded in the RAM 100-3 by a DMA controller (not shown) or the like, and executed by the CPU 100-2. As a result, the control unit 110 or 110A is realized. The program referenced by the control unit 110 or 110A may be downloaded from another device via the network NW.

以上、本発明を実施するための形態について実施形態を用いて説明したが、本発明はこうした実施形態に何ら限定されるものではなく、本発明の要旨を逸脱しない範囲内において種々の変形及び置換を加えることができる。 Although the embodiments for carrying out the present invention have been described above using the embodiments, the present invention is not limited to these embodiments, and various modifications and substitutions are made without departing from the gist of the present invention. Can be added.

１…情報処理システム、１０…端末装置、２０…サービス提供装置、１００、１００Ａ…情報処理装置、１０２…通信部、１１０、１１０Ａ…制御部、１１２…キーワード付与部、１１４…文書分類部、１１６…抽出器評価部、１１８…類似文書選択部、１２０…学習処理部、１３０、１３０Ａ…記憶部 1 ... Information processing system, 10 ... Terminal device, 20 ... Service providing device, 100, 100A ... Information processing device, 102 ... Communication unit, 110, 110A ... Control unit, 112 ... Keyword assigning unit, 114 ... Document classification unit, 116 ... Extractor evaluation unit, 118 ... Similar document selection unit, 120 ... Learning processing unit, 130, 130A ... Storage unit

Claims

A processing unit that extracts the keyword candidates included in the focus document as keywords based on the number of similar documents in which the keyword candidates appearing in the focus document appear among a plurality of similar documents similar to the focus document. Equipped with
Keyword extractor.

The processing unit
A weighting coefficient is calculated for each keyword candidate based on the number of similar documents in which keyword candidates appearing in the document of interest appear.
Based on the calculated weighting coefficient, the keyword is extracted from a plurality of candidates for the keyword included in the document of interest.
The keyword extraction device according to claim 1.

The processing unit
The ratio of the number of similar documents in which keyword candidates appearing in the document of interest are divided by the number of the plurality of similar documents is calculated.
The ratio calculated by the logarithm of the number of documents in which the keyword candidates appearing in the document of interest appear among all the documents including the plurality of similar documents and the plurality of dissimilar documents not similar to the document of interest. Divide and
The quotient of the ratio and the logarithmic value is calculated as the weighting factor.
The keyword extraction device according to claim 2.

The processing unit repeatedly calculates the weighting coefficient for each keyword candidate included in the focused document while changing the keyword candidates appearing in the focused document.
The keyword extraction device according to claim 2 or 3.

The processing unit
A weighting coefficient is calculated for each candidate of the keyword based on the degree of similarity of the similar document to the document of interest in which the candidate of the keyword appearing in the document of interest appears.
Based on the calculated weighting coefficient, the keyword is extracted from a plurality of candidates for the keyword included in the document of interest.
The keyword extraction device according to any one of claims 1 to 4.

The processing unit
The average of the similarity of each of the plurality of similar documents was calculated.
The logarithmic value of the number of documents in which the keyword candidates appearing in the document of interest appear among all the documents including the plurality of similar documents and the plurality of dissimilar documents not similar to the document of interest, and the degree of similarity. Divide the average and
The quotient between the average of the similarity and the logarithmic value is calculated as the weighting factor.
The keyword extraction device according to claim 5.

The processing unit increases the weighting coefficient as the similarity is larger, and decreases the weighting coefficient as the similarity is smaller.
The keyword extraction device according to claim 5 or 6.

The computer
Among a plurality of similar documents similar to the document of interest, the candidate of the keyword included in the document of interest is extracted as a keyword based on the number of the similar documents in which the candidate of the keyword appearing in the document of interest appears.
Keyword extraction method.

On the computer
Extracting the keyword candidates included in the focus document as keywords based on the number of the similar documents in which the keyword candidates appearing in the focus document appear among a plurality of similar documents similar to the focus document.
A program to execute.