JP2007140603A

JP2007140603A - Early adapter extraction method and device and program and topic word prediction method and device and program

Info

Publication number: JP2007140603A
Application number: JP2005329269A
Authority: JP
Inventors: Kunihiro Takiuchi; 邦弘滝内; Toru Sadakata; 徹定方; Masahiro Oku; 雅博奥
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2005-11-14
Filing date: 2005-11-14
Publication date: 2007-06-07

Abstract

<P>PROBLEM TO BE SOLVED: To extract a writer who is sensitive to a topic word, and to extract an information description document described by an early adapter, and to analyze and extract a topic word which may be a topic included in the latest information description document. <P>SOLUTION: An information description document in which a topic word is included is extracted from a plurality of information description documents, and the fixed number of information description documents whose preparation dates are old, written in an early stage, are extracted from the information description documents, and a writer as an early adapter relating to a topic word describing the information description document whose preparation date is old is extracted. Also, the information description documents for specifying the writer and the preparation date are collected, and the document prepared in the described period is extracted from among the documents of the document writer as the early adapter relating to a certain topic word from among the information description documents, and the topic word is extracted from the document. <P>COPYRIGHT: (C)2007,JPO&INPIT

Description

本発明は、アーリーアダプタ抽出方法及び装置及びプログラム及び話題語予測方法及び装置及びプログラムに係り、特に、情報記述文書から最新の話題語を提示する作者を抽出ためのアーリーアダプタ抽出方法及び装置及びプログラム、及び話題語を抽出するための話題語予測方法及び装置及びプログラムに関する。 BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to an early adapter extraction method and apparatus and program, and a topic word prediction method and apparatus and program, and more particularly to an early adapter extraction method and apparatus and program for extracting an author who presents the latest topic word from an information description document. , And a topic word prediction method, apparatus, and program for extracting topic words.

収集した多種多様な文書から話題語を抽出する技術として、複数のカテゴリ分類された作成時刻情報付きの文書が入力されると、当該文書を解析し、文書内に出現する語句と語句の語句付随情報を集計し、語句の語句付随情報を利用して語句が出現するカテゴリへの関連度を各語句について算出する技術がある（例えば、特許文献１参照）。
特開２００５−１３５３１１号公報 As a technique for extracting topic words from a wide variety of collected documents, when a plurality of categorized documents with creation time information are input, the document is analyzed, and phrases appearing in the document and phrases included There is a technique in which information is aggregated and the degree of relevance to a category in which a phrase appears is calculated for each phrase using the phrase accompanying information of the phrase (see, for example, Patent Document 1).
JP 2005-13531 A

従来の話題語抽出のための技術では、収集した多くの文書全体に対して出現頻度の高い語を抽出することを基礎とした手法で分析を行い、話題語抽出を行ってきた。さらに、この出現頻度の高い語を抽出することを基礎として分析手法自体を工夫することにより話題語抽出の精度を向上してきた。 In the conventional technique for extracting topic words, analysis is performed by a technique based on extracting words having a high frequency of appearance in the entire collected documents, and topic words are extracted. Furthermore, the accuracy of topic word extraction has been improved by devising the analysis method itself on the basis of extracting words with high appearance frequency.

しかしながら、従来の話題語抽出の技術は、現在の話題になっている話題語を抽出することはできるが、これらの話題となる話題語を予測して抽出することはできないという問題がある。 However, the conventional topic word extraction technique can extract the topic word that is the current topic, but cannot predict and extract the topic word that becomes the topic.

本発明は、上記の点に鑑みなされたもので、話題語に敏感な作者（アーリーアダプタ）を抽出すると共に、当該アーリーアダプタが記述した情報記述文書を抽出可能とし、最近の情報記述文書に含まれるこれから話題となる話題語を分析して抽出することにより話題語の予測が可能なアーリーアダプタ抽出方法及び装置及びプログラム及び話題語予測方法及び装置及びプログラムを提供することを目的とする。 The present invention has been made in view of the above points, and extracts an author (early adapter) sensitive to a topic word and enables extraction of an information description document described by the early adapter, which is included in recent information description documents. It is an object of the present invention to provide an early adapter extraction method and apparatus and program, and a topic word prediction method and apparatus and program capable of predicting a topic word by analyzing and extracting a topic word to be discussed.

本発明（請求項１）は、作者と作成日付を特定できる複数の情報記述文書から話題語に関するアーリーアダプタである作者を抽出するアーリーアダプタ抽出方法であって、
話題語フィルタリング手段が、入力された前記複数の情報記述文書から話題語が含まれる情報記述文書を抽出する話題語フィルタリングステップと、
作成日付フィルタリング手段が、前記情報記述文書から作成日付の古い、早期に書かれた一定個数の情報記述文書を抽出する作成日付フィルタリングステップと、
アーリーアダプタ抽出手段が、前記作成日付フィルタリングステップで得られた作成日付の古い情報記述文書を記述した話題語に関するアーリーアダプタである作者を抽出するアーリーアダプタ抽出ステップと、を行う。 The present invention (Claim 1) is an early adapter extraction method for extracting an author who is an early adapter related to a topic word from a plurality of information description documents that can specify an author and a creation date.
Topic word filtering means for extracting an information description document including a topic word from the plurality of input information description documents,
A creation date filtering step in which a creation date filtering means extracts a predetermined number of information description documents written earlier in the creation date from the information description document;
The early adapter extracting means performs an early adapter extracting step of extracting an author who is an early adapter related to a topic word describing an information description document having an older creation date obtained in the creation date filtering step.

図１は、本発明の原理を説明するための図である。 FIG. 1 is a diagram for explaining the principle of the present invention.

本発明（請求項２）は、作者と作成日付を特定できる複数の情報記述文書から最新の話題語を予測する話題語予測方法であって、
文書収集手段が、作者と作成日時を特定できる情報記述文書を収集し、文書格納手段に格納する文書収集ステップ（ステップ１）と、
アーリーアダプタ文書抽出手段が、文書格納手段に格納されている情報記述文書からある話題語に関するアーリーアダプタである文書作者の文書のうち、指定した期間に作成された文書を抽出し、アーリーアダプタ文書格納手段に格納するアーリーアダプタ文書抽出ステップ（ステップ２）と、
話題語抽出手段が、アーリーアダプタ文書格納手段に格納されている文書から話題語を抽出する話題語抽出ステップ（ステップ３）と、を行う。 The present invention (Claim 2) is a topic word prediction method for predicting the latest topic word from a plurality of information description documents that can specify the author and the creation date,
A document collection step (step 1) in which the document collection unit collects an information description document capable of specifying the author and the creation date and stores the information description document in the document storage unit;
The early adapter document extraction means extracts documents created during a specified period from the document author's documents that are early adapters related to a topic word from the information description document stored in the document storage means, and stores the early adapter document. An early adapter document extraction step (step 2) to be stored in the means;
The topic word extraction means performs a topic word extraction step (step 3) for extracting a topic word from the document stored in the early adapter document storage means.

また、本発明（請求項３）は、アーリーアダプタ文書抽出ステップ（ステップ２）において、
アーリーアダプタ文書抽出手段の話題語フィルタリング手段が、文書格納手段に格納されている情報記述文書から話題語が含まれる情報記述文書を抽出する話題語フィルタリングステップと、
アーリーアダプタ文書抽出手段の作成日付フィルタリング手段が、話題語フィルタリングステップで抽出された情報記述文書から、作成日付の古い、早期に書かれた一定個数の情報記述文書を抽出する作成日付フィルタリングステップと、
アーリーアダプタ文書抽出手段のアーリーアダプタ記述文書抽出手段が、作成日付フィルタリングステップで抽出された情報記述文書から、作成日付の古い情報記述文書を記述した話題語に関するアーリーアダプタである作者の情報記述文書を抽出するアーリーアダプタ記述文書抽出ステップと、
アーリーアダプタ文書抽出手段の期間フィルタリング手段が、アーリーアダプタ記述文書抽出ステップで得られた話題語に関するアーリーアダプタである作者の情報記述文書のうち、一定期間内に作成された文書を抽出する期間フィルタリングステップと、を行う。 Further, according to the present invention (Claim 3), in the early adapter document extraction step (Step 2),
A topic word filtering step in which the topic word filtering means of the early adapter document extracting means extracts an information description document including a topic word from the information description document stored in the document storage means;
A creation date filtering step in which the creation date filtering means of the early adapter document extraction means extracts a predetermined number of information description documents written earlier in the creation date from the information description documents extracted in the topic word filtering step;
The early adapter description document extracting means of the early adapter document extracting means obtains the author's information description document which is an early adapter related to the topic word describing the information description document with the old creation date from the information description document extracted in the creation date filtering step. An early adapter description document extraction step to be extracted;
Period filtering step in which the period filtering means of the early adapter document extracting means extracts a document created within a certain period from the author's information description document which is an early adapter related to the topic word obtained in the early adapter description document extracting step. And do.

本発明（請求項４）は、作者と作成日付を特定できる複数の情報記述文書から話題語に関するアーリーアダプタである作者を抽出するアーリーアダプタ抽出装置であって、
入力された複数の情報記述文書から話題語が含まれる情報記述文書を抽出する話題語フィルタリング手段と、
情報記述文書から作成日付の古い、早期に書かれた一定個数の情報記述文書を抽出する作成日付フィルタリング手段と、
作成日付フィルタリング手段で得られた作成日付の古い情報記述文書を記述した話題語に関するアーリーアダプタである作者を抽出するアーリーアダプタ抽出手段と、を有する。 The present invention (Claim 4) is an early adapter extraction device for extracting an author who is an early adapter related to a topic word from a plurality of information description documents that can specify an author and a creation date.
A topic word filtering means for extracting an information description document including a topic word from a plurality of input information description documents;
A creation date filtering means for extracting a fixed number of information description documents written earlier from an information description document,
An early adapter extracting means for extracting an author who is an early adapter related to a topic word describing an information description document having an old creation date obtained by the creation date filtering means.

図２は、本発明の原理構成図である。 FIG. 2 is a principle configuration diagram of the present invention.

本発明（請求項５）は、作者と作成日付を特定できる複数の情報記述文書から最新の話題語を予測する話題語予測装置であって、
収集された文書を格納する文書格納手段１０２と、
アーリーアダプタである文書作者の文書を格納するアーリーアダプタ文書格納手段１０４と、
作者と作成日時を特定できる情報記述文書を収集し、文書格納手段１０２に格納する文書収集手段１０１と、
文書格納手段１０２に格納されている情報記述文書からある話題語に関するアーリーアダプタである文書作者の文書のうち、指定した期間に作成された文書を抽出し、アーリーアダプタ文書格納手段１０４に格納するアーリーアダプタ文書抽出手段１０３と、
アーリーアダプタ文書格納手段１０４に格納されている文書から話題語を抽出する話題語抽出手段１０５と、を有する。 The present invention (Claim 5) is a topic word prediction device that predicts the latest topic word from a plurality of information description documents that can specify the author and the creation date,
Document storage means 102 for storing the collected documents;
Early adapter document storage means 104 for storing a document author's document as an early adapter;
A document collection unit 101 that collects an information description document that can specify an author and creation date and stores the information description document in the document storage unit 102;
The document created during the specified period is extracted from the document author's document, which is an early adapter related to a topic word, from the information description document stored in the document storage unit 102, and stored in the early adapter document storage unit 104. Adapter document extraction means 103;
A topic word extraction unit 105 that extracts a topic word from a document stored in the early adapter document storage unit 104.

また、本発明（請求項６）は、アーリーアダプタ文書抽出手段１０３において、
文書格納手段１０２に格納されている情報記述文書から話題語が含まれる情報記述文書を抽出する話題語フィルタリング手段と、
話題語フィルタリング手段で抽出された情報記述文書から、作成日付の古い、早期に書かれた一定個数の情報記述文書を抽出する作成日付フィルタリング手段と、
作成日付フィルタリング手段で抽出された情報記述文書から、作成日付の古い情報記述文書を記述した話題語に関するアーリーアダプタである作者の情報記述文書を抽出するアーリーアダプタ記述文書抽出手段と、
アーリーアダプタ記述文書抽出手段で得られた話題語に関するアーリーアダプタである作者の情報記述文書のうち、一定期間内に作成された文書を抽出する期間フィルタリング手段と、を含む。 Further, according to the present invention (claim 6), the early adapter document extracting means 103
Topic word filtering means for extracting an information description document including a topic word from the information description document stored in the document storage means 102;
A creation date filtering means for extracting a certain number of information description documents written earlier in the creation date from the information description documents extracted by the topic word filtering means;
An early adapter description document extracting means for extracting an author's information description document that is an early adapter related to a topic word describing an information description document having an older creation date from the information description document extracted by the creation date filtering means;
Period filtering means for extracting a document created within a certain period from the author's information description document which is an early adapter related to the topic word obtained by the early adapter description document extracting means.

本発明（請求項７）は、コンピュータに、請求項１記載のステップを実行させるアーリーアダプタ抽出プログラムである。 The present invention (Claim 7) is an early adapter extraction program that causes a computer to execute the steps of Claim 1.

本発明（請求項８）は、コンピュータに、請求項２または、３記載のステップを実行させる話題予測プログラムである。 The present invention (Claim 8) is a topic prediction program that causes a computer to execute the steps of Claim 2 or Claim 3.

上記のように本発明によれば、アーリーアダプタ抽出装置により、話題語に敏感な作者及び当該作者の情報記述文書を抽出できる。さらに、話題語に敏感な作者の最近の情報記述文書にはこれらか話題となる話題語が含まれており、これを分析して抽出することにより話題語の予測ができるようになる。 As described above, according to the present invention, the early adapter extraction device can extract the author who is sensitive to the topic word and the information description document of the author. Furthermore, recent information description documents of authors who are sensitive to topic words include these or topic words that become topics. By analyzing and extracting them, topic words can be predicted.

以下、図面と共に本発明の実施の形態を説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.

本発明では、多種多様な文書全体を分析の対象とするのではなく、作者と作成日付が特定できる情報記述文書を対象とし、更に、新聞や各種メディアで既に公開されている話題語、流行語、注目語（以下話題語）の中の一つの話題語に関連する記事を記述している情報記述文書であって、時期的に早い段階でその話題語を取り上げた情報記述文書の作者である、話題語に敏感な文書作者（アーリーアダプタ）を抽出し、この話題語に敏感な複数のＷｅｂページ作者の作成した情報記述文書を抽出し、当該情報記述文書を分析の対象として話題語を抽出する。このように予め話題語に早期に気付き、話題語に関連した記事を記述した複数の文書作成者の作成した情報記述文書を抽出し、それらの情報記述文書から話題語を抽出することにより、これから話題となる話題語を予測することができる。 In the present invention, not all of a wide variety of documents are analyzed, but an information description document in which an author and a creation date can be specified is targeted, and a topic word or buzzword already published in newspapers or various media. An information description document that describes an article related to one topic word in the attention word (topic word) and is the author of the information description document that picks up the topic word at an early stage , Extract the author (early adapter) sensitive to the topic word, extract the information description document created by multiple Web page authors sensitive to the topic word, and extract the topic word using the information description document as the object of analysis To do. In this way, by extracting information description documents created by a plurality of document creators who have previously noticed topic words in advance and described articles related to the topic words, and extracting topic words from these information description documents, A topic word to be a topic can be predicted.

本実施の形態では、説明の簡単化のために、ここでは、作者と作成日付を特定できる情報記述文書としてブログを用いて説明する。 In this embodiment, for simplification of description, a blog is used as an information description document that can specify an author and a creation date.

図３は、本発明の一実施の形態における最新話題語予測装置の構成図である。 FIG. 3 is a configuration diagram of the latest topic word prediction device according to the embodiment of the present invention.

同図に示す最新話題語予測装置３００は、インターネット上のブログサイトからブログを収集するブログ収集部３０１（請求項の文書収集手段に対応）、収集したエントリを格納するブログエントリ格納部３０２（請求項の文書格納手段に対応）、収集したブログエントリの中から話題語に敏感なブロガー（請求項のアーリーアダプタである作者に対応）が記述したブログエントリを抽出するためのアーリーアダプタ文書抽出装置３０３（請求項のアーリーアダプタ抽出手段に対応）、これらのブログエントリを検索する、アーリーアダプタエントリ格納部３０４（請求項のアーリーアダプタ文書格納手段に対応）、及び、話題語を抽出する話題語抽出部３０５から構成される。このうち、ブログエントリ格納部３０２、アーリーアダプタブログエントリ格納部３０４は、ディスク装置などの記憶媒体である。 The latest topic word prediction device 300 shown in FIG. 1 includes a blog collection unit 301 that collects blogs from blog sites on the Internet (corresponding to the document collection unit in claims), and a blog entry storage unit 302 that stores the collected entries (billing). An early adapter document extracting device 303 for extracting a blog entry described by a blogger sensitive to a topic word (corresponding to an author who is an early adapter of claims) from the collected blog entries. (Corresponding to the early adapter extracting means in claims), an early adapter entry storing section 304 (corresponding to the early adapter document storing means in claims) for searching for these blog entries, and a topic word extracting section for extracting topic words 305. Among these, the blog entry storage unit 302 and the early adapter blog entry storage unit 304 are storage media such as disk devices.

本発明の最新話題語予測装置３００では、まず、インターネットの上のブログサイトからブログ収集部３０１によって、ブログエントリを収集し、ブログエントリ格納部３０２に格納する。 In the latest topic word prediction device 300 of the present invention, first, blog entries are collected from a blog site on the Internet by the blog collection unit 301 and stored in the blog entry storage unit 302.

次に、アーリーアダプタ文書抽出装置３０３において、ブログエントリ格納部３０２に格納されたブログエントリから話題語を扱ったブログエントリのうち早期に作成された複数のブロガーのブログから期間を設定することにより、最新のブログエントリを抽出する。例えば、「最近３日間に作成されたブログエントリ」と指定することにより、アーリーアダプタ文書抽出装置３０３では、話題語の情報に早期に気づいたアーリーアダプタのブログエントリのうち、最近作成されたエントリが収集できることになる。 Next, in the early adapter document extraction device 303, by setting a period from a plurality of blogger blogs created early among blog entries dealing with topic words from blog entries stored in the blog entry storage unit 302, Extract the latest blog entries. For example, by specifying “a blog entry created in the last three days”, the early adapter document extraction apparatus 303 selects a recently created entry among the blog entries of the early adapter that has noticed the topic word information at an early stage. It can be collected.

最後に、話題語抽出部３０５により話題語を抽出する。上記のアーリーアダプタのブログエントリの中にはまだ一般的には気付かれていない話題語を含んだエントリが含まれていることから、最新話題語予測が可能である。 Finally, a topic word is extracted by the topic word extraction unit 305. Since the blog entry of the above early adapter includes an entry including a topic word that has not been generally noticed, the latest topic word can be predicted.

次に、上記のアーリーアダプタ文書抽出装置３０３について詳細に説明する。 Next, the early adapter document extraction device 303 will be described in detail.

図４は、本発明の一実施の形態におけるアーリーアダプタ抽出装置の構成を示す。 FIG. 4 shows the configuration of the early adapter extracting apparatus in one embodiment of the present invention.

アーリーアダプタ文書抽出装置３０３は、ブログエントリ格納部３０２に格納されたブログエントリに関して既に世間で公開されている話題語をクエリとしてキーワード検索を行うことにより、話題語に関連したブログエントリを抽出する話題語フィルタ部４０１、検索結果のブログエントリのうち作成日付の早いものから指定した数のブログエントリを抽出する作成日付フィルタリング部４０２、作成日付が古く、早期に話題語について取り上げたブロガーのブログ全体を抽出するアーリーアダプタブログ抽出部４０３、アーリーアダプタのブログの中から指定した日付より以前で指定した期間のブログを抽出する期間フィルタリング部４０４から構成される。 The early adapter document extraction device 303 extracts a blog entry related to a topic word by performing a keyword search using a topic word that is already open to the public as a query regarding the blog entry stored in the blog entry storage unit 302. A word filter unit 401, a creation date filtering unit 402 that extracts a specified number of blog entries from the blog entries of search results from the earliest creation date. An early adapter blog extracting unit 403 for extraction and a period filtering unit 404 for extracting blogs of a specified period before the specified date from the blogs of the early adapter are configured.

以上の構成を有するアーリーアダプタ文書抽出装置３０３によって、話題語に対して早期に着目したブロガーのブログのうち、例えば、最新のブログエントリを抽出し、アーリーアダプタブログエントリ格納部３０４に格納する。 The early adapter blog entry storage unit 304 extracts, for example, the latest blog entry from the blogger blog focused on the topic word at an early stage by the early adapter document extraction device 303 having the above configuration.

次に、最新話題語予測装置３００の各要素の動作について図３、図４、図５を用いて説明する。図５は、本発明の一実施の形態における最新話題語予測装置の動作のフローチャートである。 Next, the operation of each element of the latest topic word prediction device 300 will be described using FIG. 3, FIG. 4, and FIG. FIG. 5 is a flowchart of the operation of the latest topic word prediction device according to the embodiment of the present invention.

ブログ収集部３０１では、インターネット上のブログサイトからブログエントリを収集し（ステップ１０１）、ブログエントリ格納部３０２に保存する。この際、ブログ本文から日付データを抽出し格納する。日付の特定には、ブログ本文のタグを利用するなどが考えられる。さらに、ブログエントリについてはインデクシングを行い、キーワードによる全文検索を行うことができるようにブログエントリ格納部３０２に保存する（ステップ１０２）。 The blog collection unit 301 collects blog entries from blog sites on the Internet (step 101) and stores them in the blog entry storage unit 302. At this time, date data is extracted from the blog text and stored. The date can be specified by using a tag in the blog text. Further, the blog entry is indexed and stored in the blog entry storage unit 302 so that a full-text search by keyword can be performed (step 102).

話題語フィルタリング部４０１は、ブログエントリ格納部３０２に保存されたブログエントリに対して、既に世の中に新聞やテレビ、インターネット上のＷｅｂサイトで公開されている、話題語、流行語、注目後（以下、話題語と記す）の一つＸをクエリとして検索する（ステップ１０３）。 The topic word filtering unit 401 uses the topic words, buzzwords, and after attention (hereinafter referred to as the following) that are already published in newspapers, televisions, and Internet websites for the blog entries stored in the blog entry storage unit 302. (Referred to as topic word) as a query (step 103).

その結果得られたエントリの集合をＧｘとする。Ｇｘは、複数のブログエントリの文書集合である。作成日付フィルタリング部４０２は、文書集合Ｇｘに関して作成日時の古いブログエントリから順番にＫ個のエントリＤｋ（ｋ＝１，２，…，Ｋ）を抽出する（ステップ１０４）。 Let Gx be the set of entries obtained as a result. Gx is a document set of a plurality of blog entries. The creation date filtering unit 402 extracts K entries Dk (k = 1, 2,..., K) in order from the blog entries with the oldest creation date and time for the document set Gx (step 104).

ブログエントリであるＤｋには各々話題語Ｘが含まれており、話題語Ｘに関する記述の存在するブログエントリであるからＤｋは話題語に敏感なブロガーが記述したブログエントリである。アーリーアダプタブログ抽出部４０３は、Ｋ個のエントリそれぞれを記述したＮ人の作者であるブロガーＰｎ（ｎ＝１，２，…，Ｎ）を抽出する（ステップ１０５）。 Each Dk, which is a blog entry, includes a topic word X, and since it is a blog entry in which a description relating to the topic word X exists, Dk is a blog entry described by a blogger sensitive to the topic word. The early adapter blog extraction unit 403 extracts bloggers Pn (n = 1, 2,..., N) that are N authors describing each of the K entries (step 105).

Ｎ人のブロガーＰｎに対して、それぞれが記述したブログエントリの全文書ＥＰｎを収集し（ステップ１０６）、期間フィルタリング部４０４において当該ＥＰｎから更新日時の最新のエントリＲ日分の集合をＯｐｎ（ＯＰｎ⊇ＥＰｎ）として抽出する（ステップ１０７）。 All documents EPn of blog entries described by N bloggers Pn are collected (step 106), and a period filtering unit 404 sets a set for the latest entry R days from the EPn to Opn (OPn (EPn) is extracted (step 107).

期間フィルタリング部４０４は、抽出されたブログエントリの集合ＯＰｎをデータベースであるアーリーアダプタブログエントリ格納部３０４に保存する（ステップ１０８）。 The period filtering unit 404 stores the extracted blog entry set OPn in the early adapter blog entry storage unit 304 as a database (step 108).

なお、ステップ１０３〜ステップ１０８までの動作は、アーリーアダプタブログ抽出部３０３において行う。また、ステップ１０７とステップ１０８は順序を入れ替えて、アーリーアダプタのブログをアーリーアダプタブログエントリ格納部３０４に格納した後に、エントリの更新日付の新しい物をＲ日分を抽出し、話題語抽出部３０５に送ってもよい。 The operations from step 103 to step 108 are performed by the early adapter blog extracting unit 303. Step 107 and step 108 are switched in order to store the blog of the early adapter in the early adapter blog entry storage unit 304, and then extract the R entry for a new entry update date, and extract the topic word extraction unit 305. May be sent to

話題語抽出部３０５では、文書集合Ｏｐｎを分析することにより話題語を抽出する分析手法としては、ＯＰｎにおける頻出ワード抽出が考えられる。ＯＰｎのテキストに関して形態素解析を行い、テキストからワードを抜き出し、得られたワードの集合から形態素の属性により名詞を抜き出す処理を行い（ステップ１０９）、出現頻度の最も高いワードを話題語として表示する（ステップ１１０）。 The topic word extraction unit 305 may extract frequent words in OPn as an analysis method for extracting topic words by analyzing the document set Opn. Morphological analysis is performed on the text of OPn, a word is extracted from the text, a noun is extracted from the obtained set of words based on the attribute of the morpheme (step 109), and the word having the highest appearance frequency is displayed as a topic word (step 109). Step 110).

ステップ１０９においては、形態素の代わりに固有表現抽出技術を使ってワードを抽出するなどが考えられる。また、ステップ１１０においては、ＴＦ・ＩＤＦ（Term Frequency Inverse Document Frequency）を利用するなど既存の話題語抽出方法を用いて話題語として相応しいワードを抽出することとしてもよく、この方法については特に限定しない。 In step 109, it is conceivable to extract a word using a specific expression extraction technique instead of a morpheme. In step 110, a word suitable as a topic word may be extracted using an existing topic word extraction method such as using TF / IDF (Term Frequency Inverse Document Frequency), and this method is not particularly limited. .

また、ブログエントリ収集部３０１から収集したブログエントリはブログエントリ格納部３０２に格納することなく、直接アーリーアダプタ文書抽出装置３０３で処理する構成も容易に考えられる。 In addition, a configuration in which the blog entry collected from the blog entry collection unit 301 is directly processed by the early adapter document extraction device 303 without being stored in the blog entry storage unit 302 can be easily considered.

なお、本発明を、アーリーアダプタである作者を抽出する目的で使用することも可能であり、この場合は、アーリーアダプタブログ抽出部４０３において、上記のステップ１０５までの処理を行うことで実現できる。この場合は、ステップ１０６以降のアーリーアダプタのブログエントリの出力処理、最近作成されたブログエントリを抽出する処理は不要となる。 Note that the present invention can also be used for the purpose of extracting an author who is an early adapter. In this case, the early adapter blog extracting unit 403 can perform the processing up to step 105 described above. In this case, the processing of outputting the blog entry of the early adapter after step 106 and the process of extracting the recently created blog entry are unnecessary.

また、本発明では、上記のステップ１０１〜ステップ１０５までの処理、及び、ステップ１０１〜ステップ１１０までの処理をプログラムとして構築し、アーリーアダプタ抽出装置及び、話題語予測装置として利用されるコンピュータにインストールして実行させる、または、ネットワークを介して流通させることが可能である。 Further, in the present invention, the processing from step 101 to step 105 and the processing from step 101 to step 110 are constructed as a program and installed in a computer used as an early adapter extraction device and a topic word prediction device. And can be distributed through a network.

また、構築されたプログラムを、ハードディスク装置や、フレキシブルディスク・ＣＤ−ＲＯＭ等の可搬記憶媒体に格納し、コンピュータにインストールして実行させることが可能である。 Further, the constructed program can be stored in a portable storage medium such as a hard disk device or a flexible disk / CD-ROM, and installed in a computer to be executed.

なお、本発明は、上記の実施の形態に限定されることなく、特許請求の範囲内において種々変更・応用が可能である。 The present invention is not limited to the above-described embodiment, and various modifications and applications can be made within the scope of the claims.

本発明は、検索ポータルサイトの中には話題語をアンカーテキストとしてインターネット上の文書を対象としたキーワード検索結果へのリンクを提示するサービスを提供しているサイトがある。一般にこのようなサービスで提示される話題語は専門家が新聞やテレビをはじめとした各種メディアに目を通して注目した語を人手で提供しており、このようなサービスを自動化して提供できる。 According to the present invention, some search portal sites provide a service that presents a link to a keyword search result for a document on the Internet using a topic word as an anchor text. In general, the topic words presented by such services are manually provided by experts who have paid attention to various media such as newspapers and television, and such services can be provided automatically.

本発明の原理を説明するための図である。It is a figure for demonstrating the principle of this invention. 本発明の原理構成図である。It is a principle block diagram of this invention. 本発明の一実施の形態における最新話題語予測装置の構成図である。It is a block diagram of the newest topic word prediction apparatus in one embodiment of this invention. 本発明の一実施の形態におけるアーリーアダプタ抽出装置の構成図である。It is a block diagram of the early adapter extraction apparatus in one embodiment of this invention. 本発明の一実施の形態における最新話題予測装置の動作のフローチャートである。It is a flowchart of operation | movement of the newest topic prediction apparatus in one embodiment of this invention.

Explanation of symbols

１００最新話題語予測装置
１０１文書収集手段
１０２文書格納手段
１０３アーリーアダプタ抽出手段
１０４アーリーアダプタ文書格納手段
１０５話題語抽出手段
３００最新話題語予測装置
３０１ブログ収集部
３０２ブログエントリ格納部
３０３アーリーアダプタ文書抽出装置
３０４アーリーアダプタブログエントリ格納部
３０５話題抽出部
４０１話題語フィルタリング部
４０２作成日時フィルタリング部
４０３アーリーアダプタブログ抽出部
４０４期間フィルタリング部 DESCRIPTION OF SYMBOLS 100 Latest topic word prediction apparatus 101 Document collection means 102 Document storage means 103 Early adapter extraction means 104 Early adapter document storage means 105 Topic word extraction means 300 Latest topic word prediction apparatus 301 Blog collection part 302 Blog entry storage part 303 Early adapter document extraction Device 304 Early adapter blog entry storage unit 305 Topic extraction unit 401 Topic word filtering unit 402 Creation date filtering unit 403 Early adapter blog extraction unit 404 Period filtering unit

Claims

An early adapter extraction method for extracting an author who is an early adapter related to a topic word from a plurality of information description documents that can specify an author and a creation date,
Topic word filtering means for extracting an information description document including a topic word from the plurality of input information description documents,
A creation date filtering step in which a creation date filtering means extracts a predetermined number of information description documents written earlier in the creation date from the information description document;
An early adapter extracting step in which an early adapter extracting means extracts an author who is an early adapter related to a topic word describing an information description document having an old creation date obtained in the creation date filtering step;
The early adapter extraction method characterized by performing.

A topic word prediction method for predicting the latest topic word from a plurality of information description documents that can specify an author and a creation date,
A document collection step in which the document collection means collects an information description document that can specify the author and creation date and stores the information description document in the document storage means;
An early adapter document extraction unit extracts a document created during a specified period from documents of a document author who is an early adapter related to a topic word from the information description document stored in the document storage unit, and the early adapter An early adapter document extraction step for storing in the document storage means;
A topic word extracting unit that extracts a topic word from the document stored in the early adapter document storage unit;
The topic word prediction method characterized by having.

In the early adapter document extraction step,
A topic word filtering step in which the topic word filtering means of the early adapter document extraction means extracts an information description document including a topic word from the information description document stored in the document storage means;
Creation date filtering means for extracting creation date filtering means of the early adapter document extracting means for extracting a predetermined number of information description documents written earlier and earlier from the information description document extracted in the topic word filtering step. Steps,
The information of the author who is the early adapter related to the topic word describing the information description document with the old creation date from the information description document extracted in the creation date filtering step by the early adapter description document extraction means of the early adapter document extraction means An early adapter description document extraction step for extracting a description document;
Period filtering in which the period filtering means of the early adapter document extracting means extracts a document created within a certain period from the author's information description document which is an early adapter related to the topic word obtained in the early adapter document extracting step. Steps,
The topic word prediction method according to claim 2, wherein:

An early adapter extraction device for extracting an author who is an early adapter related to a topic word from a plurality of information description documents that can specify an author and a creation date,
A topic word filtering means for extracting an information description document including a topic word from the plurality of input information description documents;
Creation date filtering means for extracting a predetermined number of information description documents written earlier in the creation date from the information description document;
Early adapter extraction means for extracting an author who is an early adapter related to a topic word describing an information description document with an old creation date obtained by the creation date filtering means;
An early adapter extraction device comprising:

A topic word prediction device that predicts the latest topic word from a plurality of information description documents that can specify an author and a creation date,
Document storage means for storing the collected documents;
Early adapter document storage means for storing the document author's document as an early adapter;
A document collection unit that collects an information description document that can specify an author and a creation date, and stores the document in the document storage unit;
A document created during a specified period is extracted from documents of a document author who is an early adapter related to a topic word from the information description document stored in the document storage unit, and is stored in the early adapter document storage unit. Early adapter document extraction means;
Topic word extraction means for extracting a topic word from the document stored in the early adapter document storage means;
A topic word prediction device characterized by comprising:

The early adapter document extraction means includes:
Topic word filtering means for extracting an information description document including a topic word from the information description document stored in the document storage means;
A creation date filtering means for extracting a fixed number of information description documents written earlier in the creation date from the information description document extracted by the topic word filtering means;
Early adapter description document extraction means for extracting an author's information description document that is an early adapter related to a topic word describing an information description document with an old creation date from the information description document extracted by the creation date filtering means;
Period filtering means for extracting a document created within a certain period from the author's information description document which is an early adapter related to the topic word obtained by the early adapter description document extracting means;
The topic word prediction device according to claim 5, comprising:

On the computer,
An early adapter extraction program for executing the steps according to claim 1.

On the computer,
A topic prediction program for executing the steps according to claim 2 or 3.