JP5955817B2

JP5955817B2 - Extraction apparatus, extraction method and program

Info

Publication number: JP5955817B2
Application number: JP2013154872A
Authority: JP
Inventors: 敏羅; 正圭韓
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2013-07-25
Filing date: 2013-07-25
Publication date: 2016-07-20
Anticipated expiration: 2033-07-25
Also published as: JP2015026206A

Description

本発明は、抽出装置、抽出方法及びプログラムに関する。 The present invention relates to an extraction device, an extraction method, and a program.

近年、スマートフォン等の携帯型情報機器の普及により、ソーシャル・ネットワーキング・サービス（以降、ＳＮＳ）を介してのコミュニケーションが人々の日常生活に浸透し、爆発的に普及している。このＳＮＳにおいては、ユーザが自分の近況や感じたことなどを、携帯型情報機器を介して気軽に投稿し共有する。したがって、ＳＮＳからは一般大衆の意見がタイムリーに得られるため、マーケティング施策への活用が注目されている。 In recent years, with the spread of portable information devices such as smartphones, communication via social networking services (hereinafter referred to as SNS) has permeated people's daily lives and has exploded. In this SNS, the user can easily post and share his / her recent situation and feeling through the portable information device. Therefore, since the general public's opinions can be obtained in a timely manner from the SNS, the use for marketing measures is attracting attention.

例えば、ＳＮＳでは、投稿されたデータを取得するＡＰＩ（Application Programming Interface）が公開されており、ＳＮＳで投稿されたデータを用いての分析や予測が行われ始めている。このＳＮＳで投稿されたデータは、多数のユーザが書き込むため、全体のデータ量が膨大であり、その内容は書き込む対象のユーザによって異なった時系列データの集合（以下、ストリームデータと呼ぶ）となる。そして、ストリームデータのデータ量、ストリームデータの内容（ストリームデータに含まれるキーワード（単語））は、ユーザのライフサイクルや、世間一般のイベント・トピック等によって変化する。 For example, in SNS, an API (Application Programming Interface) for obtaining posted data is disclosed, and analysis and prediction using data posted in SNS is starting to be performed. Since data posted by this SNS is written by a large number of users, the total amount of data is enormous, and the content is a set of time-series data (hereinafter referred to as stream data) that differs depending on the user to be written. . The amount of stream data and the content of the stream data (keywords (words) included in the stream data) vary depending on the life cycle of the user, general event topics, etc.

これらの膨大なストリームデータに含まれるキーワード（単語）の出現頻度の時間推移や時系列の変化幅を計算・検出することで、注目度の高いキーワードを抽出する抽出方法（特許文献１参照）や、潜在的なキーワードを抽出する抽出方法（特許文献２参照）がある。 An extraction method (see Patent Document 1) for extracting a keyword with a high degree of attention by calculating / detecting the time transition of the frequency of appearance of keywords (words) included in these enormous stream data and the time series change width, There is an extraction method for extracting potential keywords (see Patent Document 2).

特開２００５−３１６８９９号公報JP 2005-316899 A 特開２００８−１５２６３４号公報JP 2008-152634 A

しかしながら、上述した従来の抽出方法では、長期的には無関係であっても、何らかのトリガー（要因）によって特定期間内に強い関連性が生じるキーワードが存在する場合に、そのキーワードを潜在的なキーワードとしても抽出することができなかった。そして、このようなキーワードの抽出もれにより、マーケティング施策における解析精度が低下することがあった。 However, in the conventional extraction method described above, even if there is a keyword that has a strong relationship within a specific period due to some trigger (factor) even if it is irrelevant in the long term, that keyword is regarded as a potential keyword. Also could not be extracted. In addition, due to the extraction of such keywords, the analysis accuracy in marketing measures may be reduced.

本発明は、上記に鑑みてなされたものであって、ストリームデータから関連性のあるキーワードを精度よく抽出することを可能とする抽出装置、抽出方法及びプログラムを提供することを目的とする。 The present invention has been made in view of the above, and an object of the present invention is to provide an extraction device, an extraction method, and a program capable of accurately extracting relevant keywords from stream data.

上述した課題を解決し、目的を達成するために、実施形態にかかる抽出装置は、予め定められた各単語について、入力されたストリームデータに当該単語が出現する出現頻度を所定時間ごとに集計する集計部と、前記集計された各単語の所定時間ごとの出現頻度を蓄積し、各単語における現在までの出現頻度を記録する出現頻度記録部と、前記集計された各単語の所定時間ごとの出現頻度と、前記記録された各単語における現在までの出現頻度とに基づいて、前記所定時間における前記単語のバーストを検出するバースト検出部と、前記バーストが検出された単語ごとに、前記所定時間における出現頻度を含むバースト特徴量情報を記憶するバースト特徴量情報記録部と、抽出条件とする期間と、基準単語とを受け付けて、前記期間内の前記バースト特徴量情報における各単語の出現頻度を正規化し、正規化後の出現頻度が前記基準単語の正規化後の出現頻度と類似する単語を関連単語として抽出する関連単語抽出部と、を備えることを特徴とする。 In order to solve the above-described problem and achieve the object, the extraction device according to the embodiment totals the appearance frequency of the word appearing in the input stream data for each predetermined time for each predetermined word. A totaling unit, an appearance frequency recording unit that accumulates the frequency of appearance of each of the aggregated words for each predetermined time, and records the frequency of appearance of each word up to the present, and the occurrence of each of the aggregated words for every predetermined time A burst detection unit for detecting a burst of the word at the predetermined time based on the frequency and the frequency of appearance of each recorded word up to the present; and for each word at which the burst is detected, at the predetermined time Receiving a burst feature amount information recording unit that stores burst feature amount information including the appearance frequency, a period as an extraction condition, and a reference word; A related word extraction unit that normalizes the appearance frequency of each word in the first feature amount information and extracts a word whose appearance frequency after normalization is similar to the appearance frequency after normalization of the reference word as a related word It is characterized by.

実施形態にかかる抽出装置によれば、ストリームデータから関連性のある単語を精度よく抽出することを可能とする、という効果を奏する。 According to the extraction device according to the embodiment, there is an effect that it is possible to accurately extract related words from stream data.

図１は、実施形態にかかる抽出装置の機能構成を例示するブロック図である。FIG. 1 is a block diagram illustrating a functional configuration of the extraction device according to the embodiment. 図２は、ストリームデータのデータ構造を説明する図である。FIG. 2 is a diagram for explaining the data structure of stream data. 図３は、辞書情報のデータ構造を説明する図である。FIG. 3 is a diagram for explaining the data structure of dictionary information. 図４は、長時間頻度情報のデータ構造を説明する図である。FIG. 4 is a diagram for explaining the data structure of long-time frequency information. 図５は、短時間頻度情報のデータ構造を説明する図である。FIG. 5 is a diagram for explaining the data structure of the short-time frequency information. 図６は、バースト特徴量情報のデータ構造を説明する図である。FIG. 6 is a diagram for explaining the data structure of burst feature information. 図７は、実施形態にかかる抽出装置の動作の一例を示すフローチャートである。FIG. 7 is a flowchart illustrating an example of the operation of the extraction device according to the embodiment. 図８は、実施形態にかかる抽出装置の動作の一例を示すフローチャートである。FIG. 8 is a flowchart illustrating an example of the operation of the extraction device according to the embodiment. 図９は、実施形態にかかる抽出装置の動作の一例を示すフローチャートである。FIG. 9 is a flowchart illustrating an example of the operation of the extraction device according to the embodiment. 図１０は、実施形態にかかる抽出装置における処理がコンピュータを用いて具体的に実現されることを示す図である。FIG. 10 is a diagram illustrating that the processing in the extraction device according to the embodiment is specifically realized using a computer.

以下、添付図面を参照して実施形態にかかる抽出装置、抽出方法及びプログラムを詳細に説明する。なお、以下の説明において、同様の構成要素には共通の符号を付与するとともに、重複する説明を省略する。 Hereinafter, an extraction apparatus, an extraction method, and a program according to embodiments will be described in detail with reference to the accompanying drawings. In the following description, common constituent elements are given common reference numerals, and redundant description is omitted.

以下で説明する実施形態にかかる抽出装置では、ＳＮＳなどで絶え間なくエントリされるデータの集合であるストリームデータに対して、そのストリームデータに出現する各単語の短時間の出現頻度を集計し、その集計結果に基づいて単語毎の最新の長時間の出現頻度を更新し記録する。そして、長時間の出現頻度と短時間毎の出現頻度に基づいて、その短時間において所定の単語の出現頻度が急激に増加するという現象である、単語のバーストを検出する。そして、検出したバーストの発生時刻と出現頻度とを含むバースト特徴量情報を記録する。そして、抽出条件とする期間と、基準単語とを受け付けて、その期間内で出現頻度を正規化したバースト特徴量情報をベースに、基準単語との関連性（正規化後の出現頻度が基準単語の正規化後の出現頻度と類似する単語）を分析し、関連単語を抽出する。 In the extraction device according to the embodiment described below, for the stream data that is a set of data that is continuously entered in SNS and the like, the appearance frequency of each word appearing in the stream data is tabulated, The latest long-term appearance frequency for each word is updated and recorded based on the counting result. And based on the appearance frequency for a long time and the appearance frequency for every short time, the burst of a word which is a phenomenon in which the appearance frequency of a predetermined word increases rapidly in the short time is detected. Then, burst feature information including the detected burst occurrence time and appearance frequency is recorded. Then, based on the burst feature amount information obtained by normalizing the appearance frequency within the period, and receiving the period as the extraction condition and the reference word, the relationship with the reference word (the appearance frequency after normalization is the reference word) And the related words are extracted.

なお、上述した「短時間」は、装置において適切な値として設定される所定時間であればよく、本実施形態では１０分として設定されるものとする。また、「長時間」は、「短時間」よりも長い一定期間の時間帯という意味ではなく、抽出を行う抽出時点である現在までに記録していた単語の時間の合計を意味するものである。また、「抽出条件とする期間」、「基準単語」については、装置のメモリなどに事前に設定された値、又はキーボード、マウスなどの操作装置を介してユーザが指定した値などであってよい。 The “short time” described above may be a predetermined time set as an appropriate value in the apparatus, and is set as 10 minutes in the present embodiment. In addition, “long time” does not mean a time period of a certain period longer than “short time”, but means the total time of words recorded so far, which is the extraction time point at which extraction is performed. . The “period as the extraction condition” and “reference word” may be values set in advance in the memory of the device or values specified by the user via an operation device such as a keyboard or a mouse. .

図１は、実施形態にかかる抽出装置の機能構成を例示するブロック図である。図１に示すように、抽出装置１００は、コンピュータを用いて実現される機能構成として（詳細は後述する）、短時間頻度集計部１０１と、バースト検出・判定部１０２と、長時間頻度情報記録部１０３と、長時間頻度計算・更新部１０４と、バースト特徴量情報記録部１０５と、関連単語抽出部１０６とを備える。 FIG. 1 is a block diagram illustrating a functional configuration of the extraction device according to the embodiment. As shown in FIG. 1, the extraction device 100 has a short-time frequency counting unit 101, a burst detection / determination unit 102, and a long-time frequency information record as a functional configuration realized by using a computer (details will be described later). Unit 103, long-time frequency calculation / update unit 104, burst feature amount information recording unit 105, and related word extraction unit 106.

短時間頻度集計部１０１は、辞書情報３００に予め定められた各単語について、入力されたストリームデータ２００にその単語が出現する出現頻度を短時間ごとに集計する。短時間頻度集計部１０１は、各単語について集計した出現頻度を、短時間頻度情報５００としてバースト検出・判定部１０２へ出力する。 The short-time frequency totaling unit 101 totals the appearance frequencies of the words appearing in the input stream data 200 for each word predetermined in the dictionary information 300 every short time. The short-time frequency totaling unit 101 outputs the appearance frequency totaled for each word to the burst detection / determination unit 102 as short-time frequency information 500.

バースト検出・判定部１０２は、短時間頻度集計部１０１で集計された短時間頻度情報５００と、長時間頻度情報記録部１０３に記録された長時間頻度情報４００とに基づいて、短時間における単語のバーストを検出する。バースト検出・判定部１０２は、バーストが検出された単語ごとの出現頻度を含むバースト特徴量情報５１０をバースト特徴量情報記録部１０５へ出力する。 The burst detection / determination unit 102 uses the short-time frequency information 500 tabulated by the short-time frequency tabulation unit 101 and the long-time frequency information 400 recorded in the long-time frequency information recording unit 103 to generate a word in a short time. Detect bursts. The burst detection / determination unit 102 outputs burst feature amount information 510 including the appearance frequency for each word in which a burst is detected to the burst feature amount information recording unit 105.

長時間頻度情報記録部１０３は、集計された各単語の短時間毎の出現頻度（短時間毎の頻度情報）を蓄積した長時間頻度情報４００を記録する。長時間頻度計算・更新部１０４は、各単語の短時間毎の出現頻度（短時間毎の頻度情報）と、長時間頻度情報記録部１０３が現時点までに記録しているカレントの長時間頻度情報４００とをもとに、各単語における現在までの出現頻度を計算し、長時間頻度情報記録部１０３が記録する長時間頻度情報４００を更新する。 The long-time frequency information recording unit 103 records the long-time frequency information 400 in which the appearance frequency (frequency information for each short time) of each summarized word is accumulated. The long-time frequency calculation / update unit 104 displays the appearance frequency of each word for a short time (frequency information for each short time) and the current long-time frequency information recorded by the long-time frequency information recording unit 103 up to the present time. 400, the appearance frequency of each word up to now is calculated, and the long-term frequency information 400 recorded by the long-term frequency information recording unit 103 is updated.

バースト特徴量情報記録部１０５は、バースト検出・判定部１０２より出力されたバースト特徴量情報５１０を記録する。関連単語抽出部１０６は、抽出条件（期間、基準単語）とを受け付けて、バースト特徴量情報記録部１０５が記録するバースト特徴量情報５１０の中から期間内の全てのバースト特徴量情報を取得する。次いで、関連単語抽出部１０６は、期間内のバースト特徴量情報における各単語の出現頻度を正規化し、正規化後の出現頻度が基準単語の正規化後の出現頻度と類似する単語を関連単語として抽出する。 The burst feature amount information recording unit 105 records the burst feature amount information 510 output from the burst detection / determination unit 102. The related word extraction unit 106 receives the extraction condition (period, reference word), and acquires all the burst feature amount information within the period from the burst feature amount information 510 recorded by the burst feature amount information recording unit 105. . Next, the related word extracting unit 106 normalizes the appearance frequency of each word in the burst feature amount information within the period, and uses a word whose appearance frequency after normalization is similar to the appearance frequency after normalization of the reference word as a related word. Extract.

図２は、ストリームデータ２００のデータ構造を説明する図である。図２に示すように、ストリームデータ２００は、文書を識別するためにユニークに割り当てられた文書ＩＤ２０１と、ＳＮＳなどにエントリされた期間である文書発表期間２０２と、エントリされた文書内容２０３とを有する。 FIG. 2 is a diagram for explaining the data structure of the stream data 200. As shown in FIG. 2, the stream data 200 includes a document ID 201 uniquely assigned to identify a document, a document announcement period 202 that is an entry period in SNS and the like, and an entered document content 203. Have.

図３は、辞書情報３００のデータ構造を説明する図である。図３に示すように、辞書情報３００には、大量の単語３０２が事前に登録されている。具体的には、辞書情報３００は、ユニークに割り当てられた単語ＩＤ３０１ごとに、単語３０２を有する。 FIG. 3 is a diagram for explaining the data structure of the dictionary information 300. As shown in FIG. 3, a large number of words 302 are registered in advance in the dictionary information 300. Specifically, the dictionary information 300 has a word 302 for each uniquely assigned word ID 301.

図４は、長時間頻度情報４００のデータ構造を説明する図である。図４に示すように、長時間頻度情報４００は、ユニークに割り当てられた単語ＩＤ４０１ごとに、単語４０２と、短時間毎の出現頻度を集計した平均頻度４０３と、単語４０２についての集計を行った期間を示す観察時間４０４とを有する。 FIG. 4 is a diagram for explaining the data structure of the long-time frequency information 400. As shown in FIG. 4, the long-term frequency information 400 is obtained by summing up the word 402, the average frequency 403 obtained by summing up appearance frequencies every short time, and the word 402 for each uniquely assigned word ID 401. And an observation time 404 indicating a period.

図５は、短時間頻度情報５００のデータ構造を説明する図である。図５に示すように、短時間頻度情報５００は、ユニークに割り当てられた単語ＩＤ５０１ごとに、単語５０２と、期間５０３と、出現頻度である出現回数５０４と、正規化値５０５とを有する。なお、正規化値５０５は空欄（Ｎ／Ａ（Not Available））とされている。 FIG. 5 is a diagram for explaining the data structure of the short-time frequency information 500. As shown in FIG. 5, the short-time frequency information 500 includes a word 502, a period 503, an appearance frequency 504 that is an appearance frequency, and a normalized value 505 for each uniquely assigned word ID 501. The normalized value 505 is blank (N / A (Not Available)).

図６は、バースト特徴量情報５１０のデータ構造を説明する図である。図６に示すように、バースト特徴量情報５１０は、ユニークに割り当てられた単語ＩＤ５１１ごとに、単語５１２と、期間５１３と、出現頻度である出現回数５１４と、正規化値５１５とを有する。なお、正規化値５１５には、関連単語抽出部１０６により正規化された際の値が記述される。 FIG. 6 is a diagram for explaining the data structure of the burst feature information 510. As shown in FIG. 6, the burst feature quantity information 510 includes a word 512, a period 513, an appearance frequency 514, and a normalized value 515 for each uniquely assigned word ID 511. The normalized value 515 describes a value when normalized by the related word extraction unit 106.

ここで、上述した機能構成による抽出装置１００の動作について図７〜９を参照して詳細に説明する。図７〜９は、実施形態にかかる抽出装置の動作の一例を示すフローチャートである。 Here, operation | movement of the extraction apparatus 100 by the function structure mentioned above is demonstrated in detail with reference to FIGS. 7 to 9 are flowcharts illustrating an example of the operation of the extraction device according to the embodiment.

より具体的には、図７は、リアルタイムでの単語のバースト特徴量の算出を示すフローチャートである。図８は、抽出条件（期間、基準単語）による関連単語の抽出処理を示すフローチャートである。図９は、バースト特徴量情報における各単語の出現頻度の正規化を示すフローチャートである。 More specifically, FIG. 7 is a flowchart showing calculation of word burst feature values in real time. FIG. 8 is a flowchart showing related word extraction processing based on extraction conditions (period, reference word). FIG. 9 is a flowchart showing normalization of the appearance frequency of each word in the burst feature information.

図７に示すように、リアルタイムでの単語のバースト特徴量の算出を行う処理が開始されると、短時間頻度集計部１０１はストリームデータ２００を読み込む（Ｓ１１）。次いで、短時間頻度集計部１０１は、読み込んだストリームデータ２００を形態素解析し、ストリームデータ２００内の単語リストを抽出する（Ｓ１２）。 As shown in FIG. 7, when the processing for calculating the burst feature amount of the word in real time is started, the short-time frequency counting unit 101 reads the stream data 200 (S11). Next, the short-time frequency counting unit 101 performs morphological analysis on the read stream data 200 and extracts a word list in the stream data 200 (S12).

例えば、図２の例では、文書ＩＤ「０００１」の文書内容２０３において「明日」、「金環」、「日食」、「期待」などの単語リストが抽出される。 For example, in the example of FIG. 2, a word list such as “Tomorrow”, “Ring”, “Eclipse”, “Expectation” is extracted from the document content 203 with the document ID “0001”.

なお、図７〜９に例示する本処理は時間がかかることを考慮し、リアルタイムでの分散処理が可能なフレームワーク（例えば、Ｊｕｂａｔｕｓ（登録商標））の利用を想定している。 In consideration of the time required for the processing illustrated in FIGS. 7 to 9, it is assumed that a framework (for example, Jubatus (registered trademark)) capable of distributed processing in real time is used.

次いで、Ｓ１２で抽出された単語リストの個々の単語に対して、Ｓ１３〜Ｓ２１のループ処理が行われる。 Next, the loop processing of S13 to S21 is performed for each word in the word list extracted in S12.

ループ処理が開始されると（Ｓ１３）、短時間頻度集計部１０１は、辞書情報３００を読み出し、単語リスト内でループ処理における処理対象となっている単語が辞書に存在するか否かを判定する（Ｓ１４）。単語が辞書に存在しない場合（Ｓ１４：ＮＯ）、短時間頻度集計部１０１は、Ｓ２１へ処理を進め、次の単語のループ処理に移行する。 When the loop process is started (S13), the short-time frequency counting unit 101 reads the dictionary information 300 and determines whether or not the word that is the processing target in the loop process exists in the dictionary in the word list. (S14). When the word does not exist in the dictionary (S14: NO), the short-time frequency counting unit 101 advances the process to S21 and shifts to a loop process for the next word.

単語が辞書に存在する場合（Ｓ１４：ＹＥＳ）、短時間頻度集計部１０１は、指定された短時間内の単語の出現回数を計算し、その単語の単語ＩＤ、単語、指定された短時間、計算した出現回数を、短時間頻度情報５００としてメモリに保持する（Ｓ１５）。 When the word is present in the dictionary (S14: YES), the short-time frequency counting unit 101 calculates the number of appearances of the word within the designated short time, the word ID of the word, the word, the designated short time, The calculated number of appearances is stored in the memory as short-time frequency information 500 (S15).

なお、本実施形態では、短時間を１０分の部分期間とし、ストリームデータ２００の文書発表期間２０２が単語の短時間頻度情報５００の期間５０３と一致する場合、その単語の短時間の出現頻度に追加する。例えば、文書ＩＤ「０００１」のストリームデータ２００において、その文書発表期間２０２は「２０１２／０５／２０１８：３１：０１〜２０１２／０５／２０１８：４０：００」であるため（図２参照）、単語の短時間頻度情報５００内の単語「日食」の「２０１２／０５／２０１８：３１：０１〜２０１２／０５／２０１８：４０：００」の出現回数５０４を１増やす。なお、本実施例で短時間の値を１０分にしているが、これは短時間の指定の一例であり、適用先システムにより適切な値を選択可能である。 In the present embodiment, when the short period is a partial period of 10 minutes and the document announcement period 202 of the stream data 200 matches the period 503 of the short time frequency information 500 of the word, the appearance frequency of the word in a short time is set. to add. For example, in the stream data 200 of the document ID “0001”, the document announcement period 202 is “2012/05/20 18:31:01 to 2012/05/20 18:40:00” (see FIG. 2). The number of appearances 504 of “2012/05/20 18:31:01 to 2012/05/20 18:40:00” of the word “eclipse” in the short time frequency information 500 of the word is increased by one. In this embodiment, the short-time value is set to 10 minutes, but this is an example of short-time specification, and an appropriate value can be selected by the application system.

次いで、長時間頻度計算・更新部１０４は、長時間頻度情報記録部１０３に記録された長時間頻度情報４００から該当単語のカレントの長時間頻度情報を取得する（Ｓ１６）。具体的には、単語が「日食」である場合には、単語ＩＤ「０００１」の「日食」についての平均頻度４０３、観察時間４０４が取得される（図４参照）。 Next, the long-time frequency calculation / update unit 104 acquires the current long-time frequency information of the corresponding word from the long-time frequency information 400 recorded in the long-time frequency information recording unit 103 (S16). Specifically, when the word is “eclipse”, the average frequency 403 and the observation time 404 for “eclipse” of the word ID “0001” are acquired (see FIG. 4).

次いで、長時間頻度計算・更新部１０４は、Ｓ１５で計算された短時間の出現回数と、Ｓ１６で取得した長時間頻度情報とに基づいて、次の式（１）、（２）を用いて、該当単語の最新の長時間の出現頻度及び最新の観測時間（最新の長時間頻度情報）を計算する（Ｓ１７）。 Next, the long-time frequency calculation / update unit 104 uses the following formulas (1) and (2) based on the short-time appearance count calculated in S15 and the long-time frequency information acquired in S16. The latest long-term appearance frequency and the latest observation time (latest long-term frequency information) of the corresponding word are calculated (S17).

次いで、長時間頻度計算・更新部１０４は、Ｓ１７で計算した最新の長時間頻度情報を長時間頻度情報記録部１０３へ出力し、長時間頻度情報記録部１０３は最新の長時間頻度情報を長時間頻度情報４００として保管する（Ｓ１８）。 Next, the long-term frequency calculation / update unit 104 outputs the latest long-term frequency information calculated in S17 to the long-term frequency information recording unit 103, and the long-term frequency information recording unit 103 sets the latest long-term frequency information long. The time frequency information 400 is stored (S18).

次いで、バースト検出・判定部１０２は、単語について、短時間の出現回数と、カレントの長時間の出現頻度（出現回数）とを用いて、短時間における単語のバーストを検出する。具体的には、バースト検出・判定部１０２は、次の式（３）を満たすか否かを判定することで（Ｓ１９）、その短時間の出現回数が単語のバーストであるかどうかを判断する。ここで、式（３）を満たし、短時間の出現回数が単語のバーストである場合（Ｓ１９：ＹＥＳ）は、Ｓ２０に進む。式（３）を満たさない場合（Ｓ１９：ＮＯ）は、Ｓ２１に進む。 Next, the burst detection / determination unit 102 detects a burst of words in a short time using the short-time appearance count and the current long-time appearance frequency (appearance count). Specifically, the burst detection / determination unit 102 determines whether or not the following expression (3) is satisfied (S19), thereby determining whether or not the short-time appearance count is a word burst. . If the expression (3) is satisfied and the number of appearances in a short time is a burst of words (S19: YES), the process proceeds to S20. When Expression (3) is not satisfied (S19: NO), the process proceeds to S21.

なお、式（３）内のαは、本実施形態では「２」としているが、適用先システムにより適切な値を選択可能である。 Note that α in Expression (3) is “2” in the present embodiment, but an appropriate value can be selected by the application destination system.

Ｓ２０において、バースト検出・判定部１０２は、バーストとして判定された単語の短時間の出現回数をバースト特徴量とするバースト特徴量情報５１０をバースト特徴量情報記録部１０５へ出力し、バースト特徴量情報記録部１０５は出力されたバースト特徴量情報５１０を保管する。 In S20, the burst detection / determination unit 102 outputs burst feature amount information 510 having the number of short-time appearances of the word determined as a burst as a burst feature amount to the burst feature amount information recording unit 105, and burst feature amount information The recording unit 105 stores the output burst feature amount information 510.

図８に示すように、抽出条件（期間、基準単語）による関連単語の抽出処理が開始されると、関連単語抽出部１０６は、抽出装置１００の利用者（他のシステムまたはユーザ）によって抽出条件として指定された期間と基準単語とを取得する（Ｓ３１）。 As illustrated in FIG. 8, when the related word extraction process based on the extraction condition (period, reference word) is started, the related word extraction unit 106 uses the extraction device 100 user (another system or user) to extract the extraction condition. And a reference word are acquired (S31).

本実施形態では、対象とする期間を「２０１２／０５／１９００：００：００〜２０１２／０６／０９２４：００：００」、基準単語を「サングラス」として抽出条件を指定したものとして、処理を説明する。 In the present embodiment, it is assumed that the target period is “2012/05/19 00: 00: 00-2012 / 06/09 24:00:00”, the reference word is “sunglasses”, and the extraction condition is designated. Will be explained.

次いで、関連単語抽出部１０６は、バースト特徴量情報記録部１０５のバースト特徴量情報５１０から抽出条件として指定された該当期間内の全ての単語のバースト特徴量情報（レコード）を取得する（Ｓ３２）。次いで、関連単語抽出部１０６は、該当期間内における単語毎のバースト特徴量を正規化する（Ｓ３３）。 Next, the related word extraction unit 106 acquires burst feature amount information (records) of all words within the corresponding period specified as the extraction condition from the burst feature amount information 510 of the burst feature amount information recording unit 105 (S32). . Next, the related word extraction unit 106 normalizes the burst feature value for each word within the corresponding period (S33).

具体的には、図９に示すように、Ｓ３３において処理が開始されると、単語毎にＳ４１〜Ｓ４８の第１のループ処理が行われる。 Specifically, as shown in FIG. 9, when the process is started in S33, the first loop process of S41 to S48 is performed for each word.

第１のループ処理が開始されると（Ｓ４１）、関連単語抽出部１０６は、該当単語の全てのレコードの中から、期間５１３の最小値であるレコードを基準レコードとする（Ｓ４２）。例えば、図６の例では、単語ＩＤ「０００１」の単語「日食」について、期間「２０１２／０５／１９００：００：００〜２０１２／０６／０９２４：００：００」として指定された期間内の最小値が１番目のレコードであることから、単語「日食」についてはバースト特徴量情報５１０の１番目のレコードを基準レコードとする。なお、「月食」、「金星」、「サングラス」なども同様に基準レコードが定められる。 When the first loop process is started (S41), the related word extraction unit 106 sets a record that is the minimum value of the period 513 from all the records of the corresponding word as a reference record (S42). For example, in the example of FIG. 6, the period designated as the period “2012/05/19 00:00:00 to 2012/06/09 24:00:00” for the word “eclipse” with the word ID “0001”. Since the minimum value is the first record, the first record of the burst feature information 510 is set as the reference record for the word “eclipse”. The reference record is similarly determined for “lunar eclipse”, “Venus”, “sunglasses”, and the like.

次いで、関連単語抽出部１０６は、基準レコードの正規化値５１５を「１」にする（Ｓ４３）。例えば、単語「日食」についてはバースト特徴量情報５１０の１番目のレコードの正規化値５１５を「１」にする。 Next, the related word extraction unit 106 sets the normalized value 515 of the reference record to “1” (S43). For example, for the word “eclipse”, the normalized value 515 of the first record of the burst feature information 510 is set to “1”.

次いで、関連単語抽出部１０６は、基準レコードを「１」とした場合の出現回数を正規化するため、基準レコード以外のレコード毎に、Ｓ４４〜Ｓ４７の第２のループ処理を行う。 Next, the related word extraction unit 106 performs the second loop process of S44 to S47 for each record other than the reference record in order to normalize the number of appearances when the reference record is “1”.

具体的には、関連単語抽出部１０６は、「現在レコードの出現回数/基準レコードの出現回数」の割り算を計算する（Ｓ４５）。例えば、図６の例では、バースト特徴量情報５１０の２番目レコードの出現回数５１４は「２５０」であり、基準レコード（１番目）の出現回数５１４は「２３０」である。この現在レコードの「２５０」と基準レコードの「２３０」とで割り算を計算し、２５０／２３０＝１．０８７と算出する。 Specifically, the related word extraction unit 106 calculates a division of “the number of appearances of the current record / the number of appearances of the reference record” (S45). For example, in the example of FIG. 6, the number of appearances 514 of the second record of the burst feature information 510 is “250”, and the number of appearances 514 of the reference record (first) is “230”. The division is calculated by “250” of the current record and “230” of the reference record, and 250/230 = 1.087 is calculated.

次いで、関連単語抽出部１０６は、Ｓ４５の計算結果を、現在のレコードの正規化値５１５に設定する（Ｓ４６）。よって、上述した例では、バースト特徴量情報５１０の２番目レコードの正規化値５１５に「１．０８７」を設定する。 Next, the related word extraction unit 106 sets the calculation result of S45 to the normalized value 515 of the current record (S46). Therefore, in the above-described example, “1.087” is set to the normalized value 515 of the second record of the burst feature information 510.

図８に戻り、Ｓ３３に次いで、関連単語抽出部１０６は、基準単語以外の単語（以降、非基準単語）に対して、各単語の短時間毎のバースト特徴量の正規化値を用いて、次の式（４）に基づいて、非基準単語と基準単語の類似度を計算する（Ｓ３４）。 Returning to FIG. 8, after S <b> 33, the related word extraction unit 106 uses the normalized value of the burst feature value for each short time of each word for words other than the reference word (hereinafter, non-reference word), Based on the following formula (4), the similarity between the non-reference word and the reference word is calculated (S34).

具体的には、上述した式（４）に示すように、短時間を示す変数ｋを１〜ｎとし、非基準単語の短時間ｋの正規化値と、基準単語の短時間ｋの正規化値とを時系列順に比較して、非基準単語と基準単語との時系列的な推移の類似度を算出する。 Specifically, as shown in Equation (4) described above, the variable k indicating the short time is set to 1 to n, the normalized value of the short time k of the non-reference word, and the normalization of the short time k of the reference word The values are compared in time series order to calculate the similarity of the time series transition between the non-reference word and the reference word.

次いで、関連単語抽出部１０６は、Ｓ３４で計算した類似度に基づいて、類似度の高い順で単語をソートする（Ｓ３５）。次いで、関連単語抽出部１０６は、ソート後の単語を抽出した関連単語として返す（Ｓ３６）。具体的には、ソートされた単語リストから、先頭から一定個数（例えば、３０個）の単語を抽出条件とした期間内の基準単語と関連のある関連単語として返す。なお、本実施形態では、ソートされた単語リストの先頭３０個の単語を関連単語として出力する設定としているが、これは指定の一例であり、適用先システムにより適切な値を選択可能である。Ｓ３６により関連単語がシステムに返されることで、返された関連単語がシステム上に表示されるなどしてユーザに通知されることとなる。 Next, the related word extraction unit 106 sorts words in descending order of similarity based on the similarity calculated in S34 (S35). Next, the related word extraction unit 106 returns the sorted words as extracted related words (S36). Specifically, from the sorted word list, a certain number (for example, 30) of words from the beginning are returned as related words that are related to the reference word within the period. In the present embodiment, the first 30 words of the sorted word list are set to be output as related words, but this is an example of designation, and an appropriate value can be selected by the application system. When the related word is returned to the system in S36, the returned related word is displayed on the system, and the user is notified.

以上のように、抽出装置１００は、予め定められた各単語について、入力されたストリームデータ２００にその単語が出現する出現頻度を短時間ごとに集計する。そして、集計された各単語の短時間ごとの出現頻度を蓄積し、各単語における現在までの長時間頻度情報４００を記録する。そして、集計された各単語の短時間ごとの出現頻度と、記録された各単語における現在までの出現頻度とに基づいて、短時間における単語のバーストを検出し、バーストが検出された単語ごとに、短時間における出現頻度を含むバースト特徴量情報５１０を記録する。そして、抽出条件とする期間と、基準単語とを受け付けて、受け付けた期間内のバースト特徴量情報５１０における各単語の出現頻度を正規化し、正規化後の出現頻度が基準単語の正規化後の出現頻度と類似する単語を関連単語として抽出する。 As described above, the extraction apparatus 100 adds up the appearance frequencies of the words appearing in the input stream data 200 for each predetermined time for each short time. And the appearance frequency for every short time of each totaled word is accumulate | stored, and the long time frequency information 400 until now in each word is recorded. Then, based on the frequency of appearance of each counted word for a short time and the frequency of occurrence of each recorded word up to the present, a burst of words in a short time is detected, and for each word for which a burst is detected The burst feature amount information 510 including the appearance frequency in a short time is recorded. Then, the period as the extraction condition and the reference word are accepted, the appearance frequency of each word in the burst feature amount information 510 within the accepted period is normalized, and the appearance frequency after normalization is after the normalization of the reference word Words similar in appearance frequency are extracted as related words.

このため、抽出装置１００では、抽出装置１００から基準単語と関連性のあるキーワード（単語）を精度よく抽出することができる。また、ストリームデータ２００は大容量のデータであることから、ストリームデータ２００に含まれる単語に対して、観察している期間内の時系列の変化情報を全て記録する場合は、関連単語の発見の際に時間がかかり、時系列解析処理のボトルネックとなることがある。しかしながら、短時間における単語の出現頻度をバースト特徴量情報５１０として記録していることから、潜在的な関連単語を効率的に発見することが可能である。 Therefore, the extraction device 100 can accurately extract keywords (words) related to the reference word from the extraction device 100. In addition, since the stream data 200 is a large amount of data, when all the time-series change information within the observed period is recorded for the words included in the stream data 200, the related words are found. Sometimes takes a long time and becomes a bottleneck in the time series analysis process. However, since the appearance frequency of words in a short time is recorded as the burst feature information 510, it is possible to efficiently find a potential related word.

また、抽出装置１００では、正規化後の出現頻度の時系列順の推移が基準単語の正規化後の出現頻度の時系列順の推移と類似する単語を関連単語として抽出することから、ストリームデータ２００に含まれる単語の出現頻度の時系列の変化率を分析することにより、たくさんの単語が存在している場合にも、一時的な関連のある単語の中の、言語上の関連性がないために通常は埋もれてしまう関連単語を効果的に抽出することができる。 In addition, the extraction apparatus 100 extracts, as the related words, a word whose transition in the time series order of the appearance frequency after normalization is similar to the transition in the time series order of the appearance frequency after the normalization of the reference word. By analyzing the time series change rate of the appearance frequency of words included in 200, even if there are many words, there is no linguistic relevance among the temporarily related words Therefore, it is possible to effectively extract related words that are normally buried.

例えば、世間一般のイベント・トピック等の一例として「日食」、「月食」、「金星の太陽面通過」が３ヶ月などの一期間にあるものとする。この場合、従来の抽出方法では、「日食」に直接的に関連する「月食」については関連単語として抽出できる。しかしながら、直接的には関連しないものの、「日食」と同様に「サングラス」を用いて観測するという「金星の太陽面通過」にかかる「金星」を関連単語として抽出することはできなかった。 For example, it is assumed that “sun eclipse”, “lunar eclipse”, and “passing the sun on Venus” are in one period such as three months as examples of general events and topics. In this case, with the conventional extraction method, “lunar eclipse” directly related to “eclipse” can be extracted as a related word. However, although not directly related, it was not possible to extract “Venus” related to “Venus's passage through the sun”, which is observed using “sunglasses” in the same way as “eclipse”, as a related word.

これに対し、上述した実施形態では、「サングラス」を用いて観測する「日食」の出現回数と、同じく「サングラス」を用いて観測する「金星」の出現回数とに生じている時系列的な類似性をもとに、「金星」を関連単語として抽出することが可能である。 On the other hand, in the above-described embodiment, the time series of occurrences of the occurrence number of “sun eclipse” observed using “sunglasses” and the appearance number of “Venus” observed similarly using “sunglasses”. Based on similar similarity, it is possible to extract “Venus” as a related word.

図１０は、実施形態にかかる抽出装置１００における処理がコンピュータを用いて具体的に実現されることを示す図である。図１０に例示するように、コンピュータ１０００は、例えば、メモリ１０１０と、ＣＰＵ１０２０と、ハードディスクドライブインタフェース１０３０と、ディスクドライブインタフェース１０４０と、シリアルポートインタフェース１０５０と、ビデオアダプタ１０６０と、ネットワークインタフェース１０７０とを有し、これらの各部はバス１０８０によって接続される。 FIG. 10 is a diagram illustrating that the processing in the extraction device 100 according to the embodiment is specifically realized using a computer. As illustrated in FIG. 10, the computer 1000 includes, for example, a memory 1010, a CPU 1020, a hard disk drive interface 1030, a disk drive interface 1040, a serial port interface 1050, a video adapter 1060, and a network interface 1070. These units are connected by a bus 1080.

メモリ１０１０は、ＲＯＭ（Read Only Memory）１０１１及びＲＡＭ（Random Access Memory）１０１２を含む。ＲＯＭ１０１１は、例えば、ＢＩＯＳ（Basic Input Output System）等のブートプログラムを記憶する。ハードディスクドライブインタフェース１０３０は、ハードディスクドライブ１０３１に接続される。ディスクドライブインタフェース１０４０は、ディスクドライブ１０４１に接続される。例えば磁気ディスクや光ディスク等の着脱可能な記憶媒体が、ディスクドライブに挿入される。シリアルポートインタフェース１０５０は、例えばマウス１０５１、キーボード１０５２に接続される。ビデオアダプタ１０６０は、例えばディスプレイ１０６１に接続される。 The memory 1010 includes a ROM (Read Only Memory) 1011 and a RAM (Random Access Memory) 1012. The ROM 1011 stores a boot program such as BIOS (Basic Input Output System). The hard disk drive interface 1030 is connected to the hard disk drive 1031. The disk drive interface 1040 is connected to the disk drive 1041. For example, a removable storage medium such as a magnetic disk or an optical disk is inserted into the disk drive. The serial port interface 1050 is connected to a mouse 1051 and a keyboard 1052, for example. The video adapter 1060 is connected to the display 1061, for example.

ハードディスクドライブ１０３１は、例えば、ＯＳ１０９１、アプリケーションプログラム１０９２、プログラムモジュール１０９３、プログラムデータ１０９４を記憶する。すなわち、上記のプログラムは、コンピュータ１０００によって実行される指令が記述されたプログラムモジュール１０９３として、例えばハードディスクドライブ１０３１に記憶される。例えば、図１に例示した機能構成と同様の情報処理を実行するためのプログラムモジュール１０９３が、ハードディスクドライブ１０３１に記憶される。 The hard disk drive 1031 stores, for example, an OS 1091, an application program 1092, a program module 1093, and program data 1094. That is, the above program is stored in, for example, the hard disk drive 1031 as a program module 1093 in which a command to be executed by the computer 1000 is described. For example, a program module 1093 for executing information processing similar to the functional configuration illustrated in FIG. 1 is stored in the hard disk drive 1031.

また、上述した実施形態での処理に必要な設定データは、プログラムデータ１０９４として、例えばメモリ１０１０やハードディスクドライブ１０３１に記憶される。そして、ＣＰＵ１０２０が、メモリ１０１０やハードディスクドライブ１０３１に記憶されたプログラムモジュール１０９３やプログラムデータ１０９４を必要に応じてＲＡＭ１０１２に読み出して実行する。 In addition, setting data necessary for processing in the above-described embodiment is stored as program data 1094 in, for example, the memory 1010 or the hard disk drive 1031. Then, the CPU 1020 reads the program module 1093 and the program data 1094 stored in the memory 1010 and the hard disk drive 1031 to the RAM 1012 as necessary, and executes them.

なお、プログラムモジュール１０９３やプログラムデータ１０９４は、ハードディスクドライブ１０３１に記憶される場合に限られず、例えば着脱可能な記憶媒体に記憶され、ディスクドライブ等を介してＣＰＵ１０２０によって読み出されてもよい。あるいは、プログラムモジュール１０９３やプログラムデータ１０９４は、ネットワーク（ＬＡＮ（Local Area Network）、ＷＡＮ（Wide Area Network）等）を介して接続された他のコンピュータに記憶され、ネットワークインタフェース１０７０を介してＣＰＵ１０２０によって読み出されてもよい。 The program module 1093 and the program data 1094 are not limited to being stored in the hard disk drive 1031, but may be stored in a removable storage medium, for example, and read by the CPU 1020 via the disk drive or the like. Alternatively, the program module 1093 and the program data 1094 are stored in another computer connected via a network (LAN (Local Area Network), WAN (Wide Area Network), etc.) and read by the CPU 1020 via the network interface 1070. May be issued.

１００…抽出装置、１０１…短時間頻度集計部、１０２…バースト検出・判定部、１０３…長時間頻度情報記録部、１０４…長時間頻度計算・更新部、１０５…バースト特徴量情報記録部、１０６…関連単語抽出部、２００…ストリームデータ、２０１…文書ＩＤ、２０２…文書発表期間、２０３…文書内容、３００…辞書情報、３０１、４０１、５０１、５１１…単語ＩＤ、３０２、４０２、５０２、５１２…単語、４００…長時間頻度情報、４０３…平均頻度、４０４…観察時間、５００…短時間頻度情報、５０３、５１３…期間、５０４、５１４…出現回数、５０５、５１５…正規化値、５１０…バースト特徴量情報、１０００…コンピュータ。 DESCRIPTION OF SYMBOLS 100 ... Extraction apparatus, 101 ... Short time frequency totaling part, 102 ... Burst detection / determination part, 103 ... Long time frequency information recording part, 104 ... Long time frequency calculation / update part, 105 ... Burst feature-value information recording part, 106 ... related word extraction unit, 200 ... stream data, 201 ... document ID, 202 ... document announcement period, 203 ... document content, 300 ... dictionary information, 301, 401, 501, 511 ... word ID, 302, 402, 502, 512 ... Word, 400 ... Long-term frequency information, 403 ... Average frequency, 404 ... Observation time, 500 ... Short-time frequency information, 503, 513 ... Period, 504, 514 ... Number of appearances, 505, 515 ... Normalized value, 510 ... Burst feature information, 1000 ... computer.

Claims

For each predetermined word, a counting unit that counts the appearance frequency of the word in the input stream data every predetermined time;
Appearance frequency recording unit for accumulating the appearance frequency of each of the aggregated words every predetermined time, and recording the appearance frequency of each word up to the present,
A burst detection unit that detects a burst of the word at the predetermined time based on the frequency of occurrence of each of the counted words at a predetermined time and the frequency of appearance of the recorded words up to the present;
A burst feature amount information recording unit that records burst feature amount information including an appearance frequency in the predetermined time for each word in which the burst is detected;
Accepting a period as an extraction condition and a reference word, normalizing the appearance frequency of each word in the burst feature information within the period, and the appearance frequency after normalization is the appearance frequency after normalization of the reference word A related word extraction unit that extracts words similar to
An extraction device comprising:

The related word extracting unit extracts, as related words, a word whose transition in time series order of appearance frequency after normalization is similar to transition in time series order of appearance frequency after normalization of the reference word The extraction device according to claim 1.

The extraction apparatus according to claim 1, wherein the related word extraction unit sorts and outputs the extracted related words based on a rank similar to the reference word.

An extraction method executed by an extraction device,
For each predetermined word, a counting step of counting the frequency of appearance of the word in the input stream data every predetermined time;
Appearance frequency recording step of accumulating the appearance frequency of each of the counted words for each predetermined time, and recording the appearance frequency of each word up to the present,
A burst detection step of detecting a burst of the word at the predetermined time based on the frequency of occurrence of the totaled words for each predetermined time and the frequency of occurrence of the recorded words up to the present;
Burst feature amount information recording step for recording burst feature amount information including the appearance frequency in the predetermined time for each word in which the burst is detected;
Accepting a period as an extraction condition and a reference word, normalizing the appearance frequency of each word in the burst feature information within the period, and the appearance frequency after normalization is the appearance frequency after normalization of the reference word A related word extraction step of extracting a word similar to as a related word;
The extraction method characterized by including.

In the computer of the extraction device,
For each predetermined word, a counting step of counting the frequency of appearance of the word in the input stream data every predetermined time;
Appearance frequency recording step of accumulating the appearance frequency of each of the counted words for each predetermined time, and recording the appearance frequency of each word up to the present,
A burst detection step of detecting a burst of the word at the predetermined time based on the frequency of occurrence of the totaled words for each predetermined time and the frequency of occurrence of the recorded words up to the present;
Burst feature amount information recording step for recording burst feature amount information including the appearance frequency in the predetermined time for each word in which the burst is detected;
Accepting a period as an extraction condition and a reference word, normalizing the appearance frequency of each word in the burst feature information within the period, and the appearance frequency after normalization is the appearance frequency after normalization of the reference word A related word extraction step of extracting a word similar to as a related word;
A program that executes