JP2014160358A

JP2014160358A - Time series data component decomposition apparatus, method and program and recording medium

Info

Publication number: JP2014160358A
Application number: JP2013030469A
Authority: JP
Inventors: Kyosuke Nishida; 京介西田; Hiroyuki Toda; 浩之戸田; Masashi Uchiyama; 匡内山
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2013-02-19
Filing date: 2013-02-19
Publication date: 2014-09-04
Anticipated expiration: 2033-02-19
Also published as: JP5937529B2

Abstract

PROBLEM TO BE SOLVED: To highly accurately perform component decomposition of time series data without using a keyword.SOLUTION: Each observed value included in time series data is totalized for each predetermined time section (S106), a plurality of clusters are detected by clustering on the basis of contents of each document without using an external reference for a document set with time information and the degree which each document belongs to each cluster is calculated (S102), the degree which each document of the document set with time information belongs to each cluster for each time section is totalized (S104), a regression equation which shows relation between an explanatory variable and an objective variable is estimated by regression analysis while a document totalization result is defined as the explanatory variable and a time series data totalization result is defined as the objective variable (S108), a component ratio of each cluster is calculated for each time section on the basis of an estimation result for each time section obtained by the regression analysis and the document totalization result (S110), and each observed value included in the time series data is decomposed into a component of each cluster (S112).

Description

本発明は、時系列データの成分分解を行なう時系列データ成分分解装置、方法、プログラム、及び記録媒体に関する。 The present invention relates to a time-series data component decomposition apparatus, method, program, and recording medium for performing component decomposition of time-series data.

Twitter（登録商標）などに代表される、利用者が主に自身の状況や雑感などを短い文章で投稿するマイクロブログは、更新の容易さと、そのリアルタイム性の高さから大きく普及し、新たな情報基盤として発展を続けている。このような状況の中で、マイクロブログを、実世界で発生した事象を捉える「センサ」として利用する技術が近年では発展している。 Microblogs, such as Twitter (registered trademark), where users mainly post their own situations and miscellaneous feelings in short sentences are widely used due to their ease of updating and their real-time nature. It continues to develop as an information infrastructure. Under these circumstances, in recent years, technology that uses microblogging as a “sensor” that captures events that occur in the real world has been developed.

例えば、下記非特許文献１には、インフルエンザ関連のキーワードが含まれるツイート量から、インフルエンザの流行度合について予測する技術が開示されている。このような技術により、実世界で観測された時系列データが、どのような背景によって生成されているかに関しての手掛りを知ることができる。 For example, the following non-patent document 1 discloses a technique for predicting the degree of influenza epidemic from the amount of tweets including influenza-related keywords. With such a technique, it is possible to know a clue as to what kind of background the time-series data observed in the real world is generated.

Aron Culotta: Towards detecting influenza epidemics by analyzing Twitter messages. SOMA Workshop 2010.Aron Culotta: Towards detecting influenza epidemics by analyzing Twitter messages. SOMA Workshop 2010.

なお、前述の非特許文献１に記載の従来手法が利用するキーワード指定によるツイート数の計測では、時系列データの生成に関わるキーワードを網羅的に用意する必要がある。しかしながら、網羅的にキーワードを用意するのは困難であり、キーワードの用意が不十分である場合、結果として、時系列データの生成過程に関する正しい理解も不十分になってしまう、という問題がある。 Note that in the measurement of the number of tweets by keyword designation used by the conventional method described in Non-Patent Document 1 described above, it is necessary to comprehensively prepare keywords related to generation of time series data. However, it is difficult to comprehensively prepare keywords, and if the keywords are not prepared sufficiently, there is a problem that as a result, correct understanding about the generation process of time series data is insufficient.

例えば、ある空港における利用者数について、利用者数の成分分解、すなわち、どのような理由でその空港を利用したか、という利用理由の内訳を明らかにすることを考えたとき、人々は「出張する」、「新婚旅行に出発する」、「遊ぶ」、「食事をする」、「買い物をする」など、様々な理由で空港を訪れているため、これらを網羅的にカバー可能なキーワード集合を用意するのは非常に難しく、時系列データの成分分解の精度が落ちてしまう。 For example, when considering the number of users at an airport, considering the breakdown of the number of users, that is, the reason for the reason for using the airport, , “Depart for a honeymoon”, “play”, “dine”, “shop”, and so on. It is very difficult to prepare, and the accuracy of component decomposition of time-series data is reduced.

また、キーワードマッチ型の従来手法では、キーワードは含まないが時系列データには関連する文書の情報が欠落してしまう、という問題もある。 In addition, the conventional keyword-matching method also has a problem that information on related documents is lost in time-series data although keywords are not included.

本発明は、上記問題を解決するためになされたもので、キーワードを用いずに時系列データの成分分解を精度高く行なうことができる時系列データ成分分解装置、方法、プログラム、及び記録媒体を提供することを目的とする。 The present invention has been made to solve the above-described problem, and provides a time-series data component decomposition apparatus, method, program, and recording medium capable of accurately performing time-series data component decomposition without using a keyword. The purpose is to do.

上記の目的を達成するために、本発明に係る時系列データ成分分解装置は、ある現象の時間的な変化を観測して得られた一連の観測値の系列である時系列データに含まれる各観測値を、予め定められた時間区間ごとに集計する時系列データ集計部と、作成時間情報が付与された文書の集合である時間情報付文書集合に対して、外的な基準を用いずに各文書の内容に基づいてクラスタリングを行って複数個のクラスタを発見し、各文書について、該複数個のクラスタの各々に所属する度合いを算出する文書クラスタリング部と、前記文書クラスタリング部により算出された前記各クラスタに所属する度合い、及び前記各文書に付与された作成時間情報に基づいて、前記時間区間ごとに、前記時間情報付文書集合の各文書が前記各クラスタに所属する度合いを集計する文書集計部と、前記時間区間ごとに、前記文書集計部の集計結果を説明変数とし、前記時系列データ集計部の集計結果を目的変数として、該説明変数と該目的変数との間の関係を表す回帰式を回帰分析により推計する回帰分析部と、前記回帰分析部により前記時間区間ごとに推計された回帰式の定数項及び回帰係数と、前記文書集計部による前記時間区間ごとの集計結果とを用いて、前記時間区間ごとの前記各クラスタの成分の比率を算出する成分比率算出部と、前記時系列データに含まれる各観測値を、前記観測値の観測時間に対応する前記時間区間の前記成分比率算出部により算出された前記クラスタの成分の比率を用いて、前記各クラスタの成分に分解する成分分解部と、を含んでいる。 In order to achieve the above object, the time series data component decomposition apparatus according to the present invention includes each time series data included in a series of observation values obtained by observing a temporal change of a certain phenomenon. Without using an external standard for the time-series data aggregation unit that aggregates observation values for each predetermined time interval and the document set with time information, which is a set of documents with creation time information Clustering is performed based on the contents of each document to find a plurality of clusters, and each document is calculated by the document clustering unit that calculates the degree of belonging to each of the plurality of clusters. Based on the degree of belonging to each cluster and the creation time information given to each document, each document in the document set with time information belongs to each cluster for each time interval. A document totaling unit that counts the degree of data, and for each time interval, the totaling result of the document totaling unit is used as an explanatory variable, and the totaling result of the time-series data totaling unit is used as an objective variable. A regression analysis unit that estimates a regression equation representing a relationship between the regression equation, a constant term and a regression coefficient of the regression equation estimated for each time interval by the regression analysis unit, and the time interval by the document aggregation unit The component ratio calculation unit that calculates the ratio of the components of each cluster for each time interval using the total result for each time interval, and each observation value included in the time series data corresponds to the observation time of the observation value And a component decomposing unit that decomposes into the components of each cluster using the ratio of the cluster components calculated by the component ratio calculating unit of the time interval.

このように、本発明に係る時系列データ成分分解装置によれば、キーワード単位ではなく、文書の内容に基づいたクラスタの成分比率に従って時系列データを成分分解するようにしたため、時系列データの成分分解を精度高く行なうことができる。 As described above, according to the time-series data component decomposition apparatus according to the present invention, the time-series data is decomposed according to the component ratio of the cluster based on the content of the document instead of the keyword unit. Decomposition can be performed with high accuracy.

なお、前記時間情報付文書集合の各文書について、どのような感情に基づいて文書が記載されたかを示す値として、各文書に含まれる各感情の度合いを推定する感情推定部を更に含み、前記文書集計部は、前記文書クラスタリング部により算出された前記各クラスタに所属する度合い、前記各文書に付与された作成時間情報、及び前記感情推定部の推定結果に基づいて、前記時間区間ごとに、前記時間情報付文書集合の各文書が前記クラスタ及び前記感情の各組に所属する度合いを集計し、前記成分比率算出部は、前記回帰分析部により前記時間区間ごとに推計された回帰式の定数項及び回帰係数と、前記文書集計部による前記時間区間ごとの集計結果とを用いて、前記時間区間ごとの前記クラスタ及び前記感情の組の成分の比率を算出し、前記時系列データに含まれる各観測値を、前記観測値の観測時間に対応する前記時間区間の前記成分比率算出部により算出された前記クラスタの成分の比率を用いて、前記クラスタ及び前記感情の各組の成分に分解するようにしてもよい。 In addition, for each document in the document set with time information, as a value indicating what kind of emotion is described based on the emotion, further includes an emotion estimation unit that estimates the degree of each emotion included in each document, Based on the degree of belonging to each cluster calculated by the document clustering unit, the creation time information given to each document, and the estimation result of the emotion estimation unit, the document aggregation unit, for each time interval, The degree to which each document of the document set with time information belongs to each set of the cluster and the emotion is totaled, and the component ratio calculation unit is a regression equation constant estimated by the regression analysis unit for each time interval. Using a term and a regression coefficient, and a totaling result for each time interval by the document totaling unit, a ratio of components of the cluster and the emotion set for each time interval is calculated, Each observed value included in the time-series data is obtained by using the ratio of the components of the cluster calculated by the component ratio calculating unit of the time interval corresponding to the observation time of the observed value. You may make it decompose | disassemble into the component of each group.

このように、文書の内容に基づいたクラスタ及び感情の組の成分比率に従って時系列データを成分分解するようにしたため、より精度高く時系列データの成分分解を行なうことができる。 As described above, since the time-series data is decomposed in accordance with the component ratio of the cluster and emotion group based on the contents of the document, the time-series data can be decomposed with higher accuracy.

本発明に係る時系列データ成分分解方法は、時系列データ集計部、文書クラスタリング部、文書集計部、回帰分析部、成分比率算出部、成分比率算出部、及び成分分解部を含む時系列データ成分分解装置における時系列データ成分分解方法であって、前記時系列データ集計部によって、ある現象の時間的な変化を観測して得られた一連の観測値の系列である時系列データに含まれる各観測値を、予め定められた時間区間ごとに集計し、前記文書クラスタリング部によって、作成時間情報が付与された文書の集合である時間情報付文書集合に対して、外的な基準を用いずに各文書の内容に基づいてクラスタリングを行って複数個のクラスタを発見し、各文書について、該複数個のクラスタの各々に所属する度合いを算出し、前記文書集計部によって、前記文書クラスタリング部により算出された前記各クラスタに所属する度合い、及び前記各文書に付与された作成時間情報に基づいて、前記時間区間ごとに、前記時間情報付文書集合の各文書が前記各クラスタに所属する度合いを集計し、前記回帰分析部によって、前記時間区間ごとに、前記文書集計部の集計結果を説明変数とし、前記時系列データ集計部の集計結果を目的変数として、該説明変数と該目的変数との間の関係を表す回帰式を回帰分析により推計し、前記成分比率算出部によって、前記回帰分析部により前記時間区間ごとに推計された回帰式の定数項及び回帰係数と、前記文書集計部による前記時間区間ごとの集計結果とを用いて、前記時間区間ごとの前記各クラスタの成分の比率を算出し、前記成分分解部によって、前記時系列データに含まれる各観測値を、前記観測値の観測時間に対応する前記時間区間の前記成分比率算出部により算出された前記クラスタの成分の比率を用いて、前記各クラスタの成分に分解するものである。 A time series data component decomposition method according to the present invention includes a time series data aggregation unit, a document clustering unit, a document aggregation unit, a regression analysis unit, a component ratio calculation unit, a component ratio calculation unit, and a component decomposition unit. A time-series data component decomposing method in a decomposing apparatus, wherein each time-series data is a series of observation values obtained by observing a temporal change of a phenomenon by the time-series data totaling unit. The observed values are aggregated for each predetermined time interval, and the document clustering unit does not use an external standard for the document set with time information that is a set of documents to which creation time information is given. Clustering is performed based on the contents of each document to find a plurality of clusters, and the degree to which each document belongs to each of the plurality of clusters is calculated. Based on the degree of belonging to each cluster calculated by the document clustering unit and the creation time information given to each document, each document in the document set with time information is The degree of belonging to each cluster is tabulated, and the regression analysis unit uses the tabulation result of the document tabulation unit as an explanatory variable and the tabulation result of the time series data tabulation unit as an objective variable for each time interval. A regression equation representing a relationship between the variable and the objective variable is estimated by regression analysis, and the component ratio calculation unit calculates a constant term and a regression coefficient of the regression equation estimated by the regression analysis unit for each time interval. , By using the aggregation result for each time interval by the document aggregation unit, to calculate the ratio of the components of each cluster for each time interval, by the component decomposition unit, Each observed value included in the time-series data is converted into a component of each cluster by using the ratio of the cluster components calculated by the component ratio calculating unit in the time interval corresponding to the observation time of the observed value. Decompose.

なお、前記時間情報付文書集合は、前記時間情報付文書集合に含まれる各文書内に特定のキーワードが含まれる文書の集合であってもよい。 The document set with time information may be a set of documents in which a specific keyword is included in each document included in the document set with time information.

また、前記時間情報付文書集合は、予め指定された位置範囲内に存在する端末で作成され投稿された文書の集合であってもよい。 Further, the document set with time information may be a set of documents created and posted by a terminal existing in a position range designated in advance.

また、前記時間情報付文書集合は、利用者が利用者自身のリアルタイムな状況又は雑感を表わす所定文字数以内の文章を作成して投稿するマイクロブログの文書の集合であってもよい。 The document set with time information may be a set of microblog documents in which a user creates and submits text within a predetermined number of characters representing a real-time situation or miscellaneous feeling of the user.

本発明に係るプログラムは、コンピュータを、上記時系列データ成分分解装置の各手段として機能させるためのプログラムである。 The program according to the present invention is a program for causing a computer to function as each unit of the time-series data component decomposition apparatus.

本発明に係る記録媒体は、コンピュータを、上記時系列データ成分分解装置の各手段として機能させるためのプログラムを記録したコンピュータ読取り可能な記録媒体である。 The recording medium according to the present invention is a computer-readable recording medium recording a program for causing a computer to function as each means of the time-series data component decomposition apparatus.

以上説明したように、本発明によれば、キーワードを用いずに時系列データの成分分解を精度高く行なうことができる、という効果が得られる。 As described above, according to the present invention, it is possible to obtain an effect that the component decomposition of time-series data can be performed with high accuracy without using a keyword.

第１の実施の形態に係る時系列データ成分分解装置の機能的な構成を示す図である。It is a figure which shows the functional structure of the time series data component decomposition | disassembly apparatus which concerns on 1st Embodiment. 時系列データ成分分解装置として機能するコンピュータの概略ブロック図である。It is a schematic block diagram of the computer which functions as a time series data component decomposition | disassembly apparatus. 第１の実施の形態に係る時系列データ成分分解装置が実行する処理ルーチンの内容を示すフローチャートである。It is a flowchart which shows the content of the processing routine which the time series data component decomposition | disassembly apparatus based on 1st Embodiment performs. 第１の実施の形態に係る時系列データ成分分解装置の処理内容を模式的に示した図である。It is the figure which showed typically the processing content of the time series data component decomposition | disassembly apparatus based on 1st Embodiment. 第２の実施の形態に係る時系列データ成分分解装置の機能的な構成を示す図である。It is a figure which shows the functional structure of the time series data component decomposition | disassembly apparatus which concerns on 2nd Embodiment. 第２の実施の形態に係る時系列データ成分分解装置が実行する処理ルーチンの内容を示すフローチャートである。It is a flowchart which shows the content of the processing routine which the time series data component decomposition | disassembly apparatus based on 2nd Embodiment performs.

＜第１の実施の形態＞ <First Embodiment>

図１は、本発明の一実施の形態における時系列データ成分分解装置１０の機能的な構成を示す図である。 FIG. 1 is a diagram showing a functional configuration of a time-series data component decomposition apparatus 10 according to an embodiment of the present invention.

図１に示す時系列データ成分分解装置１０は、入力された時系列データ及び時間情報付文書集合から、当該時系列データを複数個の成分に分解して出力する装置であって、入力部２０、演算部２２、及び出力部２４を備えている。 The time-series data component decomposition apparatus 10 shown in FIG. 1 is an apparatus that decomposes the time-series data into a plurality of components from the input time-series data and the document set with time information, and outputs them. , A calculation unit 22 and an output unit 24 are provided.

入力部２０は、時系列データ入力部３０及び時間情報付文書集合入力部３２を備えている。 The input unit 20 includes a time-series data input unit 30 and a document set input unit 32 with time information.

時系列データ入力部３０は、時系列データの入力を受け付ける。 The time series data input unit 30 receives input of time series data.

ここで、時系列データは、ある現象の時間的な変化を観測して得られた一連の値（観測値）の系列である。時系列データの具体例としては、例えば、株価データや売り上げデータ等が挙げられる。時系列データに含まれる各観測値には、各観測値が観測された観測日時を示す観測日時情報が対応付けられている。時系列データは、例えば、インターネット等から自動的に収集しておくことができる。 Here, the time series data is a series of a series of values (observed values) obtained by observing a temporal change of a certain phenomenon. Specific examples of time-series data include stock price data and sales data. Each observation value included in the time-series data is associated with observation date / time information indicating the observation date / time when each observation value was observed. The time series data can be automatically collected from the Internet or the like, for example.

時間情報付文書集合入力部３２は、時間情報付文書集合の各文書の入力を受け付ける。 The document set input unit 32 with time information receives input of each document in the document set with time information.

時間情報付文書集合は、文書データ（以下、単に文書という）の集合であって、各文書には、各文書がいつ作成されたかを示す作成時間情報（以下、単に時間情報という）が例えばメタデータとして付与されている。時間情報付文書集合は、各文書内に特定のキーワードが含まれる文書の集合であってもよい。また、時間情報付文書集合は、予め指定された位置範囲内に存在する端末で作成され投稿された文書の集合であってもよい。更にまた、時間情報付文書集合は、利用者が利用者自身のリアルタイムな状況又は雑感を表わす所定文字数以内の文章を作成して投稿するマイクロブログの文書の集合であってもよい。 The document set with time information is a set of document data (hereinafter simply referred to as documents), and each document has creation time information (hereinafter simply referred to as time information) indicating when each document is created. It is given as data. The document set with time information may be a set of documents in which specific keywords are included in each document. Further, the document set with time information may be a set of documents created and posted by a terminal existing in a position range designated in advance. Furthermore, the document set with time information may be a set of microblog documents in which the user creates and submits text within a predetermined number of characters representing the user's own real-time situation or feeling.

なお、各文書に付与される時間情報は、例えば、日時を示す情報であってもよいし、曜日を示す情報であってもよいし、曜日と時間の組み合わせを示す情報であってもよいし、日付を示す情報であってもよい。時間情報付文書集合は、例えば、事前にインターネット等から収集して構築しておくことができる。 The time information given to each document may be, for example, information indicating date and time, information indicating day of the week, or information indicating a combination of day of the week and time. Information indicating a date may be used. The document set with time information can be collected and constructed in advance from the Internet or the like, for example.

演算部２２は、単語分割部４０、文書クラスタリング部４２、文書集計部４４、時系列データ集計部４６、回帰分析部４８、成分比率算出部５０、及び成分分解部５２を備えている。 The calculation unit 22 includes a word division unit 40, a document clustering unit 42, a document totaling unit 44, a time series data totaling unit 46, a regression analysis unit 48, a component ratio calculation unit 50, and a component decomposition unit 52.

単語分割部４０は、時間情報付文書集合に含まれる各文書を単語単位に分割して単語集合を生成する。 The word dividing unit 40 generates a word set by dividing each document included in the document set with time information into words.

文書クラスタリング部４２は、まず、単語分割部４０により各文書の単語集合が生成された時間情報付文書集合から、外的な基準を用いずに、時間情報付文書集合に含まれる文書の内容に基づいて、複数個の意味的なまとまりを持つクラスタ集合を発見する。そして、文書クラスタリング部４２は、時間情報付文書集合に含まれる各文書について、各クラスタに所属する度合いを算出して出力する。なお、外的な基準を用いずに行なわれるクラスタリングは、教師無しのデータ分類手法として一般的に知られている手法を用いることができる。 First, the document clustering unit 42 converts the document information included in the time information-added document set from the time information-added document set generated by the word dividing unit 40 without using an external reference. Based on this, a cluster set having a plurality of semantic clusters is found. Then, the document clustering unit 42 calculates and outputs the degree to which each document included in the document set with time information belongs to each cluster. Note that clustering performed without using an external criterion can use a method generally known as an unsupervised data classification method.

文書集計部４４は、時間情報付文書集合の各文書の各クラスタに所属する度合いを時間区間ごとに集計する。ここで、時間区間とは、対象期間を複数の期間に分割したときの各分割区間をいう。具体的な例を挙げると、例えば、１日を２４個の期間に均等に分割したときの各時間帯（0時台（00:00:00〜00:59:59）、1時台、……23時台）として時間区間を定義することができる。また、日付（365区間）や、曜日（7区間）、曜日と時間帯の組み合わせ（月曜7時台、火曜10時台、7×24＝168区間）などとしてもよい。なお、文書集計部４４の集計結果を文書集計結果と呼称する。 The document totaling unit 44 totals the degree of belonging to each cluster of each document in the document set with time information for each time section. Here, the time section refers to each divided section when the target period is divided into a plurality of periods. To give a specific example, for example, each time zone when a day is equally divided into 24 periods (0 hour range (00:00:00 to 00:59:59), 1 hour range, ... ... the time zone can be defined as 23:00). Also, the date (365 sections), day of the week (7 sections), combination of day of the week and time zone (Monday 7 o'clock, Tuesday 10 o'clock, 7 × 24 = 168 sections) may be used. The totaling result of the document totaling unit 44 is referred to as a document totaling result.

時系列データ集計部４６は、時系列データを、時間区間ごとに集計する。時系列データ集計部４６で時系列データを集計するときの時間区間の定義は、文書集計部４４における集計で使用された時間区間の定義と同じである。時系列データ集計部４６の集計結果を時系列データ集計結果と呼称する。 The time series data totaling unit 46 totals time series data for each time interval. The definition of the time interval when the time series data totaling unit 46 totals the time series data is the same as the definition of the time interval used in the totalization in the document totaling unit 44. The aggregation result of the time-series data aggregation unit 46 is referred to as a time-series data aggregation result.

回帰分析部４８は、時間区間ごとに、文書集計部４４による文書集計結果を説明変数とし、時系列データ集計部４６による時系列データ集計結果を目的変数として回帰分析を行なう。 The regression analysis unit 48 performs a regression analysis for each time interval using the document aggregation result by the document aggregation unit 44 as an explanatory variable and the time series data aggregation result by the time series data aggregation unit 46 as an objective variable.

成分比率算出部５０は、回帰分析部４８の回帰分析結果を用いて、時間区間ごとの各クラスタの成分比率を算出する。 The component ratio calculation unit 50 calculates the component ratio of each cluster for each time interval using the regression analysis result of the regression analysis unit 48.

成分分解部５２は、時系列データを、成分比率算出部５０で算出された時間区間ごとの成分比率を用いて複数個の成分に分解する。 The component decomposing unit 52 decomposes the time series data into a plurality of components using the component ratio for each time interval calculated by the component ratio calculating unit 50.

出力部２４は、成分分解部５２の分解結果（成分分解値）を出力する。 The output unit 24 outputs the decomposition result (component decomposition value) of the component decomposition unit 52.

時系列データ成分分解装置１０は、図２に示すように、ＣＰＵ（Central Processing Unit）２０１と、ＣＰＵが後述する各処理ルーチンを実行するためのプログラム等を記憶したＲＯＭ（Read Only Memory）２０２と、ＲＡＭ（Random Access Memory）２０３と、を備えたコンピュータ２００で構成することができる。また、このコンピュータ２００は、通信インタフェース（ＩＦ）２０４、入出力ＩＦ２０５、及びハードディスクドライブ２０６も備えている。通信ＩＦ２０４はネットワーク２１０に接続するためのインタフェースである。入出力ＩＦ２０５は、ディスプレイ２０８及びキーボード２０９に接続される。 As shown in FIG. 2, the time-series data component decomposition apparatus 10 includes a CPU (Central Processing Unit) 201, a ROM (Read Only Memory) 202 that stores a program for the CPU to execute each processing routine described later, and the like. The computer 200 includes a RAM (Random Access Memory) 203. The computer 200 also includes a communication interface (IF) 204, an input / output IF 205, and a hard disk drive 206. The communication IF 204 is an interface for connecting to the network 210. The input / output IF 205 is connected to the display 208 and the keyboard 209.

ＣＰＵ２０１がＲＯＭ２０２やハードディスク等の記録媒体に記憶されているプログラムを読み出して実行することにより、上記ハードウェアとプログラムとを協働させて上述した機能が実現される。 When the CPU 201 reads and executes a program stored in a recording medium such as the ROM 202 or a hard disk, the above-described functions are realized by cooperation of the hardware and the program.

次に、本実施の形態における時系列データ成分分解装置１０の動作について詳細に説明する。時系列データ成分分解装置１０は、時系列データ入力部３０及び時間情報付文書集合入力部３２により時系列データ及び時間情報付文書集合の入力を受け付けると、図３に示す処理ルーチンを実行する。 Next, the operation of the time-series data component decomposition apparatus 10 in the present embodiment will be described in detail. When the time-series data input unit 30 and the time-information-added document set input unit 32 accept input of the time-series data and the time-information-added document set, the time-series data component decomposition apparatus 10 executes the processing routine shown in FIG.

ステップ１００において、単語分割部４０は、時間情報付文書集合入力部３２で入力が受け付けられた時間情報付文書集合に含まれる各文書d_iを単語単位に分割し単語集合を生成する。このとき、形態素解析器を用いて名詞のみを抽出して単語集合としてもよいし、あるいは、名詞・動詞・形容詞の単語のみを抽出し単語集合としてもよい。また、他品詞の単語を上記単語集合に加えてもよい。なお、ここでは、形態素解析器による形態素解析を実施する例について説明したが、形態素解析を実施する代わりに、時間情報付文書集合の各文書d_iに含まれる全ての文字nグラム（連続するn文字）を上記単語集合としてもよい。 In step 100, the word dividing unit 40 divides each document d _i included in the document set with time information accepted by the document set input unit 32 with time information into word units to generate a word set. At this time, only nouns may be extracted by using a morphological analyzer to obtain word sets, or only noun / verb / adjective words may be extracted to form word sets. Moreover, you may add the word of another part of speech to the said word set. Here, an example has been described for implementing the morphological analysis by the morphological analyzer, instead of carrying out the morphological analysis, continuous all characters n g (contained in each document d _i of the document set with time information n Character) may be the word set.

ステップ１０２において、文書クラスタリング部４２は、各文書の単語集合が生成された時間情報付文書集合から、外的な基準を用いずに、文書の内容に基づいて複数個の意味的まとまりを持つクラスタ集合を発見し、時間情報付文書集合に含まれる各文書d_iについて、各クラスタに所属する度合いP(c|d_i)を出力する。なお、cは、各クラスタを示すパラメータであって、１からクラスタの個数Cまでの値をとる。 In step 102, the document clustering unit 42 selects a cluster having a plurality of semantic groups based on the contents of the document from the document set with time information in which the word set of each document is generated without using an external criterion. A set is found, and the degree P (c | d _i ) belonging to each cluster is output for each document d _i included in the document set with time information. C is a parameter indicating each cluster, and takes a value from 1 to the number C of clusters.

ここで、以下の式（１）に示す関係が成立する。 Here, the relationship shown in the following formula (1) is established.

ここでは、ソフトクラスタリング（例えば、参考文献１「新納浩幸、Rで学ぶクラスタ解析、オーム社、2007」参照。）と呼ばれる、データが各クラスタに属する度合いを出力可能なアルゴリズム（Fuzzy c-means、混合分布モデル、pLSI、LDA、NMFなど）のうち、任意のものが使用可能である。クラスタの個数Cは事前に与えるパラメータで、C=20などと設定する。 Here, an algorithm (Fuzzy c-means, for example) called soft clustering (see Reference 1 “Hiroyuki Shinno, Cluster Analysis Learned by R, Ohmsha, 2007”) that can output the degree to which each data belongs to each cluster. Any mixture distribution model, pLSI, LDA, NMF, etc. can be used. The number C of clusters is a parameter given in advance, such as C = 20.

ステップ１０４において、文書集計部４４は、上記時間情報付文書集合の各文書d_iのクラスタに所属する度合いを示す値（以下、この度合いを示す値を便宜的に文書数と呼称する）を時間区間h別に集計する。具体的には、文書d_iが投稿された時刻をt_iとしたとき、時間区間hにおける、クラスタcに所属する文書数n(c,h)を以下の式（２）に示すように定義し、全ての<c,h>の組み合わせについて計算する。 In step 104, the document counting unit 44, a value indicating the degree of belonging to the cluster of each document d _i of the time information attached document set (hereinafter, a value indicating the degree of convenience referred to as document number) Time Aggregate by interval h. Specifically, when the time when the document d _i is posted is t _i , the number of documents n (c, h) belonging to the cluster c in the time interval h is defined as shown in the following formula (2) Then, all combinations of <c, h> are calculated.

ここで、Dは、時間情報付文書集合に含まれる文書の個数であり、δ(t_i,h)は、時刻t_iが時間区間hに含まれるときには1を、そうでないときには0を返す関数である。 Here, D is the number of documents included in the document set with time information, and δ (t _i , h) is a function that returns 1 when the time t _i is included in the time interval h, and returns 0 otherwise. It is.

文書集計部４４は、上記計算した文書数n(c,h)を文書集計結果として出力する。 The document totaling unit 44 outputs the calculated document number n (c, h) as a document totaling result.

なお、前述したように、時間区間hを、１時間単位の時間帯としてもよいし、日付（365区間）や、曜日（7区間）、曜日と時間帯の組み合わせ（月曜7時台、火曜10時台、7×24＝168区間）などとしてもよい。 As described above, the time interval h may be a time zone in units of one hour, a date (365 intervals), a day of the week (7 intervals), a combination of day of the week and time zone (Monday 7 o'clock, Tuesday 10 Time zone, 7 × 24 = 168 sections).

ステップ１０６において、時系列データ集計部４６は、時系列データのｊ個目の観測値x_jについて、時間区間h別に平均値を集計する。 In step 106, the time-series data totaling unit 46 totals the average value for each time interval h for the _j-th observed value x _j of the time-series data.

具体的には、観測値x_jが観測された時刻（観測時刻）をt_jとしたとき、時間区間hにおける、観測値x_jの平均値m(h)を以下の式（３）に示すように定義し、全ての時間区間hについて式（３）を計算する。 Specifically, when the time (observation time) at which the observed value x _j is observed is t _j , the average value m (h) of the observed values x _j in the time interval h is expressed by the following equation (3). Thus, Equation (3) is calculated for all time intervals h.

ここで、Jは、時系列データに含まれる観測値の個数であり、N_hは、δ(t_j,h)=1となる観測値x_jの個数である。また、集計の方法は、平均値以外にも、最大値、最小値、合計などとしてもよい。 Here, J is the number of observation values included in the time-series data, and N _h is the number of observation values x _j for which δ (t _j , h) = 1. In addition to the average value, the aggregation method may be a maximum value, a minimum value, a total, or the like.

なお、ここでは、時間区間hの定義に基づき、観測値x_jの観測時刻t_jに応じて時系列データを集計したが、上記時間区間hの定義によっては、観測値x_jの観測日や観測曜日等に応じて集計することもできる。 Here, based on the definition of the time interval h, the time series data is aggregated according to the observation time t _j of the observation value x _j , but depending on the definition of the time interval h, the observation date of the observation value x _j It can also be tabulated according to the day of observation.

時系列データ集計部４６は、上記計算したm(h)を時系列データ集計結果として出力する。 The time series data totaling unit 46 outputs the calculated m (h) as a time series data totaling result.

ステップ１０８において、回帰分析部４８は、文書集計部４４の文書集計結果n(c,h)を説明変数とし、時系列データ集計部４６の時系列データ集計結果m(h)を目的変数として、回帰分析を行う。m(h)とn(c,h)との関係を表す線形回帰モデルは、以下の式（４）の通り定義される。 In step 108, the regression analysis unit 48 uses the document aggregation result n (c, h) of the document aggregation unit 44 as an explanatory variable, and the time series data aggregation result m (h) of the time series data aggregation unit 46 as an objective variable. Perform regression analysis. A linear regression model representing the relationship between m (h) and n (c, h) is defined as the following equation (4).

定数項及び回帰係数（β₀,β₁,…β_C)は、例えば、参考文献２（川野秀一、廣瀬慧、立石正平、小西貞則「回帰モデリングとL1型正則化方の最近の展開」日本統計学会誌、39、2、pp.211-242、2010）に記載の線形重回帰や、lasso回帰、ridge回帰などにより導出できる。 Constant terms and regression coefficients (β ₀ , β ₁ ,… β _C ) can be found in Reference 2 (Shuichi Kawano, Satoshi Hirose, Shohei Tateishi, Sadanori Konishi “Recent Developments in Regression Modeling and L1 Type Regularization” It can be derived by linear multiple regression, lasso regression, ridge regression, etc. described in the Journal of Statistical Society, 39, 2, pp.211-242, 2010).

ステップ１１０において、成分比率算出部５０は、回帰分析部４８による回帰係数の推定結果と、文書集計結果とを用いて、時間区間h別のクラスタ成分比率r(c,h)を以下の式（５）に従って算出し、出力する。 In step 110, the component ratio calculation unit 50 uses the regression coefficient estimation result obtained by the regression analysis unit 48 and the document aggregation result to calculate the cluster component ratio r (c, h) for each time interval h by the following formula ( Calculate and output according to 5).

ここで、以下の式（６）を満たす。 Here, the following expression (6) is satisfied.

ステップ１１２において、成分分解部５２は、時刻t_jに観測された値x_jを、成分比率算出部５０により算出された、時刻t_jに対応する時間区間hのクラスタ成分比率r(c,h)を用いて、以下の式（７）に示すように、C個の成分に分解する。 In step 112, the component decomposing unit 52 uses the value x _j observed at time t _j to calculate the cluster component ratio r (c, h for the time interval h corresponding to time t _j calculated by the component ratio calculating unit 50. ) To decompose into C components as shown in the following equation (7).

出力部２４は、成分分解部５２の成分分解結果を出力する。 The output unit 24 outputs the component decomposition result of the component decomposition unit 52.

図４に、上記説明した時系列データ成分分解装置１０の処理内容を模式的に示す。図４に示すように、本実施の形態の時系列データ成分分解装置１０は、ある現象の時間的な変化を観測して得られた一連の値の系列である時系列データを複数個の意味的成分に分解可能な手法であり、データマイニング、マーケティングなどに利用可能である。 FIG. 4 schematically shows the processing contents of the time-series data component decomposition apparatus 10 described above. As shown in FIG. 4, the time-series data component decomposition apparatus 10 according to the present embodiment has a plurality of meanings of time-series data, which is a series of values obtained by observing a temporal change of a certain phenomenon. It can be decomposed into specific components and can be used for data mining and marketing.

以上説明したように、第１の実施の形態に係る時系列データ成分分解装置によれば、時系列データに含まれる各観測値を予め定められた時間区間ごとに集計し、時間情報付文書集合から、外的な基準を用いずに各文書の内容に基づいて複数個のクラスタを発見し、各文書について、該複数個のクラスタの各々に所属する度合いを算出し、該各クラスタに所属する度合い及び各文書に付与された作成時間情報に基づいて、上記時間区間ごとに時間情報付文書集合の各文書が各クラスタに所属する度合いを集計し、上記時間区間ごとに、時間情報付文書集合の集計結果を説明変数とし時系列データの集計結果を目的変数として、該説明変数と該目的変数との間の関係を表す回帰式を回帰分析により推計し、該回帰分析による推計結果及び時間情報付文書集合の集計結果に基づいて、時間区間ごとのクラスタの成分比率を算出し、該算出結果に基づいて、時系列データに含まれる各観測値をクラスタの成分に分解するようにしたため、キーワードを用いずに時系列データの成分分解を精度高く行なうことができる。 As described above, according to the time-series data component decomposition apparatus according to the first embodiment, the observation values included in the time-series data are aggregated for each predetermined time interval, and the document set with time information is collected. From the above, a plurality of clusters are found based on the contents of each document without using external criteria, and the degree to which each document belongs to each of the plurality of clusters is calculated and belongs to each cluster. Based on the degree and the creation time information given to each document, the degree to which each document of the document set with time information belongs to each cluster for each time interval is totaled, and the document set with time information for each time interval The summary results of the time series data are used as explanatory variables, and the summary results of the time series data are used as objective variables. A regression expression representing the relationship between the explanatory variables and the objective variables is estimated by regression analysis. The cluster component ratio for each time interval is calculated based on the aggregation result of the document set, and each observation value included in the time series data is decomposed into cluster components based on the calculation result. The component decomposition of time series data can be performed with high accuracy without using it.

なお、第１の実施の形態では、ステップ１００〜１０４を、ステップ１０６の前に行なう例について記載したが（図３参照）、ステップ１００〜１０４を、ステップ１０６とステップ１０８の間に行なうようにしてもよいし、ステップ１００〜１０４の各々と、ステップ１０６とを並列に行なうようにしてもよい。 In the first embodiment, an example is described in which steps 100 to 104 are performed before step 106 (see FIG. 3), but steps 100 to 104 are performed between step 106 and step 108. Alternatively, each of steps 100 to 104 and step 106 may be performed in parallel.

＜第２の実施の形態＞ <Second Embodiment>

第２の実施の形態では、時間情報付文書集合の各文書について感情を推定する感情推定部が設けられた時系列データ成分分解装置により、時系列データの成分分解を行なう例について説明する。 In the second embodiment, an example will be described in which time-series data component decomposition is performed by a time-series data component decomposition apparatus provided with an emotion estimation unit that estimates emotion for each document in a document set with time information.

図５は、第２の実施の形態に係る時系列データ成分分解装置６０の機能的な構成を示す図である。なお、図５において、図１と同一もしくは同等の部分には同じ記号を付し、その説明を省略する。 FIG. 5 is a diagram illustrating a functional configuration of the time-series data component decomposition apparatus 60 according to the second embodiment. In FIG. 5, the same or equivalent parts as in FIG.

時系列データ成分分解装置６０は、入力部２０、演算部２６、及び出力部２４を備えている。第２の実施の形態に係る演算部２６は、単語分割部４０、感情推定部４１、文書クラスタリング部４２、文書集計部６２、時系列データ集計部４６、回帰分析部６４、成分比率算出部６６、及び成分分解部６８を備えている。 The time series data component decomposition device 60 includes an input unit 20, a calculation unit 26, and an output unit 24. The calculation unit 26 according to the second embodiment includes a word division unit 40, an emotion estimation unit 41, a document clustering unit 42, a document totaling unit 62, a time series data totaling unit 46, a regression analysis unit 64, and a component ratio calculation unit 66. , And a component decomposition unit 68.

感情推定部４１は、感情推定器を用いて、時間情報付文書集合に含まれる、単語集合生成済みの各文書について、どのような感情に基づいて文書が記載されたかを示す値として、各文書に含まれる各感情成分の度合いを推定して出力する。ここでは、感情推定部４１は、ポジティブ（肯定的）、ネガティブ（否定的）、ニュートラル（中立）の各感情の度合いを推定する。 The emotion estimation unit 41 uses an emotion estimator to set each document as a value indicating what kind of emotion the document has been described for each document that has been generated in the word set included in the document set with time information. Estimate and output the degree of each emotion component included in. Here, the emotion estimation unit 41 estimates the degree of each emotion of positive (positive), negative (negative), and neutral (neutral).

文書集計部６２は、時間情報付文書集合の各文書が各クラスタ及び感情の組に所属する度合いを時間区間ごとに集計する。本実施の形態において、文書集計部６２の集計結果を文書集計結果と呼称する。 The document totaling unit 62 totalizes the degree to which each document in the document set with time information belongs to each cluster and emotion group for each time interval. In the present embodiment, the totaling result of the document totaling unit 62 is referred to as a document totaling result.

回帰分析部６４は、文書集計部６２の文書集計結果を説明変数とし、時系列データ集計部４６の時系列データ集計結果を目的変数として回帰分析を行なう。 The regression analysis unit 64 performs regression analysis using the document aggregation result of the document aggregation unit 62 as an explanatory variable and the time series data aggregation result of the time series data aggregation unit 46 as an objective variable.

成分比率算出部６６は、回帰分析部６４の回帰分析結果を用いて、時間区間ごとの各成分（クラスタ及び感情の組）の比率を算出する。 The component ratio calculation unit 66 calculates the ratio of each component (a set of clusters and emotions) for each time interval using the regression analysis result of the regression analysis unit 64.

成分分解部６８は、時系列データを、成分比率算出部６６で算出された成分比率を用いて複数個の成分に分解する。 The component decomposition unit 68 decomposes the time series data into a plurality of components using the component ratio calculated by the component ratio calculation unit 66.

なお、第２実施形態の時系列データ成分分解装置６０も、図２に例示したコンピュータにより構成され、ＣＰＵ２０１がＲＯＭ２０２やハードディスク等の記録媒体に記憶されているプログラムを読み出して実行することにより、ハードウェアとプログラムとを協働させて上述した機能が実現される。 Note that the time-series data component decomposition apparatus 60 of the second embodiment is also configured by the computer illustrated in FIG. 2, and the CPU 201 reads out and executes a program stored in a recording medium such as the ROM 202 or a hard disk. The above-described functions are realized by cooperating the hardware and the program.

次に、本実施の形態における時系列データ成分分解装置６０の動作について詳細に説明する。時系列データ成分分解装置６０は、時系列データ入力部３０及び時間情報付文書集合入力部３２により時系列データ及び時間情報付文書集合の入力を受け付けると、図６に示す処理ルーチンを実行する。 Next, the operation of the time series data component decomposition apparatus 60 in the present embodiment will be described in detail. When the time-series data component decomposition apparatus 60 receives the input of the time-series data and the document set with time information by the time-series data input unit 30 and the document set input unit 32 with time information, the time-series data component decomposition apparatus 60 executes the processing routine shown in FIG.

ステップ３００において、単語分割部４０は、時間情報付文書集合入力部３２で入力が受け付けられた時間情報付文書集合に含まれる各文書d_iの単語集合を生成する。単語集合の生成方法は、第１の実施の形態で説明した通り（図３のステップ１００参照）であるため、ここでは詳細な説明を省略する。 In step 300, the word dividing unit 40 generates a word set for each document d _i included in the document set with time information accepted by the document set input unit 32 with time information. Since the word set generation method is as described in the first embodiment (see step 100 in FIG. 3), detailed description thereof is omitted here.

ステップ３０２において、感情推定器を用いて、各文書について単語集合生成済みの時間情報付文書集合の各文書について、ポジティブ（肯定的）、ネガティブ（否定的）、ニュートラル（中立）の各感情の度合いP(s=1|d_i), P(s=2|d_i)、P(s=3|d_i)を推定する。感情推定器は、例えば、参考文献３（Alexander Pak and Patrick Paroubek, “Twitter as a Corpus for Sentiment Analysis and Opinion Mining”, LREC 2010.）に記載の方法により、時系列データ成分分解装置６０に入力される入力データとは別のコーパスを用意して構築することができる。具体的には、予め正解データが与えられた感情データを用いて学習することで当該コーパスを構築することができる。 In step 302, using the emotion estimator, the degree of each emotion of positive (positive), negative (negative), and neutral (neutral) for each document in the document set with time information for which a word set has been generated for each document. P (s = 1 | d _i ), P (s = 2 | d _i ), and P (s = 3 | d _i ) are estimated. The emotion estimator is input to the time-series data component decomposition apparatus 60 by the method described in Reference 3 (Alexander Pak and Patrick Paroubek, “Twitter as a Corpus for Sentiment Analysis and Opinion Mining”, LREC 2010.), for example. It is possible to prepare and construct a corpus different from the input data. Specifically, the corpus can be constructed by learning using emotion data to which correct data is given in advance.

ここで、以下の式（８）に示す関係が成立する。 Here, the relationship shown in the following formula (8) is established.

なお、感情推定器は、ポジティブ（肯定的）、ネガティブ（否定的）、ニュートラル（中立）の各感情の度合いを推定できるものであれば、上記参考文献３以外の方法を使用してもよい。 As long as the emotion estimator can estimate the degree of each of positive (positive), negative (negative), and neutral (neutral) emotions, a method other than the above Reference 3 may be used.

ステップ３０４において、文書クラスタリング部４２は、第１の実施の形態で説明したように（図３のステップ１０２も参照）、各文書の単語集合が生成された時間情報付文書集合から、外的な基準無しで、文書の内容に基づいて複数個の意味的まとまりを持つクラスタ集合を発見し、時間情報付文書集合に含まれる各文書d_iについて、各クラスタに所属する度合いP(c|d_i)を出力する。 In step 304, as described in the first embodiment (see also step 102 in FIG. 3), the document clustering unit 42 externally extracts the document set with time information from which the word set of each document has been generated. Without a reference, a cluster set having a plurality of semantic groups is found based on the content of the document, and for each document d _i included in the document set with time information, the degree P (c | d _i belonging to each cluster ) Is output.

ステップ３０６において、文書集計部６２は、上記時間情報付文書集合の各文書d_iのクラスタ及び感情の組に所属する度合いを示す値（以下、この度合いを示す値を文書数と呼称する）を時間区間h別に集計する。具体的には、文書d_iが投稿された時刻をt_iとしたとき、時間区間hにおける、クラスタ及び感情の組<s,c>に所属する文書数n(s,c,h)を以下の式（９）に示すように定義し、全ての<s,c,h>の組み合わせについて計算する。 In step 306, the document counting unit 62, the cluster and the value indicating the degree of belonging to a set of feelings of each document d _i time information with the set of documents (hereinafter, referred to as document number value indicating the degree of) the Aggregate by time interval h. Specifically, when the time at which the document d _i is posted t _i, the time interval h, cluster and emotional set <s, c> document number n belonging to the (s, c, h) the following The calculation is performed for all combinations of <s, c, h>.

ここで、Dは、第１の実施の形態の式（２）と同様に、時間情報付文書集合に含まれる文書の個数であり、δ(t_i,h)は、時刻t_iが時間区間hに含まれるときには1を、そうでないときには0を返す関数である。 Here, D is the number of documents included in the document set with time information, as in Expression (2) of the first embodiment, and δ (t _i , h) is the time interval between time t _i This function returns 1 when included in h and 0 otherwise.

文書集計部６４は、上記計算した文書数n(s,c,h)を文書集計結果として出力する。 The document totaling unit 64 outputs the calculated document number n (s, c, h) as a document totaling result.

ステップ３０８において、時系列データ集計部４６は、第１の実施の形態で説明したように時系列データのｊ個目の観測値x_j（観測時刻t_j）について、時間区間h別に平均値m(h)を集計する。時系列データ集計部４６は、計算したm(h)を時系列データ集計結果として出力する。ここでは、m(h)を、平均値としたが、最小値や、最大値、或いは合計値であってもよい。なお、時系列データの集計方法は、第１の実施の形態において、図３のステップ１０６を参照して説明した通りであるため、ここでは詳細な説明を省略する。 In step 308, the time-series data totaling unit 46 calculates the average value m for each time interval h for the _j-th observed value x _j (observation time t _j ) of the time-series data as described in the first embodiment. Aggregate (h). The time series data totaling unit 46 outputs the calculated m (h) as the time series data totaling result. Here, m (h) is an average value, but may be a minimum value, a maximum value, or a total value. The time-series data tabulation method is the same as that described with reference to step 106 in FIG. 3 in the first embodiment, and a detailed description thereof is omitted here.

ステップ３１０において、回帰分析部６４は、文書集計部６２の文書集計結果n(s,c,h)を説明変数とし、時系列データ集計部４６の時系列データ集計結果m(h)を目的変数として、回帰分析を行う。m(h)とn(s,c,h)との関係を表す線形回帰モデルは、以下の式（１０）の通り定義される。 In step 310, the regression analysis unit 64 uses the document total result n (s, c, h) of the document total unit 62 as an explanatory variable, and the time series data total result m (h) of the time series data total unit 46 as an objective variable. As a regression analysis. A linear regression model representing the relationship between m (h) and n (s, c, h) is defined as the following equation (10).

定数項及び回帰係数（β₀,β_1,1,…β_3,C)は、第１の実施の形態で説明したように、参考文献２（川野秀一、廣瀬慧、立石正平、小西貞則「回帰モデリングとL1型正則化方の最近の展開」日本統計学会誌、39、2、pp.211-242、2010）に記載の線形重回帰や、lasso回帰、ridge回帰などにより導出できる。 The constant term and the regression coefficient (β ₀ , β _1,1 ,..., Β _{3, C} ) are as described in the first embodiment, as described in Reference Document 2 (Shuichi Kawano, Satoshi Hirose, Shohei Tateishi, Sadanori Konishi “ It can be derived by linear multiple regression, lasso regression, ridge regression, etc. described in "Recent Modeling of Regression Modeling and L1 Type Regularization", Journal of the Japan Statistical Society, 39, 2, pp.211-242, 2010).

ステップ３１２において、成分比率算出部６６は、回帰分析部６４による回帰係数の推定結果を用いて、時間区間h別の各成分（クラスタ・感情の組）の比率r(s,c,h)を以下の式（１１）に従って算出し、出力する。 In step 312, the component ratio calculation unit 66 uses the estimation result of the regression coefficient by the regression analysis unit 64 to calculate the ratio r (s, c, h) of each component (cluster / emotion pair) for each time interval h. Calculate and output according to the following equation (11).

ここで、以下の式（１２）を満たす。 Here, the following expression (12) is satisfied.

ステップ３１４において、成分分解部６８は、時刻t_jに観測された値x_jを、成分比率算出部６６により算出された、時刻t_jに対応する時間区間hのクラスタ成分の比率r(s,c,h)を用いて、以下の式（１３）に示すように、3×C個の成分に分解する。 In step 314, the component decomposition unit 68 uses the value x _j observed at time t _j to calculate the ratio r (s, cluster component ratio of the time interval h corresponding to time t _j calculated by the component ratio calculation unit 66. c, h) to be decomposed into 3 × C components as shown in the following equation (13).

出力部２４は、成分分解部６８の成分分解結果を出力する。 The output unit 24 outputs the component decomposition result of the component decomposition unit 68.

本実施の形態で説明した時系列データ成分分解装置６０も、第１の実施の形態で説明した時系列データ成分分解装置１０と同様に、ある現象の時間的な変化を観測して得られた一連の値の系列である時系列データを複数個の意味的成分に分解可能な手法であり、データマイニング、マーケティングなどに利用可能である。 The time-series data component decomposition apparatus 60 described in the present embodiment is also obtained by observing a temporal change of a certain phenomenon, similarly to the time-series data component decomposition apparatus 10 described in the first embodiment. It is a technique that can decompose time-series data, which is a series of values, into a plurality of semantic components, and can be used for data mining, marketing, and the like.

以上説明したように、第２の実施の形態に係る時系列データ成分分解装置によれば、時系列データに含まれる各観測値を、予め定められた時間区間ごとに集計し、時間情報付文書集合の各文書に含まれる各感情の度合いを推定し、時間情報付文書集合から、外的な基準を用いずに各文書の内容に基づいて複数個のクラスタを発見し、各文書について、該複数個のクラスタの各々に所属する度合いを算出し、該各クラスタに所属する度合い及び各文書に付与された作成時間情報に基づいて、前記時間区間ごとに時間情報付文書集合の各文書がクラスタ及び感情の各組に所属する度合いを集計し、上記時間区間ごとに、時間情報付文書集合の集計結果を説明変数とし時系列データの集計結果を目的変数として、該説明変数と該目的変数との間の関係を表す式を回帰分析により推計し、該回帰分析による推計結果及び時間情報付文書集合の集計結果に基づいて、時間区間ごとのクラスタ及び感情の各組の成分比率を算出し、該算出結果に基づいて、時系列データに含まれる各観測値をクラスタ及び感情の各組の成分に分解するようにしたため、キーワードを用いずに時系列データの成分分解を精度高く行なうことができる。 As described above, according to the time-series data component decomposition apparatus according to the second embodiment, the observation values included in the time-series data are aggregated for each predetermined time interval, and the document with time information is added. Estimate the degree of each emotion included in each document of the set, find a plurality of clusters based on the contents of each document from the document set with time information without using external criteria, and for each document, The degree of belonging to each of a plurality of clusters is calculated, and each document of the document set with time information is clustered for each time interval based on the degree of belonging to each cluster and the creation time information given to each document. And the degree of belonging to each set of emotions, and for each time interval, the summary result of the document set with time information is the explanatory variable, and the summary result of the time series data is the objective variable. Seki Is calculated by regression analysis, and based on the estimation result of the regression analysis and the total result of the document set with time information, the component ratio of each set of clusters and emotions for each time interval is calculated, and the calculation result Based on this, each observation value included in the time-series data is decomposed into each set of cluster and emotion components, so that the time-series data can be accurately decomposed without using keywords.

なお、本実施の形態では、ステップ３００〜３０６を、ステップ３０８の前に行なう例について記載したが（図６参照）、ステップ３００〜３０６を、ステップ３０８とステップ３１０の間に行なうようにしてもよいし、ステップ３００〜３０６の各々と、ステップ３０８とを並列に行なうようにしてもよい。また、ステップ３０２の処理とステップ３０４の順番も、上記図６に示した例に限定されず、何れが先でもよいし、並列に行なうようにしてもよい。 In this embodiment, an example in which steps 300 to 306 are performed before step 308 has been described (see FIG. 6). However, steps 300 to 306 may be performed between step 308 and step 310. Alternatively, each of steps 300 to 306 and step 308 may be performed in parallel. Further, the processing of step 302 and the order of step 304 are not limited to the example shown in FIG. 6, and either may be performed first or may be performed in parallel.

なお、上記第１の実施の形態及び第２の実施の形態で説明した時系列データ成分分解装置は、どのような時間情報付文書集合に対しても適用可能であるが、特に、Twitter（登録商標）などのリアルタイム性の高いマイクロブログの文書集合に対して特に有効である。また、時間情報付文書集合を、地名、商品名など特定のキーワードを含むものや、文書を作成して投稿した端末の位置が指定範囲内に含まれるものに限定してもよい。 Note that the time-series data component decomposition apparatus described in the first embodiment and the second embodiment can be applied to any document set with time information. This is particularly effective for a microblog document set having a high real-time property such as a trademark. Further, the document set with time information may be limited to those including specific keywords such as place names and product names, and those including the position of the terminal that created and posted the document within the specified range.

また、上記第１の実施の形態及び第２の実施の形態で説明したように、図１或いは図５に示す構成要素の動作をプログラムとして構築し、時系列データ成分分解装置として利用されるコンピュータにインストールして実行させる、または、ネットワークを介して流通させることが可能である。 Further, as described in the first embodiment and the second embodiment, the operation of the components shown in FIG. 1 or FIG. 5 is constructed as a program and used as a time-series data component decomposition apparatus. It can be installed and executed on the network, or distributed via a network.

また、本願明細書中において、プログラムが予めインストールされている実施形態として説明したが、当該プログラムを、コンピュータ読み取り可能な記録媒体に格納して提供することも可能である。例えば、当該プログラムを、上述したＲＯＭやハードディスクのみならず、フレキシブルディスク、ＣＤ−ＲＯＭ等の可搬記憶媒体に格納し、コンピュータにインストールする、または、配布することが可能である。また、ネットワーク上の他の記憶装置に当該プログラムを記憶しておき、ネットワークを介して当該プログラムをダウンロードして実行するようにしてもよい。 In the present specification, the embodiment has been described in which the program is installed in advance. However, the program can be provided by being stored in a computer-readable recording medium. For example, the program can be stored not only in the above-described ROM and hard disk but also in a portable storage medium such as a flexible disk or a CD-ROM, and installed in a computer or distributed. Alternatively, the program may be stored in another storage device on the network, and the program may be downloaded and executed via the network.

また、上述の時系列データ成分分解装置は、内部にコンピュータシステムを有しているが、「コンピュータシステム」は、ＷＷＷシステムを利用している場合であれば、ホームページ提供環境（あるいは表示環境）も含むものとする。 Further, the above-described time-series data component decomposition apparatus has a computer system therein, but if the “computer system” uses a WWW system, a homepage providing environment (or display environment) is also provided. Shall be included.

また、上記第１の実施の形態及び第２の実施の形態で説明した時系列データ成分分解装置を、ASIC等のハードウェアにより構成してもよい。 The time-series data component decomposition apparatus described in the first embodiment and the second embodiment may be configured by hardware such as an ASIC.

１０時系列データ成分分解装置
２０入力部
２２演算部
２４マイクロブログユーザ属性推定器構築部
２４出力部
２６演算部
３０時系列データ入力部
３２時間情報付文書集合入力部
４０単語分割部
４１感情推定部
４２文書クラスタリング部
４４文書集計部
４６時系列データ集計部
４８回帰分析部
５０成分比率算出部
５２成分分解部
６０時系列データ成分分解装置
６２文書集計部
６４回帰分析部
６４文書集計部
６６成分比率算出部
６８成分分解部
２００コンピュータ
２０１ＣＰＵ
２０２ＲＯＭ
２０３ＲＡＭ
２０４通信ＩＦ
２０５入出力ＩＦ
２０６ハードディスクドライブ
10 time-series data component decomposition apparatus 20 input unit 22 calculation unit 24 microblog user attribute estimator construction unit 24 output unit 26 calculation unit 30 time-series data input unit 32 document set input unit with time information 40 word division unit 41 emotion estimation unit 42 Document Clustering Unit 44 Document Aggregation Unit 46 Time Series Data Aggregation Unit 48 Regression Analysis Unit 50 Component Ratio Calculation Unit 52 Component Decomposition Unit 60 Time Series Data Component Decomposition Device 62 Document Aggregation Unit 64 Regression Analysis Unit 64 Document Aggregation Unit 66 Component Ratio Calculation Unit 68 Component decomposition unit 200 Computer 201 CPU
202 ROM
203 RAM
204 Communication IF
205 I / O IF
206 Hard disk drive

Claims

A time series data totaling unit that totals each observation value included in time series data that is a series of observation values obtained by observing a temporal change of a phenomenon, for each predetermined time interval;
A document set with time information, which is a set of documents to which creation time information is assigned, is clustered based on the contents of each document without using external criteria, and a plurality of clusters are found. A document clustering unit for calculating the degree of belonging to each of the plurality of clusters;
Based on the degree of belonging to each cluster calculated by the document clustering unit and the creation time information assigned to each document, each document in the document set with time information is added to each cluster for each time interval. A document aggregation section that aggregates the degree of belonging to
For each time interval, the regression result representing the relationship between the explanatory variable and the objective variable is regressed using the aggregation result of the document aggregation unit as an explanatory variable and the aggregation result of the time series data aggregation unit as an objective variable. A regression analysis unit that estimates by analysis,
Using the constant term and regression coefficient of the regression equation estimated for each time interval by the regression analysis unit and the aggregation result for each time interval by the document aggregation unit, the components of each cluster for each time interval A component ratio calculation unit for calculating the ratio of
Each observed value included in the time-series data is converted into a component of each cluster by using the ratio of the cluster components calculated by the component ratio calculating unit in the time interval corresponding to the observation time of the observed value. A component decomposition part to be decomposed;
A time-series data component decomposition apparatus.

For each document in the document set with time information, as a value indicating what kind of emotion the document is described based on, it further includes an emotion estimation unit that estimates the degree of each emotion included in each document,
The document totaling unit, for each time interval, based on the degree of belonging to each cluster calculated by the document clustering unit, the creation time information given to each document, and the estimation result of the emotion estimation unit , Totalizing the degree to which each document of the document set with time information belongs to each set of the cluster and the emotion,
The component ratio calculation unit uses the constant term and regression coefficient of the regression equation estimated for each time interval by the regression analysis unit and the aggregation result for each time interval by the document aggregation unit, and uses the time interval. Calculating the ratio of the components of each cluster and the set of emotions for each
Each observation value included in the time series data is obtained by using the ratio of the cluster components calculated by the component ratio calculation unit of the time interval corresponding to the observation time of the observation value, and the cluster and the emotion. Break down into each set of components,
The time-series data component decomposition apparatus according to claim 1.

The document set with time information is
Each document included in the document set with time information is a set of documents including a specific keyword.
The time series data component decomposition | disassembly apparatus of Claim 1 or Claim 2.

The document set with time information is
A set of documents created and posted on a terminal that exists within a pre-specified location range.
The time series data component decomposition | disassembly apparatus of any one of Claims 1-3.

The document set with time information is
It is a set of microblog documents that users create and post sentences within a predetermined number of characters representing the user's own real-time situation or miscellaneous feeling.
The time series data component decomposition | disassembly apparatus of any one of Claims 1-4.

A time-series data component decomposition method in a time-series data component decomposition apparatus including a time-series data aggregation unit, a document clustering unit, a document aggregation unit, a regression analysis unit, a component ratio calculation unit, a component ratio calculation unit, and a component decomposition unit ,
The time series data totaling unit aggregates each observation value included in the time series data, which is a series of observation values obtained by observing a temporal change in a phenomenon, for each predetermined time interval. And
A plurality of clusters obtained by performing clustering based on the contents of each document without using an external reference for the document set with time information, which is a set of documents to which creation time information is given, by the document clustering unit. And for each document, calculate the degree of belonging to each of the plurality of clusters,
Based on the degree of belonging to each cluster calculated by the document clustering unit and the creation time information given to each document by the document aggregation unit, the document set with time information is set for each time interval. Aggregate the degree to which each document belongs to each cluster,
By the regression analysis unit, for each time interval, the aggregation result of the document aggregation unit is an explanatory variable, and the aggregation result of the time series data aggregation unit is an objective variable, and the relationship between the explanatory variable and the objective variable The regression equation representing is estimated by regression analysis,
By using the constant term and regression coefficient of the regression equation estimated by the regression analysis unit for each time interval by the component ratio calculation unit, and the aggregation result for each time interval by the document aggregation unit, the time interval Calculating the proportion of each cluster component for each
By using the component ratio of the cluster calculated by the component ratio calculation unit of the time interval corresponding to the observation time of the observation value, each observation value included in the time series data by the component decomposition unit, A time-series data component decomposition method for decomposing into components of each cluster.

The program for functioning a computer as each means of the time series data component decomposition | disassembly apparatus of any one of Claims 1-5.

A computer-readable recording medium recording a program for causing a computer to function as each means of the time-series data component decomposition apparatus according to any one of claims 1 to 5.