JP6204261B2

JP6204261B2 - Topic modeling apparatus, topic modeling method, and topic modeling program

Info

Publication number: JP6204261B2
Application number: JP2014093253A
Authority: JP
Inventors: 結城遠藤; 浩之戸田; 義昌小池
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2014-04-30
Filing date: 2014-04-30
Publication date: 2017-09-27
Anticipated expiration: 2034-04-30
Also published as: JP2015210741A

Description

本発明は時系列テキストデータにおけるトピックのモデリング技術に関する。 The present invention relates to a topic modeling technique in time-series text data.

マイクロブログなどの普及に伴い、リアルタイム性の高い時系列テキストデータからトピックを抽出し、世の中の話題を捉えることができるトピックモデルの作成が、特にマーケティングなどの分野で重要となってきている。ここでトピックとは、特定の話題に関する情報を意味する。トピックモデルはトピックとテキストデータが含む単語などの文字列との関係を記述するモデル（関数、数式）を表す。 With the spread of microblogging and the like, it has become important to create a topic model that can extract topics from time-series text data with high real-time properties and capture the topic of the world, especially in the field of marketing. Here, the topic means information on a specific topic. The topic model represents a model (function, formula) describing the relationship between a topic and a character string such as a word included in text data.

時系列テキストデータにおいてトピックを捉えるトピックモデリングの先行技術としては、ＬＤＡ（ＬａｔｅｎｔＳｅｍａｎｔｉｃＡｎａｌｙｓｉｓ）を拡張したもの（非特許文献１）や、ＮＭＦ（Ｎｏｎ−ＮｅｇａｔｉｖｅＭａｔｒｉｘＦａｃｔｏｒｉｚａｔｉｏｎ）を拡張したもの（非特許文献２）が提案されている。ＮＭＦでは、文書と単語の特徴行列を非負制約のもと行列分解し次元圧縮を行うことによってモデルを得てトピックを推定する。非特許文献２では、時間的なトピックの変化量を考慮した制約をＮＭＦに与え、盛り上がっているトピックを推定する。以下にＮＭＦと非特許文献２における手法の概要について述べる。 Prior art of topic modeling that captures topics in time-series text data includes an extension of LDA (Lentative Analysis) (Non-Patent Document 1) and an extension of NMF (Non-Negative Matrix Factorization) (Non-Patent Documents). 2) has been proposed. In NMF, a feature matrix of a document and a word is subjected to matrix decomposition under non-negative constraints and dimension compression is performed to obtain a model and estimate a topic. In Non-Patent Document 2, NMF is given a constraint that takes into account the amount of change in the topic over time, and a topic that is rising is estimated. Below, the outline | summary of the method in NMF and a nonpatent literature 2 is described.

＜ＮＭＦ＞
ＮＭＦは文書と単語の特徴行列Ｘを非負制約のもと分解した二つの行列によって文書を表現する。一方の行列は、行に対応する文書における列に対応するトピックの関係度合を表す文書トピック行列Ｗである。もう一方の行列は、行に対応するトピックにおける列に対応する単語の関係度合を表すトピック単語行列Ｈである。ＮＭＦでは次式のとおり特徴行列Ｘを文書トピック行列Ｗとトピック単語行列Ｈとに分解する。 <NMF>
NMF expresses a document by two matrices obtained by decomposing a document and word feature matrix X under non-negative constraints. One matrix is a document topic matrix W representing the degree of relationship between topics corresponding to columns in a document corresponding to a row. The other matrix is a topic word matrix H representing the degree of relationship of words corresponding to columns in the topic corresponding to the row. In NMF, the feature matrix X is decomposed into a document topic matrix W and a topic word matrix H as shown in the following equation.

上記の式において、ｉ及びｊは行列のインデックスを表す。上記のように行列Ｘを分解するため例えば以下の式ように二乗誤差に基づいて行列Ｗと行列Ｈを計算する。 In the above equation, i and j represent matrix indices. In order to decompose the matrix X as described above, for example, the matrix W and the matrix H are calculated based on the square error as in the following equation.

上記の式において、‖Ｘ−ＷＨ‖_Fは「Ｘ−ＷＨ」のフロベニウスノルムである。 In the above formula, ‖X-WH‖ _F is the Frobenius norm of "X-WH".

＜非特許文献２の手法＞
非特許文献２の手法では、上記ＮＭＦを拡張することによって時系列テキストデータにおいて盛り上がっているトピックを得る。具体的には次式にもとづいてＷとＨを計算する。 <Method of Non-Patent Document 2>
In the method of Non-Patent Document 2, a topic that is excited in time-series text data is obtained by extending the NMF. Specifically, W and H are calculated based on the following equation.

上記の式において、ｗ_iはＷのｉ番目の列ベクトルである。Ｓはトピックｉについて同じ時間帯のｗ_iの和を計算する行列である。Ｗ^emはＷのうち盛り上がりを抽出するトピックと対応する部分行列である。μはハイパーパラメータである。Ｌ（ＳＷ_i）は各時刻におけるトピックの変動が小さい場合大きなペナルティを与える関数である。このペナルティにより時間的に盛り上がっているトピックを抽出することができる。 In the above equation, w _i is the i-th column vector of W. S is a matrix for calculating the sum of w _{i in} the same time zone for topic i. ^Wem is a submatrix corresponding to the topic from which the excitement is extracted. μ is a hyperparameter. L (SW _i ) is a function that gives a large penalty when the topic variation at each time is small. With this penalty, it is possible to extract topics that are exciting in time.

Diao, Q., Jiang, J., Zhu, F., Lim, E.-P.: Finding bursty topics from microblogs, In Proc. of ACL'12, 2012, pp.536-544.Diao, Q., Jiang, J., Zhu, F., Lim, E.-P .: Finding bursty topics from microblogs, In Proc. Of ACL'12, 2012, pp.536-544. Saha, A. and Sindhwani, V.: Learning evolving and emerging topics in social media: a dynamic nmf approach with temporal regularization, In Proc. of WSDM'12, 2012, pp.693-702.Saha, A. and Sindhwani, V .: Learning evolving and emerging topics in social media: a dynamic nmf approach with temporal regularization, In Proc. Of WSDM'12, 2012, pp.693-702. Mei, Q., Cai, D., Zhang, D., Zhai, C.: Topic modeling with network regularization, In Proc. of WWW’08, 2008, pp.101-110.Mei, Q., Cai, D., Zhang, D., Zhai, C .: Topic modeling with network regularization, In Proc. Of WWW’08, 2008, pp.101-110. 鈴木潤，磯崎秀樹，「学習誤り最小化に基づく条件付き確率場の学習：言語解析への適用」，言語処理学会第１２回年次大会発表論文集，2006，pp.548-551．Jun Suzuki and Hideki Amagasaki, “Learning Conditional Random Fields Based on Learning Error Minimization: Application to Language Analysis”, Proc. Of the 12th Annual Conference of the Language Processing Society, 2006, pp.548-551.

しかしながら、非特許文献２に記載のトピックモデルでは、地理的な情報が考慮されていない。例えば、横須賀で音楽のゲリラライブなどのイベントが急遽開催される際、多くのユーザがＧＰＳ（ＧｌｏｂａｌＰｏｓｉｔｉｏｎｉｎｇＳｙｓｔｅｍ）を用いて取得した位置情報を付与したメッセージをソーシャルメディアに投稿したり、その地名を含むメッセージを投稿したりする。ソーシャルメディアの情報は、このようなイベントなどの地域特有で盛り上がっている話題を多く含む。非特許文献２に記載された手法のように時間的な盛り上がりを考慮するだけでは、上記の盛り上がりの話題を検知することが難しかった。 However, the topic model described in Non-Patent Document 2 does not consider geographical information. For example, when an event such as music guerrilla live is held suddenly in Yokosuka, a message with location information acquired by many users using GPS (Global Positioning System) is posted on social media, or the place name is Post a message that includes it. Social media information includes a lot of local topics such as events. As described in Non-Patent Document 2, it is difficult to detect the swelled topic only by considering the swell of time.

また、非特許文献３には、テキストに付与された位置情報を考慮したＬＤＡによって、地域特有のトピックを抽出する手法が開示されている。しかしながら、この手法は、位置情報と時間情報の両方が加味されていないため、ある地域で普段から多く投稿されている内容のトピックが抽出される。つまり、ある時刻、地域において局所的に盛り上がっているトピックを抽出できず、前述の例のような突然特定の地域で発生するイベント等に関するトピックを抽出できないという問題がある。 Non-Patent Document 3 discloses a technique for extracting a topic specific to a region by LDA in consideration of position information given to text. However, since this method does not take into account both position information and time information, topics with contents that are frequently posted in a certain area are extracted. That is, there is a problem that a topic that is locally rising in a certain time and area cannot be extracted, and a topic related to an event or the like that suddenly occurs in a specific area as in the above example cannot be extracted.

本発明は、上記従来技術の問題点に鑑みて、地理的かつ時間的に盛り上がっているトピック情報を得られるトピックモデルを生成できる技術を提供することを目的とする。 The present invention has been made in view of the above-described problems of the prior art, and an object of the present invention is to provide a technique capable of generating a topic model that can obtain topic information that is exciting geographically and temporally.

そこで、本発明のトピックモデリング装置は、位置情報と時間情報とに依存したトピックモデルを作成するトピックモデリング装置であって、時系列テキストから位置情報を抽出する位置情報抽出手段と、前記抽出された位置情報に基づき各時系列テキスト間の位置情報に依る類似度を示す位置情報依存行列を計算する位置情報依存行列計算手段と、前記時系列テキストの文書特徴行列を、時間的なトピックの変化量に基づくトピック抽出の制約と、前記位置情報依存行列に基づくトピック抽出の制約とのもとで、行列分解して次元圧縮することにより、テキストとトピックとの関係度合いを示す文書トピック行列並びにトピックと単語との関係度合いを示すトピック単語行列を経時的に算出し、この文書トピック行列並びにトピック単語行列の経時的変化量が規定値以下となる最新の文書トピック行列並びにトピック単語行列を、前記トピックモデルとして、決定するモデル計算手段とを備える。 Therefore, the topic modeling device of the present invention is a topic modeling device that creates a topic model that depends on position information and time information, the position information extracting means for extracting position information from time series text, and the extracted A position information dependency matrix calculating means for calculating a position information dependency matrix indicating a degree of similarity according to position information between the time series texts based on the position information, and a document feature matrix of the time series text as a temporal topic change amount. A document topic matrix and a topic indicating the degree of relationship between the text and the topic by performing matrix decomposition and dimension compression under the topic extraction constraint based on the location information and the topic extraction constraint based on the position information dependence matrix A topic word matrix indicating the degree of relationship with words is calculated over time, and this document topic matrix and topic word lines are calculated. The latest document topic matrix and topic word matrix temporal change amount is less than the specified value of, as the topic models, and a model calculation means for determining.

本発明のトピックモデリング方法は、位置情報と時間情報とに依存したトピックモデルを作成するトピックモデリング装置が実行するトピックモデリング方法であって、時系列テキストから位置情報を抽出するステップと、前記抽出された位置情報に基づき時系列テキスト間の位置情報に依る類似度を示す位置情報依存行列を計算するステップと、前記時系列テキストの文書特徴行列を、時間的なトピックの変化量に基づくトピック抽出の制約と、前記位置情報依存行列に基づくトピック抽出の制約とのもとで、行列分解して次元圧縮することにより、テキストとトピックとの関係度合いを示す文書トピック行列並びにトピックと単語との関係度合いを示すトピック単語行列を経時的に算出し、この文書トピック行列並びにトピック単語行列の経時的変化量が規定値以下となる最新の文書トピック行列並びにトピック単語行列を、前記トピックモデルとして、決定するステップとを有する。 The topic modeling method of the present invention is a topic modeling method executed by a topic modeling device that creates a topic model depending on position information and time information, the step of extracting position information from time series text, and the extracted and calculating a position information dependent matrix indicating a degree of similarity due to positional information between the time series text based on the location information, a document feature matrix of the time series text, topic extraction based on the amount of change in temporal topic Document topic matrix indicating degree of relation between text and topic and degree of relation between topic and word by matrix decomposition and dimension compression under restriction and topic extraction restriction based on position information dependency matrix The topic word matrix indicating the time is calculated over time, and this document topic matrix and topic word matrix The latest document topic matrix and topic word matrix temporal change amount is less than the specified value, as the topic models, and a step of determining.

尚、本発明は上記装置の各手段としてコンピュータを機能させるプログラムまたは上記方法のステップをコンピュータに実行させるプログラムの態様とすることもできる。 Note that the present invention may be in the form of a program that causes a computer to function as each unit of the apparatus or a program that causes a computer to execute the steps of the method.

本発明によれば地理的かつ時間的に盛り上がっているトピック情報を得られるトピックモデルを提供できる。 ADVANTAGE OF THE INVENTION According to this invention, the topic model which can obtain the topic information which excites geographically and time can be provided.

本発明の実施形態におけるトピックモデリング装置のブロック構成図。The block block diagram of the topic modeling apparatus in embodiment of this invention. 同実施形態におけるトピックモデリング過程のフロー図。The flowchart of the topic modeling process in the embodiment. 文書特徴行列の一例。An example of a document feature matrix. 位置情報の抽出過程のフローチャート。The flowchart of the extraction process of a positional information. 位置情報依存行列の一例。An example of a positional information dependence matrix. 位置情報依存行列の計算過程のフローチャート。The flowchart of the calculation process of a positional information dependence matrix. モデル計算過程のフローチャート。The flowchart of a model calculation process.

以下、図面を参照しながら本発明の実施の形態について説明するが本発明はこの実施形態に限定されるものではない。 Hereinafter, embodiments of the present invention will be described with reference to the drawings. However, the present invention is not limited to these embodiments.

［概要］
図１に示された本実施形態のトピックモデリング装置１は、従来の時間的な盛り上がりを抽出するＮＭＦにおいて、地理的な盛り上がりを抽出する制約を付与することにより、トピックの地理的かつ時間的な盛り上がりを考慮したトピックモデルを生成する。 [Overview]
The topic modeling apparatus 1 according to the present embodiment shown in FIG. 1 adds a constraint for extracting a geographical excitement in the conventional NMF that extracts a temporal excitement, thereby providing a geographical and temporal topic. Generate a topic model that takes into account excitement.

［装置の構成］
トピックモデリング装置１は、図１に示されたように、入力部１０、単語特徴量計算部２０、文書特徴行列計算部３０、位置情報抽出部４０、位置情報依存行列計算部５０、モデル計算部６０、出力部７０を備える。 [Device configuration]
As shown in FIG. 1, the topic modeling apparatus 1 includes an input unit 10, a word feature amount calculation unit 20, a document feature matrix calculation unit 30, a position information extraction unit 40, a position information dependency matrix calculation unit 50, and a model calculation unit. 60 and an output unit 70.

上記各機能部１０〜７０はコンピュータのハードウェアリソースによって実現される。すなわち、トピックモデリング装置１は、少なくとも演算装置（ＣＰＵ）、記憶装置（メモリ、ハードディスク装置等）、通信インタフェース等のコンピュータに係るハードウェアリソースを備える。そして、これらのハードウェアリソースがソフトウェアリソース（ＯＳ、アプリケーション等）と協働することにより各機能部１０〜７０が実装される。また、各々のコンピュータに機能部１０〜７０を各々実装させるようにしてもよい。 The functional units 10 to 70 are realized by computer hardware resources. That is, the topic modeling device 1 includes hardware resources related to a computer such as at least a computing device (CPU), a storage device (memory, a hard disk device, etc.), a communication interface, and the like. Then, the functional units 10 to 70 are implemented by cooperation of these hardware resources with software resources (OS, applications, etc.). Moreover, you may make it each implement | achieve the function parts 10-70 in each computer.

入力部１０は、時系列テキストデータの入力を受け付けるための手段であり、例えば、キーボード、マウス、ディスクドライブ装置等から構成される。時系列テキストデータは、文字列と時刻情報が対応づけられている。さらに、ＧＰＳによって計測した位置情報を表すジオタグが存在する場合、ジオタグも対応づけられている。 The input unit 10 is a means for receiving input of time-series text data, and includes, for example, a keyboard, a mouse, a disk drive device, and the like. In the time series text data, a character string and time information are associated with each other. Furthermore, when there is a geotag representing position information measured by GPS, the geotag is also associated.

単語特徴量計算部２０は、入力部１０から受け付けた時系列テキストデータにおける各テキストの文字列の単語特徴量を計算する。 The word feature amount calculation unit 20 calculates the word feature amount of the character string of each text in the time series text data received from the input unit 10.

文書特徴行列計算部３０は、前記算出されたテキストの文書における単語特徴量に基づき当該テキストの文書特徴行列を計算する。 The document feature matrix calculation unit 30 calculates a document feature matrix of the text based on the word feature amount in the calculated text document.

位置情報抽出部４０は、前記時系列テキストデータにおける各テキストの位置情報を抽出して当該各テキストの位置情報ベクトルを計算する。 The position information extraction unit 40 extracts position information of each text in the time series text data and calculates a position information vector of each text.

位置情報依存行列計算部５０は、前記算出された位置情報ベクトルに基づき各テキスト間の位置情報に依る類似度を示す位置情報依存行列を計算する。 The position information dependency matrix calculation unit 50 calculates a position information dependency matrix indicating the similarity according to the position information between the texts based on the calculated position information vector.

モデル計算部６０は、前記位置情報依存行列と前記時系列テキストの文書特徴行列とに基づき、位置情報と時間情報とに依存したトピックモデルとして、テキストとトピックとの関係度合を示す文書トピック行列と、トピックと単語との関係度合を示すトピック単語行列とを算出する。 The model calculation unit 60 is based on the position information dependency matrix and the document feature matrix of the time series text, and as a topic model depending on position information and time information, a document topic matrix indicating the degree of relationship between text and topic, Then, a topic word matrix indicating the degree of relationship between topics and words is calculated.

より具体的には、モデル計算部６０は、前記文書特徴行列を、時間的なトピックの変化量に基づくトピック抽出の制約と、前記位置情報依存行列に基づくトピック抽出の制約とのもとで、行列分解して次元圧縮することにより、前記文書トピック行列並びにトピック単語行列を経時的に算出する。そして、この文書トピック行列並びにトピック単語行列の経時的変化量が規定値以下となる最新の文書トピック行列並びにトピック単語行列を、前記トピックモデルとして、決定する。 More specifically, the model calculation unit 60 converts the document feature matrix into a topic extraction constraint based on a temporal topic change amount and a topic extraction constraint based on the position information dependency matrix. The document topic matrix and the topic word matrix are calculated over time by performing matrix decomposition and dimension compression. Then, the latest document topic matrix and topic word matrix in which the amount of change with time of the document topic matrix and the topic word matrix is equal to or less than a predetermined value are determined as the topic model.

出力部７０は、前記決定された文書トピック行列及びトピック単語行列を、位置及び時間に依存したトピックモデルとして、出力する。 The output unit 70 outputs the determined document topic matrix and topic word matrix as a topic model depending on position and time.

［トピックモデリング過程の説明］
以下、図１〜７を参照しながらトピックモデリングの過程について説明する。 [Explanation of topic modeling process]
The topic modeling process will be described below with reference to FIGS.

Ｓ１：入力部１０は、入力データとして時系列テキストデータを受け付ける。受け付けた時系列テキストデータは単語特徴量計算部２０および位置情報抽出部４０へ送られる。 S1: The input unit 10 receives time-series text data as input data. The received time series text data is sent to the word feature quantity calculation unit 20 and the position information extraction unit 40.

Ｓ２：単語特徴量計算部２０は入力部１０から受け付けた時系列テキストデータにおける各テキストの文字列の単語特徴量を計算する。 S2: The word feature amount calculation unit 20 calculates the word feature amount of the character string of each text in the time-series text data received from the input unit 10.

本ステップでは、各テキストの文字列を形態素解析器によって名詞・動詞・形容詞などの単語単位に分割した後、出現する単語情報に基づき、テキストの文書ｄ_iにおける単語ｗの特徴度を表すｆ_di,wを計算する。具体的な算出方法としては、以下の式によって算出するＴＦ−ＩＤＦ等が挙げられる。 In this step, after the character string of each text is divided into word units such as nouns, verbs, and adjectives by a morphological analyzer, f _di representing the characteristic degree of the word w in the text document d _i based on the appearing word information. _{, w} is calculated. Specific examples of the calculation method include TF-IDF calculated by the following equation.

上記の式において、ＴＦ（ｄ_i，ｗ）は文書ｄ_iにおける単語ｗの出現回数、ＤＦ（ｗ）はデータセットにおいて単語ｗが出現する文書の数、Ｎはデータセットにおける文書の総数を表す。 In the above equation, TF (d _i , w) is the number of occurrences of word w in document d _i , DF (w) is the number of documents in which word w appears in the data set, and N is the total number of documents in the data set. .

上記算出された単語特徴度は文書特徴行列計算部３０に送られる。 The calculated word feature is sent to the document feature matrix calculator 30.

Ｓ３：文書特徴行列計算部３０は、上記算出された文書ｄ_iにおける単語ｗの特徴度を表すｆ_di,wに基づき、文書特徴行列を計算する。文書ｄ_iにおける各々の単語の特徴を表す特徴ベクトルをｆ_diとすると文書特徴行列Ｘは次のように定義される。 S3: document characteristic matrix calculation unit 30, f _di representative of the characteristics of the word w in document d _i, which is the _calculated, based on _w, calculate the document feature matrix. If the feature vector representing the feature of each word in the document d _i is f _di , the document feature matrix X is defined as follows.

図３に文書特徴行列Ｘの一例を示した。この例では、文書１の「横須賀」、「カレー」、「新潟」の特徴量が２、２、０であるので、文書１に対応する行とそれぞれの単語に対応する列に該当する要素が２、２、０とされる。 FIG. 3 shows an example of the document feature matrix X. In this example, since the feature amounts of “Yokosuka”, “Curry”, and “Niigata” of document 1 are 2, 2, 0, the elements corresponding to the row corresponding to document 1 and the column corresponding to each word are as follows. 2, 2, 0.

上記算出された文書特徴行列Ｘはモデル計算部６０によるステップＳ６に供される。 The calculated document feature matrix X is provided to step S6 by the model calculation unit 60.

Ｓ４：位置情報抽出部４０は、入力部１０から受け付けた時系列テキストデータにおける各テキストの位置情報を抽出し、位置情報ベクトルｐを計算する。位置情報は、時系列テキストに付与されたジオタグだけでなく、時系列テキストの文字列中の地名（地域名称）を用いて抽出してもよい。 S4: The position information extraction unit 40 extracts position information of each text in the time series text data received from the input unit 10, and calculates a position information vector p. The position information may be extracted using not only the geotag attached to the time series text but also the place name (region name) in the character string of the time series text.

本ステップでの具体的な位置情報抽出処理過程例のフローチャートを図４に示す。時系列テキストデータにおける各テキストの文書ｄ∈Ｄについて下記の通り処理する。 FIG. 4 shows a flowchart of a specific position information extraction process example in this step. The following processing is performed for each text document dεD in the time-series text data.

Ｓ４０１：テキストの文書ｄ_iにジオタグが付与されているかを判定する。 S401: It is determined whether a geotag is attached to the text document d _i .

Ｓ４０２：テキストの文書ｄ_iにジオタグが付与されている場合、ジオタグの緯度経度情報を位置情報ベクトルｐ_iに代入する。尚、ｐ_iはベクトルｐのｉ番目の要素を表す。 S402: If a geotag is assigned to the text document d _i , the latitude / longitude information of the geotag is substituted into the position information vector p _i . P _i represents the i-th element of the vector p.

Ｓ４０３：テキストの文書ｄ_iにジオタグが付与されていない場合、文書ｄ_iの文字列中に地名を表す単語が存在するか否かを判定する。地名を表す語の判定方法は、条件付き確率場に基づく係り受け解析手法（非特許文献４）など公知の技術を用いることができる。 S403: If no geotag is assigned to the text document d _i , it is determined whether or not a word representing the place name exists in the character string of the document d _i . A known technique such as a dependency analysis method based on a conditional random field (Non-Patent Document 4) can be used as a method for determining a word representing a place name.

Ｓ４０４：テキストの文書ｄ_iの文字列中に地名を表す単語が存在する場合、ジオコーダを用いて地名を緯度経度情報に変換する。地名が複数ある場合は、例えば乱数を用いてランダムで一つ選択する。そして、この変換によって得られた緯度経度情報を、前述のステップＳ４０２に供して、位置情報ベクトルｐ_iに代入する。 S404: If a word representing a place name exists in the character string of the text document d _i , the place name is converted into latitude and longitude information using a geocoder. When there are a plurality of place names, for example, one is selected at random using random numbers. Then, the latitude / longitude information obtained by this conversion is provided to the above-described step S402 and substituted into the position information vector p _i .

Ｓ４０５：テキストの文書ｄ_iの文字列中に地名を表す単語が存在しない場合、位置情報ベクトルｐ_iにnullを代入する。 S405: A string can document d _i text no word representing a place name, substitutes null in the position information vector p _i.

全ての時系列テキストについて以上のＳ４０１〜Ｓ４０５が実行され、得られた位置情報ベクトルｐは位置情報依存行列計算部５０に送られる。 The above S401 to S405 are executed for all time series texts, and the obtained position information vector p is sent to the position information dependency matrix calculation unit 50.

Ｓ５：位置情報依存行列計算部５０は、上記算出された位置情報ベクトルｐに基づき、位置情報依存行列Ｇを計算する。位置情報依存行列Ｇは、総文書数をｍとしたとき、各文書同士の位置情報に基づく類似度を表すｍ×ｍの行列である。 S5: The position information dependency matrix calculation unit 50 calculates the position information dependency matrix G based on the calculated position information vector p. The position information dependence matrix G is an m × m matrix that represents the similarity based on the position information of each document, where m is the total number of documents.

図５に位置情報依存行列Ｇの一例を示した。位置情報に基づく文書１と文書２の類似度は０．０２であるため、対応する要素は０．０２とされる。 FIG. 5 shows an example of the position information dependence matrix G. Since the similarity between the document 1 and the document 2 based on the position information is 0.02, the corresponding element is 0.02.

位置情報依存行列Ｇの計算処理のフローチャートを図６に示す。全てのテキストのペアについて下記の通り処理する。 A flowchart of the calculation process of the position information dependency matrix G is shown in FIG. All text pairs are processed as follows:

Ｓ５０１：位置情報依存行列Ｇ_i,jに０を代入する。尚、Ｇ_i,jは行列Ｇのｉ行ｊ例目の要素を表す。 S501: 0 is substituted into the position information dependence matrix G _{i, j} . Note that G _{i, j} represents the element in the i-th row and j-th row of the matrix G.

Ｓ５０２：ｉ＝ｊまたは位置情報ベクトルｐ_i＝nullまたは位置情報ベクトルｐ_j＝nullであるかを判定する。判定がＹＥＳであった場合は次のループ処理に移る。 S502: It is determined whether i = j or position information vector p _i = null or position information vector p _j = null. If the determination is YES, the process proceeds to the next loop process.

Ｓ５０３：ステップＳ５０２の判定がＮｏであれば、位置情報ベクトルｐ_i，ｐ_j間の距離を計算する。位置情報ベクトルｐ_i，ｐ_j間の距離ｄｉｓｔ_i,jは例えばヒュベニ距離として以下の式により計算できる。 S503: If the determination in step S502 is No, the distance between the position information vectors p _i and p _j is calculated. The distance dist _{i, j} between the position information vectors p _i and p _j can be calculated by, for example, the following equation as the Hubeni distance.

上記の式において、Ｍは子午線曲率半径、ｄＰは二点間の緯度差、Ｎは卯西線曲率半径、Ｐは二点間の平均緯度、ｄＲは二点間の経度差を表す。 In the above equation, M is a meridian radius of curvature, dP is a latitude difference between two points, N is a Shaanxi radius of curvature, P is an average latitude between two points, and dR is a longitude difference between two points.

Ｓ５０４：上記算出した位置情報ベクトルｐ_i，ｐ_j間の距離ｄｉｓｔ_i,jを以下の式の演算に供して位置情報依存行列Ｇを算出する。 S504: calculating the distance dist i, and subjected to calculation of the following equation _j location dependent matrix G between the position information calculated above vectors p _i, p _j.

上記の式において、σ²は定数であり、位置情報依存行列Ｇ_i,jは文書ｄ_iと文書ｄ_jの位置情報に基づく類似度を表す。σ²は予め各文書の距離ｄｉｓｔ_i,jを用いて計算しておいた分散値を用いることもできる。 In the above equation, σ ² is a constant, and the position information dependency matrix G _{i, j} represents the similarity based on the position information of the document d _i and the document d _j . As σ ^2, a variance value calculated in advance using the distance dist _{i, j} of each document can be used.

全てのテキスト間について以上のＳ５０１〜Ｓ５０４が実行され、得られた位置情報依存行列Ｇはモデル計算部６０によるステップＳ６に供される。 The above S501 to S504 are executed between all the texts, and the obtained position information dependence matrix G is provided to step S6 by the model calculation unit 60.

Ｓ６：モデル計算部６０は、ステップＳ３で算出された文書特徴行列ＸとステップＳ５で算出された位置情報依存行列Ｇとを用いて、トピックモデルを学習する。 S6: The model calculation unit 60 learns the topic model using the document feature matrix X calculated in step S3 and the position information dependency matrix G calculated in step S5.

本ステップで学習するトピックモデルは、地理的に盛り上がっているトピックを検出するため、以下の式の値が小さくなるように文書トピック行列Ｗに制約が与えられる。 Since the topic model learned in this step detects a topic that is geographically active, the document topic matrix W is constrained so that the value of the following equation becomes small.

上記の式において、Ｌは各テキストの距離に基づくグラフ構造を表す行列で、例えば以下のラプラシアン行列として計算できる。 In the above formula, L is a matrix representing a graph structure based on the distance of each text, and can be calculated as the following Laplacian matrix, for example.

上記の式において、Ｄは総テキスト数をｍとしたとき、ｍ×ｍの行列であり、対角成分に位置情報依存行列Ｇにおいて対応する行ベクトルの各要素の総和を持ち、その他の成分に０を持つ。この制約は各テキストの位置情報の距離が近いほど、各々が似たトピックを持たせるという性質をもつ。これにより地理的に盛り上がっているトピックを抽出することができる。時間依存の制約と位置依存の制約とを考慮すると、モデル計算部６０は次式により文書トピック行列Ｗとトピック単語行列Ｈを計算する。 In the above equation, D is an m × m matrix where the total number of texts is m, the diagonal component has the sum of the elements of the corresponding row vector in the position information dependence matrix G, and the other components Has 0. This restriction has the property that the closer the position information of each text is, the more similar topic each has. This makes it possible to extract geographically popular topics. In consideration of time-dependent constraints and position-dependent constraints, the model calculation unit 60 calculates a document topic matrix W and a topic word matrix H by the following equations.

上記の式（１）において、λ_t及びλ_gはそれぞれ時間依存の制約とクエリ依存の制約の強さを決めるハイパーパラメータである。 In the above equation (1), λ _t and λ _g are hyperparameters that determine the strength of time-dependent constraints and query-dependent constraints, respectively.

具体的な計算処理の流れとして、文書トピック行列Ｗとトピック単語行列Ｈを交互に最適化する方法によって文書トピック行列Ｗとトピック単語行列Ｈとを算出する過程のフローチャートを図７に示した。 As a specific flow of calculation processing, a flowchart of a process of calculating the document topic matrix W and the topic word matrix H by a method of alternately optimizing the document topic matrix W and the topic word matrix H is shown in FIG.

Ｓ６０１：ｔ＝０とし、文書トピック行列Ｗの初期値Ｗ⁽⁰⁾とトピック単語行列Ｈの初期値Ｈ⁽⁰⁾とを定める。初期値の値は任意の値でよく、例えば０から１までのランダムな値で初期化する。 S601: a t = 0, the initial value H ⁽⁰⁾ of the initial value W ⁽⁰⁾ and the topic word matrix H of the document topic matrix W and determining the. The initial value may be any value, for example, it is initialized with a random value from 0 to 1.

Ｓ６０２：トピック単語行列Ｈ^(t)を式（１）に供して文書トピック行列Ｗ^(t+1)を計算する。計算の方法は最急降下法やニュートン法など公知の技術を用いることができる。 S602: The topic word matrix H ^(t) is used in equation ( ¹⁾ to calculate the document topic matrix W ^{(t + 1)} . As a calculation method, a known technique such as a steepest descent method or a Newton method can be used.

Ｓ６０３：文書トピック行列Ｗ^(t+1)を式（１）に供してトピック単語行列Ｈ^(t+1)を計算する。計算の方法は最急降下法やニュートン法など公知の技術を用いることができる。 S603: The topic word matrix H ^{(t + 1)} is calculated by using the document topic matrix W ^{(t + 1)} for the equation (1). As a calculation method, a known technique such as a steepest descent method or a Newton method can be used.

Ｓ６０４：終了条件を満たしたかを判定する。以下に終了条件を例示する。 S604: It is determined whether the end condition is satisfied. The termination conditions are illustrated below.

［終了条件の例１］
ｔ回目のイテレーションで得られた文書トピック行列Ｗ^(t)及びトピック単語行列Ｈ^(t)とｔ＋１回目のイテレーションで得られた文書トピック行列Ｗ^(t+1)及びトピック単語行列Ｈ^(t+1)における変化量が規定値以下であることを終了条件とする。例えば、ｔ回目のイテレーションで得られた文書トピック行列Ｗ^(t)及びトピック単語行列Ｈ^(t)とｔ＋１回目のイテレーションで得られた文書トピック行列Ｗ^(t+1)及びトピック単語行列Ｈ^(t+1)における各要素の二乗誤差の和が規定値以下であることを終了条件とする。 [Exit condition example 1]
Document topic matrix W ^(t) and topic word matrix H ^(t) obtained in the t-th iteration, document topic matrix W ^{(t + 1)} and topic word matrix H ^{(t + 1} ⁾ obtained in the t + 1 iteration ^The end condition is that the amount of change in ⁾ is less than or equal to the specified value. For example, the document topic matrix W ^(t) and the topic word matrix H ^(t) obtained in the t-th iteration, the document topic matrix W ^{(t + 1)} and the topic word matrix H ^{(t in the} ^{t + 1} iteration ^The end condition is that the sum of the square error of each element in ⁺¹⁾ is not more than a specified value.

［終了条件の例２］
ｔが所定のイテレーション回数に達したことを終了条件とする。 [Exit condition example 2]
An end condition is that t has reached a predetermined number of iterations.

［終了条件の例３］
終了条件の例１，２の両方を満たすことを終了条件とする。 [Exit condition example 3]
Satisfying both the end condition examples 1 and 2 is the end condition.

以上例示した終了条件を満たしていないと判定された場合には、ｔ＋１を新たなｔとしてステップＳ６０２に進む。 If it is determined that the above illustrated termination condition is not satisfied, t + 1 is set as a new t, and the process proceeds to step S602.

一方、終了条件を満たしたと判定された場合には、ステップＳ６の処理を終了し、算出された文書トピック行列Ｗ^(t+1)，トピック単語行列Ｈ^(t+1)を、それぞれ文書トピック行列Ｗ，トピック単語行列Ｈとして、出力部７０に供する。 On the other hand, if it is determined that the termination condition is satisfied, the process of step S6 is terminated, and the calculated document topic matrix W ^{(t + 1)} and topic word matrix H ^{(t + 1)} are respectively converted into the document topic matrix. W is provided to the output unit 70 as a topic word matrix H.

Ｓ７：出力部７０は、ステップＳ６で算出された文書トピック行列Ｗ及びトピック単語行列Ｈを、位置依存かつ時間依存のトピックモデルとして、出力する。このトピックモデルはモニタ等に出力表示される。 S7: The output unit 70 outputs the document topic matrix W and the topic word matrix H calculated in step S6 as position-dependent and time-dependent topic models. This topic model is output and displayed on a monitor or the like.

［本実施形態の効果］
以上説明したように、本実施形態のトピックモデリング装置１によれば、位置情報抽出部４０によって、時系列テキストから位置情報が抽出される。また、位置情報依存行列計算部５０によって、前記抽出された位置情報に基づき時系列テキスト間の位置情報に依る類似度を示す位置情報依存行列Ｇが算出される。そして、モデル計算部６０によって、位置情報依存行列Ｇと前記時系列テキストの文書特徴行列Ｘとに基づき、位置情報と時間情報とに依存したトピックモデルとして、テキストとトピックとの関係度合を示す文書トピック行列Ｗと、トピックと単語との関係度合を示すトピック単語行列Ｈとが算出される。以上のように、位置情報及び時間情報に依存したトピックモデルが生成されるので、地理的かつ時間的に盛り上がっているトピックを抽出できる。 [Effect of this embodiment]
As described above, according to the topic modeling device 1 of the present embodiment, the position information extraction unit 40 extracts position information from the time series text. In addition, the position information dependency matrix calculation unit 50 calculates a position information dependency matrix G indicating the degree of similarity depending on the position information between the time series texts based on the extracted position information. Then, the model calculation unit 60 uses the position information dependency matrix G and the document feature matrix X of the time series text as a topic model depending on the position information and the time information, and shows the degree of relationship between the text and the topic. A topic matrix W and a topic word matrix H indicating the degree of relationship between topics and words are calculated. As described above, the topic model depending on the position information and the time information is generated, so that topics that are exciting geographically and temporally can be extracted.

また、前記時系列テキストに位置情報が含まれていない場合、当該テキストに含まれる地域名称に基づき位置情報が抽出される。したがって、時系列テキストに位置情報が含まれていない場合でも、当該テキストから位置情報を抽出できる。 In addition, when position information is not included in the time series text, the position information is extracted based on the area name included in the text. Therefore, even when position information is not included in the time series text, the position information can be extracted from the text.

さらに、位置情報依存行列Ｇの計算にあたり、各時系列テキストの位置情報間の距離に基づき類似度が算出されることにより、地理的関係がより明確な位置情報依存行列Ｇが得られる。 Further, in calculating the position information dependency matrix G, the similarity is calculated based on the distance between the position information of each time series text, so that the position information dependency matrix G with a clearer geographical relationship is obtained.

そして、ステップＳ６において、ステップＳ６０４の判定が実行されることにより、位置情報及び時間情報に依存したトピックモデルを任意に精度よく取得できる。 In step S6, the determination in step S604 is executed, so that the topic model depending on the position information and time information can be acquired arbitrarily and accurately.

［本発明の他の態様］
本発明は、トピックモデリング装置１を構成する上記の機能部１０〜７０の一部若しくは全てとしてコンピュータを機能させるプログラムで構成しこれを当該コンピュータに実行させることにより実現できる。または、同装置１が実行する上記の過程Ｓ１〜Ｓ７，Ｓ４０１〜Ｓ４０５，Ｓ５０１〜Ｓ５０４，Ｓ６０１〜Ｓ６０４の一部若しくは全てをコンピュータに実行させるプログラムで構成しこれを当該コンピュータに実行させることにより実現できる。そして、このプログラムをそのコンピュータが読み取り可能な周知の記録媒体（例えば、ハードディスク、フレキシブルディスク、ＣＤ−ＲＯＭ等）に格納して提供できる。または、前記プログラムをインターネットや電子メール等でネットワークを介して提供できる。 [Other Embodiments of the Present Invention]
The present invention can be realized by configuring a program that causes a computer to function as a part or all of the functional units 10 to 70 constituting the topic modeling apparatus 1 and causing the computer to execute the program. Alternatively, the above-described processes S1 to S7, S401 to S405, S501 to S504, and S601 to S604 executed by the apparatus 1 are configured by a program that causes a computer to execute and realized by causing the computer to execute the process. it can. The program can be provided by being stored in a known recording medium (for example, a hard disk, a flexible disk, a CD-ROM, etc.) that can be read by the computer. Alternatively, the program can be provided via the network via the Internet or e-mail.

本発明は、上記の実施の形態に限定されることなく、特許請求の範囲内において種々変更、応用が可能である。 The present invention is not limited to the above-described embodiment, and various modifications and applications can be made within the scope of the claims.

１…トピックモデリング装置
２０…単語特徴量計算部
３０…文書特徴行列計算部
４０…位置情報抽出部
５０…位置情報依存行列計算部
６０…モデル計算部 DESCRIPTION OF SYMBOLS 1 ... Topic modeling apparatus 20 ... Word feature-value calculation part 30 ... Document feature matrix calculation part 40 ... Position information extraction part 50 ... Position information dependence matrix calculation part 60 ... Model calculation part

Claims

A topic modeling device for creating a topic model depending on position information and time information,
Position information extraction means for extracting position information from time series text;
A position information dependency matrix calculating means for calculating a position information dependency matrix indicating a degree of similarity depending on position information between each time series text based on the extracted position information;
The document feature matrix of the time series text is subjected to matrix decomposition and dimension compression under the restriction of topic extraction based on a temporal topic change amount and the restriction of topic extraction based on the position information dependency matrix. To calculate the document topic matrix indicating the degree of relationship between the text and the topic and the topic word matrix indicating the degree of the relationship between the topic and the word over time, and the amount of change over time of the document topic matrix and the topic word matrix is a predetermined value. A topic modeling apparatus , comprising: model calculation means for determining, as the topic model, the latest document topic matrix and topic word matrix as described below .

The model calculation unit, according to claim 1 in which the calculation of the document topic matrix and topic word matrix document topic matrix and topic word matrix upon reaching a predetermined number of times, as the topic models, and determines Topic modeling equipment.

3. The topic modeling according to claim 1, wherein, when the time series text does not include position information, the position information extraction unit extracts the position information based on an area name included in the text. 4. apparatus.

The position information dependent matrix calculation means, topic modeling device according to any one of claims 1 to 3, and calculates the similarity based on the distance between the position information of each time series text.

A topic modeling method executed by a topic modeling device that creates a topic model depending on position information and time information,
Extracting location information from time series text ;
Calculating a position information dependence matrix indicating similarity based on position information between time series texts based on the extracted position information ;
The document feature matrix of the time series text is subjected to matrix decomposition and dimension compression under the restriction of topic extraction based on a temporal topic change amount and the restriction of topic extraction based on the position information dependency matrix. To calculate the document topic matrix indicating the degree of relationship between the text and the topic and the topic word matrix indicating the degree of the relationship between the topic and the word over time, and the amount of change over time of the document topic matrix and the topic word matrix is a predetermined value. And determining the latest document topic matrix and topic word matrix as the topic model .

A topic modeling program for causing a computer to function as each means constituting the topic modeling device according to any one of claims 1 to 4 .