JP2013069141A

JP2013069141A - Document analysis device and document analysis method

Info

Publication number: JP2013069141A
Application number: JP2011207562A
Authority: JP
Inventors: Hiroshi Fujimoto; 拓藤本; Minoru Eto; 稔栄藤
Original assignee: NTT Docomo Inc
Current assignee: NTT Docomo Inc
Priority date: 2011-09-22
Filing date: 2011-09-22
Publication date: 2013-04-18

Abstract

PROBLEM TO BE SOLVED: To optimize a mixture ratio when generating prior distribution by mixing present topics and past topics.SOLUTION: Topic distribution generation means 301 generates word distribution which is polynomial distribution of a frequency of appearance of each word included in a second document set obtained after a first document set on the basis of first prior distribution which is polynomial distribution of a frequency of appearance of each word included in the first document set obtained in the past. Mixture ratio calculation means 302 calculates a mixture ratio of the first prior distribution and the word distribution on the basis of the first prior distribution. Prior distribution generation means 303 generates second prior distribution by mixing the first prior distribution and the word distribution at the mixture ratio. Prior distribution accumulation means 304 accumulates the second prior distribution. When a third document is obtained after the second document set, the topic distribution generation means 301 generates word distribution which is polynomial distribution of a frequency of appearance of each word included in a third document set on the basis of the second prior distribution.

Description

本発明は、文書からトピックを抽出する技術に関する。 The present invention relates to a technique for extracting a topic from a document.

Twitter（登録商標）などのマイクロブログに投稿される文章、ニュース専門サイトで提供される記事、ＳＮＳ（ソーシャルネットワーキングサービス）上で取り交わされるコメントなど、インターネット上では日々、大量の文書が生成される。
インターネット上で話題となっている事柄を抽出する技術の一つとして、Buzztterが知られている。Buzztterは、Twitterに投稿された文章から出現頻度の高い単語をリアルタイムに抽出してユーザの端末に配信する。 Massive documents are generated every day on the Internet, such as text posted on microblogs such as Twitter (registered trademark), articles provided on news special sites, comments exchanged on SNS (social networking service), etc. .
Buzztter is known as one of the techniques for extracting matters that have become a hot topic on the Internet. Buzztter extracts words that appear frequently from text posted on Twitter in real time and distributes them to the user's terminal.

文書上の単語の出現頻度を支配する潜在的な要因であるトピックを文書から抽出するトピックモデルの一つに、Latent Dirichlet Allocation（以下、「ＬＤＡ」という。例えば、非特許文献１。）がある。図１は、ＬＤＡのグラフィカルモデルを示す図である。表１は、図１における各記号の意味を示す表である。

As one of topic models for extracting a topic, which is a potential factor that governs the appearance frequency of words on a document, from documents, there is Latent Dirichlet Allocation (hereinafter referred to as “LDA”, for example, Non-Patent Document 1). . FIG. 1 is a diagram showing a graphical model of LDA. Table 1 is a table showing the meaning of each symbol in FIG.

ＬＤＡは、各文書に複数のトピックが存在し、各文書において各トピックが或る確率で出現し、各トピックに対応する単語が複数存在し、各トピックにおいて各単語が或る確率で出現するという仮定のもとで、各単語の出現回数の多項分布（以下、「単語分布」という。）で各トピックを特徴付けるとともに、各トピックの出現回数の多項分布（以下、「トピック分布」という。）で各文書を特徴付ける手法である。例えば、Twitterに投稿された一つの文章（以下、「tweet」という。）を文書として捉えた場合に、地震発生直後では、地震や津波に関するトピックを多く含む文書が多数生成されることとなり、地震に関するトピックは、「地震」、「震度」、「地震速報」などの各単語（単語の組み合わせも含む）の出現回数の多項分布で特徴付けられ、津波に関するトピックは、「津波」、「高さ」、「到達時刻」などの各単語の出現回数の多項分布で特徴付けられることとなる。 In LDA, there are a plurality of topics in each document, each topic appears in each document with a certain probability, a plurality of words corresponding to each topic exist, and each word appears in each topic with a certain probability. Under the assumption, each topic is characterized by a multinomial distribution of the number of occurrences of each word (hereinafter referred to as “word distribution”), and a multinomial distribution of the number of occurrences of each topic (hereinafter referred to as “topic distribution”). This is a technique for characterizing each document. For example, if one sentence posted on Twitter (hereinafter referred to as “tweet”) is captured as a document, immediately after the earthquake occurs, many documents containing many topics related to earthquakes and tsunamis will be generated. The topic about is characterized by a multinomial distribution of the number of occurrences of each word (including word combinations) such as “earthquake”, “seismic intensity”, “earthquake breaking news”, and the topics about tsunami are “tsunami”, “height ”,“ Arrival time ”and the like, which are characterized by a multinomial distribution of the number of appearances of each word.

従来のＬＤＡは、あらかじめ与えられた特定の文書集合からトピックを抽出するように構成されている。従って、経時的に内容が変化するような文書集合において、トピックの変化を抽出することはできない。これに対して、経時的に内容が変化する文書集合の分析に適用可能となるようにＬＤＡを拡張した技術として、Dynamic Topic Model（以下、「ＤＴＭ」という。例えば、非特許文献２。）がある。図２は、ＤＴＭのグラフィカルモデルを示す図である。ｔは、文書集合を取得する時期である。取得時期ｔ毎に文書集合Ｄ^(t)が入力され、取得時期ｔ毎の単語分布β^(t)、トピック分布θ^(t)を生成する。ＤＴＭの特徴は、取得時期ｔ−１における単語分布β^(t-1)が取得時期ｔにおける単語分布β^(t)の事前分布となっている点、言い換えれば、過去のトピックから現在のトピックを抽出する点である。そのため、β^(t)をβ^(t-1)と比較することで、トピックの経時的な変化を観測することができる。また、現在のトピックに過去のトピックの影響が反映されるので、特定の取得時期だけの文書集合からトピックを抽出する場合と比較して、現在の文書集合に過学習しない結果が得られる。これにより、次の取得時期ｔ＋１の文書集合に含まれるトピックを高精度に予測することができる。 The conventional LDA is configured to extract topics from a specific document set given in advance. Therefore, topic changes cannot be extracted from a document set whose contents change over time. On the other hand, Dynamic Topic Model (hereinafter referred to as “DTM”, for example, Non-Patent Document 2) is a technique that extends LDA so that it can be applied to analysis of a document set whose contents change over time. is there. FIG. 2 is a diagram showing a graphical model of DTM. t is the time when the document set is acquired. A document set D ^(t) is input for each acquisition time t, and a word distribution β ^(t) and topic distribution θ ^(t) for each acquisition time t are generated. The feature of DTM is that the word distribution β ^(t−1) at the acquisition time ^t−1 is a prior distribution of the word distribution β ^{(t) at} the acquisition time t, in other words, the current topic from the past topic. It is a point to extract. Therefore, by comparing β ^(t) with β ^(t-1) , it is possible to observe changes in the topic over time. In addition, since the influence of past topics is reflected on the current topic, a result that does not overlearn the current document set can be obtained as compared with the case of extracting a topic from a document set only at a specific acquisition time. As a result, topics included in the document set at the next acquisition time t + 1 can be predicted with high accuracy.

D. M. Blei, A. N. Ng, and M. I. Jordan. Latent dirichlet allocation. In Journal of Machine Learning Research archive., 2003.D. M. Blei, A. N. Ng, and M. I. Jordan. Latent dirichlet allocation. In Journal of Machine Learning Research archive., 2003. D. M. Blei and J. D. Lafferty. Dynamic Topic Models. In Proc. of the 23rd international conference on Machine learning.,2006.D. M. Blei and J. D. Lafferty. Dynamic Topic Models. In Proc. Of the 23rd international conference on Machine learning., 2006. G. Kitagawa. Monte carlo filter and smoother for non-Gaussian nonlinear state space models. In Journal of computationaland graphical statistics., 1996.G. Kitagawa.Monte carlo filter and smoother for non-Gaussian nonlinear state space models.In Journal of computationaland graphical statistics., 1996.

特開２０１０−２４４２６４号公報JP 2010-244264 A

ところで、ＤＴＭでは、次の取得時期のトピックの事前分布を現在のトピック分布から生成するので、次の取得時期のトピック分布が現在のトピック分布を過学習するおそれがある。
この問題に対して、現在のトピック分布と過去のトピック分布を混合して平滑化することにより、過学習を防ぐ方法が考えられる。例えば、特許文献１では、過去のいくつかのトピックを混合することで、事前分布を生成する技術を提案している。しかし、過去のトピックと現在のトピックをどのような配分で混合すれば最適であるかという点については、提案されていない。
本発明は、上述の背景に鑑みてなされたものであり、現在のトピックと過去のトピックとの混合により事前分布を生成する場合の混合比を最適化することを目的とする。 By the way, in DTM, since the prior distribution of the topic at the next acquisition time is generated from the current topic distribution, the topic distribution at the next acquisition time may overlearn the current topic distribution.
To solve this problem, a method of preventing overlearning by mixing and smoothing the current topic distribution and the past topic distribution can be considered. For example, Patent Document 1 proposes a technique for generating a prior distribution by mixing past topics. However, no proposal has been made regarding the optimal distribution of past topics and current topics.
The present invention has been made in view of the above-described background, and an object of the present invention is to optimize a mixing ratio when a prior distribution is generated by mixing a current topic and a past topic.

請求項１に係る文書分析装置は、外部装置から文書集合を取得する取得手段と、過去に前記取得手段によって取得された第１の文書集合に含まれる各単語の出現回数の多項分布である第１の事前分布に基づいて、前記第１の文書集合の後に前記取得手段によって取得された第２の文書集合に含まれる各単語の出現回数の多項分布である単語分布を生成し、当該単語分布で特徴付けられる各トピックの出現回数の多項分布であるトピック分布を生成するトピック分布生成手段と、前記第１の事前分布に基づいて、前記第１の事前分布と前記単語分布との混合比を算出する混合比算出手段と、前記第１の事前分布と前記単語分布とを前記混合比にて混合することにより、第２の事前分布を生成する事前分布生成手段と、前記第２の事前分布を蓄積する事前分布蓄積手段とを有し、前記トピック分布生成手段は、前記第２の文書集合の後に前記取得手段によって第３の文書が取得された場合に、前記第３の文書集合に含まれる各単語の出現回数の多項分布である単語分布を、前記第２の事前分布に基づいて生成することを特徴とする。 The document analysis apparatus according to claim 1 is an acquisition unit that acquires a document set from an external device, and a multinomial distribution of the number of occurrences of each word included in the first document set acquired by the acquisition unit in the past. Generating a word distribution that is a multinomial distribution of the number of occurrences of each word included in the second document set acquired by the acquisition unit after the first document set based on the prior distribution of the first document set; And a topic distribution generation means for generating a topic distribution which is a multinomial distribution of the number of occurrences of each topic characterized by the following: a mixing ratio between the first prior distribution and the word distribution based on the first prior distribution; Mixing ratio calculating means for calculating, prior distribution generating means for generating a second prior distribution by mixing the first prior distribution and the word distribution at the mixing ratio, and the second prior distribution Accumulate A prior distribution storage unit, and the topic distribution generation unit includes each of the third document sets included in the third document set when the third document is acquired by the acquisition unit after the second document set. A word distribution, which is a multinomial distribution of word appearances, is generated based on the second prior distribution.

請求項２に係る文書分析装置は、請求項１に記載の文書分析装置において、前記混合比算出手段は、前記混合比の候補である互いに異なる複数の混合比に対応する第１の粒子群と、前記第１の粒子群に含まれる各粒子に対応付けた尤度からなる第１の尤度群とを仮定して、前記第１の粒子群を粒子フィルタにおける粒子として散布し、前記第１の尤度群に基づいて前記第１の粒子群を再散布することによって第２の尤度群を算出し、前記第２の尤度群を重みとした前記第１の粒子群の重み付き平均により前記混合比を算出することを特徴とする。 The document analysis apparatus according to claim 2 is the document analysis apparatus according to claim 1, wherein the mixture ratio calculation unit includes a first particle group corresponding to a plurality of different mixture ratios that are candidates for the mixture ratio; Assuming a first likelihood group consisting of likelihoods associated with each particle included in the first particle group, the first particle group is dispersed as particles in a particle filter, and the first A second likelihood group is calculated by re-spreading the first particle group based on the likelihood group, and a weighted average of the first particle group using the second likelihood group as a weight The mixing ratio is calculated by the following.

請求項３に係る文書分析方法は、過去に取得された第１の文書集合に含まれる各単語の出現回数の多項分布である第１の事前分布に基づいて、前記第１の文書集合の後に取得された第２の文書集合に含まれる各単語の出現回数の多項分布である単語分布を生成し、当該単語分布で特徴付けられる各トピックの出現回数の多項分布であるトピック分布を生成するトピック分布生成ステップと、前記第１の事前分布に基づいて、前記第１の事前分布と前記単語分布との混合比を算出する混合比算出ステップと、前記第１の事前分布と前記単語分布とを前記混合比にて混合することにより、第２の事前分布を生成する事前分布生成ステップと、前記第２の事前分布を蓄積する事前分布蓄積ステップとを有し、前記トピック分布生成ステップにおいて、前記第２の文書集合の後に第３の文書が取得された場合に、前記第３の文書集合に含まれる各単語の出現回数の多項分布である単語分布を、前記第２の事前分布に基づいて生成することを特徴とする。 A document analysis method according to a third aspect of the present invention is based on a first prior distribution that is a multinomial distribution of the number of appearances of each word included in a first document set acquired in the past, after the first document set. A topic that generates a word distribution that is a multinomial distribution of the number of appearances of each word included in the acquired second document set, and generates a topic distribution that is a multinomial distribution of the number of appearances of each topic characterized by the word distribution A distribution generation step; a mixing ratio calculating step for calculating a mixing ratio between the first prior distribution and the word distribution based on the first prior distribution; and the first prior distribution and the word distribution. In the topic distribution generation step, a prior distribution generation step of generating a second prior distribution by mixing at the mixing ratio, and a prior distribution storage step of storing the second prior distribution, When a third document is acquired after the second document set, a word distribution that is a multinomial distribution of the number of appearances of each word included in the third document set is based on the second prior distribution. It is characterized by generating.

請求項４に係るプログラムは、コンピュータを、外部装置から文書集合を取得する取得手段と、過去に前記取得手段によって取得された第１の文書集合に含まれる各単語の出現回数の多項分布である第１の事前分布に基づいて、前記第１の文書集合の後に前記取得手段によって取得された第２の文書集合に含まれる各単語の出現回数の多項分布である単語分布を生成し、当該単語分布で特徴付けられる各トピックの出現回数の多項分布であるトピック分布を生成するトピック分布生成手段と、前記第１の事前分布に基づいて、前記第１の事前分布と前記単語分布との混合比を算出する混合比算出手段と、前記第１の事前分布と前記単語分布とを前記混合比にて混合することにより、第２の事前分布を生成する事前分布生成手段と、前記第２の事前分布を蓄積する事前分布蓄積手段として機能させるためのプログラムであって、前記トピック分布生成手段は、前記第２の文書集合の後に前記取得手段によって第３の文書が取得された場合に、前記第３の文書集合に含まれる各単語の出現回数の多項分布である単語分布を、前記第２の事前分布に基づいて生成することを特徴とする。 The program according to claim 4 is a multinomial distribution of the number of appearances of each word included in the first document set acquired by the acquisition means for acquiring a document set from an external device and the acquisition means in the past. Based on the first prior distribution, a word distribution which is a multinomial distribution of the number of appearances of each word included in the second document set acquired by the acquisition unit after the first document set is generated, and the word A topic distribution generating means for generating a topic distribution which is a multinomial distribution of the number of appearances of each topic characterized by the distribution, and a mixing ratio between the first prior distribution and the word distribution based on the first prior distribution. A mixture ratio calculating means for calculating the second prior distribution by mixing the first prior distribution and the word distribution at the mixture ratio; and the second distribution A program for functioning as a prior distribution accumulating unit for accumulating a pre-distribution, wherein the topic distribution generating unit is configured such that when a third document is acquired by the acquiring unit after the second document set, A word distribution which is a multinomial distribution of the number of appearances of each word included in the third document set is generated based on the second prior distribution.

本発明によれば、現在のトピックと過去のトピックとの混合により事前分布を生成する場合の混合比を最適化することができる。 According to the present invention, it is possible to optimize the mixing ratio when the prior distribution is generated by mixing the current topic and the past topic.

ＬＤＡのグラフィカルモデルを示す図である。It is a figure which shows the graphical model of LDA. ＤＴＭのグラフィカルモデルを示す図である。It is a figure which shows the graphical model of DTM. 通信システム１の構成を示す図である。1 is a diagram illustrating a configuration of a communication system 1. FIG. 文書分析装置３０のハードウェア構成を示すブロック図である。2 is a block diagram showing a hardware configuration of a document analysis device 30. FIG. 文書分析装置３０の機能構成を示すブロック図である。3 is a block diagram showing a functional configuration of a document analysis device 30. FIG. 本実施形態のグラフィカルモデルである。It is a graphical model of this embodiment. 粒子フィルタを利用した混合比の推定方法を示す図である。It is a figure which shows the estimation method of the mixture ratio using a particle filter.

本発明の実施形態について説明する。
（１）実施形態の構成
図３は、本発明の実施形態に係る通信システム１の構成を示す図である。通信システム１は、移動通信ネットワーク１０と、移動通信装置２０と、移動通信ネットワーク１０にゲートウェイ装置６０を介して接続されたインターネット５０と、ゲートウェイ装置６０に接続された文書分析装置３０と、インターネット５０に接続された複数のウェブサーバ装置４０とを備えている。 An embodiment of the present invention will be described.
(1) Configuration of Embodiment FIG. 3 is a diagram showing a configuration of the communication system 1 according to the embodiment of the present invention. The communication system 1 includes a mobile communication network 10, a mobile communication device 20, an Internet 50 connected to the mobile communication network 10 via a gateway device 60, a document analysis device 30 connected to the gateway device 60, and an Internet 50. And a plurality of web server devices 40 connected to each other.

移動通信装置２０は、例えば携帯電話機などの通信可能なコンピュータであり、ＣＰＵ（Central Processing Unit）などの演算装置とＲＯＭ（Read Only Memory）及びＲＡＭ（Random Access Memory）などの記憶装置からなる制御部と、ＥＥＰＲＯＭ（Electronically Erasable and Programmable ROM）やバックアップ電源を備えたＳＲＡＭ（Static Random Access Memory）などの記憶部と、アンテナや無線通信回路からなる無線通信部と、スピーカ、マイクロホン及び音声処理回路からなる音声入出力部と、複数のキーやタッチスクリーンなどの操作子を備えた操作部と、液晶パネルや液晶駆動回路からなる表示部とを備えている。この移動通信装置２０において、制御部は、操作部が受け付けたユーザの操作に応じて、通信部を制御することにより移動通信ネットワーク１０及びインターネット５０経由でウェブサーバ装置４０にアクセスし、そのウェブサーバ装置４０に蓄積されている情報を取得して、表示部に表示させる。これにより、ユーザはインターネット５０上にある様々な情報にアクセスし、それを閲覧することができる。 The mobile communication device 20 is a communicable computer such as a cellular phone, for example, and includes a control unit including an arithmetic device such as a CPU (Central Processing Unit) and a storage device such as a ROM (Read Only Memory) and a RAM (Random Access Memory). And a storage unit such as an EEPROM (Electronically Erasable and Programmable ROM) or SRAM (Static Random Access Memory) with a backup power source, a wireless communication unit including an antenna and a wireless communication circuit, a speaker, a microphone, and a sound processing circuit A voice input / output unit, an operation unit having a plurality of operators such as keys and a touch screen, and a display unit including a liquid crystal panel and a liquid crystal driving circuit are provided. In this mobile communication device 20, the control unit accesses the web server device 40 via the mobile communication network 10 and the Internet 50 by controlling the communication unit in accordance with a user operation received by the operation unit, and the web server. Information stored in the device 40 is acquired and displayed on the display unit. Thereby, the user can access and browse various information on the Internet 50.

移動通信ネットワーク１０は、移動通信装置２０に通信サービスを提供するネットワークである。この移動通信ネットワーク１０は、自局の無線セル内に存在する移動通信装置２０との間で無線通信を行う基地局、ネットワーク内で伝送されるデータのルーティングを行う交換局、及び移動通信装置２０の位置登録などを行う制御局といった各種ノードと、これらのノード間を相互に接続する通信線とを備えている。 The mobile communication network 10 is a network that provides a communication service to the mobile communication device 20. The mobile communication network 10 includes a base station that performs wireless communication with a mobile communication device 20 that exists in a wireless cell of the mobile station, an exchange that performs routing of data transmitted in the network, and the mobile communication device 20. Are provided with various nodes such as a control station for performing location registration, and a communication line for interconnecting these nodes.

ウェブサーバ装置４０は、コンピュータであり、ＣＰＵなどの演算装置とＲＯＭ及びＲＡＭなどの記憶装置からなる制御部と、ハードディスク装置などの記憶部と、インターネット５０に接続された通信部とを備えている。このウェブサーバ装置４０は、インターネット５０及び移動通信ネットワーク１０経由で移動通信装置２０との間でデータ通信を行う機能を備えている。ウェブサーバ装置４０は、マイクロブログのサービスを提供するサーバであり、ユーザがこのサービスの利用者としてウェブサーバ装置４０に登録すると、このユーザに対してマイクロブログへの投稿が許可される。ユーザが移動通信装置２０を用いてこのマイクロブログへ文章を投稿すると、ウェブサーバ装置４０が当該文章を当該移動通信装置２０に返信することにより、当該移動通信装置２０の表示部に当該文章が表示される。また、当該ユーザが他のユーザを登録した場合に、当該他のユーザの投稿した文章も表示部に表示される。このようにして、複数のユーザ間でコミュニケーションを取ることが可能となる。 The web server device 40 is a computer, and includes a control unit including a calculation device such as a CPU and a storage device such as a ROM and a RAM, a storage unit such as a hard disk device, and a communication unit connected to the Internet 50. . The web server device 40 has a function of performing data communication with the mobile communication device 20 via the Internet 50 and the mobile communication network 10. The web server device 40 is a server that provides a microblog service. When a user registers in the web server device 40 as a user of this service, the user is allowed to post to the microblog. When a user posts a text to the microblog using the mobile communication device 20, the web server device 40 sends the text back to the mobile communication device 20, whereby the text is displayed on the display unit of the mobile communication device 20. Is done. In addition, when the user registers another user, the text posted by the other user is also displayed on the display unit. In this way, it is possible to communicate among a plurality of users.

（２）文書分析装置の構成
図４は、文書分析装置３０のハードウェア構成を示すブロック図である。文書分析装置３０は、コンピュータであり、制御部３１と、通信部３２と、記憶部３３とを備えている。制御部３１は、ＣＰＵなどの演算装置と、ＲＯＭ及びＲＡＭなどの記憶装置とを備えている。ＣＰＵは、ＲＡＭをワークエリアとして用いてＲＯＭや記憶部３３に記憶されたプログラム群を実行することによって、文書分析装置３０の各部の動作を制御する。 (2) Configuration of Document Analysis Device FIG. 4 is a block diagram showing a hardware configuration of the document analysis device 30. The document analysis apparatus 30 is a computer and includes a control unit 31, a communication unit 32, and a storage unit 33. The control unit 31 includes an arithmetic device such as a CPU and a storage device such as a ROM and a RAM. The CPU controls the operation of each unit of the document analysis apparatus 30 by executing a program group stored in the ROM or the storage unit 33 using the RAM as a work area.

通信部３２は、通信インタフェースを備えており、ゲートウェイ装置６０に接続されている。通信部３２は、ゲートウェイ装置６０が中継する、ウェブサーバ装置４０から移動通信装置２０に配信された文書集合をゲートウェイ装置６０から取得して、文書分析装置３０に入力する。つまり、通信部３２は、外部装置から文書集合を取得する取得手段の一例である。
記憶部３３は、書き込み可能な不揮発性の記憶手段であり、例えばハードディスク装置である。この記憶部３３には、制御部３１が実行する処理の手順が記述されたプログラム群を記憶している。また、記憶部３３は、ゲートウェイ装置６０から取得した、ウェブサーバ装置４０が移動通信装置２０に配信する情報を記憶する記憶領域を有している。 The communication unit 32 includes a communication interface and is connected to the gateway device 60. The communication unit 32 acquires a document set distributed from the web server device 40 to the mobile communication device 20 relayed by the gateway device 60 from the gateway device 60 and inputs the document set to the document analysis device 30. That is, the communication unit 32 is an example of an acquisition unit that acquires a document set from an external device.
The storage unit 33 is a writable nonvolatile storage unit, for example, a hard disk device. The storage unit 33 stores a program group in which a procedure of processing executed by the control unit 31 is described. In addition, the storage unit 33 has a storage area for storing information acquired from the gateway device 60 and distributed to the mobile communication device 20 by the web server device 40.

図５は、本実施形態に係る文書分析装置３０の機能構成を示すブロック図である。表２は、実施形態の説明で用いる記号とその意味を示す表である。

FIG. 5 is a block diagram showing a functional configuration of the document analysis apparatus 30 according to the present embodiment. Table 2 shows symbols used in the description of the embodiments and their meanings.

文書分析装置３０は、トピック分布生成手段３０１、混合比算出手段３０２、事前分布生成手段３０３及び事前分布蓄積手段３０４を有する。文書分析装置３０が実行する処理は、主に、トピック分布の生成、混合比の算出、事前分布の生成、事前分布の蓄積、の４つである。
ここで、以下の説明で使用する用語について説明する。
本実施形態では、マイクロブログに投稿された文章（テキストデータ）を分析の対象とし、１回の投稿で移動通信装置２０からウェブサーバ装置４０に送信された文章を文書と呼ぶ。文書集合とは、１つ又は複数の文書からなる集合である。
単語とは、文書を構成する単語であり、日本語では形態素に相当する。単語は、どの品詞でもよい。 The document analysis apparatus 30 includes a topic distribution generation unit 301, a mixture ratio calculation unit 302, a prior distribution generation unit 303, and a prior distribution storage unit 304. There are mainly four processes executed by the document analysis device 30: topic distribution generation, mixture ratio calculation, prior distribution generation, and prior distribution accumulation.
Here, terms used in the following description will be described.
In this embodiment, a sentence (text data) posted on a microblog is an object of analysis, and a sentence transmitted from the mobile communication device 20 to the web server device 40 in one posting is called a document. A document set is a set of one or more documents.
A word is a word constituting a document and corresponds to a morpheme in Japanese. The word can be any part of speech.

取得時期ｔは、ウェブサーバ装置４０から移動通信装置２０に配信される文書集合を文書分析装置３０がゲートウェイ装置６０を介して取得する時期である。文書分析装置３０は、文書集合の取得を契機として、以下に説明する一連の処理を実行する。取得時期ｔは、例えば、２４時間毎、６時間毎といった一定の間隔で定められていてもよいし、間隔を定めずに、文書分析装置３０の管理者や移動通信装置２０のユーザが、随時、文書分析装置３０に文書集合の取得を指示するようにしてもよい。
取得時期ｔは、整数で表される。つまり、文書分析装置３０で処理中の文書集合が取得された時期をｔとすると、前回の処理の対象であった文書集合の取得時期はｔ−１である。 The acquisition time t is a time when the document analysis device 30 acquires the document set distributed from the web server device 40 to the mobile communication device 20 via the gateway device 60. The document analysis apparatus 30 executes a series of processes described below, triggered by acquisition of a document set. For example, the acquisition time t may be set at regular intervals such as every 24 hours or every 6 hours, or the administrator of the document analysis device 30 or the user of the mobile communication device 20 may set the intervals at any time without setting the intervals. The document analysis apparatus 30 may be instructed to acquire a document set.
The acquisition time t is represented by an integer. That is, if the time when the document set being processed by the document analysis apparatus 30 is acquired is t, the acquisition time of the document set that was the object of the previous processing is t-1.

文書集合を構成する文書は、どのように選択してもよい。例えば、投稿日時が最新のものから過去に遡って１０万回分の投稿を取得してもよいし、過去２４時間の投稿から無作為に１０万回分の投稿を抽出して取得してもよい。
新語とは、過去に生成された語彙集合に含まれない単語である。既知語とは、過去に生成された語彙集合に含まれる単語である。
事前分布は、ベイズ推定における事前確率分布であり、事前分布に尤度関数を乗じることにより事後分布が生成される。 The documents that make up the document set may be selected in any way. For example, 100,000 postings may be acquired retroactively from the latest posting date, or 100,000 postings may be randomly extracted from postings for the past 24 hours.
A new word is a word that is not included in a vocabulary set generated in the past. A known word is a word included in a vocabulary set generated in the past.
The prior distribution is a prior probability distribution in Bayesian estimation, and a posterior distribution is generated by multiplying the prior distribution by a likelihood function.

次に、各処理の内容について説明する。
（２．１）トピック分布の生成

Next, the contents of each process will be described.
(2.1) Generation of topic distribution

（２．２）混合比の算出

(2.2) Calculation of mixing ratio

粒子フィルタの利点は、経時的に変化する値を推定するにあたり、前時刻における状態と現在時刻における入力値との間に状態方程式を必要としない点である。そのため、前時刻における状態からの現在時刻における入力値の予測が困難である今回の問題には最適な方式と言える。

The advantage of the particle filter is that no state equation is required between the state at the previous time and the input value at the current time in estimating the value that changes over time. Therefore, it can be said that this method is optimal for the current problem in which it is difficult to predict the input value at the current time from the state at the previous time.

図７は、粒子フィルタを利用した混合比の推定方法を示す図である。ここでは、取得時期ｔ−１において、混合比として互いに異なる混合比に対応するＮ個の候補ρ_i ^(t-1)（ただし、ｉ＝１、２、・・・、Ｎ）を仮定し、それぞれを粒子フィルタにおける粒子として散布する。各粒子には、取得時期ｔにおける各粒子の最適な推定値ρ_i ^(t)に対する尤度ｗ_i ^(t-1)が対応付けられている。
時刻ｔにおいて、混合比算出手段３０２は、まず、各粒子を、これに対応する尤度に基づいて再散布する。具体的には、尤度を確率として、重複を許しつつ各粒子をＮ回散布する。これはすなわち、尤度の大きい粒子は複数回散布され、尤度の小さい粒子は散布されずに消滅する可能性があることを意味する。 FIG. 7 is a diagram illustrating a method for estimating a mixing ratio using a particle filter. Here, it is assumed that N candidates ρ _i ^(t−1) (where i = 1, 2,..., N) corresponding to different mixing ratios as the mixing ratio at the acquisition time t−1. Each is dispersed as particles in a particle filter. Each particle is associated with the likelihood w _i ^(t−1) for the optimum estimated value ρ _i ^(t) of each particle at the acquisition time t.
At time t, the mixture ratio calculation unit 302 first respreads each particle based on the corresponding likelihood. Specifically, each particle is dispersed N times while allowing overlap with the likelihood as the probability. This means that particles with a high likelihood may be scattered several times and particles with a low likelihood may disappear without being scattered.

次に、混合比算出手段３０２は、各粒子を特定の距離に従ってランダムウォークさせる。距離は、あらかじめ与えられた固定値である分散値σを標準偏差とした正規分布により発生させる。ここで、ρ_i ^(t)は、０以上１以下であるから、その範囲を超えるような距離が発生した場合には、同じ方法で再度、距離を発生させる。ここで得られた各粒子は、取得時期ｔにおける粒子の状態を表すものとなる。
次に、各粒子に対して、新たな尤度ｗ_i ^(t)を算出する。尤度の算出方法については後述する。時刻ｔにおけるρ_i ^(t)は、尤度を重みとした各粒子の重み付き平均により、次式のとおり算出される。

Next, the mixing ratio calculation unit 302 causes each particle to walk randomly according to a specific distance. The distance is generated by a normal distribution with a standard value of a variance value σ, which is a fixed value given in advance. Here, since ρ _i ^(t) is 0 or more and 1 or less, when a distance exceeding the range is generated, the distance is generated again by the same method. Each particle obtained here represents the state of the particle at the acquisition time t.
Next, a new likelihood w _i ^(t) is calculated for each particle. The likelihood calculation method will be described later. Ρ _i ^{(t) at} time t is calculated from the weighted average of each particle with the likelihood as a weight, as follows:

要するに、混合比算出手段３０２は、前記第１の事前分布に基づいて、前記第１の事前分布と前記単語分布との混合比を算出する手段の一例である。
また、混合比算出手段３０２は、前記混合比の候補である互いに異なる複数の混合比に対応する第１の粒子群と、前記第１の粒子群に含まれる各粒子に対応付けた尤度からなる第１の尤度群とを仮定して、前記第１の粒子群を粒子フィルタにおける粒子として散布し、前記第１の尤度群に基づいて前記第１の粒子群を再散布することによって第２の尤度群を算出し、前記第２の尤度群を重みとした前記第１の粒子群の重み付き平均により前記混合比を算出する手段としても特定され得る。 In short, the mixture ratio calculation unit 302 is an example of a unit that calculates a mixture ratio between the first prior distribution and the word distribution based on the first prior distribution.
In addition, the mixture ratio calculation unit 302 calculates a first particle group corresponding to a plurality of different mixture ratios that are candidates for the mixture ratio, and a likelihood associated with each particle included in the first particle group. Assuming that the first likelihood group is, the first particle group is dispersed as particles in a particle filter, and the first particle group is redispersed based on the first likelihood group. The second likelihood group may be calculated, and the mixture ratio may be specified by a weighted average of the first particle group using the second likelihood group as a weight.

（２．３）事前分布の生成
事前分布の生成は、事前分布生成手段３０３によって実行される。

(2.3) Generation of Prior Distribution Generation of the prior distribution is executed by the prior distribution generation unit 303.

（２．４）事前分布の蓄積

また、トピック分布生成手段３０１は、前記第２の文書集合の後に前記取得手段によって第３の文書が取得された場合に、前記第３の文書集合に含まれる各単語の出現回数の多項分布である単語分布を、前記第２の事前分布に基づいて生成する。 (2.4) Accumulation of prior distribution

The topic distribution generation unit 301 is a multinomial distribution of the number of occurrences of each word included in the third document set when the third document is acquired by the acquisition unit after the second document set. A certain word distribution is generated based on the second prior distribution.

上述のとおり、本実施形態は、事前分布の平滑化を行うことにより、従来のＤＴＭと比較して精度のよいトピック分布を生成可能となる。ここで言う精度がよいとは、文書をよくモデル化できているという意味である。これは、定量的には、perplexity（非特許文献２）により評価可能である。perplexityは、テスト文書に対するモデルの精度を示す指標であり、値が小さければ小さいほど良いモデルであることを表す。ここでは、取得時期ｔ−１におけるトピック分布が、取得時期ｔに取得された文書をいかにモデル化できているか評価することを考える。この場合、perplexityは、下記の式で表される。

As described above, according to the present embodiment, it is possible to generate a topic distribution with higher accuracy than the conventional DTM by smoothing the prior distribution. Good accuracy here means that the document is well modeled. This can be quantitatively evaluated by perplexity (Non-Patent Document 2). The perplexity is an index indicating the accuracy of the model for the test document. The smaller the value, the better the model. Here, it is considered to evaluate how the topic distribution at the acquisition time t−1 models the document acquired at the acquisition time t. In this case, perplexity is expressed by the following equation.

表３は、実際に本実施形態のシステムを実装し、perplexityを評価した結果を示す表である。この評価結果は、２０１１年２月から３月のTwitterの投稿を１日２０万ずつ収集し、そのデータに対して、トピックの経時的な変化を考慮しない通常のＬＤＡ、ＤＴＭ、本実施形態の３つの方式を適用した場合のperplexityを示したものである。なお、ＤＴＭと本実施形態に関しては、２０１１年２月１日から３月３１日まで動作させ、３月３０日のトピック分布と３月３１日の文書を利用してperplexityを導出した。通常のＬＤＡに関しては、３月３０日の文書のみからトピック分布を導出し、同じく３月３１日の文書を利用してperplexityを導出した。さらに、本実施形態は、粒子フィルタのランダムウォークの距離σによって性能が異なるため、σ＝０、０．０５、０．１、０．１５の４つの値でperplexityを算出した。 Table 3 is a table showing the results of actually implementing the system of the present embodiment and evaluating the perplexity. This evaluation result was collected from Twitter posts from February to March 2011 at a rate of 200,000 a day, and for that data, regular LDA, DTM, which does not take into account changes in topics over time, The perplexity when the three methods are applied is shown. The DTM and the present embodiment are operated from February 1, 2011 to March 31, 2011, and the perplexity is derived using the topic distribution on March 30 and the document on March 31. For ordinary LDA, the topic distribution was derived only from the March 30 document, and the perplexity was derived using the March 31 document. Further, since the performance of the present embodiment varies depending on the distance σ of the random walk of the particle filter, the perplexity is calculated with four values of σ = 0, 0.05, 0.1, and 0.15.

この評価結果より、ＤＴＭは通常のＬＤＡと比較して、はるかに性能が向上するが、本実施形態のシステムは、ＤＴＭよりもさらに性能が向上していることが分かる。また、今回の評価に用いたデータセットでは、σ＝０．１の場合に最も良い性能となることがわかる。

From this evaluation result, it can be seen that the performance of the DTM is much improved compared to the normal LDA, but the performance of the system of this embodiment is further improved than the DTM. In addition, it can be seen that the data set used for this evaluation has the best performance when σ = 0.1.

（３）変形例
上記の実施形態を次のように変形してもよい。また、以下の変形例を組み合わせて実施してもよい。
（３．１）変形例１
実施形態では、マイクロブログに投稿された文書集合を分析する例を示したが、他の種類の文書集合を分析するようにしてもよい。
例えば、ウェブサーバ装置４０がニュースの記事を配信するサーバである場合、文書集合として、特定の期間に配信される記事を取得し、この文書集合に対して実施形態と同様の処理を行ってもよい。また、ウェブサーバ装置４０がＳＮＳを管理するサーバである場合、文書集合として、特定の期間にＳＮＳ上で取り交わされるコメントを取得し、この文書集合に対して実施形態と同様の処理を行ってもよい。 (3) Modifications The above embodiment may be modified as follows. Moreover, you may implement combining the following modifications.
(3.1) Modification 1
In the embodiment, an example of analyzing a document set posted on a microblog has been shown. However, other types of document sets may be analyzed.
For example, when the web server device 40 is a server that distributes news articles, articles distributed during a specific period may be acquired as a document set, and the same processing as in the embodiment may be performed on the document set. Good. Further, when the web server device 40 is a server that manages SNS, a comment exchanged on the SNS for a specific period is acquired as a document set, and the same processing as that of the embodiment is performed on this document set. Also good.

（３．２）変形例２
実施形態では、文書の配信先が移動通信装置である例を示したが、文書の配信先はどのような装置でもよい。例えば、インターネットに接続された据え置き型のコンピュータでもよい。
実施形態では、文書分析装置３０の制御部３１がプログラムを実行することによって処理を実行する例を示したが、同様の機能をハードウェアで実装するようにしてもよい。また、このプログラムを、光記録媒体、半導体メモリ等、コンピュータで読み取り可能な記録媒体に記録して提供し、この記録媒体からプログラムを読み取って文書分析装置３０の記憶部３３に記憶させるようにしてもよい。また、このプログラムを電気通信回線経由で提供してもよい。 (3.2) Modification 2
In the embodiment, the example in which the document delivery destination is the mobile communication device has been described, but the document delivery destination may be any device. For example, a stationary computer connected to the Internet may be used.
In the embodiment, an example in which the control unit 31 of the document analysis apparatus 30 executes processing by executing a program has been described, but the same function may be implemented by hardware. The program is provided by being recorded on a computer-readable recording medium such as an optical recording medium or a semiconductor memory, and the program is read from the recording medium and stored in the storage unit 33 of the document analysis apparatus 30. Also good. Further, this program may be provided via a telecommunication line.

１…通信システム、１０…移動通信ネットワーク、２０…移動通信装置、３０…文書分析装置、４０…ウェブサーバ装置、５０…インターネット、６０…ゲートウェイ装置、３１…制御部、３２…通信部、３３…記憶部、３０１…トピック分布生成手段、３０２…混合比算出手段、３０３…事前分布生成手段、３０４…事前分布蓄積手段 DESCRIPTION OF SYMBOLS 1 ... Communication system, 10 ... Mobile communication network, 20 ... Mobile communication apparatus, 30 ... Document analysis apparatus, 40 ... Web server apparatus, 50 ... Internet, 60 ... Gateway apparatus, 31 ... Control part, 32 ... Communication part, 33 ... Storage unit 301... Topic distribution generation unit 302... Mixing ratio calculation unit 303... Prior distribution generation unit 304.

Claims

An acquisition means for acquiring a document set from an external device;
Based on a first prior distribution that is a multinomial distribution of the number of occurrences of each word included in the first document set acquired by the acquisition unit in the past, the acquisition unit acquires the first document set after the first document set. Generating a word distribution that is a multinomial distribution of the number of occurrences of each word included in the second document set, and generating a topic distribution that is a multinomial distribution of the number of occurrences of each topic characterized by the word distribution Means,
A mixing ratio calculating means for calculating a mixing ratio between the first prior distribution and the word distribution based on the first prior distribution;
A prior distribution generating means for generating a second prior distribution by mixing the first prior distribution and the word distribution at the mixing ratio;
A prior distribution accumulation means for accumulating the second prior distribution;
The topic distribution generation unit is a word that is a multinomial distribution of the number of occurrences of each word included in the third document set when the acquisition unit acquires a third document after the second document set. A document analysis apparatus characterized in that a distribution is generated based on the second prior distribution.

The mixture ratio calculation means includes a first particle group corresponding to a plurality of different mixture ratios that are candidates for the mixture ratio, and a likelihood associated with each particle included in the first particle group. Assuming one likelihood group, the first particle group is dispersed as particles in a particle filter, and the first particle group is re-dispersed based on the first likelihood group. 2. The document analysis apparatus according to claim 1, wherein the likelihood ratio is calculated, and the mixture ratio is calculated by a weighted average of the first particle group using the second likelihood group as a weight. .

Included in the second document set acquired after the first document set based on the first prior distribution which is a multinomial distribution of the number of occurrences of each word included in the first document set acquired in the past Generating a word distribution that is a multinomial distribution of the number of occurrences of each word, and generating a topic distribution that is a multinomial distribution of the number of occurrences of each topic characterized by the word distribution;
A mixing ratio calculating step for calculating a mixing ratio between the first prior distribution and the word distribution based on the first prior distribution;
A prior distribution generation step of generating a second prior distribution by mixing the first prior distribution and the word distribution at the mixing ratio;
A prior distribution accumulation step for accumulating the second prior distribution;
In the topic distribution generation step, when a third document is acquired after the second document set, a word distribution that is a multinomial distribution of the number of occurrences of each word included in the third document set is A document analysis method, wherein the document analysis method is generated based on the second prior distribution.

Computer
An acquisition means for acquiring a document set from an external device;
Based on a first prior distribution that is a multinomial distribution of the number of occurrences of each word included in the first document set acquired by the acquisition unit in the past, the acquisition unit acquires the first document set after the first document set. Generating a word distribution that is a multinomial distribution of the number of occurrences of each word included in the second document set, and generating a topic distribution that is a multinomial distribution of the number of occurrences of each topic characterized by the word distribution Means,
A mixing ratio calculating means for calculating a mixing ratio between the first prior distribution and the word distribution based on the first prior distribution;
A prior distribution generating means for generating a second prior distribution by mixing the first prior distribution and the word distribution at the mixing ratio;
A program for functioning as a prior distribution accumulation means for accumulating the second prior distribution,
The topic distribution generation unit is a word that is a multinomial distribution of the number of occurrences of each word included in the third document set when the acquisition unit acquires a third document after the second document set. A program for generating a distribution based on the second prior distribution.